When the coelacanth was first discovered, there was a lot of excitement. It was a living example of a group of fish that was thought to exist only as fossils. And it wasn’t just any group of fish: with their long, stalk-like fins, the coelacanth and its kin are thought to contain the ancestors of all non-fish vertebrates – all vertebrates with limbs, that is, including us humans, among many other things.
But since then, evidence has been mounting that we are more closely related to the freshwater lungfish that inhabit Africa, Australia and South America. But lungfish are a bit odd: The African and South American species have thin, floppy fins rather than the limb-like fins of their ancestors. And their evolutionary history is tricky to grasp in part because they have the largest genome known of any animal. The genome of the South American lungfish contains more than 90 billion base pairs, 30 times the DNA of a human.
But new sequencing technologies have made such challenges manageable, and an international collaboration has completed the largest genome ever produced, in which all but one chromosome has more DNA than is in the human genome. The study shows that the South American lungfish has added 3 billion bases of DNA every 10 million years over the past 200 million years, but it hasn’t added many new genes in that time; instead, it appears to have lost the ability to suppress junk DNA.
Long-term
This work was made possible by a technique commonly known as “long-read sequencing.” Most completed genomes were done using short reads, usually around 100-200 base pairs long. The trick was to sequence so much that, on average, every base in the genome was sequenced multiple times. Given that, a cleverly designed computer program could figure out where the two sequences overlapped, register it as one long sequence, and repeat the process until the computer spit out a long string of contiguous bases.
The problem is that most non-microbial species have repetitive sequences (containing hundreds of copies of G and A bases in a row) that are more than a few hundred bases long, and near-identical sequences appear in multiple places in the genome. Matching these to unique locations is impossible, so the output of genome assembly software contains many gaps of unknown length and sequence.
This can be extremely difficult for genomes like the lungfish genome, which is typically repetitive and filled with non-functional “junk” DNA: the software tends to generate genomes with more gaps than sequences.
Long-read technologies, just as their name suggests, avoid this problem. Rather than sequencing fragments of 200 bases or so, they produce sequences thousands of base pairs long, easily covering entire repeats that would otherwise have gaps. One early version of long-read technology involved threading a long DNA molecule through a pore and monitoring the different voltage changes across the pore as different bases passed through it. In another version, a DNA copying enzyme made copies of the long strand and monitored the fluorescence changes as different bases were added. These early versions were somewhat prone to error, but they have since been improved upon, and there are now several new competing technologies on the market.
In 2021, researchers will use this technology Complete the genome The genome of an Australian lungfish has been found, a species that retains the limb-like fins of its tetrapod-giving ancestor. Now, genomes have been found from African and South American species that appear to have taken separate paths during the breakup of the supercontinent Gondwana, which began about 200 million years ago. The discovery of the three genomes should provide some insight into the features common to all lungfish species, features that were likely shared with the distant ancestor that gave rise to tetrapods.