NEWS

One Gene, Two Genes, Three Genes, Four: Scientists Wonder How Many More

Tom Reynolds

When you were a kid, did you ever try to win a prize by guessing how many jelly beans were in the big jar? Is it 500? 5,000? Hard to tell when the numbers get so high.

If you have played this kind of guessing game, the Cold Spring Harbor (N.Y.) Laboratory’s Genesweep will probably seem familiar. But instead of a pile of candy, the guesses aim at a basic question about the foundations of human life: How many genes make up the human genome?

The contest was organized by Ewan Birney, Ph.D., of the European Bioinformatics Institute in Cambridge, England, at CSHL’s genome conference in May, where the topic reportedly was "hotly debated."

Now that the genome’s 3 billion DNA base pairs have been sequenced, can’t scientists just count the genes? Well, no; it is not quite that simple.

When a simple organism’s genome is sequenced—a bacterium or yeast, for example—it is relatively easy to identify the genes because they make up most of the organism’s DNA. For these creatures that reproduce fast and die young, evolution has ensured that the Biology 101 axiom—DNA makes RNA makes protein—works in a highly efficient and mostly straightforward way.

But for slower-paced humans and other vertebrates, it is more complicated. Unlike the compact, no-nonsense genomes found in Escherichia coli, Caenorhabditis elegans, or even Drosophila, only about 3% of the human genome’s DNA is used to make protein. Perhaps because fast reproduction is not crucial to our survival strategy, DNA replication can afford to be less efficient, and thus evolution has not culled from our chromosomes the excess DNA accumulated over the millennia from viral infections, erroneous duplications, and other events. Most confounding to gene counters are pseudogenes, stretches of DNA that resemble genes but are actually botched copies that make no protein.

Not until full mRNA sequences are made for all these putative genes can researchers be sure which are bona fide genes, which are pieces of genes, and which are imposters, said Phil Green, Ph.D., of the University of Washington, Seattle. Lacking that information, they can only make educated guesses based on limited available data.

Of three studies published in the June Nature Genetics that estimate gene number using different methods, two yielded estimates that are among the lowest ever, while a third produced a much higher figure.

In one report, Green and colleague Brent Ewing, Ph.D., used two sets of gene sequences that produced similar estimates, which are lower than most of those publicized in recent years. An estimate of 34,700 genes was based on the well-characterized chromosome 22, and an estimate of 33,630 came from the public GenBank database.

Green said these numbers—less than twice the 19,000 genes found in the 959-cell worm C. elegans—suggest that the vast complexity of humans and other vertebrates is not mediated primarily by increase in gene number but by greater diversity in the gene regulatory networks that are also encoded in the genome’s DNA.

In a second report, researchers at Genoscope, the French national sequencing center in Evry, led by center director Jean Weissenbach, Ph.D., used what might seem an unlikely interspecies comparison to estimate human gene number: the genome of the pufferfish Tetradon nigroviridis, separated from Homo sapiens by 400 million years of evolution. Using a technique they dubbed "Exofish" (EXOn FInding by Sequence Homology), based on matching up evolutionarily conserved regions of the two genomes, Weissenbach and colleagues conclude that the human genome contains between 28,000 and 34,000 genes.

Green said the Weissenbach group’s approach complements his own, with each relying on the assumption of specific values for uncertain parameters. "I think there’s enough uncertainty in both estimates to say they are compatible with each other in pointing to a fairly low number of human genes," he said.

In sharp contrast to those two papers, estimates from the third are much higher. Led by John Quackenbush, Ph.D., researchers at The Institute for Genomic Research in Rockville, Md., created a human gene index based on about 1.6 million expressed sequence tags in the GenBank database. After the data were "cleaned" to remove contaminating sequences such as bacterial or mitochondrial DNA, and clustered to eliminate redundant ones, the TIGR team calculated that human genes number between 110,000 and 134,000. Using chromosome 22 data to check these findings, they arrived at a comparable number, 118,000. TIGR’s figures are close to the estimates in the 130,000 to 140,000 range offered in September 1999 by Incyte Genomics Inc., of Palo Alto, Calif.

Why the fourfold discrepancy? Green believes the TIGR group was off base on several key numbers used in its computations. Quackenbush and co-workers estimated that 55% of all genes in the genome are represented by EST "contigs," sets of DNA clones that together contain a contiguous segment of code, but Green said that figure resulted from errors in their methods and the true figure is 80% to 85%. TIGR also underestimated the average number of EST contigs per gene and the percentage of genes in the genome that are on chromosome 22, he said, and the net effect of these underestimates is to raise the predicted number of genes.

"I went through the calculations the same way they did, but using our estimates of the parameters, and I got about 31,500 genes," he said.

Quackenbush, in an e-mail response to the News, said he has likewise pointed out to Green what he believes are flaws in Green and Ewing’s methods, and now believes the true number is "somewhere in between the extremes."

"But at this point," he added, "arguing about the absolute number of genes is akin to medieval scholars arguing about how many angels can dance on the head of a pin. ...The disparate estimates point out just what a difficult task we will face in identifying genes in the genomic sequence and the challenges we will face in assigning function to the genes."

Some have suggested that genome companies favor inflated gene numbers for marketing purposes: the more genes there are, the more genes available to patent and sell, and the more the company is seemingly worth.

"The more the better from their point of view—they’re not highly motivated to get the number down," Green said.

However, Quackenbush said TIGR, a nonprofit research institute, has nothing to gain by inflating its estimates, adding that "It would be hard to imagine that a for-profit company would do that either, since you cannot patent a gene that does not exist and, in the long run, they would suffer from misrepresenting their data."

A Nature Genetics editorial calls on the companies to publicly release some of their data, both to allow public scrutiny of their computations and to refute "insinuation that commercial interests factor into the equation."

The number of variables involved in gene counting suggest that controversy is likely to continue even as the estimates are refined. Even what defines a gene is debated, so a working definition was formulated for the sweepstakes (see box, opposite page). And at CSHL’s 2002 genome meeting—a year before Genesweep’s day of reckoning—scientists will have to vote to decide on the method used to determine the winning number.

"In a way, it’s not surprising that there is no firm agreement on the number of genes in the genome," Quackenbush noted. "Sequencing of the yeast genome was completed in 1995 and I doubt anyone could tell you precisely how many genes it contains."



             
Copyright © 2000 Oxford University Press (unless otherwise stated)
Oxford University Press Privacy Policy and Legal Statement