Concept 39 A genome is an entire set of genes.
James Watson describes sequencing the human genome using markers and BACs, and Craig Venter explains using cDNA libraries, ESTs, and shotgun sequencing.
Hi, I’m Jim Watson. You may remember me from concept 19. In 1953, Francis Crick and I discovered the structure of DNA. Prior to having a lot of genomic sequences, "finding" a gene was difficult and involved looking for markers to fix a gene's position on a chromosome. The earliest chromosome maps used visible banding patterns, from stains, as markers. However, there are millions of nucleotides and thousands of genes within each band. As the first director of the government sponsored Human Genome Project, one of my goals was to put more markers on these chromosomes. These markers would be based on unique DNA sequences. We tried to find markers about 150,000 nucleotides from each other. Since an average gene is around 10,000 nucleotides, it's much easier to find a gene within 150,000 nucleotides than within a chromosome's worth of DNA. Ultimately, we generated a map with 30,000 distinct markers. Each marker was a unique 100 to 200 base pair sequence within the genome. Hi, I’m Craig Venter. When I was at the National Institutes of Health (NIH) with Jim Watson, I was also interested in locating human genes. Genes only make up about 3% of our genome, and I figured out a faster and cheaper method of finding them. I first used my method to find genes expressed in the human brain. I used brain cDNAs that others had stored in a phage library. The library was made by first extracting mRNAs from human brain tissue. The mRNAs are reverse transcribed, and the cDNAs are stored as a library of clones in phage particles. We used the polymerase chain reaction (PCR) to generate small 150 to 400 base pair DNA segments from these cDNAs. A non-specific primer was used in these PCR reactions. Starting from the primer, taq polymerase made short copies of the brain cDNAs. I called the resulting short sequences Expressed Sequence Tags (ESTs). They vary in length, and are all unique. Each EST identifies a gene expressed in the brain, because they are derived from brain mRNA. We sequenced 2,375 ESTs from the human brain, and compared the sequences with genes already in public databases. Only 17% of our ESTs matched previously known gene sequences. Some of the ESTs in this 17% matched up with known genes expressed throughout the body, like beta-actin. Others were more specific to brain tissue, like the big-brain gene from Drosophila. Most excitingly, 83% of the ESTs represented previously unknown genes! We continued making ESTs, and, in a few years, tagged over 30,000 new genes. To some, this is a treasure trove of information. Several biotech companies, such as Human Genome Sciences and Incyte Genomics, were started to capitalize on this gene bank. It seemed to me that Craig and his industry buddies just wanted to find genes they could patent. I was interested in the entire genome and how it all works. EST libraries are fine, but genes that are not highly expressed will not be represented. Also, there won't be any upstream control regions in an EST library. To get information from the entire genome, we had to break it down into smaller manageable pieces. First, each chromosome is cut with rare-cutting restriction enzymes to generate pieces approximately 150,000 base pairs in length. These fragments are cloned into bacterial artificial chromosomes (BACs). We cut copies of the same chromosome with different enzymes to make sure we had overlapping fragments representing the entire chromosome. Our physical markers identified some of the BACs. BACs lacking markers were further mapped using more common restriction enzymes. These restriction enzyme sites, also markers, are used to identify and line up the BACs. The BACs have now been assembled in an orderly manner. After ordering, we selected the minimal number of BACs needed to span the entire chromosome. However, BACs are still too big for automated sequencing. We repeated the breakdown process, and randomly broke each BAC into pieces about 1,500 base pairs long. Each piece was subcloned into phage. For simplicity, we are only showing two copies. Again, we made sure to get overlapping pieces. One end of each subclone was sequenced in the automated sequencer. It wasn't really necessary to sequence the entire piece, because the sequenced end of one subclone will overlap the unknown end of another. When we finished sequencing, the subclones from each BAC were lined up in the proper order by aligning matching sequences. Using the sequences and the marker maps, all the subclones were assembled into BACs, and the BACs were assembled into chromosomes. When each chromosome was assembled, we had sequenced the entire human genome! Hi, I’m back. After my EST technology was commercialized, it didn't take long to make a number of good expression libraries. began to look for other projects. I thought the federally-funded Human Genome Project was plodding along too slowly so I came up with – you guessed it – a cheaper and faster method. This method, which I called shotgun sequencing, starts by first making random, overlapping DNA pieces of about 2,000 base pairs. This was done by physically forcing a solution of DNA through a syringe. We also made 10,000 base pair pieces. All the pieces were stored individually in bacterial plasmids. The ends of each piece were sequenced in the automated sequencer. As you learned before, when many pieces overlap each other, it isn't necessary to sequence the entire piece. Then, we entered the data into a computer program developed specifically for ordering all the pieces. The program assembled the entire genome by matching overlapping sequences. With shotgun sequencing, there is no need to bother with markers or assembling and subcloning BACs. I first tested this method by sequencing the 1.8 million base pair genome of the Haemophilus influenzae bacteria in 1995 after I had left NIH. Then, in 1999, as a dry run for the human genome, my own company, Celera, sequenced and assembled the Drosophila melanogaster fruit fly genome in collaboration with the Drosophila Genome Project. On June 26, 2000 the director of the Human Genome Project, Francis Collins, and I, announced at the White House that we had both completed a "working draft" DNA sequence of the human genome. We published an initial analysis of this working draft in February 2001. Despite the 2001 publications, there is still a lot of work to be done on the human genome. There are small "holes" in the sequence that need to be patched. Some of these holes may never be completely resolved because they occur near centromeres that contain multiple repeated sequences. As you can see below, pieces with repeated sequences can be lined up in several ways. Only one will be correct. Also, the hard work of understanding what all the 3.2 billion base pairs of DNA represent and the kinds of information encoded into our sequences has really just begun.
Craig Venter's company, Celera, placed an ad to find the six people who donated their DNA for sequencing.
Why is the genome divided into separate chromosomes?