Whole-Genome Shotgun Sequencing: overview, steps and achievements
In the past many virus genomes have been sequenced, but bacterial genome is not possible to be sequence. Before 1995, whole-genome sequencing were not possible because of unavailable of computational power for assembling a genome from thousands of DNA fragments.
Initially, the group of J. Craig Venter, Hamilton Smith sequenced the genomes of two free-living bacteria, Mycoplasma genitalium and Haemophilus influenzae. H. influenzae, genome was the first to be sequenced which contains about 1,743 genes in 1,830,137 base pairs and is much more larger than a virus genome.
Venter and Smith developed an approach called whole-genome shotgun sequencing. The process is fairly complex when considered in detail, and there are many procedures to ensure the accuracy of the results, but the following summary gives a general idea of the approach originally employed by The Institute of Genomic Research (TIGR).
For simplicity, this approach is broken into four stages:
- Library construction
- random sequencing
- Fragment alignment and gap closure
The large bacterial chromosomes were randomly broken into fairly small fragments, about the size of a gene or less, using ultrasonic waves; the fragments were then purified. These fragments were attached to plasmid vectors, and plasmids with a single insert were isolated. Special E. coli strains lacking restriction enzymes were transformed with the plasmids to produce a library of the plasmid clones.
After the clones were prepared and the DNA purified, thousands of bacterial DNA fragments were sequenced with automated sequencers, employing special dye-labeled primers. Thousands of templates were used, normally with universal primers that recognized the plasmid DNA sequences just next to the bacterial DNA insert. The nature of the process is such that almost all stretches of genome are sequenced many times, and this increases the accuracy of the final results.
Fragment alignment and gap closure
By the Using specialized computer programs, the sequenced DNA fragments were clustered and assembled into longer stretches of sequence by comparing nucleotide sequence overlie in between the fragments. Two fragments were joined together to form a larger stretch of DNA if the ends of the sequences overlapped and matched (i.e., were the same). This overlap comparison process resulted in a set of larger contiguous nucleotide sequences or contigs.
Finally, the contigs were aligned in the proper order to form the completed genome sequence. If gaps existed between two contigs, sometimes fragment samples with their ends in the two adjacent contigs were available. These fragments could be analyzed and the gaps filled in with their sequences. When this approach was not possible, a variety of other techniques were used to align contigs and fill in gaps.
For example, Phage libraries containing large bacterial DNA fragments were constructed. The large fragments in these libraries overlapped the previously sequenced contigs. These fragments were then combined with oligonucleotide probes that matched the ends of the contigs to be aligned. If the probes bound to a library fragment, it could be used to prepare a stretch of DNA that represented the gap region. Overlapping of the sequence new fragment with two contigs would be placed side-by-side and fill in the gap between them.
Proof reading of the sequence is done carefully to resolve any ambiguities in the sequence. Also the sequence was checked for unwanted frameshift mutations and corrected if necessary.
Some Achievement of whole-genome sequencing
- The approach worked so well that it took less than 4 months to sequence the M. genitalium genome (about 500,000 base pairs in size). The shotgun technique also has been used successfully by Celera Genomics in the Human Genome Project and to sequence the Drosophila genome.
- The process of annotation will start once the genome sequence has been established. The goal of annotation is to determine the location of specific genes in the genome map.
- Every open reading frame (ORF)—a reading frame sequence not interrupted by a stop codon—larger than 100 codons is considered to be a potential protein coding sequence.
- For comparing the sequence of the predicted ORF against large databases containing nucleotide and amino acid sequences of known enzymes and other proteins. If a bacterial sequence matches one in the database, it is assumed to code for the same protein.
- Although this comparison process is not without errors, It can provide tentative function assignments for about 40 to 50% of the presumed coding regions.
- It also gives some information about transposable elements, operons, repeat sequences, the presence of various metabolic pathways, and other genome features.
Reference and sources