Information

8.3: Sequencing the Human Genome - Biology

8.3: Sequencing the Human Genome - Biology


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

4.3 Sequencing the Human Genome

In humans, each cell normally contains 23 pairs of chromosomes, for a total of 46. Twenty-two of these pairs, called autosomes, and look the same in both males and females. In this species, ZW chickens are female and ZZ chickens are male.

The total length of the human genome is over 3 billion base pairs. The total length of the human genome is over 3 billion base pairs. The genome also includes the mitochondrial DNA (Figure 4.12).

Figure 4.18 DNA Karyotype. The 22 autosomes are numbered by size. The other two chromosomes, X and Y, are the sex chromosomes. This picture of the human chromosomes lined up in pairs is called a karyotype. Karyotypesare prepared using standardized staining procedures that reveal characteristic structural features for each chromosome, usually from white blood cells.

Image by: U.S. National Library of Medicine


The first human genome sequences were published in nearly complete draft form in February 2001 by the Human Genome Project and Celera Corporation. Completion of the Human Genome Project's sequencing effort was announced in 2004 with the publication of a draft genome sequence. Researchers working on the human genome project deciphered the human genome in three major ways: determining the order, or "sequence," of all the bases in our genome's DNA; making maps that show the locations of genes for major sections of all our chromosomes; and producing what are called linkage maps, through which inherited traits (such as those for genetic disease) can be tracked over generations.

Prior to the acquisition of the full genome sequence, estimates of the number of human genes ranged from 50,000 to 140,000 (with occasional vagueness about whether these estimates included non-protein coding genes). As genome sequence quality and the methods for identifying protein-coding genes improved, the count of recognized protein-coding genes dropped to 19,000-20,000. However, a fuller understanding of the role played by genes expressing regulatory RNAs that do not encode proteins has raised the total number of genes to at least 46,831, plus another 2300 micro-RNA genes. By 2012, functional DNA elements that encode neither RNA nor proteins have also been noted. Protein-coding sequences account for only a very small fraction of the genome (approximately 1.5%), and the rest is associated with non-coding RNA genes, regulatory DNA sequences, long interspersed nucleotide elements (LINEs), short interspersed nucleotide elements (SINEs), introns, and sequences for which as yet no function has been determined.

Recall that a geneis defined as a sequence of nucleotides in DNA or RNA that codes for a molecule that has a function. During gene expression, the DNA is first copied into RNA. The RNA can be directly functional or be the intermediate template for a protein that performs a function. Gene structureis the organisation of specialized sequence elements within a gene (Figure 4.19). Genes contain the information necessary for living cells to survive and reproduce. The processes of transcription which leads to the production of the RNA from the DNA template, and translation which produces protein from the messenger RNA (mRNA) sequence are controlled by specific sequence elements or regions within the gene. Every gene, therefore, requires multiple sequence elements to be functional. This includes the sequence that actually encodes the functional protein or ncRNA, as well as multiple regulatory sequence regions. These regions may be as short as a few base pairs, up to many thousands of base pairs long.

Much of gene structure is broadly similar between eukaryotes and prokaryotes. These common elements largely result from the shared ancestry of cellular life in organisms with roughly 3.8 billion years of evolution. Key differences in gene structure between eukaryotes and prokaryotes reflect their divergent transcription and translation machinery. Understanding gene structure is the foundation of understanding gene annotation, expression, and function.

Figure 4.19. The Process of Eukaryotic Gene Expression. Upper blue panel shows the structural elements common to eukaryotic genes. The process of gene transcription produces a messenger RNA (mRNA) molecule that must be modified post-translationally, gray panel, to remove the non-coding intron sequences and add the 5'-CAP and Poly-A-Tail sections. The mature mRNA is transported from the nucleus to the cytoplasm where it is translated by the ribosome into the protein sequence, red panel.

Image from Wikipedia


The structures of both eukaryotic and prokaryotic genes involve several nested sequence elements. Each element has a specific function in the multi-step process of gene expression. The sequences and lengths of these elements vary, but the same general functions are present in most genes. Although DNA is a double-stranded molecule, typically only one of the strands encodes information that the RNA polymerase reads to produce protein-coding mRNA or non-coding RNA. This 'sense' or 'coding' strand, runs in the 5' to 3' direction where the numbers refer to the carbon atoms of the backbone's ribose sugar. The open reading frame (ORF) of a gene is therefore usually represented as an arrow indicating the direction in which the sense strand is read.

Regulatory sequences are located at the extremities of genes. These sequence regions can either be next to the transcribed region (the promoter) or separated by many kilobases (enhancers and silencers). The promoter is located at the 5' end of the gene and is composed of a core promoter sequence and a proximal promoter sequence. The core promotermarks the start site for transcription by binding RNA polymerase and other proteins necessary for copying DNA to RNA. The proximal promoter region binds transcription factors that modify the affinity of the core promoter for RNA polymerase. Genes may be regulated by multiple enhancer and silencer sequences that further modify the activity of promoters by binding activator or repressor proteins. Enhancers and silencers may be distantly located from the gene, many thousands of base pairs away. The binding of different transcription factors, therefore, regulates the rate of transcription initiation at different times and in different cells.

Regulatory elements can overlap one another, with a section of DNA able to interact with many competing activators and repressors as well as RNA polymerase. For example, some repressor proteins can bind to the core promoter to prevent polymerase binding. For genes with multiple regulatory sequences, the rate of transcription is the product of all of the elements combined. Binding of activators and repressors to multiple regulatory sequences has a cooperative effect on transcription initiation.

An additional layer of regulation occurs for protein coding genes after the mRNA has been processed to prepare it for translation to protein. Only the region between the start and stop codons encodes the final protein product. The flanking untranslated regions (UTRs) contain further regulatory sequences. The 3' UTR contains a terminator sequence, which marks the endpoint for transcription and releases the RNA polymerase. The 5’ UTR binds the ribosome, which translates the protein-coding region into a string of amino acids that fold to form the final protein product. In the case of genes for non-coding RNAs the RNA is not translated but instead folds to be directly functional.

The structure of eukaryotic genes includes features not found in prokaryotes. Most of these relate to post-transcriptional modification of pre-mRNAs to produce mature mRNA ready for translation into protein. Eukaryotic genes typically have more regulatory elements to control gene expression compared to prokaryotes. This is particularly true in multicellular eukaryotes, where gene expression varies widely among different tissues.

A key feature of the structure of eukaryotic genes is that their transcripts are typically subdivided into exonand intronregions. Exon regions are the coding portion of the mRNA and are retained in the final mature mRNA molecule, while intron regions are non-coding and are spliced out (excised) during post-transcriptional processing. Indeed, the intron regions of a gene can be considerably longer than the exon regions. Once spliced together, the exons form a single continuous protein-coding region, and the splice boundaries are not detectable. Eukaryotic post-transcriptional processing also adds a 5' cap to the start of the mRNA and a poly-adenosine tail (poly-A-tail) to the end of the mRNA. These additions stabilize the mRNA and direct its transport from the nucleus to the cytoplasm.

The overall organization of prokaryotic genes is markedly different from that of the eukaryotes. The most obvious difference is that prokaryotic ORFs are often grouped into a structure that is called a polycistronic operonunder the control of a shared set of regulatory sequences (Figure 4.20). These ORFs are all transcribed onto the same mRNA and so are co-regulated and often serve related functions. Each ORF typically has its own ribosome binding site (RBS) so that ribosomes simultaneously translates the different ORFs on the same mRNA. Some operons also display translational coupling, where the translation rates of multiple ORFs within an operon are linked. This can occur when the ribosome remains attached at the end of an ORF and simply translocates along to the next without the need for a new RBS. Translational coupling is also observed when translation of an ORF affects the accessibility of the next RBS through changes in RNA secondary structure. Having multiple ORFs on a single mRNA is only possible in prokaryotes because their transcription and translation take place at the same time and in the same subcellular location.

Figure 4.20 The Process of Prokaryotic Gene Expression. Upper blue panel displays the organization of a typical prokaryotic polycistronic operon, wherein mulitiple genes are regulated by common regulatory elements and are transcribed as a single mRNA. Unlike eukaryotic systems, there is little to no post-translational modification of the resulting mRNA and protein translation, red panel, often ensues before transcription is complete.

Image from Wikipedia


The operator sequence next to the promoter is the main regulatory element in prokaryotes. Repressor proteins bound to the operator sequence physically obstructs the RNA polymerase enzyme, preventing transcription. Riboswitches are another important regulatory sequence commonly present in prokaryotic UTRs. These sequences switch between alternative secondary structures in the RNA depending on the concentration of key metabolites. The secondary structures then either block or reveal important sequence regions such as ribosomal binding sites. Introns are extremely rare in prokaryotes and therefore do not play a significant role in prokaryotic gene regulation.


Overall, the packaging, unpackaging, replication and transcription of DNA is a highly dynamic process that is constantly being moderated by signals and cues from the environment. The following video created by Drew Berry for WEHI.tv provides one of the most dynamic views of the major packaging and processing of DNA within the cell. In later chapters we will explore the processes of DNA replication and transcription in greater detail.

[video width="1280" height="720" mp4="https://wou.edu/chemistry/files/2020...ation-ever.mp4"][/video]

Created by Drew Berry and Etsuko Uno and sponsored by the Walter and Eliza Hall Institute of Medical Research


Sequencing the Human Genome

The human genome is the complete set of genetic information, stored as DNA within the nucleus of nearly every one of the trillions of cells in the human body. Every person’s genome is different and is a large part of what makes us into unique individuals. The first effort to decode the human genome, considered a draft sequence, resulted in its publication in 2001. Six years later a high quality sequence&mdashcalled a diploid genome&mdashof a single individual was published, containing all genetic information from both parents.


WHY WE SEQUENCED?

Sequencing is the set of methods and biochemical techniques aimed at determining the order of nucleotides (A, T, C and G). Its objective is to get in order all nucleotides DNA of an organism.

The first organisms sequenced were two bacteria, Haemophilus influenzae and Mycoplasma genitalium in 1995. One year later, the genome of a fungus was sequenced (Saccharomyces cerevisiae).

From that moment comes the eukaryotic sequencing project: in 1998 Caenorhabditis elegans (nematode) was sequenced, in 2000 Drosophila melanogaster (fruit fly) and in 2001 the human genome.

But, why we sequenced? In the case of human genome, there is the need to know to help alleviate or prevent diseases.

Some of the organisms sequenced are model organisms, which have:

  • Medical importance: there are pathogens and we know diseases that they can cause.
  • Economic importance: organisms that humans eat, they can improve with the molecular techniques.
  • Study of evolution: in 2007 more than 11 species of Drosophila were sequenced and it tried to understand the evolutionary relationship between their chromosomes. It has also been made in mammals (ENCORE Project).

Scientists Say They&rsquove Finally Sequenced the Entire Human Genome. Yes, All of It.

Twenty-one years ago, researchers announced the first &ldquodraft&rdquo of sequencing the complete human genome. It was a monumental achievement, but the sequence was still missing about 8 percent of the genome. Now, scientists working together around the world say they&rsquove finally filled in that reclusive 8 percent.

➡ You think science is badass. So do we. Let&rsquos nerd out over it together.

If their work holds up to peer review and it turns out they really did sequence and assemble the human genome in its entirety, gaps and all, it could change the future of medicine.

What&rsquos in a Genome?

Sequencing the human genome has long been a huge project with worthy goals. Why? Because as humans understand their genetic code better, they can make better, more customized medicines, for example&mdashincluding the kind of gene-focused medicine that powered the first effective COVID-19 vaccines.

Humans have 46 chromosomes, in 23 pairs, that represent tens of thousands of individual genes. Each gene consists of some number of base pairs made of adenine (A), thymine (T), guanine (G), and cytosine (C). There are billions of base pairs in the human genome.

In June 2000, the Human Genome Project (HGP) and private company Celera Genomics announced that first &ldquodraft&rdquo of the human genome. This was the result of years of work that picked up the pace as humans continued to make better computers and algorithms for processing the genome. At the time, scientists were surprised that of the over 3 billion individual &ldquoletters&rdquo of base pairs, they estimated humans have just 30,000 to 35,000 genes. Today, that number is far lower, hovering just above 20,000.

Three years later, HGP completed its mission to map the whole human genome and defined its terms this way:

&ldquoCurrent technology&rdquo is doing a lot of heavy lifting here. At the time, HGP used a process called bacterial artificial chromosome (BAC), where scientists used a bacterium to clone each piece of the genome, and then study them in smaller groups. A complete &ldquoBAC library&rdquo is 20,000 carefully prepared bacteria with cloned genes inside.

But that BAC process inherently misses some portions of the whole genome. The reason why is a great lead-in to what the new team of scientists has helped to accomplish.

A Sequencing Breakthrough

What&rsquos lurking in the secretive 8 percent of the genome that the 2000 &ldquodraft&rdquo of the genome left untouched? The base pairs in this section are made of many, many repeated patterns that just made it too unwieldy to study using the bacteria cloning method.

BAC and other approaches just weren&rsquot right for the repeats-heavy remaining 8 percent of the genome. &ldquoThe current workhorse DNA sequencers, made by Illumina, take little fragments of DNA, decode them, and reassemble the resulting puzzle,&rdquo Stat&rsquos Matthew Herper reports. &ldquoThis works fine for most of the genome, but not in areas where DNA code is the result of long repeating patterns.&rdquo

That makes intuitive sense imagine counting from 1 to 50 versus simply counting 1, 2, 1, 2, . . . over and over again. Part of what made the BAC method successful is scientists took care to minimize and match up the overlaps, which became almost impossible in the repeats-heavy unexplored portion of the genome.

So, what&rsquos different in the new approaches? Let&rsquos first look at what they are. The California-based Pacific Biosciences (PacBio) the U.K.-based Oxford Nanopore have different technologies, but are racing toward the same goal.

PacBio uses a system called HiFi, where base pairs are circulated, literally as circles, until they&rsquore read in full and in high fidelity&mdashhence the name. The system dates back just a few years and represents a big step forward in both length and accuracy for those longer sequences.

Oxford Nanopore, meanwhile, uses electrical current in its proprietary devices. Strands of base pairs are pressed through a microscopic nanopore&mdashjust one molecule at a time&mdashwhere a current zaps them in order to observe what kind of molecule they are. By zapping each molecule, scientists can identify the full strand.

In the new study published in the biology preprint server bioRxiv, an international consortium of about 100 scientists used both PacBio and Oxford Nanopore technologies to chase down some of the remaining unknown sections of the human genome.

The amount of ground the consortium covered is staggering. &ldquoThe consortium said that it increased the number of DNA bases from 2.92 billion to 3.05 billion, a 4.5 [percent] increase. But the count of genes increased by just 0.4 [percent], to 19,969,&rdquo Stat reports. This shows how big the heavily repeating base pair sequences in this zone are compared to the genes they represent.

The Missing Links

Sequencing godfather George Church, a biologist at Harvard University, told Stat if this work goes through peer review successfully, it will be the first time any vertebrate genome has been fully mapped. And the reason seems to be simply that both new technologies allow very long strings of base pairs to be read at once.

Why is the missing gene information so important? Well, the study of genes experiences a lot of favoritism, with a handful of most popular genes taking up the bulk of research interest and funding. The overlooked genes hold a lot of key mechanisms that cause disease, for example.

There&rsquos one little snag, although it was also a snag for the 2000 announcement of the first draft of the genome. Both projects studied cells that had just 23 chromosomes instead of the full 46. That&rsquos because they use cells derived from the reproductive system, where eggs and sperm each carry half of a full chromosomal load.

The cell is from a hydatidiform mole, a kind of reproductive growth that represents an extremely early, unviable union between a sperm and an egg cell that has no nucleus. Choosing this kind of cell, which has been kept and cultured as a &ldquocell line&rdquo used for research purposes, cuts the huge sequencing job in half.

The next step is for the study to appear in a peer-reviewed publication. After that, though, both PacBio and Oxford seek to sequence the entire 46-chromosome human genome. But we might be waiting a while.


Discussion

We presented a novel method SMOOTH-seq for single-cell genome analysis based on SMRT sequencing technology, which enabled accurate SV detection by taking advantages of long high-fidelity reads. Like other scWGS methods, we can obtain the CNV and SNV information from an individual cell. Ideally, it is better to sequence individual cells to enough depth to make comprehensive evaluation of a method. However, due to the high price of SMRT sequencing comparing to NGS at current stage, it is unrealistic to sequence a cell by SMOOTH-seq to the adequate depth as NGS does. In this study, we pooled about 16 cells in a sequencing library, which could generate around 400Gb data from the Pacbio Sequel II platform per run. While these are only about 15Gb CCS data, corresponding to 1Gb for each individual cell. Even so, the total cost for one individual cell was

$260, which made it very expensive for large number of cells.

Previous applications used Tn5 transposition containing two adaptors with different sequences to prevent fragment self-looping. However, only half of the DNA fragments will be tagged by different adaptors on both ends, leading to 50% loss of the original DNA fragments during sequencing. In SMOOTH-seq, we used just one Tn5 adaptor sequence to avoid original DNA fragments losing. Meanwhile, the long fragments generated by low-concentrated Tn5 transposase reduces the chance of self-looping. It is very unlikely that the same genomic sites on different alleles are cut identically by the transposases by chance thus, we can easily remove duplicated reads from overamplification of the same gDNA fragment, as the previous transposition-based WGS method did [8]. However, as an expense, we got limited genomic coverage (an average of 19% of the genome) and sequencing depth (an average of 0.4× depth) of each individual cell. For the same reason, the sensitivity and accuracy of calling CNVs and SNVs should also be improved with more sequencing data for each cell. This would not be a problem when the cost of TGS sequencing decreases to the same level as that of the NGS in the future.

There is no doubt that SMOOTH-seq showed well performance on SV detection, especially for insertion, translocation, and duplication events in both cell lines and in vivo cancer cells. We can directly capture the complete variant structures instead of deducing them by reads assembly. This has additional advantages when there are repeats inside the variant structures. We infer that that is why we observed many duplication events flanking the telomere of the chromosome [34], while other NGS based methods are difficult to detect such SV events. PCR chimera artifacts [24, 35, 36] could not be distinguished from SVs such as translocation events, while these random events could be excluded by requiring the SVs identified in multiple individual cell samples.

We are excited that SMOOTH-seq offers a way to investigate ecDNAs at single-cell level, a new type of SVs identified recently. A single transposition event on a single-ecDNA molecule made it possible to capture the full-length ecDNA. And when the ecDNA is relatively short (generally no more than 10kb), a single-sequencing read could recover its full-length sequence. However, longer candidate ecDNAs can only be detected by the circularization sites, which are difficult to be distinguished from the large duplication events. And currently, no effective way can distinguish ecDNAs and duplication events. In this analysis, by removing the candidates overlapped with duplication events, we eliminated the interference from duplication events to a certain extent. It is worth mentioned that we supposed and captured the ecDNAs with the simplest sequence structures, ecDNAs with more complex structures could be probably found when there are more appropriate analysis pipelines for long sequencing reads. In all, SMOOTH-seq made a breakthrough on scWGS analysis, which especially facilitates for single cell SV and ecDNA detection.


Researchers claim to have sequenced the entire human genome

Credit: CC0 Public Domain

A large international team of researchers is claiming to have at last sequenced the entire human genome. Collectively, the team is known as the Telomere-to-Telomere (T2T) Consortium, and they have written a paper describing their efforts and have uploaded it to the bioRxiv preprint server.

Back in 2000, a team from the Human Genome Project working with biotech firm Celera Genomics announced that they had completed the first draft of a sequencing of the human genome. That first draft was missing approximately 15% of the genome. Work since that time has brought the percentage down to just 8%. And now, the T2T Consortium is claiming to have reduced it to zero—though there are still some caveats. The group acknowledges that they had some trouble with approximately 0.3% of the genome, and there might be a few errors here and there—but no gaps, which is why the group have called themselves the Telomere-to-Telomere Consortium. As part of their work, the team discovered approximately 115 new genes that code for proteins, giving a new total of 19,969.

The researchers suggest their work was only possible because they had use of new technology developed by Oxford Nanopore and Pacific Biosciences—the new technology allowed for sequencing without cutting the DNA into pieces. It instead ran the DNA through a nanoscopic hole with lasers reading the sequences repeatedly to reduce errors. They also claim that the number of known bases has now grown from 2.92 billion to 3.05 billion and the number of known genes has increased by 0.4%.

The researchers also note that the genome they sequenced did not come from a person but rather a hydatidiform, which is a growth that on rare occasions forms on the inside of a woman's uterus. Such growths occur when a sperm manages to fertilize an egg that has no nucleus thus, it has just 23 chromosomes rather than the normal 46 found in most human cells. The researchers chose to sequence the hydatidiform because it made their computations easier.

The results have yet to be peer reviewed, which is why the genomics community is holding off on commenting—in the meantime, the T2T team is making plans to take their work further by sequencing multiple people from across the globe.


References

Collins, F. S. et al. Science 300, 286–290 (2003).

H3Africa Consortium. Science 344, 1346–1348 (2014).

Stein, L. D. et al. Nature 523, 149–150 (2015).

Knoppers, B. M. HUGO J. 8, 3 (2014).

Collins, F. S. &amp Varmus, H. N. Engl. J. Med. 372, 293–295 (2015).

Insel, T. R. et al. Science 340, 687–688 (2013).

McEwen, J. E. et al. Annu. Rev. Genomics Hum. Genet. 15, 481–505 (2014).

Green, E. D. in The Metabolic and Molecular Bases of Inherited Disease 8th Edn (eds Scriver, C. R. et al.) 259–298 (McGraw-Hill, 2001).


Timeline of Costs

How much did it cost to generate the first human genome sequence as part of the Human Genome Project?

The HGP generated a 'reference' sequence of the human genome - specifically, it sequenced one representative version of all parts of each human chromosome (totaling

3 billion bases). In the end, the quality of the 'finished' sequence was very high, with an estimated error rate of <1 in 100,000 bases note this is much higher than a typical human genome sequence produced today. The generated sequence did not come from one person's genome, and, being a 'reference' sequence of

3 billion bases, really reflects half of what is generated when an individual person's

6-billion-base genome is sequenced (see below).

The HGP involved first mapping and then sequencing the human genome. The former was required at the time because there was otherwise no 'framework' for organizing the actual sequencing or the resulting sequence data. The maps of the human genome served as 'scaffolds' on which to connect individual segments of assembled DNA sequence. These genome-mapping efforts were quite expensive, but were essential at the time for generating an accurate genome sequence. It is difficult to estimate the costs associated with the 'human genome mapping phase' of the HGP, but it was certainly in the many tens of millions of dollars (and probably hundreds of millions of dollars).

Once significant human genome sequencing began for the HGP, a 'draft' human genome sequence (as described above) was produced over a 15-month period (from April 1999 to June 2000). The estimated cost for generating that initial 'draft' human genome sequence is

$300 million worldwide, of which NIH provided roughly 50-60%.

The HGP then proceeded to refine the 'draft' and produce a 'finished' human genome sequence (as described above), which was achieved by 2003. The estimated cost for advancing the 'draft' human genome sequence to the 'finished' sequence is

$150 million worldwide. Of note, generating the final human genome sequence by the HGP also relied on the sequences of small targeted regions of the human genome that were generated before the HGP's main production-sequencing phase it is impossible to estimate the costs associated with these various other genome-sequencing efforts, but they likely total in the tens of millions of dollars.

The above explanation illustrates the difficulty in coming up with a single, accurate number for the cost of generating that first human genome sequence as part of the HGP. Such a calculation requires a clear delineation about what does and does not get 'counted' in the estimate further, most of the cost estimates for individual components can only be given as ranges. At the lower bound, it would seem that this cost figure is at least $500 million at the upper bound, this cost figure could be as high as $1 billion. The truth is likely somewhere in between.

The above estimated cost for generating the first human genome sequence by the HGP should not be confused with the total cost of the HGP. The originally projected cost for the U.S.'s contribution to the HGP was $3 billion in actuality, the Project ended up taking less time (

15 years) and requiring less funding -

$2.7 billion. But the latter number represents the total U.S. funding for a wide range of scientific activities under the HGP's umbrella beyond human genome sequencing, including technology development, physical and genetic mapping, model organism genome mapping and sequencing, bioethics research, and program management. Further, this amount does not reflect the additional funds for an overlapping set of activities pursued by other countries that participated in the HGP.

As the HGP was nearing completion, genome-sequencing pipelines had stabilized to the point that NHGRI was able to collect fairly reliable cost information from the major sequencing centers funded by the Institute. Based on these data, NHGRI estimated that the hypothetical 2003 cost to generate a 'second' reference human genome sequence using the then-available approaches and technologies was in the neighborhood of $50 million.

How much did it cost to sequence a human genome in 2006 (i.e., roughly a decade ago)?

Since the completion of the HGP and the generation of the first 'reference' human genome sequence, efforts have increasingly shifted to the generation of human genome sequences from individual people. Sequencing an individual's 'personal' genome actually involves establishing the identity and order of

6 billion bases of DNA (rather than a

3-billion-base 'reference' sequence see above). Thus, the generation of a person's genome sequence is a notably different endeavor than what the HGP did.

Within a few years following the end of the HGP (e.g., in 2006), the landscape of genome sequencing was beginning to change. While revolutionary new DNA sequencing technologies, such as those in use today, were not quite implemented at that time, genomics groups continued to refine the basic methodologies used during the HGP and continued lowering the costs for genome sequencing. Considerable efforts were being made to the sequencing of nonhuman genomes (much more so than human genomes), but the cost-accounting data collected at that time can be used to estimate the approximate cost that would have been associated with human genome sequencing at that time.

Based on data collected by NHGRI from the Institute's funded genome-sequencing groups, the cost to generate a high-quality 'draft' human genome sequence had dropped to

$14 million by 2006. Hypothetically, it would have likely cost upwards of $20-25 million to generate a 'finished' human genome sequence - expensive, but still considerably less so than for generating the first reference human genome sequence.

How much does it cost to sequence a human genome in 2016 (i.e., today)?

The decade following the HGP brought revolutionary advances in DNA sequencing technologies that are fundamentally changing the nature of genomics. So-called 'next-generation' DNA sequencing methods arrived on the scene, and their effects quickly became evident in terms of lowering genome-sequencing costs note that these NHGRI-collected data are 'retroactive' in nature, and do not always accurately reflect the 'projected' costs for genome sequencing going forward).

In 2015, the most common routine for sequencing an individual's human genome involves generating a 'draft' sequence and comparing it to a reference human genome sequence, so as to catalog all sequence variants in that genome such a routine does not involve any sequence finishing. In short, nearly all human genome sequencing in 2015 yields high-quality 'draft' (but unfinished) sequence. That sequencing is typically targeted to all exons (whole-exome sequencing) or aimed at the entire

6-billion-base genome (whole-genome sequencing), as discussed above. The quality of the resulting 'draft' sequences is heavily dependent on the amount of average base redundancy provided by the generated data (with higher redundancy costing more).

Adding to the complex landscape of genome sequencing in 2015 has been the emergence of commercial enterprises offering genome-sequencing services at competitive pricing. Direct comparisons between commercial versus academic genome-sequencing operations can be particularly challenging because of the many nuances about what each includes in any cost estimates (with such details often not revealed by private companies). The cost data that NHGRI collects from its funded genome-sequencing groups includes information about a wide range of activities and components, such as: reagents, consumables, DNA-sequencing instruments, certain computer equipment, other equipment, laboratory pipeline development, laboratory information management systems, initial data processing, submission of data to public databases, project management, utilities, other indirect costs, labor, and administration. Note that such cost-accounting does not typically include activities such as quality assurance/quality control (QA/QC), alignment of generated sequence to a reference human genome, sequence assembly, genomic variant calling, or annotation. Almost certainly, companies vary in terms of which of the items in the above lists get included in any cost estimates, making direct cost comparisons with academic genome-sequencing groups difficult. It is thus important to consider these variables - along with the distinction between retrospective versus projected costs - when comparing genome-sequencing costs claimed by different groups. Anyone comparing costs for genome sequencing should also be aware of the distinction between 'price' and 'cost' - a given price may be either higher or lower than the actual cost.

Based on the data collected from NHGRI-funded genome-sequencing groups, the cost to generate a high-quality 'draft' whole human genome sequence in mid-2015 was just above $4,000 by late in 2015, that figure had fallen below $1,500. The cost to generate a whole-exome sequence was generally below $1,000. Commercial prices for whole-genome and whole-exome sequences have often (but not always) been slightly below these numbers.

How much did it cost to generate the first human genome sequence as part of the Human Genome Project?

The HGP generated a 'reference' sequence of the human genome - specifically, it sequenced one representative version of all parts of each human chromosome (totaling

3 billion bases). In the end, the quality of the 'finished' sequence was very high, with an estimated error rate of <1 in 100,000 bases note this is much higher than a typical human genome sequence produced today. The generated sequence did not come from one person's genome, and, being a 'reference' sequence of

3 billion bases, really reflects half of what is generated when an individual person's

6-billion-base genome is sequenced (see below).

The HGP involved first mapping and then sequencing the human genome. The former was required at the time because there was otherwise no 'framework' for organizing the actual sequencing or the resulting sequence data. The maps of the human genome served as 'scaffolds' on which to connect individual segments of assembled DNA sequence. These genome-mapping efforts were quite expensive, but were essential at the time for generating an accurate genome sequence. It is difficult to estimate the costs associated with the 'human genome mapping phase' of the HGP, but it was certainly in the many tens of millions of dollars (and probably hundreds of millions of dollars).

Once significant human genome sequencing began for the HGP, a 'draft' human genome sequence (as described above) was produced over a 15-month period (from April 1999 to June 2000). The estimated cost for generating that initial 'draft' human genome sequence is

$300 million worldwide, of which NIH provided roughly 50-60%.

The HGP then proceeded to refine the 'draft' and produce a 'finished' human genome sequence (as described above), which was achieved by 2003. The estimated cost for advancing the 'draft' human genome sequence to the 'finished' sequence is

$150 million worldwide. Of note, generating the final human genome sequence by the HGP also relied on the sequences of small targeted regions of the human genome that were generated before the HGP's main production-sequencing phase it is impossible to estimate the costs associated with these various other genome-sequencing efforts, but they likely total in the tens of millions of dollars.

The above explanation illustrates the difficulty in coming up with a single, accurate number for the cost of generating that first human genome sequence as part of the HGP. Such a calculation requires a clear delineation about what does and does not get 'counted' in the estimate further, most of the cost estimates for individual components can only be given as ranges. At the lower bound, it would seem that this cost figure is at least $500 million at the upper bound, this cost figure could be as high as $1 billion. The truth is likely somewhere in between.

The above estimated cost for generating the first human genome sequence by the HGP should not be confused with the total cost of the HGP. The originally projected cost for the U.S.'s contribution to the HGP was $3 billion in actuality, the Project ended up taking less time (

15 years) and requiring less funding -

$2.7 billion. But the latter number represents the total U.S. funding for a wide range of scientific activities under the HGP's umbrella beyond human genome sequencing, including technology development, physical and genetic mapping, model organism genome mapping and sequencing, bioethics research, and program management. Further, this amount does not reflect the additional funds for an overlapping set of activities pursued by other countries that participated in the HGP.

As the HGP was nearing completion, genome-sequencing pipelines had stabilized to the point that NHGRI was able to collect fairly reliable cost information from the major sequencing centers funded by the Institute. Based on these data, NHGRI estimated that the hypothetical 2003 cost to generate a 'second' reference human genome sequence using the then-available approaches and technologies was in the neighborhood of $50 million.

How much did it cost to sequence a human genome in 2006 (i.e., roughly a decade ago)?

Since the completion of the HGP and the generation of the first 'reference' human genome sequence, efforts have increasingly shifted to the generation of human genome sequences from individual people. Sequencing an individual's 'personal' genome actually involves establishing the identity and order of

6 billion bases of DNA (rather than a

3-billion-base 'reference' sequence see above). Thus, the generation of a person's genome sequence is a notably different endeavor than what the HGP did.

Within a few years following the end of the HGP (e.g., in 2006), the landscape of genome sequencing was beginning to change. While revolutionary new DNA sequencing technologies, such as those in use today, were not quite implemented at that time, genomics groups continued to refine the basic methodologies used during the HGP and continued lowering the costs for genome sequencing. Considerable efforts were being made to the sequencing of nonhuman genomes (much more so than human genomes), but the cost-accounting data collected at that time can be used to estimate the approximate cost that would have been associated with human genome sequencing at that time.

Based on data collected by NHGRI from the Institute's funded genome-sequencing groups, the cost to generate a high-quality 'draft' human genome sequence had dropped to

$14 million by 2006. Hypothetically, it would have likely cost upwards of $20-25 million to generate a 'finished' human genome sequence - expensive, but still considerably less so than for generating the first reference human genome sequence.

How much does it cost to sequence a human genome in 2016 (i.e., today)?

The decade following the HGP brought revolutionary advances in DNA sequencing technologies that are fundamentally changing the nature of genomics. So-called 'next-generation' DNA sequencing methods arrived on the scene, and their effects quickly became evident in terms of lowering genome-sequencing costs note that these NHGRI-collected data are 'retroactive' in nature, and do not always accurately reflect the 'projected' costs for genome sequencing going forward).

In 2015, the most common routine for sequencing an individual's human genome involves generating a 'draft' sequence and comparing it to a reference human genome sequence, so as to catalog all sequence variants in that genome such a routine does not involve any sequence finishing. In short, nearly all human genome sequencing in 2015 yields high-quality 'draft' (but unfinished) sequence. That sequencing is typically targeted to all exons (whole-exome sequencing) or aimed at the entire

6-billion-base genome (whole-genome sequencing), as discussed above. The quality of the resulting 'draft' sequences is heavily dependent on the amount of average base redundancy provided by the generated data (with higher redundancy costing more).

Adding to the complex landscape of genome sequencing in 2015 has been the emergence of commercial enterprises offering genome-sequencing services at competitive pricing. Direct comparisons between commercial versus academic genome-sequencing operations can be particularly challenging because of the many nuances about what each includes in any cost estimates (with such details often not revealed by private companies). The cost data that NHGRI collects from its funded genome-sequencing groups includes information about a wide range of activities and components, such as: reagents, consumables, DNA-sequencing instruments, certain computer equipment, other equipment, laboratory pipeline development, laboratory information management systems, initial data processing, submission of data to public databases, project management, utilities, other indirect costs, labor, and administration. Note that such cost-accounting does not typically include activities such as quality assurance/quality control (QA/QC), alignment of generated sequence to a reference human genome, sequence assembly, genomic variant calling, or annotation. Almost certainly, companies vary in terms of which of the items in the above lists get included in any cost estimates, making direct cost comparisons with academic genome-sequencing groups difficult. It is thus important to consider these variables - along with the distinction between retrospective versus projected costs - when comparing genome-sequencing costs claimed by different groups. Anyone comparing costs for genome sequencing should also be aware of the distinction between 'price' and 'cost' - a given price may be either higher or lower than the actual cost.

Based on the data collected from NHGRI-funded genome-sequencing groups, the cost to generate a high-quality 'draft' whole human genome sequence in mid-2015 was just above $4,000 by late in 2015, that figure had fallen below $1,500. The cost to generate a whole-exome sequence was generally below $1,000. Commercial prices for whole-genome and whole-exome sequences have often (but not always) been slightly below these numbers.


Watch the video: Genetik u0026 Epigenetik: Genom-Sequenzierung in der modernen Biologie Vortrag (June 2022).