When was the hgp finished
Goeden, Debra J. Rose, Bob Mau, and Ying Shao. The Chimpanzee Sequencing and Analysis Consortium. Choudhuri, Supratim. Collins, Francis Sellers. New York: Free Press, Collins, Francis and David Galas. Human Genome Project. Cook-Deegan, Robert. New York: Norton, Executive Order , 65 Federal Regulations , 10 February Section ff Goffeau, A. Green, Eric D. Fleischmann, Robert D. Fraser, Claire M. Holtzman, Neil, and Michael Watson, eds. Human Genome Project Information. International Human Genome Sequencing Consortium.
Kevles, Daniel J. The Code of Codes. Kitcher, Philip. Science, Truth, and Democracy. Lindblad-Toh, Kerstin, et al. Maienschein, Jane. Whose View of Life? Embryos, Cloning, and Stem Cells.
National Institutes of Health. Shreeve, James. New York: Random House, June 17, Venter, John Craig. The problem is that the pieces from repeating regions all look nearly the same. The role of centromere sequences, like many other repeating regions, is not yet fully understood, but they are most classically known as the key to cell division. When a cell divides in two, a protein spindle attaches to the centromeres, yanking the chromosomes apart to make sure that each cell gets the right number.
When this goes wrong in eggs or sperm, babies can be born with chromosomal anomalies such as Down syndrome or Turner syndrome. When it goes wrong in other parts of the body, we can end up with blood cells, for example, that have too many or too few chromosomes. Examining the sequences in those regions might yield new clues to chromosomal anomalies. The repeat-rich short arms of the chromosomes are similarly mysterious. They definitely play some role in the cellular machinery that translates genes into proteins, and knowing their sequences could shed more light on that function.
As impressive as the technical feat of sequencing a complete human genome is, scientists told me that one genome is only one snapshot. Seeing how these repeating regions change over time from person to person, species to species, will be far more interesting.
What happens in development? What happens if you compare offspring to parents? The consortium proved that these repeating regions are sequenceable with the new long-read technologies. Now they can be applied to more genomes, allowing scientists to compare one with another. Indeed, Miga says that the ultimate dream is to make every genome that scientists attempt to sequence complete from end to end, telomere to telomere. But first, the group has a more immediate goal in mind.
To simplify the task, the group used cells from a particular type of tumor that develops from an abnormal fertilized egg and ends up with just 23 single chromosomes. Read: Searching for the genes that are unique to humans. Garg has used PacBio HiF i to rapidly assemble human genomes—minus some tricky regions such as the centromeres—at a rate of a few per day.
This may underestimate the proportion of genome missing from the finished sequence, however, because focused efforts were made to capture genomic sequence containing missing messenger RNAs. Analysis of random genomic plasmids. As an additional and broader test of coverage, we analysed paired end-sequences from 5, small-insert 3—4?
After excluding heterochromatic repeats and other artefacts, we found that For 0. For another 0. The current genome sequence contains gaps, which could not be closed with available techniques.
We briefly describe the nature of these gaps and discuss the prospects for eventual closure. See Supplementary Information Notes 2 and 4. Heterochromatic regions 33 gaps. The heterochromatic regions of the human genome were not targeted by the HGP, because their highly repetitive properties make them largely refractory to current cloning and sequencing strategies.
There are 33 heterochromatic regions falling into four types. The three secondary constrictions are immediately adjacent to the centromere on chromosome arms 1q, 9q and 16q and contain various satellite repeats beta, gamma, satellite I, II, III. Finally, there is a single large region on distal Yq composed primarily of thousands of copies of several repeat families. The heterochromatic regions all tend to be highly polymorphic in length in the human population.
Euchromatic boundary regions 35 gaps. The euchromatic regions of the human genome are bounded proximally by heterochromatin and distally by a telomere consisting of several kilobases of the hexamer repeat TTAGGG. We examined the current genome sequence for evidence of the expected boundaries on the 43 euchromatic arms.
See Supplementary Information Note 4. At the proximal ends, 30 of the 43 cases show sequence characteristic of either heterochromatin or immediately flanking regions such as higher-order centromeric repeats, stretches of at least 10? We cannot exclude the possibility that there is additional unique sequence between this point and the proximal heterochromatin; but efforts to extend the finished sequence further were unsuccessful.
In the remaining 13 cases, the finished sequence contains no evidence of heterochromatin-related sequence. At the telomeric ends, 21 of the 43 cases show continuous sequence extending to the telomeric repeat.
This sequence was typically obtained by isolation and sequencing of half-YAC clones spanning to the telomere An additional 18 cases are sequence gaps, in which half-YACs reaching to the telomere were isolated but finished sequence could not be obtained.
The remaining four cases are physical gaps, in which large-insert clones extending to the telomere could not be obtained. Euchromatic interior regions gaps. The remaining gaps are located within the current genome sequence. These consist of physical gaps for which no clones could be isolated, and 58 sequence gaps for which clones were found but reliable finished sequence could not be obtained.
The physical gaps are greatly enriched in regions of segmental duplication Fig. Such segmental duplications are especially frequent in pericentromeric regions, and gaps are notably more frequent in these regions. The association of gaps with segmental duplications is examined in detail elsewhere Large duplications are shown to approximate scale; smaller ones are indicated as ticks. Sequence gaps are indicated above the chromosomes in red.
Unfinished clones are indicated as black ticks. The blue bars show the result of direct analysis of near-complete sequence. The gold bars show an independent estimate 65 using whole-genome shotgun data to correct for potential mis-assembly of such segmental duplications.
The strong agreement suggests that most segmental duplications are properly represented in near-complete genome sequence. The discrepancy for chromosome X is probably a result of errors in the independent estimate, due to limited coverage and diversity of data from this chromosome The most extreme case occurs near the centromere of chromosome 9. The most proximal 5? Mb on 9p and 4? Mb on 9q comprise a mere 0.
These two pericentric regions are unique in the genome with respect to density of segmental duplication and the average degree of intrachromosomal sequence identity Other proximal regions also show a higher-than-average density of gaps.
For example, the proximal 2? Mb on the remaining 41 euchromatic arms comprise 2. Nearly all of these proximal gaps are flanked by segmental duplications Fig. There is also a clustering of such gaps in subtelomeric regions. The terminal 1? Mb on the 43 euchromatic arms represents 1. The most proximal regions are crowded with alpha satellite sequences and other centromeric repeats; composition, density and order may vary considerably between chromosome arms Just outside this region, there is usually a high density of inter- and intra-chromosomal duplication.
For details, see text and refs 39, 40, 66 and The terminal repeat tract consists of 2—15? Short 50—? Proximal to the Srpt region is chromosome-specific genomic DNA, typically with a high GC content and high gene density. Stretches of segmentally duplicated DNA that occur only once within subtelomeric regions tan are interspersed with 1-copy subtelomeric DNA yellow in a telomere-specific fashion. Closing the remaining gaps. These represent regions that could not be reliably mapped, cloned and sequenced with current methods.
Rather than applying further brute force, it is now time to develop focused strategies to resolve the regions. The remaining euchromatic gaps probably reflect two major issues. The first pertains to regions harbouring segmentally duplicated sequence. Such regions are challenging to map because it can be extremely difficult to discern whether two clones with small sequence differences represent different loci or different alleles at a single locus.
This challenge was eventually resolved for chromosome Y ref. By using DNA from a single haploid source, it was possible to rely on differences at only a handful of nucleotides to distinguish repeated sequences.
This approach could be applied to the rest of the genome by using appropriate haploid sources, such as a hydatidiform mole or monochromosomal hybrids. In both instances, use of parental controls to guard against being misled by somatic rearrangements would be well advised. It may be useful to test these approaches on individual chromosomes. The second issue is that some gaps are likely to correspond to regions that cannot be efficiently propagated in current large-insert vectors and hosts.
It may be useful to test new kinds of large-insert libraries for clones containing unique sequences not contained in the current human genome sequence perhaps seeded by probes derived from random small-insert genomic plasmids, as discussed above. In addition, genome completion may benefit from long-range mapping techniques such as optical mapping 38 , which may provide independent information about difficult regions.
Completing the euchromatic sequence is an important goal, but is clearly now a research effort rather than a high-throughput project.
Sequencing the human heterochromatin poses an even greater challenge. The current human sequence penetrates only the periphery of the heterochromatin—for example, the pericentric regions on a few chromosome arms 39 , This progress has required concerted efforts with specialized mapping techniques and painstaking assembly.
The fundamental issue is that current shotgun strategies are poorly suited to assembling large, highly repetitive regions. The hierarchical shotgun strategy faces the challenge of accurate assembly of individual BACs and accurate overlap of BAC clones, with the underlying data consisting of nearly identical sequence; the whole-genome shotgun strategy compounds these problems.
Conceivably, the hierarchical strategy could be adapted as was done for repetitive regions of chromosome Y. Approaches might include the use of the following: haploid DNA sources to restrict the problem to a single haplotype; single chromosome sources to avoid confusion among related centromeres on different chromosomes; sheared BAC libraries to avoid biases caused by the unusual distribution of restriction sites within the repeat sequences; assembly based on rare base differences that distinguish near-identical repeats; cloning vectors that minimize rearrangements; and subclone libraries of varying insert lengths.
Such an approach will also require ensuring accurate recovery and stability of heterochromatic regions in large-insert clones. Even so, the path is likely to be arduous and expensive to obtain regions of uncertain information content. Alternatively, it may be possible to develop new approaches. These might include methods to obtain much longer effective read lengths, directed reads from known locations and long-range mapping information about the location of rare base differences among repeat copies such as optical mapping 38 or padlock probes The present genome sequence enables far more precise analyses of the human genome, especially those that depend sensitively on high accuracy and near-completeness.
Rather than revisit all of the analyses in our initial analysis of the human genome, we have chosen four examples that illustrate the utility of the current near-complete sequence. The human genome is notable for its high proportion of recent segmental duplications.
They are of great medical interest because their unusual structure often predisposes them to deletion or rearrangement with consequent phenotypic effects; prominent examples include the Williams syndrome region 7q , Charcot—Marie—Tooth region 17p , DiGeorge syndrome region 22q and the AZF-C region Y Some regions of segmental duplication have also recently been shown to be evolutionary nurseries in which coding sequences are undergoing strong positive selection Accurate analysis of segmental duplications was previously impossible because the draft sequence also contained a high degree of artefactual duplication.
This difficulty was recognized at the time and the approximate proportion of true and artefactual duplication was inferred indirectly. With near-complete sequence, the artefacts are now largely eliminated and true segmental duplications can be reliably studied. The proportion of segmental duplication and the degree of sequence identity are clearly substantially higher in the human genome than in the mouse 44 or rat 45 genomes although precise figures for the rodent genomes must await finished sequence.
The use of large insert clones, representing a single haplotype, was critical in resolving these regions. The distribution of segmental duplication varies widely across chromosomes, as does the proportion of intrachromosomal versus interchromosomal duplications 15 Fig 4b. In addition, many pericentromeric and subtelomeric regions are rich in dispersed segmental duplications Fig. A central goal of genome analysis is the comprehensive identification of all human genes.
This task remains challenging, but is greatly aided by the near-complete sequence together with other improved resources such as expanded cDNA collections, genome sequence from other organisms and better computational methods.
The current version of the human gene catalogue Ensembl 34d contains 22, gene loci with a total of 34, transcripts, corresponding to 1. Comparison of the initial and current gene catalogues highlights the substantial improvement. Many of the earlier gene models were erroneous due to defects in the draft sequence. Examples resulting from a duplication, inversion and premature stop codon are shown in Fig.
The improvement can be quantified by mapping the current gene models onto the draft sequence, to determine whether they could have been accurately identified.
The terminal exon was also missing. Automated gene annotation has now been complemented by manual annotation of most chromosomes, based on a careful review of gene structure and examination of expressed-sequence-tag EST and transcript evidence.
Gb of the euchromatic genome. Although this annotation has further improved the quality of the gene models 47 by dealing with special cases and unusual features not yet handled in the automated programs, and resolving instances of conflicting experimental data , it has not significantly affected the total gene count for these chromosomes. On the basis of available evidence, our best estimate is that the total number of protein-coding genes is in the range 20,—25, The lower bound seems secure, based on the number of currently known genes 19, The upper bound is based on estimates of the number of additional genes.
Despite intense automated and manual analysis using cDNA, EST and cross-species homology, only 2, gene predictions have been added to the known set. This predicted set is likely to represent substantially fewer than 2, true genes, owing to fragmentation and false predictions arising from pseudogenes. On the other hand, the set is likely to be incomplete because some protein-coding genes have surely continued to escape detection. The range of 20,—25, is also consistent with recent estimates J.
Weissenbach, unpublished of the number of protein-coding genes based on cross-species homology using the Exofish method In our initial analysis of the draft sequence 15 , we estimated the count of human protein-coding genes at roughly 30, The estimate was derived as follows.
With the current high-quality sequence, it is now possible to revisit this earlier analysis. We directly compared the previous gene models with the current gene models, to determine whether our previous estimates of the various error rates were correct.
It is clear that the main reason for the earlier overestimate is that the fragmentation rate was substantially underestimated. The fragmentation rate is defined as the average number of the previous gene models that map to the same true gene; we assessed it by mapping to the current gene catalogue.
Small differences in the estimated rate of false positive and negative predictions account for the remainder of the discrepancy. It should be emphasized that the count above refers to the count of protein-coding genes. In addition, there is evidence that the human genome gives rise to many additional RNA transcripts It is unclear whether most such transcripts have specific biological functions or reflect reproducible transcriptional noise; few contain substantial open-reading frames and thus they are unlikely to encode proteins.
There is a need for reliable experimental and computational methods for comprehensive identification of non-coding RNAs. Finally, the near-complete sequence makes it possible to undertake systematic searches for pseudogenes.
Automated annotation of chromosomes has focused primarily on identifying large pseudogenes of more recent origin. This is surely still an underestimate, because such analysis will miss pseudogenes that are extremely old or that contain primarily untranslated regions.
The total number of pseudogenes is thus likely to exceed the total number of functional genes. A particular type of pseudogene recently arising non-processed pseudogenes is discussed in more detail below. The birth of new genes is of interest because it provides raw material for adaptive evolution, with extra copies of genes able to undergo functional divergence in response to positive selection.
The quality and completeness of the current sequence make it possible to study this question; such analysis would have been unreliable with the earlier draft sequence, because the extensive artefactual local duplication would have given rise to many false positives. We searched for clusters of nearby homologous genes, indicative of local gene duplication. The divergence between such genes was assessed at sites likely to be selectively neutral, by measuring the estimated substitution rate per synonymous site K S.
This threshold corresponds roughly to duplications arising after divergence from the rodent lineage, either by recent gene duplication or perhaps recent gene conversion of older duplications see Methods in Supplementary Information.
A total of 1, genes exhibit such divergence from a neighbouring gene see Methods in Supplementary Information. These genes often fall within larger clusters of paralogous genes including genes with greater divergence and reflecting older duplications. The recent duplications are enriched in genes with immune and olfactory function, as well as those likely to be involved in reproductive functions. For example, the gene families encoding the pregnancy-specific betaglycoprotein and choriogonadotropin beta proteins may be involved in the extended gestational period in the human lineage; the latter family is known to have expanded recently within the catarrhine primate lineage The distribution of K S values Fig.
There are several possible explanations for this peak. First, it may reflect a true explosion in the rate of gene duplication in the primate lineage. The primate lineage does show an increase in the rate of dispersed segmental duplication, although it is less extreme; the rate of local duplication will need to be carefully evaluated in comparative studies.
Second, it may partly reflect on the ongoing process of gene conversion of older gene duplication events. However, we offer a third explanation: the peak primarily reflects the transient of duplicated genes that are too young relative to the characteristic time of deletion. If so, most of these new genes are destined to be culled due to lack of functional benefit. In contrast to the first explanation, this would predict that a similar peak would be seen in most mammals.
K S , the number of synonymous substitutions per synonymous site, was estimated for each gene from alignment with its most closely related human paralogue.
This provides an indication of evolutionary time since divergence. Gene death is another phenomenon that sheds light on lineage-specific evolution, but which was difficult to analyse with the earlier draft sequence. To study gene death, we scanned the genome sequence for recently arising non-processed pseudogenes—that is, nearly intact human genes that appear to have recently acquired an inactivating mutation.
Specifically, we examined genomic intervals bounded at each end by two consecutive genes, with each belonging to a orthology triplet in the human, mouse and rat genomes and the interval containing at most 50 genes see Methods in Supplementary Information.
We then examined the Ensembl gene predictions in the corresponding intervals of the three genomes and identified instances in which the mouse and rat genomes contained orthologues, but the human genome appeared to contain no predicted orthologous gene. In each instance, the rodent genes were aligned to the corresponding human genomic interval to look for clear evidence of a human pseudogene—that is, a highly similar sequence containing one or more inactivating mutations in its genomic sequence see Methods in Supplementary Information.
We also required that the inactivating mutation was present in any human mRNA sequences corresponding to the locus. This analysis excludes many older pseudogenes that do not show sufficient similarity to the rodent homologues because they have substantially degenerated. A total of 37 candidate pseudogenes were identified, with an average of 0. Similar analyses performed on the draft sequence yielded a much larger list, including many apparent inactivating mutations that were errors and were corrected in the current sequence.
We carefully examined these candidates to confirm that they did not reflect errors in the current genome sequence by resequencing or examination of an independently finished clone and to determine their evolutionary origin by re-sequencing in a panel of 24 diverse humans and comparison with a draft sequence of the chimpanzee genome.
Complete experimental data could be obtained for 34 cases. The identification of a pseudogene was confirmed in 33 of the 34 cases; one case was due to an error in the current sequence Table 5. The 19 pseudogenes with two or more inactivating mutations were all found to be pseudogenes in chimpanzee as well.
The 14 pseudogenes with exactly one inactivating mutation fell into the following three classes: eight pseudogenes shared with chimpanzee; five pseudogenes fixed in the human population but functional genes in the chimpanzee; and one pseudogene that is a segregating polymorphism in the human population. In 20 cases, the inactivating mutation occurs in the final or only exon. Although this could in principle be compatible with a functional gene, the truncation removes a functionally important domain in all but one case.
Of the 32 pseudogenes fixed in the human population, 10 are derived from olfactory receptors. Olfactory receptors thus occur prominently in both birth and death analyses, indicating a dynamic expansion and contraction of this large gene family; the net effect has been an overall significant decrease in the number of functional olfactory receptors in humans compared with rodents 59 , The remaining 22 recent pseudogenes include a wide variety, such as genes homologous to a cationic amino-acid transporter, a serine-threonine kinase, a calreticulin, a putative G-protein coupled receptor and a cystatin.
The Human Genome Project marked a new approach in biomedical research, one in which the scientific community came together to characterize systematically a large domain of important biological knowledge. Notably, the finishing phase required roughly equal resources of time and expense as the draft phase. The issue is no longer scale, but rather the need for new approaches to understand and resolve these recalcitrant segments. Continuing efforts should be devoted towards the eventual goal of complete closure.
Nonetheless, the euchromatic human genome can now be regarded as effectively known. The accuracy and completeness of the current near-complete human genome sequence has important consequences for biomedical research. It allows systematic searches for the causes of disease—for example, to find all key heritable factors predisposing to diabetes or somatic mutations underlying breast cancer—with confidence that little can escape detection.
It facilitates experimental tools to recognize cellular components—for example, detectors for mRNAs based on specific oligonucleotide probes or mass-spectrometric identification of proteins based on specific peptide sequences—with confidence that these features provide a unique signature. It allows sophisticated computational analyses—for example, to study genome structure and evolution—with confidence that subtle results will not be swamped or swayed by noisy data.
At a practical level, it eliminates tedious confirmatory work by researchers, who can now rely on highly accurate information. At a conceptual level, the near-complete picture makes it reasonable for the first time to contemplate systems approaches to cellular circuitry, without fear that major components are missing. The HGP provides an essential foundation for the sequencing and analysis of additional large genomes.
With the experience gained from the human genome, it has already become scientifically and economically feasible to produce draft genome sequence from many vertebrates, which will be a crucial tool for identifying the functional elements in the human genome through comparative analysis. Ultimately, we believe that such projects should aim higher to produce genome sequence with even greater accuracy and completeness.
This will require digesting the diverse experience from the finishing phase of human sequencing and selecting a subset of techniques that can be most efficiently streamlined and scaled up to improve accuracy and completeness of genome sequence. A good example is the systematic closure of gaps by primer-directed walking on fosmid templates covering each gap, which may be able to close the vast majority of gaps in a draft sequence in an automated fashion.
More generally, the HGP demonstrates the tremendous potential value of coordinated projects to create community resources to propel biomedical research. Key challenges that lie ahead 61 include: 1 systematic identification of all genetic polymorphisms carried in the human population, to facilitate the study of their association with disease; this will require comprehensive study of hundreds to thousands of human genomes.
Absolute completeness will be elusive but, as with the HGP, obtaining the substantial majority of the information will greatly accelerate the pace of biomedical research in thousands of laboratories.
Correspondence and requests for materials should be addressed to F. Collins fc23a nih. Lander lander broad. Rogers jrh sanger. Waterston waterston gs. The sequence described here has been deposited in public databases, with the 24 human chromosomes having accession numbers NC to NC A comprehensive genetic linkage map of the human genome. Science , 67—86 Gyapay, G. The —94 Genethon human genetic linkage map.
0コメント