Even though most people carry more or less the same genes, there is a significant amount of variation in the sequence of the human genome when one individual is compared to another. A new reference genome has now been created in an effort to capture more of the diversity that is seen, and since it includes genetic data from 47 different people, it has been called a pangenome. Efforts to expand that diversity are also ongoing; researchers are hoping to include genetic data from 350 people in the pangenome by the year 2025. The effort was led by the international Human Pangenome Reference Consortium, and has been reported in Nature Biotehcnology and other journals.
The original reference genome is about twenty years old, and although it is continuously being updated, it originated with a single person and thus fails to capture any of the diversity that is seen in human genetic sequences. The genomes of most people are about 99 percent identical, but that 1 percent can have a significant impact. Some of those variations are commonly found in many people, and may appear in more than 1 percent of the population. Other variants are more rare. A lot of research effort has already gone into learning more about the biological consequences of those variations, but much more is yet to be revealed.
"Everyone has a unique genome, so using a single reference genome sequence for every person can lead to inequities in genomic analyses," said study co-author Adam Phillippy, Ph.D., a senior investigator at NHGRI's Intramural Research Program. "For example, predicting a genetic disease might not work as well for someone whose genome is more different from the reference genome."
While the human genome has been mostly complete for many years, there were actually gaps in the sequence for a long time, primarily in areas where there is a high degree of repetition, and sequencing technologies had a difficult time deciphering the actual sequence.
The new pangenome reference sequence is 99 percent complete, and includes small changes in the sequence that are found in one person or another, called single nucleotide variants (SNVs), which are changes in single bases of the sequence. These can be visualized. For example, if the data shows a yellow loop and repeat of the same nucleotide sequence, it is indicating the presence of a duplication variant; pink counterclockwise loops that follow the nucleotide sequence backwards represent inversion variants.
Because many different versions of the genome are included in the pangenomic sequence, there are many more nucleotide bases in the sequence. Researchers also have more data to work with now.
"By using the pangenome reference, we can more accurately identify larger genomic variants called structural variants," said co-first study author Mobin Asri, a graduate student at the University of California Santa Cruz. "We are able to find variants that were not identified using previous methods that depend on linear reference sequences."
Sources: NIH/National Human Genome Research Institute (NHGRI), Nature Biotechnology