ABSTRACT
Objective:
With the increasing use of whole-exome sequencing, one of the challenges in identifying the causal allele for a Mendelian disease is the lack of availability of population-specific human genetic variation reference databases. The people of Turkey were not represented in GnomAD or other publicly available large databases until recently, when the first comprehensive genomic variation database, Turkish Variome (TRV), was published. The aim of this study was to evaluate whether TRV or other publicly available large genomic variation databases can reliably be used for rare disease variant evaluation in Turkish individuals.
Methods:
Sixty non-disease-causing, non-synonymous variants (minor allele frequencies >1%) were identified in 58 genes that are known to be associated with idiopathic hypogonadotropic hypogonadism from a large Turkish patient cohort. The allelic frequencies of these variants were then compared with those in various public genomic variation databases, including TRV.
Results:
Our cohort variants showed the highest correlations with those in the TRV, Iranome, and The Greater Middle East Variome, in decreasing order.
Conclusion:
These results suggest that the TRV is the appropriate database to use for rare genomic variant evaluations in the Turkish population. Our data also suggest that variomes from geographic neighborhoods may serve as substitute references for populations devoid of their own genomic variation databases.
What is already known on this topic?
The absence of population-specific genetic variation reference databases causes misleading results in rare variant evaluations.
What this study adds?
This study confirmed that Turkish Variome could represent the Turkish population for rare genomic variant evaluation.
Introduction
The widespread use of next-generation sequencing (NGS), particularly whole-exome sequencing (WES), in medical practice, has resulted in massive data accumulation (1). In order to accurately interpret the differences in the DNA sequences of individuals, criteria based on specific parameters are used. One of the essential parameters is allele frequency (AF), which represents the prevalence of a gene variant in a given population. Variants with minor AFs less than 1% are considered rare and can play a causative role in Mendelian and complex disorders. Genetic alterations observed with a much higher frequency than expected for the disease in a population are generally interpreted as benign (2,3). As many variants are proven to be population-specific, large databases evolved into a comprehensive body of data comprising of datasets from individual subpopulations (4). Failure to use population-specific databases can lead to unreliable or even misleading results in variant evaluation.
The people of Turkey live in the Anatolian peninsula, which is geographically at the crossroads of three major continents, through which major population movements have occurred during all periods of human history. Therefore, it was thought that this geographic region might have a genetic admixture. The genetic structure of Turkish people has been investigated in small scale studies using different methods (5,6,7,8). In a recent study, Kars et al. (9) published the first comprehensive genomic variation database, Turkish Variome (TRV), which compiles whole genome and whole exome data from 3362 individuals from various regions of Turkey.
The aim of this study was to evaluate whether any large population variomes, including TRV, can reliably be used in variant evaluations for the population of Turkey. Therefore, 60 non-disease-causing non-synonymous variants (minor AFs greater than 1%) in 58 genes, known to be associated with idiopathic hypogonadotropic hypogonadism (IHH), were compared with the AFs in various population databases worldwide.
Methods
Patient Cohort
The study used genetic variants from a large, rare disease cohort. The cohort included a total of 290 independent patients (112 female and 178 male) from seven geographic regions of Turkey (the Marmara region, Black Sea, Aegean, Mediterranean, Central Anatolia, Eastern Anatolia, and Southeastern Anatolia), roughly representing the population of Turkey.
Genetic Analyses
A total of 290 WES data sets were screened for potentially pathogenic nucleotide changes [frameshifts, in-frame changes (insertion and deletion), nonsense (stop-loss and stop-gain), two-base splice-sites (donor/acceptor), and missense] located in the exons of 58 genes known to be associated with IHH. Intronic areas, distant regions, and synonymous changes were excluded. Currently known-IHH-associated genes are listed in Table 1 (11).
Selection of the Study Variants
Based on the prevalence of IHH (1/10.100.000), those variants with an AF lower than 0.0001 were excluded from the study as they can be of high pathogenicity. We also excluded those that can be potentially pathogenic with an AF of 0.01-0.0001. In this study, we only included those with AF greater than 0.01, which are extremely unlikely to be disease-causing for IHH.
WES Analyses
Briefly, the genomic DNA samples from each patient were prepared as an Illumina sequencing library. Afterward, sequencing libraries were enriched for proper targets with the Illumina Exome Enrichment protocol. Captured libraries were sequenced with Illumina HiSeq 2000 Sequencer (Macrogen, Seoul, South Korea). The reads were mapped to UCSC hg19.
Databases
The seven established databases used for the AF correlations with our cohort were: GnomAD, which includes European Finnish, Europen Non-Finnish (ENF), Ashkenazi Jewish, East Asian, South Asian, Latino/Admixed American, and African/African-American subcategories (12); The NHLBI Trans-Omics for Precision Medicine representing a diverse population around the world with multi-ethnic data content (European, Hispanic/Latino, African, Asian) (13); The Greater Middle East (GME) Variome Project, which includes the GME world population, from Morocco in the west to Pakistan in the East including 163 alleles from the Turkish peninsula (14); Iranome, which includes Iranian Arabs, Kurds, Persians, Persian Gulf Islanders, Azeris, and Turkmen ethnic groups (15); GenomeAsia, which includes South East Asian, Oceania, North East Asian, African, West Eurasia, South Asian, and American subpopulations (16); the 4.7KJPN, which represents the overall Japanese population (17); and Online Archive of Brazilian Mutations, which includes Brazilian population (18). The GnomAD ENF category includes Southern European, Bulgarian, North-Western European, Swedish, and Estonian subpopulations. Categories named as “others: were not included in the study. The AFs were collected from the databases in February 2022. URLs of databases are provided in the web resources.
Statistical Analysis
Statistical analyses were performed using the Statistical Package for Social Sciences, version 20.0 (IBM Inc., Armonk, NY, USA), and a p value of <0.001 was considered statistically significant. The Spearman’s correlation method was used as the variables in the comparison of the groups were non-normally distributed. The correlation coefficients (CCs) between the study cohort and each of the databases/subgroups were analyzed separately. All correlation analysis results were found to be statistically significant. Next, we compared the CCs based on the concept of comparison of correlations from independent samples.
Results
In this study, a total of 60 variants with an AF greater than 1% were detected in 30 of 58 IHH-associated genes in the WES data from the cohort of 290 independent Turkish IHH patients (Table 2). No variants above the cut-off were observed in 26 of the listed IHH genes, while 17 genes had more than one (maximum five) variant and 13 genes had only one. The great majority of the changes (95.0%) were missense, and 5.0% were frameshift (two insertions and one deletion). Each of the variants in the study cohort was observed only in the Iranome and GnomAD.
A statistically significant correlation was observed between the study cohort and each one of the databases analyzed (Table 3). The highest CCs were observed between the study cohort and the following databases, in decreasing order: TRV (0.994), Iranome (0.983), and GME (0.981). Comparison of correlations from independent samples indicated that the CCs of these three databases with our study cohort were not statistically different from each other (Table 3, shown in bold). The remaining 32 CCs were significantly different. Thus, the comparison results were interpreted as such that the three databases (TRV, Iranome, and GME) can be used as the reference databases for Turkish individuals.
Discussion
Population studies have repeatedly revealed the importance of local datasets in research and clinical practice, rather than using comprehensive databases with a wide-ranging sample size (4,19,20). Knowing the AF differences between populations is also essential for developing machine-learning-based methods that use clustering scores for pathogenicity classifications (21). Disease genetics studies in a given population may also provide information for community characteristics, such as mutation history, local adaptations, and avoiding false-positive genetic diagnoses of Mendelian disorders. In this way, identifying and labeling population-specific genetic changes, such as individual/family-specific variants, will significantly reduce the burden of variants of uncertain significance (22,23). In our study, the common variants of Turkish IHH patients were observed at varying frequencies in different populations, supporting the hypothesis that a population-specific reference database should be used to facilitate the selection of pathogenic variants.
It is essential to understand that common and rare alleles have different characteristics. A rare variation is needed to survive many generations to rise to a moderate frequency, while common ones tend to be inherited over long periods due to negligible effects and are most likely classified as benign. Thus, they are excellent candidates for determining demographic histories or periodical features, such as ancestral origins and migration routes (24,25,26). Blekhman et al. (27) observed that Mendelian-disease gene variants, in general, are under purifying selection pressures. The IHH-associated gene variants should be expected to be subjected to additional negative selection pressures as pathogenic variants in these genes result in infertility. This reproductive disadvantage causes them to be rapidly purged from the population (27). Consequently, the AFs of the IHH gene variants are expected to be more skewed compared to most of the Mendelian disease genes except for those with very high mortality. However, common variants (minor allele frequencies >1%) are free of such distortions. Based on the foregoing argument, we selectively compared the common variants in the IHH-related genes with those of the TRV and other publically available databases. Our cohort results showed a nearly one-to-one correlation (0.994) with TRV, which is comprised of NGS data from individuals participating in genetic studies of various diseases, such as obesity, amyotrophic lateral sclerosis, and Parkinson’s disease. Our study using a rare disease model, IHH, which is not represented in the TRV patient subpopulations, confirms that TRV is well representative of the Turkish population. Kars et al. (9) also reported the close genetic relationship between Balkan and Caucasian populations and those of Turkey. Previously, similar to our methods paradigm, Alkan et al. (5) studied the 16 genomes from various regions of Turkey and compared them to those in the 1000 Genomes Project, and showed that the genetic structure of the people of Turkey is similar to those of Europe, particularly the Southern Europe/Mediterranean region, compared to other gene pools. Similarly, in our study, a close correlation, albeit to a lesser extent, was also observed with those of West Eurasia, including Caucasia (0.974) and Southern Europe (0.955).
It is well-known that consanguineous unions increase the incidence of recessively inherited diseases (28). Our study included 290 independent IHH patients, and consanguinity was present in 56.0%. This rate is higher than the general Turkish population (21.1%), probably due to a rare disease that could be recessively inherited (29). Studies have reported that high consanguineous marriage is common in many regions, including Turkey, Iran, and Pakistan (29,30,31). The kinship union is influenced by culture, religion, geographical conditions, or socioeconomic boundaries. The AFs in our study cohort did not show remarkable similarity for those in different databases in distant geographies. However, the close correlations with the non-European neighbors of the Anatolian peninsula, Iranome (0.983) and GME (0.981) suggest our genetic similarity for alleles that are relatively difficult to spread due to this social structure (28).
Study Limitations
The use of WES analyses performed at different periods in the study may have resulted in differences between reads that confidently support alleles.
Conclusion
Our findings confirm that TRV can be reliably used for variant evaluations from the Turkish population. Our results also indicate that variomes from geographic neighborhoods may serve as substitute references in variant evaluation for populations devoid of representative databases.