Populations studied

We used data from the UK Biobank (UKB), a large-scale, population-based prospective cohort study of approximately 500,000 people aged 40-69 years during recruitment across the UK between March 2006 and October 2010. Full details of genotyping and imputation are described elsewhere.14.15.

Our study populations for each of the three disease outcomes are defined as follows:

  • The population eligible for breast cancer were women who had not had breast cancer, carcinoma in situ or mastectomy before inclusion.

  • For the hypertension-eligible population, we excluded people with missing or implausible systolic blood pressure (SBP) measurements ( 270 mmHg) at baseline, and those with major adverse cardiovascular events (MACE) prior to baseline.

  • The population eligible for dementia was limited to people without a diagnosis of dementia before inclusion.

We further limited ourselves to genetically white British individuals (UKB Data Field 22006) and excluded related individuals (3rd degree or greater), gender discordant or aberrant for lack of genotype or heterozygosity based on UK Biobank sample quality control data (UKB Data Field 22020). This gives the final size N of each population studied.

The determination of disease in the UKB during the follow-up period used the link with the death register, cancer register and hospital episode statistics (HES). Hypertension is defined as SBP >= 140 at baseline; International Classification of Diseases (ICD) codes for breast cancer and dementia can be found in Supplementary Tables 14-16.

PRS selection

For each of the three disease outcomes, we selected a pair of recently published PRSs to compare, typically published within two years of each other. The earlier PRS is denoted PRS-A, while the more recent is PRS-B.

When choosing PRS, we identified scores that were derived using the same trait definition in predominantly white European populations, to be appropriate for use in the UK biobank, and that had been derived in large consortia for their respective disease. Where possible, we selected scores listed in the Polygenic Score Catalog (PGS Catalog), an online database that collects and organizes PRSs from all the literature and makes metadata available in a standardized way.

Table 1 summarizes the characteristics of the PRS chosen for each trait, including the method of construction. Supplementary Table 1 contains additional information about each PRS, including the source of the weights (i.e., derivation dataset), population characteristics, and validation dataset.

  • For breast cancer, PRS-A (313 SNP, PGS ID: PGS000004)12 has been extensively validated and is included in the current implementation of the BOADICEA breast cancer risk model16.17. For PRS-B, we used a score (118,388 SNP, PGS ID: PGS000511)13 which was largely developed from the same Breast Cancer Association Consortium (BCAC) GWAS data as PRS-A18.

  • For hypertension, PRS-A for SBP (267 SNP, not in the PGS catalog)19 has been selected from the literature (more details later in this section). Effect sizes of PRS-B for SBP (884 SNP, PGS ID: PGS000812)20 are derived from the International Consortium for Genome-Broad Blood Pressure Association Studies (ICBP), the Million Veteran Program (MVP), and the Estonian Genome Center at the University of Tartu (EGCUT).

  • For dementia, both PRS-A (57 SNP, PGS ID: PGS000812)21 and PRS-B (39 SNPs, PGS ID: PGS001775)22 used effect sizes from the International Genomics of Alzheimer’s Project (IGAP) GWAS23. PRS-A was built in January 2021 while PRS-B was developed in September of the same year. We selected PRS that did not contain the two APOE SNPs (rs429358 and rs7412), to prevent the APOE genotype from dominating the PRS.

Table 1 PRS selected for each condition.

We examined the SNP overlap between PRS-A and PRS-B for each disease, including those in high linkage disequilibrium (LD) (R2> 0.8).

We have been careful to try to avoid the problem of overlapping samples, a potential pitfall for PRS24. Since we planned to calculate PRSs in the UKB population (i.e. the target cohort), we preferred PRSs that were not derived in the UKB population. During the PRS selection step, we looked at the base GWAS cohorts for the PRS bypass and attempted to ensure that they did not contain our target cohort (i.e. UKB).

We were able to identify such PRSs for breast cancer and dementia, but not PRS-As for PAS in the existing recent literature to the best of our knowledge. We investigated all available PRSs for SBP in the PGS catalog and found that UKB was present in all derivation populations. Given the extensive blood pressure measurements and genetic data in UKB, it is not surprising that researchers include UKB in their derivation population for PRS.

Calculation of PRS

We calculated the PRS of an individual (j) by the weighted sum of the SNPs associated with the traits,

$$PRS_{j} = mathop sum limits_{i}^{N} beta_{i} *dosage_{ij}$$

where N is the total number of SNPs, (beta_{i}) is the effect (or beta) size of the SNP (I)and (dosage_{ij}) is the number of effect alleles (usually coded as 0, 1 or 2 in SNP (I) for individuals (j) for the effect allele).

We applied genetic quality control (QC) pipelines for SNPs and samples. During SNP QC, we removed ambiguous SNPs (A/T or C/G SNPs with MAF > 0.49) and rare variants with MAF 0, 4) (Supplementary Table 1). During sample QC, we excluded participants who were gender discordant, outliers for absence or heterozygosity, or 3rd degree or greater relatives, using data field UKB 22020.

We then weighted the SNPs that passed our QC using the published effect sizes provided by the source article for each score, given either in their supplemental materials or made available in the PGS catalog, to calculate the PRS for those of the study population for each of the three disease outcomes.

Quantification of PRS stability

In each disease-specific study population, we first calculated the correlation coefficient between each pair of PRS. We then calculated age- and sex-adjusted odds ratios (ORs) at various thresholds (eg, top 1% or top 5% of the PRS), with the median quintile of the PRS as the reference group.

We calculated two versions of the AUC obtained from two separate logistic regression models for each continuous SRP:

  1. 1.

    Crude-AUC, where only PRS were fitted, adjusting the genetic picture, and the first 5 principal components (PC) of ancestry (UK Biobank Field 22009), to reflect the predictive performance of PRS itself.

  2. 2.

    Multivariate adjusted AUC (Multi-AUC), where the model was further adjusted for age and sex (if applicable).

The output of these logistic regression models is the disease status (Yes/No), for each of the three diseases. Crude-AUC alone measures the predictive power of PRS, while Multi-AUC measures it after taking age and sex into account.

The continuous net reclassification index (NRI) was used to compare PRS-A to PRS-B in multivariate logistic models (Table 2). The categorical NRI was used in the cross-classification of PRS percentile risk categories. The percentage reclassification for participants who experienced the outcome is shown in Supplementary Tables 5-7 and 11-13, for the highest 1% and 5% risk, respectively.

Table 2 Compared PRS for each outcome and their performance characteristics in the UK Biobank.

The 95% confidence interval (CI) of AUC and NRI was estimated using 1k bootstrap replicates.

Ethical approval and consent to participate

The UK Biobank study (https://www.ukbiobank.ac.uk) has received ethical approval from the North West Multi-center Research Ethics Committee (REC reference: 11/NW/03820). All participants provided written informed consent prior to enrollment in the study, which was conducted in accordance with the principles of the Declaration of Helsinki. This study was conducted under UK Biobank Application ID 33952.

Patient and community involvement

The analyzes presented here are based on existing data from the UK Biobank cohort study, and the authors were not involved in the recruitment of participants. To our knowledge, no patients were explicitly involved in the design or implementation of the UK Biobank study. No patient was asked to give their opinion on the interpretation or writing of these results. UK Biobank results are regularly disseminated to study participants via the study website and social media.

Consent to publication


Transparency statement

The lead author asserts that this manuscript is an honest, accurate, and transparent account of the reported study; that no important aspect of the study has been omitted; and that any deviation from the planned study has been explained.

About The Author

Related Posts