Genetic profiling using genome-wide SNP panels: A practical example.
(1) Background. The history of the sample collection is the most informative piece of information. A sample collection with origin "Munich" [21] already reduces a probability of an individual to be present in a data set from approximately 1:6,839 billion to 1:1,314 million. With the addition of the criteria "child" and "asthma", the probability is further reduced to 1:32.850. As study participation is not equal among social classes (with a bias of inner city participants and a bias of more severe cases attending a University department) the target group may be narrowed down to less than 1:2.000 and with known sex to less than 1:1.000.
(2) Subsets. Data subgroups and extreme dimensions of SNP array data may be checked by multidimensional scaling techniques. In the example above there is stratification in the data deposited online which is most likely explained by the inclusion of a minority group in the sample. Since public population databases include allele counts of hundreds of populations, this information can be used to further locate immigrant children to a region in Turkey. Even if there were would be no stratification but inbreeding, the inbreeding coefficient could also allow a guess about which population was included. Average linkage disequilibrium values may further tell if the population is more rural or more metropolitan. Pairwise identity by state distances between individuals allow to uncover also related individuals. Indeed, there are unreported siblings in the examined data set. Taking all this information together, this represents several unique cases that nurses and physicians at the Children' Hospital Munich may immediately recognize along with pharmacists, health insurance employees, school teachers or football coaches in inner city Munich.
(3) Phenotype prediction. A DNA-based prediction of individual phenotype characteristics is still in its infancy. So far, only the sex of a participant can be unequivocally determined using for example the AmelX/AmelY marker system; height prediction may be possible by using markers in hedgehog signalling or extracellular matrix genes; body mass can predicted by MC4R gene variants. Furthermore, we can guess skin and hair colour by OCA2 gene variants, while there might even be a small chance for an age prediction as aging introduces de novo mutations. Unique characteristics like refractive errors (MYP2), digital clubbing (HPGD), cryptorchidism (NR5A1), ear wax (ABCC11), bitter taste reception (TAS2R), freckling (BNC2), male baldness (AR, PAX1) or hair morphology (TCHH) is predictable together with behavioural traits like aggression (MAO-A) or anxiety type disorders (RGS2). There is a gene (AVPR1A) thought to influence divorce rate while alcohol dependency (GHS-R1A, NPY2R) and addictive smoking may be detectable as well (APBB1, CHRNA3). One gene became famous as the "god gene" (VMAT2) for being connected to religiosity, and another gene was assumed to influence intelligence (IQSEC2). Clearly, most of these predictions come with rather low predictive values, but it may be well be possible in the near future to run an SNP-derived prediction against a leaked Facebook profile.
(4) Disease prediction. Another source of information comes by scanning known disease variants. If present in a data set these variants are highly associated with the development of rare monogenic disease. Each of the ~2,400 OMIM phenotypes with known molecular background may be revealing when present in an individual DNA sample. In addition copy number variants as well as haplotypes may also be used for de-anonymization because these represent unique characteristic even in close relatives.
(5) Clinical status. Genotyping is usually done on a pool of blood cells while the composition of the pool may be affected by several diseases. For example cells undergoing somatic recombination may be detected in the pool if they are particularly high (leukaemia) or low (AIDS). There might be even a chance of finding foreign cells (microchimerism indicative of pregnancy or recent abortion) as even 1 in 1000 cells can be detected by genome-wide SNP arrays.
Depending on the data structure and the degree of supplemental information, there is a good chance that at least some individuals can be immediately de-identified by a profiling approach.