Analysis of Data on Related Individuals Through Inference of Identity by Descent

Tech Report Number



Pedigrees exist within populations. While close pedigree relationships may be known, in any genetic epidemiological study there are likely also relationships among pedigrees. Our ultimate goal is to combine information from within-pedigree gene descent and betweenpedigree genome sharing into a single analysis. Data from SNP genotype assays, in which 300,000 or more SNP variants may be typed across the genome, provide a powerful basis for inferring segments of genome inherited identical by descent (ibd) by remote relatives, whose exact pedigree relationships may be unknown. We develop a new paradigm for the analysis of trait data on observed individuals, were ibd among all observed individuals exactly imputable. We then consider the extent to which such imputation is possible, presenting first a simple model for ibd between a pair of chromosomes. We extend the model in two directions, first to allow for the linkage disequilibrium (LD) which undoubtedly exists among many of the SNPs of a dense genotyping array, and secondly to model the ibd process along a chromosome for an arbitrary number of genomes jointly. Our models are illustrated by analyses of simulated data of 10,000 SNPS over 1 Morgan (108bp) length of chromosome. In the first data set there is highly structured LD extending over millions of bp. In this case, LD cannot be ignored, but if an approximately correct LD model is used ibd segments can be accurately inferred. More realistic haplotypic data are constructed using Yoruban Chromosome-19s from the HapMap project. In this case, the extent of LD is less than the segment lengths of ibd to be detected and a no-LD analysis performs well, both for pairs and for quartets of haplotypes. If the set of four haplotypes is reduced to unphased genotypic data on a pair of individuals, inferences are much less clear and less accurate. However, in reality ibd patterns will be less challenging than the pattern constructed for our illustrative examples, and additionally family members will give at least partial phase information. The research described in this report was supported in part by NIH grants GM046255 and HG004175. It comprises material presented in July 2008 at the Australian Statistical Conference, Melbourne, Australia, and at the IMS and Bernoulli Society 7 th World Congress in Probability and Statistics, Singapore. Partial expenses for these meetings were provided by the Australian Statistical Society and by the Institute for Mathematical Statistics (IMS).

Keywords: identity by descent, SNP genotype assays, inferred genome sharing, Ewens’ sampling formula, linkage disequilibrium, multiple genomes.


tr539.pdf546.55 KB