Chapter 8 Exercises: Variable selection using MCMC - cystic fibrosis genetics.
The dataset under the arrows contains genetic haplotypes from 186 individuals. The first variable
y
is a case-control indicator with 1 if the individual has cystic fibrosis. The matrix
loc
has 186 corresponding rows, and 23 columns each corresponding to a genetic locus, which is 1 if one or more mutant alleles are present, and 0 otherwise. We wish to determine the approximate region of the disease-causing gene on the genome by searching for loci which are associated (in either direction) with disease status.
Click arrow for data
1. Develop a logistic regression model for these data, including all 23 loci as predictors. Use a standard logistic prior for the intercept. For the coefficients, use a normal prior with mean 0 and variance chosen to give 95% certainty that the true odds ratio is between 50 and 1/50. Which loci appear to be associated with disease status? (note an association in either direction could suggest that the disease-causing gene is close to this locus on the genome).
2.
Implement the MCMC-based variable selection technique discussed in Section 8.8.2. Note that the snippet of code given in the book is rather misleading - see the discussion at
http://www.mrc-bsu.cam.ac.uk/bugs/thebugsbook/errata
for a more helpful guide to implementing this method. Use a prior probability of 0.5 for including each covariate, and an effect size of exactly zero if the covariate is excluded from the model. If a covariate is included, use the same prior for its effect as in part 1. Note that initial values should be supplied for the inclusion indicators since WinBUGS has difficulty generating them.
According to the posterior probabilities of covariate inclusion, which locus or loci is most likely to be associated with the disease?