Chapter 8 Examples: Variable selection using MCMC - cystic fibrosis genetics.
Solutions

The dataset under the arrows contains genetic haplotypes from 186 individuals. The first variable
y is a case-control indicator with 1 if the individual has cystic fibrosis. The matrix loc has 186 corresponding rows, and 23 columns each corresponding to a genetic locus, which is 1 if one or more mutant alleles are present, and 0 otherwise. We wish to determine the approximate region of the disease-causing gene on the genome by searching for loci which are associated (in either direction) with disease status.

Click arrow for data

1. Develop a logistic regression model for these data, including all 23 loci as predictors. Use a standard logistic prior for the intercept. For the coefficients, use a normal prior with mean 0 and variance chosen to give 95% certainty that the true odds ratio is between 50 and 1/50. Which loci appear to be associated with disease status? (note an association in either direction could suggest that the disease-causing gene is close to this locus on the genome).


model {
for (i in 1:n) {
y[i] ~ dbern(p[i])
logit(p[i]) <- mu[i]
mu[i] <- alpha + inprod(beta[],loc[i,])
}
alpha ~ dlogis(0, 1)
for (j in 1:23) {
beta[j] ~ dnorm(0, 0.25) # prior for OR between exp(+-2*sqrt(1/0.25)) = 1/50 to 50.
            or[j] <- exp(beta[j])
}
}


Note the use of the inprod function for constructing the linear predictor. Initial values can be safely generated automatically under these weakly informative priors. The summary statistics below suggest that loci 5, 18, 22 are associated with disease status, so that the true disease-causing gene may be close to these areas on the genome.

   node   mean   sd   MC error   2.5%   median   97.5%   start   sample
   alpha   -0.0999   1.274   0.05426   -2.597   -0.1187   2.422   1001   19000
   beta[1]   -1.019   0.6636   0.01415   -2.33   -1.007   0.2633   1001   19000
   beta[2]   -0.6753   0.7715   0.012   -2.198   -0.6731   0.835   1001   19000
   beta[3]   0.9367   0.6455   0.01213   -0.3242   0.9323   2.222   1001   19000
   beta[4]   0.9392   1.027   0.0354   -1.132   0.9636   2.936   1001   19000
   beta[5]   2.045   0.9976   0.03632   0.03561   2.066   4.01   1001   19000
   beta[6]   0.8677   1.141   0.02685   -1.347   0.8511   3.125   1001   19000
   beta[7]   -0.2698   1.129   0.02677   -2.542   -0.2418   1.895   1001   19000
   beta[8]   0.1637   0.4976   0.007071   -0.7929   0.1569   1.146   1001   19000
   beta[9]   0.2151   1.212   0.03902   -2.11   0.2031   2.65   1001   19000
   beta[10]   -2.237   1.343   0.04616   -4.949   -2.212   0.3133   1001   19000
   beta[11]   0.8785   1.121   0.03732   -1.269   0.8817   3.134   1001   19000
   beta[12]   -0.2625   1.51   0.07296   -3.281   -0.2398   2.627   1001   19000
   beta[13]   1.062   0.8981   0.02202   -0.697   1.043   2.842   1001   19000
   beta[14]   -0.0909   1.493   0.07065   -2.91   -0.1536   2.983   1001   19000
   beta[15]   -1.059   0.7906   0.01728   -2.592   -1.07   0.5127   1001   19000
   beta[16]   -0.3471   1.0   0.03088   -2.335   -0.3274   1.575   1001   19000
   beta[17]   -0.9884   1.035   0.03623   -3.018   -0.9792   1.061   1001   19000
   beta[18]   -1.921   0.7133   0.01356   -3.347   -1.92   -0.5514   1001   19000
   beta[19]   1.526   0.8057   0.02068   -0.01777   1.513   3.137   1001   19000
   beta[20]   -0.921   0.8293   0.01571   -2.577   -0.9179   0.7002   1001   19000
   beta[21]   0.2627   0.9951   0.03132   -1.619   0.2425   2.232   1001   19000
   beta[22]   -1.774   0.8075   0.02014   -3.33   -1.772   -0.1953   1001   19000
   beta[23]   0.3885   0.9269   0.02598   -1.429   0.4034   2.189   1001   19000
   or[1]   0.4482   0.3221   0.006572   0.0973   0.3654   1.301   1001   19000
   or[2]   0.686   0.6354   0.009067   0.111   0.5101   2.305   1001   19000
   or[3]   3.153   2.346   0.04186   0.7231   2.54   9.224   1001   19000
   or[4]   4.314   5.929   0.1702   0.3225   2.621   18.85   1001   19000
   or[5]   12.76   17.38   0.5305   1.036   7.896   55.15   1001   19000
   or[6]   4.646   7.998   0.1653   0.2599   2.342   22.75   1001   19000
   or[7]   1.415   2.095   0.04513   0.07867   0.7852   6.653   1001   19000
   or[8]   1.335   0.7304   0.01031   0.4525   1.17   3.146   1001   19000
   or[9]   2.655   5.233   0.1591   0.1212   1.225   14.15   1001   19000
   or[10]   0.2507   0.4599   0.01366   0.00709   0.1095   1.368   1001   19000
   or[11]   4.56   8.015   0.2226   0.2812   2.415   22.97   1001   19000
   or[12]   2.221   4.499   0.158   0.03759   0.7867   13.83   1001   19000
   or[13]   4.379   5.171   0.1255   0.4981   2.838   17.15   1001   19000
   or[14]   2.92   7.372   0.289   0.05447   0.8576   19.74   1001   19000
   or[15]   0.4775   0.4753   0.009961   0.07486   0.3429   1.67   1001   19000
   or[16]   1.152   1.413   0.03953   0.09678   0.7208   4.833   1001   19000
   or[17]   0.6373   0.8843   0.02862   0.04888   0.3756   2.888   1001   19000
   or[18]   0.1883   0.1517   0.002636   0.0352   0.1465   0.5761   1001   19000
   or[19]   6.475   7.651   0.1799   0.9824   4.542   23.04   1001   19000
   or[20]   0.5605   0.5644   0.009287   0.07599   0.3994   2.014   1001   19000
   or[21]   2.154   2.838   0.08092   0.1982   1.274   9.315   1001   19000
   or[22]   0.2358   0.2338   0.005478   0.03578   0.1699   0.8226   1001   19000
   or[23]   2.256   2.581   0.06157   0.2395   1.497   8.922   1001   19000




2.
Implement the MCMC-based variable selection technique discussed in Section 8.8.2. Note that the snippet of code given in the book is rather misleading - see the discussion at http://www.mrc-bsu.cam.ac.uk/bugs/thebugsbook/errata for a more helpful guide to implementing this method. Use a prior probability of 0.5 for including each covariate, and an effect size of exactly zero if the covariate is excluded from the model. If a covariate is included, use the same prior for its effect as in part 1. Note that initial values should be supplied for the inclusion indicators since WinBUGS has difficulty generating them.

According to the posterior probabilities of covariate inclusion, which locus or loci is most likely to be associated with the disease?


model {
for (i in 1:n) {
y[i] ~ dbern(p[i])
logit(p[i]) <- mu[i]
mu[i] <- alpha + inprod(beta[],loc[i,])
}
alpha ~ dlogis(0, 1)
for (j in 1:23) {
beta[j] <- b[j, pick[j]]
pick[j] <- I[j] + 1
I[j] ~ dbern(0.5)
   b[j,1] <- 0
b[j,2] ~ dnorm(0, 0.25)
   or[j] <- exp(beta[j])
}
}


Initial values
list(alpha=0, I=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))

   node   mean   sd   MC error   2.5%   median   97.5%   start   sample
   I[1]   0.4144   0.4926   0.008746   0.0   0.0   1.0   1001   39000
   I[2]   0.6185   0.4858   0.006462   0.0   1.0   1.0   1001   39000
   I[3]   0.37   0.4828   0.008057   0.0   0.0   1.0   1001   39000
   I[4]   0.4355   0.4958   0.009155   0.0   0.0   1.0   1001   39000
   I[5]   0.8026   0.3981   0.008695   0.0   1.0   1.0   1001   39000
   I[6]   0.417   0.4931   0.005327   0.0   0.0   1.0   1001   39000
   I[7]   0.3433   0.4748   0.004723   0.0   0.0   1.0   1001   39000
   I[8]   0.1884   0.391   0.003191   0.0   0.0   1.0   1001   39000
   I[9]   0.4314   0.4953   0.008555   0.0   0.0   1.0   1001   39000
   I[10]   0.7941   0.4043   0.01159   0.0   1.0   1.0   1001   39000
   I[11]   0.4835   0.4997   0.01149   0.0   0.0   1.0   1001   39000
   I[12]   0.337   0.4727   0.008552   0.0   0.0   1.0   1001   39000
   I[13]   0.3522   0.4777   0.007128   0.0   0.0   1.0   1001   39000
   I[14]   0.3373   0.4728   0.008218   0.0   0.0   1.0   1001   39000
   I[15]   0.4764   0.4994   0.007814   0.0   0.0   1.0   1001   39000
   I[16]   0.3337   0.4715   0.006459   0.0   0.0   1.0   1001   39000
   I[17]   0.4919   0.4999   0.01237   0.0   0.0   1.0   1001   39000
   I[18]   0.9175   0.2751   0.006248   0.0   1.0   1.0   1001   39000
   I[19]   0.7324   0.4427   0.01069   0.0   1.0   1.0   1001   39000
   I[20]   0.4159   0.4929   0.005996   0.0   0.0   1.0   1001   39000
   I[21]   0.3344   0.4718   0.0071   0.0   0.0   1.0   1001   39000
   I[22]   0.7735   0.4185   0.008501   0.0   1.0   1.0   1001   39000
   I[23]   0.3394   0.4735   0.006561   0.0   0.0   1.0   1001   39000
   alpha   0.2996   1.175   0.04145   -2.032   0.2923   2.586   1001   39000
   beta[1]   -0.3536   0.5883   0.01275   -1.879   0.0   0.1028   1001   39000
   beta[2]   -0.7156   0.7762   0.01133   -2.321   -0.5896   0.05648   1001   39000
   beta[3]   0.2796   0.5339   0.01099   -0.1906   0.0   1.723   1001   39000
   beta[4]   0.2326   0.7977   0.02425   -1.178   0.0   2.336   1001   39000
   beta[5]   1.162   0.9324   0.02954   0.0   1.113   3.315   1001   39000
   beta[6]   0.317   0.6226   0.009441   -0.3526   0.0   1.971   1001   39000
   beta[7]   0.09067   0.5351   0.008142   -1.189   0.0   1.339   1001   39000
   beta[8]   0.01759   0.2084   0.001695   -0.4277   0.0   0.622   1001   39000
   beta[9]   -0.1502   0.8372   0.02389   -2.175   0.0   1.742   1001   39000
   beta[10]   -1.596   1.215   0.04002   -3.936   -1.711   0.0   1001   39000
   beta[11]   0.5527   0.9131   0.02884   -0.5156   0.0   2.807   1001   39000
   beta[12]   -0.06111   0.6662   0.02076   -1.845   0.0   1.43   1001   39000
   beta[13]   0.238   0.5925   0.01217   -0.5368   0.0   1.918   1001   39000
   beta[14]   -0.01253   0.6755   0.02   -1.659   0.0   1.673   1001   39000
   beta[15]   -0.4522   0.6991   0.01347   -2.127   0.0   0.2858   1001   39000
   beta[16]   0.1306   0.5463   0.01191   -0.9272   0.0   1.616   1001   39000
   beta[17]   -0.5047   0.7681   0.02416   -2.274   0.0   0.3349   1001   39000
   beta[18]   -1.618   0.7984   0.01916   -3.026   -1.694   0.0   1001   39000
   beta[19]   1.059   0.9032   0.02481   0.0   1.079   2.849   1001   39000
   beta[20]   -0.3663   0.6619   0.01048   -2.086   0.0   0.3259   1001   39000
   beta[21]   0.02684   0.5198   0.0109   -1.148   0.0   1.408   1001   39000
   beta[22]   -1.015   0.7901   0.01794   -2.623   -1.045   0.0   1001   39000
   beta[23]   0.08762   0.5227   0.01027   -1.003   0.0   1.519   1001   39000
   or[1]   0.8005   0.3235   0.006673   0.1527   1.0   1.108   1001   39000
   or[2]   0.6292   0.3989   0.005891   0.09814   0.5545   1.058   1001   39000
   or[3]   1.599   1.38   0.02695   0.8265   1.0   5.6   1001   39000
   or[4]   2.002   3.693   0.1063   0.308   1.0   10.34   1001   39000
   or[5]   5.426   8.953   0.2826   1.0   3.043   27.53   1001   39000
   or[6]   1.836   2.701   0.046   0.7029   1.0   7.179   1001   39000
   or[7]   1.28   1.044   0.01262   0.3045   1.0   3.814   1001   39000
   or[8]   1.043   0.2895   0.002484   0.652   1.0   1.863   1001   39000
   or[9]   1.264   1.96   0.04295   0.1136   1.0   5.709   1001   39000
   or[10]   0.3904   0.4656   0.0141   0.01952   0.1806   1.0   1001   39000
   or[11]   3.039   4.868   0.1332   0.5971   1.0   16.57   1001   39000
   or[12]   1.207   1.748   0.04568   0.158   1.0   4.179   1001   39000
   or[13]   1.639   1.961   0.03971   0.5846   1.0   6.808   1001   39000
   or[14]   1.368   2.955   0.08447   0.1903   1.0   5.33   1001   39000
   or[15]   0.7694   0.4116   0.00746   0.1192   1.0   1.331   1001   39000
   or[16]   1.377   1.279   0.02655   0.3957   1.0   5.033   1001   39000
   or[17]   0.7577   0.4549   0.01226   0.1029   1.0   1.398   1001   39000
   or[18]   0.2784   0.264   0.006789   0.04852   0.1837   1.0   1001   39000
   or[19]   4.421   5.06   0.1164   1.0   2.941   17.27   1001   39000
   or[20]   0.8159   0.3846   0.005623   0.1242   1.0   1.385   1001   39000
   or[21]   1.217   1.183   0.02364   0.3171   1.0   4.086   1001   39000
   or[22]   0.4797   0.3406   0.007706   0.07259   0.3518   1.0   1001   39000
   or[23]   1.3   1.18   0.02232   0.3669   1.0   4.568   1001   39000


Locus number 18 has the highest posterior probability of inclusion in the model (0.92), with corresponding odds ratio (accounting for the 9% chance that this will be 1) of 0.28. In fact, the true disease-related gene is known to be somewhere between the loci labelled 17 and 18 in this dataset.