Chapter 8 Examples: Variable selection using MCMC - cystic fibrosis genetics.
Solutions
The dataset under the arrows contains genetic haplotypes from 186 individuals. The first variable
y
is a case-control indicator with 1 if the individual has cystic fibrosis. The matrix
loc
has 186 corresponding rows, and 23 columns each corresponding to a genetic locus, which is 1 if one or more mutant alleles are present, and 0 otherwise. We wish to determine the approximate region of the disease-causing gene on the genome by searching for loci which are associated (in either direction) with disease status.
Click arrow for data
1. Develop a logistic regression model for these data, including all 23 loci as predictors. Use a standard logistic prior for the intercept. For the coefficients, use a normal prior with mean 0 and variance chosen to give 95% certainty that the true odds ratio is between 50 and 1/50. Which loci appear to be associated with disease status? (note an association in either direction could suggest that the disease-causing gene is close to this locus on the genome).
model {
for (i in 1:n) {
y[i] ~ dbern(p[i])
logit(p[i]) <- mu[i]
mu[i] <- alpha + inprod(beta[],loc[i,])
}
alpha ~ dlogis(0, 1)
for (j in 1:23) {
beta[j] ~ dnorm(0, 0.25) # prior for OR between exp(+-2*sqrt(1/0.25)) = 1/50 to 50.
or[j] <- exp(beta[j])
}
}
Note the use of the
inprod
function for constructing the linear predictor. Initial values can be safely generated automatically under these weakly informative priors. The summary statistics below suggest that loci 5, 18, 22 are associated with disease status, so that the true disease-causing gene may be close to these areas on the genome.
node mean sd MC error 2.5% median 97.5% start sample
alpha -0.0999 1.274 0.05426 -2.597 -0.1187 2.422 1001 19000
beta[1] -1.019 0.6636 0.01415 -2.33 -1.007 0.2633 1001 19000
beta[2] -0.6753 0.7715 0.012 -2.198 -0.6731 0.835 1001 19000
beta[3] 0.9367 0.6455 0.01213 -0.3242 0.9323 2.222 1001 19000
beta[4] 0.9392 1.027 0.0354 -1.132 0.9636 2.936 1001 19000
beta[5] 2.045 0.9976 0.03632 0.03561 2.066 4.01 1001 19000
beta[6] 0.8677 1.141 0.02685 -1.347 0.8511 3.125 1001 19000
beta[7] -0.2698 1.129 0.02677 -2.542 -0.2418 1.895 1001 19000
beta[8] 0.1637 0.4976 0.007071 -0.7929 0.1569 1.146 1001 19000
beta[9] 0.2151 1.212 0.03902 -2.11 0.2031 2.65 1001 19000
beta[10] -2.237 1.343 0.04616 -4.949 -2.212 0.3133 1001 19000
beta[11] 0.8785 1.121 0.03732 -1.269 0.8817 3.134 1001 19000
beta[12] -0.2625 1.51 0.07296 -3.281 -0.2398 2.627 1001 19000
beta[13] 1.062 0.8981 0.02202 -0.697 1.043 2.842 1001 19000
beta[14] -0.0909 1.493 0.07065 -2.91 -0.1536 2.983 1001 19000
beta[15] -1.059 0.7906 0.01728 -2.592 -1.07 0.5127 1001 19000
beta[16] -0.3471 1.0 0.03088 -2.335 -0.3274 1.575 1001 19000
beta[17] -0.9884 1.035 0.03623 -3.018 -0.9792 1.061 1001 19000
beta[18] -1.921 0.7133 0.01356 -3.347 -1.92 -0.5514 1001 19000
beta[19] 1.526 0.8057 0.02068 -0.01777 1.513 3.137 1001 19000
beta[20] -0.921 0.8293 0.01571 -2.577 -0.9179 0.7002 1001 19000
beta[21] 0.2627 0.9951 0.03132 -1.619 0.2425 2.232 1001 19000
beta[22] -1.774 0.8075 0.02014 -3.33 -1.772 -0.1953 1001 19000
beta[23] 0.3885 0.9269 0.02598 -1.429 0.4034 2.189 1001 19000
or[1] 0.4482 0.3221 0.006572 0.0973 0.3654 1.301 1001 19000
or[2] 0.686 0.6354 0.009067 0.111 0.5101 2.305 1001 19000
or[3] 3.153 2.346 0.04186 0.7231 2.54 9.224 1001 19000
or[4] 4.314 5.929 0.1702 0.3225 2.621 18.85 1001 19000
or[5] 12.76 17.38 0.5305 1.036 7.896 55.15 1001 19000
or[6] 4.646 7.998 0.1653 0.2599 2.342 22.75 1001 19000
or[7] 1.415 2.095 0.04513 0.07867 0.7852 6.653 1001 19000
or[8] 1.335 0.7304 0.01031 0.4525 1.17 3.146 1001 19000
or[9] 2.655 5.233 0.1591 0.1212 1.225 14.15 1001 19000
or[10] 0.2507 0.4599 0.01366 0.00709 0.1095 1.368 1001 19000
or[11] 4.56 8.015 0.2226 0.2812 2.415 22.97 1001 19000
or[12] 2.221 4.499 0.158 0.03759 0.7867 13.83 1001 19000
or[13] 4.379 5.171 0.1255 0.4981 2.838 17.15 1001 19000
or[14] 2.92 7.372 0.289 0.05447 0.8576 19.74 1001 19000
or[15] 0.4775 0.4753 0.009961 0.07486 0.3429 1.67 1001 19000
or[16] 1.152 1.413 0.03953 0.09678 0.7208 4.833 1001 19000
or[17] 0.6373 0.8843 0.02862 0.04888 0.3756 2.888 1001 19000
or[18] 0.1883 0.1517 0.002636 0.0352 0.1465 0.5761 1001 19000
or[19] 6.475 7.651 0.1799 0.9824 4.542 23.04 1001 19000
or[20] 0.5605 0.5644 0.009287 0.07599 0.3994 2.014 1001 19000
or[21] 2.154 2.838 0.08092 0.1982 1.274 9.315 1001 19000
or[22] 0.2358 0.2338 0.005478 0.03578 0.1699 0.8226 1001 19000
or[23] 2.256 2.581 0.06157 0.2395 1.497 8.922 1001 19000
2.
Implement the MCMC-based variable selection technique discussed in Section 8.8.2. Note that the snippet of code given in the book is rather misleading - see the discussion at
http://www.mrc-bsu.cam.ac.uk/bugs/thebugsbook/errata
for a more helpful guide to implementing this method. Use a prior probability of 0.5 for including each covariate, and an effect size of exactly zero if the covariate is excluded from the model. If a covariate is included, use the same prior for its effect as in part 1. Note that initial values should be supplied for the inclusion indicators since WinBUGS has difficulty generating them.
According to the posterior probabilities of covariate inclusion, which locus or loci is most likely to be associated with the disease?
model {
for (i in 1:n) {
y[i] ~ dbern(p[i])
logit(p[i]) <- mu[i]
mu[i] <- alpha + inprod(beta[],loc[i,])
}
alpha ~ dlogis(0, 1)
for (j in 1:23) {
beta[j] <- b[j, pick[j]]
pick[j] <- I[j] + 1
I[j] ~ dbern(0.5)
b[j,1] <- 0
b[j,2] ~ dnorm(0, 0.25)
or[j] <- exp(beta[j])
}
}
Initial values
list(alpha=0, I=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
node mean sd MC error 2.5% median 97.5% start sample
I[1] 0.4144 0.4926 0.008746 0.0 0.0 1.0 1001 39000
I[2] 0.6185 0.4858 0.006462 0.0 1.0 1.0 1001 39000
I[3] 0.37 0.4828 0.008057 0.0 0.0 1.0 1001 39000
I[4] 0.4355 0.4958 0.009155 0.0 0.0 1.0 1001 39000
I[5] 0.8026 0.3981 0.008695 0.0 1.0 1.0 1001 39000
I[6] 0.417 0.4931 0.005327 0.0 0.0 1.0 1001 39000
I[7] 0.3433 0.4748 0.004723 0.0 0.0 1.0 1001 39000
I[8] 0.1884 0.391 0.003191 0.0 0.0 1.0 1001 39000
I[9] 0.4314 0.4953 0.008555 0.0 0.0 1.0 1001 39000
I[10] 0.7941 0.4043 0.01159 0.0 1.0 1.0 1001 39000
I[11] 0.4835 0.4997 0.01149 0.0 0.0 1.0 1001 39000
I[12] 0.337 0.4727 0.008552 0.0 0.0 1.0 1001 39000
I[13] 0.3522 0.4777 0.007128 0.0 0.0 1.0 1001 39000
I[14] 0.3373 0.4728 0.008218 0.0 0.0 1.0 1001 39000
I[15] 0.4764 0.4994 0.007814 0.0 0.0 1.0 1001 39000
I[16] 0.3337 0.4715 0.006459 0.0 0.0 1.0 1001 39000
I[17] 0.4919 0.4999 0.01237 0.0 0.0 1.0 1001 39000
I[18] 0.9175 0.2751 0.006248 0.0 1.0 1.0 1001 39000
I[19] 0.7324 0.4427 0.01069 0.0 1.0 1.0 1001 39000
I[20] 0.4159 0.4929 0.005996 0.0 0.0 1.0 1001 39000
I[21] 0.3344 0.4718 0.0071 0.0 0.0 1.0 1001 39000
I[22] 0.7735 0.4185 0.008501 0.0 1.0 1.0 1001 39000
I[23] 0.3394 0.4735 0.006561 0.0 0.0 1.0 1001 39000
alpha 0.2996 1.175 0.04145 -2.032 0.2923 2.586 1001 39000
beta[1] -0.3536 0.5883 0.01275 -1.879 0.0 0.1028 1001 39000
beta[2] -0.7156 0.7762 0.01133 -2.321 -0.5896 0.05648 1001 39000
beta[3] 0.2796 0.5339 0.01099 -0.1906 0.0 1.723 1001 39000
beta[4] 0.2326 0.7977 0.02425 -1.178 0.0 2.336 1001 39000
beta[5] 1.162 0.9324 0.02954 0.0 1.113 3.315 1001 39000
beta[6] 0.317 0.6226 0.009441 -0.3526 0.0 1.971 1001 39000
beta[7] 0.09067 0.5351 0.008142 -1.189 0.0 1.339 1001 39000
beta[8] 0.01759 0.2084 0.001695 -0.4277 0.0 0.622 1001 39000
beta[9] -0.1502 0.8372 0.02389 -2.175 0.0 1.742 1001 39000
beta[10] -1.596 1.215 0.04002 -3.936 -1.711 0.0 1001 39000
beta[11] 0.5527 0.9131 0.02884 -0.5156 0.0 2.807 1001 39000
beta[12] -0.06111 0.6662 0.02076 -1.845 0.0 1.43 1001 39000
beta[13] 0.238 0.5925 0.01217 -0.5368 0.0 1.918 1001 39000
beta[14] -0.01253 0.6755 0.02 -1.659 0.0 1.673 1001 39000
beta[15] -0.4522 0.6991 0.01347 -2.127 0.0 0.2858 1001 39000
beta[16] 0.1306 0.5463 0.01191 -0.9272 0.0 1.616 1001 39000
beta[17] -0.5047 0.7681 0.02416 -2.274 0.0 0.3349 1001 39000
beta[18] -1.618 0.7984 0.01916 -3.026 -1.694 0.0 1001 39000
beta[19] 1.059 0.9032 0.02481 0.0 1.079 2.849 1001 39000
beta[20] -0.3663 0.6619 0.01048 -2.086 0.0 0.3259 1001 39000
beta[21] 0.02684 0.5198 0.0109 -1.148 0.0 1.408 1001 39000
beta[22] -1.015 0.7901 0.01794 -2.623 -1.045 0.0 1001 39000
beta[23] 0.08762 0.5227 0.01027 -1.003 0.0 1.519 1001 39000
or[1] 0.8005 0.3235 0.006673 0.1527 1.0 1.108 1001 39000
or[2] 0.6292 0.3989 0.005891 0.09814 0.5545 1.058 1001 39000
or[3] 1.599 1.38 0.02695 0.8265 1.0 5.6 1001 39000
or[4] 2.002 3.693 0.1063 0.308 1.0 10.34 1001 39000
or[5] 5.426 8.953 0.2826 1.0 3.043 27.53 1001 39000
or[6] 1.836 2.701 0.046 0.7029 1.0 7.179 1001 39000
or[7] 1.28 1.044 0.01262 0.3045 1.0 3.814 1001 39000
or[8] 1.043 0.2895 0.002484 0.652 1.0 1.863 1001 39000
or[9] 1.264 1.96 0.04295 0.1136 1.0 5.709 1001 39000
or[10] 0.3904 0.4656 0.0141 0.01952 0.1806 1.0 1001 39000
or[11] 3.039 4.868 0.1332 0.5971 1.0 16.57 1001 39000
or[12] 1.207 1.748 0.04568 0.158 1.0 4.179 1001 39000
or[13] 1.639 1.961 0.03971 0.5846 1.0 6.808 1001 39000
or[14] 1.368 2.955 0.08447 0.1903 1.0 5.33 1001 39000
or[15] 0.7694 0.4116 0.00746 0.1192 1.0 1.331 1001 39000
or[16] 1.377 1.279 0.02655 0.3957 1.0 5.033 1001 39000
or[17] 0.7577 0.4549 0.01226 0.1029 1.0 1.398 1001 39000
or[18] 0.2784 0.264 0.006789 0.04852 0.1837 1.0 1001 39000
or[19] 4.421 5.06 0.1164 1.0 2.941 17.27 1001 39000
or[20] 0.8159 0.3846 0.005623 0.1242 1.0 1.385 1001 39000
or[21] 1.217 1.183 0.02364 0.3171 1.0 4.086 1001 39000
or[22] 0.4797 0.3406 0.007706 0.07259 0.3518 1.0 1001 39000
or[23] 1.3 1.18 0.02232 0.3669 1.0 4.568 1001 39000
Locus number 18 has the highest posterior probability of inclusion in the model (0.92), with corresponding odds ratio (accounting for the 9% chance that this will be 1) of 0.28. In fact, the true disease-related gene is known to be somewhere between the loci labelled 17 and 18 in this dataset.