Priors in survextrap models
Christopher Jackson chris.jackson@mrc-bsu.cam.ac.uk
2024-06-29
Source:vignettes/priors.Rmd
priors.Rmd
The flexible survival models in survextrap
are
Bayesian.
This means that they analyse data under a parametric model, which assumes that each data point \(x\) comes from a sampling distribution with probability density function \(f(x|\theta)\). The parameters \(\theta\) are estimated by combining a prior distribution \(p(\theta)\), representing beliefs about model parameters external to the data, with a “likelihood” that represents the information from a set of data \(\boldsymbol{x}\), producing a posterior distribution \(p(\theta | \boldsymbol{x})\).
The principle behind the package is that any information that is relevant to extrapolation should be represented transparently. This is arguably clearest if expressed through data or parametric mechanisms.
Data x include individual-level, right-censored survival data, and aggregate counts of survivors, as described on the front page. Expert judgements about survival can be converted to pseudo-data in the form of aggregate counts.
-
Parametric mechanisms denote the form of \(f()\). These might include the notion that some people are cured (mixture cure models), or that disease-specific and background hazards can be modelled additively (relative survival models). Conventional parametric survival models (like the Weibull and log-logistic) are generally used out of familiarity, not because they are thought to represent meaningful mechanisms for survival.
The default parametric model in
survextrap
is a flexible spline-based model. This is designed as a “black box”, which aims to fit the data as well as possible. It is not designed to be treated as a mechanism for extrapolation.
However, this spline model has parameters \(\theta\), which still need prior distributions, because the model is Bayesian!
This article describes these two kinds of “prior” judgements that can
be used in survextrap
models.
Judgements about survival probabilities, that can be converted to pseudo-data, and treated as part of the model likelihood for data \(\boldsymbol{x}\).
Priors on parameters \(\theta\) in spline models.
Converting prior judgements about survival to external data
We suppose there is some expert judgement about survival over the time period from \(t\) to \(u\), and we can “elicit” this as the probability \(p\) that a person who is alive at \(t\) will have died before \(u\).
This elicitation might be done in many ways (e.g. through consensus
methods, or consulting multiple experts and aggregating their beliefs).
survextrap
does not include tools or guidance for doing
elicitation, but further info is available, e.g. from Bojke
et al., or SHELF.
One simple way is to elicit:
a best guess for what \(p\) is. We could interpret this as the median of the prior distribution.
upper and lower prior credible limits, such that we are (for example) 90% sure that \(p\) is within this range. We can interpret these as the 5% and 95% quantiles.
As an example, suppose we elicited a best guess of 0.6, and a 90%
credible interval of 0.4 to 0.8. These can be fitted to a
probability distribution representing beliefs about \(p\). The SHELF R package provides a useful
function, fitdist
, to do this. We use this to fit a
Beta distribution, a typically-used distribution for
probabilities.
library(SHELF)
p <- c(0.4, 0.6, 0.8)
bet <- fitdist(vals=p, probs=c(0.05, 0.5, 0.95), lower=0, upper=1)$Beta
bet
## shape1 shape2
## 1 9.190895 6.19854
This gives a Beta(9.2, 6.2) distribution. We can plot its probability density function, to check that the distribution represents the beliefs about \(p\) that we intended to use:
library(ggplot2)
ggplot(data.frame(p = c(0, 1)), aes(x = p)) +
stat_function(fun = dbeta, args=list(shape1=bet$shape1, shape2=bet$shape2)) +
ylab("Density")
We now convert the elicited Beta distribution to pseudo-data.
The trick for doing this comes from a basic result about Bayesian estimation of an event probability \(p\). Suppose we start with a Beta\((a,b)\) prior for \(p\), and observe \(y\) events out of \(n\) individual observations, where each observation may or may not have resulted in an event, so that \(y\) comes from a Binomial distribution with probability \(p\). The posterior for \(p\) is then Beta\((a+y, b+n-y)\).
We can then equate our elicited belief (call it Beta\((e_1, e_2)\), say) with a posterior distribution under a vague prior (say a Beta\((a=0,b=0)\)). Therefore \(e_1=a+y\) and \(e_2 = b+n-y\), hence \(y=e_1\), \(n = e_1 + e_2\).
In other words, our elicited Beta\((e_1,e_2)\) distribution is equivalent to
the knowledge from an aggregate count of \(e_1\) events out of \(e_1 + e_2\) observations. This aggregate
“pseudo-data” can be used as an external
data source in
survextrap
.
For example, the Beta(9,6) distribution above is equivalent to pseudo-data of 9 events out of 15 trials. The point estimate of the event probability from this data would be 9/15 = 0.6 (our original guess). A pseudo-dataset of 90 events out of 150 trials would have the same median, but a narrower credible interval, since the dataset is larger, implying greater confidence that the true \(p\) is near to the point estimate.
Further notes on elicitation
Academic note: there are different ways of defining a vague Beta prior for a probability. Beta\((0,0)\) is a uniform distribution for the log odds \(\log(p/(1-p))\), and Beta\((1,1)\) is uniform on the scale of the probability \(p\). The “Jeffreys” prior Beta\((0.5, 0.5)\) is sometimes used as a compromise between these. The difference between these choices is unlikely to matter in practice.
We could elicit survival probabilities over different time intervals, and use these as multiple
external
datasets insurvextrap
.It would probably not be appropriate to include beliefs about the same quantity, elicited from multiple experts, as multiple external datasets in
survextrap
. This is because the experts may have shared beliefs based on their common knowledge of the field. If multiple experts are consulted about the same quantity, their beliefs should be aggregated into a single distribution before usingsurvextrap
, e.g. by using mathematical aggregation or a consensus method. Again, see, e.g. Bojke et al. or SHELF, for more information about elicitation.
Priors on spline parameters
The parameters of the spline model are mathematical abstractions which do not have meaningful interpretations such as means or variances for survival times. However for Bayesian modelling, we still have to place priors on them.
A general way to ensure that the chosen priors represent meaningful
beliefs is through simulation. survextrap
provides
some tools for this.
Plotting plausible hazard trajectories
Before fitting a model, we can use the
prior_sample_hazard
function to check that hazard curves
simulated from the prior would be considered plausible representations
of the knowledge that exists outside the data.
To use this before the model is fitted, we have to know the knots
defining the spline, and choose a set of priors. survextrap
provides a set of default priors, which may or may not be sensible in a
given application. and a procedure for picking knots (see the methods vignette).
First we fit a default model, and print the default priors that are used.
library(survextrap)
nd_mod <- survextrap(Surv(years, status) ~ 1, data=colons, fit_method="opt")
print_priors(nd_mod)
## Priors:
## Baseline log hazard scale: normal(location=0,scale=20)
## Smoothness SD: gamma(shape=2,rate=1)
The “baseline log hazard scale” parameter is the parameter \(\log(\eta)\) in the definition of the M-spline function for the hazard: \(h(t) = \eta \sum_k p_k b_k(t)\). By itself, \(\eta\) does not define a hazard. It only defines a hazard after being combined with the basis coefficients \(p_k\) and basis terms \(b_k()\).
The “smoothness SD” is the parameter \(\sigma\) . This is even harder to interpret than \(\eta\). Values of \(\sigma=0\) represent certainty that the hazard is the one defined by the prior mean for the \(p_k\). Values of 1 or more favour wiggly hazard curves, while very high values are not meaningful.
The prior mean for the \(p_k\) can
be examined in coefs_mean
- by default, this defines a
hazard function \(h(t)\) that is
constant over time. The number and location of knots can also be
examined in the mspline
component of the model.
round(nd_mod$coefs_mean,2)
sp <- nd_mod$mspline
Together, these knots and prior choices define the model to be fitted to the data. We can now simulate the consequences of these choices, in terms of:
the typical level of the hazard
what range of hazard trajectories are plausible
how much the hazard tends to vary through time
The fitted model object (nd_mod
here) has a component
called prior_sample
, which is a list of functions that can
be used to simulate various interesting quantities from this prior
specification.
Constant hazard level
The first thing to verify is the typical level of the hazard. If no
prior for the hazard scale parameter \(\eta\) is supplied in
survextrap
, then a default \(N(0,20)\) is used, which is very vague. We
can illustrate the implications of this choice by deriving the prior for
the constant hazard \(h(t) = \eta \sum_k p_k
b_k(t)\) which is implied by the prior for \(\eta\) and the special values of \(p_k\) that imply a constant hazard (found
using mspline_constant_coefs()
).
The haz_const
function returns a summary of the implied
prior for this constant hazard, and the implied prior for the mean
survival time \(1/h(t)\).
nd_mod$prior_sample$haz_const()
## haz mean
## 2.5% 8.336083e-18 1.073931e-17
## 50% 8.810345e-01 1.135029e+00
## 97.5% 9.311588e+16 1.199604e+17
These credible limits are extreme. Perhaps this is reasonable if the dataset is large and would dominate any choice of prior. In most applications, however, we have some idea of what a typical survival time should be, hence what a typical hazard should be, say within an order or two of magnitude.
A simple heuristic is to specify
a prior guess (e.g. we guess that average survival time is 50 years after the start of the study)
a prior upper credible limit (e.g. we guess it is unlikely that the mean survival time is 80 years after the start of the study). Note that this represents uncertainty about the mean survival in a population, rather than the variability between individuals in the population, so perhaps the credible interval should be narrower than you think.
hence giving a median and lower credible limit for \(h(t)\), hence (after dividing by the constant \(\sum_k p_k b_k(t)\) for an arbitrary \(t\) and logging), giving a median \(m\) and lower credible limit \(L\) for \(\log(\eta)\). Hence a normal prior for \(\log(\eta)\) can be defined with median \(m\), and standard deviation \((L-m)/2\).
This is implemented in the function p_meansurv
. Note
that a spline knot specification is required to obtain the constant
\(\sum_k p_k b_k(t)\).
p_meansurv(median=50, upper=80, mspline=sp)
For a given prior on \(\eta\) and
spline specification, the function prior_haz_const
can be
used to summarise the implied prior on the constant hazard and mean
survival time. We use this now to check that the prior that we defined
with p_meansurv
is the one that we intended to define.
prior_haz_const(mspline=sp,
prior_hscale = p_meansurv(median=50, upper=80, mspline=sp))
## haz mean
## 2.5% 0.0125 31.25
## 50% 0.0200 50.00
## 97.5% 0.0320 80.00
The prior median and upper credible limit match the numbers that we put in. We also get a lower 2.5% credible limit as a consequence of the normal assumption - this should be checked for plausibility, and the prior revised if necessary.
Finally, an updated survextrap
model can be fitted that
includes the weakly informative prior that we defined and checked. For
reporting purposes, we can use print_priors
to view the
normal prior for \(\log(\eta)\) that
was automatically derived.
ndi_mod <- survextrap(Surv(years, status) ~ 1, data=colons, fit_method="opt",
prior_hscale = p_meansurv(median=50, upper=80, mspline=sp))
print_priors(ndi_mod)
## Priors:
## Baseline log hazard scale: normal(location=-3.7853644873736,scale=0.2398021764446)
## Smoothness SD: gamma(shape=2,rate=1)
Hazard as a function of time
The prior_sample
component includes a function called
haz
which simulates a desired number of hazard trajectories
over time, given the prior. This function returns a data frame (here
called haz_sim
) which contains the hazard value
haz
, time point time
and simulation replicate
rep
.
Here we simulate nsim=30
prior hazard trajectories from
the model ndi_mod
, and plot them with
ggplot
.
set.seed(1)
haz_sim <- ndi_mod$prior_sample$haz(nsim=30)
ggplot(haz_sim, aes(x=time, y=haz, group=rep)) +
geom_line() + xlab("Years") + ylab("Hazard") + ylim(0,0.05)
An alternative way is to use the prior_sample_hazard
function, which does not require a model to be specified with the
survextrap
function, but does require the knots and priors
to be supplied explicitly.
set.seed(1)
haz_sim <- prior_sample_hazard(knots=sp$knots, degree=sp$degree,
coefs_mean = ndi_mod$coefs_mean,
prior_hsd = p_gamma(2,1),
prior_hscale = p_meansurv(median=50, upper=80, mspline=sp),
tmax=3, nsim=30)
These plots can give some idea of how much risks might be expected to
vary over time. The parameter driving these variations is \(\sigma\), whose prior can be defined with
the prior_hsd
argument to survextrap
. In the
standard survextrap
model, \(\sigma=0\) means that we are sure that the
hazard will be constant. Therefore we might want to define the prior for
\(\sigma\) to rule out implausibly
large changes. For example, a Gamma(2,20):
set.seed(1)
haz_sim <- prior_sample_hazard(knots=sp$knots, degree=sp$degree,
coefs_mean = ndi_mod$coefs_mean,
prior_hsd = p_gamma(2,20),
prior_hscale = p_meansurv(median=50, upper=80, mspline=sp),
tmax=3, nsim=30)
ggplot(haz_sim, aes(x=time, y=haz, group=rep)) +
geom_line() + xlab("Years") + ylab("Hazard") + ylim(0,0.05)
The choice of this prior might be important if we want to extrapolate outside the data, while accounting for potential hazard changes outside the data. If the data informing extrapolation are weak, then the extrapolations will be sensitive to the prior for these potential hazard changes.
Plotting multiple trajectories gives a rough idea of the typical amount of variation through time, but , but ideally we want a more quantitative summary of this.
Ratio between high and low hazards over time
The variation in a hazard curve can be summarised simply by taking a fine grid of equally-spaced points between the time boundary knots, and computing the empirical standard deviation of the hazard on those points. If the hazard is constant, then the SD will be zero.
Since the SD depends on the scale of the hazard, a more useful measure of variation would be the ratio between a particularly high and a particularly low hazard on this grid. High and low might be defined from the 90% and 10% quantiles over time. If the hazard is constant, over time, this ratio will be 1.
These measures are computed by the haz_sd()
function:
the sd_haz
component is the SD over time of the hazard,
sd_mean
is the SD over time of the inverse hazard, and
hr
is the high/low hazard ratio (90%/10% by default).
set.seed(1)
ndi_mod$prior_sample$haz_sd()
## sd_haz sd_mean hr
## 2.5% 0.001515510 3.114473e+01 1.817126e+00
## 50% 0.008722217 6.809896e+02 4.013356e+01
## 97.5% 0.024717065 8.197922e+11 4.182825e+08
This shows that the default Gamma\((a=2,b=1)\) prior used in this model is very vague. We can make it less vague by reducing \(a/b\). A Gamma(2, 5), for example, leads to an upper 90% credible limit of about 5 for the hazard ratio over time. Trial and error is probably necessary to produce a prior that is judged plausible by this metric.
set.seed(1)
prior_haz_sd(mspline=sp,
prior_hsd = p_gamma(2,5),
prior_hscale = p_meansurv(median=50, upper=80, mspline=sp),
quantiles = c(0.1, 0.5, 0.9),
nsim=1000)
## sd_haz sd_mean hr
## 10% 0.0004542116 9.037567 1.080665
## 50% 0.0018053422 37.800321 1.517070
## 90% 0.0050991490 231.092233 4.620972
Note that a large number of simulations nsim
from the
prior may be required for a precise estimate of these tail quantiles,
and the distribution is skewed, so that the 95% credible limit may be
much higher than the 90% limit, for example.
Let us now fit a model and compare the posterior 95% credible interval for this hazard ratio to the prior. The posterior interval is narrower than the prior interval, showing the influence of the data.
library(dplyr)
nd_mod <- survextrap(Surv(years, status) ~ 1, data=colons, fit_method="mcmc",
chains=1, iter=1000,
prior_hsd = p_gamma(2,5))
qgamma(c(0.025, 0.975), 2, 5)
summary(nd_mod) %>% filter(variable=="hsd") %>% select(lower, upper)
In practical applications, if the prior is suspected to be influential, then the results of interest should be compared between different reasonable priors.
Hazard ratios for effects of covariates
Hazard ratios for effects of predictors in survival models are
commonly presented and discussed in practice, so prior judgements about
them may be easier to make, compared to the other parameters in
survextrap
models.
survextrap
uses normal (or t) priors for the log hazard
ratio. The default is normal(0, 2.5). A utility prior_hr()
is provided to check what credible interval on the hazard ratio scale is
implied by a particular prior for the log hazard ratio. This shows that
the default is actually very vague, and supports hazard ratios up to
148. The prior SD should be reduced if this is implausible and the
sample size is small enough that this would be influential.
## 2.5% 50% 97.5%
## 7.447254e-03 1.000000e+00 1.342777e+02
Conversely, the function p_hr
goes the other way round
and converts a prior median and upper 95% credible limit for the hazard
ratio to a normal prior on the log scale. This shows that we should use
a prior SD of 1.17 for the log HR if we wanted to give a low probability
(2.5%) to HRs above 10. p_hr
returns a list in a the same
format as p_normal
, which can be supplied directly to
prior_loghr
argument of survextrap
models.
p_hr(median=1, upper=10)$scale
## [1] 1.17481
## 2.5% 50% 97.5%
## 0.1 1.0 10.0
SD of hazard ratios over time in non-proportional hazards models
The novel non-proportional hazards model in survextrap
works by modelling the spline coefficients in terms of covariates, as
well as the hazard scale parameter. The model is described in the methods vignette and an example of
fitting one is given in the examples
vignette.
The parameters that relate the covariates to the spline coefficients do not have any direct interpretation in terms of how an effect of the covariate on the hazard changes over time. In theory, the model allows the hazard ratio to increase and decrease arbitrarily in a flexible way, in the same manner as a spline function.
The parameter \(\tau\) governs the amount of flexibility in the hazard ratio function. With \(\tau=0\) the model is a proportional hazards model, and as \(\tau\) increases the hazard ratio becomes a more flexible function of time. The exact value of \(\tau\) does not have an interpretation, but as we did for \(\sigma\) above, we can use simulation to calibrate the prior so that it matches plausible judgements.
This is done with the prior_hr_sd
function, which works
for \(\tau\) in the same way as
prior_haz_sd
does for \(\sigma\). The new requirement here is that
the values of covariates should be supplied in newdata
and
newdata0
, and the associated regression model formula in
formula
. The hazard ratio between newdata
and
newdata0
will be computed.
All the other priors should be supplied: prior_hscale
for \(\eta\), prior_hsd
for \(\sigma\), and
prior_loghr
for the “baseline” hazard ratio to which the
non-proportionality effects are applied. The function will simulate from
the joint prior distribution of hazard ratios implied by all these
priors together with the one that we want to calibrate:
prior_hrsd
, the Gamma prior for \(\tau\).
prior_hr_sd(mspline=sp,
prior_hsd = p_gamma(2,5),
prior_hscale = p_meansurv(median=50, upper=80, mspline=sp),
prior_loghr = p_normal(0,1),
prior_hrsd = p_gamma(2,3),
formula = ~ treat,
newdata = list(treat = 1),
newdata0 = list(treat = 0),
quantiles = c(0.05, 0.95),
nsim=1000)
## sd_hr hrr
## 5% 0.01502037 1.453464
## 95% 53.42515945 12.894767
The sd_hr
column is the standard deviation of the hazard
ratio, and the hrr
column is the ratio between a high and a
low value (by default the 90% versus the 10% quantile, but these can be
customised with the hq
argument). These quantities both
have their own implicit prior distributions. Since we supplied
quantiles=c(0.05,0.95)
, then 90% prior credible intervals
for these two quantities are shown here. These should be checked for
consistency with beliefs, and the Gamma prior for \(\tau\) (supplied as
p_gamma(2,3)
revised if necessary.