Flexible Bayesian parametric survival models. Individual data are represented using M-splines and a proportional hazards or flexible non-proportional hazards model. External aggregate data can be included, for example, to enable extrapolation outside the individual data. A fixed background hazard can also be included in an additive hazards (relative survival) model. Mixture cure versions of these models can also be used.
Usage
survextrap(
formula,
data = NULL,
external = NULL,
cure = FALSE,
nonprop = FALSE,
prior_hscale = p_normal(0, 20),
prior_loghr = NULL,
prior_hsd = p_gamma(2, 1),
prior_cure = p_beta(1, 1),
prior_logor_cure = NULL,
prior_hrsd = p_gamma(2, 1),
backhaz = NULL,
backhaz_strata = NULL,
mspline = NULL,
add_knots = NULL,
smooth_model = "exchangeable",
hsd = "bayes",
coefs_mean = NULL,
fit_method = "mcmc",
loo = (fit_method == "mcmc"),
...
)
Arguments
- formula
A survival formula in standard R formula syntax, with a call to
Surv()
on the left hand side.Covariates included on the right hand side of the formula with be modelled with proportional hazards, or if
nonprop
isTRUE
then a non-proportional hazards is used.If
data
is omitted, so that the model is being fitted to external aggregate data alone, without individual data, then the formula should not include aSurv()
call. The left-hand side of the formula will then be empty, and the right hand side specifies the covariates as usual. For example,formula = ~1
if there are no covariates.- data
Data frame containing variables in
formula
. Variables should be in a data frame, and not in the working environment.This may be omitted, in which case
external
must be supplied. This allows a model to be fitted to external aggregate data alone, without any individual-level data.- external
External data as a data frame of aggregate survival counts with columns named:
start
: Start timestop
: Follow-up timen
: Number of people alive atstart
r
: Number of those people who are still alive atstop
If there are covariates in
formula
, then the values they take in the external data must be supplied as additional columns inexternal
. Therefore if there are external data, the covariates informula
anddata
should not be namedstart
,stop
,n
orr
.- cure
If
TRUE
, a mixture cure model is used, where the "uncured" survival is defined by the M-spline model, and the cure probability is estimated.- nonprop
Non-proportional hazards model specification. This is achieved by modelling the spline basis coefficients in terms of the covariates. See the methods vignette for more details.
If
TRUE
, then all covariates are modelled with non-proportional hazards, using the same model formula asformula
.If this is a formula, then this is assumed to define a model for the dependence of the basis coefficients on the covariates.
IF this is
NULL
orFALSE
(the default) then any covariates are modelled with proportional hazards.- prior_hscale
Prior for the baseline log hazard scale parameter (
alpha
orlog(eta)
). This should be a call to a prior constructor function, such asp_normal(0,1)
orp_t(0,2,2)
. Supported prior distribution families are normal (parameters mean and SD) and t distributions (parameters location, scale and degrees of freedom). The default is a normal distribution with mean 0 and standard deviation 20.Note that
eta
is not in itself a hazard, but it is proportional to the hazard (see the vignette for the full model specification)."Baseline" is defined by the continuous covariates taking a value of zero and factor covariates taking their reference level. To use a different baseline, the data should be transformed appropriately beforehand, so that a value of zero has a different meaning. For continuous covariates, it helps for both computation and interpretation to define the value of zero to denote a typical value in the data, e.g. the mean.
- prior_loghr
Priors for log hazard ratios. This should be a call to
p_normal()
orp_t()
. A list of calls can also be provided, to give different priors to different coefficients, where the name of each list component matches the name of the coefficient, e.g.list("age45-59" = p_normal(0,1), "age60+" = p_t(0,2,3))
The default is
p_normal(0,2.5)
for all coefficients.- prior_hsd
Gamma prior for the standard deviation that controls the variability over time (or smoothness) of the hazard function. This should be a call to
p_gamma()
. The default isp_gamma(2,1)
. Seeprior_haz_sd
for a way to calibrate this to represent a meaningful belief.- prior_cure
Prior for the baseline cure probability. This should be a call to
p_beta()
. The default is a uniform prior,p_beta(1,1)
. Baseline is defined by the mean of continuous covariates and the reference level of factor covariates.- prior_logor_cure
Priors for log odds ratios on cure probabilities. This should be a call to
p_normal()
orp_t()
. The default isp_normal(0,2.5)
.- prior_hrsd
Prior for the standard deviation parameters that smooth the non-proportionality effects over time in non-proportional hazards models. This should be a call to
p_gamma()
or a list of calls top_gamma()
with one component per covariate, as inprior_loghr
. Seeprior_hr_sd
for a way to calibrate this to represent a meaningful belief.- backhaz
Background hazard, that is, for causes of death other than the cause of interest. This defines a "relative survival" model where the overall hazard is the sum of a cause-specific hazard and a background hazard. The background hazard is assumed to be known, and the cause-specific hazard is modelled with the flexible parametric model.
The background hazard can be supplied in two forms. The meaning of predictions from the model depends on which of these is used.
(a) A data frame with columns
"hazard"
and"time"
, specifying the background hazard at all times as a piecewise-constant (step) function. Each row gives the background hazard between the specified time and the next time. The first element of"time"
should be 0, and the final row specifies the hazard at all times greater than the last element of"time"
. Predictions from the model fitted bysurvextrap
will then include this background hazard, because it is known at all times.(b) The (quoted) name of a variable in the data giving the background hazard. For censored cases, the exact value does not matter. The predictions from
survextrap
will then describe the excess hazard or survival on top of this background. The overall hazard cannot be predicted in general, because the background hazard is only specified over a limited range of time.If there is external data, and
backhaz
is supplied in form (b), then the user should also supply the background survival at the start and stop points in columns of the external data named"backsurv_start"
and"backsurv_stop"
. This should describe the same reference population asbackhaz
, though the package does not check for consistency between these.If there are stratifying variables specified in
backhaz_strata
, then there should be multiple rows giving the background hazard value for each time period and stratifying variable.If
backhaz
isNULL
(the default) then no background hazard component is included in the model.- backhaz_strata
A character vector of names of variables that appear in
backhaz
that indicate strata, e.g.backhaz_strata = c("agegroup","sex")
. This allows different background hazard values to be used for different subgroups. These variables must also appear in the datasets being modelled, that is, indata
,external
or both. Each row of those datasets should then have a corresponding row inbackhaz
which has the same values of the stratifying variables.This is
NULL
by default, indicating no stratification of the background hazard.If stratification is done, then
backhaz
must be supplied in form (a), as a data frame rather than a variable in the data.- mspline
A list of control parameters defining the spline model.
knots
: Spline knots. If this is not supplied, then the number of knots is taken fromdf
, and their location is taken from equally-spaced quantiles of the observed event times in the individual-level data.add_knots
: This is intended to be used when there areexternal
data included in the model. External data are typically outside the time period covered by the individual data.add_knots
would then be chosen to span the time period covered by the external data, so that the hazard trajectory can vary over that time.If there are external data, and both
knots
andadd_knots
are omitted, then a default set of knots is chosen to span both the individual and external data, by taking the quantiles of a vector defined by concatenating the individual-level event times with thestart
andstop
times in the external data.df
: Degrees of freedom, i.e. the number of parameters (or basis terms) intended to result from choosing knots based on quantiles of the data. The total number of parameters will then bedf
plus the number of additional knots specified inadd_knots
.df
defaults to 10. This does not necessarily overfit, because the function is smoothed through the prior.degree
: Polynomial degree used for the basis function. The default is 3, giving a cubic. This can only be changed from 3 ifbsmooth
isFALSE
.bsmooth
: IfTRUE
(on by default) the spline is smoother at the highest knot, by defining the derivative and second derivative at this point to be zero.- add_knots
Any extra knots beyond those chosen from the individual-level data (or supplied in
knots
). All other spline specifications are set to their defaults, as described inmspline
. For example,add_knots = 10
is a shorthand formspline = list(add_knots = 10)
.- smooth_model
The default
"exchangeable"
uses independent logistic priors on the multinomial-logit spline coefficients, conditionally on a common smoothing variance parameter.The alternative,
"random_walk"
, specifies a random walk prior for the multinomial-logit spline coefficients, based on logistic distributions. See the methods vignette for full details.In non-proportional hazards models, setting
smooth_model
also determines whether an exchangeable or random walk model is used for the non-proportionality parameters (\(\delta\)).- hsd
Smoothing variance parameter estimation.
"bayes"
: the smoothing parameter is estimated by full Bayes (the default)."eb"
: empirical Bayes is used.Alternatively, if a number is supplied here, then the smoothing parameter is fixed to this number.
- coefs_mean
Spline basis coefficients that define the prior mean for the hazard function. By default, these are set to values that define a constant hazard function (see
mspline_constant_coefs
). They are normalised to sum to 1 internally (if they do not already).- fit_method
Method from rstan used to fit the model.
If
fit_method="mcmc"
then a sample from the posterior is drawn using Markov Chain Monte Carlo sampling, via rstan'srstan::sampling()
function. This is the default. It is the most accurate but the slowest method.If
fit_method="opt"
, then instead of an MCMC sample from the posterior,survextrap
returns the posterior mode calculated using optimisation, via rstan'srstan::optimizing()
function. A sample from a normal approximation to the (real-line-transformed) posterior distribution is drawn in order to obtain credible intervals.If
fit_method="vb"
, then variational Bayes methods are used, via rstan'srstan::vb()
function. This is labelled as "experimental" by rstan. It might give a better approximation to the posterior thanfit_method="opt"
, and is faster than MCMC, in particular for large datasets, but has not been investigated in depth for these models.- loo
Compute leave-one-out cross-validation statistics. This is done by default. Set to
FALSE
to not compute them. If these statistics are computed, then they are returned in theloo
andloo_external
components of the object returned bysurvextrap
.loo
describes the fit of the model to the individual-level data, andloo_external
describes fit to the external data.See the
"examples"
vignette for more explanation of these.- ...
Additional arguments to supply to control the Stan fit, passed to the appropriate rstan function, depending on which is chosen through the
fit_method
argument.
Value
A list of objects defining the fitted model. These are not
intended to be extracted directly by users. Instead see
summary.survextrap
for summarising the parameter
estimates, and the functions hazard
,
survival
, rmst
and others for
computing interesting summaries of the fitted survival
distribution.
The object returned by rstan
containing samples from the fitted
model is returned in the stanfit
component. See the
rstan documentation. The
function get_draws
is provided to convert this to a
simple matrix of posterior samples with all chains collapsed.
References
Jackson, C. (2023) survextrap
: a package for flexible and transparent
survival extrapolation. BMC Medical Research Methodology 23:282.
doi:10.1186/s12874-023-02094-1