`R/disbayes.R`

`disbayes.Rd`

Estimates a three-state disease model from incomplete data. It is designed to estimate case fatality and incidence, given data on mortality and at least one of incidence and prevalence. Remission may also be included in the data and modelled.

```
disbayes(
data,
inc_num = NULL,
inc_denom = NULL,
inc_prob = NULL,
inc_lower = NULL,
inc_upper = NULL,
prev_num = NULL,
prev_denom = NULL,
prev_prob = NULL,
prev_lower = NULL,
prev_upper = NULL,
mort_num = NULL,
mort_denom = NULL,
mort_prob = NULL,
mort_lower = NULL,
mort_upper = NULL,
rem_num = NULL,
rem_denom = NULL,
rem_prob = NULL,
rem_lower = NULL,
rem_upper = NULL,
age = "age",
cf_model = "smooth",
inc_model = "smooth",
rem_model = "const",
prev_zero = FALSE,
inc_trend = NULL,
cf_trend = NULL,
cf_init = 0.01,
eqage = 30,
eqagehi = NULL,
sprior = c(1, 1, 1),
hp_fixed = NULL,
rem_prior = c(1.1, 1),
inc_prior = c(2, 0.1),
cf_prior = c(2, 0.1),
method = "opt",
draws = 1000,
iter = 10000,
stan_control = NULL,
bias_model = NULL,
...
)
```

- data
Data frame containing some of the variables below. The variables below are provided as character strings naming columns in this data frame. For each disease measure available, one of the following three combinations of variables must be specified:

(1) numerator and denominator (2) estimate and denominator (3) estimate with lower and upper credible limits.

Mortality must be supplied, and at least one of incidence and prevalence. If remission is assumed to be possible, then remission data should also be supplied (see below).

Estimates refer to the probability of having some event within a year, rather than rates. Rates per year $r$ can be converted to probabilities $p$ as $p = 1 - exp(-r)$, assuming the rate is constant within the year.

For estimates based on registry data assumed to cover the whole population, then the denominator will be the population size.

- inc_num
Numerator for the incidence data, assumed to represent the observed number of new cases within a year among a population of size

`inc_denom`

.- inc_denom
Denominator for the incidence data.

The function

`ci2num`

can be used to convert a published estimate and interval for a proportion to an implicit numerator and denominator.Note that to include extra uncertainty beyond that implied by a published interval, the numerator and denominator could be multiplied by a constant, for example, multiplying both the numerator and denominator by 0.5 would give the data source half its original weight.

- inc_prob
Estimate of the incidence probability

- inc_lower
Lower credible limit for the incidence estimate

- inc_upper
Upper credible limit for the incidence estimate

- prev_num
Numerator for the estimate of prevalence, i.e. number of people currently with a disease.

- prev_denom
Denominator for the estimate of prevalence (e.g. the size of the survey used to obtain the prevalence estimate)

- prev_prob
Estimate of the prevalence probability

- prev_lower
Lower credible limit for the prevalence estimate

- prev_upper
Upper credible limit for the prevalence estimate

- mort_num
Numerator for the estimate of the mortality probability, i.e number of deaths attributed to the disease under study within a year

- mort_denom
Denominator for the estimate of the mortality probability (e.g. the population size, if the estimates were obtained from a comprehensive register)

- mort_prob
Estimate of the mortality probability

- mort_lower
Lower credible limit for the mortality estimate

- mort_upper
Upper credible limit for the mortality estimate

- rem_num
Numerator for the estimate of the remission probability, i.e number of people observed to recover from the disease within a year.

Remission data should be supplied if remission is permitted in the model, either as a numerator and denominator or as an estimate and lower credible interval. Conversely, if no remission data are supplied, then remission is assumed to be impossible. These "data" may represent a prior judgement rather than observation - lower denominators or wider credible limits represent weaker prior information.

- rem_denom
Denominator for the estimate of the remission probability

- rem_prob
Estimate of the remission probability

- rem_lower
Lower credible limit for the remission estimate

- rem_upper
Upper credible limit for the remission estimate

- age
Variable in the data indicating the year of age. This must start at age zero, but can end at any age.

- cf_model
Model for how case fatality rate varies with age.

`"smooth"`

(the default). Case fatality rate is modelled as a smooth function of age, using a spline.`"indep"`

Case fatality rates are estimated independently for each year of age. This may be useful for determining how much information is in the data. That is, if the posterior from this model is identical to the prior for a certain age, then there is no information in the data alone about case fatality at that age, indicating that some other structural assumption (such as a smooth function of age) or external data are equired to give more precise estimates.`"increasing"`

Case fatality rate is modelled as a smooth and increasing function of age.`"const"`

Case fatality rate is modelled as constant with age.- inc_model
Model for how incidence rates vary with age.

`"smooth"`

(the default). Incidence rates are modelled as a smooth spline function of age.`"indep"`

Incidence rates for each year of age are estimated independently.- rem_model
Model for how remission rates vary with age, which are typically less well-informed by data, compared to incidence and case fatality.

`"const"`

(the default). Constant remission rate over all ages.`"smooth"`

Remission rates are modelled as a smooth spline function of age.`"indep"`

Remission rates estimated independently over all ages.- prev_zero
If

`TRUE`

, attempt to estimate prevalence at age zero from the data, as part of the Bayesian model, even if the observed prevalence is zero. Otherwise (the default) this is assumed to be zero if the count is zero, and estimated otherwise.- inc_trend
Matrix of constants representing trends in incidence through calendar time by year of age. There are

`nage`

rows and`nage`

columns, where`nage`

is the number of years of age represented in the data. The entry in the ith row and jth column represents the ratio between the incidence`nage+j`

years prior to the year of the data, year, and the incidence in the year of the data, for a person i-1 years of age. For example, if`nage=100`

and the data refer to the year 2017, then the first column refers to the year 1918 and the last (100th) column refers to 2017. The last column should be all 1, unless the current data are supposed to be biased.To produce this format from a long data frame, filter to the appropriate outcome and subgroup, and use

`pivot_wider`

, e.g.`trends <- ihdtrends filter(outcome=="Incidence", gender=="Female") pivot_wider(names_from="year", values_from="p2017") select(-age, -gender, -outcome) as.matrix()`

- cf_trend
Matrix of constants representing trends in case fatality through calendar time by year of age, in the same format as

`inc_trend`

.- cf_init
Initial guess at a typical case fatality value, for an average age.

- eqage
Case fatalities (and incidence and remission rates) are assumed to be equal for all ages below this age, inclusive, when using the smoothed model.

- eqagehi
Case fatalities (and incidence and remission rates) are assumed to be equal for all ages above this age, inclusive, when using the smoothed model.

- sprior
Rates of the exponential prior distributions used to penalise the coefficients of the spline model. The default of 1 should adapt appropriately to the data, but Higher values give stronger smoothing, or lower values give weaker smoothing, if required.

This can be a named vector with names

`"inc","cf","rem"`

in any order, giving the prior smoothness rates for incidence, case fatality and remission. If any of these are not smoothed they can be excluded, e.g.`sprior = c(cf=10, inc=1)`

.This can also be an unnamed vector of three elements, where the first refers to the spline model for incidence, the second for case fatality, the third for remission. If one of the rates (e.g. remission) is not being modelled with a spline, any number can be supplied here and it is just ignored.

- hp_fixed
A list with one named element for each hyperparameter to be fixed. The value should be either

a number (to fix the hyperparameter at this number)

`TRUE`

(to fix the hyperparameter at the posterior mode from a training run where it is not fixed)

If the element is either

`NULL`

,`FALSE`

, or omitted from the list, then the hyperparameter is given a prior and estimated as part of the Bayesian model.The hyperparameters that can be fixed are

`scf`

Smoothness parameter for the spline relating case fatality to age.`sinc`

Smoothness parameter for the spline relating incidence to age.

For example, to fix the case fatality smoothness to 1.2 and fix the incidence smoothness to its posterior mode, specify

`hp_fixed = list(scf=1.2, sinc=TRUE)`

.- rem_prior
Vector of two elements giving the Gamma shape and rate parameters of the prior for the remission rate, used in both

`rem_model="const"`

and`rem_model="indep"`

.- inc_prior
Vector of two elements giving the Gamma shape and rate parameters of the prior for the incidence rate. Only used if

`inc_model="indep"`

, for each age-specific rate.- cf_prior
Vector of two elements giving the Gamma shape and rate parameters of the prior for the case fatality rate. Only used if

`cf_model="const"`

, or if`cf_model="indep"`

, for each age-specific rate, and for the rate at`eqage`

in`cf_model="increasing"`

.- method
String indicating the inference method, defaulting to

`"opt"`

.If

`method="mcmc"`

then a sample from the posterior is drawn using Markov Chain Monte Carlo sampling, via rstan's`rstan::sampling()`

function. This is the most accurate but the slowest method.If

`method="opt"`

, then instead of an MCMC sample from the posterior,`disbayes`

returns the posterior mode calculated using optimisation, via rstan's`rstan::optimizing()`

function. A sample from a normal approximation to the (real-line-transformed) posterior distribution is drawn in order to obtain credible intervals.If the optimisation fails to converge (non-zero return code), try increasing the number of iterations from the default 1000, e.g.

`disbayes(..., iter=10000, ...)`

, or changing the algorithm to`disbayes(..., algorithm="Newton", ...)`

.If there is an error message that mentions

`chol`

, then the computed Hessian matrix is not positive definite at the reported optimum, hence credible intervals cannot be computed. This can occur either because of numerical error in computation of the Hessian, or because the reported optimum is wrong. If you are willing to believe the optimum and live without credible intervals, then set`draws=0`

to skip computation of the Hessian. To examine the problematic Hessian, set`hessian=TRUE,draws=0`

, then look at the`$fit$hessian`

component of the`disbayes`

return object. If it can be inverted, do`sqrt(diag(solve()))`

on the Hessian, and check for`NaN`

s, indicating the problematic parameters. Otherwise, diagonal entries of the Hessian matrix that are very small may indicate parameters that are poorly identified from the data, leading to computational problems.If

`method="vb"`

, then variational Bayes methods are used, via rstan's`rstan::vb()`

function. This is labelled as "experimental" by rstan. It might give a better approximation to the posterior than`method="opt"`

, but has not been investigated much for`disbayes`

models.- draws
Number of draws from the normal approximation to the posterior when using

`method="opt"`

.- iter
Number of iterations for MCMC sampling, or maximum number of iterations for optimization.

- stan_control
(

`method="mcmc"`

only). List of options supplied as the`control`

argument to`rstan::sampling()`

in rstan for the main model fit.- bias_model
Experimental model for bias in the incidence estimates due to conflicting information between the different data sources. If

`bias_model=NULL`

(the default) no bias is assumed, and all data are assumed to be generated from the same age-specific incidences.Otherwise there are assumed to be two alternative curves of incidence by age (denoted 2 and 1) where curve 2 is related to curve 1 via a constant hazard ratio that is estimated from the data, given a standard normal prior on the log scale. Three distinct curves would not be identifiable from the data.

If

`bias_model="inc"`

then the incidence data is assumed to be generated from curve 2, and the prevalence and mortality data from curve 1.`bias_model="prev"`

then the prevalence data is generated from curve 2, and the incidence and mortality data from curve 1.If

`bias_model="incprev"`

then both incidence and prevalence data are generated from curve 2, and the mortality data from curve 1.- ...
Further arguments passed to

`rstan::sampling()`

to control MCMC sampling, or`rstan::optimizing()`

to control optimisation, in Stan.

A list including the following components

`call`

: Function call that was used.

`fit`

: An object containing posterior samples from the fitted model,
in the `stanfit`

format returned by the `stan`

function in the rstan package.

`method`

: Optimisation method that was chosen.

`nage`

: Number of years of age in the data

`dat`

: A list containing the input data in the form of numerators
and denominators.

`stan_data`

: Full list of data supplied to Stan

`stan_inits`

: Full list of parameter initial values supplied to Stan

`hp_fixed`

Values of any hyperparameters that are fixed during the main model fit.

Use the `tidy.disbayes`

method to return summary statistics
from the fitted models, simply by calling `tidy()`

on the fitted model.

Jackson C, Zapata-Diomedi B, Woodcock J. (2023) "Bayesian multistate modelling of incomplete chronic disease burden data" Journal of the Royal Statistical Society, Series A, 186(1), 1-19 https://doi.org/10.1093/jrsssa/qnac015