a2bcovid: introduction

a2bcovid is a tool to estimate the likelihood that an infection was transmitted between particular individuals, and then to identify clusters of infections.

It uses data on either genome sequences or the locations of infected individuals, and uses a statistical and evolutionary model.

It is currently designed to apply to COVID-19 infection dynamics on hospital wards.

It is available as both an R package and a web app.

An a2bcovid analysis using the R package is shown here for a simple example.

The data files for this example are provided with the installed package, and can be read into R as follows. For the CSV files, the first couple of rows are shown here, to illustrate the format required. Full details of the format are specified in the R help page help(a2bcovid).

The file names, e.g. pat_file, below, should be specified as full path names. If the file is in your current working directory (for example myfile.csv) you can construct this path with code such as file.path(getwd(),"myfile.csv").

Times of symptom onset

pat_file <- system.file("extdata", "Example_genetic_temporal_data.csv", package="a2bcovid")
head(read.csv(pat_file),2)

##   patient_study_id onset_date onset_date_source infection_type sequence_id
## 1       CAMP001001 01/05/2020                 1              2    CAM00001
## 2       CAMP001002 05/05/2020                 1              2    CAM00002
##   sample_collection_date sample_received_date
## 1             04/05/2020           04/05/2020
## 2             08/05/2020           08/05/2020

Genome sequence data

The example sequences file has been constructed so that all of the variants in the sequence appear in the first 10 positions in the genome. Looking at it with an alignment viewer will give a simple idea of how the sequences relate to one another.

ali_file <- system.file("extdata", "Example_sequences.fa", package="a2bcovid")

Patient location data

Two alternative formats are accepted for the patient location data file, as illustrated here. The format can be automatically detected: if a variable called “start_date” is supplied then long format is assumed, or if there is a variable called “StartDate_0”, then wide format is assumed.

Wide format: one row per patient

pat_loc_file <- system.file("extdata", "Example_pat_loc_file.csv", package="a2bcovid")
head(read.csv(pat_loc_file),2)

##   patient_study_id ward_cluster_network hcw_status
## 1       CAMP001001                    A    patient
## 2       CAMP001002                    A    patient
##   patient_movement_data_available LocationName_0 StartDate_0  EndDate_0
## 1         patient_moves_available        WARD_01  01/05/2020 07/05/2020
## 2         patient_moves_available        WARD_01  04/05/2020 09/05/2020
##   LocationName_1 StartDate_1  EndDate_1
## 1        WARD_04  07/05/2020 09/05/2020
## 2        WARD_05  09/05/2020 12/05/2020

Long format: one row per ward stay

pat_loc_file_long <- system.file("extdata", "Example_pat_loc_file_long.csv", package="a2bcovid")
head(read.csv(pat_loc_file_long),2)

##   patient_study_id from_ward start_date   to_ward   end_date
## 1       CAMP001001   WARD_01 01/05/2020 Discharge 07/05/2020
## 2       CAMP001001   WARD_04 07/05/2020 Discharge 09/05/2020

Staff location data

hcw_loc_file <- system.file("extdata", "Example_hcw_loc_file.csv", package="a2bcovid")
head(read.csv(hcw_loc_file),2)

##   patient_study_id ward_cluster X05.05.2020 X06.05.2020 X07.05.2020
## 1       CAMP001004      WARD_01           Y           Y           N
## 2       CAMP001005      WARD_01           N           N           Y
##   X08.05.2020 X09.05.2020 X10.05.2020 X11.05.2020 X12.05.2020 X13.05.2020
## 1           N           N           N           N           Y           N
## 2           N           N           N           N           N           N
##   X14.05.2020 X15.05.2020 X16.05.2020 X17.05.2020 X18.05.2020 X19.05.2020
## 1           N           N           N           N           N           N
## 2           Y           N           Y           N           N           N
##   X20.05.2020 X21.05.2020
## 1           N           N
## 2           N           N

Example analysis

Using just the symptom onset times

library(a2bcovid)
a <- a2bcovid(pat_file = pat_file)
plot_a2bcovid(a)

Individual 1008 does not seem to infect anyone else, but otherwise most of the cases appear to be connected to each other. Sequences with lower numbers are generally more likely to infect individuals with higher numbers.

Using symptom onset times and sequence data

a <- a2bcovid(pat_file = pat_file, ali_file = ali_file)
plot_a2bcovid(a)

With the addition of sequence information, individuals 1009 and 1010 appear more separate. They may have been infected by 1001 or 1002 but most of the links to them at type 0 have disappeared. There seems to be a cluster of individuals 1001 to 1008.

Using symptom onset times, sequence data and patient location data

Suppose that individuals 1004 to 1006 are healthcare workers, and the remaining individuals are patients in the ward. Suppose we have location data for all patients, but not the healthcare workers.

a <- a2bcovid(pat_file = pat_file, ali_file = ali_file, pat_loc_file = pat_loc_file)
plot_a2bcovid(a)

The links from 1001 and 1002 to 1009 and 1010 are now gone, with these last two individuals being seen as clearly separate from the remaining cases. The plot suggests that 1007 infected 1008 in the absence of other known cases.

Using symptom onset times, sequence data, patient location and staff location data

Finally, we add in location data for the healthcare workers, individuals 1004 to 1006.

a <- a2bcovid(pat_file = pat_file, ali_file = ali_file, pat_loc_file = pat_loc_file, hcw_loc_file = hcw_loc_file)
plot_a2bcovid(a)

More resolution is now seen around these individuals, with for example 1006 not having infected anyone else, but possibly having been infected by 1002 or 1003. We note that generally the question of who infected who is not resolved for the cluster of cases at the top right of the plot, but a cluster linking the individuals 1001 to 1008, and a second linking 1009 and 1010, could be identified from this plot for further investigation.

Note that in the current version of a2bcovid, two different files and formats are used for location of patients and location of healthcare workers. However this is not necessary for the calculation. This is just a legacy of the original setting where the package was used. In a future version, the data format might be standardised.

Additional plotting options

By default, the individuals in the plot are sorted in a way that highlights potential clusters of infections. To sort them in the order that they were provided in the original data, specify cluster=FALSE.

plot_a2bcovid(a, cluster=FALSE)

By default the colours in the plot indicate ranges of significance levels for a test of the hypothesis that transmission occurred between a pair of individuals. Three ranges are shown, corresponding to different ranges of the test p-value:

(p > 0.05) Data “Consistent” with transmission
(0.01 < p < 0.05) Data of “Borderline” significance
(p < 0.01) Data “Unlikely” to have arisen from a transmission event

A smoother plot can be obtained by specifying continuous=TRUE. Here the colours vary smoothly with the p-value.

plot_a2bcovid(a, continuous=TRUE)

A2BCovid: example analysis