added references + additional descriptive covariates

38745375 · Kevin · 87998fa0 · 38745375 · 38745375 · 38745375
Commit 38745375 authored Apr 29, 2019 by Kevin
Showing with 138 additions and 206 deletions

manuscript/manuscript.Rmd manuscript/manuscript.Rmd +99 -201

manuscript/references.bib manuscript/references.bib +32 -5

scripts/download_v1.1.sh scripts/download_v1.1.sh +7 -0

No files found.
--- a/manuscript/manuscript.Rmd
+++ b/manuscript/manuscript.Rmd
@@ -2,10 +2,10 @@
 title: "Model-based longitudinal imputation of cross-sectional outcomes in traumatic brain injury"
 output: 
  word_document: default
-  pdf_document: default
+  pdf_document:  default
  html_document:
    keep_md: yes
-bibliography: "references.bib"
+bibliography:  "references.bib"
 params:
  data_dir:    "../output/v1.1/data"
  config_file: "../config.yml"
@@ -24,46 +24,25 @@ config <- yaml::read_yaml(params$config_file)
 Assessments of global functional outcome such as the Glasgow Outcome Scale (GOS)
 and the Glasgow Outcome Scale extended (GOSe) capture 
 meaningful differences across the full spectrum of recovery,
-<<<<<<< HEAD
-and have popularity as endpoints in TBI studies @horton2018randomized.
+and have popularity as endpoints in traumatic brain injury (@horton2018).
 However, missing outcome data is a common problem in TBI research, 
 and for longitudinal studies completion rates at six months can be 
-lower than 70% @richter2019handling.
-=======
-and have popularity as endpoints in traumatic brain injury (TBI0 studies 
-(Horton et al 2018).
-However, missing outcome data is a common problem in TBI research, 
-and for longitudinal studies completion rates at six months can be 
-lower than 70% (Richter et al).
+lower than 70% (@richter2019).
 This is important since it is well known that complete-case analyses may 
-introduce bias or reduce power [@white2010bias]. 
+introduce bias or reduce power (@white2010). 

->>>>>>> 943d2c27a63fd82c265933c46a0d6ab674191f03
 Imputation of patient outcomes is gradually gaining acceptance in the TBI
 field as a method of dealing with missing data.
 Recent longitudinal studies have successfully employed techniques for both
-<<<<<<< HEAD
-single @clifton2011 @silver2006 @skolnick2014 and
-multiple imputation @bulger2010 @kirkness2006 @wright2014 @robertson2014.
-The advantage of multiple imputation over single imputation clearly lies in the
-fact that the uncertainty about imputed values can be properly accounted for 
-in a subsequent statistical analysis.
-Despite these considerations, last observation carried forward (LOCF) is still
-a widely applied method for dealing with missing data in TBI research,
-e.g. simply substituting the respective 3-months outcome for a missing 6-months data point [REFERENCE]. 
-Although LOCF is simple to understand and implement, the technique clearly 
-lacks in several respects.
-=======
-single (Clifton et al., 2011; Silver et al 2006; Skolnick et al 2014) and
-multiple imputation (Bulger et al, 2010, Kirkness et al 2006;
-Wright et al, 2014; Robertson et al, 2014).
+single (@clifton2011; @silver2006; @skolnick2014) and
+multiple imputation (@bulger2010; @kirkness2006; @wright2014; @robertson2014).
 Last observation carried forward (LOCF) is a widely applied single-imputation
 method for dealing with missing data in TBI research.
-Typically, 3-month outcome is substituted for missing 6-month data [REFERENCE].
-Although LOCF is easz to understand and implement, the technique is clearly 
+Typically, 3-month outcome is substituted for missing 6-month data (@steyerberg2008).
+Although LOCF is easy to understand and implement, the technique is clearly 
 lacking in several respects.
->>>>>>> 943d2c27a63fd82c265933c46a0d6ab674191f03
-Firstly, it is biased in that it neglects potential trends in the GOS(e) trajectories.
+Firstly, it is biased in that it neglects potential trends in the GOS(e) 
+trajectories.
 Secondly, a naive application of LOCF is also inefficient since it neglects
 data observed briefly after the target time window.
 E.g., a GOS(e) value recorded at 200 days post-injury is likely to be more
@@ -77,7 +56,7 @@ cannot be used to obtain multiply imputed data sets by design.

 In this manuscript, three model-based imputation strategies for GOSe at 
 6 months (= 180 days) post-injury in the longitudinal CENTER-TBI study 
-[@center2015collaborative] are compared with LOCF.
+@center2015collaborative are compared with LOCF.
 While we acknowledge the principle superiority of multiple imputation over
 single imputation to propagating imputation uncertainty,
 we focus on single-imputation performance for primarily practical reasons.
@@ -98,13 +77,6 @@ for imputing cross-sectional GOSe at 6 months exploiting the longitudinal
 GOSe measurements.
 Each model is fit in a version with and without baseline covariates.

-<<<<<<< HEAD
-We propose three different model-based approaches - a mixed-effects model, 
-a Gaussian process regression, and a multi-state model - for imputing GOSe 
-longitudinally each of which we fit in a version including baseline covariates
-and without.
-=======
->>>>>>> 943d2c27a63fd82c265933c46a0d6ab674191f03


 # Methods
@@ -112,7 +84,49 @@ and without.
 ## Study population

 ```{r read-population-data, include=FALSE}
-df_baseline <- read_rds(paste0(params$data_dir, "/df_baseline.rds"))
+df_baseline <- read_rds(paste0(params$data_dir, "/df_baseline.rds")) %>% 
+  mutate(
+    InjuryHx.PupilsBaselineDerived = factor(InjuryHx.PupilsBaselineDerived,
+                                            levels = 0:2),
+    InjuryHx.GCSScoreBaselineDerived_dscr = case_when(
+      InjuryHx.GCSScoreBaselineDerived <= 8  ~ "Severe",
+      InjuryHx.GCSScoreBaselineDerived <= 12 ~ "Moderate",
+      InjuryHx.GCSScoreBaselineDerived  > 12 ~ "Mild"
+    ) %>% as.factor()
+  ) %>% 
+  left_join(
+    read_rds(
+      "../data/v1.1/df_baseline_descriptive.rds"
+    ) %>%
+    transmute(
+      gupi,
+      Subject.PatientType = factor(
+        Subject.PatientType, 
+        level = 1:3, 
+        labels = c("Emergency Room", "Admission to Hospital", "Intensive Care Unit")
+      ),
+      InjuryHx.InjCause = factor(InjuryHx.InjCause,
+        level  = c(1:6, 88, 99),
+        labels = c(
+          "Road traffic incident",
+          "Incidental fall",
+          "Other non-intentional injury",
+          "Violence/assault",
+          "Act of mass violence",
+          "Suicide attempt",
+          "Unknown",
+          "Other"
+        ) %>% 
+        fct_recode(
+          Other              = "Other non-intentional injury",
+          Other              = "Suicide attempt",
+          `Violence/assault` = "Act of mass violence"
+        )
+      ),
+      InjuryHx.TotalISS
+    ),
+    by = "gupi"
+  )

 n_pat <- nrow(df_baseline)
 ```
@@ -127,7 +141,7 @@ Follow-up of participants was scheduled per protocol at 2 weeks, 3 months,
 and 6 months in group (a) and at 3 months, 6 months, and 12 months in groups 
 (b) and (c).   
 Outcome assessments at all timepoints included the GOSe 
-[@jennett1981disability, @mcmillan2016glasgow]. 
+(@jennett1981disability, @mcmillan2016glasgow). 
 The GOSe is an eight-point scale with the following categories: 
 (1) dead, (2) vegetative state, (3) lower severe disability, 
 (4) upper severe disability, (5) lower moderate disability, 
@@ -142,13 +156,6 @@ The rationale for conducting the comparison conditional on 6-months survival
 is simply that the GOSe can only be missing at 6-months if the individuals
 are still alive since GOSe would be (1) otherwise.

-**TODO:**
-
-* ->LW: I explicitly asked for additional covariates that are needed from the DB,
-happy to include them if you pass me the Neurobot codes!
-* yes, the entire document is automatically generated to ensure reproducibility and
-up-to-date data; once we agree on the manuscript we can pimp the table formatting and
-add cross-references to the respective figures (that is only supported for pdf output which you could not edit directly)


 ```{r baseline-table-continuous, echo=FALSE, results='asis'}
@@ -171,7 +178,7 @@ df_baseline %>%
  unnest() %>%
  spread(statistic, value) %>% 
  unnest() %>% 
-  pander::pandoc.table("Discrete baseline variables", digits = 3)
+  pander::pandoc.table("Discrete baseline variables", digits = 3, split.tables = 120)
 ```

 ```{r baseline-table-discrete, echo=FALSE, results='asis'}
@@ -195,6 +202,8 @@ summarizer <- function(x) {
 df_baseline %>%
  select_if(~!is.numeric(.)) %>% 
  select(-gupi) %>% 
+  mutate_all(as.factor) %>% 
+  mutate_all(fct_explicit_na) %>% 
  mutate_all(as.character) %>% 
  gather(variable, value) %>% 
  group_by(variable) %>% 
@@ -204,8 +213,9 @@ df_baseline %>%
  unnest() %>%
  spread(statistic, value) %>% 
  unnest(`# NA`) %>%
+  select(-`# NA`) %>% 
  unnest(table) %>% 
-  pander::pandoc.table("Continuous baseline variables", digits = 3)
+  pander::pandoc.table("Continuous baseline variables", digits = 3, split.tables = 120)
 ```

 ```{r read-gose-data, include=FALSE}
@@ -225,7 +235,7 @@ n_gose_pp <- df_gose %>%
 Only GOSe observations between injury and 18 months post injury are used
 since extremely late follow-ups are not providing enough information.
 This leads to a total of `r nrow(df_gose)` GOSe observations of the study
-population being availabe for the analyses in this manuscript.
+population being available for the analyses in this manuscript.
 Only for `r n_gose_180` (`r round(n_gose_180 / n_pat * 100, 1)`%) individuals,
 GOSe observations at 180 +/- 14 days post injury are available,
 for `r n_gose_pp` (`r round(n_gose_pp / n_pat * 100, 1)`%) individuals
@@ -344,25 +354,26 @@ GOSe value for subjects where at least one value is available within the
 first 180 days post injury.
 We account for this lack of complete coverage under LOCF by performing all
 performance comparisons including LOCF only on the subset of individuals
-for which a LOCF-imputed value can be obtained (cf. Section ???). 
+for which a LOCF-imputed value can be obtained. 


 ## Mixed-effects model

-Mixed effects models are a a widely used approach in longitudinal 
-data analysis andd model individual deviations from a population mean trajectory 
-[@verbeke2009linear].
+Mixed effects models are a widely used approach in longitudinal 
+data analysis and model individual deviations from a population mean trajectory 
+(@verbeke2009linear).
 To account for the fact that the GOSe outcome is an ordered factor,
-we employ a cumulative link function model with flexible intercepts [@Agresti2003].
-The population mean is modeled as spline function with knots at ???
-to allow a non-linear population mean trajectory.
+we employ a cumulative link function model with flexible intercepts 
+(@Agresti2003).
+The population mean is modeled as cubic spline function to allow a 
+non-linear population mean trajectory.
 Patient-individual deviations from this population mean are modeled 
 as quadratic polynomials to allow sufficient flexibility (random effects).
 Baseline covariates are added as linear fixed effects to to the 
 population mean.
 The model was fitted using Bayesian statistics via the BRMS 
-package [@brms2017, @brms2018] for the R environment for statistical 
-computing [@R2016] and the Stan modelling language [@stan2017]. 
+package (@brms2017; @brms2018) for the R environment for statistical 
+computing (@R2016) and the Stan modelling language (@stan2017). 
 Non-informative priors were used for the model parameters.
 A potential drawback of the proposed longitudinal mixed effects model 
 is the fact that the individual deviations from the population mean 
@@ -423,7 +434,7 @@ Since the number of observations per individual is very limited in our
 data set (1 to 4 GOSe observations per individual),
 an approach explicitly modelling transition probabilities might be 
 more suitable to capture the dynamics of the GOSe trajectories.
-To explore this further, a Markov multi-state model is considered [REFERENCE].
+To explore this further, a Markov multi-state model is considered (@meira2009).
 This model class assumes that the transitions between adjacent GOSe 
 states can be modeled as a Markov process and the transition 
 intensities between adjacent states are fitted to the observed data.
@@ -586,7 +597,7 @@ All measures are considered both conditional on the ground-truth
 LOCF, by design, cannot provide imputed values when there are no
 observations before 180 days post injury.
 A valid comparison of LOCF with the other methods must therefore be 
-baseed on the set of individuals for whom an LOCF imputation is possible.
+based on the set of individuals for whom an LOCF imputation is possible.
 ```{r non-locf-ids, include=FALSE}
 idx <- df_predictions %>% 
  filter(model == "LOCF", !complete.cases(.)) %>% 
@@ -710,7 +721,7 @@ Both the raw count as well as the relative (by left-out observed GOSe) confusion
 are presented in Figure ???.

 ```{r confusion-matrix-locf, warning=FALSE, message=FALSE, echo=FALSE, fig.cap="Confusion matrices on LOCF subset.", fig.height=9, fig.width=6}
-plot_confusion_matrices <- function(df_predictions, models) {
+plot_confusion_matrices <- function(df_predictions, models, nrow = 2) {
  
  df_average_confusion_matrices <- df_predictions %>% 
    filter(model %in% models) %>% 
@@ -748,7 +759,7 @@ plot_confusion_matrices <- function(df_predictions, models) {
      theme(
        panel.grid = element_blank()
      ) + 
-      facet_wrap(~model, nrow = 2) +
+      facet_wrap(~model, nrow = nrow) +
      ggtitle("Average confusion matrix accross folds (absolute counts)")
  
  p_cnf_mtrx_colnrm <- df_average_confusion_matrices %>%
@@ -768,7 +779,7 @@ plot_confusion_matrices <- function(df_predictions, models) {
      theme(
        panel.grid = element_blank()
      ) + 
-      facet_wrap(~model, nrow = 2) +
+      facet_wrap(~model, nrow = nrow) +
      ggtitle("Average confusion matrix accross folds (column fraction)")
  
  cowplot::plot_grid(p_cnf_mtrx_raw, p_cnf_mtrx_colnrm, ncol = 1, align = "v")      
@@ -777,7 +788,8 @@ plot_confusion_matrices <- function(df_predictions, models) {

 plot_confusion_matrices(
  df_predictions %>% filter(!(gupi %in% idx)), 
-  c("MSM", "GP + cov", "MM", "LOCF")
+  c("MSM", "GP + cov", "MM", "LOCF"),
+  nrow = 2
 )

 ggsave(filename = "confusion_matrices_locf.pdf", width = 6, height = 9)
@@ -796,77 +808,12 @@ Both the MSM and the MM models account for this by almost never imputing a
 GOSe of 4.
 Instead, the respective cases tend to be imputed to GOSe 3 or 5.

-**TODO:**
-
-* this section table is the one we David requested in our last meeting,
-not entirely convinced though ...
-* ... 1 -> > 1 is not relevant since our imputation is conditional
-on not being 1 at 6 months
-* ... the comparison seems to favor LOCF since only
-upward confusions are considered (which LOCF by design tends to do less)]
-* Is there a clinical interpretation along the way that '4' might constitue
-a short-term transition state or is it just defined in a way that makes it
-highly unlikely to be observed in practice?
-
-```{r crossing-table, echo=FALSE, warning=FALSE, results='asis'}
-models <- c("MSM", "GP + cov", "MM")
-df_average_confusion_matrices <- df_predictions %>% 
-    filter(model %in% models) %>% 
-    filter(!(gupi %in% idx)) %>% 
-    group_by(fold, model) %>% 
-    do(
-      confusion_matrix = caret::confusionMatrix(
-          data = factor(.$prediction, levels = 1:8), 
-          reference = factor(.$GOSE, levels = 1:8)
-        ) %>% 
-        as.matrix %>% as_tibble %>% 
-        mutate(`Predicted GOSE` = row_number() %>% as.character) %>% 
-        gather(`Observed GOSE`, n, 1:8)
-    ) %>% 
-    unnest %>% 
-    group_by(model, `Predicted GOSE`, `Observed GOSE`) %>% 
-    summarize(n = mean(n)) %>% 
-    ungroup %>% 
-    mutate(model = factor(model, models))
-rbind(
-df_average_confusion_matrices %>% 
-  filter(model %in% c("LOCF", "MM", "GP + cov", "MSM")) %>% 
-  group_by(model) %>% 
-  filter(`Observed GOSE` <= 3) %>% 
-  mutate(n_total = sum(n)) %>% 
-  filter(`Predicted GOSE` > 3) %>% 
-  summarize(fraction = sum(n / n_total)) %>% 
-  mutate(`Event` = "<=3 -> >3"),
-
-df_average_confusion_matrices %>% 
-  filter(model %in% c("LOCF", "MM", "GP + cov", "MSM")) %>% 
-  group_by(model) %>% 
-  filter(`Observed GOSE` == 4) %>% 
-  mutate(n_total = sum(n)) %>% 
-  filter(`Predicted GOSE` > 4) %>% 
-  summarize(fraction = sum(n / n_total)) %>% 
-  mutate(`Event` = "4 -> >4"),
-
-df_average_confusion_matrices %>% 
-  filter(model %in% c("LOCF", "MM", "GP + cov", "MSM")) %>% 
-  group_by(model) %>% 
-  filter(`Observed GOSE` < 8) %>% 
-  mutate(n_total = sum(n)) %>% 
-  filter(`Predicted GOSE` == 8) %>% 
-  summarize(fraction = sum(n / n_total)) %>% 
-  mutate(`Event` = "<8 -> 8")
-) %>% 
-  transmute(Model = model, Percent = 100*fraction, Event) %>% 
-  spread(Event, Percent) %>% 
-  pander::pandoc.table("Some specific confusion percentages, LOCF subset.", digits = 3)
-```
-
 To better understand the overall performance assessment in Figure ???,
 we also consider the performance conditional on the respective ground-truth
 (i.e. the observed GOSe values in the test sets).
-The results are shown in Figure ??? (vertical bars are =/- one standard error of the mean).
+The results are shown in Figure ??? (vertical bars are +/- one standard error of the mean).

-```{r error-scores-locf, echo=FALSE, fig.height=5, fig.width=9}
+```{r error-scores-locf, echo=FALSE, fig.height=4, fig.width=9}
 plot_summary_measures_cond <- function(df_predictions, models, label) {
  
  df_predictions %>%
@@ -950,10 +897,10 @@ positive and negative biases conditional on low/high GOSe values canceling out
 in the overall population.
 The MSM and MM models are fairly similar with respect to accuracy but MSM
 clearly dominates with respect to bias.
-Note that irrespective of the exact definition of bias used, MSM ominates the other
+Note that irrespective of the exact definition of bias used, MSM dominates the other
 model-based approaches. 
 Comparing LOCF and MSM, there is a slight advantage of MSM in terms of accuracy for
-the majority classes 3, 7, 8 which explain the overall difference shwon in Figure ???.
+the majority classes 3, 7, 8 which explain the overall difference shown in Figure ???.
 With respect to bias, MSM also performs better than LOCF for the most frequently
 observed categories, but the extent of this improvement depend on the performance measure.

@@ -968,79 +915,30 @@ where only GOSe values after 180 days post-injury are available.
 The relative characteristics of the three considered approaches are comparable
 to the LOCF subset.

-**TODO**

-* decide whether figures go in appendix - David and I agree on them being actually the 
-primary analysis. we just needto convince people of the fact that LOCF should be dropped *first*. As always, I am open to debate this but we should just make a decision, figurexit or figuremain?

-```{r confusion-matrix, warning=FALSE, message=FALSE, echo=FALSE, fig.cap="Confusion matrices, full training set without LOCF.", fig.height=9, fig.width=6}
+```{r confusion-matrix, warning=FALSE, message=FALSE, echo=FALSE, fig.cap="Confusion matrices, full test set without LOCF.", fig.height=6, fig.width=6}
 plot_confusion_matrices(
  df_predictions, 
-  c("MSM", "GP + cov", "MM")
+  c("MSM", "GP + cov", "MM"),
+  nrow = 1
 )

-ggsave(filename = "confusion_matrices_all.pdf", width = 6, height = 9)
-ggsave(filename = "confusion_matrices_all.png", width = 6, height = 9)
+ggsave(filename = "confusion_matrices_all.pdf", width = 6, height = 6)
+ggsave(filename = "confusion_matrices_all.png", width = 6, height = 6)
 ```

-```{r crossing-table-full, echo=FALSE, warning=FALSE, results='asis'}
-models <- c("MSM", "GP + cov", "MM")
-df_average_confusion_matrices <- df_predictions %>% 
-    filter(model %in% models) %>% 
-    group_by(fold, model) %>% 
-    do(
-      confusion_matrix = caret::confusionMatrix(
-          data = factor(.$prediction, levels = 1:8), 
-          reference = factor(.$GOSE, levels = 1:8)
-        ) %>% 
-        as.matrix %>% as_tibble %>% 
-        mutate(`Predicted GOSE` = row_number() %>% as.character) %>% 
-        gather(`Observed GOSE`, n, 1:8)
-    ) %>% 
-    unnest %>% 
-    group_by(model, `Predicted GOSE`, `Observed GOSE`) %>% 
-    summarize(n = mean(n)) %>% 
-    ungroup %>% 
-    mutate(model = factor(model, models))
-rbind(
-df_average_confusion_matrices %>% 
-  group_by(model) %>% 
-  filter(`Observed GOSE` <= 3) %>% 
-  mutate(n_total = sum(n)) %>% 
-  filter(`Predicted GOSE` > 3) %>% 
-  summarize(fraction = sum(n / n_total)) %>% 
-  mutate(`Event` = "<=3 -> >3"),
-
-df_average_confusion_matrices %>% 
-  group_by(model) %>% 
-  filter(`Observed GOSE` == 4) %>% 
-  mutate(n_total = sum(n)) %>% 
-  filter(`Predicted GOSE` > 4) %>% 
-  summarize(fraction = sum(n / n_total)) %>% 
-  mutate(`Event` = "4 -> >4"),
-
-df_average_confusion_matrices %>% 
-  group_by(model) %>% 
-  filter(`Observed GOSE` < 8) %>% 
-  mutate(n_total = sum(n)) %>% 
-  filter(`Predicted GOSE` == 8) %>% 
-  summarize(fraction = sum(n / n_total)) %>% 
-  mutate(`Event` = "<8 -> 8")
-) %>% 
-  transmute(Model = model, Percent = 100*fraction, Event) %>% 
-  spread(Event, Percent) %>% 
-  pander::pandoc.table("Some specific confusion percentages, full data set.", digits = 3)
-```

-```{r error-scores-all, echo=FALSE, fig.height=5, fig.width=99}
+
+```{r error-scores-all, echo=FALSE, fig.height=4, fig.width=9}
 plot_summary_measures_cond(
  df_predictions %>% filter(!(gupi %in% idx)), 
  c("MSM", "GP + cov", "MM"), 
  "Summary measures by observed GOSe, full test set"
 )

-ggsave(filename = "imputation_error.pdf", width = 9, height = 5)
-ggsave(filename = "imputation_error.png", width = 9, height = 5)
+ggsave(filename = "imputation_error.pdf", width = 9, height = 4)
+ggsave(filename = "imputation_error.png", width = 9, height = 4)
 ```


@@ -1056,7 +954,7 @@ data in the first place [comment: I strongly feel we should lead with this
 sentence or something in the same spirit to make it absolutely clear that 
 statistics cannot be used to impute data out of nowhere. 
 Raising awareness for the complexity of missing data problems and should rather be seen
-as an incetive to invest more effort upfront in preventing missingness in the first place ;)]
+as an incentive to invest more effort upfront in preventing missingness in the first place ;)]
 Nevertheless, in practice, missing values due to loss-to-follow-up will always
 occur and should be addressed effectively
 There is a wide consensus that statistically sound imputation of missing values 
@@ -1066,7 +964,7 @@ imputation on a per-analysis basis including analysis-specific covariates to
 further reduce bias and to preserve the imputation uncertainty in the 
 downstream analysis.
 In practice, however, there are good reasons for providing a set of single-imputed
-default values in large bservational studies such as CENTER-TBI.
+default values in large observational studies such as CENTER-TBI.
 Consortia are increasingly committed to making their databases 
 available to a wider range of researchers. 
 In fact, more liberal data-sharing policies are becoming a core requirement 
@@ -1078,13 +976,13 @@ Furthermore, the imputed values of a multiple-imputation procedure are
 inherently random and it is thus difficult to ensure consistency across 
 different analysis teams if the values themselves cannot be stored 
 directly in a database.
-For this reason, as a pratical way forward, we suggest providing a default 
+For this reason, as a practical way forward, we suggest providing a default 
 single-imputation with appropriate measures of uncertainty for key outcomes 
 in the published data base itself.
 This mitigates problems with complete-case analyses and provides a
 principled and consistent default approach to handling missing values.
 Since we strongly suggest to employ a model-based approach to imputation,
-the fitted class probabilities can be provided in the core databse along the
+the fitted class probabilities can be provided in the core database along the
 imputed values.
 Based on these probabilities, it is easy to draw samples for a multiple imputation
 analysis.
@@ -1155,17 +1053,17 @@ df_ground_truth %>%

 ## Reproducible Research Strategy

-CENTER-TBI is commited to reproducible research. 
+CENTER-TBI is committed to reproducible research. 
 To this end, the entire source code to run the analyses is publicly available
 at https://git.center-tbi.eu/kunzmann/gose-6mo-imputation.
 Scripts for automatically downloading the required data from the central
 access restricted 'Neurobot' (https://neurobot.incf.org/) database at 
 https://center-tbi.incf.org/ are provided.
-The analysis is completely automated using the workflow managment tool 'snakemake'
+The analysis is completely automated using the workflow management tool 'snakemake'
 [@koster2012snakemake] and a singularity [@kurtzer2017singularity] container image
 containing all required dependencies is publicly available from zenodo.org
 at https://zenodo.org/record/2600385#.XJzZwEOnw5k (DOI: 10.5281/zenodo.2600385).
-Detailed step-bz-step instructions on how to reproduce the analysis are provided 
+Detailed step-by-step instructions on how to reproduce the analysis are provided 
 in the README.md file of the GitLab repository. 



--- a/manuscript/references.bib
+++ b/manuscript/references.bib
-@article{horton2018randomized,
+@article{horton2018,
  title={Randomized controlled trials in adult traumatic brain injury: A systematic review on the use and reporting of clinical outcome assessments},
  author={Horton, Lindsay and Rhodes, Jonathan and Wilson, Lindsay},
  journal={Journal of neurotrauma},
@@ -9,14 +9,14 @@
  publisher={Mary Ann Liebert, Inc. 140 Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA}
 }

-@article{richter2019handling,
+@article{richter2019,
  title={Handling missing outcome data in traumatic brain injury research-a systematic review},
  author={Richter, Sophie and Stevenson, Susan and Newman, Tom and Wilson, Lindsay and Menon, David and Maas, Andrew and Nieboer, Daan and Lingsma, Hester and Steyerberg, Ewout and Newcombe, Virginia},
  year={2019},
  publisher={Mary Ann Liebert Inc.}
 }

-@article{white2010bias,
+@article{white2010,
  title={Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values},
  author={White, Ian R and Carlin, John B},
  journal={Statistics in medicine},
@@ -102,6 +102,7 @@
  pages={36--47},
  year={2014},
  publisher={American Medical Association}
+}

 @article{center2015collaborative,
  title={Collaborative European neurotrauma effectiveness research in traumatic brain injury (CENTER-TBI): A prospective longitudinal observational study},
@@ -175,7 +176,6 @@
  pages={2520--2522},
  year={2012},
  publisher={Oxford University Press}
->>>>>>> 943d2c27a63fd82c265933c46a0d6ab674191f03
 }


@@ -242,8 +242,33 @@ year = {2017}
 author = {Agresti, Alan},
 editor = {{John Wiley {\&} Sons}},
 title = {{Categorical data analysis}},
-year = {2003}
+year =
+{2003}
+}
+
+@article{steyerberg2008,
+  title={Predicting outcome after traumatic brain injury: development and international validation of prognostic scores based on admission characteristics},
+  author={Steyerberg, Ewout W and Mushkudiani, Nino and Perel, Pablo and Butcher, Isabella and Lu, Juan and McHugh, Gillian S and Murray, Gordon D and Marmarou, Anthony and Roberts, Ian and Habbema, J Dik F and others},
+  journal={PLoS medicine},
+  volume={5},
+  number={8},
+  pages={e165},
+  year={2008},
+  publisher={Public Library of Science}
+}
+
+@article{meira2009,
+  title={Multi-state models for the analysis of time-to-event data},
+  author={Meira-Machado, Lu{\'\i}s and de U{\~n}a-{\'A}lvarez, Jacobo and Cadarso-Suarez, Carmen and Andersen, Per K},
+  journal={Statistical methods in medical research},
+  volume={18},
+  number={2},
+  pages={195--222},
+  year={2009},
+  publisher={SAGE Publications Sage UK: London, England}
 }
+
+
 @article{brms2018,
 author = {B{\"{u}}rkner, Paul},
 journal = {The R Journal},
@@ -253,6 +278,7 @@ title = {{Advanced Bayesian Multilevel Modeling with the R Package brms}},
 volume = {10},
 year = {2018}
 }
+
 @article{brms2017,
 author = {B{\"{u}}rkner, Paul},
 journal = {Journal Of Statistical Software},
@@ -262,6 +288,7 @@ title = {{brms: An R Package for Bayesian Multilevel Models Using Stan}},
 volume = {80},
 year = {2017}
 }
+
 @article{Steyerberg2008,
 author = {Steyerberg, Ewout W and Mushkudiani, Nino and Perel, Pablo and Butcher, Isabella and Lu, Juan and McHugh, Gillian S and Murray, Gordon D and Marmarou, Anthony and Roberts, Ian and Habbema, J. Dik F and Maas, Andrew I. R},
 doi = {10.1371/journal.pmed.0050165},

--- a/scripts/download_v1.1.sh
+++ b/scripts/download_v1.1.sh
@@ -39,3 +39,10 @@ curl \
  --digest https://center-tbi.incf.org/api/data/_5c548a5b6b3f2f22e14d20a2.csv > \
  $OUT/df_baseline.csv
 Rscript -e "library(tidyverse); saveRDS(as_tibble(read_csv('$OUT/df_baseline.csv')), file = '$OUT/df_baseline.rds')"
+
+# baseline descriptive
+curl \
+  --user $NEUROBOT_USR:$NEUROBOT_API \
+  --digest https://center-tbi.incf.org/api/data/_5cc703eb3a4c5139c387f8b5.csv > \
+  $OUT/df_baseline_descriptive.csv
+Rscript -e "library(tidyverse); saveRDS(as_tibble(read_csv('$OUT/df_baseline_descriptive.csv')), file = '$OUT/df_baseline_descriptive.rds')"