working on manuscript

21167658 · Kevin Kunzmann · 9b92dbeb · 21167658 · 21167658 · 21167658
Commit 21167658 authored Mar 21, 2019 by Kevin Kunzmann
Hide whitespace changes
Inline Side-by-side

Showing with 221 additions and 112 deletions

.gitignore .gitignore +2 -3

manuscript/manuscript.Rmd manuscript/manuscript.Rmd +154 -109

manuscript/references.bib manuscript/references.bib +65 -0

No files found.
--- a/.gitignore
+++ b/.gitignore
 data
-output
+output*
 .snakemake
 .Rproj.user
 .cache
-container.sif
+*.sif
 *.zip
 *.pdf
 *.Rproj
@@ -11,4 +11,3 @@ container.sif
 *.Rhistory
 *.out
 *.docx
-fetch
--- a/manuscript/manuscript.Rmd
+++ b/manuscript/manuscript.Rmd
 ---
-title: "Model-based longitudinal imputation of cross-sectional GOSe in CENTER-TBI"
+title: "Model-based longitudinal imputation of cross-sectional outcomes in traumatic brain injury"
 output: 
  word_document: default
  pdf_document: default
@@ -24,26 +24,25 @@ config <- yaml::read_yaml(params$config_file)
 Assessments of global functional outcome such as the Glasgow Outcome Scale (GOS)
 and the Glasgow Outcome Scale extended (GOSe) capture 
 meaningful differences across the full spectrum of recovery,
-and have popularity as endpoints in TBI studies (Horton et al 2018).
+and have popularity as endpoints in traumatic brain injury (TBI0 studies 
+(Horton et al 2018).
 However, missing outcome data is a common problem in TBI research, 
 and for longitudinal studies completion rates at six months can be 
 lower than 70% (Richter et al).
+This is important since it is well known that complete-case analyses may 
+introduce bias or reduce power [@white2010bias]. 
 Imputation of patient outcomes is gradually gaining acceptance in the TBI
-field as a method of dealing with missing data in studies.
+field as a method of dealing with missing data.
-This is important since it is well known that complete-case analyses may
-introduce bias or reduce power [REFERENCE].
 Recent longitudinal studies have successfully employed techniques for both
 single (Clifton et al., 2011; Silver et al 2006; Skolnick et al 2014) and
 multiple imputation (Bulger et al, 2010, Kirkness et al 2006;
 Wright et al, 2014; Robertson et al, 2014).
-The advantage of multiple imputation over single imputation clearly lies in the
+Last observation carried forward (LOCF) is a widely applied single-imputation
-fact that the uncertainty about imputed values can be properly accounted for 
+method for dealing with missing data in TBI research.
-in a subsequent statistical analysis.
+Typically, 3-month outcome is substituted for missing 6-month data [REFERENCE].
-Despite these considerations, last observation carried forward (LOCF) is still
+Although LOCF is easz to understand and implement, the technique is clearly 
-a widely applied method for dealing with missing data in TBI research,
+lacking in several respects.
-e.g. simply substituting the respective 3-months outcome for a missing 6-months data point [REFERENCE]. 
-Although LOCF is simple to understand and implement, the technique clearly 
-lacks in several respects.
 Firstly, it is biased in that it neglects potential trends in the GOS(e) trajectories.
 Secondly, a naive application of LOCF is also inefficient since it neglects
 data observed briefly after the target time window.
@@ -56,44 +55,74 @@ This also implies that it is impossible to include additional covariates
 to further reduce bias introduced by the imputation method and that LOCF
 cannot be used to obtain multiply imputed data sets by design.
-In this manuscript, we three model-based imputation strategies for longitudinal
+In this manuscript, three model-based imputation strategies for GOSe at 
-GOSe data at 6 months (180 days) post-injury in TBI studies by example of the 
+6 months (= 180 days) post-injury in the longitudinal CENTER-TBI study 
-CENTER-TBI study and compare the results with standard LOCF.
+[@center2015collaborative] are compared with LOCF.
 While we acknowledge the principle superiority of multiple imputation over
 single imputation to propagating imputation uncertainty,
-we focus on single-imputation performance for primarily practical reasons:
+we focus on single-imputation performance for primarily practical reasons.
-CENTER-TBI is committed to providing a curated data base to facilitate subsequent
+CENTER-TBI is committed to providing a curated database to facilitate subsequent
 analyses.
-Since the primary endpoint is functional outcome at 6 months, an default imputed
+Since the primary endpoint in CENTER-TBI is functional outcome at 6 months, 
-value for as many study participants as possible is desirable. 
+a single default imputed value for as many study participants as possible is 
+desirable. 
 To this end, a single imputed value is to be included in the published database
 together with a measure of uncertainty about the imputed value.
-Subsequent analyses may then either resort to using the provided imputed values
+Subsequent analyses could then either use the imputed values or fit a custom 
-or - based on the findings presented in this manuscript - fit a custom
+imputation model to adjust for different covariates.
-imputation model to adjust for different covariates or obtain multiply
+Multiply imputed values can be sampled from the provided class probabilities or
-imputed values.
+the custom fit model respectively.
+We compare three different model-based approaches - 
-We propose three different model-based approaches - a mixed-effects model, a Gaussian process regression, and a multi-state model - for imputing GOSe 
+a mixed-effects model, a Gaussian process regression, and a multi-state model - 
-longitudinally each of which we fit in a version including baseline covariates
+for imputing cross-sectional GOSe at 6 months exploiting the longitudinal 
-and without.
+GOSe measurements.
+Each model is fit in a version with and without baseline covariates.
-# Study population
+# Methods
-The study population for our empirical methods comparison are all individuals
+## Study population
-from the CENTER-TBI database (n = 4509) whose GOSe status was queried at least
-once and who were still alive 180 days post-injury.
-The rationale for conducting the comparison conditional on 6-months survival
-is simply that the GOSe can only be missing at 6-months if the individuals
-are still alive since GOSe would be 1 (dead) otherwise.
 ```{r read-population-data, include=FALSE}
 df_baseline <- read_rds(paste0(params$data_dir, "/df_baseline.rds"))
+n_pat <- nrow(df_baseline)
 ```
-Overall, `r nrow(df_baseline)` individuals are eligible for a model-based 
+The CENTER-TBI project methods and design are described in detail 
-imputation.
+elsewhere [@center2015collaborative]. 
+Participants with TBI were recruited into three strata: 
+(a) patients attending the emergency room, 
+(b) patients admitted to hospital but not intensive care, 
+and (c) patients admitted to intensive care. 
+Follow-up of participants was scheduled per protocol at 2 weeks, 3 months, 
+and 6 months in group (a) and at 3 months, 6 months, and 12 months in groups 
+(b) and (c).   
+Outcome assessments at all timepoints included the GOSe 
+[@jennett1981disability, @mcmillan2016glasgow]. 
+The GOSe is an eight-point scale with the following categories: 
+(1) dead, (2) vegetative state, (3) lower severe disability, 
+(4) upper severe disability, (5) lower moderate disability, 
+(6) upper moderate disability, (7) lower good recovery, 
+(8) upper good recovery.
+The study population for this empirical methods comparison are all individuals
+from the CENTER-TBI database (total of n = 4509) whose GOSe status was queried 
+at least once within the first 18 months and who were still alive 180 days 
+post-injury `r n_pat`.
+The rationale for conducting the comparison conditional on 6-months survival
+is simply that the GOSe can only be missing at 6-months if the individuals
+are still alive since GOSe would be (1) otherwise.
+**TODO:**
+* ->LW: I explicitly asked for additional covariates that are needed from the DB,
+happy to include them if you pass me the Neurobot codes!
+* yes, the entire document is automatically generated to ensure reproducibility and
+up-to-date data; once we agree on the manuscript we can pimp the table formatting and
+add cross-references to the respective figures (that is only supported for pdf output which you could not edit directly)
 ```{r baseline-table-continuous, echo=FALSE, results='asis'}
 summarizer <- function(x) {
@@ -154,21 +183,37 @@ df_baseline %>%
 ```{r read-gose-data, include=FALSE}
 df_gose <- read_rds(paste0(params$data_dir, "/df_gose.rds"))
+n_gose_180 <- df_gose %>% 
+  filter(between(Outcomes.DerivedCompositeGOSEDaysPostInjury, 180 - 14, 180 + 14)) %>% 
+  group_by(gupi) %>% 
+  n_groups()
+n_gose_pp <- df_gose %>% 
+  filter(between(Outcomes.DerivedCompositeGOSEDaysPostInjury, 5*30, 8*30)) %>% 
+  group_by(gupi) %>% 
+  n_groups()
 ```
-A total of `r nrow(df_gose)` GOSe observations are available for the study 
+Only GOSe observations between injury and 18 months post injury are used
-population.
+since extremely late follow-ups are not providing enough information.
-The marginal distribution of their sampling times as well as absolute and
+This leads to a total of `r nrow(df_gose)` GOSe observations of the study
-relative frequencies of the respective GOSe categories are depicted in 
+population being availabe for the analyses in this manuscript.
+Only for `r n_gose_180` (`r round(n_gose_180 / n_pat * 100, 1)`%) individuals,
+GOSe observations at 180 +/- 14 days post injury are available,
+for `r n_gose_pp` (`r round(n_gose_pp / n_pat * 100, 1)`%) individuals
+GOSe observations within the per-protocol window of 5-8 months post injury
+exist.
+The distribution of GOSe sampling times and both absolute and
+relative frequencies of the respective GOSe categories are shown in 
 Figure (???).
-Note that the small fraction of GOSe 1 (dead) observations at 6 months is due
+True observation times were mapped to categories by rounding to the 
-to the fact that the true observation times where mapped to the respective
+closest time point, i.e., 
-categories by rounding to the respective closest.
+the ‘6 months’ category contains observations up to 9 months post-injury.  
-I.e., the '6 months' category actually contains observations of up to 9 months 
+Thus the figures include a small proportion of GOSE 1 representing patients 
-post-injury and the figures are thus not contradicting the study population
+who died between 6 and 9 months.
-inclusion criteria.
+```{r gose-plots, echo=FALSE, fig.cap="GOSe sampling time distribution and distribution at per-protocol time points.", fig.height=5, fig.width=7}
-```{r gose-plots, echo=FALSE, fig.cap="GOSe time point distribution and marginal distribution at per-protocol time points.", fig.height=5, fig.width=7}
 p_gose_times <- df_gose %>% 
  ggplot(aes(Outcomes.DerivedCompositeGOSEDaysPostInjury / 30)) +
    geom_histogram(binwidth = 1) +
@@ -253,41 +298,35 @@ gridExtra::grid.arrange(
-# Methods
+# Imputation methods
-We compare the current default method of last observation carried forward 
+We compare last observation carried forward (LOCF) to a mixed effect model (MM), 
-(LOCF) to 
+a Gaussian process regression (GP), and a multi-state model (MSM). 
-a mixed effect model (MM), a Gaussian process regression (GP)
+For all model-based approaches we additionally explore variants including the key 
-as well as a multi-state model (MSM).
+IMPACT [@steyerberg2008predicting] predictors as covariates. 
-For all model-based approaches we additionally explore variants including
-the key IMPACT [REFERENCE] predictors as covariates.
-For the MSM approach the inclusion of all predictors proved to be numerically 
-unstable due to the resulting model complexity and we only considered 
-age at injury.
 ## Last-observation-carried-forward
-Since LOCF is still widely used to impute missing outcomes in TBI studies, 
+Since LOCF is widely used to impute missing outcomes in TBI studies, 
-it serves as baseline methods for the comparison at hand.
+it serves as the baseline method.
 Here, LOCF is defined as the last GOSe observation before the
-imputation time point of exactly 180 days post-injury.
+imputation time point of 180 days post-injury.
-Note that LOCF is not model-based and may not permit the imputation
+LOCF is not model-based and, by definition, only permits the imputation of a 
-of a GOSe value for all subjects in the study population.
+GOSe value for subjects where at least one value is available within the 
-A value cannot be imputed, if only valid GOSe observations after 180 days 
+first 180 days post injury.
-post-injury are available for a subject but none before.
 We account for this lack of complete coverage under LOCF by performing all
-performance comparisons including LOCF on the subset of individuals
+performance comparisons including LOCF only on the subset of individuals
 for which a LOCF-imputed value can be obtained (cf. Section ???). 
 ## Mixed-effects model
-Mixed effects models are a standard approach to longitudinal data analysis  
+Mixed effects models are a a widely used approach in longitudinal 
-which model individual deviations from a population mean trajectory 
+data analysis andd model individual deviations from a population mean trajectory 
-(REFERENCE).
+[@verbeke2009linear].
 To account for the fact that the GOSe outcome is an ordered factor,
-we employ a cumulative link function model with flexible intercepts @Agresti2003.
+we employ a cumulative link function model with flexible intercepts [@Agresti2003].
 The population mean is modeled as spline function with knots at ???
 to allow a non-linear population mean trajectory.
 Patient-individual deviations from this population mean are modeled 
@@ -295,38 +334,39 @@ as quadratic polynomials to allow sufficient flexibility (random effects).
 Baseline covariates are added as linear fixed effects to to the 
 population mean.
 The model was fitted using Bayesian statistics via the BRMS 
-package @brms2017, @brms2018 for the R environment for statistical 
+package [@brms2017, @brms2018] for the R environment for statistical 
-computing @R2016 and the Stan modelling language @stan2017. 
+computing [@R2016] and the Stan modelling language [@stan2017]. 
 Non-informative priors were used for the model parameters.
-## Gaussian process model
 A potential drawback of the proposed longitudinal mixed effects model 
 is the fact that the individual deviations from the population mean 
 are modeled globally using polynomials.
 Since linear and quadratic terms are not identifiable for patients with 
-only one observed GOSE value, this implies large uncertainty over the patient-specific effects for individuals with one or two observations 
+only one observed GOSE value, this implies large uncertainty over the 
+patient-specific effects for individuals with one or two observations 
 only by falling back on the non-informative priors on these model 
 parameters. 
 Thus, the overall uncertainty associated with model-based imputations 
 at exactly 180 days may become relatively large.
-Gaussian process regression allows to flexibly model 
+## Gaussian process model
+Gaussian process regression allows flexible modelling of 
 both the individual GOSE trajectories as well as the population mean
-in a Bayesian non-parametric way @rasmussen2006.
+in a Bayesian non-parametric way [@rasmussen2006].
-This non-parametric paradigm leads to a very flexible model and
+This non-parametric paradigm leads to low model-uncertainty in the 
-low model-uncertainty in the vicinity of actually observed GOSe outcomes.
+vicinity of actually observed GOSe outcomes.
 To account for the discreteness of the GOSe outcome, 
 the continuous output of the Gaussian process model is rounded to the
 nearest integer in  1 to 8 (GOSe categories).
-The commonly used squared exponential covariance function with shared 
+The squared exponential covariance function with shared 
 length scale for all individuals is used to model intra-individual
 dependency of GOSe outcomes.
 The population mean trajectory of the Gaussian process is modeled as
 mean of an independent Gaussian process with pseudo observations at 
-`r config$gp$t_knots %>% as.numeric()` days post-injury (also using a squared 
+`r config$gp$t_knots %>% as.numeric()` days post-injury 
-exponential covariance function).
+(also using a squared exponential covariance function).
 This approach maintains flexibility of the population mean function 
 similar to the spline-based approach for the mixed effects model
 while avoiding the the computational complexity of a fully
@@ -340,7 +380,8 @@ Due to the sparseness of the data, the estimated length scale will naturally
 tend towards extremely large values implying unrealistically 
 long-range dependency between observations.
 We therefore limit the length scale to a maximum of 120 days (4 months) and 
-impose a Gaussian prior $\mathcal{N}(60, 14)$.
+impose a Gaussian prior with a mean of 60 days post injury and and a 
+standard deviation of 14.
 ## Multi-state model
@@ -397,12 +438,11 @@ piece wise constant transition intensities were fitted to the intervals
 [0, 90), [90, 270), and 270+ days post-injury.
 The model was fit using the msm @msm2011 package for the R environment
 for statistical computing @R2016.
-Due to the large number of 19 transition intensities in the proposed model
+Due to the relatively large number of 19 transition intensities in the proposed model
 (cf. arrows in Figure ???, structure of transition graph),
 inclusion of all baseline covariates turned out to be numerically unstable.
 For the MSM model, instead of including all covariates, 
-only a model adjusting for age at injury via a proportional hazard approach  
+only a model adjusting for age at injury via a proportional hazard approach was fit.
-was fit.
 ## Performance assessment
@@ -425,15 +465,6 @@ for (i in 1:config$folds) {
 }
 ```
-Model performance was assessed via three fold cross validation on the subset 
-of individuals with a valid GOSe value within 180 +/- 14 days post-injury 
-(n = `r nrow(df_ground_truth)`.
-All models were fit on the entire available data after removing the 
-180 +/- 14 days post-injury observation from the respective test fold
-thus mimicking a missing completely at random missing data mechanism.
-The distribution of GOSe values in the respective three test sets is well 
-balanced, cf. Figure ??? (Appendix).
 ```{r misc, include=FALSE}
 modelnames <- lapply(
    list.dirs(
@@ -502,20 +533,35 @@ df_predictions <- df_model_posteriors %>%
  )
 ```
+Model performance was assessed via three fold cross validation on the subset 
+of individuals with a valid GOSe value within 180 +/- 14 days post-injury 
+(n = `r nrow(df_ground_truth)`.
+All models were fit on the entire available data after removing the 
+180 +/- 14 days post-injury observation from the respective test fold
+thus mimicking a missing completely at random missing data mechanism.
+The distribution of GOSe values in the respective three test sets is well 
+balanced, cf. Figure ??? (Appendix).
 Performance is assessed using the absolute-count and the normalized 
 (proportions) confusion matrices as well as
-bias, mean absolute error (MAE) and root mean squared error (RMSE).
+bias, mean absolute error (MAE), and root mean squared error (RMSE).
-Bias, MAE, and RMSE are considered both conditional on the ground-truth
+Bias indicates whether, on average, predicted values are systematically 
-(unobserved true GOSe) as well as averaged over the entire test set.
+lower (negative) or higher (positive) than observed values.
-Note that comparisons in terms of bias, MAE, and RMSE tacitly assume that 
+MAE and RMSE are both a measures of average precision where
+RMSE puts more weight on large deviations as compared to RMSE.
+Comparisons in terms of bias, MAE, and RMSE tacitly assume that 
 GOSe values can be sensibly interpreted on an interval scale.
+We therefore also consider $Pr[est > true] - Pr[est < true]$ as an
+alternative measure of bias.
+Note that the scale is not directlz comparable to the one of the
+other three quantities!
+All measures are considered both conditional on the ground-truth
+(unobserved true GOSe) as well as averaged over the entire test set.
-LOCF, by design, may not be able to provide imputed values whenever 
-no observation before 180 days post-injury is observed.
+LOCF, by design, cannot provide imputed values when there are no
-Since there is reason to assume that these individuals 6-months GOSE will be 
+observations before 180 days post injury.
-harder to predict, 
+A valid comparison of LOCF with the other methods must therefore be 
-a comparison of LOCF with the other methods must be conditional on the 
+baseed on the set of individuals for whom an LOCF imputation is possible.
-set of individuals for which an LOCF imputation is possible.
 ```{r non-locf-ids, include=FALSE}
 idx <- df_predictions %>% 
  filter(model == "LOCF", !complete.cases(.)) %>% 
@@ -524,7 +570,6 @@ idx <- df_predictions %>%
 Overall, `r length(idx)` out of 
 `r df_predictions %>% filter(model == "LOCF") %>% nrow` test cases 
 (`r round(100 * length(idx) / (df_predictions %>% filter(model == "LOCF") %>% nrow), 1)`%) cannot be imputed with the LOCF approach.
 In the entire study population, `r df_gose %>% group_by(gupi) %>% summarize(LOCF = any(Outcomes.DerivedCompositeGOSEDaysPostInjury <= 180)) %>% ungroup %>% summarize(n_LOCF = sum(!LOCF)) %>% .[["n_LOCF"]]`
 individuals do not permit an LOCF imputation (`r round(100 * (df_gose %>% group_by(gupi) %>% summarize(LOCF = any(Outcomes.DerivedCompositeGOSEDaysPostInjury <= 180)) %>% ungroup %>% summarize(n_LOCF = sum(!LOCF)) %>% .[["n_LOCF"]]) / (df_gose$gupi %>% unique %>% length), 1)`%).

--- a/manuscript/references.bib
+++ b/manuscript/references.bib
+@article{center2015collaborative,
+  title={Collaborative European neurotrauma effectiveness research in traumatic brain injury (CENTER-TBI): A prospective longitudinal observational study},
+  author={CENTER-TBI Participants and Investigators and others},
+  journal={Neurosurgery},
+  volume={76},
+  number={1},
+  pages={67--80},
+  year={2015},
+  publisher={Lippincott Williams and Wilkins}
+}
+@article{mcmillan2016glasgow,
+  title={The Glasgow Outcome Scale—40 years of application and refinement},
+  author={McMillan, Tom and Wilson, Lindsay and Ponsford, Jennie and Levin, Harvey and Teasdale, Graham and Bond, Michael},
+  journal={Nature Reviews Neurology},
+  volume={12},
+  number={8},
+  pages={477},
+  year={2016},
+  publisher={Nature Publishing Group}
+}
+@article{jennett1981disability,
+  title={Disability after severe head injury: observations on the use of the Glasgow Outcome Scale.},
+  author={Jennett, B and Snoek, J and Bond, MR and Brooks, N},
+  journal={Journal of Neurology, Neurosurgery \& Psychiatry},
+  volume={44},
+  number={4},
+  pages={285--293},
+  year={1981},
+  publisher={BMJ Publishing Group Ltd}
+}
+@article{steyerberg2008predicting,
+  title={Predicting outcome after traumatic brain injury: development and international validation of prognostic scores based on admission characteristics},
+  author={Steyerberg, Ewout W and Mushkudiani, Nino and Perel, Pablo and Butcher, Isabella and Lu, Juan and McHugh, Gillian S and Murray, Gordon D and Marmarou, Anthony and Roberts, Ian and Habbema, J Dik F and others},
+  journal={PLoS medicine},
+  volume={5},
+  number={8},
+  pages={e165},
+  year={2008},
+  publisher={Public Library of Science}
+}
+@book{verbeke2009linear,
+  title={Linear mixed models for longitudinal data},
+  author={Verbeke, Geert and Molenberghs, Geert},
+  year={2009},
+  publisher={Springer Science \& Business Media}
+}
 @article{msm2011,
 author = {Jackson, Christopher H.},
 doi = {10.18637/jss.v038.i08},
@@ -9,6 +61,19 @@ title = {{Multi-State Models for Panel Data: The msm Package for R}},
 volume = {38},
 year = {2011}
 }
+@article{white2010bias,
+  title={Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values},
+  author={White, Ian R and Carlin, John B},
+  journal={Statistics in medicine},
+  volume={29},
+  number={28},
+  pages={2920--2931},
+  year={2010},  
+  publisher={Wiley Online Library}
+}
 @article{R2016,
 archivePrefix = {arXiv},
 arxivId = {arXiv:1011.1669v3},