reduced manuscript to figures

2c0d88df · Kevin Kunzmann · c81702ce · 2c0d88df · c81702ce · c81702ce
Commit 2c0d88df authored Oct 08, 2019 by Kevin Kunzmann
5 changed files
--- a/Snakefile
+++ b/Snakefile
@@ -10,17 +10,19 @@ include: "rules/final_imputation_report.rule"
-rule manuscript:
+rule manuscript_figures:
    input:
        pop_report = rules.prepare_data_v1_1.output,
        posteriors = rules.fit_models_validation_v1_1.input,
        markdown   = "manuscript/manuscript.Rmd"
    output:
-        "output/v1.1/manuscript.docx"
+        "output/v1.1/manuscript-figures.zip"
    shell:
        """
        mkdir -p output/v1.1
-        Rscript -e "rmarkdown::render('{input.markdown}', output_dir = 'output/v1.1', params = list(data_dir = '../output/v1.1/data', config_file = '../config.yml'))"
+        Rscript scripts/manuscript_figures.R
+        zip -rm manuscript-figures.zip *.eps
+        mv manuscript-figures.zip output/v1.1/manuscript-figures.zip
        """

--- a/manuscript/manuscript.Rmd
+++ b/manuscript/manuscript.Rmd
---
-title: "Model-based longitudinal imputation of cross-sectional outcomes in traumatic brain injury"
-output: 
-  word_document: default
-  pdf_document:  default
-  html_document:
-    keep_md: yes
-bibliography:  "references.bib"
-params:
-  data_dir:    "../output/v1.1/data"
-  config_file: "../config.yml"
---
-```{r setup-chunk, include=FALSE}
-library(tidyverse)
-knitr::opts_chunk$set(dpi = 300, fig.width = 8, fig.height = 5)
-config <- yaml::read_yaml(params$config_file)
-```
-*Abstract:* TODO
-# Introduction
-Assessments of global functional outcome such as the Glasgow Outcome Scale (GOS)
-and the Glasgow Outcome Scale extended (GOSe) capture 
-meaningful differences across the full spectrum of recovery,
-and have popularity as endpoints in traumatic brain injury (@horton2018).
-However, missing outcome data is a common problem in TBI research, 
-and for longitudinal studies completion rates at six months can be 
-lower than 70% (@richter2019).
-This is important since it is well known that complete-case analyses may 
-introduce bias or reduce power (@white2010). 
-Imputation of patient outcomes is gradually gaining acceptance in the TBI
-field as a method of dealing with missing data.
-Recent longitudinal studies have successfully employed techniques for both
-single (@clifton2011; @silver2006; @skolnick2014) and
-multiple imputation (@bulger2010; @kirkness2006; @wright2014; @robertson2014).
-Last observation carried forward (LOCF) is a widely applied single-imputation
-method for dealing with missing data in TBI research.
-The 3-month outcome is recognized as one approach of subsituting for missing
-6-month data (@steyerberg2008), 
-and has been used in recent trials (@skolnick2014).
-Although LOCF is easy to understand and implement, the technique is 
-suboptimal in several respects.
-Firstly, it is biased in that it neglects potential time trends in 
-GOS(e) trajectories.
-Secondly, a naive application of LOCF is also inefficient since it neglects
-data observed briefly after the target time window.
-E.g., a GOS(e) value recorded at 200 days post-injury is likely to be more
-informative about the status at 180 days post-injury than a value observed 90 
-days post-injury.
-Finally, the *ad-hoc* nature of the LOCF method implies that there is no 
-probabilistic model, and thus no measure of uncertainty about the imputed values.
-This also implies that it is impossible to include additional covariates
-to further reduce bias introduced by the imputation method and that LOCF
-cannot be used to obtain multiply imputed data sets by design.
-The variation in timing of outcome assessments for patients with TBI 
-varies between studies.
-Some studies define very stringent time windows (e.g. +/- 2 weeks [TRACK-TBI ???????????]), 
-but this can lead to a substantial amount missing data (@richter2019).  Consequently other studies have defined more pragmatic protocol windows 
-(e.g. -1 month to +2 months, cf. @maas2014).
-While the wider windows enable more complete data collection, 
-they suffer from the problem that outcome can be evolving over this period,
-and an outcome assessment obtained at five months 
-(the beginning of this window) in one subject may not be strictly comparable
-with outcomes obtained just before eight months (the end of the window) in 
-another subject.
-Consequently, even where outcomes are available within pragmatic protocol 
-windows, 
-there may be a benefit from being able to impute an outcome more 
-precisely at the 180 day (6 month) time point.
-In this manuscript, three model-based imputation strategies for GOSe at 
-6 months (=180 days) post-injury in the longitudinal CENTER-TBI study 
-(@maas2014) are compared with LOCF.
-While we acknowledge the principle superiority of multiple imputation over
-single imputation to propagating imputation uncertainty,
-we focus on single-imputation performance primarily for practical reasons.
-Unlike randomized trials which involve a single principle analysis,
-CENTER-TBI is committed to providing a curated database to facilitate 
-multiple subsequent analyses.
-Since the primary endpoint in CENTER-TBI is functional outcome at 6 months, 
-a single default imputed value for as many study participants as possible is 
-desirable. 
-To this end, a single imputed value is to be included in the published database
-together with a measure of uncertainty about the imputed value.
-Subsequent analyses could then either use the imputed values or fit a custom 
-imputation model to adjust for different covariates.
-Multiply imputed values can be sampled from the provided class probabilities or
-the custom fit model respectively.
-We compare three different model-based approaches - 
-a mixed-effects model, a Gaussian process regression, and a multi-state model - 
-for imputing cross-sectional GOSe at 6 months exploiting the longitudinal 
-GOSe measurements.
-Each model is fit in a version with and without baseline covariates.
-# Methods
-## Study population
-```{r read-population-data, include=FALSE}
-df_baseline <- read_rds(paste0(params$data_dir, "/df_baseline.rds")) %>% 
-  mutate(
-    InjuryHx.PupilsBaselineDerived = factor(InjuryHx.PupilsBaselineDerived,
-                                            levels = 0:2),
-    InjuryHx.GCSScoreBaselineDerived = case_when(
-      InjuryHx.GCSScoreBaselineDerived <= 8  ~ "Severe",
-      InjuryHx.GCSScoreBaselineDerived <= 12 ~ "Moderate",
-      InjuryHx.GCSScoreBaselineDerived  > 12 ~ "Mild"
-    ) %>% as.factor(),
-    Labs.DLGlucosemmolL = exp(Labs.DLGlucosemmolL_log)
-  ) %>% 
-  select(-Labs.DLGlucosemmolL_log) %>% 
-  left_join(
-    read_rds(
-      "../data/v1.1/df_baseline_descriptive.rds"
-    ) %>%
-    transmute(
-      gupi,
-      Subject.PatientType = factor(
-        Subject.PatientType, 
-        level = 1:3, 
-        labels = c("Emergency Room", "Admission to Hospital", "Intensive Care Unit")
-      ),
-      InjuryHx.InjCause = factor(InjuryHx.InjCause,
-        level  = c(1:6, 88, 99),
-        labels = c(
-          "Road traffic incident",
-          "Incidental fall",
-          "Other non-intentional injury",
-          "Violence/assault",
-          "Act of mass violence",
-          "Suicide attempt",
-          "Unknown",
-          "Other"
-        ) %>% 
-        fct_recode(
-          Other              = "Other non-intentional injury",
-          Other              = "Suicide attempt",
-          `Violence/assault` = "Act of mass violence"
-        )
-      ),
-      InjuryHx.TotalISS
-    ),
-    by = "gupi"
-  )
-var_labels <- readr::read_csv("varlabels.csv")
-get_label <- function(varname) var_labels$label[var_labels$varname == varname] 
-get_label <- Vectorize(get_label)
-get_label(names(df_baseline))
-names(df_baseline) <- get_label(names(df_baseline))
-n_pat <- nrow(df_baseline)
-```
-The CENTER-TBI project methods and design are described in detail 
-elsewhere (@maas2014). 
-Participants with TBI were recruited into three strata: 
-(a) patients attending the emergency room, 
-(b) patients admitted to hospital but not intensive care, 
-and (c) patients admitted to intensive care. 
-Follow-up of participants was scheduled per protocol at 2 weeks, 3 months, 
-and 6 months in group (a) and at 3 months, 6 months, and 12 months in groups 
-(b) and (c).
-The protocol time window for the 6 months GOSe was between 5 and 8 months
-post injury.
-Outcome assessments at all timepoints included the GOSe 
-(@jennett1981disability, @mcmillan2016glasgow). 
-The GOSe is an eight-point scale with the following categories: 
-(1) dead, (2) vegetative state, (3) lower severe disability, 
-(4) upper severe disability, (5) lower moderate disability, 
-(6) upper moderate disability, (7) lower good recovery, 
-(8) upper good recovery.
-The study population for this empirical methods comparison are all 
-individuals from the CENTER-TBI database (total of n = 4509) whose GOSe 
-status was queried at least once within the first 18 months and who were 
-still alive 180 days post-injury (n = `r n_pat`).
-The rationale for conducting the comparison conditional on 6-months survival
-is simply that the GOSe can only be missing at 6-months if the individuals
-are still alive since GOSe would be (1) otherwise.
-Data for the CENTER-TBI study has been collected through the Quesgen e-CRF 
-(Quesgen Systems Inc, USA), 
-hosted on the INCF platform and extracted via the INCF Neurobot tool (https://neurobot.incf.org/). 
-Release 1.1 of the database was used (cf. Appendix for details).
-Basic summary statistics for population characteristics are listed in
-Table 1.
-```{r baseline-table-continuous, echo=FALSE, results='asis'}
-summarizer <- function(x) {
-  return(tibble(
-    statistic = c("# NA", "min", "mean", "median", "max", "std") %>% 
-      factor(., levels = ., ordered = TRUE),
-    value     = list(sum(is.na(x)), min(x, na.rm = TRUE), mean(x, na.rm = TRUE), median(x, na.rm = TRUE),
-                 max(x, na.rm = TRUE), sd(x, na.rm = TRUE))
-  ))
-}
-df_baseline %>%
-  select_if(is.numeric) %>% 
-  gather(variable, value) %>% 
-  group_by(variable) %>% 
-  do(
-    smry = summarizer(.$value)
-  ) %>%
-  unnest() %>%
-  spread(statistic, value) %>% 
-  unnest() %>% 
-  pander::pandoc.table("Discrete baseline variables", digits = 3, split.tables = 120)
-```
-```{r baseline-table-discrete, echo=FALSE, results='asis'}
-summarizer <- function(x) {
-  rbind(table(x), prop.table(table(x))) %>% 
-    t %>% 
-    {tmp <- .; colnames(tmp) <- c("n", "%"); tmp} %>%  
-    as_tibble(rownames = "level") %>% 
-    mutate(`%` = 100 * `%`)
-}
-df_baseline %>%
-  select_if(~!is.numeric(.)) %>% 
-  select(-gupi) %>% 
-  mutate_all(as.factor) %>% 
-  mutate_all(fct_explicit_na) %>% 
-  mutate_all(as.character) %>% 
-  gather(variable, value) %>% 
-  group_by(variable) %>% 
-  do(
-    smry = summarizer(.$value)
-  ) %>% 
-  unnest() %>% 
-  pander::pandoc.table("Continuous baseline variables", digits = 3, split.tables = 120)
-```
-```{r read-gose-data, include=FALSE}
-df_gose <- read_rds(paste0(params$data_dir, "/df_gose.rds"))
-n_gose_180 <- df_gose %>% 
-  filter(between(Outcomes.DerivedCompositeGOSEDaysPostInjury, 180 - 14, 180 + 14)) %>% 
-  group_by(gupi) %>% 
-  n_groups()
-n_gose_pp <- df_gose %>% 
-  filter(between(Outcomes.DerivedCompositeGOSEDaysPostInjury, 5*30, 8*30)) %>% 
-  group_by(gupi) %>% 
-  n_groups()
-```
-We decided to use only those GOSe observations obtained between injury 
-and 18 months post injury
-since extremely late follow-ups were considered to be irrelevant to the
-index follow-up time point of 6 months post injury.
-This lead to a total of `r nrow(df_gose)` GOSe observations of the study
-population being available for the analyses in this manuscript.
-For `r n_gose_180` (`r round(n_gose_180 / n_pat * 100, 1)`%) individuals,
-GOSe observations at 180 +/- 14 days post injury were available and
-`r n_gose_pp` (`r round(n_gose_pp / n_pat * 100, 1)`%) individuals had
-GOSe observations within the per-protocol window of 5-8 months post injury.
-The distribution of GOSe sampling times and both absolute and
-relative frequencies of the respective GOSe categories are shown in 
-Figure 1.
-True observation times were mapped to categories by rounding to the 
-closest time point, i.e., 
-the ‘6 months’ category contains observations up to 9 months post-injury.  
-Thus the figures include a small proportion of GOSE 1 representing patients 
-who died between 6 and 9 months.
-```{r gose-plots, echo=FALSE, fig.cap="GOSe sampling time distribution and distribution at per-protocol time points (actual date rounded to nearest).", fig.height=6, fig.width=6}
-p_gose_times <- df_gose %>% 
-  ggplot(aes(Outcomes.DerivedCompositeGOSEDaysPostInjury / 30)) +
-    geom_histogram(binwidth = 1) +
-    theme_bw() +
-    scale_x_continuous("Months post-injury",
-      breaks = seq(0, 50)
-    ) + 
-    scale_y_continuous("", breaks = seq(0, 2000, 200)) +
-    theme(
-      panel.grid.minor = element_blank(),
-      panel.grid.major.x = element_blank()       
-    )
-timepoint <- function(x) {
-  abs(
-    matrix(rep(x, 4), ncol = 4)
-    - matrix(rep(c(14, 90, 180, 360), length(x)), nrow = length(x), byrow = TRUE)
-  ) %>% 
-  apply(., 1, which.min) %>% 
-  factor(
-    levels = 1:4, 
-    labels = c("2 weeks", "3 months", "6 months", "12 months")
-  )
-}
-tmp <- df_gose %>% 
-  transmute(
-    gupi,
-    t    = timepoint(Outcomes.DerivedCompositeGOSEDaysPostInjury),
-    GOSe = Outcomes.DerivedCompositeGOSE
-  )
-p1 <- tmp %>% 
-  ggplot(aes(GOSe)) +
-    geom_bar(aes(fill = GOSe), width = .66, color = "black") +
-    scale_fill_grey(start = .1, end = .9) + 
-    facet_wrap(~t, nrow = 1) + 
-    ylab("absolute count") +
-    theme_bw() +
-    theme(
-      panel.grid.major.x = element_blank(),
-      legend.position = "none"
-    ) 
-p2 <- tmp %>% 
-  group_by(t, GOSe) %>% 
-  summarize(n = n()) %>% 
-  group_by(t) %>% 
-  mutate(freq = n / sum(n)) %>% 
-  ggplot(aes(x = GOSe, y = freq)) +
-    geom_bar(aes(fill = GOSe), width = .66, color = "black", stat = "identity") +
-    scale_fill_grey(start = .1, end = .9) + 
-    facet_wrap(~t, nrow = 1) + 
-    ylab("relative frequency") +
-    theme_bw() +
-    theme(
-      panel.grid.major.x  = element_blank(),
-      legend.position = "none"
-    )
-gridExtra::grid.arrange(
-  p_gose_times,
-  p1,
-  p2,
-  nrow    = 3,
-  heights = c(1.5, 1, 1)
-)
-```
-# Imputation methods
-We compared last observation carried forward (LOCF) to a mixed effect 
-model (MM), 
-a Gaussian process regression (GP), and a multi-state model (MSM). 
-For all model-based approaches we additionally explored variants 
-including the key IMPACT [@steyerberg2008] predictors as covariates. 
-These are age, GCS motot score, pupil reactivity (0, 1, 2), hypoxia, hypotension, Marshall CT classification, traumatic subarachnoid hemorrhage, epidural hematoma,
-glucose, and Hb.
-## Last-observation-carried-forward
-Since LOCF is widely used to impute missing outcomes in TBI studies, 
-it served as the baseline method.
-Here, LOCF was defined as the last GOSe observation before the
-imputation time point of 180 days post-injury.
-LOCF is not a model-based method and, by definition, 
-only permits the imputation of a  GOSe value for subjects where at least one 
-value is available within the first 180 days post injury.
-We accounted for this lack of complete coverage under LOCF by performing all
-performance comparisons including LOCF only on the subset of individuals
-for which a LOCF-imputed value can be obtained. 
-## Model-based methods 
-Model-based imputation approaches offer richer output (probabilistic imputation,
-multiple imputation) and may reduce the LOCF-inherent bias.
-We compared LOCF with three model-based approaches.
-We considered mixed effects models (MM) are a widely used approach in 
-longitudinal data analysis and model individual deviations from a population 
-mean trajectory (@verbeke2009linear).
-An alternative non-linear regression model for longitudinal data is 
-Gaussian process regression (GP) which allows flexible modelling of 
-both the individual GOSE trajectories as well as the population mean
-in a Bayesian non-parametric way [@rasmussen2006].
-Both the mixed effects model as well as the Gaussian process regression 
-model are non-linear regression techniques for longitudinal 
-data.
-While these are powerful tools to model longitudinal trajectories,
-they do not explicitly model the probability of transitions between 
-GOSe states.
-Since the number of observations per individual is limited in our
-data set (1 to 4 GOSe observations per individual),
-an approach explicitly modelling transition probabilities might be 
-more suitable to capture the dynamics of the GOSe trajectories.
-To explore this further, a Markov multi-state model (MSM)
-was considered (@meira2009).
-All models were fitted using eiter none or all IMPACT predictors except for the 
-MSM model which only used age due to issues with numerical stability.
-Further details on the respective implementations are given in the Appendix.
-## Performance assessment
-```{r load-test-data, include=FALSE}
-df_ground_truth <- c()
-for (i in 1:config$folds) {
-  df_ground_truth <- rbind(
-    df_ground_truth,
-    # note that GOSE values in test data are not subject to mi - loading the 
-    # first data set is sufficient!
-    sprintf("%s/validation/df_test_mi_1_fold_%i.rds", params$data_dir, i) %>%
-      readRDS() %>% 
-      mutate(fold = i) %>% 
-      select(gupi, fold, Outcomes.DerivedCompositeGOSEDaysPostInjury, Outcomes.DerivedCompositeGOSE)
-  )
-}
-modelnames <- lapply(
-    list.dirs(
-      sprintf("%s/validation/posteriors", params$data_dir), 
-      recursive = FALSE
-    ),
-    basename
-  ) %>% 
-  as.character()
-df_model_posteriors <- c()
-for (modelname in modelnames) {
-  for (i in 1:config$mi_m) {
-    for (j in 1:config$folds) {
-      tryCatch(
-        df_model_posteriors <- rbind(
-          df_model_posteriors,
-          sprintf("%s/validation/posteriors/%s/df_posterior_mi_%i_fold_%i.rds", 
-                  params$data_dir, modelname, i, j) %>%
-            readRDS() %>% 
-            mutate(imp = i, fold = j, model = modelname) %>% 
-            select(gupi, fold, imp, model, everything()) %>% 
-            filter(
-              gupi %in% (df_ground_truth %>% filter(fold == j) %>% .[["gupi"]])
-            )
-        ), 
-        error = function(e) print(e)
-      )
-    }
-  }
-}
-df_model_posteriors <- df_model_posteriors %>% 
-  mutate(
-    model = factor(model,
-      levels = c("locf", "mm_nb", "mm", "gp_nb", "gp", "msm", "msm_age"),
-      labels = c("LOCF", "MM", "MM + cov", "GP", "GP + cov", "MSM", "MSM + age"),
-    ) %>% as.character
-  )
-df_predictions <- df_model_posteriors %>%
-  group_by(fold, model, gupi, t, GOSE) %>%
-  summarise(probability = mean(probability)) %>%
-  arrange(model, fold, gupi, t, GOSE) %>% 
-  ungroup() %>% 
-  filter(t == 180) %>% 
-  select(-t) %>% 
-  group_by(fold, model, gupi) %>% 
-  do(
-    GOSE = .$GOSE,
-    prediction = rep(which.max(.$probability) %>% 
-      ifelse(length(.) == 0, NA_integer_, .) %>% 
-      ifelse(length(.) > 1, round(mean(.)), .), 8),
-    probability = .$probability
-  ) %>% 
-  unnest %>% 
-  spread(GOSE, probability) %>% 
-  right_join(
-    df_ground_truth %>% 
-      mutate(
-        gupi = as.character(gupi),
-        GOSE = as.integer(Outcomes.DerivedCompositeGOSE)
-    ), 
-    by = c("gupi", "fold")
-  )
-```
-Model performance was assessed via three-fold cross validation on the subset 
-of individuals with a valid GOSe value within 180 +/- 14 days post-injury 
-(n = `r nrow(df_ground_truth)`).
-All models were fit on the entire available data after removing the 
-180 +/- 14 days post-injury observation from the respective test fold
-thus mimicking a missing completely at random missing data mechanism.
-The distribution of GOSe values in the respective three test sets is well 
-balanced, (cf. Appendix, Figure A.1).
-Performance was assessed using the absolute-count and the normalized 
-(proportions) confusion matrices as well as
-bias, mean absolute error (MAE), and root mean squared error (RMSE).
-All confusion matrices are reported as averages over the three-fold cross
-validation test sets.
-The normalized confusion matrices are normalized within each stratum of observed 
-GOSe value and are thus estimates of confusion probability conditional on the
-observed GOSe.
-Bias indicates whether, on average, predicted values are systematically 
-lower (negative) or higher (positive) than observed values.
-MAE and RMSE are both a measures of average precision where
-RMSE puts more weight on large deviations as compared to RMSE.
-Comparisons in terms of bias, MAE, and RMSE tacitly assume that 
-GOSe values can be sensibly interpreted on an interval scale.
-We therefore also considered directional bias (bias'), 
-the difference between the model-fitted 
-probability of exceeding the true value minus the model-fitted probability of 
-undershooting the true GOSe ($Pr[imputed > observed] - Pr[imputed < observed]$) as an
-alternative measure of bias which does not require this assumption.
-Note that the scale of the directional bias is not directly comparable to the 
-one of the other three quantities!
-All measures are considered both conditional on the ground-truth
-(unobserved observed GOSe) as well as averaged over the entire test set.
-```{r non-locf-ids, include=FALSE}
-idx <- df_predictions %>% 
-  filter(model == "LOCF", !complete.cases(.)) %>% 
-  .[["gupi"]]
-```
-LOCF, by design, cannot provide imputed values when there are no
-observations before 180 days post injury.
-A valid comparison of LOCF with the other methods must therefore be 
-based on the set of individuals for whom an LOCF imputation is possible.
-Overall, `r length(idx)` out of 
-`r df_predictions %>% filter(model == "LOCF") %>% nrow` test cases 
-(`r round(100 * length(idx) / (df_predictions %>% filter(model == "LOCF") %>% nrow), 1)`%) could not be imputed with the LOCF approach.
-In the entire study population, `r df_gose %>% group_by(gupi) %>% summarize(LOCF = any(Outcomes.DerivedCompositeGOSEDaysPostInjury <= 180)) %>% ungroup %>% summarize(n_LOCF = sum(!LOCF)) %>% .[["n_LOCF"]]`
-individuals (`r round(100 * (df_gose %>% group_by(gupi) %>% summarize(LOCF = any(Outcomes.DerivedCompositeGOSEDaysPostInjury <= 180)) %>% ungroup %>% summarize(n_LOCF = sum(!LOCF)) %>% .[["n_LOCF"]]) / (df_gose$gupi %>% unique %>% length), 1)`%) did not have data that would permit an LOCF imputation.
-The subset used for comparison of the imputation approaches with the LOCF 
-approach  was similar to the overall dataset (cf. Appendix, Table ???).
-# Results
-The overall performance of all fitted models in terms of bias, bias', MAE, and RMSE
-is depicted in Figure 2 both conditional on LOCF being applicable (gray) and,
-excluding LOCF, on the entire test set (black).
-Values are reported as mean over the three cross-validation folds and 
-error bars indicate +/- 1.96 standard errors.
-```{r overall-comparison-all-methods, echo=FALSE, fig.height=3.5, fig.width=6, warning=FALSE}
-compute_summary_measures <- function(df) {
-  df %>%
-    group_by(model, fold) %>% 
-    summarize(
-      RMSE    = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
-      MAE     = mean(abs(GOSE - prediction), na.rm = TRUE),
-      bias    = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE),
-      `d-bias` = mean(prediction > GOSE, na.rm = TRUE) - mean(prediction < GOSE, na.rm = TRUE)
-    ) %>% 
-    ungroup %>% 
-    gather(error, value, -model, -fold) %>% 
-    mutate(
-      error = factor(error, c(
-        "bias",
-        "d-bias",
-        "MAE",
-        "RMSE"
-      ))
-    ) %>% 
-    group_by(model, error) %>%
-    summarize(
-      mean_error = mean(value),
-      se_error   = sd(value) / sqrt(n()),
-    )
-}
-df_plot <- rbind(
-  compute_summary_measures(
-      df_predictions %>% 
-        mutate(prediction = ifelse(model == "LOCF", NA, prediction))
-    ) %>% 
-    mutate(label = "full test set"),
-  compute_summary_measures(df_predictions %>% filter(!(gupi %in% idx))) %>% 
-    mutate(label = "LOCF subset")
-) %>% 
-  ungroup
-ggplot(df_plot, aes(x = model, y = mean_error)) + 
-  geom_hline(yintercept = 0, color = "black") +
-  geom_point(aes(color = label), size = .8, position = position_dodge(.4)) +
-  geom_errorbar(
-    aes(ymin = mean_error - 1.96*se_error, ymax = mean_error + 1.96*se_error, color = label),
-    width = .25, position = position_dodge(.3)
-  ) + 
-  scale_color_grey("", start = .0, end = .5) + 
-  facet_wrap(~error, nrow = 1) +
-  scale_y_continuous(name = "", breaks = seq(-2, 8, .25), limits = c(-.5, 1.25)) +
-  scale_x_discrete("") + 
-  theme_bw() +
-  theme(
-    panel.grid.minor   = element_blank(),
-    panel.grid.major.x = element_blank(),
-    axis.text.x        = element_text(angle = 66, hjust = 1),
-    legend.position    = "bottom" 
-  )
-ggsave(filename = "overall_locf.pdf", width = 6, height = 3.5)
-ggsave(filename = "overall_locf.png", width = 6, height = 3.5, dpi = 300)
-```
-Firstly, LOCF is overall negatively biased, i.e., 
-on average it imputes lower-than-observed GOSe values.
-This reflects a population average trend towards continued
-recovery within the first 6 months post injury.
-The fact that both ways of measuring bias qualitatively agree, 
-indicates that the interpretation of GOSe as an interval measure which
-tacitly underlies Bias, MAE, and RMSE comparisons is not too restrictive.
-In terms of accuracy, LOCF does perform worst but differences between
-methods are less pronounced than in terms of bias.
-Notably, the RMSE difference between LOCF and the other methods is slightly
-larger than the MAE difference which indicates that LOCF tends to produce more 
-large deviations, i.e., across several GOSe categories.
-Secondly, including baseline covariates only affects the GP regression model
-notably. 
-Both the MM and MSM model perform more or less the same irrespective of adjustment
-for baseline covariates.
-This indicates that the additional predictive value of baseline 
-covariates over the information contained in at least one observed GOSe value
-is limited.
-Furthermore, note that both variants of the mixed effects model fail to correct
-the overall bias of the imputed values.
-We proceed with a detailed analyses of a subset of models both in direct 
-comparison with LOCF and in the entire data set including those cases where
-LOCF is not applicable.
-In the following we only consider the baseline-adjusted Gaussian process model 
-('GP + baseline'), the mixed effect model without baseline covariates ('MM')
-and the multi-state model without baseline covariates ('MSM').
-The rationale behind dropping baseline adjustment for MM and MSM being that
-the additional complexity does not substantially alter overall performance.
-On the other hand, the GP model benefits from the inclusion of the IMPACT
-baseline covariates.
-## Detailed comparison conditional on LOCF subset
-We first consider results for the set of test cases which allow LOCF imputation 
-(n = `r df_predictions %>% filter(model == "LOCF") %>% nrow - length(idx)`).
-Both the raw count as well as the relative (by left-out observed GOSe) confusion matrices
-are presented in Figure 3.
-The GOSe scale is restricted to 3+ since the imputation is conditional on 
-an observed GOSe larger than 1 (deaths are known and no imputation necessary)
-and no GOSe 2 was observed.
-```{r confusion-matrix-locf, warning=FALSE, message=FALSE, echo=FALSE, fig.cap="Confusion matrices on LOCF subset.", fig.height=3, fig.width=6}
-plot_confusion_matrices <- function(df_predictions, models, nrow = 2, legendpos, scriptsize) {
-  df_average_confusion_matrices <- df_predictions %>% 
-    select(-`1`, -`2`) %>% 
-    filter(model %in% models) %>% 
-    group_by(fold, model) %>% 
-    do(
-      confusion_matrix = caret::confusionMatrix(
-          data      = factor(.$prediction, levels = 3:8), 
-          reference = factor(.$GOSE, levels = 3:8)
-        ) %>% 
-        as.matrix %>% as_tibble %>% 
-        mutate(`Predicted GOSE` = {row_number() + 2} %>% as.character) %>% 
-        gather(`Observed GOSE`, n, 1:6)
-    ) %>% 
-    unnest %>% 
-    group_by(model, `Predicted GOSE`, `Observed GOSE`) %>% 
-    summarize(n = mean(n)) %>% 
-    ungroup %>% 
-    mutate(model = factor(model, models))
-  # p_cnf_mtrx_raw <- 
-  df_average_confusion_matrices %>%
-    ggplot(aes(`Observed GOSE`, `Predicted GOSE`)) +
-      geom_text(
-        aes(
-          label    = sprintf("%3.f", n) %>% ifelse(. == "0.0", "", .),
-          fontface = ifelse(`Observed GOSE` == `Predicted GOSE`, 'bold', 'plain')
-        ), 
-        vjust = .5, size = scriptsize
-      ) +
-      geom_hline(yintercept = c(2, 4, 6) + .5, color = "black") +
-      geom_vline(xintercept = c(2, 4, 6) + .5, color = "black") +
-      coord_fixed(expand = FALSE, xlim = c(0.5, 6.5), ylim = c(0.5, 6.5)) + 
-      labs(x = "observed GOSe", y = "imputed GOSe", fill = "") +
-      theme_bw() +
-      theme(
-        panel.grid      = element_blank(),
-        legend.position = "none"
-      ) + 
-      facet_wrap(~model, nrow = nrow) +
-      ggtitle("Average confusion matrix across folds (absolute counts)")
-  # p_cnf_mtrx_colnrm <- df_average_confusion_matrices %>%
-  #   group_by(model, `Observed GOSE`) %>%
-  #   mutate(
-  #     `fraction (column)` = n / sum(n),
-  #     `fraction (column)` = ifelse(is.nan(`fraction (column)`), 0, `fraction (column)`)
-  #   ) %>% 
-  #   ggplot(aes(`Observed GOSE`, `Predicted GOSE`, fill = `fraction (column)`)) +
-  #     geom_raster() +
-  #     geom_hline(yintercept = c(2, 4, 6) + .5, color = "black") +
-  #     geom_vline(xintercept = c(2, 4, 6) + .5, color = "black") +
-  #     scale_fill_gradient("", low = "white", high = "black", limits = c(0, 1)) +
-  #     coord_fixed(expand = FALSE) + 
-  #     labs(x = "observed GOSe", y = "imputed GOSe", fill = "") +
-  #     theme_bw() +
-  #     theme(
-  #       panel.grid = element_blank(),
-  #       legend.position = legendpos
-  #     ) + 
-  #     facet_wrap(~model, nrow = nrow) +
-  #     ggtitle("Average confusion matrix across folds (column fraction)")
-  # cowplot::plot_grid(p_cnf_mtrx_raw, p_cnf_mtrx_colnrm, ncol = 1, align = "h")      
-}
-plot_confusion_matrices(
-  df_predictions %>% 
-    filter(!(gupi %in% idx)), 
-  c("MSM", "GP + cov", "MM", "LOCF"),
-  nrow = 1,
-  legendpos = "none",
-  scriptsize = 2.5
-)
-ggsave(filename = "confusion_matrices_locf.pdf", width = 6, height = 3)
-ggsave(filename = "confusion_matrices_locf.png", width = 6, height = 3, dpi = 300)
-```
-The absolute-count confusion matrices show that most imputed values are 
-within +/- one GOSE categories of the observed ones. 
-However, they also reflect the category imbalance (cf. Figures 1) in the study population.
-The performance conditional on the (in practice unknown) observed GOSe value
-clearly shows that imputation for the most infrequent category 4
-is hardest.
-This is, however, true across the range of methods considered.
-Both the MSM and the MM models account for this by almost never imputing a 
-GOSe of 4.
-Instead, the respective cases tend to be imputed to GOSe 3 or 5.
-To better understand the overall performance assessment in Figure 2,
-we also consider the performance conditional on the respective ground-truth
-(i.e. the observed GOSe values in the test sets).
-The results are shown in Figure 4 (vertical bars are +/-1.96 standard error of the mean).
-```{r error-scores-locf, echo=FALSE, fig.height=3.5, fig.width=6}
-plot_summary_measures_cond <- function(df_predictions, models, label) {
-  df_predictions %>%
-    filter(model %in% models) %>% 
-    group_by(model, fold, GOSE) %>% 
-    summarize(
-      RMSE = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
-      MAE  = mean(abs(GOSE - prediction), na.rm = TRUE),
-      bias = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE),
-      `d-bias` = mean(prediction > GOSE, na.rm = TRUE) - mean(prediction < GOSE, na.rm = TRUE)
-    ) %>% 
-    gather(error, value, -model, -GOSE, -fold) %>% 
-    group_by(GOSE, model, error, fold) %>% 
-    summarize(
-        mean_per_fold = mean(value) # overall mean
-    ) %>%
-    group_by(GOSE, model, error) %>% 
-    summarize(
-        mean = mean(mean_per_fold), # overall mean
-        se   = sd(mean_per_fold) / sqrt(n())
-    ) %>%
-    ungroup() %>% 
-    mutate(
-      model = factor(model, models),
-      error = factor(error, c(
-        "bias",
-        "d-bias",
-        "MAE",
-        "RMSE"
-      ))   
-    ) %>% 
-  ggplot(aes(GOSE, color = model)) + 
-      geom_hline(yintercept = 0, color = "black") +
-      geom_line(aes(y = mean), alpha = .5) + 
-      geom_point(aes(y = mean), size = .3,
-        position = position_dodge(.3)) + 
-      geom_errorbar(aes(ymin = mean - 1.96*se, ymax = mean + 1.96*se), 
-        width = .33,
-        size = .5,
-        position = position_dodge(.2)
-      ) +
-      geom_point(aes(y = mean),
-        position = position_dodge(.2)) + 
-      xlab("observed GOSe") + 
-      facet_wrap(~error, nrow = 1) +
-      scale_y_continuous(name = "", breaks = seq(-2, 8, .5)) +
-      theme_bw() +
-      theme(
-        panel.grid.minor   = element_blank(),
-        panel.grid.major.x = element_blank(),
-        legend.title       = element_blank(),
-        legend.position    = "bottom" 
-      ) +
-      ggtitle(label)
-}
-plot_summary_measures_cond(
-  df_predictions %>% filter(!(gupi %in% idx)), 
-  c("MSM", "GP + cov", "MM", "LOCF"), 
-  "Summary measures by observed GOSe, LOCF subset"
-)
-ggsave(filename = "errors_stratified_locf.pdf", width = 6, height = 3.5)
-ggsave(filename = "errors_stratified_locf.png", width = 6, height = 3.5, scale = 1.25, dpi = 300)
-```
-Just as with the overall performance, differences are most pronounced in terms
-of bias.
-Interestingly, the conditional perspective reveals differences between bias
-as difference between mean imputed and mean observed values (tacitly assuming
-an at least interval scale) and the difference in the probability to
-over- or undershoot the observed value. 
-Again, the category imbalance in the GOSe distribution explains the fact that
-all model-based approaches tend to perform better for the most frequent
-categories 6, 7, and 8 while sacrificing performance for the less frequent
-categories 4 and 5 as compared to LOCF.
-Bias-wise all methods exhibit a certain regression to the mean effect since low
-categories tend to be confused with better (higher) GOSe on average while
-high observed GOSe values are dampened (negative bias at 7, 8).
-Since LOCF does not take the category imbalance into account and since it exhibits
-a relatively large negative bias at the most frequent GOSe values, it is
-overall negatively biased.
-Interestingly, the conditional assessment of the GP regressions bias profile
-reveals that the overall unbiasedness can be explained by the relatively high
-positive and negative biases conditional on low/high GOSe values canceling out
-in the overall population.
-The MSM and MM models are fairly similar with respect to accuracy but MSM
-clearly dominates with respect to bias.
-Note that irrespective of the exact definition of bias used, MSM dominates the other
-model-based approaches. 
-Comparing LOCF and MSM, there is a slight advantage of MSM in terms of accuracy for
-the majority classes 3, 7, 8 which explain the overall difference shown in Figure 2.
-With respect to bias, MSM also performs better than LOCF for the most frequently
-observed categories, but the extent of this improvement depend on the performance measure.
-## Detailed comparison on full test set
-In the following, LOCF is not considered since a meaningful comparison 
-including LOCF is not possible on the entire set of test candidates due to the 
-fact that LOCF is not applicable in cases
-where only GOSe values after 180 days post-injury are available.
-The qualitative performance of the three imputation approaches in the complete 
-dataset was similar to their performance in the subset of data used for 
-comparison with LOCF.
-```{r confusion-matrix, warning=FALSE, message=FALSE, echo=FALSE, fig.cap="Confusion matrices, full test set without LOCF.", fig.height=6, fig.width=6}
-plot_confusion_matrices(
-  df_predictions, 
-  c("MSM", "GP + cov", "MM"),
-  nrow = 1,
-  legendpos = "none",
-  scriptsize = 3
-)
-ggsave(filename = "confusion_matrices_all.pdf", width = 6, height = 3)
-ggsave(filename = "confusion_matrices_all.png", width = 6, height = 3, dpi = 300)
-```
-```{r error-scores-all, echo=FALSE, fig.height=3.5, fig.width=6}
-plot_summary_measures_cond(
-  df_predictions %>% filter(!(gupi %in% idx)), 
-  c("MSM", "GP + cov", "MM"), 
-  "Summary measures by observed GOSe, full test set"
-)
-ggsave(filename = "imputation_error.pdf", width = 6, height = 3.5)
-ggsave(filename = "imputation_error.png", width = 6, height = 3.5, scale = 1.25, dpi = 300)
-```
-# Discussion
-Handling missing data *post-hoc* to prevent biased analyses often requires
-great effort.
-It is thus of the utmost importance to implement measures for avoiding missing
-data in the first place.
-Nevertheless, in practice, missing values due to loss-to-follow-up will always
-occur and should be addressed effectively.
-There is a wide consensus that statistically sound imputation of missing values 
-is beneficial for both the reduction of bias and for increasing statistical power.
-The current gold-standard for imputing missing values is multiple
-imputation on a per-analysis basis including analysis-specific covariates to
-further reduce bias and to preserve the imputation uncertainty in the 
-downstream analysis.
-In practice, however, there are good reasons for providing a set of single-imputed
-default values in large observational studies such as CENTER-TBI.
-Consortia are increasingly committed to making their databases 
-available to a wider range of researchers. 
-In fact, more liberal data-sharing policies are becoming a core requirement 
-for funding bodies (cf. https://www.openaire.eu/).
-In this context, it might not be possible to ensure that every analysis team
-has the necessary statistical expertise to properly conduct a per-analysis
-multiple imputation in the future.
-Furthermore, the imputed values of a multiple-imputation procedure are 
-inherently random and it is thus difficult to ensure consistency across 
-different analysis teams if the values themselves cannot be stored 
-directly in a database.
-For this reason, as a practical way forward, we suggest providing a default 
-single-imputation with appropriate measures of uncertainty for key outcomes 
-in the published data base itself.
-This mitigates problems with complete-case analyses and provides a
-principled and consistent default approach to handling missing values.
-Since we strongly suggest to employ a model-based approach to imputation,
-the fitted class probabilities can be provided in the core database along the
-imputed values.
-Based on these probabilities, it is easy to draw samples for a multiple imputation
-analysis.
-Wherever necessary and practical, a custom, analysis-specific multiple 
-imputation approach might still be employed.
-In these cases, the model providing the single-imputed values may 
-be used as a starting point.
-Several reasons disqualify LOCF as method of choice.
-Not only is it inherently biased, but LOCF is also inefficient in that it 
-does not properly take into account the category imbalance of GOSe in 
-the respective target population. 
-Albeit simple to implement, LOCF - by definition - is not capable of 
-exploiting longitudinal information after the target time point. 
-This results in a smaller subset of individuals for which imputed values can
-be provided in the first place.
-LOCF also lacks flexibility to adjust for further covariates which might be
-necessary in some cases to further reduce bias under a missing at random assumption.
-Finally, LOCF cannot produce an adequate measure of imputation uncertainty 
-since it is not model based.
-We draw two main conclusions from our comparison of three 
-alternative, model-based approaches.
-Firstly, despite its theoretical drawbacks, LOCF is hard to beat in terms of
-accuracy.
-Still, small improvements are possible.
-The main advantages of a model-based approach is thus a reduction of bias,
-the ability to provide a measure of uncertainty together with the imputed 
-values
-(or to use the same very same model to draw multiple imputations), 
-as well as the possibility of including further analysis-specific covariates.
-Secondly, we found that the inclusion of established baseline predictors for
-GOSe at 6 months post-injury had little effect on the imputation quality.
-Note that this does not refute their predictive value but only indicates that
-there is little marginal benefit over knowing at least one other value. 
-Differences between the model-based approaches tend to be rather nuanced.
-We nevertheless favor the multi-state model (MSM).
-It is well-interpretable in terms of transition intensities.
-An efficient implementation [@msm2011] in standard statistical software [@R2016] is 
-available.
-It succeeds in eliminating the bias introduced by LOCF, is
-able to provide imputed values for the entire population and is able to
-provide a probabilistic output.
-## Funding sources statement:
-Data used in preparation of this manuscript were obtained in the context of
-CENTER-TBI, a large collaborative project with the support of the European
-Union 7th Framework program (EC grant 602150).
-Additional funding was obtained from the Hannelore Kohl Stiftung (Germany),
-from OneMind (USA) and from Integra LifeSciences Corporation (USA).
-# Appendix / Supplemental Material
-## Ethical approval statement
-The CENTER-TBI study (EC grant 602150) has been conducted in accordance with all relevant laws of the EU if directly applicable or of direct effect and all relevant laws of the country where the Recruiting sites  were  located,  including  but  not  limited  to,  the  relevant  privacy  and  data  protection  laws  and regulations (the “Privacy Law”), the relevant laws and regulations on the use of human materials, and all relevant guidance relating to clinical studies from time to time in force including, but not limited to, the ICH Harmonised Tripartite Guideline for Good Clinical Practice (CPMP/ICH/135/95) (“ICH GCP”) and the World Medical Association Declaration of Helsinki entitled “Ethical Principles for Medical Research Involving Human Subjects”.
-Informed  Consent by the patients and/or the legal representative/next of kin was obtained, accordingly to thelocal legislations,for all patients recruited in the Core Dataset of CENTER-TBI and documented in the e-CRF.
-Ethical approval was obtained for each recruiting site.
-The list of sites, Ethical Committees, approval numbers and approval dates can be foundon the website: https://www.center-tbi.eu/project/ethical-approval.
-## Distribution of GOSe in validation folds
-```{r marginal-GOSe-folds, echo=FALSE, fig.cap="Marginal GOSe distribution per validation fold.", fig.width=6, fig.height=3}
-df_ground_truth %>% 
-    ggplot(aes(Outcomes.DerivedCompositeGOSE)) +
-        geom_bar() +
-        facet_wrap(~fold) +
-        theme_bw() +
-        ylab("") +
-        xlab("GOSe") +
-        theme(
-            panel.grid.minor   = element_blank(), 
-            panel.grid.major.x = element_blank()
-        )
-ggsave(filename = "gose_marignal_per_fold.pdf", width = 6, height = 3)
-ggsave(filename = "gose_marignal_per_fold.png", width = 6, height = 3, dpi = 300)
-```
-## Comparison of LOCF and non-LOCF subgroups
-```{r baseline-table-continuous2, echo=FALSE, results='asis'}
-summarizer <- function(x) {
-  return(tibble(
-    statistic = c("n", "# NA", "min", "mean", "median", "max", "std") %>% 
-      factor(., levels = ., ordered = TRUE),
-    value     = list(length(x), sum(is.na(x)), min(x, na.rm = TRUE), mean(x, na.rm = TRUE), median(x, na.rm = TRUE),
-                 max(x, na.rm = TRUE), sd(x, na.rm = TRUE))
-  ))
-}
-get_cont_smry <- function(dfs) {
-  res <- list()
-  for (i in 1:length(dfs)) {
-    res[[i]] <- dfs[[i]] %>% 
-      select_if(is.numeric) %>% 
-      gather(variable, value) %>% 
-      group_by(variable) %>% 
-      do(
-        smry = summarizer(.$value)
-      ) %>%
-      unnest %>%
-      spread(statistic, value) %>%
-      unnest() 
-  }
-  return(res)
-}
-df_baseline %>%
-  mutate(locf = ifelse(!(gupi %in% idx), "LOCF possible", "LOCF impossible")) %>% 
-  group_by(locf) %>% 
-  nest %>% 
-  do(
-    locf = .$locf,
-    smry = get_cont_smry(.$data)
-  ) %>% 
-  unnest %>% unnest %>% 
-  arrange(variable, locf) %>% 
-  gather(key = "Statistic", value = "value", 3:9) %>% 
-  spread(Statistic, value) %>% 
-  arrange(variable, locf) %>% 
-  pander::pandoc.table("Discrete baseline variables by LOCF status", digits = 3, split.tables = 120)
-```
-```{r baseline-table-discrete2, echo=FALSE, results='asis'}
-summarizer <- function(x) {
-  rbind(table(x), prop.table(table(x))) %>% 
-    t %>% 
-    {tmp <- .; colnames(tmp) <- c("n", "%"); tmp} %>%  
-    as_tibble(rownames = "level") %>% 
-    mutate(`%` = 100 * `%`)
-}
-get_cont_smry <- function(dfs) {
-  res <- list()
-  for (i in 1:length(dfs)) {
-    res[[i]] <- dfs[[i]] %>% 
-      select_if(~!is.numeric(.)) %>% 
-      select(-gupi) %>% 
-      mutate_all(as.factor) %>% 
-      mutate_all(fct_explicit_na) %>% 
-      mutate_all(as.character) %>% 
-      gather(variable, value) %>% 
-      group_by(variable) %>% 
-      do(
-        smry = summarizer(.$value)
-      ) %>% 
-      unnest()
-  }
-  return(res)
-}
-df_baseline %>%
-  mutate(locf = ifelse(!(gupi %in% idx), "LOCF possible", "LOCF impossible")) %>% 
-  group_by(locf) %>% 
-  nest %>% 
-  do(
-    locf = .$locf,
-    smry = get_cont_smry(.$data)
-  ) %>% 
-  unnest %>% unnest %>% 
-  gather("stat", "value", n, `%`) %>% 
-  unite(tmp, c(stat, locf), sep = ", ") %>% 
-  spread(tmp, value) %>% 
-  mutate_at(3:6, ~ifelse(is.na(.x), 0, .x)) %>% 
-  pander::pandoc.table("Continuous baseline variables by LOCF status", digits = 3, split.tables = 140)
-```
-## Models
-### Mixed-effects model
-Mixed effects models are a widely used approach in longitudinal 
-data analysis and model individual deviations from a population mean trajectory 
-(@verbeke2009linear).
-To account for the fact that the GOSe outcome is an ordered factor,
-we employ a cumulative link function model with flexible intercepts 
-(@Agresti2003).
-The population mean is modeled as cubic spline function to allow a 
-non-linear population mean trajectory.
-Patient-individual deviations from this population mean are modeled 
-as quadratic polynomials to allow sufficient flexibility (random effects).
-Baseline covariates are added as linear fixed effects to to the 
-population mean.
-The model was fitted using Bayesian statistics via the BRMS 
-package (@brms2017; @brms2018) for the R environment for statistical 
-computing (@R2016) and the Stan modelling language (@stan2017). 
-Non-informative priors were used for the model parameters.
-A potential drawback of the proposed longitudinal mixed effects model 
-is the fact that the individual deviations from the population mean 
-are modeled globally using polynomials.
-Since linear and quadratic terms are not identifiable for patients with 
-only one observed GOSE value, this implies large uncertainty over the 
-patient-specific effects for individuals with one or two observations 
-only by falling back on the non-informative priors on these model 
-parameters. 
-Thus, the overall uncertainty associated with model-based imputations 
-at exactly 180 days may become relatively large.
-### Gaussian process model
-Gaussian process regression allows flexible modelling of 
-both the individual GOSE trajectories as well as the population mean
-in a Bayesian non-parametric way [@rasmussen2006].
-This non-parametric paradigm leads to low model-uncertainty in the 
-vicinity of actually observed GOSe outcomes.
-To account for the discreteness of the GOSe outcome, 
-the continuous output of the Gaussian process model is rounded to the
-nearest integer in  1 to 8 (GOSe categories).
-The squared exponential covariance function with shared 
-length scale for all individuals is used to model intra-individual
-dependency of GOSe outcomes.
-The population mean trajectory of the Gaussian process is modeled as
-mean of an independent Gaussian process with pseudo observations at 
-`r config$gp$t_knots %>% as.numeric()` days post-injury 
-(also using a squared exponential covariance function).
-This approach maintains flexibility of the population mean function 
-similar to the spline-based approach for the mixed effects model
-while avoiding the the computational complexity of a fully
-hierarchical Gaussian process model.
-Again, the impact of baseline covariates is modeled via linear effects
-on the population mean of the Gaussian process.
-All parameters are estimated in a fully Bayesian fashion using the Stan 
-modelling language @stan2017 and non-informative priors except for the 
-length scale of the squared exponential kernel.
-Due to the sparseness of the data, the estimated length scale will naturally 
-tend towards extremely large values implying unrealistically 
-long-range dependency between observations.
-We therefore limit the length scale to a maximum of 120 days (4 months) and 
-impose a Gaussian prior with a mean of 60 days post injury and and a 
-standard deviation of 14.
-### Multi-state model
-Both the mixed effects model as well as the Gaussian process regression 
-model are essentially non-linear regression techniques for longitudinal 
-data.
-While they are both powerful tools to model longitudinal trajectories,
-they do not explicitly model the probability of transitions between 
-GOSe states.
-Since the number of observations per individual is very limited in our
-data set (1 to 4 GOSe observations per individual),
-an approach explicitly modelling transition probabilities might be 
-more suitable to capture the dynamics of the GOSe trajectories.
-To explore this further, a Markov multi-state model is considered (@meira2009).
-This model class assumes that the transitions between adjacent GOSe 
-states can be modeled as a Markov process and the transition 
-intensities between adjacent states are fitted to the observed data.
-```{r diagram, echo=FALSE, fig.cap="Structural diagram of allowed transitions between GOSe states for the proposed multi-state model.", fig.width=8, fig.height=4}
-# define transition matrix
-Q <- matrix(0, nrow = 8, ncol = 8)
-for (i in 1:8) {
-  for (j in 1:8) {
-    if (i == j - 1 & i != 1) {
-      Q[i, j] <- 1
-    }
-    if (i == j + 1) {
-      Q[i, j] <- 1
-    }
-  }
-}
-Q[2:8, 1] <- 1 # allow instantaneous deaths
-Q <- t(Q)
-curvature <- Q / 5
-curvature[1, ] <- 0
-diagram::plotmat(
-  Q, 
-  relsize = 1, 
-  shadow.size = 0, 
-  pos = c(1, 7), 
-  box.size = .033, 
-  arr.pos = .55, 
-  arr.length = .33, 
-  arr.type = "triangle", 
-  cex.txt = 0, 
-  curve = curvature
-)
-```
-To account for the fact that state-transitions might be more frequent 
-in the early post-injury phase, 
-piece wise constant transition intensities were fitted to the intervals 
-[0, 90), [90, 270), and 270+ days post-injury.
-The model was fit using the msm @msm2011 package for the R environment
-for statistical computing @R2016.
-Due to the relatively large number of 19 transition intensities in the proposed model
-(cf. arrows in Figure ???, structure of transition graph),
-inclusion of all baseline covariates turned out to be numerically unstable.
-For the MSM model, instead of including all covariates, 
-only a model adjusting for age at injury via a proportional hazard approach was fit.
-## Reproducible Research Strategy
-CENTER-TBI is committed to reproducible research. 
-To this end, the entire source code to run the analyses is publicly available
-at https://git.center-tbi.eu/kunzmann/gose-6mo-imputation.
-Scripts for automatically downloading the required data from the central
-access restricted 'Neurobot' (https://neurobot.incf.org/) database at 
-https://center-tbi.incf.org/ are provided.
-The analysis is completely automated using the workflow management tool 'snakemake'
-[@koster2012snakemake] and a singularity [@kurtzer2017singularity] container image
-containing all required dependencies is publicly available from zenodo.org
-(DOI: 10.5281/zenodo.2600385).
-Detailed step-by-step instructions on how to reproduce the analysis are provided 
-in the README.md file of the GitLab repository. 
-# References
--- a/manuscript/references.bib
+++ b/manuscript/references.bib
-@article{horton2018,
-  title={Randomized controlled trials in adult traumatic brain injury: A systematic review on the use and reporting of clinical outcome assessments},
-  author={Horton, Lindsay and Rhodes, Jonathan and Wilson, Lindsay},
-  journal={Journal of neurotrauma},
-  volume={35},
-  number={17},
-  pages={2005--2014},
-  year={2018},
-  publisher={Mary Ann Liebert, Inc. 140 Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA}
-}
-@article{richter2019,
-  title={Handling missing outcome data in traumatic brain injury research-a systematic review},
-  author={Richter, Sophie and Stevenson, Susan and Newman, Tom and Wilson, Lindsay and Menon, David and Maas, Andrew and Nieboer, Daan and Lingsma, Hester and Steyerberg, Ewout and Newcombe, Virginia},
-  year={2019},
-  publisher={Mary Ann Liebert Inc.}
-}
-@article{white2010,
-  title={Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values},
-  author={White, Ian R and Carlin, John B},
-  journal={Statistics in medicine},
-  volume={29},
-  number={28},
-  pages={2920--2931},
-  year={2010},
-  publisher={Wiley Online Library}
-}
-@article{clifton2011,
-  title={Very early hypothermia induction in patients with severe brain injury (the National Acute Brain Injury Study: Hypothermia II): a randomised trial},
-  author={Clifton, Guy L and Valadka, Alex and Zygun, David and Coffey, Christopher S and Drever, Pamala and Fourwinds, Sierra and Janis, L Scott and Wilde, Elizabeth and Taylor, Pauline and Harshman, Kathy and others},
-  journal={The Lancet Neurology},
-  volume={10},
-  number={2},
-  pages={131--139},
-  year={2011},
-  publisher={Elsevier}
-}
-@article{silver2006,
-  title={Effects of rivastigmine on cognitive function in patients with traumatic brain injury},
-  author={Silver, JM and Koumaras, B and Chen, M and Mirski, D and Potkin, SG and Reyes, P and Warden, D and Harvey, Philip D and Arciniegas, D and Katz, DI and others},
-  journal={Neurology},
-  volume={67},
-  number={5},
-  pages={748--755},
-  year={2006},
-  publisher={AAN Enterprises}
-}
-@article{skolnick2014,
-  title={A clinical trial of progesterone for severe traumatic brain injury},
-  author={Skolnick, Brett E and Maas, Andrew I and Narayan, Raj K and Van Der Hoop, Roland Gerritsen and MacAllister, Thomas and Ward, John D and Nelson, Neta R and Stocchetti, Nino},
-  journal={New England Journal of Medicine},
-  volume={371},
-  number={26},
-  pages={2467--2476},
-  year={2014},
-  publisher={Mass Medical Soc}
-}
-@article{bulger2010,
-  title={Out-of-hospital hypertonic resuscitation following severe traumatic brain injury: a randomized controlled trial},
-  author={Bulger, Eileen M and May, Susanne and Brasel, Karen J and Schreiber, Martin and Kerby, Jeffrey D and Tisherman, Samuel A and Newgard, Craig and Slutsky, Arthur and Coimbra, Raul and Emerson, Scott and others},
-  journal={Jama},
-  volume={304},
-  number={13},
-  pages={1455--1464},
-  year={2010},
-  publisher={American Medical Association}
-}
-@article{kirkness2006,
-  title={Effect of continuous display of cerebral perfusion pressure on outcomes in patients with traumatic brain injury},
-  author={Kirkness, Catherine J and Burr, Robert L and Cain, Kevin C and Newell, David W and Mitchell, Pamela H},
-  journal={American journal of critical care},
-  volume={15},
-  number={6},
-  pages={600--609},
-  year={2006},
-  publisher={AACN}
-}
-@article{wright2014,
-  title={Very early administration of progesterone for acute traumatic brain injury},
-  author={Wright, David W and Yeatts, Sharon D and Silbergleit, Robert and Palesch, Yuko Y and Hertzberg, Vicki S and Frankel, Michael and Goldstein, Felicia C and Caveney, Angela F and Howlett-Smith, Harriet and Bengelink, Erin M and others},
-  journal={New England Journal of Medicine},
-  volume={371},
-  number={26},
-  pages={2457--2466},
-  year={2014},
-  publisher={Mass Medical Soc}
-}
-@article{robertson2014,
-  title={Effect of erythropoietin and transfusion threshold on neurological recovery after traumatic brain injury: a randomized clinical trial},
-  author={Robertson, Claudia S and Hannay, H Julia and Yamal, Jos{\'e}-Miguel and Gopinath, Shankar and Goodman, J Clay and Tilley, Barbara C and Baldwin, Athena and Lara, Lucia Rivera and Saucedo-Crespo, Hector and Ahmed, Osama and others},
-  journal={Jama},
-  volume={312},
-  number={1},
-  pages={36--47},
-  year={2014},
-  publisher={American Medical Association}
-}
-@article{maas2014,
-  title={Collaborative European NeuroTrauma Effectiveness Research in Traumatic Brain Injury (CENTER-TBI) A Prospective Longitudinal Observational Study},
-  author={Maas, Andrew IR and Menon, David K and Steyerberg, Ewout W and Citerio, Giuseppe and Lecky, Fiona and Manley, Geoffrey T and Hill, Sean and Legrand, Valerie and Sorgner, Annina},
-  journal={Neurosurgery},
-  volume={76},
-  number={1},
-  pages={67--80},
-  year={2014},
-  publisher={Oxford University Press}
-}
-@article{kurtzer2017singularity,
-  title={Singularity: Scientific containers for mobility of compute},
-  author={Kurtzer, Gregory M and Sochat, Vanessa and Bauer, Michael W},
-  journal={PloS one},
-  volume={12},
-  number={5},
-  pages={e0177459},
-  year={2017},
-  publisher={Public Library of Science}
-}
-@article{mcmillan2016glasgow,
-  title={The Glasgow Outcome Scale—40 years of application and refinement},
-  author={McMillan, Tom and Wilson, Lindsay and Ponsford, Jennie and Levin, Harvey and Teasdale, Graham and Bond, Michael},
-  journal={Nature Reviews Neurology},
-  volume={12},
-  number={8},
-  pages={477},
-  year={2016},
-  publisher={Nature Publishing Group}
-}
-@article{jennett1981disability,
-  title={Disability after severe head injury: observations on the use of the Glasgow Outcome Scale.},
-  author={Jennett, B and Snoek, J and Bond, MR and Brooks, N},
-  journal={Journal of Neurology, Neurosurgery \& Psychiatry},
-  volume={44},
-  number={4},
-  pages={285--293},
-  year={1981},
-  publisher={BMJ Publishing Group Ltd}
-}
-@book{verbeke2009linear,
-  title={Linear mixed models for longitudinal data},
-  author={Verbeke, Geert and Molenberghs, Geert},
-  year={2009},
-  publisher={Springer Science \& Business Media}
-}
-@article{koster2012snakemake,
-  title={Snakemake—a scalable bioinformatics workflow engine},
-  author={K{\"o}ster, Johannes and Rahmann, Sven},
-  journal={Bioinformatics},
-  volume={28},
-  number={19},
-  pages={2520--2522},
-  year={2012},
-  publisher={Oxford University Press}
-}
-@article{msm2011,
-author = {Jackson, Christopher H.},
-doi = {10.18637/jss.v038.i08},
-journal = {Journal of Statistical Software},
-month = {jan},
-number = {8},
-pages = {1--28},
-title = {{Multi-State Models for Panel Data: The msm Package for R}},
-volume = {38},
-year = {2011}
-}
-@article{white2010bias,
-  title={Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values},
-  author={White, Ian R and Carlin, John B},
-  journal={Statistics in medicine},
-  volume={29},
-  number={28},
-  pages={2920--2931},
-  year={2010},
-  publisher={Wiley Online Library}
-}
-@article{R2016,
-archivePrefix = {arXiv},
-arxivId = {arXiv:1011.1669v3},
-author = {R Core Team},
-doi = {10.1007/978-3-540-74686-7},
-eprint = {arXiv:1011.1669v3},
-isbn = {3{\_}900051{\_}00{\_}3},
-issn = {3-900051-07-0},
-journal = {R Foundation for Statistical Computing},
-pmid = {16106260},
-title = {{R: A Language and Environment for Statistical Computing}},
-year = {2016}
-}
-@book{rasmussen2006,
-abstract = {Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated. Gaussian process models are routinely used to solve hard machine learning problems. They are attractive because of their flexible non-parametric nature and computational simplicity. Treated within a Bayesian framework, very powerful statistical methods can be implemented which offer valid estimates of uncertainties in our predictions and generic model selection procedures cast as nonlinear optimization problems. Their main drawback of heavy computational scaling has recently been alleviated by the introduction of generic sparse approximations.13,78,31 The mathematical literature on GPs is large and often uses deep concepts which are not required to fully understand most machine learning applications. In this tutorial paper, we aim to present characteristics of GPs relevant to machine learning and to show up precise connections to other "kernel machines" popular in the community. Our focus is on a simple presentation, but references to more detailed sources are provided.},
-archivePrefix = {arXiv},
-arxivId = {026218253X},
-author = {Rasmussen, Carl Edward and Williams, Christopher K. I.},
-booktitle = {International Journal of Neural Systems},
-doi = {10.1142/S0129065704001899},
-eprint = {026218253X},
-isbn = {026218253X},
-issn = {0129-0657},
-pmid = {15112367},
-title = {{Gaussian Processes for Machine Learning}},
-year = {2006}
-}
-@article{stan2017,
-author = {Carpenter, B and Gelman, A and Hoffman, MD and Lee, D},
-journal = {Journal of statistical software},
-number = {1},
-title = {{Stan: A Probabilistic Programming Language}},
-volume = {76},
-year = {2017}
-}
-@book{Agresti2003,
-author = {Agresti, Alan},
-editor = {{John Wiley {\&} Sons}},
-title = {{Categorical data analysis}},
-year =
-{2003}
-}
-@article{steyerberg2008,
-  title={Predicting outcome after traumatic brain injury: development and international validation of prognostic scores based on admission characteristics},
-  author={Steyerberg, Ewout and Mushkudiani, Nino and Perel, Pablo and Butcher, Isabella and Lu, Juan and McHugh, Gillian S and Murray, Gordon D and Marmarou, Anthony and Roberts, Ian and Habbema, J Dik F and others},
-  journal={PLoS medicine},
-  volume={5},
-  number={8},
-  pages={e165},
-  year={2008},
-  publisher={Public Library of Science}
-}
-@article{meira2009,
-  title={Multi-state models for the analysis of time-to-event data},
-  author={Meira-Machado, Lu{\'\i}s and de U{\~n}a-{\'A}lvarez, Jacobo and Cadarso-Suarez, Carmen and Andersen, Per K},
-  journal={Statistical methods in medical research},
-  volume={18},
-  number={2},
-  pages={195--222},
-  year={2009},
-  publisher={SAGE Publications Sage UK: London, England}
-}
-@article{brms2018,
-author = {B{\"{u}}rkner, Paul},
-journal = {The R Journal},
-number = {1},
-pages = {395--411},
-title = {{Advanced Bayesian Multilevel Modeling with the R Package brms}},
-volume = {10},
-year = {2018}
-}
-@article{brms2017,
-author = {B{\"{u}}rkner, Paul},
-journal = {Journal Of Statistical Software},
-number = {1},
-pages = {1--28},
-title = {{brms: An R Package for Bayesian Multilevel Models Using Stan}},
-volume = {80},
-year = {2017}
-}
--- a/scripts/manuscript_figures.R
+++ b/scripts/manuscript_figures.R
+library(tidyverse)
+knitr::opts_chunk$set(dpi = 600, fig.width = 8, fig.height = 5)
+config <- yaml::read_yaml('config.yml')
+# baseline data (no table produced) ============================================
+df_baseline <- read_rds("output/v1.1/data/df_baseline.rds") %>%
+  mutate(
+    InjuryHx.PupilsBaselineDerived = factor(InjuryHx.PupilsBaselineDerived,
+                                            levels = 0:2),
+    InjuryHx.GCSScoreBaselineDerived = case_when(
+      InjuryHx.GCSScoreBaselineDerived <= 8  ~ "Severe",
+      InjuryHx.GCSScoreBaselineDerived <= 12 ~ "Moderate",
+      InjuryHx.GCSScoreBaselineDerived  > 12 ~ "Mild"
+    ) %>% as.factor(),
+    Labs.DLGlucosemmolL = exp(Labs.DLGlucosemmolL_log)
+  ) %>%
+  select(-Labs.DLGlucosemmolL_log) %>%
+  left_join(
+    read_rds(
+      "data/v1.1/df_baseline_descriptive.rds"
+    ) %>%
+    transmute(
+      gupi,
+      Subject.PatientType = factor(
+        Subject.PatientType,
+        level = 1:3,
+        labels = c("Emergency Room", "Admission to Hospital", "Intensive Care Unit")
+      ),
+      InjuryHx.InjCause = factor(InjuryHx.InjCause,
+        level  = c(1:6, 88, 99),
+        labels = c(
+          "Road traffic incident",
+          "Incidental fall",
+          "Other non-intentional injury",
+          "Violence/assault",
+          "Act of mass violence",
+          "Suicide attempt",
+          "Unknown",
+          "Other"
+        ) %>%
+        fct_recode(
+          Other              = "Other non-intentional injury",
+          Other              = "Suicide attempt",
+          `Violence/assault` = "Act of mass violence"
+        )
+      ),
+      InjuryHx.TotalISS
+    ),
+    by = "gupi"
+  )
+var_labels <- readr::read_csv("varlabels.csv")
+get_label <- function(varname) var_labels$label[var_labels$varname == varname]
+get_label <- Vectorize(get_label)
+names(df_baseline) <- get_label(names(df_baseline))
+n_pat <- nrow(df_baseline)
+df_gose <- read_rds("output/v1.1/data/df_gose.rds")
+n_gose_180 <- df_gose %>%
+  filter(between(Outcomes.DerivedCompositeGOSEDaysPostInjury, 180 - 14, 180 + 14)) %>%
+  group_by(gupi) %>%
+  n_groups()
+n_gose_pp <- df_gose %>%
+  filter(between(Outcomes.DerivedCompositeGOSEDaysPostInjury, 5*30, 8*30)) %>%
+  group_by(gupi) %>%
+  n_groups()
+# figure 1: gose longitudinal ==================================================
+p_gose_times <- df_gose %>%
+  ggplot(aes(Outcomes.DerivedCompositeGOSEDaysPostInjury / 30)) +
+    geom_histogram(binwidth = 1) +
+    theme_bw() +
+    scale_x_continuous("Months post-injury",
+      breaks = seq(0, 50)
+    ) +
+    scale_y_continuous("", breaks = seq(0, 2000, 200)) +
+    theme(
+      panel.grid.minor = element_blank(),
+      panel.grid.major.x = element_blank()
+    )
+timepoint <- function(x) {
+  abs(
+    matrix(rep(x, 4), ncol = 4)
+    - matrix(rep(c(14, 90, 180, 360), length(x)), nrow = length(x), byrow = TRUE)
+  ) %>%
+  apply(., 1, which.min) %>%
+  factor(
+    levels = 1:4,
+    labels = c("2 weeks", "3 months", "6 months", "12 months")
+  )
+}
+tmp <- df_gose %>%
+  transmute(
+    gupi,
+    t    = timepoint(Outcomes.DerivedCompositeGOSEDaysPostInjury),
+    GOSe = Outcomes.DerivedCompositeGOSE
+  )
+p1 <- tmp %>%
+  ggplot(aes(GOSe)) +
+    geom_bar(aes(fill = GOSe), width = .66, color = "black") +
+    scale_fill_grey(start = .1, end = .9) +
+    facet_wrap(~t, nrow = 1) +
+    ylab("absolute count") +
+    theme_bw() +
+    theme(
+      panel.grid.major.x = element_blank(),
+      legend.position = "none"
+    )
+p2 <- tmp %>%
+  group_by(t, GOSe) %>%
+  summarize(n = n()) %>%
+  group_by(t) %>%
+  mutate(freq = n / sum(n)) %>%
+  ggplot(aes(x = GOSe, y = freq)) +
+    geom_bar(aes(fill = GOSe), width = .66, color = "black", stat = "identity") +
+    scale_fill_grey(start = .1, end = .9) +
+    facet_wrap(~t, nrow = 1) +
+    ylab("relative frequency") +
+    theme_bw() +
+    theme(
+      panel.grid.major.x  = element_blank(),
+      legend.position = "none"
+    )
+p3 <- gridExtra::grid.arrange(
+  p_gose_times,
+  p1,
+  p2,
+  nrow    = 3,
+  heights = c(1.5, 1, 1)
+)
+ggsave(filename = "gose_longitudinally.eps", plot = p3, width = 6, height = 6, colormodel = "cmyk")
+# load model data ==============================================================
+df_ground_truth <- c()
+for (i in 1:config$folds) {
+  df_ground_truth <- rbind(
+    df_ground_truth,
+    # note that GOSE values in test data are not subject to mi - loading the
+    # first data set is sufficient!
+    sprintf("output/v1.1/data/validation/df_test_mi_1_fold_%i.rds", i) %>%
+      readRDS() %>%
+      mutate(fold = i) %>%
+      select(gupi, fold, Outcomes.DerivedCompositeGOSEDaysPostInjury, Outcomes.DerivedCompositeGOSE)
+  )
+}
+modelnames <- lapply(
+    list.dirs(
+      "output/v1.1/data/validation/posteriors",
+      recursive = FALSE
+    ),
+    basename
+  ) %>%
+  as.character()
+df_model_posteriors <- c()
+for (modelname in modelnames) {
+  for (i in 1:config$mi_m) {
+    for (j in 1:config$folds) {
+      tryCatch(
+        df_model_posteriors <- rbind(
+          df_model_posteriors,
+          sprintf("output/v1.1/data/validation/posteriors/%s/df_posterior_mi_%i_fold_%i.rds",
+                  modelname, i, j) %>%
+            readRDS() %>%
+            mutate(imp = i, fold = j, model = modelname) %>%
+            select(gupi, fold, imp, model, everything()) %>%
+            filter(
+              gupi %in% (df_ground_truth %>% filter(fold == j) %>% .[["gupi"]])
+            )
+        ),
+        error = function(e) print(e)
+      )
+    }
+  }
+}
+df_model_posteriors <- df_model_posteriors %>%
+  mutate(
+    model = factor(model,
+      levels = c("locf", "mm_nb", "mm", "gp_nb", "gp", "msm", "msm_age"),
+      labels = c("LOCF", "MM", "MM + cov", "GP", "GP + cov", "MSM", "MSM + age"),
+    ) %>% as.character
+  )
+df_predictions <- df_model_posteriors %>%
+  group_by(fold, model, gupi, t, GOSE) %>%
+  summarise(probability = mean(probability)) %>%
+  arrange(model, fold, gupi, t, GOSE) %>%
+  ungroup() %>%
+  filter(t == 180) %>%
+  select(-t) %>%
+  group_by(fold, model, gupi) %>%
+  do(
+    GOSE = .$GOSE,
+    prediction = rep(which.max(.$probability) %>%
+      ifelse(length(.) == 0, NA_integer_, .) %>%
+      ifelse(length(.) > 1, round(mean(.)), .), 8),
+    probability = .$probability
+  ) %>%
+  unnest %>%
+  spread(GOSE, probability) %>%
+  right_join(
+    df_ground_truth %>%
+      mutate(
+        gupi = as.character(gupi),
+        GOSE = as.integer(Outcomes.DerivedCompositeGOSE)
+    ),
+    by = c("gupi", "fold")
+  )
+idx <- df_predictions %>%
+  filter(model == "LOCF", !complete.cases(.)) %>%
+  .[["gupi"]]
+compute_summary_measures <- function(df) {
+  df %>%
+    group_by(model, fold) %>%
+    summarize(
+      RMSE    = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
+      MAE     = mean(abs(GOSE - prediction), na.rm = TRUE),
+      bias    = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE),
+      `d-bias` = mean(prediction > GOSE, na.rm = TRUE) - mean(prediction < GOSE, na.rm = TRUE)
+    ) %>%
+    ungroup %>%
+    gather(error, value, -model, -fold) %>%
+    mutate(
+      error = factor(error, c(
+        "bias",
+        "d-bias",
+        "MAE",
+        "RMSE"
+      ))
+    ) %>%
+    group_by(model, error) %>%
+    summarize(
+      mean_error = mean(value),
+      se_error   = sd(value) / sqrt(n()),
+    )
+}
+df_plot <- rbind(
+  compute_summary_measures(
+      df_predictions %>%
+        mutate(prediction = ifelse(model == "LOCF", NA, prediction))
+    ) %>%
+    mutate(label = "full test set"),
+  compute_summary_measures(df_predictions %>% filter(!(gupi %in% idx))) %>%
+    mutate(label = "LOCF subset")
+) %>%
+  ungroup
+# figure 2: overall locf =======================================================
+ggplot(df_plot, aes(x = model, y = mean_error)) +
+  geom_hline(yintercept = 0, color = "black") +
+  geom_point(aes(color = label), size = .8, position = position_dodge(.4)) +
+  geom_errorbar(
+    aes(ymin = mean_error - 1.96*se_error, ymax = mean_error + 1.96*se_error, color = label),
+    width = .25, position = position_dodge(.3)
+  ) +
+  scale_color_grey("", start = .0, end = .5) +
+  facet_wrap(~error, nrow = 1) +
+  scale_y_continuous(name = "", breaks = seq(-2, 8, .25), limits = c(-.5, 1.25)) +
+  scale_x_discrete("") +
+  theme_bw() +
+  theme(
+    panel.grid.minor   = element_blank(),
+    panel.grid.major.x = element_blank(),
+    axis.text.x        = element_text(angle = 66, hjust = 1),
+    legend.position    = "bottom"
+  )
+ggsave(filename = "overall_locf.eps", width = 6, height = 3.5, colormodel = "cmyk")
+# figure 3: overall confusion ==================================================
+plot_confusion_matrices <- function(df_predictions, models, nrow = 2, legendpos, scriptsize) {
+  df_average_confusion_matrices <- df_predictions %>%
+    select(-`1`, -`2`) %>%
+    filter(model %in% models) %>%
+    group_by(fold, model) %>%
+    do(
+      confusion_matrix = caret::confusionMatrix(
+          data      = factor(.$prediction, levels = 3:8),
+          reference = factor(.$GOSE, levels = 3:8)
+        ) %>%
+        as.matrix %>% as_tibble %>%
+        mutate(`Predicted GOSE` = {row_number() + 2} %>% as.character) %>%
+        gather(`Observed GOSE`, n, 1:6)
+    ) %>%
+    unnest %>%
+    group_by(model, `Predicted GOSE`, `Observed GOSE`) %>%
+    summarize(n = mean(n)) %>%
+    ungroup %>%
+    mutate(model = factor(model, models))
+  # p_cnf_mtrx_raw <-
+  df_average_confusion_matrices %>%
+    ggplot(aes(`Observed GOSE`, `Predicted GOSE`)) +
+      geom_text(
+        aes(
+          label    = sprintf("%3.f", n) %>% ifelse(. == "0.0", "", .),
+          fontface = ifelse(`Observed GOSE` == `Predicted GOSE`, 'bold', 'plain')
+        ),
+        vjust = .5, size = scriptsize
+      ) +
+      geom_hline(yintercept = c(2, 4, 6) + .5, color = "black") +
+      geom_vline(xintercept = c(2, 4, 6) + .5, color = "black") +
+      coord_fixed(expand = FALSE, xlim = c(0.5, 6.5), ylim = c(0.5, 6.5)) +
+      labs(x = "observed GOSe", y = "imputed GOSe", fill = "") +
+      theme_bw() +
+      theme(
+        panel.grid      = element_blank(),
+        legend.position = "none"
+      ) +
+      facet_wrap(~model, nrow = nrow) +
+      ggtitle("Average confusion matrix across folds (absolute counts)")
+}
+plot_confusion_matrices(
+  df_predictions %>%
+    filter(!(gupi %in% idx)),
+  c("MSM", "GP + cov", "MM", "LOCF"),
+  nrow = 1,
+  legendpos = "none",
+  scriptsize = 2.5
+)
+ggsave(filename = "confusion_matrices_locf.eps", width = 6, height = 3, colormodel = "cmyk")
+# figure 4: errors stratified locf =============================================
+plot_summary_measures_cond <- function(df_predictions, models, label) {
+  df_predictions %>%
+    filter(model %in% models) %>%
+    group_by(model, fold, GOSE) %>%
+    summarize(
+      RMSE = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
+      MAE  = mean(abs(GOSE - prediction), na.rm = TRUE),
+      bias = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE),
+      `d-bias` = mean(prediction > GOSE, na.rm = TRUE) - mean(prediction < GOSE, na.rm = TRUE)
+    ) %>%
+    gather(error, value, -model, -GOSE, -fold) %>%
+    group_by(GOSE, model, error, fold) %>%
+    summarize(
+        mean_per_fold = mean(value) # overall mean
+    ) %>%
+    group_by(GOSE, model, error) %>%
+    summarize(
+        mean = mean(mean_per_fold), # overall mean
+        se   = sd(mean_per_fold) / sqrt(n())
+    ) %>%
+    ungroup() %>%
+    mutate(
+      model = factor(model, models),
+      error = factor(error, c(
+        "bias",
+        "d-bias",
+        "MAE",
+        "RMSE"
+      ))
+    ) %>%
+  ggplot(aes(GOSE, color = model)) +
+      geom_hline(yintercept = 0, color = "black") +
+      geom_line(aes(y = mean)) +
+      geom_point(aes(y = mean), size = .3,
+        position = position_dodge(.3)) +
+      geom_errorbar(aes(ymin = mean - 1.96*se, ymax = mean + 1.96*se),
+        width = .33,
+        size = .5,
+        position = position_dodge(.2)
+      ) +
+      geom_point(aes(y = mean),
+        position = position_dodge(.2)) +
+      xlab("observed GOSe") +
+      facet_wrap(~error, nrow = 1) +
+      scale_y_continuous(name = "", breaks = seq(-2, 8, .5)) +
+      theme_bw() +
+      theme(
+        panel.grid.minor   = element_blank(),
+        panel.grid.major.x = element_blank(),
+        legend.title       = element_blank(),
+        legend.position    = "bottom"
+      ) +
+      ggtitle(label)
+}
+plot_summary_measures_cond(
+  df_predictions %>% filter(!(gupi %in% idx)),
+  c("MSM", "GP + cov", "MM", "LOCF"),
+  "Summary measures by observed GOSe, LOCF subset"
+)
+ggsave(filename = "errors_stratified_locf.eps", width = 6, height = 3.5, colormodel = "cmyk")
+# figure 5: confusion all ==================================================
+plot_confusion_matrices(
+  df_predictions,
+  c("MSM", "GP + cov", "MM"),
+  nrow = 1,
+  legendpos = "none",
+  scriptsize = 3
+)
+ggsave(filename = "confusion_matrices_all.eps", width = 6, height = 3, colormodel = "cmyk")
+# figure 6: errors overall ==================================================
+plot_summary_measures_cond(
+  df_predictions %>% filter(!(gupi %in% idx)),
+  c("MSM", "GP + cov", "MM"),
+  "Summary measures by observed GOSe, full test set"
+)
+ggsave(filename = "imputation_error.eps", width = 6, height = 3.5, colormodel = "cmyk")
+# figure 7:marginal gose per fold ==================================================
+df_ground_truth %>%
+    ggplot(aes(Outcomes.DerivedCompositeGOSE)) +
+        geom_bar() +
+        facet_wrap(~fold) +
+        theme_bw() +
+        ylab("") +
+        xlab("GOSe") +
+        theme(
+            panel.grid.minor   = element_blank(),
+            panel.grid.major.x = element_blank()
+        )
+ggsave(filename = "gose_marignal_per_fold.eps", width = 6, height = 3, colormodel = "cmyk")
+# figure 8: diagram ==================================================
+Q <- matrix(0, nrow = 8, ncol = 8)
+for (i in 1:8) {
+  for (j in 1:8) {
+    if (i == j - 1 & i != 1) {
+      Q[i, j] <- 1
+    }
+    if (i == j + 1) {
+      Q[i, j] <- 1
+    }
+  }
+}
+Q[2:8, 1] <- 1 # allow instantaneous deaths
+Q <- t(Q)
+curvature <- Q / 5
+curvature[1, ] <- 0
+setEPS()
+postscript("diagram.eps", width = 8, height = 4)
+diagram::plotmat(
+  Q,
+  relsize = 1,
+  shadow.size = 0,
+  pos = c(1, 7),
+  box.size = .033,
+  arr.pos = .55,
+  arr.length = .33,
+  arr.type = "triangle",
+  cex.txt = 0,
+  curve = curvature
+)
+dev.off()
+unlink('Rplots.pdf')
--- a/manuscript/varlabels.csv
+++ b/manuscript/varlabels.csv