revision of manuscript

0badf884 · Kevin Kunzmann · 21167658 · 0badf884
Commit 0badf884 authored Mar 22, 2019 by Kevin Kunzmann
Hide whitespace changes
Inline Side-by-side

Showing with 311 additions and 302 deletions

manuscript/manuscript.Rmd manuscript/manuscript.Rmd +311 -302

No files found.
--- a/manuscript/manuscript.Rmd
+++ b/manuscript/manuscript.Rmd
@@ -463,9 +463,7 @@ for (i in 1:config$folds) {
  )
  
 }
-```

-```{r misc, include=FALSE}
 modelnames <- lapply(
    list.dirs(
      sprintf("%s/validation/posteriors", params$data_dir), 
@@ -540,7 +538,7 @@ All models were fit on the entire available data after removing the
 180 +/- 14 days post-injury observation from the respective test fold
 thus mimicking a missing completely at random missing data mechanism.
 The distribution of GOSe values in the respective three test sets is well 
-balanced, cf. Figure ??? (Appendix).
+balanced, cf. Figure ??? (cf. Appendix).
 Performance is assessed using the absolute-count and the normalized 
 (proportions) confusion matrices as well as
 bias, mean absolute error (MAE), and root mean squared error (RMSE).
@@ -551,7 +549,7 @@ RMSE puts more weight on large deviations as compared to RMSE.
 Comparisons in terms of bias, MAE, and RMSE tacitly assume that 
 GOSe values can be sensibly interpreted on an interval scale.
 We therefore also consider $Pr[est > true] - Pr[est < true]$ as an
-alternative measure of bias.
+alternative measure of bias which does not require this tacit assumption.
 Note that the scale is not directlz comparable to the one of the
 other three quantities!
 All measures are considered both conditional on the ground-truth
@@ -584,83 +582,79 @@ is depicted in Figure ??? both conditional on LOCF being applicable and,
 excluding LOCF, on the entire test set.

 ```{r overall-comparison-all-methods, echo=FALSE, fig.height=7}
-p1 <- df_predictions %>%
-  filter(!(gupi %in% idx)) %>% 
-  group_by(model, fold) %>% 
-  summarize(
-    RMSE = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
-    MAE  = mean(abs(GOSE - prediction), na.rm = TRUE),
-    Bias = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE),
-    `Pr[est > true] - Pr[est < true]` = mean(prediction > GOSE, na.rm = TRUE) - mean(prediction < GOSE, na.rm = TRUE)
-  ) %>% 
-  ungroup %>% 
-  gather(error, value, -model, -fold) %>% 
-  ggplot(aes(model, value)) + 
-    geom_hline(yintercept = 0, color = "black") +
-    geom_boxplot() + 
-    facet_wrap(~error, nrow = 1) +
-    scale_y_continuous(name = "", breaks = seq(-2, 8, .25), limits = c(-.5, 1.5)) +
-    scale_x_discrete("") + 
-    theme_bw() +
-    theme(
-      panel.grid.minor   = element_blank(),
-      panel.grid.major.x = element_blank(),
-      axis.text.x        = element_text(angle = 66, hjust = 1)
-    ) +
-    ggtitle("Overall summary measures, LOCF subset")
-
-p2 <- df_predictions %>%
-  filter(model != "LOCF") %>% 
-  group_by(model, fold) %>% 
-  summarize(
-    RMSE = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
-    MAE  = mean(abs(GOSE - prediction), na.rm = TRUE),
-    Bias = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE),
-    `Pr[est > true] - Pr[est < true]` = mean(prediction > GOSE, na.rm = TRUE) - mean(prediction < GOSE, na.rm = TRUE)
-  ) %>% 
-  ungroup %>% 
-  gather(error, value, -model, -fold) %>% 
-  ggplot(aes(model, value)) + 
-    geom_hline(yintercept = 0, color = "black") +
-    geom_boxplot() + 
-    facet_wrap(~error, nrow = 1) +
-    scale_y_continuous(name = "", breaks = seq(-2, 8, .25), limits = c(-.5, 1.5)) +
-    scale_x_discrete("") + 
-    theme_bw() +
-    theme(
-      panel.grid.minor   = element_blank(),
-      panel.grid.major.x = element_blank(),
-      axis.text.x        = element_text(angle = 66, hjust = 1)
-    ) +
-    ggtitle("Overall summary measures, full test set")
+plot_summary_measures <- function(df, label) {
+
+  df_predictions %>%
+    filter(!(gupi %in% idx)) %>% 
+    group_by(model, fold) %>% 
+    summarize(
+      RMSE = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
+      MAE  = mean(abs(GOSE - prediction), na.rm = TRUE),
+      Bias = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE),
+      `Pr[est. > true] - Pr[est. < true]` = mean(prediction > GOSE, na.rm = TRUE) - mean(prediction < GOSE, na.rm = TRUE)
+    ) %>% 
+    ungroup %>% 
+    gather(error, value, -model, -fold) %>% 
+    mutate(
+      error = factor(error, c(
+        "Bias",
+        "Pr[est. > true] - Pr[est. < true]",
+        "MAE",
+        "RMSE"
+      ))
+    ) %>% 
+    ggplot(aes(model, value)) + 
+      geom_hline(yintercept = 0, color = "black") +
+      geom_boxplot() + 
+      facet_wrap(~error, nrow = 1) +
+      scale_y_continuous(name = "", breaks = seq(-2, 8, .25), limits = c(-.5, 1.5)) +
+      scale_x_discrete("") + 
+      theme_bw() +
+      theme(
+        panel.grid.minor   = element_blank(),
+        panel.grid.major.x = element_blank(),
+        axis.text.x        = element_text(angle = 66, hjust = 1)
+      ) +
+      ggtitle(label)
+
+}

 cowplot::plot_grid(
-  p1, 
-  p2, 
-  ncol = 1, align = "v", axis = "lr"
-)
+    plot_summary_measures(
+      df_predictions %>% filter(gupi %in% idx), 
+      "Summary measures, LOCF subset"
+    ), 
+    plot_summary_measures(
+      df_predictions, 
+      "Summary measures, full test set"
+    ), 
+    ncol = 1, align = "v", axis = "lr"
+  )
 ```

-Firstly, as could be expected, LOCF is overall negatively biased, i.e., 
+Firstly, LOCF is overall negatively biased, i.e., 
 on average it imputes lower-than-observed GOSe values.
-This indicates that there is a population average trend towards continued
-recovery within the first 6 months post-injury.
-In terms of accuracy, the LOCF does perform worst but differences between
+This reflects a population average trend towards continued
+recovery within the first 6 months post injury.
+The fact that both ways of quantifying bias qualitatively agree, 
+indicates that the interpretation of GOSe as an interval measure which is
+tacitly underlying Bias, MAE, and RMSE comparisons is not too restrictive.
+In terms of accuracy, LOCF does perform worst but differences between
 methods are less pronounced than in terms of bias.
-Notably, the difference in terms of RMSE is more substantial than in terms of
-MAE which indicates that LOCF tends to incur more large deviations across
-several GOSe categories than the other methods considered here.
+Notably, the RMSE difference between LOCF and the other methods is slightly
+larger than the MAE difference which indicates that LOCF tends to produce more 
+large deviations, i.e., across several GOSe categories.
 Secondly, including baseline covariates only affects the GP regression model
 notably. 
 Both the MM and MSM model perform more or less the same irrespective of adjustment
 for baseline covariates.
-This indicates that the marginal additional predictive value of baseline 
+This indicates that the additional predictive value of baseline 
 covariates over the information contained in at least one observed GOSe value
 is limited.
-Furthermore, note that both variants of the mixed effects model fail to remove
+Furthermore, note that both variants of the mixed effects model fail to correct
 the overall bias of the imputed values.

-We proceed with an in-depth analyses of a subset of models both in direct 
+We proceed with a detailed analyses of a subset of models both in direct 
 comparison with LOCF and in the entire data set including those cases where
 LOCF is not applicable.
 In the following we only consider the baseline-adjusted Gaussian process model 
@@ -668,96 +662,137 @@ In the following we only consider the baseline-adjusted Gaussian process model
 and the multi-state model without baseline covariates ('MSM').
 The rationale behind dropping baseline adjustment for MM and MSM being that
 the additional complexity does not substantially alter overall performance.
+On the other hand, the GP model benefits from the inclusion of the IMPACT
+baseline covariates.
+


 ## Detailed comparison conditional on LOCF subset

 We first consider results for the set of test cases which allow LOCF imputation 
 (n = `r df_predictions %>% filter(model == "LOCF") %>% nrow - length(idx)`).
-Both the raw count as well as the relative (by true GOSe) confusion matrices
+Both the raw count as well as the relative (by left-out true GOSe) confusion matrices
 are presented in Figure ???.

 ```{r confusion-matrix-locf, warning=FALSE, message=FALSE, echo=FALSE, fig.cap="Confusion matrices on LOCF subset."}
-df_average_confusion_matrices <- df_predictions %>% 
-  filter(!(gupi %in% idx)) %>% 
-  group_by(fold, model) %>% 
-  do(
-    confusion_matrix = caret::confusionMatrix(
-        data = factor(.$prediction, levels = 1:8), 
-        reference = factor(.$GOSE, levels = 1:8)
-      ) %>% 
-      as.matrix %>% as_tibble %>% 
-      mutate(`Predicted GOSE` = row_number() %>% as.character) %>% 
-      gather(`True GOSE`, n, 1:8)
-  ) %>% 
-  unnest %>% 
-  group_by(model, `Predicted GOSE`, `True GOSE`) %>% 
-  summarize(n = mean(n)) %>% 
-  ungroup
-
-p_cnf_mtrx_raw <- df_average_confusion_matrices %>%
-  filter(model %in% c("LOCF", "MM", "GP + cov", "MSM")) %>% 
-  group_by(model, `True GOSE`) %>%
-  ggplot(aes(`True GOSE`, `Predicted GOSE`, fill = n)) +
-    geom_raster() +
-    geom_hline(yintercept = c(2, 4, 6) + .5, color = "black") +
-    geom_vline(xintercept = c(2, 4, 6) + .5, color = "black") +
-    scale_fill_gradient(low = "white", high = "black") +
-    coord_fixed(expand = FALSE) + 
-    labs(x = "true GOSe", y = "imputed GOSe", fill = "") +
-    theme_bw() +
-    theme(
-      panel.grid = element_blank()
-    ) + 
-    facet_wrap(~model, nrow = 1) +
-    ggtitle("Average confusion matrix accross folds (absolute counts)")
-
-p_cnf_mtrx_colnrm <- df_average_confusion_matrices %>%
-  filter(model %in% c("LOCF", "MM", "GP + cov", "MSM")) %>% 
-  group_by(model, `True GOSE`) %>%
-  mutate(
-    `fraction (column)` = n / sum(n),
-    `fraction (column)` = ifelse(is.nan(`fraction (column)`), 0, `fraction (column)`)
-  ) %>% 
-  ggplot(aes(`True GOSE`, `Predicted GOSE`, fill = `fraction (column)`)) +
-    geom_raster() +
-    geom_hline(yintercept = c(2, 4, 6) + .5, color = "black") +
-    geom_vline(xintercept = c(2, 4, 6) + .5, color = "black") +
-    scale_fill_gradient("", low = "white", high = "black", limits = c(0, 1)) +
-    coord_fixed(expand = FALSE) + 
-    labs(x = "true GOSe", y = "imputed GOSe", fill = "") +
-    theme_bw() +
-    theme(
-      panel.grid = element_blank()
-    ) + 
-    facet_wrap(~model, nrow = 1) +
-    ggtitle("Average confusion matrix accross folds (column fraction)")
+plot_confusion_matrices <- function(df_predictions, models) {
+  
+  df_average_confusion_matrices <- df_predictions %>% 
+    filter(model %in% models) %>% 
+    group_by(fold, model) %>% 
+    do(
+      confusion_matrix = caret::confusionMatrix(
+          data = factor(.$prediction, levels = 1:8), 
+          reference = factor(.$GOSE, levels = 1:8)
+        ) %>% 
+        as.matrix %>% as_tibble %>% 
+        mutate(`Predicted GOSE` = row_number() %>% as.character) %>% 
+        gather(`True GOSE`, n, 1:8)
+    ) %>% 
+    unnest %>% 
+    group_by(model, `Predicted GOSE`, `True GOSE`) %>% 
+    summarize(n = mean(n)) %>% 
+    ungroup %>% 
+    mutate(model = factor(model, models))
+  
+  p_cnf_mtrx_raw <- df_average_confusion_matrices %>%
+    ggplot(aes(`True GOSE`, `Predicted GOSE`, fill = n)) +
+      geom_raster() +
+      geom_text(aes(
+          label = sprintf("%.1f", n) %>% 
+            ifelse(. == "0.0", "", .)
+        ), 
+        size = 1.5
+      ) +
+      geom_hline(yintercept = c(2, 4, 6) + .5, color = "black") +
+      geom_vline(xintercept = c(2, 4, 6) + .5, color = "black") +
+      scale_fill_gradient(low = "white", high = "#555555") +
+      coord_fixed(expand = FALSE) + 
+      labs(x = "true GOSe", y = "imputed GOSe", fill = "") +
+      theme_bw() +
+      theme(
+        panel.grid = element_blank()
+      ) + 
+      facet_wrap(~model, nrow = 1) +
+      ggtitle("Average confusion matrix accross folds (absolute counts)")
+  
+  p_cnf_mtrx_colnrm <- df_average_confusion_matrices %>%
+    group_by(model, `True GOSE`) %>%
+    mutate(
+      `fraction (column)` = n / sum(n),
+      `fraction (column)` = ifelse(is.nan(`fraction (column)`), 0, `fraction (column)`)
+    ) %>% 
+    ggplot(aes(`True GOSE`, `Predicted GOSE`, fill = `fraction (column)`)) +
+      geom_raster() +
+      geom_hline(yintercept = c(2, 4, 6) + .5, color = "black") +
+      geom_vline(xintercept = c(2, 4, 6) + .5, color = "black") +
+      scale_fill_gradient("", low = "white", high = "black", limits = c(0, 1)) +
+      coord_fixed(expand = FALSE) + 
+      labs(x = "true GOSe", y = "imputed GOSe", fill = "") +
+      theme_bw() +
+      theme(
+        panel.grid = element_blank()
+      ) + 
+      facet_wrap(~model, nrow = 1) +
+      ggtitle("Average confusion matrix accross folds (column fraction)")
+  
+  cowplot::plot_grid(p_cnf_mtrx_raw, p_cnf_mtrx_colnrm, ncol = 1, align = "v")        
+  
+}

-cowplot::plot_grid(p_cnf_mtrx_raw, p_cnf_mtrx_colnrm, ncol = 1, align = "v")
+plot_confusion_matrices(
+  df_predictions %>% filter(!(gupi %in% idx)), 
+  c("MSM", "GP + cov", "MM", "LOCF")
+)

-ggsave(filename = "confusion_matrices_locf.pdf", width = 7, height = 7)
-ggsave(filename = "confusion_matrices_locf.png", width = 7, height = 7)
+ggsave(filename = "confusion_matrices_locf.pdf", width = 7, height = 6)
+ggsave(filename = "confusion_matrices_locf.png", width = 7, height = 6)
 ```

 The absolute-count confusion matrices show that most imputed values are 
-within +/- one GOSE categories of the true value. 
-However, they also reflect the category imbalance (cf. Figures ??? and ???, 
-Appendix) in the study ppopulation.
-The performance conditional on the (in practice unknown) true GOSe value
+within +/- one GOSE categories of the observed ones. 
+However, they also reflect the category imbalance (cf. Figures ??? and ??? 
+Appendix) in the study population.
+The performance conditional on the (in practice unknown) observed GOSe value
 clearly shows that imputation performance for the most infrequent category 4
-(GOSe of 1 and 2 are not observed in the test set!) is most problematic.
-This is, however, true across the board of methods considered.
-Note that both the MSM as well as the MM models account for this fact by 
-almost never imputing a GOSe of 4.
+is most problematic.
+This is, however, true across the range of methods considered.
+Both the MSM and the MM models account for this by almost never imputing a 
+GOSe of 4.
 Instead, the respective cases tend to be imputed to GOSe 3 or 5.

-Additionally, the following table ??? lists some confusion probabilities
-of particular interest.
-[Comment: 1 -> > 1 is not relevant since our imputation is conditional
-on not being 1 at 6 months; the comparison seems to favor LOCF since only
+**TODO:**
+
+* this section table is the one we David requested in our last meeting,
+not entirely convinced though ...
+* ... 1 -> > 1 is not relevant since our imputation is conditional
+on not being 1 at 6 months
+* ... the comparison seems to favor LOCF since only
 upward confusions are considered (which LOCF by design tends to do less)]
+* Is there a clinical interpretation along the way that '4' might constitue
+a short-term transition state or is it just defined in a way that makes it
+highly unlikely to be observed in practice?

 ```{r crossing-table, echo=FALSE, warning=FALSE, results='asis'}
+models <- c("MSM", "GP + cov", "MM")
+df_average_confusion_matrices <- df_predictions %>% 
+    filter(model %in% models) %>% 
+    filter(!(gupi %in% idx)) %>% 
+    group_by(fold, model) %>% 
+    do(
+      confusion_matrix = caret::confusionMatrix(
+          data = factor(.$prediction, levels = 1:8), 
+          reference = factor(.$GOSE, levels = 1:8)
+        ) %>% 
+        as.matrix %>% as_tibble %>% 
+        mutate(`Predicted GOSE` = row_number() %>% as.character) %>% 
+        gather(`True GOSE`, n, 1:8)
+    ) %>% 
+    unnest %>% 
+    group_by(model, `Predicted GOSE`, `True GOSE`) %>% 
+    summarize(n = mean(n)) %>% 
+    ungroup %>% 
+    mutate(model = factor(model, models))
 rbind(
 df_average_confusion_matrices %>% 
  filter(model %in% c("LOCF", "MM", "GP + cov", "MSM")) %>% 
@@ -791,48 +826,71 @@ df_average_confusion_matrices %>%
  pander::pandoc.table("Some specific confusion percentages, LOCF subset.", digits = 3)
 ```

-[TODO: Interpretation? GOSe 4 potentially relatively brief transient state, 
-clinical relevance of distinction between 3/4?]
-
 To better understand the overall performance assessment in Figure ???,
 we also consider the performance conditional on the respective ground-truth
-(i.e. the true GOSe values in the respective test sets).
-The results are shown in Figure ???.
+(i.e. the observed GOSe values in the test sets).
+The results are shown in Figure ??? (vertical bars are =/- one standard error of the mean).

-```{r error-scores-locf, echo=FALSE, fig.height=3}
-df_predictions %>%
-  filter(!(gupi %in% idx)) %>% 
-  filter(model %in% c("LOCF", "MM", "GP + cov", "MSM")) %>% 
-  group_by(model, fold, GOSE) %>% 
-  summarize(
-    RMSE = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
-    MAE  = mean(abs(GOSE - prediction), na.rm = TRUE),
-    bias = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE) 
-  ) %>% 
-  gather(error, value, -model, -fold, -GOSE) %>% 
-  group_by(GOSE, model, error) %>% 
-  summarize(
-      mean = mean(value),
-      sd   = sd(value)
-  ) %>% 
-  ungroup() %>% 
+```{r error-scores-locf, echo=FALSE, fig.height=3, fig.width=9}
+plot_summary_measures_cond <- function(df_predictions, models, label) {
+  
+  df_predictions %>%
+    filter(model %in% models) %>% 
+    group_by(model, fold, GOSE) %>% 
+    summarize(
+      RMSE = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
+      MAE  = mean(abs(GOSE - prediction), na.rm = TRUE),
+      Bias = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE),
+      `Pr[est. > true] - Pr[est. < true]` = mean(prediction > GOSE, na.rm = TRUE) - mean(prediction < GOSE, na.rm = TRUE)
+    ) %>% 
+    gather(error, value, -model, -GOSE, -fold) %>% 
+    group_by(GOSE, model, error, fold) %>% 
+    summarize(
+        mean_per_fold = mean(value) # overall mean
+    ) %>%
+    group_by(GOSE, model, error) %>% 
+    summarize(
+        mean = mean(mean_per_fold), # overall mean
+        se   = sd(mean_per_fold) / sqrt(n())
+    ) %>%
+    ungroup() %>% 
+    mutate(
+      model = factor(model, models),
+      error = factor(error, c(
+        "Bias",
+        "Pr[est. > true] - Pr[est. < true]",
+        "MAE",
+        "RMSE"
+      ))   
+    ) %>% 
  ggplot(aes(GOSE, color = model)) + 
-    geom_hline(yintercept = 0, color = "black") +
-    geom_line(aes(y = mean)) + 
-    geom_point(aes(y = mean)) + 
-    xlab("GOSe") + 
-    facet_wrap(~error) +
-    scale_y_continuous(name = "", breaks = seq(-2, 8, .25)) +
-    theme_bw() +
-    theme(
-      panel.grid.minor   = element_blank(),
-      panel.grid.major.x = element_blank(),
-      legend.title       = element_blank()
-    ) +
-    ggtitle("Bias and accuracy by true GOSe value, LOCF subset")
+      geom_hline(yintercept = 0, color = "black") +
+      geom_errorbar(aes(ymin = mean - se, ymax = mean + se), 
+        width = .2,
+        position = position_dodge(.33)
+      ) +
+      geom_line(aes(y = mean), alpha = .5) + 
+      xlab("GOSe") + 
+      facet_wrap(~error, nrow = 1) +
+      scale_y_continuous(name = "", breaks = seq(-2, 8, .25)) +
+      theme_bw() +
+      theme(
+        panel.grid.minor   = element_blank(),
+        panel.grid.major.x = element_blank(),
+        legend.title       = element_blank()
+      ) +
+      ggtitle(label)
+  
+}
+
+plot_summary_measures_cond(
+  df_predictions %>% filter(!(gupi %in% idx)), 
+  c("MSM", "GP + cov", "MM", "LOCF"), 
+  "Summary measures by observed GOSe, LOCF subset"
+)

-ggsave(filename = "errors_stratified_locf.pdf", width = 7, height = 7)
-ggsave(filename = "errors_stratified_locf.png", width = 7, height = 7)
+ggsave(filename = "errors_stratified_locf.pdf", width = 9, height = 3)
+ggsave(filename = "errors_stratified_locf.png", width = 9, height = 3)
 ```

 Just as with the overall performance, differences are most pronounced in terms
@@ -871,69 +929,40 @@ where only GOSe values after 180 days post-injury are available.
 The relative characteristics of the three considered approaches are comparable
 to the LOCF subset.

-```{r confusion-matrix, warning=FALSE, message=FALSE, echo=FALSE, fig.cap="Confusion matrices, full training set without LOCF."}
-df_average_confusion_matrices <- df_predictions %>% 
-  filter(
-    model %in% c("GP + cov", "MM", "MSM")
-  ) %>% 
-  group_by(fold, model) %>% 
-  do(
-    confusion_matrix = caret::confusionMatrix(
-        data = factor(.$prediction, levels = 1:8), 
-        reference = factor(.$GOSE, levels = 1:8)
-      ) %>% 
-      as.matrix %>% as_tibble %>% 
-      mutate(`Predicted GOSE` = row_number() %>% as.character) %>% 
-      gather(`True GOSE`, n, 1:8)
-  ) %>% 
-  unnest %>% 
-  group_by(model, `Predicted GOSE`, `True GOSE`) %>% 
-  summarize(n = mean(n)) %>% 
-  ungroup
-
-p_cnf_mtrx_raw <- df_average_confusion_matrices %>% 
-  group_by(model, `True GOSE`) %>%
-  ggplot(aes(`True GOSE`, `Predicted GOSE`, fill = n)) +
-    geom_raster() +
-    geom_hline(yintercept = c(2, 4, 6) + .5, color = "black") +
-    geom_vline(xintercept = c(2, 4, 6) + .5, color = "black") +
-    scale_fill_gradient(low = "white", high = "black") +
-    coord_fixed(expand = FALSE) + 
-    labs(x = "true GOSe", y = "imputed GOSe", fill = "") +
-    theme_bw() +
-    theme(
-      panel.grid = element_blank()
-    ) + 
-    facet_wrap(~model, nrow = 1) +
-    ggtitle("Average confusion matrix accross folds (absolute counts)")
+**TODO**

-p_cnf_mtrx_colnrm <- df_average_confusion_matrices %>%
-  group_by(model, `True GOSE`) %>%
-  mutate(
-    `fraction (column)` = n / sum(n),
-    `fraction (column)` = ifelse(is.nan(`fraction (column)`), 0, `fraction (column)`)
-  ) %>% 
-  ggplot(aes(`True GOSE`, `Predicted GOSE`, fill = `fraction (column)`)) +
-    geom_raster() +
-    geom_hline(yintercept = c(2, 4, 6) + .5, color = "black") +
-    geom_vline(xintercept = c(2, 4, 6) + .5, color = "black") +
-    scale_fill_gradient("", low = "white", high = "black", limits = c(0, 1)) +
-    coord_fixed(expand = FALSE) + 
-    labs(x = "true GOSe", y = "imputed GOSe", fill = "") +
-    theme_bw() +
-    theme(
-      panel.grid = element_blank()
-    ) + 
-    facet_wrap(~model, nrow = 1) +
-    ggtitle("Average confusion matrix accross folds (column fraction)")
+* decide whether figures go in appendix - David and I agree on them being actually the 
+primary analysis. we just needto convince people of the fact that LOCF should be dropped *first*. As always, I am open to debate this but we should just make a decision, figurexit or figuremain?

-cowplot::plot_grid(p_cnf_mtrx_raw, p_cnf_mtrx_colnrm, ncol = 1, align = "v")
+```{r confusion-matrix, warning=FALSE, message=FALSE, echo=FALSE, fig.cap="Confusion matrices, full training set without LOCF."}
+plot_confusion_matrices(
+  df_predictions, 
+  c("MSM", "GP + cov", "MM")
+)

-ggsave(filename = "confusion_matrices_all.pdf", width = 7, height = 7)
-ggsave(filename = "confusion_matrices_all.png", width = 7, height = 7)
+ggsave(filename = "confusion_matrices_all.pdf", width = 7, height = 6)
+ggsave(filename = "confusion_matrices_all.png", width = 7, height = 6)
 ```

 ```{r crossing-table-full, echo=FALSE, warning=FALSE, results='asis'}
+models <- c("MSM", "GP + cov", "MM")
+df_average_confusion_matrices <- df_predictions %>% 
+    filter(model %in% models) %>% 
+    group_by(fold, model) %>% 
+    do(
+      confusion_matrix = caret::confusionMatrix(
+          data = factor(.$prediction, levels = 1:8), 
+          reference = factor(.$GOSE, levels = 1:8)
+        ) %>% 
+        as.matrix %>% as_tibble %>% 
+        mutate(`Predicted GOSE` = row_number() %>% as.character) %>% 
+        gather(`True GOSE`, n, 1:8)
+    ) %>% 
+    unnest %>% 
+    group_by(model, `Predicted GOSE`, `True GOSE`) %>% 
+    summarize(n = mean(n)) %>% 
+    ungroup %>% 
+    mutate(model = factor(model, models))
 rbind(
 df_average_confusion_matrices %>% 
  group_by(model) %>% 
@@ -964,39 +993,15 @@ df_average_confusion_matrices %>%
  pander::pandoc.table("Some specific confusion percentages, full data set.", digits = 3)
 ```

-```{r error-scores-all, echo=FALSE, fig.height=3}
-df_predictions %>%
-  filter(model %in% c("GP + cov", "MM", "MSM")) %>% 
-  group_by(model, fold, GOSE) %>% 
-  summarize(
-    RMSE = mean((GOSE - prediction)^2, na.rm = TRUE) %>% sqrt,
-    MAE  = mean(abs(GOSE - prediction), na.rm = TRUE),
-    bias = mean(prediction, na.rm = TRUE) - mean(GOSE, na.rm = TRUE) 
-  ) %>% 
-  gather(error, value, -model, -fold, -GOSE) %>% 
-  group_by(GOSE, model, error) %>% 
-  summarize(
-      mean = mean(value),
-      sd   = sd(value)
-  ) %>% 
-  ungroup() %>% 
-  ggplot(aes(GOSE, color = model)) + 
-    geom_hline(yintercept = 0, color = "black") +
-    geom_point(aes(y = mean)) + 
-    geom_line(aes(y = mean)) + 
-    facet_wrap(~error) +
-    scale_y_continuous(name = "", breaks = seq(-2, 8, .25)) +
-    xlab("GOSe") +
-    theme_bw() +
-    theme(
-      panel.grid.minor   = element_blank(),
-      panel.grid.major.x = element_blank(),
-      legend.title       = element_blank()
-    ) +
-    ggtitle("Accuracy by true GOSe value, full test set")
+```{r error-scores-all, echo=FALSE, fig.height=3, fig.width=9}
+plot_summary_measures_cond(
+  df_predictions %>% filter(!(gupi %in% idx)), 
+  c("MSM", "GP + cov", "MM"), 
+  "Summary measures by observed GOSe, full test set"
+)

-ggsave(filename = "imputation_error.pdf", width = 7, height = 7)
-ggsave(filename = "imputation_error.png", width = 7, height = 7)
+ggsave(filename = "imputation_error.pdf", width = 9, height = 3)
+ggsave(filename = "imputation_error.png", width = 9, height = 3)
 ```


@@ -1005,40 +1010,49 @@ ggsave(filename = "imputation_error.png", width = 7, height = 7)

 # Discussion

-The most important measure to prevent missing data bias 
-is to avoid missing values in the first place!
-Study protocols and procedures should be checked preemptively for their 
-applicability to clinical practice and adequate effort should be put in
-ensuring good compliance with follow-up protocols.
+Handling missing data *post-hoc* to prevent biased analyses often requires
+great effort.
+It is thus of the utmost importance to implement measures for avoiding missing
+data in the first place [comment: I strongly feel we should lead with this 
+sentence or something in the same spirit to make it absolutely clear that 
+statistics cannot be used to impute data out of nowhere. 
+Raising awareness for the complexity of missing data problems and should rather be seen
+as an incetive to invest more effort upfront in preventing missingness in the first place ;)]
 Nevertheless, in practice, missing values due to loss-to-follow-up will always
-occur and should be addressed efficiently.
-There is a wide consensus that non-imputation of missing values cannot be 
-considered best practice and may introduce bias or lead to an unnecessary loss 
-of power.
-The current gold-standard for imputing missing values is certainly multiple
+occur and should be addressed effectively
+There is a wide consensus that statistically sound imputation of missing values 
+is beneficial for both the reduction of bias and for increasing statistical power.
+The current gold-standard for imputing missing values is multiple
 imputation on a per-analysis basis including analysis-specific covariates to
 further reduce bias and to preserve the imputation uncertainty in the 
 downstream analysis.
-However, in practice, there are good reasons to provide a set of single-imputed
-default values in larger projects such as CENTER-TBI.
-E.g., consortia are increasingly committed to making their databases 
+In practice, however, there are good reasons for providing a set of single-imputed
+default values in large bservational studies such as CENTER-TBI.
+Consortia are increasingly committed to making their databases 
 available to a wider range of researchers. 
-In fact, more liberal data-sharing policies are becoming an core requirement 
-for many funding bodies [REFERENCES].
+In fact, more liberal data-sharing policies are becoming a core requirement 
+for funding bodies (cf. https://www.openaire.eu/).
 In this context, it might not be possible to ensure that every analysis team
 has the necessary statistical expertise to properly conduct a per-analysis
-multiple imputation.
+multiple imputation in the future.
 Furthermore, the imputed values of a multiple-imputation procedure are 
 inherently random and it is thus difficult to ensure consistency across 
-different analysis teams.
-For this reason, we suggest to provide a default single-imputation with 
-appropriate measures of uncertainty for key outcomes in the published 
-data base itself.
-This mitigates problems with complete cases analyses and provide a more
+different analysis teams if the values themselves cannot be stored 
+directly in a database.
+For this reason, as a pratical way forward, we suggest providing a default 
+single-imputation with appropriate measures of uncertainty for key outcomes 
+in the published data base itself.
+This mitigates problems with complete-case analyses and provides a
 principled and consistent default approach to handling missing values.
+Since we strongly suggest to employ a model-based approach to imputation,
+the fitted class probabilities can be provided in the core databse along the
+imputed values.
+Based on these probabilities, it is easy to draw samples for a multiple imputation
+analysis.
 Wherever necessary and practical, a custom, analysis-specific multiple 
-imputation approach might still be used and the model used to provide the 
-single-imputed values may be used as a starting point.
+imputation approach might still be employed.
+In these cases, the model providing the single-imputed values may 
+be used as a starting point.

 Several reasons disqualify LOCF as method of choice.
 Not only is it inherently biased, but LOCF is also inefficient in that it 
@@ -1053,34 +1067,29 @@ necessary in some cases to further reduce bias.
 Finally, LOCF cannot produce an adequate measure of imputation uncertainty 
 since it is not model based.

-We draw two main conclusion from our in-depth comparison of three 
+We draw two main conclusion from our comparison of three 
 alternative, model-based approaches.
 Firstly, despite its theoretical drawbacks, LOCF is hard to beat in terms of
-raw accuracy. 
+accuracy.  
 Still, small improvements are possible (cf. Section ???).
 The main advantages of a model-based approach is thus a reduction of bias,
 the ability to provide a measure of uncertainty together with the imputed 
 values
 (or to use the same very same model to draw multiple imputations), 
-as well as the possibility to include further analysis-specific covariates.
+as well as the possibility of including further analysis-specific covariates.
 Secondly, we found that the inclusion of established baseline predictors for
-GOSe at 6 months post-injury had next to no effect on the imputation quality.
+GOSe at 6 months post-injury had little effect on the imputation quality.
 Note that this does not refute their predictive value but only indicates that
 there is little marginal benefit over knowing at least one other GOSe value. 
-Finally, we conclude that the differences between the considered model-bases
-approaches are rather small.
-
-We nevertheless favor the multi-state model.
-The multi state model is well-interpretable in terms of modeling transition 
-intensities.
-Efficient implementations in standard statistical software (msm, R) are 
-available, it succeeds in eliminating the bias introduced by LOCF, is
+Differences between the model-based approaches tend to be rather nuanced.
+We nevertheless favor the multi-state model (MSM).
+It is well-interpretable in terms of transition intensities.
+An efficient implementation [@msm2011] in standard statistical software [@R2016] is 
+available.
+It succeeds in eliminating the bias introduced by LOCF, is
 able to provide imputed values for the entire population and is able to
 provide a probabilistic output.
-I.e., besides the single predicted GOSe value, the fitted probabilities
-for each GOSe value may be stored in the database.
-Based on these values, it is easy to draw samples for a multiple imputation
-analysis.
+