manucript details, more compact figures

b0dd68da · Kevin Kunzmann · 32996c4e · b0dd68da · b0dd68da
Commit b0dd68da authored May 27, 2019 by Kevin Kunzmann
Hide whitespace changes
Inline Side-by-side

Showing with 64 additions and 41 deletions

manuscript/manuscript.Rmd manuscript/manuscript.Rmd +62 -40

manuscript/references.bib manuscript/references.bib +2 -1

No files found.
--- a/manuscript/manuscript.Rmd
+++ b/manuscript/manuscript.Rmd
@@ -43,7 +43,7 @@ The 3-month outcome is recognized as one approach of subsituting for missing
 and has been used in recent trials (@skolnick2014).
 Although LOCF is easy to understand and implement, the technique is 
 suboptimal in several respects.
-Firstly, it is biased in that it neglects potential trends in 
+Firstly, it is biased in that it neglects potential time trends in 
 GOS(e) trajectories.
 Secondly, a naive application of LOCF is also inefficient since it neglects
 data observed briefly after the target time window.
@@ -57,16 +57,16 @@ to further reduce bias introduced by the imputation method and that LOCF
 cannot be used to obtain multiply imputed data sets by design.
 The variation in timing of outcome assessments for patients with TBI 
-varies between studies.  
+varies between studies.
-Some studies define very stringent time windows (e.g. + 2 weeks [TRACK-TBI]), 
+Some studies define very stringent time windows (e.g. +/- 2 weeks [TRACK-TBI ???????????]), 
 but this can lead to a substantial amount missing data (@richter2019).  Consequently other studies have defined more pragmatic protocol windows 
-(e.g. -1 month to +2 months, @maas2014).  
+(e.g. -1 month to +2 months, cf. @maas2014).
 While the wider windows enable more complete data collection, 
 they suffer from the problem that outcome can be evolving over this period,
 and an outcome assessment obtained at five months 
 (the beginning of this window) in one subject may not be strictly comparable
 with outcomes obtained just before eight months (the end of the window) in 
-another subject.  
+another subject.
 Consequently, even where outcomes are available within pragmatic protocol 
 windows, 
 there may be a benefit from being able to impute an outcome more 
@@ -162,14 +162,14 @@ n_pat <- nrow(df_baseline)
 ```
 The CENTER-TBI project methods and design are described in detail 
-elsewhere [@center2014]. 
+elsewhere (@maas2014). 
 Participants with TBI were recruited into three strata: 
 (a) patients attending the emergency room, 
 (b) patients admitted to hospital but not intensive care, 
 and (c) patients admitted to intensive care. 
 Follow-up of participants was scheduled per protocol at 2 weeks, 3 months, 
 and 6 months in group (a) and at 3 months, 6 months, and 12 months in groups 
-(b) and (c).   
+(b) and (c).
 The protocol time window for the 6 months GOSe was between 5 and 8 months
 post injury.
 Outcome assessments at all timepoints included the GOSe 
@@ -183,12 +183,14 @@ The GOSe is an eight-point scale with the following categories:
 The study population for this empirical methods comparison are all 
 individuals from the CENTER-TBI database (total of n = 4509) whose GOSe 
 status was queried at least once within the first 18 months and who were 
-still alive 180 days post-injury `r n_pat`.
+still alive 180 days post-injury (n = `r n_pat`).
 The rationale for conducting the comparison conditional on 6-months survival
 is simply that the GOSe can only be missing at 6-months if the individuals
 are still alive since GOSe would be (1) otherwise.
-All data were accessed from the CENTER-TBI Neurobot database 
+Data for the CENTER-TBI study has been collected through the Quesgen e-CRF 
-(release 1.1, cf. Appendix for details).
+(Quesgen Systems Inc, USA), 
+hosted on the INCF platform and extracted via the INCF Neurobot tool (https://neurobot.incf.org/). 
+Release 1.1 of the database was used (cf. Appendix for details).
 Basic summary statistics for population characteristics are listed in
 Table 1.
@@ -265,7 +267,7 @@ GOSe observations at 180 +/- 14 days post injury were available and
 GOSe observations within the per-protocol window of 5-8 months post injury.
 The distribution of GOSe sampling times and both absolute and
 relative frequencies of the respective GOSe categories are shown in 
-Figure (???).
+Figure 1.
 True observation times were mapped to categories by rounding to the 
 closest time point, i.e., 
 the ‘6 months’ category contains observations up to 9 months post-injury.  
@@ -353,7 +355,8 @@ model (MM),
 a Gaussian process regression (GP), and a multi-state model (MSM). 
 For all model-based approaches we additionally explored variants 
 including the key IMPACT [@steyerberg2008] predictors as covariates. 
-These are [????].
+These are age, GCS motot score, pupil reactivity (0, 1, 2), hypoxia, hypotension, Marshall CT classification, traumatic subarachnoid hemorrhage, epidural hematoma,
+glucose, and Hb.
@@ -387,10 +390,10 @@ in a Bayesian non-parametric way [@rasmussen2006].
 Both the mixed effects model as well as the Gaussian process regression 
 model are non-linear regression techniques for longitudinal 
 data.
-While they are both powerful tools to model longitudinal trajectories,
+While these are powerful tools to model longitudinal trajectories,
 they do not explicitly model the probability of transitions between 
 GOSe states.
-Since the number of observations per individual is very limited in our
+Since the number of observations per individual is limited in our
 data set (1 to 4 GOSe observations per individual),
 an approach explicitly modelling transition probabilities might be 
 more suitable to capture the dynamics of the GOSe trajectories.
@@ -399,7 +402,7 @@ was considered (@meira2009).
 All models were fitted using eiter none or all IMPACT predictors except for the 
 MSM model which only used age due to issues with numerical stability.
-Further details on the respective implementations is given in the Appendix.
+Further details on the respective implementations are given in the Appendix.
@@ -498,7 +501,7 @@ All models were fit on the entire available data after removing the
 180 +/- 14 days post-injury observation from the respective test fold
 thus mimicking a missing completely at random missing data mechanism.
 The distribution of GOSe values in the respective three test sets is well 
-balanced, (cf. Appendix, Figure ???).
+balanced, (cf. Appendix, Figure A.1).
 Performance was assessed using the absolute-count and the normalized 
 (proportions) confusion matrices as well as
 bias, mean absolute error (MAE), and root mean squared error (RMSE).
@@ -513,10 +516,10 @@ MAE and RMSE are both a measures of average precision where
 RMSE puts more weight on large deviations as compared to RMSE.
 Comparisons in terms of bias, MAE, and RMSE tacitly assume that 
 GOSe values can be sensibly interpreted on an interval scale.
-We therefore also considered the directional bias (bias'), 
+We therefore also considered directional bias (bias'), 
 the difference between the model-fitted 
 probability of exceeding the true value minus the model-fitted probability of 
-undershooting the true GOSe ($Pr[imp. > true] - Pr[imp. < true]$) as an
+undershooting the true GOSe ($Pr[imputed > observed] - Pr[imputed < observed]$) as an
 alternative measure of bias which does not require this assumption.
 Note that the scale of the directional bias is not directly comparable to the 
 one of the other three quantities!
@@ -528,6 +531,7 @@ idx <- df_predictions %>%
  filter(model == "LOCF", !complete.cases(.)) %>% 
  .[["gupi"]]
 ```
 LOCF, by design, cannot provide imputed values when there are no
 observations before 180 days post injury.
 A valid comparison of LOCF with the other methods must therefore be 
@@ -547,8 +551,10 @@ approach  was similar to the overall dataset (cf. Appendix, Table ???).
 # Results
 The overall performance of all fitted models in terms of bias, bias', MAE, and RMSE
-is depicted in Figure ??? both conditional on LOCF being applicable and,
+is depicted in Figure 2 both conditional on LOCF being applicable (gray) and,
-excluding LOCF, on the entire test set.
+excluding LOCF, on the entire test set (black).
+Values are reported as mean over the three cross-validation folds and 
+error bars indicate +/- 1.96 standard errors.
 ```{r overall-comparison-all-methods, echo=FALSE, fig.height=3.5, fig.width=6, warning=FALSE}
 compute_summary_measures <- function(df) {
@@ -616,7 +622,7 @@ Firstly, LOCF is overall negatively biased, i.e.,
 on average it imputes lower-than-observed GOSe values.
 This reflects a population average trend towards continued
 recovery within the first 6 months post injury.
-The fact that both ways of quantifying bias qualitatively agree, 
+The fact that both ways of measuring bias qualitatively agree, 
 indicates that the interpretation of GOSe as an interval measure which
 tacitly underlies Bias, MAE, and RMSE comparisons is not too restrictive.
 In terms of accuracy, LOCF does perform worst but differences between
@@ -652,7 +658,10 @@ baseline covariates.
 We first consider results for the set of test cases which allow LOCF imputation 
 (n = `r df_predictions %>% filter(model == "LOCF") %>% nrow - length(idx)`).
 Both the raw count as well as the relative (by left-out observed GOSe) confusion matrices
-are presented in Figure ???.
+are presented in Figure 3.
+The GOSe scale is restricted to 3+ since the imputation is conditional on 
+an observed GOSe larger than 1 (deaths are known and no imputation necessary)
+and no GOSe 2 was observed.
 ```{r confusion-matrix-locf, warning=FALSE, message=FALSE, echo=FALSE, fig.cap="Confusion matrices on LOCF subset.", fig.height=6, fig.width=6}
@@ -740,20 +749,19 @@ ggsave(filename = "confusion_matrices_locf.png", width = 6, height = 6)
 The absolute-count confusion matrices show that most imputed values are 
 within +/- one GOSE categories of the observed ones. 
-However, they also reflect the category imbalance (cf. Figures ??? and ??? 
+However, they also reflect the category imbalance (cf. Figures 1) in the study population.
-Appendix) in the study population.
 The performance conditional on the (in practice unknown) observed GOSe value
-clearly shows that imputation performance for the most infrequent category 4
+clearly shows that imputation for the most infrequent category 4
-is most problematic.
+is hardest.
 This is, however, true across the range of methods considered.
 Both the MSM and the MM models account for this by almost never imputing a 
 GOSe of 4.
 Instead, the respective cases tend to be imputed to GOSe 3 or 5.
-To better understand the overall performance assessment in Figure ???,
+To better understand the overall performance assessment in Figure 2,
 we also consider the performance conditional on the respective ground-truth
 (i.e. the observed GOSe values in the test sets).
-The results are shown in Figure ??? (vertical bars are +/- one standard error of the mean).
+The results are shown in Figure 4 (vertical bars are +/-1.96 standard error of the mean).
 ```{r error-scores-locf, echo=FALSE, fig.height=3.5, fig.width=6}
 plot_summary_measures_cond <- function(df_predictions, models, label) {
@@ -790,14 +798,14 @@ plot_summary_measures_cond <- function(df_predictions, models, label) {
  ggplot(aes(GOSE, color = model)) + 
      geom_hline(yintercept = 0, color = "black") +
      geom_line(aes(y = mean)) + 
-      geom_errorbar(aes(ymin = mean - se, ymax = mean + se), 
+      geom_errorbar(aes(ymin = mean - 1.96*se, ymax = mean + 1.96*se), 
        width = .2,
-        position = position_dodge(.2),
+        position = position_dodge(.33),
        size = 1
      ) +
      xlab("observed GOSe") + 
      facet_wrap(~error, nrow = 1) +
-      scale_y_continuous(name = "", breaks = seq(-2, 8, .25)) +
+      scale_y_continuous(name = "", breaks = seq(-2, 8, .5)) +
      theme_bw() +
      theme(
        panel.grid.minor   = element_blank(),
@@ -844,7 +852,7 @@ clearly dominates with respect to bias.
 Note that irrespective of the exact definition of bias used, MSM dominates the other
 model-based approaches. 
 Comparing LOCF and MSM, there is a slight advantage of MSM in terms of accuracy for
-the majority classes 3, 7, 8 which explain the overall difference shown in Figure ???.
+the majority classes 3, 7, 8 which explain the overall difference shown in Figure 2.
 With respect to bias, MSM also performs better than LOCF for the most frequently
 observed categories, but the extent of this improvement depend on the performance measure.
@@ -899,7 +907,7 @@ great effort.
 It is thus of the utmost importance to implement measures for avoiding missing
 data in the first place.
 Nevertheless, in practice, missing values due to loss-to-follow-up will always
-occur and should be addressed effectively
+occur and should be addressed effectively.
 There is a wide consensus that statistically sound imputation of missing values 
 is beneficial for both the reduction of bias and for increasing statistical power.
 The current gold-standard for imputing missing values is multiple
@@ -941,17 +949,17 @@ the respective target population.
 Albeit simple to implement, LOCF - by definition - is not capable of 
 exploiting longitudinal information after the target time point. 
 This results in a smaller subset of individuals for which imputed values can
-be provided in the first place (cf. Section ???).
+be provided in the first place.
 LOCF also lacks flexibility to adjust for further covariates which might be
-necessary in some cases to further reduce bias.
+necessary in some cases to further reduce bias under a missing at random assumption.
 Finally, LOCF cannot produce an adequate measure of imputation uncertainty 
 since it is not model based.
-We draw two main conclusion from our comparison of three 
+We draw two main conclusions from our comparison of three 
 alternative, model-based approaches.
 Firstly, despite its theoretical drawbacks, LOCF is hard to beat in terms of
-accuracy.  
+accuracy.
-Still, small improvements are possible (cf. Section ???).
+Still, small improvements are possible.
 The main advantages of a model-based approach is thus a reduction of bias,
 the ability to provide a measure of uncertainty together with the imputed 
 values
@@ -960,7 +968,7 @@ as well as the possibility of including further analysis-specific covariates.
 Secondly, we found that the inclusion of established baseline predictors for
 GOSe at 6 months post-injury had little effect on the imputation quality.
 Note that this does not refute their predictive value but only indicates that
-there is little marginal benefit over knowing at least one other GOSe value. 
+there is little marginal benefit over knowing at least one other value. 
 Differences between the model-based approaches tend to be rather nuanced.
 We nevertheless favor the multi-state model (MSM).
 It is well-interpretable in terms of transition intensities.
@@ -971,11 +979,25 @@ able to provide imputed values for the entire population and is able to
 provide a probabilistic output.
+## Funding sources statement:
+Data used in preparation of this manuscript were obtained in the context of
+CENTER-TBI, a large collaborative project with the support of the European
+Union 7th Framework program (EC grant 602150).
+Additional funding was obtained from the Hannelore Kohl Stiftung (Germany),
+from OneMind (USA) and from Integra LifeSciences Corporation (USA).
 # Appendix / Supplemental Material
+## Ethical approval statement
+The CENTER-TBI study (EC grant 602150) has been conducted in accordance with all relevant laws of the EU if directly applicable or of direct effect and all relevant laws of the country where the Recruiting sites  were  located,  including  but  not  limited  to,  the  relevant  privacy  and  data  protection  laws  and regulations (the “Privacy Law”), the relevant laws and regulations on the use of human materials, and all relevant guidance relating to clinical studies from time to time in force including, but not limited to, the ICH Harmonised Tripartite Guideline for Good Clinical Practice (CPMP/ICH/135/95) (“ICH GCP”) and the World Medical Association Declaration of Helsinki entitled “Ethical Principles for Medical Research Involving Human Subjects”.
+Informed  Consent by the patients and/or the legal representative/next of kin was obtained, accordingly to thelocal legislations,for all patients recruited in the Core Dataset of CENTER-TBI and documented in the e-CRF.
+Ethical approval was obtained for each recruiting site.
+The list of sites, Ethical Committees, approval numbers and approval dates can be foundon the website: https://www.center-tbi.eu/project/ethical-approval.
 ## Distribution of GOSe in validation folds
@@ -1235,7 +1257,7 @@ https://center-tbi.incf.org/ are provided.
 The analysis is completely automated using the workflow management tool 'snakemake'
 [@koster2012snakemake] and a singularity [@kurtzer2017singularity] container image
 containing all required dependencies is publicly available from zenodo.org
-at https://zenodo.org/record/2600385#.XJzZwEOnw5k (DOI: 10.5281/zenodo.2600385).
+(DOI: 10.5281/zenodo.2600385).
 Detailed step-by-step instructions on how to reproduce the analysis are provided 
 in the README.md file of the GitLab repository. 

--- a/manuscript/references.bib
+++ b/manuscript/references.bib
@@ -196,7 +196,7 @@ year = {2011}
 @article{R2016,
 archivePrefix = {arXiv},
 arxivId = {arXiv:1011.1669v3},
-author = {Team, R Development Core and {R Development Core Team}, R},
+author = {R Core Team},
 doi = {10.1007/978-3-540-74686-7},
 eprint = {arXiv:1011.1669v3},
 isbn = {3{\_}900051{\_}00{\_}3},
@@ -206,6 +206,7 @@ pmid = {16106260},
 title = {{R: A Language and Environment for Statistical Computing}},
 year = {2016}
 }
 @book{rasmussen2006,
 abstract = {Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated. Gaussian process models are routinely used to solve hard machine learning problems. They are attractive because of their flexible non-parametric nature and computational simplicity. Treated within a Bayesian framework, very powerful statistical methods can be implemented which offer valid estimates of uncertainties in our predictions and generic model selection procedures cast as nonlinear optimization problems. Their main drawback of heavy computational scaling has recently been alleviated by the introduction of generic sparse approximations.13,78,31 The mathematical literature on GPs is large and often uses deep concepts which are not required to fully understand most machine learning applications. In this tutorial paper, we aim to present characteristics of GPs relevant to machine learning and to show up precise connections to other "kernel machines" popular in the community. Our focus is on a simple presentation, but references to more detailed sources are provided.},
 archivePrefix = {arXiv},