Commit f1a20e8d authored by Kevin's avatar Kevin

tidier data report, imposed age restriction to 12-80 (configurable)

parent 36aab205
...@@ -34,7 +34,7 @@ rule prepare_data: ...@@ -34,7 +34,7 @@ rule prepare_data:
shell: shell:
""" """
mkdir -p output/{wildcards.version}/data mkdir -p output/{wildcards.version}/data
Rscript -e "rmarkdown::render(\\"{input.markdown}\\", output_dir = \\"output/{wildcards.version}\\", params = list(datapath = \\"../data/{wildcards.version}\\", max_lab_days = {config[max_lab_days]}, seed = {config[seed]}))" Rscript -e "rmarkdown::render(\\"{input.markdown}\\", output_dir = \\"output/{wildcards.version}\\", params = list(datapath = \\"../data/{wildcards.version}\\", max_lab_days = {config[max_lab_days]}, seed = {config[seed]}, age_min = {config[age_min]}, age_max = {config[age_max]}))"
mv reports/*.rds output/{wildcards.version}/data mv reports/*.rds output/{wildcards.version}/data
mv reports/figures.zip {output.figures} mv reports/figures.zip {output.figures}
""" """
...@@ -88,7 +88,7 @@ rule generate_validation_data_v1_1: ...@@ -88,7 +88,7 @@ rule generate_validation_data_v1_1:
for j in range(1, config["folds"] + 1) for j in range(1, config["folds"] + 1)
] ]
......
...@@ -2,6 +2,10 @@ seed: ...@@ -2,6 +2,10 @@ seed:
42 42
max_lab_days: max_lab_days:
3 3
age_min:
12
age_max:
80
mi_m: mi_m:
5 5
mi_maxiter: mi_maxiter:
......
...@@ -13,6 +13,8 @@ params: ...@@ -13,6 +13,8 @@ params:
datapath: "../data/v1.1" datapath: "../data/v1.1"
max_lab_days: 3 max_lab_days: 3
seed: 42 seed: 42
age_min: 12
age_max: 80
--- ---
...@@ -38,7 +40,9 @@ df_gose <- readRDS(sprintf('%s/df_gose.rds', params$datapath)) ...@@ -38,7 +40,9 @@ df_gose <- readRDS(sprintf('%s/df_gose.rds', params$datapath))
# Extract data # Extract data
```{r} ## Baseline and death times
```{r extract-deathtimes}
df_deaths <- df_baseline %>% df_deaths <- df_baseline %>%
transmute( transmute(
gupi, gupi,
...@@ -50,7 +54,11 @@ df_deaths <- df_baseline %>% ...@@ -50,7 +54,11 @@ df_deaths <- df_baseline %>%
) %>% ) %>%
filter(complete.cases(.)) filter(complete.cases(.))
``` ```
```{r}
We use exact deathtimes (Subject.DeathDate).
Death dates are recorded for `r df_deaths %>% nrow`.
```{r extract-baseline-covariates}
df_baseline <- df_baseline %>% df_baseline <- df_baseline %>%
select(-Subject.DeathDate) %>% select(-Subject.DeathDate) %>%
mutate( mutate(
...@@ -121,13 +129,13 @@ df_baseline <- df_baseline %>% ...@@ -121,13 +129,13 @@ df_baseline <- df_baseline %>%
) )
``` ```
Overall, `r nrow(df_baseline)` individuals have recorded baseline data.
## GOSE data ## GOSE data
```{r gose-outcomes-ambiguity} ```{r extract-gose}
df_gose <- df_gose %>% df_gose <- df_gose %>%
distinct %>% distinct %>%
filter(complete.cases(.)) %>% filter(complete.cases(.)) %>%
...@@ -136,17 +144,16 @@ df_gose <- df_gose %>% ...@@ -136,17 +144,16 @@ df_gose <- df_gose %>%
mutate(Outcomes.DerivedCompositeGOSE = factor(Outcomes.DerivedCompositeGOSE, levels = 1:8)) mutate(Outcomes.DerivedCompositeGOSE = factor(Outcomes.DerivedCompositeGOSE, levels = 1:8))
``` ```
This results in `r nrow(df_gose)` GOSE measurements of Overall, `r nrow(df_gose)` GOSE measurements of
`r df_gose %>% group_by(gupi) %>% n_groups()` individuals. `r df_gose %>% group_by(gupi) %>% n_groups()` individuals are available.
To these observations, we add a GOSE of 1 at the recorded death times.
We then exclude all patients with a recorded death time of less than 6 months,
since there 6 months GOSE is known exactly (1, dead).
# Compile final datasets The target population is thus the subset of individuals with
1. at least one valid GOSE observation
* exclude all patient who do not survive first 6 months (no need to impute) 2. no confirmed death within 6 months
* exclude all patients with no GOSE measurement (no imputation)
```{r} ```{r exclude-early-deaths}
early_deaths_gupi <- df_deaths %>% early_deaths_gupi <- df_deaths %>%
filter(days <= 180 - 14) %>% filter(days <= 180 - 14) %>%
.[["gupi"]] .[["gupi"]]
...@@ -185,14 +192,14 @@ df_baseline <- df_baseline %>% ...@@ -185,14 +192,14 @@ df_baseline <- df_baseline %>%
This results in `r nrow(df_gose)` GOSE measurements of This results in `r nrow(df_gose)` GOSE measurements of
`r df_gose %>% group_by(gupi) %>% n_groups()` individuals. `r df_gose %>% group_by(gupi) %>% n_groups()` individuals.
## Plausibility check ## Plausibility check
The only genuinly numerical variables are Age, Glucose_mmolL, and Hb_dL. The only genuinly numerical variables are Age, Glucose_mmolL, and Hb_dL.
All other variables are factors and may therefore not contain outliers. All other variables are factors and may therefore not contain outliers.
```{r} ```{r baseline-histograms-raw}
df_baseline %>% df_baseline %>%
select(Subject.Age, Labs.DLGlucosemmolL, Labs.DLHemoglobingdL) %>% select(Subject.Age, Labs.DLGlucosemmolL, Labs.DLHemoglobingdL) %>%
gather(Variable, value) %>% gather(Variable, value) %>%
...@@ -203,19 +210,40 @@ df_baseline %>% ...@@ -203,19 +210,40 @@ df_baseline %>%
theme(panel.grid = element_blank()) theme(panel.grid = element_blank())
``` ```
### Age range
The observed age range is quite wide, we further restrict the study population
to individuals between `r params$age_min` and `r params$age_max`.
```{r restrict-age-range}
df_baseline <- df_baseline %>%
filter(
Subject.Age >= params$age_min,
Subject.Age <= params$age_max
)
df_gose <- df_gose %>%
filter(gupi %in% df_baseline$gupi)
```
This reduces the number GOSE observations to `r nrow(df_gose)` of
`r df_gose %>% group_by(gupi) %>% n_groups()` individuals.
### Glucose and Hemoglobin
Glucose is obviously left-skewed and a log transfrom might improve fits in linear Glucose is obviously left-skewed and a log transfrom might improve fits in linear
models. models.
All values above 50 are considered implausible and set to missing
(probably meant as missing).
```{r, echo=TRUE} ```{r glucose-outliers}
df_baseline %>% df_baseline %>%
select(gupi, Labs.DLHemoglobingdL) %>% select(gupi, Labs.DLHemoglobingdL) %>%
filter(Labs.DLHemoglobingdL > 50) filter(Labs.DLHemoglobingdL > 50)
``` ```
All values above 50 are considered implausible and set to missing ```{r log-trans-hemoglobin}
(probably meant as missing).
```{r}
df_baseline <- df_baseline %>% df_baseline <- df_baseline %>%
mutate( mutate(
Labs.DLHemoglobingdL = ifelse(Labs.DLHemoglobingdL > 50, NA_real_, Labs.DLHemoglobingdL), Labs.DLHemoglobingdL = ifelse(Labs.DLHemoglobingdL > 50, NA_real_, Labs.DLHemoglobingdL),
...@@ -334,18 +362,20 @@ ggsave("gose_alluvial_differential_coloring.pdf", height = 5, width = 8) ...@@ -334,18 +362,20 @@ ggsave("gose_alluvial_differential_coloring.pdf", height = 5, width = 8)
ggsave("gose_alluvial_differential_coloring.png", height = 5, width = 8) ggsave("gose_alluvial_differential_coloring.png", height = 5, width = 8)
``` ```
Out of the `r df_gose %>% group_by(gupi) %>% n_groups` individuals in the
final dataset, `r df_gose %>% group_by(gupi) %>% filter(!any(Outcomes.DerivedCompositeGOSEDaysPostInjury >= 5*30 & Outcomes.DerivedCompositeGOSEDaysPostInjury <= 8*30)) %>% n_groups` do not
have per-protocol 6 months GOSE observations and are eligible for model-based
imputation.
# Session Info
```{r zip-figures} ```{r zip-figures}
system("zip figures.zip *.png *.pdf") system("zip figures.zip *.png *.pdf")
system("rm *.png *.pdf") system("rm *.png *.pdf")
``` ```
# Session Info
```{r session-info} ```{r session-info}
sessionInfo() sessionInfo()
``` ```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment