Commit f1a20e8d authored by Kevin's avatar Kevin

tidier data report, imposed age restriction to 12-80 (configurable)

parent 36aab205
......@@ -34,7 +34,7 @@ rule prepare_data:
shell:
"""
mkdir -p output/{wildcards.version}/data
Rscript -e "rmarkdown::render(\\"{input.markdown}\\", output_dir = \\"output/{wildcards.version}\\", params = list(datapath = \\"../data/{wildcards.version}\\", max_lab_days = {config[max_lab_days]}, seed = {config[seed]}))"
Rscript -e "rmarkdown::render(\\"{input.markdown}\\", output_dir = \\"output/{wildcards.version}\\", params = list(datapath = \\"../data/{wildcards.version}\\", max_lab_days = {config[max_lab_days]}, seed = {config[seed]}, age_min = {config[age_min]}, age_max = {config[age_max]}))"
mv reports/*.rds output/{wildcards.version}/data
mv reports/figures.zip {output.figures}
"""
......@@ -88,7 +88,7 @@ rule generate_validation_data_v1_1:
for j in range(1, config["folds"] + 1)
]
......
......@@ -2,6 +2,10 @@ seed:
42
max_lab_days:
3
age_min:
12
age_max:
80
mi_m:
5
mi_maxiter:
......
......@@ -13,6 +13,8 @@ params:
datapath: "../data/v1.1"
max_lab_days: 3
seed: 42
age_min: 12
age_max: 80
---
......@@ -38,7 +40,9 @@ df_gose <- readRDS(sprintf('%s/df_gose.rds', params$datapath))
# Extract data
```{r}
## Baseline and death times
```{r extract-deathtimes}
df_deaths <- df_baseline %>%
transmute(
gupi,
......@@ -50,7 +54,11 @@ df_deaths <- df_baseline %>%
) %>%
filter(complete.cases(.))
```
```{r}
We use exact deathtimes (Subject.DeathDate).
Death dates are recorded for `r df_deaths %>% nrow`.
```{r extract-baseline-covariates}
df_baseline <- df_baseline %>%
select(-Subject.DeathDate) %>%
mutate(
......@@ -121,13 +129,13 @@ df_baseline <- df_baseline %>%
)
```
Overall, `r nrow(df_baseline)` individuals have recorded baseline data.
## GOSE data
```{r gose-outcomes-ambiguity}
```{r extract-gose}
df_gose <- df_gose %>%
distinct %>%
filter(complete.cases(.)) %>%
......@@ -136,17 +144,16 @@ df_gose <- df_gose %>%
mutate(Outcomes.DerivedCompositeGOSE = factor(Outcomes.DerivedCompositeGOSE, levels = 1:8))
```
This results in `r nrow(df_gose)` GOSE measurements of
`r df_gose %>% group_by(gupi) %>% n_groups()` individuals.
# Compile final datasets
* exclude all patient who do not survive first 6 months (no need to impute)
* exclude all patients with no GOSE measurement (no imputation)
Overall, `r nrow(df_gose)` GOSE measurements of
`r df_gose %>% group_by(gupi) %>% n_groups()` individuals are available.
To these observations, we add a GOSE of 1 at the recorded death times.
We then exclude all patients with a recorded death time of less than 6 months,
since there 6 months GOSE is known exactly (1, dead).
The target population is thus the subset of individuals with
1. at least one valid GOSE observation
2. no confirmed death within 6 months
```{r}
```{r exclude-early-deaths}
early_deaths_gupi <- df_deaths %>%
filter(days <= 180 - 14) %>%
.[["gupi"]]
......@@ -185,14 +192,14 @@ df_baseline <- df_baseline %>%
This results in `r nrow(df_gose)` GOSE measurements of
`r df_gose %>% group_by(gupi) %>% n_groups()` individuals.
## Plausibility check
The only genuinly numerical variables are Age, Glucose_mmolL, and Hb_dL.
All other variables are factors and may therefore not contain outliers.
```{r}
```{r baseline-histograms-raw}
df_baseline %>%
select(Subject.Age, Labs.DLGlucosemmolL, Labs.DLHemoglobingdL) %>%
gather(Variable, value) %>%
......@@ -203,19 +210,40 @@ df_baseline %>%
theme(panel.grid = element_blank())
```
### Age range
The observed age range is quite wide, we further restrict the study population
to individuals between `r params$age_min` and `r params$age_max`.
```{r restrict-age-range}
df_baseline <- df_baseline %>%
filter(
Subject.Age >= params$age_min,
Subject.Age <= params$age_max
)
df_gose <- df_gose %>%
filter(gupi %in% df_baseline$gupi)
```
This reduces the number GOSE observations to `r nrow(df_gose)` of
`r df_gose %>% group_by(gupi) %>% n_groups()` individuals.
### Glucose and Hemoglobin
Glucose is obviously left-skewed and a log transfrom might improve fits in linear
models.
All values above 50 are considered implausible and set to missing
(probably meant as missing).
```{r, echo=TRUE}
```{r glucose-outliers}
df_baseline %>%
select(gupi, Labs.DLHemoglobingdL) %>%
filter(Labs.DLHemoglobingdL > 50)
```
All values above 50 are considered implausible and set to missing
(probably meant as missing).
```{r}
```{r log-trans-hemoglobin}
df_baseline <- df_baseline %>%
mutate(
Labs.DLHemoglobingdL = ifelse(Labs.DLHemoglobingdL > 50, NA_real_, Labs.DLHemoglobingdL),
......@@ -334,18 +362,20 @@ ggsave("gose_alluvial_differential_coloring.pdf", height = 5, width = 8)
ggsave("gose_alluvial_differential_coloring.png", height = 5, width = 8)
```
Out of the `r df_gose %>% group_by(gupi) %>% n_groups` individuals in the
final dataset, `r df_gose %>% group_by(gupi) %>% filter(!any(Outcomes.DerivedCompositeGOSEDaysPostInjury >= 5*30 & Outcomes.DerivedCompositeGOSEDaysPostInjury <= 8*30)) %>% n_groups` do not
have per-protocol 6 months GOSE observations and are eligible for model-based
imputation.
# Session Info
```{r zip-figures}
system("zip figures.zip *.png *.pdf")
system("rm *.png *.pdf")
```
# Session Info
```{r session-info}
sessionInfo()
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment