Commit b0e5c713 authored by Kevin Kunzmann's avatar Kevin Kunzmann

Refactor to gtex v8

parent 8d918f3c
.*
.config
.snakemake
output
logs
nohup.out
container.sif
*.err
*.out
*.sif
*.simg
*.html
*.pdf
.DS_Store
.Rproj.user
# Impute gene expression for CENTER-TBI with PrediXcan
# Impute genetically regulated gene expression for CENTER-TBI using PrediXcan
The singularity container with most software dpendencies is available at
https://doi.org/10.5281/zenodo.3376504.
Data currently needs to be accessed manually due to access restrictions.
This workflow is design for *.vcf.gz files with dosage (DS) information.
This repository contains all scripts necessary to run the impute genetically
regulated gene expression (GREx) using the
[PrediXcan](https://github.com/hakyimlab/MetaXcan) methodology for
[CENTER-TBI](https://www.center-tbi.eu/) imputed whole genomes data.
More information on PrediXcan can be found here https://github.com/hakyimlab/PrediXcan and in:
> Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data.
[1] Gamazon ER†, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
Nicolae DL, Cox NJ, Im HK. (2015) *A gene-based association method for mapping traits using reference transcriptome data.*
Nat Genet. doi:10.1038/ng.3367.
## Dependencies
1. linux shell (`bash`), possibly via virtual machine on Windows/Mac
2. `wget` (pre-installed or via distribution package manager)
3. `singularity` container software (tested on 3.3.0, https://sylabs.io/guides/3.3/user-guide)
4. `git` (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
5. for data download fimm GCP bucket access to `fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data`
### Optional
5. python 3.7+ and snakemake
6. slurm cluster
We use snakemake to organize the workflow (also pre-installed in the container) and support cluster execution.
Snakemake is available via `pip` package for python 3.7.
> Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics,
Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480
## Execution
Download and extract the contents of this repository (might be access restricted)
[2] Barbeira, A., Shah, K. P., Torres, J. M., Wheeler, H. E., Torstenson, E. S., Edwards, T., ... & Im, H. K. (2016). *MetaXcan: summary statistics based gene-level association method infers accurate PrediXcan results.* BioRxiv, 045260.
git clone https://github.com/kkmann/impute-gene-expression
cd impute-gene-expression
Download the container image
bash scripts/download_container.sh
Obtain the imputed genomes (GCP bucket: fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data).
`gsutils` is pre-installed in the container image, to authenticate with your
GCP account run and follow the interactive instructions
singularity shell container.sif
gcloud auth login
snakemake download_imputed_genotypes
exit
The pipeline itself is implemented using [snakemake](https://snakemake.readthedocs.io/en/stable/) and
[singularity](https://sylabs.io/) containers to guarantee reproducibility.
Execute the workflow inside the container on a single core (takes a while!)
singularity exec container.sif snakemake impute
Optionally, if snakemake is installed, the workflow can be run in parallel via
snakemake --use-singularity -j 8 impute
## Dependencies
where '8' can be replaced by the number of available cores.
Cluster execution is enables via the `scripts/slurm_snakemake.sh` script as
1. [`Linux` system](https://ubuntu.com/download/desktop)
2. [`git`](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
3. [`singularity`](https://sylabs.io/guides/3.5/user-guide/quick_start.html#quick-installation-steps) container software (tested on 3.5.0)
4. *optionally:* [Python 3](https://www.python.org/) and [snakemake](https://github.com/snakemake/snakemake)
bash scripts/slurm-snakemake.sh impute
To access the imputed genotypes, access to the
[FIMM](https://www.fimm.fi/) GCS bucket
`fimm-horizon-outgoing-data` is necessary (requires a FIMM user account).
It is recommended to run the workflow on a cluster system,
since execution time on a desktop machine may take several hours.
An example [slurm](https://slurm.schedmd.com/documentation.html)
cluster [profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) is saved in `config/snakemake/mrc-bsu-cluster`
for reference.
## Results
an output folder is created with the PrediXcan dosage files and imputed gene
expressions.
## Execution
## Versions
### Local execution
release tags should point to respective FIMM imputed genotype source data.
1. Clone this repository and change the working directory to the newly
created folder
```
git clone https://git.center-tbi.eu/kunzmann/impute-gene-expression.git
cd impute-gene-expression
```
Make sure the ckeck out the exact version of the pipeline you want to run,
either by comit hash, or an available tag, e.g. via
```
git checkout v0.1.0
```
2. Download the container image specified at the to of the `Snakefile`
(singularity: ...) manually via, e.g. (make sure to adjust to the exact specification!)
```
singularity pull library://kkmann/default/center-grex-imputation:[tag]
```
where `[tag]` is the desired container tag or hash (see Snakefile to make
sure that you download the exact right version).
This will download the container from the sylab cloud and store it as
`center-grex-imputation_[tag].sif`.
3. Authenticate with your FIMM account
```
singularity shell -H $PWD [container-name].sif
[container-name].sif> gcloud auth login
```
Here, the `-H $PWD` flag ensures that singularity does not mount the users
home directory but uses the working directory instead.
Follow the on-screen instructions to authenticate in your browser,
then leave the container
```
[container-name].sif> exit
```
This should have creates a hidden folder `.config/gcloud` with authentification
details in your working directory.
4. Adjust tissues, data date, and target cohort in `config/parameters.yaml`
4. Run the pipeline
```
singularity exec -H $PWD -j 1 [container-name].sif snakemake all
```
Here `-j 1` specifies the maximal number of parallel jobs.
Increase this on more powerful machines but bear in mind that roughly 16Gb
of RAM per job are required to avoid out of memory issues.
An output folder is created with intermediate files, a mapping report,
and the imputed GREx levels per tissue in `output/imputed-grex`.
### Cluster execution
1. Same as the first step in 'Local execution'.
2. Make sure that `Python 3` is available and `snakemake` installed
(e.g. load Python via modules and install snakemake into a virtual environment)
3. Configure snakemake for your execution environment and store the profile in,
e.g., `config/my-cluster`.
It is highly recommended to use the `-H $PWD` flag to make sure that there
is no interference with configuration files int he users home directory.
3. Run
```
snakemake --profile config/my-cluster container
```
to download the correct container image automatically.
This will create a file `[container-hash].simg` in the working folder.
4. Authenticate with your FIMM account as in 'Local execution', point 3.,
with a shell in the downloaded `[container-hash].simg`
5. Adjust tissues, data date, and target cohort in `config/parameters.yaml`
6. Run the distributed pipeline
```
snakemake --profile config/snakemake/my-cluster all
```
## Logo
......
This diff is collapsed.
{
"__default__" :
{
"account" : "MRC-BSU-SL2-CPU",
"partition" : "skylake",
"time" : "02:00:00",
"ntasks" : "1",
"nodes" : "1",
"ncpu" : "1",
"name" : "{rule}__{wildcards}",
"output" : "logs/{rule}__{wildcards}.out",
"error" : "logs/{rule}__{wildcards}.out"
}
}
# where to put things
output_dir: 'output'
# suggested filtering options for PrediXcan
min_MAF: '0.01'
min_INFO: '0.8'
# brain regions to impute expression for
brain_regions:
- 'Amygdala'
- 'Anterior_cingulate_cortex_BA24'
- 'Caudate_basal_ganglia'
- 'Cerebellar_Hemisphere'
- 'Cerebellum'
- 'Cortex'
- 'Frontal_Cortex_BA9'
- 'Hippocampus'
- 'Hypothalamus'
- 'Nucleus_accumbens_basal_ganglia'
- 'Putamen_basal_ganglia'
- 'Spinal_cord_cervical_c-1'
- 'Substantia_nigra'
jobs: 99
local-cores: 1
use-singularity: True
cluster: "sbatch -A MRC-BSU-SL2-CPU -p skylake --nodes 1 --ntasks 1 --cpus-per-task 1 --mem 16000 -t 02:00:00 --job-name '{rule}[{wildcards}]' --output 'logs/{rule}[{wildcards}].log' --error 'logs/{rule}[{wildcards}].log'"
singularity-args: "-H $PWD"
singularity-prefix: "."
# tissues to impute expression for, all brain except Pituitary and
# Spinal cord (cervical c-1)
tissues:
- 'Brain_Amygdala'
- 'Brain_Anterior_cingulate_cortex_BA24'
- 'Brain_Caudate_basal_ganglia'
- 'Brain_Cerebellar_Hemisphere'
- 'Brain_Cerebellum'
- 'Brain_Cortex'
- 'Brain_Frontal_Cortex_BA9'
- 'Brain_Hippocampus'
- 'Brain_Hypothalamus'
- 'Brain_Nucleus_accumbens_basal_ganglia'
- 'Brain_Putamen_basal_ganglia'
- 'Brain_Substantia_nigra'
data_date: "20200306"
cohort: "all-acgm-filtered"
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 4
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
AutoAppendNewline: Yes
StripTrailingWhitespace: Yes
# Ignore everything in this directory
*
# Except this file
!.gitignore
#!/bin/bash
set -ex
sudo singularity build center-impute-grex.sif singularity.def
singularity push -U center-impute-grex.sif library://kkmann/default/center-grex-imputation:latest
wget https://zenodo.org/record/3376504/files/container.sif -O container.sif
#!/usr/bin/env Rscript
options(tidyverse.quite = TRUE)
suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
library(glue, warn.conflicts = FALSE)
args <- commandArgs(trailingOnly = TRUE)
chromosome <- as.integer(args[[1]])
tbl_gtex_lookup <- read_csv(
"output/model-variants-lookup-table.csv.gz",
col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1 = col_character(),
allele2 = col_character()
)
) %>%
filter(chr == glue("chr{chromosome}")) %>%
select(chr, position) %>%
arrange(position) %>%
write_delim(
path = glue("output/model-variant-positions-chromosome-{chromosome}.txt"),
delim = "\t",
col_names = FALSE
)
#!/usr/bin/env Rscript
options(tidyverse.quite = TRUE)
suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
library(glue, warn.conflicts = FALSE)
library(DBI)
tissues <- yaml::read_yaml("config/parameters.yml")$tissues
tbl_weights <- map(
tissues,
function(tissue) {
con <- dbConnect(
RSQLite::SQLite(),
glue("output/weights/en_{tissue}.db")
)
res <- dbReadTable(con, "weights") %>%
as_tibble() %>%
mutate(tissue = tissue) %>%
select(tissue, everything())
dbDisconnect(con)
return(res)
}
) %>%
bind_rows() %>%
separate(
varID,
into = c("chr", "position", "allele1", "allele2", "dummy"),
sep = "_",
remove = FALSE
) %>% {
assertthat::assert_that(
all((.$allele1 == .$ref_allele) & (.$allele2 == .$eff_allele))
)
.
} %>%
select(
chr, position, rsid, varID, allele1, allele2
) %>%
rename(
gtex_id = varID
) %>%
mutate(
position = as.integer(position)
) %>%
distinct()
write_csv(
tbl_weights,
"output/model-variants-lookup-table.csv.gz"
)
options(
repos = list(
"https://mran.revolutionanalytics.com/snapshot/2020-03-01/"
)
)
install.packages("tidyverse")
install.packages("pander")
install.packages("vroom")
install.packages("DBI")
install.packages("RSQLite")
---
title: "Mapping hg38 to GTEx model variants"
author: "Kevin Kunzmann"
date: "`r Sys.time()`"
output:
html_document:
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE
)
library(tidyverse, warn.conflicts = FALSE)
library(glue, warn.conflicts = FALSE)
set.seed(42)
# load data
tbl_unmatched <- glue(
"output/dosages/unmatched-chromosome-{1:23}.csv.gz"
) %>%
map(
~tryCatch(
read_csv(
., col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1..model = col_character(),
allele2..model = col_character(),
allele1..dosage = col_character(),
allele2..dosage = col_character()
)
),
error = function(e) tibble(
chr = character(0L),
position = integer(0L),
rsid = character(0L),
gtex_id = character(0L),
allele1..model = character(0L),
allele2..model = character(0L),
allele1..dosage = character(0L),
allele2..dosage = character(0L)
)
)
) %>%
bind_rows() %>%
mutate(
chromosome = str_extract(
chr,
"(?<=^chr)[1-9]{1}[0-9]{0,1}$"
) %>%
as.integer()
) %>%
select(-chr) %>%
select(chromosome, everything()) %>%
arrange(chromosome, position)
tbl_model_variants <- read_csv(
"output/model-variants-lookup-table.csv.gz",
col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1 = col_character(),
allele2 = col_character()
)
)
```
### Unmatched variants
```{r}
tbl_model_variants %>%
mutate(
chromosome = str_extract(
chr,
"(?<=^chr)[1-9]{1}[0-9]{0,1}$"
) %>%
as.integer()
) %>%
select(-chr) %>%
select(chromosome, everything()) %>%
mutate(
missing = gtex_id %in% tbl_unmatched$gtex_id
) %>%
group_by(chromosome) %>%
summarize(
`# missing` = sum(missing),
`% missing` = sum(missing) / n() * 100
) %>%
pander::pander(
caption = "Number of unmatched model variants per chromosome",
digits = 2
)
```
```{r}
complementary_base <- function(allele) {
case_when(
allele == "A" ~ "T",
allele == "T" ~ "A",
allele == "G" ~ "C",
allele == "C" ~ "G"
) %>% {
assertthat::assert_that(!any(is.na(.)));
.
}
}
tbl_unmatched %>%
filter(
complementary_base(allele1..model) == allele1..dosage
) %>%
pander::pander(
caption = "Variants that match on complementary strand",
split.table = Inf
)
```
## Session Info
```{r, echo=FALSE}
sessionInfo() %>%
pander::pander()
```
#!/usr/bin/env Rscript
options(tidyverse.quite = TRUE)
suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
library(glue, warn.conflicts = FALSE)
args <- commandArgs(trailingOnly = TRUE)
chromosome <- as.integer(args[[1]])
dosages_file <- glue("output/dosages/chromosome-{chromosome}.dosage.txt.gz")
out_file_matched <- glue("output/dosages/matched-chromosome-{chromosome}.dosage.txt.gz")
out_file_unmatched <- glue("output/dosages/unmatched-chromosome-{chromosome}.csv.gz")
sample_ids <- read_delim(
"output/dosages/samples.txt",
delim = "\t",
col_types = "cc",
col_names = c("FID", "IID")
)$IID
tbl_model_variants <- read_csv(
"output/model-variants-lookup-table.csv.gz",
col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1 = col_character(),
allele2 = col_character()
)
) %>%
filter(
chr == glue("chr{chromosome}")
)
tbl_dosage <- read_csv(
glue("output/dosages/chromosome-{chromosome}.dosage.csv"),
trim_ws = TRUE,
na = ".",
progress = TRUE,
col_names = c(
"chr",
"rsid",
"position",
"allele1",
"allele2",
"MAF",
sample_ids
),
col_types = cols(
.default = col_double(),
chr = col_character(),
rsid = col_character(),
position = col_integer(),
allele1 = col_character(),
allele2 = col_character()
)
)
tbl_matched <- tbl_model_variants %>%
inner_join(
select(tbl_dosage, -chr, -rsid),
by = c("position", "allele1", "allele2")
)
write_delim(
tbl_matched %>%
arrange(position) %>%
select(-rsid) %>%
select(chr, gtex_id, position, allele1, allele2, MAF, everything()) %>%
rename(chromosome = chr) %>%
filter(complete.cases(.)),
delim = " ",
col_names = FALSE,
path = out_file_matched
)
tbl_unmatched <- left_join(
tbl_model_variants,
select(tbl_dosage, position, allele1, allele2),
by = "position",
suffix = c("..model", "..dosage")
) %>%
filter(
!(gtex_id %in% tbl_matched$gtex_id)
)
write_csv(
tbl_unmatched,
path = out_file_unmatched
)
cat(sprintf(
"%s: %9i (%6.2f%%) model variants found for chromosome %i\n\r",
Sys.time(),
nrow(tbl_matched), 100*nrow(tbl_matched)/nrow(tbl_model_variants),
chromosome
))
mkdir -p output/imputed-gene-expressions
rsync -avzhe ssh --progress \
kk656@login-cpu.hpc.cam.ac.uk:/home/kk656/scratch/impute-gene-expression/output/imputed-gene-expressions \
output/imputed-gene-expressions
Bootstrap: docker
From: ubuntu:18.04
From: rocker/verse:3.6.2
%files
install.R /tmp/install.R
%post
# non-interactive debconf
......@@ -17,45 +20,36 @@ From: ubuntu:18.04
apt-get update && apt-get -y install google-cloud-sdk
# install python3 and snakemake
apt-get -y install \
python3 python3-pip
pip3 install snakemake
apt-get -y install python3 python3-pip
pip3 install snakemake==5.10.0
# install bcftools
export BCFVER=1.10.2
apt-get -y install \
gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev
wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
tar -xvjf bcftools-1.9.tar.bz2
cd bcftools-1.9
gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev pv
wget https://github.com/samtools/bcftools/releases/download/$BCFVER/bcftools-$BCFVER.tar.bz2
tar -xvjf bcftools-$BCFVER.tar.bz2
cd bcftools-$BCFVER
./configure --prefix=/usr/bcftools
make
make install
(cd /usr/bin; ln -s /usr/bcftools/bin/bcftools bcftools)
# install PrediXcan and python dependencies (uses python 2.7)
# install PrediXcan and python (2) dependencies
apt-get -y install \
wget python-pip
wget https://raw.githubusercontent.com/hakyimlab/PrediXcan/master/Software/PrediXcan.py -O /usr/bin/predixcan
chmod +x /usr/bin/predixcan
python-pip
pip install \
argparse datetime numpy
# download, extract and store (brain) weights
mkdir /usr/predixcan
wget https://s3.amazonaws.com/predictdb2/GTEx-V7_HapMap-2017-11-29.tar.gz -O /usr/predixcan/GTEx-V7_HapMap-2017-11-29.tar.gz
(cd /usr/predixcan; \
mkdir GTEx-V7_HapMap-2017-11-29; \
# we only need the brain tissue weights:
tar -xvz -f GTEx-V7_HapMap-2017-11-29.tar.gz -C GTEx-V7_HapMap-2017-11-29 --wildcards "*_Brain_*"; \
rm GTEx-V7_HapMap-2017-11-29.tar.gz \
)
# predixcan connects to the weights database with sql, needs write permission
# even if the file system will be read only for the cvontainer
chmod -R 777 /usr/predixcan
# install R and packages
apt-get -y install r-base
Rscript -e "install.packages('dplyr')"
Rscript -e "install.packages('yaml')"
Rscript -e "install.packages('purrr')"
Rscript -e "install.packages('readr')"
Rscript -e "install.packages('tidyr')"
numpy==1.16.6
mkdir -p /usr/PrediXcan
(cd /usr; \
git clone https://github.com/hakyimlab/PrediXcan; \
cd PrediXcan; \
git checkout e77dd8a04a0345cb63aa634d4f8acc6aca9e25e0)
ln -s /usr/PrediXcan/Software/PrediXcan.py /usr/bin/predixcan
chmod +x /usr/bin/predixcan
# install bc
apt-get -y install bc
# install R packages
Rscript /tmp/install.R
mkdir -p logs
nohup snakemake $1 \
--jobs 13 \
--use-singularity \
--cluster-config cluster.json \
--cluster "sbatch -A {cluster.account} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task {cluster.ncpu} --nodes {cluster.nodes} -t {cluster.time} --job-name {cluster.name} --output {cluster.output} --error {cluster.error}" &
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment