Commit b0e5c713 authored by Kevin Kunzmann's avatar Kevin Kunzmann

Refactor to gtex v8

parent 8d918f3c
.*
.config .config
.snakemake .snakemake
output output
logs
nohup.out *.err
container.sif *.out
*.sif
*.simg
*.html
*.pdf
.DS_Store .DS_Store
.Rproj.user
# Impute gene expression for CENTER-TBI with PrediXcan # Impute genetically regulated gene expression for CENTER-TBI using PrediXcan
The singularity container with most software dpendencies is available at This repository contains all scripts necessary to run the impute genetically
https://doi.org/10.5281/zenodo.3376504. regulated gene expression (GREx) using the
Data currently needs to be accessed manually due to access restrictions. [PrediXcan](https://github.com/hakyimlab/MetaXcan) methodology for
This workflow is design for *.vcf.gz files with dosage (DS) information. [CENTER-TBI](https://www.center-tbi.eu/) imputed whole genomes data.
More information on PrediXcan can be found here https://github.com/hakyimlab/PrediXcan and in: [1] Gamazon ER†, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
> Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC, Nicolae DL, Cox NJ, Im HK. (2015) *A gene-based association method for mapping traits using reference transcriptome data.*
Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data.
Nat Genet. doi:10.1038/ng.3367. Nat Genet. doi:10.1038/ng.3367.
## Dependencies [2] Barbeira, A., Shah, K. P., Torres, J. M., Wheeler, H. E., Torstenson, E. S., Edwards, T., ... & Im, H. K. (2016). *MetaXcan: summary statistics based gene-level association method infers accurate PrediXcan results.* BioRxiv, 045260.
1. linux shell (`bash`), possibly via virtual machine on Windows/Mac
2. `wget` (pre-installed or via distribution package manager)
3. `singularity` container software (tested on 3.3.0, https://sylabs.io/guides/3.3/user-guide)
4. `git` (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
5. for data download fimm GCP bucket access to `fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data`
### Optional
5. python 3.7+ and snakemake
6. slurm cluster
We use snakemake to organize the workflow (also pre-installed in the container) and support cluster execution.
Snakemake is available via `pip` package for python 3.7.
> Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics,
Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480
## Execution
Download and extract the contents of this repository (might be access restricted)
git clone https://github.com/kkmann/impute-gene-expression The pipeline itself is implemented using [snakemake](https://snakemake.readthedocs.io/en/stable/) and
cd impute-gene-expression [singularity](https://sylabs.io/) containers to guarantee reproducibility.
Download the container image
bash scripts/download_container.sh
Obtain the imputed genomes (GCP bucket: fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data).
`gsutils` is pre-installed in the container image, to authenticate with your
GCP account run and follow the interactive instructions
singularity shell container.sif
gcloud auth login
snakemake download_imputed_genotypes
exit
Execute the workflow inside the container on a single core (takes a while!)
singularity exec container.sif snakemake impute
Optionally, if snakemake is installed, the workflow can be run in parallel via ## Dependencies
snakemake --use-singularity -j 8 impute
where '8' can be replaced by the number of available cores. 1. [`Linux` system](https://ubuntu.com/download/desktop)
Cluster execution is enables via the `scripts/slurm_snakemake.sh` script as 2. [`git`](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
3. [`singularity`](https://sylabs.io/guides/3.5/user-guide/quick_start.html#quick-installation-steps) container software (tested on 3.5.0)
4. *optionally:* [Python 3](https://www.python.org/) and [snakemake](https://github.com/snakemake/snakemake)
bash scripts/slurm-snakemake.sh impute To access the imputed genotypes, access to the
[FIMM](https://www.fimm.fi/) GCS bucket
`fimm-horizon-outgoing-data` is necessary (requires a FIMM user account).
It is recommended to run the workflow on a cluster system,
since execution time on a desktop machine may take several hours.
An example [slurm](https://slurm.schedmd.com/documentation.html)
cluster [profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) is saved in `config/snakemake/mrc-bsu-cluster`
for reference.
## Results
an output folder is created with the PrediXcan dosage files and imputed gene
expressions.
## Execution
## Versions ### Local execution
release tags should point to respective FIMM imputed genotype source data. 1. Clone this repository and change the working directory to the newly
created folder
```
git clone https://git.center-tbi.eu/kunzmann/impute-gene-expression.git
cd impute-gene-expression
```
Make sure the ckeck out the exact version of the pipeline you want to run,
either by comit hash, or an available tag, e.g. via
```
git checkout v0.1.0
```
2. Download the container image specified at the to of the `Snakefile`
(singularity: ...) manually via, e.g. (make sure to adjust to the exact specification!)
```
singularity pull library://kkmann/default/center-grex-imputation:[tag]
```
where `[tag]` is the desired container tag or hash (see Snakefile to make
sure that you download the exact right version).
This will download the container from the sylab cloud and store it as
`center-grex-imputation_[tag].sif`.
3. Authenticate with your FIMM account
```
singularity shell -H $PWD [container-name].sif
[container-name].sif> gcloud auth login
```
Here, the `-H $PWD` flag ensures that singularity does not mount the users
home directory but uses the working directory instead.
Follow the on-screen instructions to authenticate in your browser,
then leave the container
```
[container-name].sif> exit
```
This should have creates a hidden folder `.config/gcloud` with authentification
details in your working directory.
4. Adjust tissues, data date, and target cohort in `config/parameters.yaml`
4. Run the pipeline
```
singularity exec -H $PWD -j 1 [container-name].sif snakemake all
```
Here `-j 1` specifies the maximal number of parallel jobs.
Increase this on more powerful machines but bear in mind that roughly 16Gb
of RAM per job are required to avoid out of memory issues.
An output folder is created with intermediate files, a mapping report,
and the imputed GREx levels per tissue in `output/imputed-grex`.
### Cluster execution
1. Same as the first step in 'Local execution'.
2. Make sure that `Python 3` is available and `snakemake` installed
(e.g. load Python via modules and install snakemake into a virtual environment)
3. Configure snakemake for your execution environment and store the profile in,
e.g., `config/my-cluster`.
It is highly recommended to use the `-H $PWD` flag to make sure that there
is no interference with configuration files int he users home directory.
3. Run
```
snakemake --profile config/my-cluster container
```
to download the correct container image automatically.
This will create a file `[container-hash].simg` in the working folder.
4. Authenticate with your FIMM account as in 'Local execution', point 3.,
with a shell in the downloaded `[container-hash].simg`
5. Adjust tissues, data date, and target cohort in `config/parameters.yaml`
6. Run the distributed pipeline
```
snakemake --profile config/snakemake/my-cluster all
```
## Logo ## Logo
CC-BY 4.0, Ali Zifan CC-BY 4.0, Ali Zifan
\ No newline at end of file
# edit configuration here (see cluster.json for running on slurm scheduler) # define the singularity (https://sylabs.io/) container image to use
configfile: "config.yml" # see also https://cloud.sylabs.io/library/kkmann/default/center-grex-imputation#
singularity: "library://kkmann/default/center-grex-imputation:sha256.321833b357b20feda51877dcef6e9bac2dfa706e9ed6c32da39baaa8ead9e1f7"
localrules: impute, clean, download_container, generate_samples_file # modify this file to define global parameters
# (e.g. set of tissues to impute genetically regulated expression levels for)
configfile: "config/parameters.yml"
# these rules are computationally less expensive and run locally, even in
# cluster execution mode
localrules: all, container, download_genome, download_weights, samples_file,
model_variant_lookup_table, mapping_report
# this rule is exected by default, i.e.
# this rule runs the entire pipeline and creates a R data set file (rds) # > snakemake
# containing all imputed expression values # is equivalent to
rule impute: # > snakemake all
# the target imputes gene expression for all tissues listed in the config file
# and the report on mapping mopdel variants to available dosage information.
rule all:
input: input:
"container.sif", expand(
expand("{output_dir}/imputed-gene-expressions/{region}.expression.txt.gz", region = config['brain_regions'], output_dir = config['output_dir']) "output/imputed-grex/{tissue}.igrex.txt.gz",
tissue=config["tissues"]
),
"output/mapping-report.html"
# delete output and logs (if run on slurm cluster) # dummy rule doing nothing; can be used to download the required container;
rule clean: # this might be helpful to set up authentification with the data servers
# (requires an interactive step); see readme.md for details
rule container:
shell: shell:
""" """
rm -rf {config[output_dir]} echo done.
rm -rf logs
rm -rf nohup.out
""" """
# download vcf files using gsutil (needs to be set up and configured! # download vcf files using gsutil (needs to be set up and configured!)
rule download_imputed_genotype_chromosome: rule download_genome:
output: output:
"output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{i}.vcf.gz" "output/imputed-genotypes/chromosome-{chr}.vcf.gz",
singularity: "output/imputed-genotypes/chromosome-{chr}.vcf.gz.tbi"
"container.sif"
shell: shell:
""" """
set -ex
mkdir -p output/imputed-genotypes mkdir -p output/imputed-genotypes
gsutil cp gs://fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_191014/Imputed_data/CENTER.TBI_imputed_N3741_hg19_HRC_MAC1_freeze_191014_chr{wildcards.i}.vcf.gz \ FILEDIR=gs://fimm-horizon-outgoing-data/genetic-data/imputed-genomes/{config[data_date]}/{config[cohort]}
output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{wildcards.i}.vcf.gz gsutil cp $FILEDIR/chromosome-{wildcards.chr}.vcf.gz \
output/imputed-genotypes/chromosome-{wildcards.chr}.vcf.gz
gsutil cp $FILEDIR/chromosome-{wildcards.chr}.vcf.gz.tbi \
output/imputed-genotypes/chromosome-{wildcards.chr}.vcf.gz.tbi
""" """
rule download_imputed_genotypes: # download PrediXcan EQTL weights
rule download_weights:
output:
directory("output/weights")
shell:
"""
set -ex
mkdir -p logs
mkdir -p output/weights
# wget https://zenodo.org/record/3518299/files/mashr_eqtl.tar?download=1 \
-O output/weights/eqtl.tar
wget https://zenodo.org/record/3519321/files/elastic_net_eqtl.tar?download=1 \
-O output/weights/eqtl.tar
tar -C output/weights -xvf output/weights/eqtl.tar
rm output/weights/eqtl.tar
mv output/weights/elastic_net_models/* output/weights/
rm -rf output/weights/elastic_net_models
"""
# generate a table with all model variants
rule model_variant_lookup_table:
input: input:
expand("output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{i}.vcf.gz", rules.download_weights.output
i = list(map(str, range(1, 23))) + ['X'] output:
) "output/model-variants-lookup-table.csv.gz"
shell:
"""
set -ex
Rscript scripts/get-model-variants.R
"""
# download the required singularity container from zenodo.org # generate the samples file in the format required by PrediXcan
rule download_container: rule samples_file:
input:
"output/imputed-genotypes/chromosome-1.vcf.gz"
output: output:
"container.sif" "output/dosages/samples.txt"
shell: shell:
""" """
bash scripts/download_container.sh set -ex
mkdir -p output/dosages
# query sample ids from first chromosome
bcftools query -l {input} > output/samples.txt
# predixcan allows family and sample id, duplicate column and overwrite
Rscript -e "library(dplyr); readr::read_tsv('output/samples.txt', col_names = 'SID', col_types = 'c') %>% dplyr::mutate(FID = SID) %>% readr::write_tsv(path = 'output/dosages/samples.txt', col_names = FALSE)"
""" """
# extract dosages in custom PrediXcan format from vcf file # extract and map dosage information to model variants by position and alleles
rule vcf_to_dosages: # (only works with hg38 inputs, otherwise a liftover needs to be performed!)
rule dosage:
input: input:
"container.sif", vcf = "output/imputed-genotypes/chromosome-{chr}.vcf.gz",
vcf_gz_file = "output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{i}.vcf.gz" index = "output/imputed-genotypes/chromosome-{chr}.vcf.gz.tbi",
lookup = "output/model-variants-lookup-table.csv.gz",
samples = "output/dosages/samples.txt"
output: output:
"{output_dir}/dosages/chr{i}.dosage.txt.gz" "output/dosages/matched-chromosome-{chr}.dosage.txt.gz",
singularity: "output/dosages/unmatched-chromosome-{chr}.csv.gz"
"container.sif"
shell: shell:
""" """
export prefix={wildcards.output_dir}/dosages # setup
mkdir -p $prefix set -ex
echo "decompress and filter out: 'MAF[0]<{config[min_MAF]} | INFO<{config[min_INFO]} | TYPE!="snp" | N_ALT!=1' ..." mkdir -p output/dosages
bcftools filter -e 'MAF[0]<{config[min_MAF]} | INFO<{config[min_INFO]} | TYPE!="snp" | N_ALT!=1' {input.vcf_gz_file} > $prefix/chr{wildcards.i}_.vcf
echo "add MAFs ..." # create file with chromosomal positions for all model variants
bcftools +fill-tags $prefix/chr{wildcards.i}_.vcf > $prefix/chr{wildcards.i}.vcf Rscript scripts/get-model-variant-positions.R {wildcards.chr}
rm $prefix/chr{wildcards.i}_.vcf
echo 'querying dosages ...' nlines=$(wc -l < "output/model-variant-positions-chromosome-{wildcards.chr}.txt" | bc)
bcftools query -f '%CHROM %ID %POS %REF %ALT %INFO/MAF [%DS ]\n' $prefix/chr{wildcards.i}.vcf > $prefix/chr{wildcards.i}.dosage.txt if [[ $nlines == 0 ]]
rm $prefix/chr{wildcards.i}.vcf then
echo 'compressing ...' touch "output/dosages/matched-chromosome-{wildcards.chr}.dosage.txt.gz"
gzip $prefix/chr{wildcards.i}.dosage.txt touch "output/dosages/unmatched-chromosome-{wildcards.chr}.csv.gz"
printf "done.\n\r\n\r" else
""" # filter for model variant positions and extract dosage information
bcftools query \
# extract sample file for PrediXcan -o "output/dosages/chromosome-{wildcards.chr}.dosage.csv" \
rule generate_samples_file: -R "output/model-variant-positions-chromosome-{wildcards.chr}.txt" \
-f "%CHROM,%ID,%POS,%REF,%ALT,%INFO/MAF[,%DS]\n\r" \
{input.vcf}
# match model variants with extracted dosage
Rscript scripts/match-variants.R {wildcards.chr}
# cleanup
rm "output/dosages/chromosome-{wildcards.chr}.dosage.csv"
fi
# cleanup
rm "output/model-variant-positions-chromosome-{wildcards.chr}.txt"
"""
# build an overview over the found model SNPS per chromosome
rule mapping_report:
input: input:
"container.sif", dosage_files = expand(
vcf_gz_file = "output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr1.vcf.gz" "output/dosages/matched-chromosome-{chr}.dosage.txt.gz",
chr=range(1,24)
)
output: output:
"{output_dir}/dosages/samples.txt" "output/mapping-report.html"
singularity:
"container.sif"
params:
format = "\'{ printf(\"%s %s\\n\", $1, $1); }\'"
shell: shell:
""" """
export prefix={wildcards.output_dir}/dosages set -ex
mkdir -p $prefix Rscript -e "rmarkdown::render('scripts/mapping-report.Rmd', output_dir = 'output', output_file = 'mapping-report.html', knit_root_dir = '$PWD')"
bcftools query -l {input.vcf_gz_file} >> $prefix/samples_.txt
# family ID = individual ID
awk {params.format} < $prefix/samples_.txt > $prefix/samples.txt
rm $prefix/samples_.txt
""" """
# run PrediXcan to impute gene expression for individual tissue type # impute GREx for a specific tissue (sums up the mapped dosages times the
rule impute_gene_expressions: # PrediXcan weights)
rule grex:
input: input:
"container.sif", samples_file = "output/dosages/samples.txt",
samples_file = expand("{output_dir}/dosages/samples.txt", output_dir = config['output_dir']), dosage_files = expand(
dosage_files = expand("{output_dir}/dosages/chr{i}.dosage.txt.gz", "output/dosages/matched-chromosome-{chr}.dosage.txt.gz",
i = list(map(str, range(1, 23))) + ['X'], chr=range(1,24)
output_dir = config['output_dir']
) )
output: output:
"{output_dir}/imputed-gene-expressions/{region}.expression.txt.gz" "output/imputed-grex/{tissue}.igrex.txt.gz"
singularity:
"container.sif"
shell: shell:
""" """
mkdir -p {wildcards.output_dir}/imputed-gene-expressions mkdir -p output/imputed-grex
predixcan \ predixcan \
--predict \ --predict \
--dosages {config[output_dir]}/dosages \ --dosages output/dosages \
--dosages_prefix chr \ --dosages_prefix matched-chromosome- \
--samples samples.txt \ --samples samples.txt \
--weights /usr/predixcan/GTEx-V7_HapMap-2017-11-29/gtex_v7_Brain_{wildcards.region}_imputed_europeans_tw_0.5_signif.db \ --weights output/weights/en_{wildcards.tissue}.db \
--output_prefix {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region} --output_prefix output/imputed-grex/{wildcards.tissue} \
mv {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}_predicted_expression.txt \ --model_variant_id varID
{wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}.expression.txt mv output/imputed-grex/{wildcards.tissue}_predicted_expression.txt \
gzip {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}.expression.txt output/imputed-grex/{wildcards.tissue}.igrex.txt
gzip output/imputed-grex/{wildcards.tissue}.igrex.txt
""" """
{
"__default__" :
{
"account" : "MRC-BSU-SL2-CPU",
"partition" : "skylake",
"time" : "02:00:00",
"ntasks" : "1",
"nodes" : "1",
"ncpu" : "1",
"name" : "{rule}__{wildcards}",
"output" : "logs/{rule}__{wildcards}.out",
"error" : "logs/{rule}__{wildcards}.out"
}
}
# where to put things
output_dir: 'output'
# suggested filtering options for PrediXcan
min_MAF: '0.01'
min_INFO: '0.8'
# brain regions to impute expression for
brain_regions:
- 'Amygdala'
- 'Anterior_cingulate_cortex_BA24'
- 'Caudate_basal_ganglia'
- 'Cerebellar_Hemisphere'
- 'Cerebellum'
- 'Cortex'
- 'Frontal_Cortex_BA9'
- 'Hippocampus'
- 'Hypothalamus'
- 'Nucleus_accumbens_basal_ganglia'
- 'Putamen_basal_ganglia'
- 'Spinal_cord_cervical_c-1'
- 'Substantia_nigra'
jobs: 99
local-cores: 1
use-singularity: True
cluster: "sbatch -A MRC-BSU-SL2-CPU -p skylake --nodes 1 --ntasks 1 --cpus-per-task 1 --mem 16000 -t 02:00:00 --job-name '{rule}[{wildcards}]' --output 'logs/{rule}[{wildcards}].log' --error 'logs/{rule}[{wildcards}].log'"
singularity-args: "-H $PWD"
singularity-prefix: "."
# tissues to impute expression for, all brain except Pituitary and
# Spinal cord (cervical c-1)
tissues:
- 'Brain_Amygdala'
- 'Brain_Anterior_cingulate_cortex_BA24'
- 'Brain_Caudate_basal_ganglia'
- 'Brain_Cerebellar_Hemisphere'
- 'Brain_Cerebellum'
- 'Brain_Cortex'
- 'Brain_Frontal_Cortex_BA9'
- 'Brain_Hippocampus'
- 'Brain_Hypothalamus'
- 'Brain_Nucleus_accumbens_basal_ganglia'
- 'Brain_Putamen_basal_ganglia'
- 'Brain_Substantia_nigra'
data_date: "20200306"
cohort: "all-acgm-filtered"
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 4
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
AutoAppendNewline: Yes
StripTrailingWhitespace: Yes
# Ignore everything in this directory
*
# Except this file
!.gitignore
#!/bin/bash
set -ex
sudo singularity build center-impute-grex.sif singularity.def
singularity push -U center-impute-grex.sif library://kkmann/default/center-grex-imputation:latest
wget https://zenodo.org/record/3376504/files/container.sif -O container.sif
#!/usr/bin/env Rscript
options(tidyverse.quite = TRUE)
suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
library(glue, warn.conflicts = FALSE)
args <- commandArgs(trailingOnly = TRUE)
chromosome <- as.integer(args[[1]])
tbl_gtex_lookup <- read_csv(
"output/model-variants-lookup-table.csv.gz",
col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1 = col_character(),
allele2 = col_character()
)
) %>%
filter(chr == glue("chr{chromosome}")) %>%
select(chr, position) %>%
arrange(position) %>%
write_delim(
path = glue("output/model-variant-positions-chromosome-{chromosome}.txt"),
delim = "\t",
col_names = FALSE
)
#!/usr/bin/env Rscript
options(tidyverse.quite = TRUE)
suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
library(glue, warn.conflicts = FALSE)
library(DBI)
tissues <- yaml::read_yaml("config/parameters.yml")$tissues
tbl_weights <- map(
tissues,
function(tissue) {
con <- dbConnect(
RSQLite::SQLite(),
glue("output/weights/en_{tissue}.db")
)
res <- dbReadTable(con, "weights") %>%
as_tibble() %>%
mutate(tissue = tissue) %>%
select(tissue, everything())
dbDisconnect(con)
return(res)
}
) %>%
bind_rows() %>%
separate(
varID,
into = c("chr", "position", "allele1", "allele2", "dummy"),
sep = "_",
remove = FALSE
) %>% {
assertthat::assert_that(
all((.$allele1 == .$ref_allele) & (.$allele2 == .$eff_allele))
)
.
} %>%
select(
chr, position, rsid, varID, allele1, allele2
) %>%
rename(
gtex_id = varID
) %>%
mutate(
position = as.integer(position)
) %>%
distinct()
write_csv(
tbl_weights,
"output/model-variants-lookup-table.csv.gz"
)
options(
repos = list(
"https://mran.revolutionanalytics.com/snapshot/2020-03-01/"
)
)
install.packages("tidyverse")
install.packages("pander")
install.packages("vroom")
install.packages("DBI")
install.packages("RSQLite")
---
title: "Mapping hg38 to GTEx model variants"
author: "Kevin Kunzmann"
date: "`r Sys.time()`"
output:
html_document:
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE
)
library(tidyverse, warn.conflicts = FALSE)
library(glue, warn.conflicts = FALSE)
set.seed(42)
# load data
tbl_unmatched <- glue(
"output/dosages/unmatched-chromosome-{1:23}.csv.gz"
) %>%
map(
~tryCatch(
read_csv(
., col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1..model = col_character(),
allele2..model = col_character(),
allele1..dosage = col_character(),
allele2..dosage = col_character()
)
),
error = function(e) tibble(
chr = character(0L),
position = integer(0L),
rsid = character(0L),
gtex_id = character(0L),
allele1..model = character(0L),
allele2..model = character(0L),
allele1..dosage = character(0L),
allele2..dosage = character(0L)
)
)
) %>%
bind_rows() %>%
mutate(
chromosome = str_extract(
chr,
"(?<=^chr)[1-9]{1}[0-9]{0,1}$"
) %>%
as.integer()
) %>%
select(-chr) %>%
select(chromosome, everything()) %>%
arrange(chromosome, position)
tbl_model_variants <- read_csv(
"output/model-variants-lookup-table.csv.gz",
col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1 = col_character(),
allele2 = col_character()
)
)
```
### Unmatched variants
```{r}
tbl_model_variants %>%
mutate(
chromosome = str_extract(
chr,
"(?<=^chr)[1-9]{1}[0-9]{0,1}$"
) %>%
as.integer()
) %>%
select(-chr) %>%
select(chromosome, everything()) %>%
mutate(
missing = gtex_id %in% tbl_unmatched$gtex_id
) %>%
group_by(chromosome) %>%
summarize(
`# missing` = sum(missing),
`% missing` = sum(missing) / n() * 100
) %>%
pander::pander(
caption = "Number of unmatched model variants per chromosome",
digits = 2
)
```
```{r}
complementary_base <- function(allele) {
case_when(
allele == "A" ~ "T",
allele == "T" ~ "A",
allele == "G" ~ "C",
allele == "C" ~ "G"
) %>% {
assertthat::assert_that(!any(is.na(.)));
.
}
}
tbl_unmatched %>%
filter(
complementary_base(allele1..model) == allele1..dosage
) %>%
pander::pander(
caption = "Variants that match on complementary strand",
split.table = Inf
)
```
## Session Info
```{r, echo=FALSE}
sessionInfo() %>%
pander::pander()
```
#!/usr/bin/env Rscript
options(tidyverse.quite = TRUE)
suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
library(glue, warn.conflicts = FALSE)
args <- commandArgs(trailingOnly = TRUE)
chromosome <- as.integer(args[[1]])
dosages_file <- glue("output/dosages/chromosome-{chromosome}.dosage.txt.gz")
out_file_matched <- glue("output/dosages/matched-chromosome-{chromosome}.dosage.txt.gz")
out_file_unmatched <- glue("output/dosages/unmatched-chromosome-{chromosome}.csv.gz")
sample_ids <- read_delim(
"output/dosages/samples.txt",
delim = "\t",
col_types = "cc",
col_names = c("FID", "IID")
)$IID
tbl_model_variants <- read_csv(
"output/model-variants-lookup-table.csv.gz",
col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1 = col_character(),
allele2 = col_character()
)
) %>%
filter(
chr == glue("chr{chromosome}")
)
tbl_dosage <- read_csv(
glue("output/dosages/chromosome-{chromosome}.dosage.csv"),
trim_ws = TRUE,
na = ".",
progress = TRUE,
col_names = c(
"chr",
"rsid",
"position",
"allele1",
"allele2",
"MAF",
sample_ids
),
col_types = cols(
.default = col_double(),
chr = col_character(),
rsid = col_character(),
position = col_integer(),
allele1 = col_character(),
allele2 = col_character()
)
)
tbl_matched <- tbl_model_variants %>%
inner_join(
select(tbl_dosage, -chr, -rsid),
by = c("position", "allele1", "allele2")
)
write_delim(
tbl_matched %>%
arrange(position) %>%
select(-rsid) %>%
select(chr, gtex_id, position, allele1, allele2, MAF, everything()) %>%
rename(chromosome = chr) %>%
filter(complete.cases(.)),
delim = " ",
col_names = FALSE,
path = out_file_matched
)
tbl_unmatched <- left_join(
tbl_model_variants,
select(tbl_dosage, position, allele1, allele2),
by = "position",
suffix = c("..model", "..dosage")
) %>%
filter(
!(gtex_id %in% tbl_matched$gtex_id)
)
write_csv(
tbl_unmatched,
path = out_file_unmatched
)
cat(sprintf(
"%s: %9i (%6.2f%%) model variants found for chromosome %i\n\r",
Sys.time(),
nrow(tbl_matched), 100*nrow(tbl_matched)/nrow(tbl_model_variants),
chromosome
))
mkdir -p output/imputed-gene-expressions
rsync -avzhe ssh --progress \
kk656@login-cpu.hpc.cam.ac.uk:/home/kk656/scratch/impute-gene-expression/output/imputed-gene-expressions \
output/imputed-gene-expressions
Bootstrap: docker Bootstrap: docker
From: ubuntu:18.04 From: rocker/verse:3.6.2
%files
install.R /tmp/install.R
%post %post
# non-interactive debconf # non-interactive debconf
...@@ -17,45 +20,36 @@ From: ubuntu:18.04 ...@@ -17,45 +20,36 @@ From: ubuntu:18.04
apt-get update && apt-get -y install google-cloud-sdk apt-get update && apt-get -y install google-cloud-sdk
# install python3 and snakemake # install python3 and snakemake
apt-get -y install \ apt-get -y install python3 python3-pip
python3 python3-pip pip3 install snakemake==5.10.0
pip3 install snakemake
# install bcftools # install bcftools
export BCFVER=1.10.2
apt-get -y install \ apt-get -y install \
gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev pv
wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2 wget https://github.com/samtools/bcftools/releases/download/$BCFVER/bcftools-$BCFVER.tar.bz2
tar -xvjf bcftools-1.9.tar.bz2 tar -xvjf bcftools-$BCFVER.tar.bz2
cd bcftools-1.9 cd bcftools-$BCFVER
./configure --prefix=/usr/bcftools ./configure --prefix=/usr/bcftools
make make
make install make install
(cd /usr/bin; ln -s /usr/bcftools/bin/bcftools bcftools) (cd /usr/bin; ln -s /usr/bcftools/bin/bcftools bcftools)
# install PrediXcan and python dependencies (uses python 2.7) # install PrediXcan and python (2) dependencies
apt-get -y install \ apt-get -y install \
wget python-pip python-pip
wget https://raw.githubusercontent.com/hakyimlab/PrediXcan/master/Software/PrediXcan.py -O /usr/bin/predixcan
chmod +x /usr/bin/predixcan
pip install \ pip install \
argparse datetime numpy numpy==1.16.6
# download, extract and store (brain) weights mkdir -p /usr/PrediXcan
mkdir /usr/predixcan (cd /usr; \
wget https://s3.amazonaws.com/predictdb2/GTEx-V7_HapMap-2017-11-29.tar.gz -O /usr/predixcan/GTEx-V7_HapMap-2017-11-29.tar.gz git clone https://github.com/hakyimlab/PrediXcan; \
(cd /usr/predixcan; \ cd PrediXcan; \
mkdir GTEx-V7_HapMap-2017-11-29; \ git checkout e77dd8a04a0345cb63aa634d4f8acc6aca9e25e0)
# we only need the brain tissue weights: ln -s /usr/PrediXcan/Software/PrediXcan.py /usr/bin/predixcan
tar -xvz -f GTEx-V7_HapMap-2017-11-29.tar.gz -C GTEx-V7_HapMap-2017-11-29 --wildcards "*_Brain_*"; \ chmod +x /usr/bin/predixcan
rm GTEx-V7_HapMap-2017-11-29.tar.gz \
) # install bc
# predixcan connects to the weights database with sql, needs write permission apt-get -y install bc
# even if the file system will be read only for the cvontainer
chmod -R 777 /usr/predixcan # install R packages
Rscript /tmp/install.R
# install R and packages
apt-get -y install r-base
Rscript -e "install.packages('dplyr')"
Rscript -e "install.packages('yaml')"
Rscript -e "install.packages('purrr')"
Rscript -e "install.packages('readr')"
Rscript -e "install.packages('tidyr')"
mkdir -p logs
nohup snakemake $1 \
--jobs 13 \
--use-singularity \
--cluster-config cluster.json \
--cluster "sbatch -A {cluster.account} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task {cluster.ncpu} --nodes {cluster.nodes} -t {cluster.time} --job-name {cluster.name} --output {cluster.output} --error {cluster.error}" &
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment