Commit b0e5c713 authored by Kevin Kunzmann's avatar Kevin Kunzmann

Refactor to gtex v8

parent 8d918f3c
.*
.config
.snakemake
output
logs
nohup.out
container.sif
*.err
*.out
*.sif
*.simg
*.html
*.pdf
.DS_Store
.Rproj.user
# Impute gene expression for CENTER-TBI with PrediXcan
# Impute genetically regulated gene expression for CENTER-TBI using PrediXcan
The singularity container with most software dpendencies is available at
https://doi.org/10.5281/zenodo.3376504.
Data currently needs to be accessed manually due to access restrictions.
This workflow is design for *.vcf.gz files with dosage (DS) information.
This repository contains all scripts necessary to run the impute genetically
regulated gene expression (GREx) using the
[PrediXcan](https://github.com/hakyimlab/MetaXcan) methodology for
[CENTER-TBI](https://www.center-tbi.eu/) imputed whole genomes data.
More information on PrediXcan can be found here https://github.com/hakyimlab/PrediXcan and in:
> Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data.
[1] Gamazon ER†, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
Nicolae DL, Cox NJ, Im HK. (2015) *A gene-based association method for mapping traits using reference transcriptome data.*
Nat Genet. doi:10.1038/ng.3367.
## Dependencies
1. linux shell (`bash`), possibly via virtual machine on Windows/Mac
2. `wget` (pre-installed or via distribution package manager)
3. `singularity` container software (tested on 3.3.0, https://sylabs.io/guides/3.3/user-guide)
4. `git` (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
5. for data download fimm GCP bucket access to `fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data`
### Optional
5. python 3.7+ and snakemake
6. slurm cluster
We use snakemake to organize the workflow (also pre-installed in the container) and support cluster execution.
Snakemake is available via `pip` package for python 3.7.
> Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics,
Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480
## Execution
Download and extract the contents of this repository (might be access restricted)
[2] Barbeira, A., Shah, K. P., Torres, J. M., Wheeler, H. E., Torstenson, E. S., Edwards, T., ... & Im, H. K. (2016). *MetaXcan: summary statistics based gene-level association method infers accurate PrediXcan results.* BioRxiv, 045260.
git clone https://github.com/kkmann/impute-gene-expression
cd impute-gene-expression
Download the container image
bash scripts/download_container.sh
Obtain the imputed genomes (GCP bucket: fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data).
`gsutils` is pre-installed in the container image, to authenticate with your
GCP account run and follow the interactive instructions
singularity shell container.sif
gcloud auth login
snakemake download_imputed_genotypes
exit
The pipeline itself is implemented using [snakemake](https://snakemake.readthedocs.io/en/stable/) and
[singularity](https://sylabs.io/) containers to guarantee reproducibility.
Execute the workflow inside the container on a single core (takes a while!)
singularity exec container.sif snakemake impute
Optionally, if snakemake is installed, the workflow can be run in parallel via
snakemake --use-singularity -j 8 impute
## Dependencies
where '8' can be replaced by the number of available cores.
Cluster execution is enables via the `scripts/slurm_snakemake.sh` script as
1. [`Linux` system](https://ubuntu.com/download/desktop)
2. [`git`](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
3. [`singularity`](https://sylabs.io/guides/3.5/user-guide/quick_start.html#quick-installation-steps) container software (tested on 3.5.0)
4. *optionally:* [Python 3](https://www.python.org/) and [snakemake](https://github.com/snakemake/snakemake)
bash scripts/slurm-snakemake.sh impute
To access the imputed genotypes, access to the
[FIMM](https://www.fimm.fi/) GCS bucket
`fimm-horizon-outgoing-data` is necessary (requires a FIMM user account).
It is recommended to run the workflow on a cluster system,
since execution time on a desktop machine may take several hours.
An example [slurm](https://slurm.schedmd.com/documentation.html)
cluster [profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) is saved in `config/snakemake/mrc-bsu-cluster`
for reference.
## Results
an output folder is created with the PrediXcan dosage files and imputed gene
expressions.
## Execution
## Versions
### Local execution
release tags should point to respective FIMM imputed genotype source data.
1. Clone this repository and change the working directory to the newly
created folder
```
git clone https://git.center-tbi.eu/kunzmann/impute-gene-expression.git
cd impute-gene-expression
```
Make sure the ckeck out the exact version of the pipeline you want to run,
either by comit hash, or an available tag, e.g. via
```
git checkout v0.1.0
```
2. Download the container image specified at the to of the `Snakefile`
(singularity: ...) manually via, e.g. (make sure to adjust to the exact specification!)
```
singularity pull library://kkmann/default/center-grex-imputation:[tag]
```
where `[tag]` is the desired container tag or hash (see Snakefile to make
sure that you download the exact right version).
This will download the container from the sylab cloud and store it as
`center-grex-imputation_[tag].sif`.
3. Authenticate with your FIMM account
```
singularity shell -H $PWD [container-name].sif
[container-name].sif> gcloud auth login
```
Here, the `-H $PWD` flag ensures that singularity does not mount the users
home directory but uses the working directory instead.
Follow the on-screen instructions to authenticate in your browser,
then leave the container
```
[container-name].sif> exit
```
This should have creates a hidden folder `.config/gcloud` with authentification
details in your working directory.
4. Adjust tissues, data date, and target cohort in `config/parameters.yaml`
4. Run the pipeline
```
singularity exec -H $PWD -j 1 [container-name].sif snakemake all
```
Here `-j 1` specifies the maximal number of parallel jobs.
Increase this on more powerful machines but bear in mind that roughly 16Gb
of RAM per job are required to avoid out of memory issues.
An output folder is created with intermediate files, a mapping report,
and the imputed GREx levels per tissue in `output/imputed-grex`.
### Cluster execution
1. Same as the first step in 'Local execution'.
2. Make sure that `Python 3` is available and `snakemake` installed
(e.g. load Python via modules and install snakemake into a virtual environment)
3. Configure snakemake for your execution environment and store the profile in,
e.g., `config/my-cluster`.
It is highly recommended to use the `-H $PWD` flag to make sure that there
is no interference with configuration files int he users home directory.
3. Run
```
snakemake --profile config/my-cluster container
```
to download the correct container image automatically.
This will create a file `[container-hash].simg` in the working folder.
4. Authenticate with your FIMM account as in 'Local execution', point 3.,
with a shell in the downloaded `[container-hash].simg`
5. Adjust tissues, data date, and target cohort in `config/parameters.yaml`
6. Run the distributed pipeline
```
snakemake --profile config/snakemake/my-cluster all
```
## Logo
......
# edit configuration here (see cluster.json for running on slurm scheduler)
configfile: "config.yml"
# define the singularity (https://sylabs.io/) container image to use
# see also https://cloud.sylabs.io/library/kkmann/default/center-grex-imputation#
singularity: "library://kkmann/default/center-grex-imputation:sha256.321833b357b20feda51877dcef6e9bac2dfa706e9ed6c32da39baaa8ead9e1f7"
localrules: impute, clean, download_container, generate_samples_file
# modify this file to define global parameters
# (e.g. set of tissues to impute genetically regulated expression levels for)
configfile: "config/parameters.yml"
# these rules are computationally less expensive and run locally, even in
# cluster execution mode
localrules: all, container, download_genome, download_weights, samples_file,
model_variant_lookup_table, mapping_report
# this rule runs the entire pipeline and creates a R data set file (rds)
# containing all imputed expression values
rule impute:
# this rule is exected by default, i.e.
# > snakemake
# is equivalent to
# > snakemake all
# the target imputes gene expression for all tissues listed in the config file
# and the report on mapping mopdel variants to available dosage information.
rule all:
input:
"container.sif",
expand("{output_dir}/imputed-gene-expressions/{region}.expression.txt.gz", region = config['brain_regions'], output_dir = config['output_dir'])
expand(
"output/imputed-grex/{tissue}.igrex.txt.gz",
tissue=config["tissues"]
),
"output/mapping-report.html"
# delete output and logs (if run on slurm cluster)
rule clean:
# dummy rule doing nothing; can be used to download the required container;
# this might be helpful to set up authentification with the data servers
# (requires an interactive step); see readme.md for details
rule container:
shell:
"""
rm -rf {config[output_dir]}
rm -rf logs
rm -rf nohup.out
echo done.
"""
# download vcf files using gsutil (needs to be set up and configured!
rule download_imputed_genotype_chromosome:
# download vcf files using gsutil (needs to be set up and configured!)
rule download_genome:
output:
"output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{i}.vcf.gz"
singularity:
"container.sif"
"output/imputed-genotypes/chromosome-{chr}.vcf.gz",
"output/imputed-genotypes/chromosome-{chr}.vcf.gz.tbi"
shell:
"""
set -ex
mkdir -p output/imputed-genotypes
gsutil cp gs://fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_191014/Imputed_data/CENTER.TBI_imputed_N3741_hg19_HRC_MAC1_freeze_191014_chr{wildcards.i}.vcf.gz \
output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{wildcards.i}.vcf.gz
FILEDIR=gs://fimm-horizon-outgoing-data/genetic-data/imputed-genomes/{config[data_date]}/{config[cohort]}
gsutil cp $FILEDIR/chromosome-{wildcards.chr}.vcf.gz \
output/imputed-genotypes/chromosome-{wildcards.chr}.vcf.gz
gsutil cp $FILEDIR/chromosome-{wildcards.chr}.vcf.gz.tbi \
output/imputed-genotypes/chromosome-{wildcards.chr}.vcf.gz.tbi
"""
rule download_imputed_genotypes:
# download PrediXcan EQTL weights
rule download_weights:
output:
directory("output/weights")
shell:
"""
set -ex
mkdir -p logs
mkdir -p output/weights
# wget https://zenodo.org/record/3518299/files/mashr_eqtl.tar?download=1 \
-O output/weights/eqtl.tar
wget https://zenodo.org/record/3519321/files/elastic_net_eqtl.tar?download=1 \
-O output/weights/eqtl.tar
tar -C output/weights -xvf output/weights/eqtl.tar
rm output/weights/eqtl.tar
mv output/weights/elastic_net_models/* output/weights/
rm -rf output/weights/elastic_net_models
"""
# generate a table with all model variants
rule model_variant_lookup_table:
input:
expand("output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{i}.vcf.gz",
i = list(map(str, range(1, 23))) + ['X']
)
rules.download_weights.output
output:
"output/model-variants-lookup-table.csv.gz"
shell:
"""
set -ex
Rscript scripts/get-model-variants.R
"""
# download the required singularity container from zenodo.org
rule download_container:
# generate the samples file in the format required by PrediXcan
rule samples_file:
input:
"output/imputed-genotypes/chromosome-1.vcf.gz"
output:
"container.sif"
"output/dosages/samples.txt"
shell:
"""
bash scripts/download_container.sh
set -ex
mkdir -p output/dosages
# query sample ids from first chromosome
bcftools query -l {input} > output/samples.txt
# predixcan allows family and sample id, duplicate column and overwrite
Rscript -e "library(dplyr); readr::read_tsv('output/samples.txt', col_names = 'SID', col_types = 'c') %>% dplyr::mutate(FID = SID) %>% readr::write_tsv(path = 'output/dosages/samples.txt', col_names = FALSE)"
"""
# extract dosages in custom PrediXcan format from vcf file
rule vcf_to_dosages:
# extract and map dosage information to model variants by position and alleles
# (only works with hg38 inputs, otherwise a liftover needs to be performed!)
rule dosage:
input:
"container.sif",
vcf_gz_file = "output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{i}.vcf.gz"
vcf = "output/imputed-genotypes/chromosome-{chr}.vcf.gz",
index = "output/imputed-genotypes/chromosome-{chr}.vcf.gz.tbi",
lookup = "output/model-variants-lookup-table.csv.gz",
samples = "output/dosages/samples.txt"
output:
"{output_dir}/dosages/chr{i}.dosage.txt.gz"
singularity:
"container.sif"
"output/dosages/matched-chromosome-{chr}.dosage.txt.gz",
"output/dosages/unmatched-chromosome-{chr}.csv.gz"
shell:
"""
export prefix={wildcards.output_dir}/dosages
mkdir -p $prefix
echo "decompress and filter out: 'MAF[0]<{config[min_MAF]} | INFO<{config[min_INFO]} | TYPE!="snp" | N_ALT!=1' ..."
bcftools filter -e 'MAF[0]<{config[min_MAF]} | INFO<{config[min_INFO]} | TYPE!="snp" | N_ALT!=1' {input.vcf_gz_file} > $prefix/chr{wildcards.i}_.vcf
echo "add MAFs ..."
bcftools +fill-tags $prefix/chr{wildcards.i}_.vcf > $prefix/chr{wildcards.i}.vcf
rm $prefix/chr{wildcards.i}_.vcf
echo 'querying dosages ...'
bcftools query -f '%CHROM %ID %POS %REF %ALT %INFO/MAF [%DS ]\n' $prefix/chr{wildcards.i}.vcf > $prefix/chr{wildcards.i}.dosage.txt
rm $prefix/chr{wildcards.i}.vcf
echo 'compressing ...'
gzip $prefix/chr{wildcards.i}.dosage.txt
printf "done.\n\r\n\r"
"""
# extract sample file for PrediXcan
rule generate_samples_file:
# setup
set -ex
mkdir -p output/dosages
# create file with chromosomal positions for all model variants
Rscript scripts/get-model-variant-positions.R {wildcards.chr}
nlines=$(wc -l < "output/model-variant-positions-chromosome-{wildcards.chr}.txt" | bc)
if [[ $nlines == 0 ]]
then
touch "output/dosages/matched-chromosome-{wildcards.chr}.dosage.txt.gz"
touch "output/dosages/unmatched-chromosome-{wildcards.chr}.csv.gz"
else
# filter for model variant positions and extract dosage information
bcftools query \
-o "output/dosages/chromosome-{wildcards.chr}.dosage.csv" \
-R "output/model-variant-positions-chromosome-{wildcards.chr}.txt" \
-f "%CHROM,%ID,%POS,%REF,%ALT,%INFO/MAF[,%DS]\n\r" \
{input.vcf}
# match model variants with extracted dosage
Rscript scripts/match-variants.R {wildcards.chr}
# cleanup
rm "output/dosages/chromosome-{wildcards.chr}.dosage.csv"
fi
# cleanup
rm "output/model-variant-positions-chromosome-{wildcards.chr}.txt"
"""
# build an overview over the found model SNPS per chromosome
rule mapping_report:
input:
"container.sif",
vcf_gz_file = "output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr1.vcf.gz"
dosage_files = expand(
"output/dosages/matched-chromosome-{chr}.dosage.txt.gz",
chr=range(1,24)
)
output:
"{output_dir}/dosages/samples.txt"
singularity:
"container.sif"
params:
format = "\'{ printf(\"%s %s\\n\", $1, $1); }\'"
"output/mapping-report.html"
shell:
"""
export prefix={wildcards.output_dir}/dosages
mkdir -p $prefix
bcftools query -l {input.vcf_gz_file} >> $prefix/samples_.txt
# family ID = individual ID
awk {params.format} < $prefix/samples_.txt > $prefix/samples.txt
rm $prefix/samples_.txt
set -ex
Rscript -e "rmarkdown::render('scripts/mapping-report.Rmd', output_dir = 'output', output_file = 'mapping-report.html', knit_root_dir = '$PWD')"
"""
# run PrediXcan to impute gene expression for individual tissue type
rule impute_gene_expressions:
# impute GREx for a specific tissue (sums up the mapped dosages times the
# PrediXcan weights)
rule grex:
input:
"container.sif",
samples_file = expand("{output_dir}/dosages/samples.txt", output_dir = config['output_dir']),
dosage_files = expand("{output_dir}/dosages/chr{i}.dosage.txt.gz",
i = list(map(str, range(1, 23))) + ['X'],
output_dir = config['output_dir']
samples_file = "output/dosages/samples.txt",
dosage_files = expand(
"output/dosages/matched-chromosome-{chr}.dosage.txt.gz",
chr=range(1,24)
)
output:
"{output_dir}/imputed-gene-expressions/{region}.expression.txt.gz"
singularity:
"container.sif"
"output/imputed-grex/{tissue}.igrex.txt.gz"
shell:
"""
mkdir -p {wildcards.output_dir}/imputed-gene-expressions
mkdir -p output/imputed-grex
predixcan \
--predict \
--dosages {config[output_dir]}/dosages \
--dosages_prefix chr \
--dosages output/dosages \
--dosages_prefix matched-chromosome- \
--samples samples.txt \
--weights /usr/predixcan/GTEx-V7_HapMap-2017-11-29/gtex_v7_Brain_{wildcards.region}_imputed_europeans_tw_0.5_signif.db \
--output_prefix {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}
mv {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}_predicted_expression.txt \
{wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}.expression.txt
gzip {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}.expression.txt
--weights output/weights/en_{wildcards.tissue}.db \
--output_prefix output/imputed-grex/{wildcards.tissue} \
--model_variant_id varID
mv output/imputed-grex/{wildcards.tissue}_predicted_expression.txt \
output/imputed-grex/{wildcards.tissue}.igrex.txt
gzip output/imputed-grex/{wildcards.tissue}.igrex.txt
"""
{
"__default__" :
{
"account" : "MRC-BSU-SL2-CPU",
"partition" : "skylake",
"time" : "02:00:00",
"ntasks" : "1",
"nodes" : "1",
"ncpu" : "1",
"name" : "{rule}__{wildcards}",
"output" : "logs/{rule}__{wildcards}.out",
"error" : "logs/{rule}__{wildcards}.out"
}
}
# where to put things
output_dir: 'output'
# suggested filtering options for PrediXcan
min_MAF: '0.01'
min_INFO: '0.8'
# brain regions to impute expression for
brain_regions:
- 'Amygdala'
- 'Anterior_cingulate_cortex_BA24'
- 'Caudate_basal_ganglia'
- 'Cerebellar_Hemisphere'
- 'Cerebellum'
- 'Cortex'
- 'Frontal_Cortex_BA9'
- 'Hippocampus'
- 'Hypothalamus'
- 'Nucleus_accumbens_basal_ganglia'
- 'Putamen_basal_ganglia'
- 'Spinal_cord_cervical_c-1'
- 'Substantia_nigra'
jobs: 99
local-cores: 1
use-singularity: True
cluster: "sbatch -A MRC-BSU-SL2-CPU -p skylake --nodes 1 --ntasks 1 --cpus-per-task 1 --mem 16000 -t 02:00:00 --job-name '{rule}[{wildcards}]' --output 'logs/{rule}[{wildcards}].log' --error 'logs/{rule}[{wildcards}].log'"
singularity-args: "-H $PWD"
singularity-prefix: "."
# tissues to impute expression for, all brain except Pituitary and
# Spinal cord (cervical c-1)
tissues:
- 'Brain_Amygdala'
- 'Brain_Anterior_cingulate_cortex_BA24'
- 'Brain_Caudate_basal_ganglia'
- 'Brain_Cerebellar_Hemisphere'
- 'Brain_Cerebellum'
- 'Brain_Cortex'
- 'Brain_Frontal_Cortex_BA9'
- 'Brain_Hippocampus'
- 'Brain_Hypothalamus'
- 'Brain_Nucleus_accumbens_basal_ganglia'
- 'Brain_Putamen_basal_ganglia'
- 'Brain_Substantia_nigra'
data_date: "20200306"
cohort: "all-acgm-filtered"
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 4
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
AutoAppendNewline: Yes
StripTrailingWhitespace: Yes
# Ignore everything in this directory
*
# Except this file
!.gitignore
#!/bin/bash
set -ex
sudo singularity build center-impute-grex.sif singularity.def
singularity push -U center-impute-grex.sif library://kkmann/default/center-grex-imputation:latest
wget https://zenodo.org/record/3376504/files/container.sif -O container.sif
#!/usr/bin/env Rscript
options(tidyverse.quite = TRUE)
suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
library(glue, warn.conflicts = FALSE)
args <- commandArgs(trailingOnly = TRUE)
chromosome <- as.integer(args[[1]])
tbl_gtex_lookup <- read_csv(
"output/model-variants-lookup-table.csv.gz",
col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1 = col_character(),
allele2 = col_character()
)
) %>%
filter(chr == glue("chr{chromosome}")) %>%
select(chr, position) %>%
arrange(position) %>%
write_delim(
path = glue("output/model-variant-positions-chromosome-{chromosome}.txt"),
delim = "\t",
col_names = FALSE
)
#!/usr/bin/env Rscript
options(tidyverse.quite = TRUE)
suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
library(glue, warn.conflicts = FALSE)
library(DBI)
tissues <- yaml::read_yaml("config/parameters.yml")$tissues
tbl_weights <- map(
tissues,
function(tissue) {
con <- dbConnect(
RSQLite::SQLite(),
glue("output/weights/en_{tissue}.db")
)
res <- dbReadTable(con, "weights") %>%
as_tibble() %>%
mutate(tissue = tissue) %>%
select(tissue, everything())
dbDisconnect(con)
return(res)
}
) %>%
bind_rows() %>%
separate(
varID,
into = c("chr", "position", "allele1", "allele2", "dummy"),
sep = "_",
remove = FALSE
) %>% {
assertthat::assert_that(
all((.$allele1 == .$ref_allele) & (.$allele2 == .$eff_allele))
)
.
} %>%
select(
chr, position, rsid, varID, allele1, allele2
) %>%
rename(
gtex_id = varID
) %>%
mutate(
position = as.integer(position)
) %>%
distinct()
write_csv(
tbl_weights,
"output/model-variants-lookup-table.csv.gz"
)
options(
repos = list(
"https://mran.revolutionanalytics.com/snapshot/2020-03-01/"
)
)
install.packages("tidyverse")
install.packages("pander")
install.packages("vroom")
install.packages("DBI")
install.packages("RSQLite")
---
title: "Mapping hg38 to GTEx model variants"
author: "Kevin Kunzmann"
date: "`r Sys.time()`"
output:
html_document:
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE
)
library(tidyverse, warn.conflicts = FALSE)
library(glue, warn.conflicts = FALSE)
set.seed(42)
# load data
tbl_unmatched <- glue(
"output/dosages/unmatched-chromosome-{1:23}.csv.gz"
) %>%
map(
~tryCatch(
read_csv(
., col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1..model = col_character(),
allele2..model = col_character(),
allele1..dosage = col_character(),
allele2..dosage = col_character()
)
),
error = function(e) tibble(
chr = character(0L),
position = integer(0L),
rsid = character(0L),
gtex_id = character(0L),
allele1..model = character(0L),
allele2..model = character(0L),
allele1..dosage = character(0L),
allele2..dosage = character(0L)
)
)
) %>%
bind_rows() %>%
mutate(
chromosome = str_extract(
chr,
"(?<=^chr)[1-9]{1}[0-9]{0,1}$"
) %>%
as.integer()
) %>%
select(-chr) %>%
select(chromosome, everything()) %>%
arrange(chromosome, position)
tbl_model_variants <- read_csv(
"output/model-variants-lookup-table.csv.gz",
col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1 = col_character(),
allele2 = col_character()
)
)
```
### Unmatched variants
```{r}
tbl_model_variants %>%
mutate(
chromosome = str_extract(
chr,
"(?<=^chr)[1-9]{1}[0-9]{0,1}$"
) %>%
as.integer()
) %>%
select(-chr) %>%
select(chromosome, everything()) %>%
mutate(
missing = gtex_id %in% tbl_unmatched$gtex_id
) %>%
group_by(chromosome) %>%
summarize(
`# missing` = sum(missing),
`% missing` = sum(missing) / n() * 100
) %>%
pander::pander(
caption = "Number of unmatched model variants per chromosome",
digits = 2
)
```
```{r}
complementary_base <- function(allele) {
case_when(
allele == "A" ~ "T",
allele == "T" ~ "A",
allele == "G" ~ "C",
allele == "C" ~ "G"
) %>% {
assertthat::assert_that(!any(is.na(.)));
.
}
}
tbl_unmatched %>%
filter(
complementary_base(allele1..model) == allele1..dosage
) %>%
pander::pander(
caption = "Variants that match on complementary strand",
split.table = Inf
)
```
## Session Info
```{r, echo=FALSE}
sessionInfo() %>%
pander::pander()
```
#!/usr/bin/env Rscript
options(tidyverse.quite = TRUE)
suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
library(glue, warn.conflicts = FALSE)
args <- commandArgs(trailingOnly = TRUE)
chromosome <- as.integer(args[[1]])
dosages_file <- glue("output/dosages/chromosome-{chromosome}.dosage.txt.gz")
out_file_matched <- glue("output/dosages/matched-chromosome-{chromosome}.dosage.txt.gz")
out_file_unmatched <- glue("output/dosages/unmatched-chromosome-{chromosome}.csv.gz")
sample_ids <- read_delim(
"output/dosages/samples.txt",
delim = "\t",
col_types = "cc",
col_names = c("FID", "IID")
)$IID
tbl_model_variants <- read_csv(
"output/model-variants-lookup-table.csv.gz",
col_types = cols(
chr = col_character(),
position = col_integer(),
rsid = col_character(),
gtex_id = col_character(),
allele1 = col_character(),
allele2 = col_character()
)
) %>%
filter(
chr == glue("chr{chromosome}")
)
tbl_dosage <- read_csv(
glue("output/dosages/chromosome-{chromosome}.dosage.csv"),
trim_ws = TRUE,
na = ".",
progress = TRUE,
col_names = c(
"chr",
"rsid",
"position",
"allele1",
"allele2",
"MAF",
sample_ids
),
col_types = cols(
.default = col_double(),
chr = col_character(),
rsid = col_character(),
position = col_integer(),
allele1 = col_character(),
allele2 = col_character()
)
)
tbl_matched <- tbl_model_variants %>%
inner_join(
select(tbl_dosage, -chr, -rsid),
by = c("position", "allele1", "allele2")
)
write_delim(
tbl_matched %>%
arrange(position) %>%
select(-rsid) %>%
select(chr, gtex_id, position, allele1, allele2, MAF, everything()) %>%
rename(chromosome = chr) %>%
filter(complete.cases(.)),
delim = " ",
col_names = FALSE,
path = out_file_matched
)
tbl_unmatched <- left_join(
tbl_model_variants,
select(tbl_dosage, position, allele1, allele2),
by = "position",
suffix = c("..model", "..dosage")
) %>%
filter(
!(gtex_id %in% tbl_matched$gtex_id)
)
write_csv(
tbl_unmatched,
path = out_file_unmatched
)
cat(sprintf(
"%s: %9i (%6.2f%%) model variants found for chromosome %i\n\r",
Sys.time(),
nrow(tbl_matched), 100*nrow(tbl_matched)/nrow(tbl_model_variants),
chromosome
))
mkdir -p output/imputed-gene-expressions
rsync -avzhe ssh --progress \
kk656@login-cpu.hpc.cam.ac.uk:/home/kk656/scratch/impute-gene-expression/output/imputed-gene-expressions \
output/imputed-gene-expressions
Bootstrap: docker
From: ubuntu:18.04
From: rocker/verse:3.6.2
%files
install.R /tmp/install.R
%post
# non-interactive debconf
......@@ -17,45 +20,36 @@ From: ubuntu:18.04
apt-get update && apt-get -y install google-cloud-sdk
# install python3 and snakemake
apt-get -y install \
python3 python3-pip
pip3 install snakemake
apt-get -y install python3 python3-pip
pip3 install snakemake==5.10.0
# install bcftools
export BCFVER=1.10.2
apt-get -y install \
gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev
wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
tar -xvjf bcftools-1.9.tar.bz2
cd bcftools-1.9
gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev pv
wget https://github.com/samtools/bcftools/releases/download/$BCFVER/bcftools-$BCFVER.tar.bz2
tar -xvjf bcftools-$BCFVER.tar.bz2
cd bcftools-$BCFVER
./configure --prefix=/usr/bcftools
make
make install
(cd /usr/bin; ln -s /usr/bcftools/bin/bcftools bcftools)
# install PrediXcan and python dependencies (uses python 2.7)
# install PrediXcan and python (2) dependencies
apt-get -y install \
wget python-pip
wget https://raw.githubusercontent.com/hakyimlab/PrediXcan/master/Software/PrediXcan.py -O /usr/bin/predixcan
chmod +x /usr/bin/predixcan
python-pip
pip install \
argparse datetime numpy
# download, extract and store (brain) weights
mkdir /usr/predixcan
wget https://s3.amazonaws.com/predictdb2/GTEx-V7_HapMap-2017-11-29.tar.gz -O /usr/predixcan/GTEx-V7_HapMap-2017-11-29.tar.gz
(cd /usr/predixcan; \
mkdir GTEx-V7_HapMap-2017-11-29; \
# we only need the brain tissue weights:
tar -xvz -f GTEx-V7_HapMap-2017-11-29.tar.gz -C GTEx-V7_HapMap-2017-11-29 --wildcards "*_Brain_*"; \
rm GTEx-V7_HapMap-2017-11-29.tar.gz \
)
# predixcan connects to the weights database with sql, needs write permission
# even if the file system will be read only for the cvontainer
chmod -R 777 /usr/predixcan
# install R and packages
apt-get -y install r-base
Rscript -e "install.packages('dplyr')"
Rscript -e "install.packages('yaml')"
Rscript -e "install.packages('purrr')"
Rscript -e "install.packages('readr')"
Rscript -e "install.packages('tidyr')"
numpy==1.16.6
mkdir -p /usr/PrediXcan
(cd /usr; \
git clone https://github.com/hakyimlab/PrediXcan; \
cd PrediXcan; \
git checkout e77dd8a04a0345cb63aa634d4f8acc6aca9e25e0)
ln -s /usr/PrediXcan/Software/PrediXcan.py /usr/bin/predixcan
chmod +x /usr/bin/predixcan
# install bc
apt-get -y install bc
# install R packages
Rscript /tmp/install.R
mkdir -p logs
nohup snakemake $1 \
--jobs 13 \
--use-singularity \
--cluster-config cluster.json \
--cluster "sbatch -A {cluster.account} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task {cluster.ncpu} --nodes {cluster.nodes} -t {cluster.time} --job-name {cluster.name} --output {cluster.output} --error {cluster.error}" &
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment