Refactor to gtex v8

b0e5c713 · Kevin Kunzmann · 8d918f3c · b0e5c713 · b0e5c713 · b0e5c713
Commit b0e5c713 authored Mar 14, 2020 by Kevin Kunzmann
19 changed files
--- a/.gitignore
+++ b/.gitignore
+.*
 .config
 .snakemake
 output
-logs
-nohup.out
+*.err
-container.sif
+*.out
+*.sif
+*.simg
+*.html
+*.pdf
 .DS_Store
+.Rproj.user
--- a/README.md
+++ b/README.md
-# Impute gene expression for CENTER-TBI with PrediXcan
+# Impute genetically regulated gene expression for CENTER-TBI using PrediXcan
-The singularity container with most software dpendencies is available at
+This repository contains all scripts necessary to run the impute genetically
-https://doi.org/10.5281/zenodo.3376504.
+regulated gene expression (GREx) using the 
-Data currently needs to be accessed manually due to access restrictions.
+[PrediXcan](https://github.com/hakyimlab/MetaXcan) methodology for 
-This workflow is design for *.vcf.gz files with dosage (DS) information.
+[CENTER-TBI](https://www.center-tbi.eu/) imputed whole genomes data.
-More information on PrediXcan can be found here https://github.com/hakyimlab/PrediXcan and in:
+[1] Gamazon ER†, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
-> Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
+Nicolae DL, Cox NJ, Im HK. (2015) *A gene-based association method for mapping traits using reference transcriptome data.*
-Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data.
 Nat Genet. doi:10.1038/ng.3367.
-## Dependencies
+[2] Barbeira, A., Shah, K. P., Torres, J. M., Wheeler, H. E., Torstenson, E. S., Edwards, T., ... & Im, H. K. (2016). *MetaXcan: summary statistics based gene-level association method infers accurate PrediXcan results.* BioRxiv, 045260.
-1. linux shell (`bash`), possibly via virtual machine on Windows/Mac
-2. `wget` (pre-installed or via distribution package manager)
-3. `singularity` container software (tested on 3.3.0, https://sylabs.io/guides/3.3/user-guide)
-4. `git` (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
-5. for data download fimm GCP bucket access to `fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data`
-### Optional
-5. python 3.7+ and snakemake
-6. slurm cluster
-We use snakemake to organize the workflow (also pre-installed in the container) and support cluster execution.
-Snakemake is available via `pip` package for python 3.7.
-> Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics,
-Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480
-## Execution
-Download and extract the contents of this repository (might be access restricted)
-    git clone https://github.com/kkmann/impute-gene-expression
+The pipeline itself is implemented using [snakemake](https://snakemake.readthedocs.io/en/stable/) and
-    cd impute-gene-expression
+[singularity](https://sylabs.io/) containers to guarantee reproducibility.
-Download the container image
-    bash scripts/download_container.sh
-Obtain the imputed genomes (GCP bucket: fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data).
-`gsutils` is pre-installed in the container image, to authenticate with your
-GCP account run and follow the interactive instructions
-    singularity shell container.sif
-    gcloud auth login
-    snakemake download_imputed_genotypes
-    exit
-Execute the workflow inside the container on a single core (takes a while!)
-    singularity exec container.sif snakemake impute
-Optionally, if snakemake is installed, the workflow can be run in parallel via
+## Dependencies
-    snakemake --use-singularity -j 8 impute
-where '8' can be replaced by the number of available cores.
+1. [`Linux` system](https://ubuntu.com/download/desktop)
-Cluster execution is enables via the `scripts/slurm_snakemake.sh` script as
+2. [`git`](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
+3. [`singularity`](https://sylabs.io/guides/3.5/user-guide/quick_start.html#quick-installation-steps) container software (tested on 3.5.0)
+4. *optionally:* [Python 3](https://www.python.org/) and [snakemake](https://github.com/snakemake/snakemake) 
-    bash scripts/slurm-snakemake.sh impute
+To access the imputed genotypes, access to the 
+[FIMM](https://www.fimm.fi/) GCS bucket 
+`fimm-horizon-outgoing-data` is necessary (requires a FIMM user account).
+It is recommended to run the workflow on a cluster system, 
+since execution time on a desktop machine may take several hours.
+An example [slurm](https://slurm.schedmd.com/documentation.html)
+cluster [profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) is saved in `config/snakemake/mrc-bsu-cluster` 
+for reference.
-## Results
-an output folder is created with the PrediXcan dosage files and imputed gene
-expressions.
+## Execution
-## Versions
+### Local execution
-release tags should point to respective FIMM imputed genotype source data.
+1.  Clone this repository and change the working directory to the newly 
+    created folder
+    ```
+    git clone https://git.center-tbi.eu/kunzmann/impute-gene-expression.git
+    cd impute-gene-expression
+    ```
+    Make sure the ckeck out the exact version of the pipeline you want to run,
+    either by comit hash, or an available tag, e.g. via
+    ```
+    git checkout v0.1.0
+    ```
+2.  Download the container image specified at the to of the `Snakefile`
+    (singularity: ...) manually via, e.g. (make sure to adjust to the exact specification!)
+    ```
+    singularity pull library://kkmann/default/center-grex-imputation:[tag]
+    ```
+    where `[tag]` is the desired container tag or hash (see Snakefile to make
+    sure that you download the exact right version).
+    This will download the container from the sylab cloud and store it as
+    `center-grex-imputation_[tag].sif`.
+3.  Authenticate with your FIMM account
+    ```
+    singularity shell -H $PWD [container-name].sif
+    [container-name].sif> gcloud auth login
+    ```
+    Here, the `-H $PWD` flag ensures that singularity does not mount the users
+    home directory but uses the working directory instead.
+    Follow the on-screen instructions to authenticate in your browser, 
+    then leave the container
+    ```
+    [container-name].sif> exit
+    ```
+    This should have creates a hidden folder `.config/gcloud` with authentification 
+    details in your working directory.
+4.  Adjust tissues, data date, and target cohort in `config/parameters.yaml`
+4.  Run the pipeline
+    ```
+    singularity exec -H $PWD -j 1 [container-name].sif snakemake all
+    ```
+    Here `-j 1` specifies the maximal number of parallel jobs.
+    Increase this on more powerful machines but bear in mind that roughly 16Gb
+    of RAM per job are required to avoid out of memory issues.
+    An output folder is created with intermediate files, a mapping report, 
+    and the imputed GREx levels per tissue in `output/imputed-grex`.
+### Cluster execution
+1.  Same as the first step in 'Local execution'.
+2.  Make sure that `Python 3` is available and `snakemake` installed 
+    (e.g. load Python via modules and install snakemake into a virtual environment)
+3.  Configure snakemake for your execution environment and store the profile in, 
+    e.g., `config/my-cluster`.
+    It is highly recommended to use the `-H $PWD` flag to make sure that there
+    is no interference with configuration files int he users home directory.
+3.  Run
+    ```
+    snakemake --profile config/my-cluster container
+    ```
+    to download the correct container image automatically.
+    This will create a file `[container-hash].simg` in the working folder.
+4.  Authenticate with your FIMM account as in 'Local execution', point 3., 
+    with a shell in the downloaded `[container-hash].simg`
+5.  Adjust tissues, data date, and target cohort in `config/parameters.yaml`
+6.  Run the distributed pipeline
+    ```
+    snakemake --profile config/snakemake/my-cluster all
+    ```
 ## Logo
 CC-BY 4.0, Ali Zifan
\ No newline at end of file
--- a/Snakefile
+++ b/Snakefile
-# edit configuration here (see cluster.json for running on slurm scheduler)
+# define the singularity (https://sylabs.io/) container image to use
-configfile: "config.yml"
+# see also https://cloud.sylabs.io/library/kkmann/default/center-grex-imputation#
+singularity: "library://kkmann/default/center-grex-imputation:sha256.321833b357b20feda51877dcef6e9bac2dfa706e9ed6c32da39baaa8ead9e1f7"
-localrules: impute, clean, download_container, generate_samples_file
+# modify this file to define global parameters
+# (e.g. set of tissues to impute genetically regulated expression levels for)
+configfile:  "config/parameters.yml"
+# these rules are computationally less expensive and run locally, even in
+# cluster execution mode
+localrules:  all, container, download_genome, download_weights, samples_file,
+    model_variant_lookup_table, mapping_report
+# this rule is exected by default, i.e.
-# this rule runs the entire pipeline and creates a R data set file (rds)
+# > snakemake
-# containing all imputed expression values
+# is equivalent to
-rule impute:
+# > snakemake all
+# the target imputes gene expression for all tissues listed in the config file
+# and the report on mapping mopdel variants to available dosage information.
+rule all:
    input:
-        "container.sif",
+        expand(
-        expand("{output_dir}/imputed-gene-expressions/{region}.expression.txt.gz", region = config['brain_regions'], output_dir = config['output_dir'])
+            "output/imputed-grex/{tissue}.igrex.txt.gz",
+            tissue=config["tissues"]
+        ),
+        "output/mapping-report.html"
-# delete output and logs (if run on slurm cluster)
+# dummy rule doing nothing; can be used to download the required container;
-rule clean:
+# this might be helpful to set up authentification with the data servers
+# (requires an interactive step); see readme.md for details
+rule container:
    shell:
        """
-        rm -rf {config[output_dir]}
+        echo done.
-        rm -rf logs
-        rm -rf nohup.out
        """
-# download vcf files using gsutil (needs to be set up and configured!
+# download vcf files using gsutil (needs to be set up and configured!)
-rule download_imputed_genotype_chromosome:
+rule download_genome:
    output:
-        "output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{i}.vcf.gz"
+        "output/imputed-genotypes/chromosome-{chr}.vcf.gz",
-    singularity:
+        "output/imputed-genotypes/chromosome-{chr}.vcf.gz.tbi"
-        "container.sif"
    shell:
        """
+        set -ex
        mkdir -p output/imputed-genotypes
-        gsutil cp gs://fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_191014/Imputed_data/CENTER.TBI_imputed_N3741_hg19_HRC_MAC1_freeze_191014_chr{wildcards.i}.vcf.gz \
+        FILEDIR=gs://fimm-horizon-outgoing-data/genetic-data/imputed-genomes/{config[data_date]}/{config[cohort]}
-            output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{wildcards.i}.vcf.gz
+        gsutil cp $FILEDIR/chromosome-{wildcards.chr}.vcf.gz \
+                output/imputed-genotypes/chromosome-{wildcards.chr}.vcf.gz
+        gsutil cp $FILEDIR/chromosome-{wildcards.chr}.vcf.gz.tbi \
+                output/imputed-genotypes/chromosome-{wildcards.chr}.vcf.gz.tbi
        """
-rule download_imputed_genotypes:
+# download PrediXcan EQTL weights
+rule download_weights:
+    output:
+        directory("output/weights")
+    shell:
+        """
+        set -ex
+        mkdir -p logs
+        mkdir -p output/weights
+        # wget https://zenodo.org/record/3518299/files/mashr_eqtl.tar?download=1 \
+            -O output/weights/eqtl.tar
+        wget https://zenodo.org/record/3519321/files/elastic_net_eqtl.tar?download=1 \
+            -O output/weights/eqtl.tar
+        tar -C output/weights -xvf output/weights/eqtl.tar
+        rm output/weights/eqtl.tar
+        mv output/weights/elastic_net_models/* output/weights/
+        rm -rf output/weights/elastic_net_models
+        """
+# generate a table with all model variants
+rule model_variant_lookup_table:
    input:
-        expand("output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{i}.vcf.gz",
+        rules.download_weights.output
-            i = list(map(str, range(1, 23))) + ['X']
+    output:
-        )
+        "output/model-variants-lookup-table.csv.gz"
+    shell:
+        """
+        set -ex
+        Rscript scripts/get-model-variants.R
+        """
-# download the required singularity container from zenodo.org
+# generate the samples file in the format required by PrediXcan
-rule download_container:
+rule samples_file:
+    input:
+        "output/imputed-genotypes/chromosome-1.vcf.gz"
    output:
-        "container.sif"
+        "output/dosages/samples.txt"
    shell:
        """
-        bash scripts/download_container.sh
+        set -ex
+        mkdir -p output/dosages
+        # query sample ids from first chromosome
+        bcftools query -l {input} > output/samples.txt
+        # predixcan allows family and sample id, duplicate column and overwrite
+        Rscript -e "library(dplyr); readr::read_tsv('output/samples.txt', col_names = 'SID', col_types = 'c') %>% dplyr::mutate(FID = SID) %>% readr::write_tsv(path = 'output/dosages/samples.txt', col_names = FALSE)"
        """
-# extract dosages in custom PrediXcan format from vcf file
+# extract and map dosage information to model variants by position and alleles
-rule vcf_to_dosages:
+# (only works with hg38 inputs, otherwise a liftover needs to be performed!)
+rule dosage:
    input:
-        "container.sif",
+        vcf     = "output/imputed-genotypes/chromosome-{chr}.vcf.gz",
-        vcf_gz_file = "output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr{i}.vcf.gz"
+        index   = "output/imputed-genotypes/chromosome-{chr}.vcf.gz.tbi",
+        lookup  = "output/model-variants-lookup-table.csv.gz",
+        samples = "output/dosages/samples.txt"
    output:
-        "{output_dir}/dosages/chr{i}.dosage.txt.gz"
+        "output/dosages/matched-chromosome-{chr}.dosage.txt.gz",
-    singularity:
+        "output/dosages/unmatched-chromosome-{chr}.csv.gz"
-        "container.sif"
    shell:
        """
-        export prefix={wildcards.output_dir}/dosages
+        # setup
-        mkdir -p $prefix
+        set -ex
-        echo "decompress and filter out: 'MAF[0]<{config[min_MAF]} | INFO<{config[min_INFO]} | TYPE!="snp" | N_ALT!=1' ..."
+        mkdir -p output/dosages
-        bcftools filter -e 'MAF[0]<{config[min_MAF]} | INFO<{config[min_INFO]} | TYPE!="snp" | N_ALT!=1' {input.vcf_gz_file} > $prefix/chr{wildcards.i}_.vcf
-        echo "add MAFs ..."
+        # create file with chromosomal positions for all model variants
-        bcftools +fill-tags $prefix/chr{wildcards.i}_.vcf > $prefix/chr{wildcards.i}.vcf
+        Rscript scripts/get-model-variant-positions.R {wildcards.chr}
-        rm $prefix/chr{wildcards.i}_.vcf
-        echo 'querying dosages ...'
+        nlines=$(wc -l < "output/model-variant-positions-chromosome-{wildcards.chr}.txt" | bc)
-        bcftools query -f '%CHROM %ID %POS %REF %ALT %INFO/MAF [%DS ]\n' $prefix/chr{wildcards.i}.vcf > $prefix/chr{wildcards.i}.dosage.txt
+        if [[ $nlines == 0 ]]
-        rm $prefix/chr{wildcards.i}.vcf
+        then
-        echo 'compressing ...'
+            touch "output/dosages/matched-chromosome-{wildcards.chr}.dosage.txt.gz"
-        gzip $prefix/chr{wildcards.i}.dosage.txt
+            touch "output/dosages/unmatched-chromosome-{wildcards.chr}.csv.gz"
-        printf "done.\n\r\n\r"
+        else
-        """
+            # filter for model variant positions and extract dosage information
+            bcftools query \
-# extract sample file for PrediXcan
+                -o "output/dosages/chromosome-{wildcards.chr}.dosage.csv" \
-rule generate_samples_file:
+                -R "output/model-variant-positions-chromosome-{wildcards.chr}.txt" \
+                -f "%CHROM,%ID,%POS,%REF,%ALT,%INFO/MAF[,%DS]\n\r" \
+                {input.vcf}
+            # match model variants with extracted dosage
+            Rscript scripts/match-variants.R {wildcards.chr}
+            # cleanup
+            rm "output/dosages/chromosome-{wildcards.chr}.dosage.csv"
+        fi
+        # cleanup
+        rm "output/model-variant-positions-chromosome-{wildcards.chr}.txt"
+        """
+# build an overview over the found model SNPS per chromosome
+rule mapping_report:
    input:
-        "container.sif",
+        dosage_files = expand(
-        vcf_gz_file = "output/imputed-genotypes/CENTER_TBI_imputed_genotype_chr1.vcf.gz"
+            "output/dosages/matched-chromosome-{chr}.dosage.txt.gz",
+            chr=range(1,24)
+        )
    output:
-        "{output_dir}/dosages/samples.txt"
+        "output/mapping-report.html"
-    singularity:
-        "container.sif"
-    params:
-        format = "\'{ printf(\"%s %s\\n\", $1, $1); }\'"
    shell:
        """
-        export prefix={wildcards.output_dir}/dosages
+        set -ex
-        mkdir -p $prefix
+        Rscript -e "rmarkdown::render('scripts/mapping-report.Rmd', output_dir = 'output', output_file = 'mapping-report.html', knit_root_dir = '$PWD')"
-        bcftools query -l {input.vcf_gz_file} >> $prefix/samples_.txt
-        # family ID = individual ID
-        awk {params.format} < $prefix/samples_.txt > $prefix/samples.txt
-        rm $prefix/samples_.txt
        """
-# run PrediXcan to impute gene expression for individual tissue type
+# impute GREx for a specific tissue (sums up the mapped dosages times the
-rule impute_gene_expressions:
+# PrediXcan weights)
+rule grex:
    input:
-        "container.sif",
+        samples_file = "output/dosages/samples.txt",
-        samples_file = expand("{output_dir}/dosages/samples.txt", output_dir = config['output_dir']),
+        dosage_files = expand(
-        dosage_files = expand("{output_dir}/dosages/chr{i}.dosage.txt.gz",
+            "output/dosages/matched-chromosome-{chr}.dosage.txt.gz",
-            i = list(map(str, range(1, 23))) + ['X'],
+            chr=range(1,24)
-            output_dir = config['output_dir']
        )
    output:
-        "{output_dir}/imputed-gene-expressions/{region}.expression.txt.gz"
+        "output/imputed-grex/{tissue}.igrex.txt.gz"
-    singularity:
-        "container.sif"
    shell:
        """
-        mkdir -p {wildcards.output_dir}/imputed-gene-expressions
+        mkdir -p output/imputed-grex
        predixcan \
            --predict \
-            --dosages {config[output_dir]}/dosages \
+            --dosages output/dosages \
-            --dosages_prefix chr \
+            --dosages_prefix matched-chromosome- \
            --samples samples.txt \
-            --weights /usr/predixcan/GTEx-V7_HapMap-2017-11-29/gtex_v7_Brain_{wildcards.region}_imputed_europeans_tw_0.5_signif.db \
+            --weights output/weights/en_{wildcards.tissue}.db \
-            --output_prefix {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}
+            --output_prefix output/imputed-grex/{wildcards.tissue} \
-        mv {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}_predicted_expression.txt \
+            --model_variant_id varID
-            {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}.expression.txt
+        mv output/imputed-grex/{wildcards.tissue}_predicted_expression.txt \
-        gzip {wildcards.output_dir}/imputed-gene-expressions/{wildcards.region}.expression.txt
+            output/imputed-grex/{wildcards.tissue}.igrex.txt
+        gzip output/imputed-grex/{wildcards.tissue}.igrex.txt
        """
--- a/cluster.json
+++ b/cluster.json
-{
-    "__default__" :
-    {
-        "account"       : "MRC-BSU-SL2-CPU",
-        "partition"     : "skylake",
-        "time"          : "02:00:00",
-        "ntasks"        : "1",
-        "nodes"         : "1",
-        "ncpu"          : "1",
-        "name"          : "{rule}__{wildcards}",
-        "output"        : "logs/{rule}__{wildcards}.out",
-        "error"         : "logs/{rule}__{wildcards}.out"
-    }
-}
--- a/config.yml
+++ b/config.yml
-# where to put things
-output_dir: 'output'
-# suggested filtering options for PrediXcan
-min_MAF:  '0.01'
-min_INFO: '0.8'
-# brain regions to impute expression for
-brain_regions:
-    - 'Amygdala'
-    - 'Anterior_cingulate_cortex_BA24'
-    - 'Caudate_basal_ganglia'
-    - 'Cerebellar_Hemisphere'
-    - 'Cerebellum'
-    - 'Cortex'
-    - 'Frontal_Cortex_BA9'
-    - 'Hippocampus'
-    - 'Hypothalamus'
-    - 'Nucleus_accumbens_basal_ganglia'
-    - 'Putamen_basal_ganglia'
-    - 'Spinal_cord_cervical_c-1'
-    - 'Substantia_nigra'
--- a/config/mrc-bsu-cluster/config.yaml
+++ b/config/mrc-bsu-cluster/config.yaml
+jobs:               99
+local-cores:        1
+use-singularity:    True
+cluster:            "sbatch -A MRC-BSU-SL2-CPU -p skylake  --nodes 1 --ntasks 1 --cpus-per-task 1 --mem 16000 -t 02:00:00 --job-name '{rule}[{wildcards}]' --output 'logs/{rule}[{wildcards}].log' --error 'logs/{rule}[{wildcards}].log'"
+singularity-args:   "-H $PWD"
+singularity-prefix: "."
--- a/config/parameters.yml
+++ b/config/parameters.yml
+# tissues to impute expression for, all brain except Pituitary and
+# Spinal cord (cervical c-1)
+tissues:
+    - 'Brain_Amygdala'
+    - 'Brain_Anterior_cingulate_cortex_BA24'
+    - 'Brain_Caudate_basal_ganglia'
+    - 'Brain_Cerebellar_Hemisphere'
+    - 'Brain_Cerebellum'
+    - 'Brain_Cortex'
+    - 'Brain_Frontal_Cortex_BA9'
+    - 'Brain_Hippocampus'
+    - 'Brain_Hypothalamus'
+    - 'Brain_Nucleus_accumbens_basal_ganglia'
+    - 'Brain_Putamen_basal_ganglia'
+    - 'Brain_Substantia_nigra'
+data_date: "20200306"
+cohort: "all-acgm-filtered"
--- a/impute-gene-expression.Rproj
+++ b/impute-gene-expression.Rproj
+Version: 1.0
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 4
+Encoding: UTF-8
+RnwWeave: Sweave
+LaTeX: pdfLaTeX
+AutoAppendNewline: Yes
+StripTrailingWhitespace: Yes
--- a/logs/.gitignore
+++ b/logs/.gitignore
+# Ignore everything in this directory
+*
+# Except this file
+!.gitignore
--- a/scripts/build-container.sh
+++ b/scripts/build-container.sh
+#!/bin/bash
+set -ex
+sudo singularity build center-impute-grex.sif singularity.def
+singularity push -U center-impute-grex.sif library://kkmann/default/center-grex-imputation:latest
--- a/scripts/download_container.sh
+++ b/scripts/download_container.sh
-wget https://zenodo.org/record/3376504/files/container.sif -O container.sif
--- a/scripts/get-model-variant-positions.R
+++ b/scripts/get-model-variant-positions.R
+#!/usr/bin/env Rscript
+options(tidyverse.quite = TRUE)
+suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
+library(glue, warn.conflicts = FALSE)
+args <- commandArgs(trailingOnly = TRUE)
+chromosome <- as.integer(args[[1]])
+tbl_gtex_lookup <- read_csv(
+        "output/model-variants-lookup-table.csv.gz",
+        col_types  = cols(
+            chr      = col_character(),
+            position = col_integer(),
+            rsid     = col_character(),
+            gtex_id  = col_character(),
+            allele1  = col_character(),
+            allele2  = col_character()
+        )
+    ) %>%
+    filter(chr == glue("chr{chromosome}")) %>%
+    select(chr, position) %>%
+    arrange(position) %>%
+    write_delim(
+        path     = glue("output/model-variant-positions-chromosome-{chromosome}.txt"),
+        delim     = "\t",
+        col_names = FALSE
+    )
--- a/scripts/get-model-variants.R
+++ b/scripts/get-model-variants.R
+#!/usr/bin/env Rscript
+options(tidyverse.quite = TRUE)
+suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
+library(glue, warn.conflicts = FALSE)
+library(DBI)
+tissues  <- yaml::read_yaml("config/parameters.yml")$tissues
+tbl_weights <- map(
+        tissues,
+        function(tissue) {
+            con <- dbConnect(
+                RSQLite::SQLite(),
+                glue("output/weights/en_{tissue}.db")
+            )
+            res <- dbReadTable(con, "weights") %>%
+                as_tibble() %>%
+                mutate(tissue = tissue) %>%
+                select(tissue, everything())
+            dbDisconnect(con)
+            return(res)
+        }
+    ) %>%
+    bind_rows() %>%
+    separate(
+        varID,
+        into = c("chr", "position", "allele1", "allele2", "dummy"),
+        sep = "_",
+        remove = FALSE
+    ) %>% {
+        assertthat::assert_that(
+            all((.$allele1 == .$ref_allele) & (.$allele2 == .$eff_allele))
+        )
+        .
+    } %>%
+    select(
+        chr, position, rsid, varID, allele1, allele2
+    ) %>%
+    rename(
+        gtex_id  = varID
+    ) %>%
+    mutate(
+        position = as.integer(position)
+    ) %>%
+    distinct()
+write_csv(
+    tbl_weights,
+    "output/model-variants-lookup-table.csv.gz"
+)
--- a/scripts/install.R
+++ b/scripts/install.R
+options(
+    repos = list(
+        "https://mran.revolutionanalytics.com/snapshot/2020-03-01/"
+    )
+)
+install.packages("tidyverse")
+install.packages("pander")
+install.packages("vroom")
+install.packages("DBI")
+install.packages("RSQLite")
--- a/scripts/mapping-report.Rmd
+++ b/scripts/mapping-report.Rmd
+---
+title:  "Mapping hg38 to GTEx model variants"
+author: "Kevin Kunzmann"
+date:   "`r Sys.time()`"
+output: 
+    html_document:
+        code_folding: hide
+---
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(
+    echo     = TRUE
+)
+library(tidyverse, warn.conflicts = FALSE)
+library(glue, warn.conflicts = FALSE)
+set.seed(42)
+# load data
+tbl_unmatched <- glue(
+        "output/dosages/unmatched-chromosome-{1:23}.csv.gz"
+    ) %>% 
+    map(
+        ~tryCatch(
+            read_csv(
+                ., col_types = cols(
+                    chr = col_character(),
+                    position = col_integer(),
+                    rsid = col_character(),
+                    gtex_id = col_character(),
+                    allele1..model = col_character(),
+                    allele2..model = col_character(),
+                    allele1..dosage = col_character(),
+                    allele2..dosage = col_character()
+                )
+            ),
+            error = function(e) tibble(
+                chr = character(0L),
+                position = integer(0L),
+                rsid = character(0L),
+                gtex_id = character(0L),
+                allele1..model = character(0L),
+                allele2..model = character(0L),
+                allele1..dosage = character(0L),
+                allele2..dosage = character(0L)
+            )
+        )
+    ) %>% 
+    bind_rows() %>% 
+    mutate(
+        chromosome = str_extract(
+                chr, 
+                "(?<=^chr)[1-9]{1}[0-9]{0,1}$"
+            ) %>% 
+            as.integer()
+    ) %>% 
+    select(-chr) %>% 
+    select(chromosome, everything()) %>% 
+    arrange(chromosome, position)
+tbl_model_variants <- read_csv(
+    "output/model-variants-lookup-table.csv.gz",
+    col_types = cols(
+        chr      = col_character(),
+        position = col_integer(),
+        rsid     = col_character(),
+        gtex_id  = col_character(),
+        allele1  = col_character(),
+        allele2  = col_character()
+    )
+)
+```
+### Unmatched variants
+```{r}
+tbl_model_variants %>% 
+    mutate(
+        chromosome = str_extract(
+                chr, 
+                "(?<=^chr)[1-9]{1}[0-9]{0,1}$"
+            ) %>% 
+            as.integer()
+    ) %>% 
+    select(-chr) %>% 
+    select(chromosome, everything()) %>% 
+    mutate(
+        missing = gtex_id %in% tbl_unmatched$gtex_id
+    ) %>% 
+    group_by(chromosome) %>% 
+    summarize(
+        `# missing` = sum(missing),
+        `% missing` = sum(missing) / n() * 100
+    ) %>% 
+    pander::pander(
+        caption = "Number of unmatched model variants per chromosome",
+        digits  = 2
+    )
+```
+```{r}
+complementary_base <- function(allele) {
+    case_when(
+        allele == "A" ~ "T",
+        allele == "T" ~ "A",
+        allele == "G" ~ "C",
+        allele == "C" ~ "G"
+    ) %>% {
+        assertthat::assert_that(!any(is.na(.))); 
+        .
+    }
+}
+tbl_unmatched %>% 
+    filter(
+        complementary_base(allele1..model) == allele1..dosage
+    ) %>% 
+    pander::pander(
+        caption = "Variants that match on complementary strand",
+        split.table = Inf
+    )
+```
+## Session Info
+```{r, echo=FALSE}
+sessionInfo() %>% 
+    pander::pander()
+```
--- a/scripts/match-variants.R
+++ b/scripts/match-variants.R
+#!/usr/bin/env Rscript
+options(tidyverse.quite = TRUE)
+suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
+library(glue, warn.conflicts = FALSE)
+args <- commandArgs(trailingOnly = TRUE)
+chromosome <- as.integer(args[[1]])
+dosages_file       <- glue("output/dosages/chromosome-{chromosome}.dosage.txt.gz")
+out_file_matched   <- glue("output/dosages/matched-chromosome-{chromosome}.dosage.txt.gz")
+out_file_unmatched <- glue("output/dosages/unmatched-chromosome-{chromosome}.csv.gz")
+sample_ids <- read_delim(
+        "output/dosages/samples.txt",
+        delim     = "\t",
+        col_types = "cc",
+        col_names = c("FID", "IID")
+    )$IID
+tbl_model_variants <- read_csv(
+        "output/model-variants-lookup-table.csv.gz",
+        col_types  = cols(
+            chr      = col_character(),
+            position = col_integer(),
+            rsid     = col_character(),
+            gtex_id  = col_character(),
+            allele1  = col_character(),
+            allele2  = col_character()
+        )
+    ) %>%
+    filter(
+        chr == glue("chr{chromosome}")
+    )
+tbl_dosage <- read_csv(
+    glue("output/dosages/chromosome-{chromosome}.dosage.csv"),
+    trim_ws    = TRUE,
+    na         = ".",
+    progress   = TRUE,
+    col_names  = c(
+        "chr",
+        "rsid",
+        "position",
+        "allele1",
+        "allele2",
+        "MAF",
+        sample_ids
+    ),
+    col_types  = cols(
+        .default = col_double(),
+        chr      = col_character(),
+        rsid     = col_character(),
+        position = col_integer(),
+        allele1  = col_character(),
+        allele2  = col_character()
+    )
+)
+tbl_matched <- tbl_model_variants %>%
+    inner_join(
+        select(tbl_dosage, -chr, -rsid),
+        by = c("position", "allele1", "allele2")
+    )
+write_delim(
+    tbl_matched %>%
+        arrange(position) %>%
+        select(-rsid) %>%
+        select(chr, gtex_id, position, allele1, allele2, MAF, everything()) %>%
+        rename(chromosome = chr) %>%
+        filter(complete.cases(.)),
+    delim     = " ",
+    col_names = FALSE,
+    path      = out_file_matched
+)
+tbl_unmatched <- left_join(
+        tbl_model_variants,
+        select(tbl_dosage, position, allele1, allele2),
+        by     = "position",
+        suffix = c("..model", "..dosage")
+    ) %>%
+    filter(
+        !(gtex_id %in% tbl_matched$gtex_id)
+    )
+write_csv(
+    tbl_unmatched,
+    path  = out_file_unmatched
+)
+cat(sprintf(
+    "%s: %9i (%6.2f%%) model variants found for chromosome %i\n\r",
+    Sys.time(),
+    nrow(tbl_matched), 100*nrow(tbl_matched)/nrow(tbl_model_variants),
+    chromosome
+))
--- a/scripts/pull.sh
+++ b/scripts/pull.sh
-mkdir -p output/imputed-gene-expressions
-rsync -avzhe ssh --progress \
-    kk656@login-cpu.hpc.cam.ac.uk:/home/kk656/scratch/impute-gene-expression/output/imputed-gene-expressions \
-    output/imputed-gene-expressions
--- a/container.def
+++ b/container.def
 Bootstrap: docker
-From: ubuntu:18.04
+From: rocker/verse:3.6.2
+%files
+    install.R /tmp/install.R
 %post
    # non-interactive debconf
@@ -17,45 +20,36 @@ From: ubuntu:18.04
    apt-get update && apt-get -y install google-cloud-sdk
    # install python3 and snakemake
-    apt-get -y install \
+    apt-get -y install python3 python3-pip
-        python3 python3-pip
+    pip3 install snakemake==5.10.0
-    pip3 install snakemake
    # install bcftools
+    export BCFVER=1.10.2
    apt-get -y install \
-        gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev
+        gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev pv
-    wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
+    wget https://github.com/samtools/bcftools/releases/download/$BCFVER/bcftools-$BCFVER.tar.bz2
-    tar -xvjf bcftools-1.9.tar.bz2
+    tar -xvjf bcftools-$BCFVER.tar.bz2
-    cd bcftools-1.9
+    cd bcftools-$BCFVER
    ./configure --prefix=/usr/bcftools
    make
    make install
    (cd /usr/bin; ln -s /usr/bcftools/bin/bcftools bcftools)
-    # install PrediXcan and python dependencies (uses python 2.7)
+    # install PrediXcan and python (2) dependencies
    apt-get -y install \
-        wget python-pip
+        python-pip
-    wget https://raw.githubusercontent.com/hakyimlab/PrediXcan/master/Software/PrediXcan.py -O /usr/bin/predixcan
-    chmod +x /usr/bin/predixcan
    pip install \
-        argparse datetime numpy
+        numpy==1.16.6
-    # download, extract and store (brain) weights
+    mkdir -p /usr/PrediXcan
-    mkdir /usr/predixcan
+    (cd /usr; \
-    wget https://s3.amazonaws.com/predictdb2/GTEx-V7_HapMap-2017-11-29.tar.gz -O /usr/predixcan/GTEx-V7_HapMap-2017-11-29.tar.gz
+        git clone https://github.com/hakyimlab/PrediXcan; \
-    (cd /usr/predixcan; \
+        cd PrediXcan; \
-        mkdir GTEx-V7_HapMap-2017-11-29; \
+        git checkout e77dd8a04a0345cb63aa634d4f8acc6aca9e25e0)
-        # we only need the brain tissue weights:
+    ln -s /usr/PrediXcan/Software/PrediXcan.py /usr/bin/predixcan
-        tar -xvz -f GTEx-V7_HapMap-2017-11-29.tar.gz -C GTEx-V7_HapMap-2017-11-29 --wildcards "*_Brain_*"; \
+    chmod +x /usr/bin/predixcan
-        rm GTEx-V7_HapMap-2017-11-29.tar.gz \
-        )
+    # install bc
-    # predixcan connects to the weights database with sql, needs write permission
+    apt-get -y install bc
-    # even if the file system will be read only for the cvontainer
-    chmod -R 777 /usr/predixcan
+    # install R packages
+    Rscript /tmp/install.R
-    # install R and packages
-    apt-get -y install r-base
-    Rscript -e "install.packages('dplyr')"
-    Rscript -e "install.packages('yaml')"
-    Rscript -e "install.packages('purrr')"
-    Rscript -e "install.packages('readr')"
-    Rscript -e "install.packages('tidyr')"
--- a/scripts/slurm-snakemake.sh
+++ b/scripts/slurm-snakemake.sh
-mkdir -p logs
-nohup snakemake $1 \
-    --jobs 13 \
-    --use-singularity \
-    --cluster-config cluster.json \
-    --cluster "sbatch -A {cluster.account} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task {cluster.ncpu} --nodes {cluster.nodes} -t {cluster.time} --job-name {cluster.name} --output {cluster.output} --error {cluster.error}" &