Refactor to gtex v8

b0e5c713 · Kevin Kunzmann · 8d918f3c · b0e5c713 · b0e5c713 · b0e5c713
Commit b0e5c713 authored Mar 14, 2020 by Kevin Kunzmann
19 changed files
--- a/.gitignore
+++ b/.gitignore
+.*
 .config
 .snakemake
 output
-logs
-nohup.out
-container.sif
+
+*.err
+*.out
+*.sif
+*.simg
+*.html
+*.pdf
+
 .DS_Store
+.Rproj.user
--- a/README.md
+++ b/README.md
-# Impute gene expression for CENTER-TBI with PrediXcan
+# Impute genetically regulated gene expression for CENTER-TBI using PrediXcan

-The singularity container with most software dpendencies is available at
-https://doi.org/10.5281/zenodo.3376504.
-Data currently needs to be accessed manually due to access restrictions.
-This workflow is design for *.vcf.gz files with dosage (DS) information.
+This repository contains all scripts necessary to run the impute genetically
+regulated gene expression (GREx) using the 
+[PrediXcan](https://github.com/hakyimlab/MetaXcan) methodology for 
+[CENTER-TBI](https://www.center-tbi.eu/) imputed whole genomes data.

-More information on PrediXcan can be found here https://github.com/hakyimlab/PrediXcan and in:
-> Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
-Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data.
+[1] Gamazon ER†, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
+Nicolae DL, Cox NJ, Im HK. (2015) *A gene-based association method for mapping traits using reference transcriptome data.*
 Nat Genet. doi:10.1038/ng.3367.

-## Dependencies
-
-1. linux shell (`bash`), possibly via virtual machine on Windows/Mac
-2. `wget` (pre-installed or via distribution package manager)
-3. `singularity` container software (tested on 3.3.0, https://sylabs.io/guides/3.3/user-guide)
-4. `git` (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
-5. for data download fimm GCP bucket access to `fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data`
-
-### Optional
-
-5. python 3.7+ and snakemake
-6. slurm cluster
-
-We use snakemake to organize the workflow (also pre-installed in the container) and support cluster execution.
-Snakemake is available via `pip` package for python 3.7.
-> Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics,
-Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480
-
-
-## Execution
-
-Download and extract the contents of this repository (might be access restricted)
+[2] Barbeira, A., Shah, K. P., Torres, J. M., Wheeler, H. E., Torstenson, E. S., Edwards, T., ... & Im, H. K. (2016). *MetaXcan: summary statistics based gene-level association method infers accurate PrediXcan results.* BioRxiv, 045260.

-    git clone https://github.com/kkmann/impute-gene-expression
-    cd impute-gene-expression
-
-Download the container image
-
-    bash scripts/download_container.sh
-
-Obtain the imputed genomes (GCP bucket: fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data).
-`gsutils` is pre-installed in the container image, to authenticate with your
-GCP account run and follow the interactive instructions
-
-    singularity shell container.sif
-    gcloud auth login
-    snakemake download_imputed_genotypes
-    exit
+The pipeline itself is implemented using [snakemake](https://snakemake.readthedocs.io/en/stable/) and
+[singularity](https://sylabs.io/) containers to guarantee reproducibility.

-Execute the workflow inside the container on a single core (takes a while!)

-    singularity exec container.sif snakemake impute

-Optionally, if snakemake is installed, the workflow can be run in parallel via
-
-    snakemake --use-singularity -j 8 impute
+## Dependencies

-where '8' can be replaced by the number of available cores.
-Cluster execution is enables via the `scripts/slurm_snakemake.sh` script as
+1. [`Linux` system](https://ubuntu.com/download/desktop)
+2. [`git`](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
+3. [`singularity`](https://sylabs.io/guides/3.5/user-guide/quick_start.html#quick-installation-steps) container software (tested on 3.5.0)
+4. *optionally:* [Python 3](https://www.python.org/) and [snakemake](https://github.com/snakemake/snakemake) 

-    bash scripts/slurm-snakemake.sh impute
+To access the imputed genotypes, access to the 
+[FIMM](https://www.fimm.fi/) GCS bucket 
+`fimm-horizon-outgoing-data` is necessary (requires a FIMM user account).

+It is recommended to run the workflow on a cluster system, 
+since execution time on a desktop machine may take several hours.
+An example [slurm](https://slurm.schedmd.com/documentation.html)
+cluster [profile](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles) is saved in `config/snakemake/mrc-bsu-cluster` 
+for reference.

-## Results

-an output folder is created with the PrediXcan dosage files and imputed gene
-expressions.

+## Execution

-## Versions
+### Local execution

-release tags should point to respective FIMM imputed genotype source data.
+1.  Clone this repository and change the working directory to the newly 
+    created folder
+    ```
+    git clone https://git.center-tbi.eu/kunzmann/impute-gene-expression.git
+    cd impute-gene-expression
+    ```
+    Make sure the ckeck out the exact version of the pipeline you want to run,
+    either by comit hash, or an available tag, e.g. via
+    ```
+    git checkout v0.1.0
+    ```
+2.  Download the container image specified at the to of the `Snakefile`
+    (singularity: ...) manually via, e.g. (make sure to adjust to the exact specification!)
+    ```
+    singularity pull library://kkmann/default/center-grex-imputation:[tag]
+    ```
+    where `[tag]` is the desired container tag or hash (see Snakefile to make
+    sure that you download the exact right version).
+    This will download the container from the sylab cloud and store it as
+    `center-grex-imputation_[tag].sif`.
+3.  Authenticate with your FIMM account
+    ```
+    singularity shell -H $PWD [container-name].sif
+    [container-name].sif> gcloud auth login
+    ```
+    Here, the `-H $PWD` flag ensures that singularity does not mount the users
+    home directory but uses the working directory instead.
+    Follow the on-screen instructions to authenticate in your browser, 
+    then leave the container
+    ```
+    [container-name].sif> exit
+    ```
+    This should have creates a hidden folder `.config/gcloud` with authentification 
+    details in your working directory.
+4.  Adjust tissues, data date, and target cohort in `config/parameters.yaml`
+4.  Run the pipeline
+    ```
+    singularity exec -H $PWD -j 1 [container-name].sif snakemake all
+    ```
+    Here `-j 1` specifies the maximal number of parallel jobs.
+    Increase this on more powerful machines but bear in mind that roughly 16Gb
+    of RAM per job are required to avoid out of memory issues.
+    An output folder is created with intermediate files, a mapping report, 
+    and the imputed GREx levels per tissue in `output/imputed-grex`.
+
+### Cluster execution
+
+1.  Same as the first step in 'Local execution'.
+2.  Make sure that `Python 3` is available and `snakemake` installed 
+    (e.g. load Python via modules and install snakemake into a virtual environment)
+3.  Configure snakemake for your execution environment and store the profile in, 
+    e.g., `config/my-cluster`.
+    It is highly recommended to use the `-H $PWD` flag to make sure that there
+    is no interference with configuration files int he users home directory.
+3.  Run
+    ```
+    snakemake --profile config/my-cluster container
+    ```
+    to download the correct container image automatically.
+    This will create a file `[container-hash].simg` in the working folder.
+4.  Authenticate with your FIMM account as in 'Local execution', point 3., 
+    with a shell in the downloaded `[container-hash].simg`
+5.  Adjust tissues, data date, and target cohort in `config/parameters.yaml`
+6.  Run the distributed pipeline
+    ```
+    snakemake --profile config/snakemake/my-cluster all
+    ```


 ## Logo

--- a/Snakefile
+++ b/Snakefile
--- a/cluster.json
+++ b/cluster.json
-{
-    "__default__" :
-    {
-        "account"       : "MRC-BSU-SL2-CPU",
-        "partition"     : "skylake",
-        "time"          : "02:00:00",
-        "ntasks"        : "1",
-        "nodes"         : "1",
-        "ncpu"          : "1",
-        "name"          : "{rule}__{wildcards}",
-        "output"        : "logs/{rule}__{wildcards}.out",
-        "error"         : "logs/{rule}__{wildcards}.out"
-    }
-}
--- a/config.yml
+++ b/config.yml
-# where to put things
-output_dir: 'output'
-
-# suggested filtering options for PrediXcan
-min_MAF:  '0.01'
-min_INFO: '0.8'
-
-# brain regions to impute expression for
-brain_regions:
-    - 'Amygdala'
-    - 'Anterior_cingulate_cortex_BA24'
-    - 'Caudate_basal_ganglia'
-    - 'Cerebellar_Hemisphere'
-    - 'Cerebellum'
-    - 'Cortex'
-    - 'Frontal_Cortex_BA9'
-    - 'Hippocampus'
-    - 'Hypothalamus'
-    - 'Nucleus_accumbens_basal_ganglia'
-    - 'Putamen_basal_ganglia'
-    - 'Spinal_cord_cervical_c-1'
-    - 'Substantia_nigra'
--- a/config/mrc-bsu-cluster/config.yaml
+++ b/config/mrc-bsu-cluster/config.yaml
+jobs:               99
+local-cores:        1
+use-singularity:    True
+cluster:            "sbatch -A MRC-BSU-SL2-CPU -p skylake  --nodes 1 --ntasks 1 --cpus-per-task 1 --mem 16000 -t 02:00:00 --job-name '{rule}[{wildcards}]' --output 'logs/{rule}[{wildcards}].log' --error 'logs/{rule}[{wildcards}].log'"
+singularity-args:   "-H $PWD"
+singularity-prefix: "."
--- a/config/parameters.yml
+++ b/config/parameters.yml
+# tissues to impute expression for, all brain except Pituitary and
+# Spinal cord (cervical c-1)
+tissues:
+    - 'Brain_Amygdala'
+    - 'Brain_Anterior_cingulate_cortex_BA24'
+    - 'Brain_Caudate_basal_ganglia'
+    - 'Brain_Cerebellar_Hemisphere'
+    - 'Brain_Cerebellum'
+    - 'Brain_Cortex'
+    - 'Brain_Frontal_Cortex_BA9'
+    - 'Brain_Hippocampus'
+    - 'Brain_Hypothalamus'
+    - 'Brain_Nucleus_accumbens_basal_ganglia'
+    - 'Brain_Putamen_basal_ganglia'
+    - 'Brain_Substantia_nigra'
+
+data_date: "20200306"
+
+cohort: "all-acgm-filtered"
--- a/impute-gene-expression.Rproj
+++ b/impute-gene-expression.Rproj
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 4
+Encoding: UTF-8
+
+RnwWeave: Sweave
+LaTeX: pdfLaTeX
+
+AutoAppendNewline: Yes
+StripTrailingWhitespace: Yes
--- a/logs/.gitignore
+++ b/logs/.gitignore
+# Ignore everything in this directory
+*
+# Except this file
+!.gitignore
--- a/scripts/build-container.sh
+++ b/scripts/build-container.sh
+#!/bin/bash
+
+set -ex
+
+sudo singularity build center-impute-grex.sif singularity.def
+
+singularity push -U center-impute-grex.sif library://kkmann/default/center-grex-imputation:latest
--- a/scripts/download_container.sh
+++ b/scripts/download_container.sh
-wget https://zenodo.org/record/3376504/files/container.sif -O container.sif
--- a/scripts/get-model-variant-positions.R
+++ b/scripts/get-model-variant-positions.R
+#!/usr/bin/env Rscript
+
+options(tidyverse.quite = TRUE)
+suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
+library(glue, warn.conflicts = FALSE)
+
+args <- commandArgs(trailingOnly = TRUE)
+
+chromosome <- as.integer(args[[1]])
+
+tbl_gtex_lookup <- read_csv(
+        "output/model-variants-lookup-table.csv.gz",
+        col_types  = cols(
+            chr      = col_character(),
+            position = col_integer(),
+            rsid     = col_character(),
+            gtex_id  = col_character(),
+            allele1  = col_character(),
+            allele2  = col_character()
+        )
+    ) %>%
+    filter(chr == glue("chr{chromosome}")) %>%
+    select(chr, position) %>%
+    arrange(position) %>%
+    write_delim(
+        path     = glue("output/model-variant-positions-chromosome-{chromosome}.txt"),
+        delim     = "\t",
+        col_names = FALSE
+    )
--- a/scripts/get-model-variants.R
+++ b/scripts/get-model-variants.R
+#!/usr/bin/env Rscript
+
+options(tidyverse.quite = TRUE)
+suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
+library(glue, warn.conflicts = FALSE)
+library(DBI)
+
+tissues  <- yaml::read_yaml("config/parameters.yml")$tissues
+
+tbl_weights <- map(
+        tissues,
+        function(tissue) {
+            con <- dbConnect(
+                RSQLite::SQLite(),
+                glue("output/weights/en_{tissue}.db")
+            )
+            res <- dbReadTable(con, "weights") %>%
+                as_tibble() %>%
+                mutate(tissue = tissue) %>%
+                select(tissue, everything())
+            dbDisconnect(con)
+            return(res)
+        }
+    ) %>%
+    bind_rows() %>%
+    separate(
+        varID,
+        into = c("chr", "position", "allele1", "allele2", "dummy"),
+        sep = "_",
+        remove = FALSE
+    ) %>% {
+        assertthat::assert_that(
+            all((.$allele1 == .$ref_allele) & (.$allele2 == .$eff_allele))
+        )
+        .
+    } %>%
+    select(
+        chr, position, rsid, varID, allele1, allele2
+    ) %>%
+    rename(
+        gtex_id  = varID
+    ) %>%
+    mutate(
+        position = as.integer(position)
+    ) %>%
+    distinct()
+
+write_csv(
+    tbl_weights,
+    "output/model-variants-lookup-table.csv.gz"
+)
--- a/scripts/install.R
+++ b/scripts/install.R
+options(
+    repos = list(
+        "https://mran.revolutionanalytics.com/snapshot/2020-03-01/"
+    )
+)
+
+install.packages("tidyverse")
+install.packages("pander")
+install.packages("vroom")
+install.packages("DBI")
+install.packages("RSQLite")
--- a/scripts/mapping-report.Rmd
+++ b/scripts/mapping-report.Rmd
+---
+title:  "Mapping hg38 to GTEx model variants"
+author: "Kevin Kunzmann"
+date:   "`r Sys.time()`"
+output: 
+    html_document:
+        code_folding: hide
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(
+    echo     = TRUE
+)
+
+library(tidyverse, warn.conflicts = FALSE)
+library(glue, warn.conflicts = FALSE)
+
+set.seed(42)
+
+# load data
+tbl_unmatched <- glue(
+        "output/dosages/unmatched-chromosome-{1:23}.csv.gz"
+    ) %>% 
+    map(
+        ~tryCatch(
+            read_csv(
+                ., col_types = cols(
+                    chr = col_character(),
+                    position = col_integer(),
+                    rsid = col_character(),
+                    gtex_id = col_character(),
+                    allele1..model = col_character(),
+                    allele2..model = col_character(),
+                    allele1..dosage = col_character(),
+                    allele2..dosage = col_character()
+                )
+            ),
+            error = function(e) tibble(
+                chr = character(0L),
+                position = integer(0L),
+                rsid = character(0L),
+                gtex_id = character(0L),
+                allele1..model = character(0L),
+                allele2..model = character(0L),
+                allele1..dosage = character(0L),
+                allele2..dosage = character(0L)
+            )
+        )
+    ) %>% 
+    bind_rows() %>% 
+    mutate(
+        chromosome = str_extract(
+                chr, 
+                "(?<=^chr)[1-9]{1}[0-9]{0,1}$"
+            ) %>% 
+            as.integer()
+    ) %>% 
+    select(-chr) %>% 
+    select(chromosome, everything()) %>% 
+    arrange(chromosome, position)
+
+tbl_model_variants <- read_csv(
+    "output/model-variants-lookup-table.csv.gz",
+    col_types = cols(
+        chr      = col_character(),
+        position = col_integer(),
+        rsid     = col_character(),
+        gtex_id  = col_character(),
+        allele1  = col_character(),
+        allele2  = col_character()
+    )
+)
+```
+
+
+
+### Unmatched variants
+
+```{r}
+tbl_model_variants %>% 
+    mutate(
+        chromosome = str_extract(
+                chr, 
+                "(?<=^chr)[1-9]{1}[0-9]{0,1}$"
+            ) %>% 
+            as.integer()
+    ) %>% 
+    select(-chr) %>% 
+    select(chromosome, everything()) %>% 
+    mutate(
+        missing = gtex_id %in% tbl_unmatched$gtex_id
+    ) %>% 
+    group_by(chromosome) %>% 
+    summarize(
+        `# missing` = sum(missing),
+        `% missing` = sum(missing) / n() * 100
+    ) %>% 
+    pander::pander(
+        caption = "Number of unmatched model variants per chromosome",
+        digits  = 2
+    )
+```
+
+```{r}
+complementary_base <- function(allele) {
+    case_when(
+        allele == "A" ~ "T",
+        allele == "T" ~ "A",
+        allele == "G" ~ "C",
+        allele == "C" ~ "G"
+    ) %>% {
+        assertthat::assert_that(!any(is.na(.))); 
+        .
+    }
+}
+
+tbl_unmatched %>% 
+    filter(
+        complementary_base(allele1..model) == allele1..dosage
+    ) %>% 
+    pander::pander(
+        caption = "Variants that match on complementary strand",
+        split.table = Inf
+    )
+```
+
+
+
+## Session Info
+
+```{r, echo=FALSE}
+sessionInfo() %>% 
+    pander::pander()
+```
--- a/scripts/match-variants.R
+++ b/scripts/match-variants.R
+#!/usr/bin/env Rscript
+
+options(tidyverse.quite = TRUE)
+suppressPackageStartupMessages(library(tidyverse, warn.conflicts = FALSE))
+library(glue, warn.conflicts = FALSE)
+
+args <- commandArgs(trailingOnly = TRUE)
+
+chromosome <- as.integer(args[[1]])
+
+dosages_file       <- glue("output/dosages/chromosome-{chromosome}.dosage.txt.gz")
+out_file_matched   <- glue("output/dosages/matched-chromosome-{chromosome}.dosage.txt.gz")
+out_file_unmatched <- glue("output/dosages/unmatched-chromosome-{chromosome}.csv.gz")
+
+
+sample_ids <- read_delim(
+        "output/dosages/samples.txt",
+        delim     = "\t",
+        col_types = "cc",
+        col_names = c("FID", "IID")
+    )$IID
+
+
+tbl_model_variants <- read_csv(
+        "output/model-variants-lookup-table.csv.gz",
+        col_types  = cols(
+            chr      = col_character(),
+            position = col_integer(),
+            rsid     = col_character(),
+            gtex_id  = col_character(),
+            allele1  = col_character(),
+            allele2  = col_character()
+        )
+    ) %>%
+    filter(
+        chr == glue("chr{chromosome}")
+    )
+
+tbl_dosage <- read_csv(
+    glue("output/dosages/chromosome-{chromosome}.dosage.csv"),
+    trim_ws    = TRUE,
+    na         = ".",
+    progress   = TRUE,
+    col_names  = c(
+        "chr",
+        "rsid",
+        "position",
+        "allele1",
+        "allele2",
+        "MAF",
+        sample_ids
+    ),
+    col_types  = cols(
+        .default = col_double(),
+        chr      = col_character(),
+        rsid     = col_character(),
+        position = col_integer(),
+        allele1  = col_character(),
+        allele2  = col_character()
+    )
+)
+
+
+
+tbl_matched <- tbl_model_variants %>%
+    inner_join(
+        select(tbl_dosage, -chr, -rsid),
+        by = c("position", "allele1", "allele2")
+    )
+
+write_delim(
+    tbl_matched %>%
+        arrange(position) %>%
+        select(-rsid) %>%
+        select(chr, gtex_id, position, allele1, allele2, MAF, everything()) %>%
+        rename(chromosome = chr) %>%
+        filter(complete.cases(.)),
+    delim     = " ",
+    col_names = FALSE,
+    path      = out_file_matched
+)
+
+
+
+tbl_unmatched <- left_join(
+        tbl_model_variants,
+        select(tbl_dosage, position, allele1, allele2),
+        by     = "position",
+        suffix = c("..model", "..dosage")
+    ) %>%
+    filter(
+        !(gtex_id %in% tbl_matched$gtex_id)
+    )
+
+write_csv(
+    tbl_unmatched,
+    path  = out_file_unmatched
+)
+
+cat(sprintf(
+    "%s: %9i (%6.2f%%) model variants found for chromosome %i\n\r",
+    Sys.time(),
+    nrow(tbl_matched), 100*nrow(tbl_matched)/nrow(tbl_model_variants),
+    chromosome
+))
--- a/scripts/pull.sh
+++ b/scripts/pull.sh
-mkdir -p output/imputed-gene-expressions
-rsync -avzhe ssh --progress \
-    kk656@login-cpu.hpc.cam.ac.uk:/home/kk656/scratch/impute-gene-expression/output/imputed-gene-expressions \
-    output/imputed-gene-expressions
--- a/container.def
+++ b/container.def
 Bootstrap: docker

-From: ubuntu:18.04
+From: rocker/verse:3.6.2
+
+%files
+    install.R /tmp/install.R

 %post
    # non-interactive debconf
@@ -17,45 +20,36 @@ From: ubuntu:18.04
    apt-get update && apt-get -y install google-cloud-sdk

    # install python3 and snakemake
-    apt-get -y install \
-        python3 python3-pip
-    pip3 install snakemake
+    apt-get -y install python3 python3-pip
+    pip3 install snakemake==5.10.0

    # install bcftools
+    export BCFVER=1.10.2
    apt-get -y install \
-        gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev
-    wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
-    tar -xvjf bcftools-1.9.tar.bz2
-    cd bcftools-1.9
+        gcc wget make zlib1g zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev pv
+    wget https://github.com/samtools/bcftools/releases/download/$BCFVER/bcftools-$BCFVER.tar.bz2
+    tar -xvjf bcftools-$BCFVER.tar.bz2
+    cd bcftools-$BCFVER
    ./configure --prefix=/usr/bcftools
    make
    make install
    (cd /usr/bin; ln -s /usr/bcftools/bin/bcftools bcftools)

-    # install PrediXcan and python dependencies (uses python 2.7)
+    # install PrediXcan and python (2) dependencies
    apt-get -y install \
-        wget python-pip
-    wget https://raw.githubusercontent.com/hakyimlab/PrediXcan/master/Software/PrediXcan.py -O /usr/bin/predixcan
-    chmod +x /usr/bin/predixcan
+        python-pip
    pip install \
-        argparse datetime numpy
-    # download, extract and store (brain) weights
-    mkdir /usr/predixcan
-    wget https://s3.amazonaws.com/predictdb2/GTEx-V7_HapMap-2017-11-29.tar.gz -O /usr/predixcan/GTEx-V7_HapMap-2017-11-29.tar.gz
-    (cd /usr/predixcan; \
-        mkdir GTEx-V7_HapMap-2017-11-29; \
-        # we only need the brain tissue weights:
-        tar -xvz -f GTEx-V7_HapMap-2017-11-29.tar.gz -C GTEx-V7_HapMap-2017-11-29 --wildcards "*_Brain_*"; \
-        rm GTEx-V7_HapMap-2017-11-29.tar.gz \
-        )
-    # predixcan connects to the weights database with sql, needs write permission
-    # even if the file system will be read only for the cvontainer
-    chmod -R 777 /usr/predixcan
-
-    # install R and packages
-    apt-get -y install r-base
-    Rscript -e "install.packages('dplyr')"
-    Rscript -e "install.packages('yaml')"
-    Rscript -e "install.packages('purrr')"
-    Rscript -e "install.packages('readr')"
-    Rscript -e "install.packages('tidyr')"
+        numpy==1.16.6
+    mkdir -p /usr/PrediXcan
+    (cd /usr; \
+        git clone https://github.com/hakyimlab/PrediXcan; \
+        cd PrediXcan; \
+        git checkout e77dd8a04a0345cb63aa634d4f8acc6aca9e25e0)
+    ln -s /usr/PrediXcan/Software/PrediXcan.py /usr/bin/predixcan
+    chmod +x /usr/bin/predixcan
+
+    # install bc
+    apt-get -y install bc
+
+    # install R packages
+    Rscript /tmp/install.R
--- a/scripts/slurm-snakemake.sh
+++ b/scripts/slurm-snakemake.sh
-mkdir -p logs
-nohup snakemake $1 \
-    --jobs 13 \
-    --use-singularity \
-    --cluster-config cluster.json \
-    --cluster "sbatch -A {cluster.account} -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task {cluster.ncpu} --nodes {cluster.nodes} -t {cluster.time} --job-name {cluster.name} --output {cluster.output} --error {cluster.error}" &