bugfix

36e4a0ef · Kevin Kunzmann · ba722fac · 36e4a0ef · 36e4a0ef
Commit 36e4a0ef authored Sep 02, 2019 by Kevin Kunzmann
Hide whitespace changes
Inline Side-by-side

Showing with 29 additions and 19 deletions

README.md README.md +27 -17

Snakefile Snakefile +2 -2

No files found.
--- a/README.md
+++ b/README.md
 # Impute gene expression for CENTER-TBI with PrediXcan

-The singularity container with most software dpendencies is available at 
-https://doi.org/10.5281/zenodo.3376504. 
+The singularity container with most software dpendencies is available at
+https://doi.org/10.5281/zenodo.3376504.
 Data currently needs to be accessed manually due to access restrictions.
 This workflow is design for *.vcf.gz files with dosage (DS) information.

 More information on PrediXcan can be found here https://github.com/hakyimlab/PrediXcan and in:
-> Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC, 
+> Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
 Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data.
 Nat Genet. doi:10.1038/ng.3367.

 ## Dependencies

-1. linux shell (`bash`), possibly via virtual machine on Windows/Mac 
+1. linux shell (`bash`), possibly via virtual machine on Windows/Mac
 2. `wget` (pre-installed or via distribution package manager)
 3. `singularity` container software (tested on 3.3.0, https://sylabs.io/guides/3.3/user-guide)
 4. `git` (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
+5. for data download fimm GCP bucket access to `fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data`

 ### Optional

-5. python 3.7+ and snakemake 
+5. python 3.7+ and snakemake
 6. slurm cluster

 We use snakemake to organize the workflow (also pre-installed in the container) and support cluster execution.
 Snakemake is available via `pip` package for python 3.7.
-> Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics, 
+> Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics,
 Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480


 ## Execution
- 
+
 Download and extract the contents of this repository (might be access restricted)

    git clone https://github.com/kkmann/impute-gene-expression
@@ -38,24 +39,33 @@ Download and extract the contents of this repository (might be access restricted
 Download the container image

    bash scripts/download_container.sh
-    
+
+Obtain the imputed genomes (GCP bucket: fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data).
+`gsutils` is pre-installed in the container image, to authenticate with your
+GCP account run and follow the interactive instructions
+
+    singularity shell container.sif
+    gcloud auth login
+    snakemake download_imputed_genotypes
+    exit
+
 Execute the workflow inside the container on a single core (takes a while!)
-  
-    singularity exec container.sif snakemake impute 
-    
+
+    singularity exec container.sif snakemake impute
+
 Optionally, if snakemake is installed, the workflow can be run in parallel via
-  
+
    snakemake --use-singularity -j 8 impute
-    
+
 where '8' can be replaced by the number of available cores.
 Cluster execution is enables via the `scripts/slurm_snakemake.sh` script as

    bash scripts/slurm_snakemake.sh impute
-    
-    
+
+
 ## Results

-Intermediate files (PrediXcan dosage files and raw outputs) are stored in the `outputs/` 
+Intermediate files (PrediXcan dosage files and raw outputs) are stored in the `outputs/`
 subfolder of the working directory.
-The file `output/gene_expressions_combined.rds` combines imputed gene expression levels across 
+The file `output/gene_expressions_combined.rds` combines imputed gene expression levels across
 all available brain regions in a compressed .rds file  (R data set).
--- a/Snakefile
+++ b/Snakefile
@@ -70,7 +70,7 @@ rule vcf_to_dosages:
        export prefix={wildcards.output_dir}/dosages
        mkdir -p $prefix
        echo "extracting and computing MAFs ..."
-        bcftools +fill-tags {inputs.vcf_gz_file} > $prefix/chr{wildcards.i}.vcf
+        bcftools +fill-tags {input.vcf_gz_file} > $prefix/chr{wildcards.i}.vcf
        echo 'querying dosages ...'
        bcftools query -e 'MAF[0]>{config[min_MAF]} | INFO>{config[min_INFO]} | TYPE!="snp" | N_ALT!=1' -f '%CHROM %ID %POS %REF %ALT %INFO/MAF [%DS ]\n' $prefix/chr{wildcards.i}.vcf > $prefix/chr{wildcards.i}.dosage.txt
        echo 'compressing ...'
@@ -94,7 +94,7 @@ rule generate_samples_file:
        """
        export prefix={wildcards.output_dir}/dosages
        mkdir -p $prefix
-        bcftools query -l {inputs.vcf_gz_file} >> $prefix/samples_.txt
+        bcftools query -l {input.vcf_gz_file} >> $prefix/samples_.txt
        # family ID = individual ID
        awk {params.format} < $prefix/samples_.txt > $prefix/samples.txt
        rm $prefix/samples_.txt