Commit 36e4a0ef authored by Kevin Kunzmann's avatar Kevin Kunzmann

bugfix

parent ba722fac
# Impute gene expression for CENTER-TBI with PrediXcan # Impute gene expression for CENTER-TBI with PrediXcan
The singularity container with most software dpendencies is available at The singularity container with most software dpendencies is available at
https://doi.org/10.5281/zenodo.3376504. https://doi.org/10.5281/zenodo.3376504.
Data currently needs to be accessed manually due to access restrictions. Data currently needs to be accessed manually due to access restrictions.
This workflow is design for *.vcf.gz files with dosage (DS) information. This workflow is design for *.vcf.gz files with dosage (DS) information.
More information on PrediXcan can be found here https://github.com/hakyimlab/PrediXcan and in: More information on PrediXcan can be found here https://github.com/hakyimlab/PrediXcan and in:
> Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC, > Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC,
Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data.
Nat Genet. doi:10.1038/ng.3367. Nat Genet. doi:10.1038/ng.3367.
## Dependencies ## Dependencies
1. linux shell (`bash`), possibly via virtual machine on Windows/Mac 1. linux shell (`bash`), possibly via virtual machine on Windows/Mac
2. `wget` (pre-installed or via distribution package manager) 2. `wget` (pre-installed or via distribution package manager)
3. `singularity` container software (tested on 3.3.0, https://sylabs.io/guides/3.3/user-guide) 3. `singularity` container software (tested on 3.3.0, https://sylabs.io/guides/3.3/user-guide)
4. `git` (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) 4. `git` (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
5. for data download fimm GCP bucket access to `fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data`
### Optional ### Optional
5. python 3.7+ and snakemake 5. python 3.7+ and snakemake
6. slurm cluster 6. slurm cluster
We use snakemake to organize the workflow (also pre-installed in the container) and support cluster execution. We use snakemake to organize the workflow (also pre-installed in the container) and support cluster execution.
Snakemake is available via `pip` package for python 3.7. Snakemake is available via `pip` package for python 3.7.
> Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics, > Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics,
Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480 Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480
## Execution ## Execution
Download and extract the contents of this repository (might be access restricted) Download and extract the contents of this repository (might be access restricted)
git clone https://github.com/kkmann/impute-gene-expression git clone https://github.com/kkmann/impute-gene-expression
...@@ -38,24 +39,33 @@ Download and extract the contents of this repository (might be access restricted ...@@ -38,24 +39,33 @@ Download and extract the contents of this repository (might be access restricted
Download the container image Download the container image
bash scripts/download_container.sh bash scripts/download_container.sh
Obtain the imputed genomes (GCP bucket: fimm-horizon-outgoing-data/CENTER_TBI_data_freeze_190829/Imputed_data).
`gsutils` is pre-installed in the container image, to authenticate with your
GCP account run and follow the interactive instructions
singularity shell container.sif
gcloud auth login
snakemake download_imputed_genotypes
exit
Execute the workflow inside the container on a single core (takes a while!) Execute the workflow inside the container on a single core (takes a while!)
singularity exec container.sif snakemake impute singularity exec container.sif snakemake impute
Optionally, if snakemake is installed, the workflow can be run in parallel via Optionally, if snakemake is installed, the workflow can be run in parallel via
snakemake --use-singularity -j 8 impute snakemake --use-singularity -j 8 impute
where '8' can be replaced by the number of available cores. where '8' can be replaced by the number of available cores.
Cluster execution is enables via the `scripts/slurm_snakemake.sh` script as Cluster execution is enables via the `scripts/slurm_snakemake.sh` script as
bash scripts/slurm_snakemake.sh impute bash scripts/slurm_snakemake.sh impute
## Results ## Results
Intermediate files (PrediXcan dosage files and raw outputs) are stored in the `outputs/` Intermediate files (PrediXcan dosage files and raw outputs) are stored in the `outputs/`
subfolder of the working directory. subfolder of the working directory.
The file `output/gene_expressions_combined.rds` combines imputed gene expression levels across The file `output/gene_expressions_combined.rds` combines imputed gene expression levels across
all available brain regions in a compressed .rds file (R data set). all available brain regions in a compressed .rds file (R data set).
...@@ -70,7 +70,7 @@ rule vcf_to_dosages: ...@@ -70,7 +70,7 @@ rule vcf_to_dosages:
export prefix={wildcards.output_dir}/dosages export prefix={wildcards.output_dir}/dosages
mkdir -p $prefix mkdir -p $prefix
echo "extracting and computing MAFs ..." echo "extracting and computing MAFs ..."
bcftools +fill-tags {inputs.vcf_gz_file} > $prefix/chr{wildcards.i}.vcf bcftools +fill-tags {input.vcf_gz_file} > $prefix/chr{wildcards.i}.vcf
echo 'querying dosages ...' echo 'querying dosages ...'
bcftools query -e 'MAF[0]>{config[min_MAF]} | INFO>{config[min_INFO]} | TYPE!="snp" | N_ALT!=1' -f '%CHROM %ID %POS %REF %ALT %INFO/MAF [%DS ]\n' $prefix/chr{wildcards.i}.vcf > $prefix/chr{wildcards.i}.dosage.txt bcftools query -e 'MAF[0]>{config[min_MAF]} | INFO>{config[min_INFO]} | TYPE!="snp" | N_ALT!=1' -f '%CHROM %ID %POS %REF %ALT %INFO/MAF [%DS ]\n' $prefix/chr{wildcards.i}.vcf > $prefix/chr{wildcards.i}.dosage.txt
echo 'compressing ...' echo 'compressing ...'
...@@ -94,7 +94,7 @@ rule generate_samples_file: ...@@ -94,7 +94,7 @@ rule generate_samples_file:
""" """
export prefix={wildcards.output_dir}/dosages export prefix={wildcards.output_dir}/dosages
mkdir -p $prefix mkdir -p $prefix
bcftools query -l {inputs.vcf_gz_file} >> $prefix/samples_.txt bcftools query -l {input.vcf_gz_file} >> $prefix/samples_.txt
# family ID = individual ID # family ID = individual ID
awk {params.format} < $prefix/samples_.txt > $prefix/samples.txt awk {params.format} < $prefix/samples_.txt > $prefix/samples.txt
rm $prefix/samples_.txt rm $prefix/samples_.txt
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment