Last updated: 2023-09-22

Checks: 2 0

Knit directory: snakemake_tutorial/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version bc62309. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/config.Rmd) and HTML (docs/config.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html bc62309 Jean Morrison 2023-09-22 updates
Rmd 8972212 Jean Morrison 2023-09-21 updates
html 8972212 Jean Morrison 2023-09-21 updates
Rmd 656c8b4 Jean Morrison 2023-09-21 add new content
html 656c8b4 Jean Morrison 2023-09-21 add new content

Introduction

In this section we will learn about the params: field for rules how to pass options to Snakemake using a config file.

In this Section:

  1. Modify a rule to use params:
  2. Write and use a config file in our workflow

Params

In our previous workflow, two of our steps, variant_qc and ld_prune include parameters that we hard coded into the command on the shell: line. We could hav also passed these in using the params: field and placeholders.

For example, we could use

rule variant_qc:
    input: "data/all.vcf.gz"
    output: multiext("data/all_varqc", ".pgen", ".psam", ".pvar", ".log")
    params: geno_thresh = 0.1, qual_thresh = 95, maf = 0.01
    shell: """
           plink2 --vcf {input} --set-missing-var-ids '@:#' --snps-only \
                 --var-min-qual {params.qual_thresh} \
                 --geno {params.geno_thresh} \
                 --maf {params.maf} \
                 --make-pgen \
                 --out data/all_varqc 
          """
          
rule ld_prune:
    input: multiext("data/{prefix}", ".pgen", ".psam", ".pvar")
    output: "data/{prefix}.prune.in"
    params: kb = "1000kb", r2 = 0.01
    shell: """
          plink2 --pfile data/{wildcards.prefix} \
          --indep-pairwise {params.kb} {params.r2} \
          --out data/{wildcards.prefix}
          """

Notice that we are now using triple quotes on the shell line. This allows us to have line breaks inside of the command which makes for easier reading.

Config Files

Putting the parameters on the params: line makes for easier reading but doesn’t really add much flexibility. The parameters are still hard coded into the Snakefile. Often the utility of the params: field comes from combining it with a configuration file. This is especially useful if you want to distribute your workflow to others. It is a lot easier to tell someone how to change a configuration file than to tell them how to edit your Snakefile.

Lets create a configuration file that contains the variant QC parameters and the LD pruning parameters. The config file will be a yaml file (“yet another markup language”) which is a nice, human readable way to specify information. Create a file in your tutorial directory called config.yaml (or any other name you prefer) and put this content in it:

QC:
    geno_thresh: 0.1
    qual_thresh: 95
    maf: 0.01
LD: 
    kb: "1000kb"
    r2: 0.01

Each line of the yaml file begins with a label followed by a colon. You can have as many sub-levels as you want, indicated by tabs.

Now we need to tell Snakemake where to find the configuration file. Add this line to the top of your Snakefile

configfile: "config.yaml"

Finally, we need to access the values in the config file. We can do this with nested brackets, so config["QC"]["geno_thresh"] indicates the content of the geno_thresh: line.

Modify the params: lines of the variant_qc and ld_prune rules. These rules should now read

rule variant_qc:
    input: "data/all.vcf.gz"
    output: multiext("data/all_varqc", ".pgen", ".psam", ".pvar", ".log")
    params: geno_thresh = config["QC"]["geno_thresh"], 
            qual_thresh = config["QC"]["qual_thresh"], 
            maf = config["QC"]["maf"]
    shell: """
           plink2 --vcf {input} --set-missing-var-ids '@:#' --snps-only \
                 --var-min-qual {params.qual_thresh} \
                 --geno {params.geno_thresh} \
                 --maf {params.maf} \
                 --make-pgen \
                 --out data/all_varqc 
          """

rule ld_prune:
    input: multiext("data/{prefix}", ".pgen", ".psam", ".pvar")
    output: "data/{prefix}.prune.in"
    params: kb = config["LD"]["kb"], r2 = config["LD"]["r2"]
    shell: """
          plink2 --pfile data/{wildcards.prefix} \
          --indep-pairwise {params.kb} {params.r2} \
          --out data/{wildcards.prefix}
          """

Special Challenge: Modify the maf line of the config file to give a list, like [0.01, 0.05]. Can you modify the pipeline so that two sets of pca results are produced, one using each of the values given to the maf line in the config file? An answer is in code/6.Snakefile and code/6.config.yaml.

Final Snakefile

The Snakefile you should have at the end of this section (excluding the special challenge) can be found in code/5.Snakefile (that file uses 5.config.yaml instead of config.yaml).

configfile: "config.yaml"

rule all:
    input: "results/all_varqc.eigenvec", expand("results/chr{c}_stats.txt", c = range(20, 23))

rule get_stats:
    input: "data/chr{c}.vcf.gz"
    output: "results/chr{c}_stats.txt"
    shell: "mkdir -p results; bcftools stats {input} > {output}"

rule combine_data:
    input: expand("data/chr{c}.vcf.gz", c = range(20, 23))
    output: "data/all.vcf.gz"
    shell: "bcftools concat -o {output} {input}"

rule variant_qc:
    input: "data/all.vcf.gz"
    output: multiext("data/all_varqc", ".pgen", ".psam", ".pvar", ".log")
    params: geno_thresh = config["QC"]["geno_thresh"],
            qual_thresh = config["QC"]["qual_thresh"],
            maf = config["QC"]["maf"]
    shell: """
           plink2 --vcf {input} --set-missing-var-ids '@:#' --snps-only \
                 --var-min-qual {params.qual_thresh} \
                 --geno {params.geno_thresh} \
                 --maf {params.maf} \
                 --make-pgen \
                 --out data/all_varqc 
          """

rule ld_prune:
    input: multiext("data/{prefix}", ".pgen", ".psam", ".pvar")
    output: "data/{prefix}.prune.in"
    params: kb = config["LD"]["kb"], r2 = config["LD"]["r2"]
    shell: """
          plink2 --pfile data/{wildcards.prefix} \
          --indep-pairwise {params.kb} {params.r2} \
          --out data/{wildcards.prefix}
          """

rule pca:
    input: data = multiext("data/{prefix}", ".pgen", ".psam", ".pvar"), prune = "data/{prefix}.prune.in"
    output: "results/{prefix}.eigenvec"
    shell: "plink2 --pfile data/{wildcards.prefix} --extract {input.prune} --pca --out results/{wildcards.prefix}"