Last updated: 2023-09-22
Checks: 2 0
Knit directory: snakemake_tutorial/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version bc62309. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/config.Rmd
) and HTML
(docs/config.html
) files. If you’ve configured a remote Git
repository (see ?wflow_git_remote
), click on the hyperlinks
in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | bc62309 | Jean Morrison | 2023-09-22 | updates |
Rmd | 8972212 | Jean Morrison | 2023-09-21 | updates |
html | 8972212 | Jean Morrison | 2023-09-21 | updates |
Rmd | 656c8b4 | Jean Morrison | 2023-09-21 | add new content |
html | 656c8b4 | Jean Morrison | 2023-09-21 | add new content |
In this section we will learn about the params:
field
for rules how to pass options to Snakemake using a config file.
In this Section:
params:
In our previous workflow, two of our steps, variant_qc
and ld_prune
include parameters that we hard coded into the
command on the shell:
line. We could hav also passed these
in using the params:
field and placeholders.
For example, we could use
rule variant_qc:
input: "data/all.vcf.gz"
output: multiext("data/all_varqc", ".pgen", ".psam", ".pvar", ".log")
params: geno_thresh = 0.1, qual_thresh = 95, maf = 0.01
shell: """
plink2 --vcf {input} --set-missing-var-ids '@:#' --snps-only \
--var-min-qual {params.qual_thresh} \
--geno {params.geno_thresh} \
--maf {params.maf} \
--make-pgen \
--out data/all_varqc
"""
rule ld_prune:
input: multiext("data/{prefix}", ".pgen", ".psam", ".pvar")
output: "data/{prefix}.prune.in"
params: kb = "1000kb", r2 = 0.01
shell: """
plink2 --pfile data/{wildcards.prefix} \
--indep-pairwise {params.kb} {params.r2} \
--out data/{wildcards.prefix}
"""
Notice that we are now using triple quotes on the shell line. This allows us to have line breaks inside of the command which makes for easier reading.
Putting the parameters on the params:
line makes for
easier reading but doesn’t really add much flexibility. The parameters
are still hard coded into the Snakefile. Often the utility of the
params:
field comes from combining it with a configuration
file. This is especially useful if you want to distribute your workflow
to others. It is a lot easier to tell someone how to change a
configuration file than to tell them how to edit your Snakefile.
Lets create a configuration file that contains the variant QC
parameters and the LD pruning parameters. The config file will be a yaml
file (“yet another markup language”) which is a nice, human readable way
to specify information. Create a file in your tutorial directory called
config.yaml
(or any other name you prefer) and put this
content in it:
QC:
geno_thresh: 0.1
qual_thresh: 95
maf: 0.01
LD:
kb: "1000kb"
r2: 0.01
Each line of the yaml file begins with a label followed by a colon. You can have as many sub-levels as you want, indicated by tabs.
Now we need to tell Snakemake where to find the configuration file. Add this line to the top of your Snakefile
configfile: "config.yaml"
Finally, we need to access the values in the config file. We can do
this with nested brackets, so config["QC"]["geno_thresh"]
indicates the content of the geno_thresh:
line.
Modify the params:
lines of the variant_qc
and ld_prune
rules. These rules should now read
rule variant_qc:
input: "data/all.vcf.gz"
output: multiext("data/all_varqc", ".pgen", ".psam", ".pvar", ".log")
params: geno_thresh = config["QC"]["geno_thresh"],
qual_thresh = config["QC"]["qual_thresh"],
maf = config["QC"]["maf"]
shell: """
plink2 --vcf {input} --set-missing-var-ids '@:#' --snps-only \
--var-min-qual {params.qual_thresh} \
--geno {params.geno_thresh} \
--maf {params.maf} \
--make-pgen \
--out data/all_varqc
"""
rule ld_prune:
input: multiext("data/{prefix}", ".pgen", ".psam", ".pvar")
output: "data/{prefix}.prune.in"
params: kb = config["LD"]["kb"], r2 = config["LD"]["r2"]
shell: """
plink2 --pfile data/{wildcards.prefix} \
--indep-pairwise {params.kb} {params.r2} \
--out data/{wildcards.prefix}
"""
Special Challenge: Modify the maf
line
of the config file to give a list, like [0.01, 0.05]
. Can
you modify the pipeline so that two sets of pca results are produced,
one using each of the values given to the maf
line in the
config file? An answer is in code/6.Snakefile
and
code/6.config.yaml
.
The Snakefile you should have at the end of this section (excluding
the special challenge) can be found in code/5.Snakefile
(that file uses 5.config.yaml
instead of
config.yaml
).
configfile: "config.yaml"
rule all:
input: "results/all_varqc.eigenvec", expand("results/chr{c}_stats.txt", c = range(20, 23))
rule get_stats:
input: "data/chr{c}.vcf.gz"
output: "results/chr{c}_stats.txt"
shell: "mkdir -p results; bcftools stats {input} > {output}"
rule combine_data:
input: expand("data/chr{c}.vcf.gz", c = range(20, 23))
output: "data/all.vcf.gz"
shell: "bcftools concat -o {output} {input}"
rule variant_qc:
input: "data/all.vcf.gz"
output: multiext("data/all_varqc", ".pgen", ".psam", ".pvar", ".log")
params: geno_thresh = config["QC"]["geno_thresh"],
qual_thresh = config["QC"]["qual_thresh"],
maf = config["QC"]["maf"]
shell: """
plink2 --vcf {input} --set-missing-var-ids '@:#' --snps-only \
--var-min-qual {params.qual_thresh} \
--geno {params.geno_thresh} \
--maf {params.maf} \
--make-pgen \
--out data/all_varqc
"""
rule ld_prune:
input: multiext("data/{prefix}", ".pgen", ".psam", ".pvar")
output: "data/{prefix}.prune.in"
params: kb = config["LD"]["kb"], r2 = config["LD"]["r2"]
shell: """
plink2 --pfile data/{wildcards.prefix} \
--indep-pairwise {params.kb} {params.r2} \
--out data/{wildcards.prefix}
"""
rule pca:
input: data = multiext("data/{prefix}", ".pgen", ".psam", ".pvar"), prune = "data/{prefix}.prune.in"
output: "results/{prefix}.eigenvec"
shell: "plink2 --pfile data/{wildcards.prefix} --extract {input.prune} --pca --out results/{wildcards.prefix}"