The Why: Analysis Reproducibility
- Modern CMS analysis is too complex to be managed by memory or manual scripts.
- Reproducibility is a productivity tool: it makes your own work easier to revise and update.
- Capturing the Workflow (the “Glue”) is the final step in ensuring an analysis is truly reusable.
Introduction to Snakemake
- A
Snakefiledefines the workflow. - A
rulecontainsinput,output, and ashellcommand. - You execute the workflow by asking for the output file, not the rule name.
Chaining Rules (The DAG)
- Declarative Workflows: Unlike bash scripts where you define the order of steps, in Snakemake you define the dependencies (inputs/outputs), and Snakemake figures out the order (DAG).
-
The
allRule: It is convention to include a rule namedallat the top of the workflow to define the final targets of your analysis. - Lazy Execution: Snakemake only re-runs a rule if the output file is missing or if the input files have changed (have a newer timestamp) since the last run.
Scaling with Wildcards
- Wildcards: Use name in filenames to define a generic rule.
- Constraints: Snakemake fills wildcards by looking at the output you requested and propagating that value to the input.
-
expand(): A Python function that generates a list
of filenames from a pattern. It is commonly used in
rule allto define the final targets. -
Parallelism: With wildcards, Snakemake can run
multiple independent jobs in parallel using the
--coresflag.
Visualizing the Workflow
- DAG: A visual map of your analysis dependencies.
- Dry-run (-n): Always perform a dry-run to verify the plan before executing.
- Rule Graph: A simplified visualization showing the relationship between rules rather than individual files.
Containerized Execution
- container:: A rule-level directive that specifies the Docker/Apptainer image to use.
- –use-apptainer: The command-line flag required to enable container execution.
- *–apptainer-args: Use this to bind external storage
paths (like
/eosor/cvmfs) so the container can see them. - Environment Agnostic: You can mix and match different containers in a single workflow, ensuring each step has the exact dependencies it needs.
Bonus: The CMSDAS Challenge
- Integration: You can wrap almost any existing script in Snakemake, provided the Input/Output filenames are predictable.
- Determinism: If a script produces random timestamps or unique IDs in filenames, you must “patch” it to ensure Snakemake can track the files.
-
Hybrid Environments: While
container:is preferred, you can explicitly callapptainer execinside ashellblock when you need complex environment sourcing (likecmsenv). - Orchestration: Snakemake can seamlessly connect completely different software stacks (e.g., Python/Coffea and C++/ROOT/Combine) into a single reproducible pipeline.