The Why: Analysis Reproducibility


  • Modern CMS analysis is too complex to be managed by memory or manual scripts.
  • Reproducibility is a productivity tool: it makes your own work easier to revise and update.
  • Capturing the Workflow (the “Glue”) is the final step in ensuring an analysis is truly reusable.

Introduction to Snakemake


  • A Snakefile defines the workflow.
  • A rule contains input, output, and a shell command.
  • You execute the workflow by asking for the output file, not the rule name.

Chaining Rules (The DAG)


  • Declarative Workflows: Unlike bash scripts where you define the order of steps, in Snakemake you define the dependencies (inputs/outputs), and Snakemake figures out the order (DAG).
  • The all Rule: It is convention to include a rule named all at the top of the workflow to define the final targets of your analysis.
  • Lazy Execution: Snakemake only re-runs a rule if the output file is missing or if the input files have changed (have a newer timestamp) since the last run.

Scaling with Wildcards


  • Wildcards: Use name in filenames to define a generic rule.
  • Constraints: Snakemake fills wildcards by looking at the output you requested and propagating that value to the input.
  • expand(): A Python function that generates a list of filenames from a pattern. It is commonly used in rule all to define the final targets.
  • Parallelism: With wildcards, Snakemake can run multiple independent jobs in parallel using the --cores flag.

Visualizing the Workflow


  • DAG: A visual map of your analysis dependencies.
  • Dry-run (-n): Always perform a dry-run to verify the plan before executing.
  • Rule Graph: A simplified visualization showing the relationship between rules rather than individual files.

Containerized Execution


  • container:: A rule-level directive that specifies the Docker/Apptainer image to use.
  • –use-apptainer: The command-line flag required to enable container execution.
  • *–apptainer-args: Use this to bind external storage paths (like /eos or /cvmfs) so the container can see them.
  • Environment Agnostic: You can mix and match different containers in a single workflow, ensuring each step has the exact dependencies it needs.

Bonus: The CMSDAS Challenge


  • Integration: You can wrap almost any existing script in Snakemake, provided the Input/Output filenames are predictable.
  • Determinism: If a script produces random timestamps or unique IDs in filenames, you must “patch” it to ensure Snakemake can track the files.
  • Hybrid Environments: While container: is preferred, you can explicitly call apptainer exec inside a shell block when you need complex environment sourcing (like cmsenv).
  • Orchestration: Snakemake can seamlessly connect completely different software stacks (e.g., Python/Coffea and C++/ROOT/Combine) into a single reproducible pipeline.