Key Points

The Why: Analysis Reproducibility

Modern CMS analysis is too complex to be managed by memory or manual scripts.
Reproducibility is a productivity tool: it makes your own work easier to revise and update.
Capturing the Workflow (the “Glue”) is the final step in ensuring an analysis is truly reusable.

Declarative Workflows: Unlike bash scripts where you define the order of steps, in Snakemake you define the dependencies (inputs/outputs), and Snakemake figures out the order (DAG).
The all Rule: It is convention to include a rule named all at the top of the workflow to define the final targets of your analysis.
Lazy Execution: Snakemake only re-runs a rule if the output file is missing or if the input files have changed (have a newer timestamp) since the last run.

Wildcards: Use name in filenames to define a generic rule.
Constraints: Snakemake fills wildcards by looking at the output you requested and propagating that value to the input.
expand(): A Python function that generates a list of filenames from a pattern. It is commonly used in rule all to define the final targets.
Parallelism: With wildcards, Snakemake can run multiple independent jobs in parallel using the --cores flag.

DAG: A visual map of your analysis dependencies.
Dry-run (-n): Always perform a dry-run to verify the plan before executing.
Rule Graph: A simplified visualization showing the relationship between rules rather than individual files.

container:: A rule-level directive that specifies the Docker/Apptainer image to use.
–use-apptainer: The command-line flag required to enable container execution.
*–apptainer-args: Use this to bind external storage paths (like /eos or /cvmfs) so the container can see them.
Environment Agnostic: You can mix and match different containers in a single workflow, ensuring each step has the exact dependencies it needs.

Integration: You can wrap almost any existing script in Snakemake, provided the Input/Output filenames are predictable.
Determinism: If a script produces random timestamps or unique IDs in filenames, you must “patch” it to ensure Snakemake can track the files.
Hybrid Environments: While container: is preferred, you can explicitly call apptainer exec inside a shell block when you need complex environment sourcing (like cmsenv).
Orchestration: Snakemake can seamlessly connect completely different software stacks (e.g., Python/Coffea and C++/ROOT/Combine) into a single reproducible pipeline.