Summary and Schedule
In this tutorial, we will move beyond the limitations of manual Bash scripts and discover how to build reproducible, scalable, and portable analysis workflows using Snakemake. We will start by installing the necessary tools with Pixi, learn the “grammar” of Snakemake rules, and master the art of running jobs inside isolated Containers. Finally, we will put it all together by converting a real CMSDAS analysis into a fully automated pipeline that runs on your laptop just as easily as it does on the grid. By the end of this session, you will have the skills to turn your complex physics ideas into robust, one-command workflows.
| Setup Instructions | Download files required for the lesson | |
| Duration: 00h 00m | 1. The Why: Analysis Reproducibility |
Why do we need workflow orchestration in CMS? What are the three pillars of a reusable analysis? Who is the primary beneficiary of a reproducible workflow? |
| Duration: 00h 10m | 2. Introduction to Snakemake |
What is a Snakefile? How do I define a single processing step? How do I execute a rule? |
| Duration: 00h 30m | 3. Chaining Rules (The DAG) |
How does Snakemake connect different rules? What is a DAG? How does Snakemake know what to re-run? :::::::::::::::::::::::::::::::::::::::::::::::: |
| Duration: 01h 00m | 4. Scaling with Wildcards |
How can I use one rule to process multiple different samples? What is a wildcard and how does Snakemake “fill” it? How do I tell Snakemake to generate a list of all my target files? |
| Duration: 01h 30m | 5. Visualizing the Workflow |
How can I see the dependencies between my rules? What is a Directed Acyclic Graph (DAG)? How do I preview what Snakemake intends to do? |
| Duration: 01h 50m | 6. Containerized Execution |
How do I run specific steps of my analysis in a controlled
environment? How can I use CMSSW or specific Python versions without installing them locally? How does Snakemake handle Apptainer/Singularity? |
| Duration: 02h 20m | 7. Bonus: The CMSDAS Challenge |
How do I integrate existing CMS analysis repositories into
Snakemake? How do I handle scripts that produce non-deterministic outputs (timestamps)? How do I chain different software environments (Coffea \(\\rightarrow\) Combine)? |
| Duration: 03h 20m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Basic Knowledge
This tutorial assumes you have a basic understanding of the following packages and concepts:
- git: Version control system for tracking changes in code.
- containers: Lightweight, portable environments for running applications (e.g., Docker, Singularity).
- conda: Package and environment management system.
- coffea: A Python library for high-energy physics data analysis. (You won’t program anything in coffea during this tutorial, but it’s good to know what it is.)
Setup and Installation
Most of this tutorial can be run on your local machine, but we will also demonstrate how to run a heavy-duty physics workflow on a remote cluster.
We will prepare our local environment using Pixi to manage Snakemake. While the heavy-duty physics code will eventually run inside CMSSW or conda environments, or containers, we need a reliable “Orchestrator” on our laptops to manage the workflow.
Why Pixi instead of Conda?
For years, conda (and mamba) has been the
standard for HEP environment management. However, Pixi
is a modern alternative built on the same foundations (Conda-forge) but
with several key advantages:
- Speed: It is significantly faster at resolving dependencies than standard Conda.
-
Reproducibility: It creates a
pixi.lockfile automatically, ensuring that every student in this tutorial has the exact same version of every package. -
Project-Centric: Pixi keeps dependencies local to
your project folder rather than burying them in a global
/envs/directory. -
Single Tool: It handles environment creation,
package installation, and task execution (like
make) in one binary.
More information about Pixi can be found in the Pixi documentation.
1. Installing Pixi
First, we need to install the pixi binary itself. Open
your terminal and run the command appropriate for your system:
macOS and Linux:
(Note: You may need to restart your terminal or source your .bashrc / .zshrc after installation.)
Verify the installation:
2. Creating the Project and Installing Snakemake
Now, we will initialize a new project directory and install Snakemake. This ensures that our workflow is self-contained.
BASH
# 1. Create a new directory for the tutorial
mkdir snakemake-cms-tutorial
cd snakemake-cms-tutorial
# 2. Initialize a pixi project
pixi init .
## Snakemake is hosted in the conda-forge channel, so we need to add it to our project
pixi project channel add bioconda
# 3. Add Snakemake and Graphviz
# (Graphviz is used to visualize our workflow diagrams)
pixi add snakemake graphviz
3. Verification
To ensure everything is working correctly, we will run a simple command through the pixi environment. Pixi uses the run command to execute software inside the environment it just created.
If you see a version number (e.g., 8.x.x), you are ready
to go!
Check your environment
Look inside your cms-snakemake-workshop folder. Can you find the pixi.toml file? Open it with a text editor and identify where snakemake is listed.
If you prefer to use conda instead of pixi,
you can create a conda environment and install Snakemake there. However,
keep in mind that this tutorial is designed around Pixi’s workflow
management, so you may need to adjust some commands accordingly.