Summary and Schedule

In this tutorial, we will move beyond the limitations of manual Bash scripts and discover how to build reproducible, scalable, and portable analysis workflows using Snakemake. We will start by installing the necessary tools with Pixi, learn the “grammar” of Snakemake rules, and master the art of running jobs inside isolated Containers. Finally, we will put it all together by converting a real CMSDAS analysis into a fully automated pipeline that runs on your laptop just as easily as it does on the grid. By the end of this session, you will have the skills to turn your complex physics ideas into robust, one-command workflows.

Setup Instructions

Download files required for the lesson

00h 00m

1. The Why: Analysis Reproducibility

Why do we need workflow orchestration in CMS?
What are the three pillars of a reusable analysis?
Who is the primary beneficiary of a reproducible workflow?

00h 10m

2. Introduction to Snakemake

What is a Snakefile?
How do I define a single processing step?
How do I execute a rule?

00h 30m

3. Chaining Rules (The DAG)

How does Snakemake connect different rules?
What is a DAG?
How does Snakemake know what to re-run?
::::::::::::::::::::::::::::::::::::::::::::::::

01h 00m

4. Scaling with Wildcards

How can I use one rule to process multiple different samples?
What is a wildcard and how does Snakemake “fill” it?
How do I tell Snakemake to generate a list of all my target files?

01h 30m

5. Visualizing the Workflow

How can I see the dependencies between my rules?
What is a Directed Acyclic Graph (DAG)?
How do I preview what Snakemake intends to do?

01h 50m

6. Containerized Execution

How do I run specific steps of my analysis in a controlled environment?
How can I use CMSSW or specific Python versions without installing them locally?
How does Snakemake handle Apptainer/Singularity?

02h 20m

7. Bonus: The CMSDAS Challenge

How do I integrate existing CMS analysis repositories into Snakemake?
How do I handle scripts that produce non-deterministic outputs (timestamps)?
How do I chain different software environments (Coffea \(\\rightarrow\) Combine)?

03h 20m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Basic Knowledge

This tutorial assumes you have a basic understanding of the following packages and concepts:

git: Version control system for tracking changes in code.
containers: Lightweight, portable environments for running applications (e.g., Docker, Singularity).
conda: Package and environment management system.
coffea: A Python library for high-energy physics data analysis. (You won’t program anything in coffea during this tutorial, but it’s good to know what it is.)

Setup and Installation

Most of this tutorial can be run on your local machine, but we will also demonstrate how to run a heavy-duty physics workflow on a remote cluster.

We will prepare our local environment using Pixi to manage Snakemake. While the heavy-duty physics code will eventually run inside CMSSW or conda environments, or containers, we need a reliable “Orchestrator” on our laptops to manage the workflow.

Discussion

Why Pixi instead of Conda?

For years, conda (and mamba) has been the standard for HEP environment management. However, Pixi is a modern alternative built on the same foundations (Conda-forge) but with several key advantages:

Speed: It is significantly faster at resolving dependencies than standard Conda.
Reproducibility: It creates a pixi.lock file automatically, ensuring that every student in this tutorial has the exact same version of every package.
Project-Centric: Pixi keeps dependencies local to your project folder rather than burying them in a global /envs/ directory.
Single Tool: It handles environment creation, package installation, and task execution (like make) in one binary.

More information about Pixi can be found in the Pixi documentation.

1. Installing Pixi

First, we need to install the pixi binary itself. Open your terminal and run the command appropriate for your system:

macOS and Linux:

BASH

curl -fsSL https://pixi.sh/install.sh | bash

(Note: You may need to restart your terminal or source your .bashrc / .zshrc after installation.)

Verify the installation:

BASH

pixi --version

2. Creating the Project and Installing Snakemake

Now, we will initialize a new project directory and install Snakemake. This ensures that our workflow is self-contained.

BASH


# 1. Create a new directory for the tutorial
mkdir snakemake-cms-tutorial
cd snakemake-cms-tutorial

# 2. Initialize a pixi project
pixi init .
## Snakemake is hosted in the conda-forge channel, so we need to add it to our project
pixi project channel add bioconda

# 3. Add Snakemake and Graphviz
# (Graphviz is used to visualize our workflow diagrams)
pixi add snakemake graphviz

3. Verification

To ensure everything is working correctly, we will run a simple command through the pixi environment. Pixi uses the run command to execute software inside the environment it just created.

BASH

pixi run snakemake --version

If you see a version number (e.g., 8.x.x), you are ready to go!

Discussion

Check your environment

Look inside your cms-snakemake-workshop folder. Can you find the pixi.toml file? Open it with a text editor and identify where snakemake is listed.

If you want to use conda instead…

If you prefer to use conda instead of pixi, you can create a conda environment and install Snakemake there. However, keep in mind that this tutorial is designed around Pixi’s workflow management, so you may need to adjust some commands accordingly.

BASH


## Create the environment
conda create -c conda-forge -c bioconda -n snakemake snakemake

# Activate the environment:
conda activate snakemake

# Verify the installation:
snakemake --version