Introduction to Snakemake
Last updated on 2026-02-17 | Edit this page
Estimated time: 20 minutes
Overview
Questions
- What is a Snakefile?
- How do I define a single processing step?
- How do I execute a rule?
Objectives
- Create a
Snakefile. - Write a valid Snakemake rule with
input,output, andshell. - Execute a specific target file.
Why Snakemake for CMS?
In CMS analysis, we rarely run a single script. We create an analysis chain: Skimming \(\rightarrow\) Processing \(\rightarrow\) Plotting \(\rightarrow\) Fitting. Traditionally, we managed this with “Mega-Bash” scripts or by manually submitting JDL files to HTCondor/Slurm.
Key Advantages
- Portability: Works on your laptop, LXPLUS, or the LPC with the same code.
- Environment Management: Seamless integration with Containers (Apptainer/Singularity) and Conda.
- Visualization: It automatically generates a “map” (Directed Acyclic Graph or DAG) of your analysis.
Snakemake vs. LAW (Luigi Analysis Workflow): A bias comparison
- Boilerplate: LAW is based on Luigi and requires significant Python boilerplate for every task. Snakemake uses a “Domain Specific Language” (DSL) that is much closer to plain English or a configuration file.
- The Learning Curve: Snakemake is widely used in Genomics and Data Science. If you search for a problem on StackOverflow, you’ll find an answer in seconds. LAW is more niche to the HEP community.
- Automation: Snakemake is file-based. It looks at the “last modified” timestamp of your files. If you change your plotting script, it won’t re-run the 5-hour skimming step unless you ask it to.
- Workflow Isolation: All your code can run independently of Snakemake. You can think of Snakemake as a “workflow manager” that orchestrates your existing scripts (like a bash script on steroids). LAW requires you to write your tasks as Luigi Tasks, which can be more cumbersome.
Credits and Documentation
Snakemake is an open-source project created by Johannes Köster (University of Duisburg-Essen) in 2012. While this tutorial focuses on CMS-specific use cases, the official documentation is comprehensive and covers thousands of features.
- Documentation: https://snakemake.github.io
- Citation: If you use Snakemake in your analysis, proper citation is expected in our field:
Köster, Johannes and Rahmann, Sven. “Snakemake—a scalable bioinformatics workflow engine”. Bioinformatics, 2012.
The Essentials
To run a workflow, you need two things:
-
The Snakefile: A file named
Snakefile(capital S, no extension) where you define your rules. It uses a Python-based syntax, so you can write standard Python code inside it alongside your rules. -
The Execution Command: You run the workflow from
the terminal by invoking
snakemake.
The basic command structure is:
-
--cores <N>: Tells Snakemake how many CPU cores to use. (e.g.,--cores 1for sequential execution,--cores 4for parallel). -
<target_file>: The file you want to generate. Snakemake will figure out the steps to get there. - If you dont specify a snakefile, Snakemake will look for a file
named
Snakefilein the current directory. You can specify a different file with-s <filename>.
The Anatomy of a Rule
The building block of Snakemake is the rule. Think of it as a recipe. To make a dish, you need ingredients (inputs), a kitchen (the environment), and a set of instructions (the shell command).
PYTHON
rule skim_data:
input:
"raw_data.txt"
output:
"skimmed_data.txt"
shell:
"grep 'Signal' {input} > {output}"
Important Properties:
- Rule Name: Must be unique.
- Input/Output: These are strings (or lists of strings). Snakemake uses these to “connect the dots” between rules.
-
Shell: The bash command to execute. Notice the
{input}and{output}placeholders—Snakemake automatically fills these in with the paths you defined above.
Understanding the Syntax
A Snakefile is essentially a Python script with some
extra keywords added. This is powerful because it means you can use
Python variables, functions, and libraries directly in your workflow
definition.
-
Indentation matters: Just like in Python, Snakemake
uses indentation (usually 4 spaces) to group code blocks. The
input,output, andshelldirectives must be indented relative to therule. -
Strings: File paths and commands are strings, so
they must be enclosed in quotes (
"..."or'...'). -
Comments: Use
#for comments, just like in Python/Bash. - Lists: If a rule has multiple inputs or outputs, you can list them with commas:
Running the Rule
- Create dummy data:
Create a
Snakefilewith the content shown above.Run Snakemake. We must tell it what file we want to generate:
BASH
pixi run snakemake --cores 1 skimmed_data.txt
# if you run snakemake from a conda environment or pip, use:
# snakemake --cores 1 skimmed_data.txt
If successful, you will see Finished jobid: 0.
OUTPUT
Assuming unrestricted shared filesystem usage.
host: gluon
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
--------- -------
skim_data 1
total 1
Select jobs to execute...
Execute 1 jobs...
[Sun Feb 8 12:05:08 2026]
localrule skim_data:
input: raw_data.txt
output: skimmed_data.txt
jobid: 0
reason: Missing output files: skimmed_data.txt
resources: tmpdir=/tmp
[Sun Feb 8 12:05:08 2026]
Finished jobid: 0 (Rule: skim_data)
1 of 1 steps (100%) done
Syntax Error Hunt
Intentionally break your indentation (remove a space before
input:). Run the command again.
What error does Snakemake give you?
This IndentationError is the most common error you will
encounter.
- A
Snakefiledefines the workflow. - A
rulecontainsinput,output, and ashellcommand. - You execute the workflow by asking for the output file, not the rule name.