Chaining Rules (The DAG)
Last updated on 2026-02-17 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How does Snakemake connect different rules?
- What is a DAG?
- How does Snakemake know what to re-run?
Objectives
- Connect two rules by matching input/output filenames.
- Use the
rule allconvention. - Observe “Lazy Execution” in action.
Thinking Backwards
The most difficult paradigm shift when learning Snakemake is that you stop writing Imperative instructions (Step A, then Step B) and start writing Declarative goals.
In a Bash script, you say: > “Run the skimmer. Then run the plotter.”
In Snakemake, you say: > “I want the plot. To get the plot, I need the skimmed file. To get the skimmed file, I need the raw data.”
Snakemake determines the dependencies automatically by matching
filenames. If Rule A outputs file.txt and Rule B takes
file.txt as input, Snakemake knows Rule A must run first.
This chain of dependencies is called a DAG (Directed
Acyclic Graph).
Activity: Extending the Analysis
We have a rule that skims data. Now we want to count the events in that skimmed file.
Add this second rule to your Snakefile
(below the first one):
PYTHON
rule count_events:
input:
"skimmed_data.txt"
output:
"counts.txt"
shell:
"wc -l {input} > {output}"
Crucial Link: Notice that the input of
count_events matches the output of
skim_data. This is how Snakemake builds the
Directed Acyclic Graph (DAG).
Running the Chain
Ask for the final result:
Snakemake realizes:
- You want
counts.txt. -
count_eventscan produce it, but it needsskimmed_data.txt. -
skim_datacan produceskimmed_data.txt. Plan: Runskim_data-> Runcount_events.
The rule all Convention
By default, Snakemake runs the first rule it sees if
you don’t specify a file. To avoid typing counts.txt every
time, we add a “dummy” rule at the very top.
Add this to the top of your Snakefile:
Now you can simply run:
Lazy Execution (The “Why”)
- Run
pixi run snakemake --cores 1again.
What happens?
OUTPUT
Assuming unrestricted shared filesystem usage.
host: xxxx
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
- Modify the original raw data:
For MacOS users: some students have reported that touch does not remove the timestamp. In this case, you can remove the file and re-create it, or try to change the content.
- Run Snakemake again.
What happens?
When you first run the command, Snakemake checks if
counts.txt exists. Since it doesn’t, it calculates the
steps needed to create it. The second time you run the command,
Snakemake sees that counts.txt exists and is newer than its
inputs, so it does nothing.
When you “touch” raw_data.txt, you update its
modification time. Snakemake notices that an input
(raw_data.txt) is now strictly newer than the downstream
files (skimmed_data.txt and counts.txt). It
marks them as “stale” and re-runs the chain.
This is the crucial benefit for large analyses. If you had a workflow with 500 rules and you only modified the input for rule 499, Snakemake would not re-run rules 1 through 498. It selectively re-executes only the parts of the DAG that are affected by your change. In CMS terms: if you change a plotting style, you don’t have to re-run the N-tuplizer.
- Declarative Workflows: Unlike bash scripts where you define the order of steps, in Snakemake you define the dependencies (inputs/outputs), and Snakemake figures out the order (DAG).
-
The
allRule: It is convention to include a rule namedallat the top of the workflow to define the final targets of your analysis. - Lazy Execution: Snakemake only re-runs a rule if the output file is missing or if the input files have changed (have a newer timestamp) since the last run.