Tutorial #

This tutorial takes you step by step trough a pypipegraph2 toy example.

The very first pipegraph #

Let’s assume we have an input file ‘input.dat’, in which for reasons we want to count the lower case letters. We want to do this in a pipegraph, to properly keep track of the dependencies.

Here is input.dat

HelloWorld

And here’s the python function to count lower case letters. It’s a stand in for any kind of brittle bioinformatics software that transforms files.

import collections
def count_letters(input_string):
   counts = collections.Counter()
   for c in input_string:
       if c.islower():
           counts[c] += 1
       else:
           raise ValueError("invalid letter")
    return counts

Now in an ideal world, this would work:

import pypipegraph2 as ppg
import json
import collections


def count_letters(input_string):
    counts = collections.Counter()
    for c in input_string:
        if c.islower():
            counts[c] += 1
        else:
            raise ValueError("invalid letter")
    return counts


input_file = Path("input.dat")
ppg.new()
ppg.FileGeneratingJob(
    "output.dat",
    lambda of: of.write_text(json.dumps(count_letters(input_file.read_text()))),
).depends_on_file(input_file)
ppg.run()

But alas, the input file is not quite compatible with the count_letters function, as we see when we run our file wit python:

   0:00:00.125824s | Job failed  : 'output.dat'

	More details (stdout, locals) in .ppg/errors/2024-05-07_12-55-54.2642035/0_exception.txt
	Failed after 0.013s.

Exception: ValueError invalid letter
        Traceback (most recent call last):
        /home/tp/upstream/ppg2_rust/python/pypipegraph2/jobs.py":798, 
in run
                  795 
                  796                         try:
                > 797                             
self.generating_function(*input)
                  798                             stdout.flush()
                  799                             stderr.flush()
                  800                             # else:
        /home/tp/upstream/ppg2_rust/ex/shu.py":21, in <lambda>
                  18 ppg.FileGeneratingJob(
                  19     "output.dat",
                > 20     lambda of: of.write_text(json.dumps(count_letters(input_file.read_text()))),
                  21 ).depends_on_file(input_file)
                  22 ppg.run()
                  23 
        /home/tp/upstream/ppg2_rust/ex/shu.py":14, in count_letters
                  11             counts += 1
                  12         else:
                > 13             raise ValueError("invalid letter")
                  14 
                  15 
                  16 input_file = Path("input.dat")
        Exception (repeated from above): ValueError invalid letter
  0:00:00.146061s | At least one job failed
Traceback (most recent call last):
  File "/home/tp/upstream/ppg2_rust/ex/shu.py", line 23, in <module>
    ppg.run()
  File "/home/tp/upstream/ppg2_rust/python/pypipegraph2/__init__.py", line 135, in run
    return global_pipegraph.run(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tp/upstream/ppg2_rust/python/pypipegraph2/graph.py", line 184, in run
    raise JobsFailed(do_raise[1], exceptions=do_raise[2])
pypipegraph2.exceptions.JobsFailed: At least one job failed

Transforming the input #

We need to transform the input file so it works with the count_letters function.

import pypipegraph2 as ppg
import json
import collections
from pathlib import Path



def count_letters(input_string):
    counts = collections.Counter()
    for c in input_string:
        if c.islower():
            counts[c] += 1
        else:
            raise ValueError("invalid letter")
    return counts


input_file = Path("input.dat")
ppg.new()


job_transform = ppg.FileGeneratingJob('transformed.dat', 
    lambda of: of.write_text(input_file.read_text().lower().strip())
).depends_on_file(input_file).self # depends_on_file returns a named tuple (invariant, self)

job_output = ppg.FileGeneratingJob(
    "output.dat",
    lambda of: of.write_text(json.dumps(count_letters(job_transform.files[0].read_text()))),
).depends_on(job_transform) # depends_on returns self.

ppg.run()

This time we get the expected output: no output on the console and a file output.dat containing:

{"h": 1, "e": 1, "l": 3, "o": 2, "w": 1, "r": 1, "d": 1}

Building intuition on recalculations #

At this point, you have a graph that looks like this:

graph TD; FileInvariant:input.dat-->FileGeneratingJob:transformed.dat; FileGeneratingJob:transformed.dat-->FileGeneratingJob:output.dat; FunctionInvariant:FItransformed.dat-->FileGeneratingJob:transformed.dat; FunctionInvariant:FIoutput.dat-->FileGeneratingJob:output.dat;

And you should investigate the following questions:

What happens if you change input.dat?
What happens when you change the function doing the transformation
Does anything happen when you change the function count_letters?
How do you get it to track changes in count_letters()

To investigate, you might want to add a ’tracking’ function that adds a counter to a file, as a side effect to your job functions (promote them from lambdas to real defs!)

def count(filename):
    try:
        count = int(Path(filename).read_text().strip())
    except:
        count = 0
    count += 1
    Path(filename).write_text(str(count))