mPyPl

Monadic Pipeline Library for Python

The main goal of mPyPl is to allow data processing tasks in Python to be expressed in a functional way. It uses pipe syntax provided by Pipe package, and augments it with named pipelines.

Often, Pandas is used for many data-processing tasks. The main concept in Pandas is dataframe, which contains data in a tabular form. New features can be computed from the data using computed columns.

In mPyPl, we represent data stream by a generator, which can load data on demand from disk. Data transformations are described by applying lazily-evaluated functions on those data streams. Each data stream typically consists of dictionary-like objects (called mdicts) that contain named fields, and new features can be computed and stored in those fields.

Core Concepts

mPyPl is based on three main ideas:

The main advantage of this approach is the ability to create pipelines that combine several streams of data together.

Quickstart

Consider a simple example: we have a number of .jpg files in a directory, and we want to imprint their modification data on top of the image to produce the result similar to photographs with imprinted date produced by some old cameras. This can be accomplished using the following code:

import mPyPl as mp

images = (
 mp.get_files('images',ext='.jpg')
 | mp.as_field('filename')
 | mp.apply('filename','image', lambda x: imread(x))
 | mp.apply('filename','date', get_date)
 | mp.apply(['image','date'],'result',lambda x: imprint(x[0],x[1]))
 | mp.select_field('result')
 | mp.as_list)

Let’s go over it line by line:

Video Tutorial + Hands On

If you like seeing mPyPl in action - have a look at the tutorial video. You can also follow the same steps using this notebook in Azure - just sign in with Microsoft Account, clone it, and experiment!

You can also watch 3-minute short demo:

Alt text