Recently our team built a data pipeline: a few large inputs, a few large outputs...

zeptomu · on Jan 19, 2017

This is interesting, do you mind being more specific - what was the data, how big was it, how long did the functions run?

Assuming you are talking about real types and having something like

  f1 :: t1 -> t2

and

  f2 :: t2 -> t3

you suggest you were able to do

  g :: t1 -> t3
  g = f2 . f1

which works perfectly well, but is sometimes nontrivial to do for more complex functions, in particular if they are not pure (e.g. they do IO as data is too big for memory) and you do some logging and house-keeping in-between and because of runtime behavior that might be hard to predict.

Does f1 consume all input before f2 can run? Is it "streamed-through", like in `sh`, e.g.

  $ find /home/foobar | grep hs$ | xargs wc -l

which is often done as an optimization?

I really like the concept and it works great, but for me it is simpler to apply to smaller constructs and I am still investigating how to apply it to more "business-logic".

SatvikBeri · on Jan 19, 2017

Almost all the functions are pure–the only impure functions we use read from or write to a data store. We use Apache Spark, which lets you write pure functions that can operate on data too large to handle on a single box, and it overall works quite well. Eg when designing started this project we wrote something like:

g = f5 . f4 . f3 . f2 . f1

where f1 reads from s3 and f5 writes to a db. Then the implementation work mostly involved breaking these down further, e.g. f1 = h4 . h3 . h2 . h1, where only h1 is stateful and everything else is pure.

Spark is lazily evaluated, and in practice it will stream through many operations–f1 will generally not be done consuming the input by the time f2 starts, although sometimes we force it to for debugging purposes.

Lazy evaluation and the discarding of side effects make logging difficult, which is one of the downsides. There are various monitoring and debugging tools that help but it's still definitely harder than the single machine case.