Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Recently our team built a data pipeline: a few large inputs, a few large outputs, a lot of processing in between, a lot of parallelization & working w/large datasets needed. Essentially you could view the entire process as writing one very complex function.

We approached this first outlining the procedure and specifying the types involved, then outlining functions from each type to the next. You could essentially think of our types as tables, so our outline was f1: t1: -> t2, f2: t2 -> t3,...,fn: tn -> t_output (although not quite that linear.)

This let us split up work on the different functions across a team of 6 people. Specifying the input and output types was basically enough to make sure the functions were correct most of the time, and baking the interfaces into types enforced by the compiler made it easy to refactor & coordinate on changes when necessary. Feedback when we made an error was generally immediately available, because IntelliJ would highlight the function that produced an output value of the wrong type, or the compiler would catch it.

In contrast, if we had relied primarily on unit tests to check the functions, that would have made coordination more difficult, refactoring harder, and would have required us to either generate or acquire test data to feed through each function. But this architecture let us successfully build out most of the logic even while we had no access to real data & a different team was working on data ingestion.



This is interesting, do you mind being more specific - what was the data, how big was it, how long did the functions run?

Assuming you are talking about real types and having something like

  f1 :: t1 -> t2
and

  f2 :: t2 -> t3
you suggest you were able to do

  g :: t1 -> t3
  g = f2 . f1
which works perfectly well, but is sometimes nontrivial to do for more complex functions, in particular if they are not pure (e.g. they do IO as data is too big for memory) and you do some logging and house-keeping in-between and because of runtime behavior that might be hard to predict.

Does f1 consume all input before f2 can run? Is it "streamed-through", like in `sh`, e.g.

  $ find /home/foobar | grep hs$ | xargs wc -l
which is often done as an optimization?

I really like the concept and it works great, but for me it is simpler to apply to smaller constructs and I am still investigating how to apply it to more "business-logic".


Almost all the functions are pure–the only impure functions we use read from or write to a data store. We use Apache Spark, which lets you write pure functions that can operate on data too large to handle on a single box, and it overall works quite well. Eg when designing started this project we wrote something like:

g = f5 . f4 . f3 . f2 . f1

where f1 reads from s3 and f5 writes to a db. Then the implementation work mostly involved breaking these down further, e.g. f1 = h4 . h3 . h2 . h1, where only h1 is stateful and everything else is pure.

Spark is lazily evaluated, and in practice it will stream through many operations–f1 will generally not be done consuming the input by the time f2 starts, although sometimes we force it to for debugging purposes.

Lazy evaluation and the discarding of side effects make logging difficult, which is one of the downsides. There are various monitoring and debugging tools that help but it's still definitely harder than the single machine case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: