Strange Loop

2009 - 2023

/

St. Louis, MO

Expressing complex data aggregations with Histogrammar

Since the 1970's, data analysis in high energy physics has revolved around the histogram: an array of integers approximating a distribution. Fortran codes drawing ASCII-art plots bear a strong resemblance to the analysis scripts that discovered the Higgs boson: imperative for-loops filling histogram objects.

If this sounds cumbersome, it is. Explicit for-loops must be manually edited to add concurrency. However, the histogram concept itself is powerful: many data visualizations can be constructed by cleverly filling suites of related histograms, adding them, subtracting them, and dividing them bin by bin.

Today, high-energy physics analysis is colliding with tools from the Big Data community. In my work with physicists adopting Apache Spark, I've found that histograms can benefit from a functional style, accepting fill rules as lambda functions, and they can be subdivided into more fundamental units.

In fact, all the tricks for building complex data visualizations with histograms can be formalized as a grammar of "aggregator monoids." These aggregators are associative for easy concurrency and simple enough to implement in many languages.

In this talk, I'll show you how to use Histogrammar, a lightweight, cross-language suite of histogramming primitives. Examples will include cluster-wide histogramming in Spark and tapping intermediate values in a GPU computation. It is my belief that the Big Data community can learn as much from physicists as physicists are from Big Data.

Jim Pivarski

Jim Pivarski

Jim was trained as a physicist with a Ph.D. from Cornell and has done a lot of physics analyses himself. He helped to commission the CMS experiment at the Large Hadron Collider before joining Open Data Group, a Big Data analytics consultancy. There, he developed the Portable Format for Analytics (PFA) and has since returned to high energy physics as part of DIANA-HEP, a project to introduce machine learning and Big Data techniques to high energy physicists.