© 2018 Strange Loop
Query systems are widely used among data scientists, but this style of real-time analysis is virtually unknown to particle physicists. Considering that thousands of physicists per collaboration share data, a query system would be a perfect fit.
Traditionally, particle physicists write one-off C++ programs to reduce sets of files into smaller sets of files. Some of us are exploring the use of SparkSQL, Impala, and Kudu, but our experience revealed a fundamental distinction: some data are flat tables of numbers, which can be processed quickly with cache-friendly strides over a columnar representation, but others are nested objects in arbitrary-length lists, processed without the same level of optimization.
However, storage formats like ROOT and Parquet prove that even nested data can be represented as contiguous arrays— the challenge is to perform calculations directly on this "shredded" data. To my surprise, there don't seem to be any compilers that translate operations on objects into operations on shredded arrays. So I began work on Femtocode.
Femtocode is a total functional language for fast queries on nested data. It uses a dependent type system to eliminate runtime errors and translates the nested-object view into columnar calculations. Even at this early stage, simple queries in Femtocode run 10,000 times faster than the equivalent in our current C++ frameworks, and having a query system dramatically simplifies the human effort of launching distributed calculations.
Jim was trained as a physicist with a Ph.D. from Cornell and has done a lot of physics analyses himself. He helped to commission the CMS experiment at the Large Hadron Collider before joining Open Data Group, a Big Data analytics consultancy. There, he developed the Portable Format for Analytics (PFA) and has since returned to high energy physics as part of DIANA-HEP, a project to introduce machine learning and Big Data techniques to high energy particle physicists.