If you're working with in Python and want to fill your histograms with Numpy arrays, read this page.
Author: Jim Pivarski
Preliminaries
This tutorial assumes familiarity with using Histogrammar in Python. It only shows how to accelerate your histogram filling with Numpy.
It should work with Histogrammar version 1.0.4 and above.
Quick example
If you want to fill your histograms with a data from a Numpy array named arr
, you might be tempted to do this:
from histogrammar import *
histogram = Bin(100, 0, 10, lambda x: x**2) # or any other aggregator
for x in arr:
histogram.fill(arr)
It would be much faster if you instead did this:
histogram.fill.numpy(arr)
The result would be the same (histogram
is now filled), but depending on the complexity of histogram
, the outcome could be achieved ten to a hundred times sooner.
Due to duck typing and Numpy’s overloaded operators, the lambda x: x**2
that transforms scalar values in the first case transforms whole Numpy arrays in the second case.
Numpy overview and motivation
Numpy is fast because it does less work: its arrays have a fixed size and a fixed type, such as all 32-bit integers or all 64-bit floats. Unlike Python’s for loop, Numpy’s internal loops do not have to check the type of each item before each operation. Also, its arrays are contiguous in memory with some attempt to stride them for cache efficiency, which is the overriding performance consideration for modern CPUs. (Memory access is the bottleneck in modern CPUs, not computation, so good cache utilization is essential.) Python references, even for primitive types like integers and floating-point numbers, are pointers that could reside anywhere in memory.
(Of course it’s possible to thwart this efficiency: a Numpy array with dtype=object
operates on Python objects and does not provide as much of a speed-up.)
The downside is that converting a pure Python analysis to Numpy requires you to fundamentally rethink your algorithm in many cases. In pure Python, for loops usually contain more than one operation, but each Numpy operation has a single for loop embedded within it. Let’s start with an easy example to illustrate this. You have two Python lists, xlist
and ylist
, and you want to find the Euclidean distance of each x
, y
pair from the origin.
result = []
for x, y in zip(xlist, ylist):
result.append(math.sqrt(x**2 + y**2))
This is, I believe, the most “natural” way to think about it. You’re thinking about events independently (an x
, y
pair is like an “event”), and you write down what you want to do to each of them. It’s slow because Python doesn’t know how large to allocate result
, so it keeps reallocating as results accumulate. Then the for loop constructs and then deconstructs a two-tuple to define each pair, and Python has to verify that the values are numbers when it squares them, again when it adds them, and again when it computes the square root. We know that all this extra work is not necessary, but Python doesn’t.
The Numpy version looks pretty similar:
result = numpy.sqrt(xarray**2 + yarray**2)
but what it’s doing is fundamentally different. Numpy squares all elements of xarray
, making a new, unnamed, temporary array xarray**2
. Then it does the same with yarray
to make yarray**2
. Then it adds the two temporary arrays, and finally it computes the square root of each element (note that this is numpy.sqrt
and not math.sqrt
). Numpy has allocated a lot of temporary arrays, but in the end, it’s 37 times faster than the pure Python version (on my laptop with a million-element list).
If we think of our data schematically as a table with two columns (x
and y
) and a million rows, the pure Python implementation steps row-by-row and computes the full expression individually. The Numpy implementation computes each operation of the expression a whole column (or temporary column) at a time. The Numpy syntax is great because it makes the fast column-wise expression look just like the slow row-wise expression.
However, this syntax doesn’t work if the calculation is anything but a table transformation (same number of output rows, each row is transformed independently). Numpy has a nice syntax for filtering rows (with square brackets, like R), and this allows for the full complexity of SQL with SELECT
and WHERE
but no GROUP BY
. Here are some things that are hard to translate unless there is a Numpy function that does exactly what you want:
- Rows containing arbitrary length objects, such as events containing lists of jets, or events that have N tracks that have Mn hits.
- Iterative procedures on each row. If a procedure is supposed to iterate until a condition defined on a row is true, the Numpy version will have to iterate until it’s true on all rows. (You could do this by performing unnecessary iterations on the completed rows or by allocating smaller subarrays of survivors with each iteration. Which one is faster depends on the straggler distribution.)
- Operations that mix data from different rows.
- Data reductions, such as aggregation (SQL’s
GROUP BY
). Simple cases likenumpy.sum
are part of the Numpy library, but a general reducer taking a Python lambda function, like Spark’saggregate
, would lose most of the Numpy performance because it would have to call out to Python on each row.
Reductions like choosing histogram bins using one Numpy array and computing averages in each bin with another become very complicated:
# Get a bin index for each x in xarray.
xindex = numpy.array(numpy.floor(
num * (xarray - low) / (high - low)), dtype=int)
profiles = []
# For each bin,
for index in range(num):
# find the y values that fall in this bin,
yinbin = yarray[xindex == index]
# and average them.
profiles.append(yinbin.mean())
# Now look at the profile plot.
print profiles
Numpy has a built-in numpy.histogram
for the simple case (count in each bin), but not for anything more complicated (e.g. average in each bin).
Histogrammar in Numpy
This is where Histogrammar comes in. Histogrammar describes complex aggregations in a declarative way, so although its generic implementation operates row-by-row, it can just as easily be computed column-by-column. A pure Python workflow built out of nested Histogrammar primitives can be Numpy-enhanced with little or no alteration to the code.
For instance, if we want to fill a histogram with Euclidean distances, the pure Python approach would be
# data = [{"x": x0, "y": y0}, {"x": x1, "y": y1}, ...]
histogram = Bin(100, 0, 5, lambda datum: math.sqrt(datum["x"]**2 + datum["y"]**2))
for datum in data:
histogram.fill(datum)
print histogram.numericalValues
and the Numpy approach would be
# data = {"x": numpy.array([x0, x1, ...]), "y": numpy.array([y0, y1, ...])}
histogram = Bin(100, 0, 5, lambda data: numpy.sqrt(data["x"]**2 + data["y"]**2))
histogram.fill.numpy(data)
print histogram.numericalValues
It’s the same construction of Histogrammar primitives (just a Bin
here, but it could have been any complex structure.) What differs is the structure of the input data passed as arguments to the lambda functions and the return values of the lambda functions.
In the pure Python case, we have a big list of individual rows and the lambda takes one row (datum
) each time it is invoked (many times). In the Numpy case, we have a single dict of column arrays and the lambda takes the whole collection when it is invoked (once).
Histogrammar’s string-based shorthand for function definitions hides this distinction:
histogram = Bin(100, 0, 5, "sqrt(x**2 + y**2)")
for datum in data:
histogram.fill(datum)
histogram = Bin(100, 0, 5, "sqrt(x**2 + y**2)")
histogram.fill.numpy(data)
The string gets compiled into a Python function using math.sqrt
for sqrt
if you don’t have Numpy and numpy.sqrt
if you do. Given x
and y
as floating point scalars, Python will compute a floating point scalar, and given x
and y
as Numpy arrays, Python will compute a Numpy array. The fact that we passed in a datum
in the first case and data
in the second determines how the function will be evaluated, and the fact that we called fill
in the first case and numpy
in the second determines how Histogrammar will attempt to use that result.
If the wrong type is used in either case, Histogrammar will raise an error: fill
requires lambda functions to produce scalars, and numpy
requires lambda functions to produce one-dimensional Numpy arrays. (Histogrammar will also raise errors if Numpy arrays at any level of the aggregation have different lengths. They’re supposed to represent a single, rectangular table.)
Generality of the interface
In my example above, the Numpy implementation extracts all data from a dict of Numpy columns. It could have been anything else. The job of producing one-dimensional Numpy arrays from data
, whatever data
happens to be, is up to the lambda functions. If you have a weird data source, you need correspondingly weird lambda functions.
However, the idea of having some string-key index to look up equal-length, one-dimensional columns of numbers is a pretty general one. It’s how I do most of my Numpy-based analysis, and it’s also how Numpy’s built-in structured arrays are presented to the user. In addition, it’s the same rough idea as Pandas DataFrames, so all of the following work with the same function string passed to Histogrammar:
import random
import numpy
class Data:
def __init__(self, x, y):
self.x = x
self.y = y
xarray = numpy.array([random.gauss(0, 1) for i in xrange(1000000)])
yarray = numpy.array([random.gauss(0, 1) for i in xrange(1000000)])
# dict of arrays
histogram = Bin(100, 0, 5, "sqrt(x**2 + y**2)")
histogram.fill.numpy({"x": xarray, "y": yarray})
# object of arrays
histogram = Bin(100, 0, 5, "sqrt(x**2 + y**2)")
histogram.fill.numpy(Data(xarray, yarray))
# Numpy record array
r = numpy.rec.array((xarray, yarray), names=["x", "y"])
histogram = Bin(100, 0, 5, "sqrt(x**2 + y**2)")
histogram.fill.numpy(r)
# Pandas DataFrame
df = pandas.DataFrame({"x": xarray, "y": yarray})
histogram = Bin(100, 0, 5, "sqrt(x**2 + y**2)")
histogram.fill.numpy(df)
More could be added (SQLAlchemy?) to Histogrammar’s inspection rules. These libraries are not dependencies; if they’re not available on your system, Histogrammar will silently not attempt to use them.
Performance and coverage
As mentioned in the beginning of this tutorial, the speedup depends on the complexity of the aggregator. A histogram with many bins could even be slower in Numpy because the whole dataset needs to be examined to find the subset matching each individual bin. (But not always: normal histograms, i.e. Bin
of Count
, use numpy.histogram
as a shortcut and therefore scale better than Bin
of arbitrary content. This shortcut also requires all-finite values: no inf
, -inf
, or nan
.)
Also, if the dataset is small or empty, Numpy may lose due to time spent setting up and tearing down the calculation. As usual, optimization is driven by particular cases, which is why it’s good that the same Histogrammar construct runs in each mode. You can test both without rewriting your analysis, which is not the case for general Numpy translations!
Here is a rough table of speedup factors (wall time for pure Python divided by wall time for Numpy) for primitives in typical applications. For instance, it’s possible to use Stack
on a suite of Counts
, but it’s usually used on a suite of (Bin
of Count
).
Primitive | Numpy speedup | Primitive | Numpy speedup |
---|---|---|---|
Count | ~100X | Categorize | 1.5X |
Sum | 40-100X | Fraction | 4-20X (100 bins) |
Average | 40-100X | Stack | 2-12X (10 plots) |
Deviate | 40-80X | Select | 4-20X (100 bins) |
Minimize | 50-150X | Limit | pass-through |
Maximize | 50-150X | Label | pass-through |
Bin | 5-25X (for 100 bins) | UntypedLabel | pass-through |
SparselyBin | 4-5X (about 100 bins) | Index | pass-through |
CentrallyBin | 25-40X (10 bins) | Branch | pass-through |
IrregularlyBin | 1-4X (10 plots) | Bag | 1.5-2X |
Some caveats:
Count
cannot be used alone in Numpy because it does not extract a quantity, a one-dimensional array to examine and get the length.Counts
embedded in primitives that extract quantities are fine because those other primitives determine (and assert!) the lengths of the arrays.Bin
has a wide range of speedups because some cases can usenumpy.histogram
and others can’t. (See above.)Categorize
operates on strings and builds a dictionary: not something you can do in Numpy, so it falls back on the pure Python implementation.Label
,UntypedLabel
,Index
, andBranch
are just containers; they pass the data along in the same way whether the data fill row-by-row or column-by-column.Bag
defers to its pure Python implementation.