This page describes Histogrammar in detail, but without reference to any particular implementation, including the composable primitives, their required functions and argument lists, and their JSON representations.
This is the normative specification; any implementations that don't adhere to the definitions on this page should be corrected.
General features
Histogrammar consists of twenty aggregation primitives, each performing a simple task, that can be composed to solve complex aggregation problems. These primitives share the following properties.
Reducers
The aggregators, individually and collectively, are reducers in the map-reduce sense. That is, they reduce a large table of data points to a small data structure that summarizes the whole. For instance, a traditional histogram is an array of counts, produced by incrementing array bins for each data point associated with it, and the resulting data structure approximates the probability density of the dataset in a lossy-compressed form.
Composable
A primitive that accepts a sub-aggregator as an argument to its constructor may accept any sub-aggregator. The suite of primitives should be thought of as building blocks to compute whatever complex statistics are needed.
Commutative Monoids
Every primitive has a combine
method, usually presented as the +
operator, that allows aggregations of data partitions to be combined into an aggregation of the whole dataset. This operation is associative, commutative, and has an identity (the aggregator’s initial state). Such an algebra is known as a commutative monoid, and it allows aggregation jobs to be distributed arbitrarily: the dataset may be partitioned in any way that is convenient, the data accumulated in any order, and the final result is always the same.
Serializable
Every filled aggregator has a language-independent JSON representation, allowing results to be transmitted across networks, co-processors, frameworks, and languages. JSON was chosen because aggregated data is supposed to be a small representation of the whole. For large aggregates, consider binary JSON alternatives (such as Apache Avro).
Verbs
Each type of primitive has two forms:
- the “present tense,” which knows how to fill itself;
- the “past tense,” which has lost this information due to being serialized and reconstituted.
When constructing an aggregator, the data analyst defines functions for extracting the relevant quantities from the data. A present-tense aggregator uses these functions in its fill
method to update its state. After serializing to and from JSON, an aggregator may have crossed to an entirely new language where the fill functions are no longer meaningful. It retains its data as a past-tense aggregator that can be combined but not filled.
As a mnemonic, each primitive is named as a verb, with the present-tense “-ing” form for mutable, fillable aggregators and the past-tense “-ed” form for immutable, filled aggregators. The infinitive is used for the factory that constructs both forms (if the factory pattern is used, rather than constructors). Dynamically typed languages like Python and R may use the same class for present and past tenses (named with the infinitive) with different sets of fields for the two cases.
Exception-safe
The fill
methods only update an aggregator’s state after the user-defined functions have all been evaluated. An exception in a user-defined function leaves a complex aggregator in its state prior to the fill attempt, not an inconsistent state.
Version numbers
Specifications are released with two digit version numbers: major.minor. Implementations are released with three digit version numbers: major.minor.bugfix.
The major and minor version numbers of implementations must match the major and minor numbers of the specification they implement. Bugfix numbers increase as errors are corrected and functions outside the scope of the specification are added.
For example, Histogrammar-Python 0.8.5 should be adhere to version 0.8 of the specification and therefore be interoperable with Histogrammar-Scala 0.8.3. Always use the latest bugfix version, but pick a major.minor version for compatibility.
Starting in version 1.0, all versions of the specification with the same major number are backward compatible. That is, code using Histogrammar 1.5 interfaces will still work in Histogrammars 1.6, 1.7, and 1.8, but maybe not 2.0 and maybe not 1.4.
Data model
The input data for all aggregators is taken to be a stream, possibly an infinite stream, of entries. User-defined functions map an entry to something an aggregator can use, such as a boolean for filtering, a floating point number for computing a mean or selecting a bin, or a string for categorical data.
A composite aggregator may contain many plotable data structures, such as histograms, profile plots, and scatter plots, but they must all be filled by a single data source. For instance, data from one sample can be plotted hundreds of different ways within a composite aggregator, revealing subtle relationships in that one dataset, but it cannot be compared with data in a qualitatively different sample (a table with different columns). For that, the data analyst needs to construct independent aggregators and run separate jobs.
It may be possible to merge data from multiple sources with the equivalent of an SQL JOIN
operation before it enters Histogrammar, but in that case, Histogrammar’s source is still a single stream.
Data weights
Selection functions in Histogrammar are “fuzzy.” While they may entirely keep or entirely drop entries, they can also keep an entry with partial weight, such as 0.5 (half), 0.1 (tenth), or 1.5 (overweight). A weight of 0.0 is equivalent to dropping an entry, a weight of 1.0 is equivalent to not weighting, and a weight of 2.0 is equivalent to filling with two identical events, each with weight 1.0. Negative and NaN weights are treated as 0.0 and ignored. The total weight can be positive infinity, but it can never be NaN or negative. All primitives interpret weights this same way.
When two selection functions are nested, their weights multiply. Different parts of a composite aggregator may apply different selections, but all of the selections applied to a particular primitive can be deduced by following the path from the root of the tree to that primitive. Composite aggregators are in this way self-documenting.
Hisogrammar weights are metadata: though they are calculated from the data in user-defined functions, they are passed from one fill
method to the next separately from the data entries themselves. Therefore, if a statistical procedure needs to make use of “negative weights,” it can do so by calculating its own weights, which is simply data from Histogrammar’s point of view. For example, a Count primitive accumulates the sum of Histogrammar weights, which are always non-negative. To accumulate the sum of possibly negative weights, replace Count with Sum, and give the Sum an appropriate lambda function.
User-defined functions
Each language has its own way of declaring functions; Histogrammar only assumes that functions (of some form) can be passed as arguments. For instance, Java technically does not have first class functions, but it can define objects in place with an apply
method.
Histogrammar implementations must also be able to assign names to the functions as strings known at runtime. These names may be treated as metadata, carried separately from the functions themselves. When an aggregator is serialized as JSON, the functions are lost but their names are passed through for bookkeeping. For instance, a data analyst may define a function that extracts a distance in centimeters, label it as “x (cm)”, and use it in a dozen histograms. After aggregating, serializing, transmitting, and reconstituting the histograms, they are still labeled as “x (cm)” and this label may be used on the axis of a plot.
Moreover, if the analyst ever changes the units, introducing a factor of 10 in the function and changing the label to “x (mm)”, this calculation and its label are correctly updated in all the histograms in which it was used. The same would not be true if all histograms that use “x (mm)” were labeled independently.
JSON representation
Every primitive has a JSON fragment representation, which are nested in JSON the same way the primitives themselves are nested. The fragment does not carry information about the primitive type; this information is encoded as a field in the fragment’s parent. This reduces redundancy in the serialized form, since containers might store large collections of sub-aggregators of the same type. The top of the tree wraps the topmost fragment in a JSON object containing
version
(JSON string), the major.minor version number of the specificationtype
(JSON string), name of the topmost primitive type (infinitive)data
, the fragment itself.
{"version": VERSION, "type": TYPENAME, "data": FRAGMENT}
Thus, serializing or deserializing a whole Histogrammar document is distinct from serializing or deserializing any one primitive. The whole-document serialization/deserialization functions must be available to the user, but the fragment functions may be hidden as an implementation detail.
Other key-value pairs may be included at this top level of the document as metadata for the user, and they are ignored by the deserializer. Extraneous key-value pairs are not allowed in JSON fragments, however.
JSON fragments for primitives that accept a function have an optional "name"
attribute in their JSON representation to carry the function name. To reduce redundancy, collections of aggregators with the same name are named once in the parent as "values:name"
or similar. Serialization of an aggregator should not produce "values:name"
in the parent and "name"
in the children, but if this is encountered in deserialization, the child’s "name"
overrides the parent’s "values:name"
. Examples are included in the documentation below.
Conventions used in this specification
In the rest of this document, constructor arguments, required members, fill
, combine
algorithms, and JSON representations are given for each type of aggregator primitive. Histogrammar implementations should reproduce these names, calling structure, and data types exactly, though of course the syntax differs from one language to another.
Implementations match if they produce the same numerical results with a relative and absolute tolerance of 10-12, exactly agree on NaN and infinite cases, exactly agree on strings/null/boolean values, produce lists in the same order, and maps with the same key-value pairs in any order. Although IEEE double precision is the standard for floating-point numbers in Histogrammar, some environments like GPUs may have only single precision floats or premultiplied integers, requiring a looser (and documented) tolerance. Results must match regardless of how the input entries are partitioned (associativity) or shuffled (commutativity).
The fill
and combine
algorithms are expressed in Python syntax for concreteness, with a goal of clarity, not performance. (In many cases, the Python version of Histogrammar is implemented differently for performance). The fact that each primitive has required members does not preclude implementations from defining more member functions and member data, even with semi-standard names and argument lists across implementations. But they are not required.
Also, some arguments are represented here as lists of values. If the language allows it, they may be varargs. Histogrammar implementations have a preference for varargs over list arguments.
If this document disagrees with the behavior of a Histogrammar implementation, this document should be taken as normative and the implementation should be corrected.
Happy histogramming!
Zeroth kind: depend only on weights
Count: sum of weights
Count entries by accumulating the sum of all observed weights or a sum of transformed weights (e.g. sum of squares of weights).
An optional transform
function can be applied to the weights before summing. To accumulate the sum of squares of weights, use lambda x: x**2
, for instance. This is unlike any other primitive’s quantity
function in that its domain is the weights (always floating-point numbers), not entries (any type).
Counting constructor and required members
Count.ing(transform=identity)
transform
(function from double to double) transforms each weight.entries
(mutable double) is the number of entries, initially 0.0.
Counted constructor and required members
Count.ed(entries)
entries
(double) is the number of entries.
Fill and combine algorithms
identity = lambda x: x
def fill(counting, datum, weight):
if weight > 0.0:
counting.entries += counting.transform(weight)
def combine(one, two):
return Count.ed(one.entries + two.entries)
JSON fragment format
Simply a JSON number (or JSON string “inf” for infinite values) representing the entries
. This is the only aggregator whose representation is not a JSON object.
Example:
Here is a stand-alone Histogrammar document representing one Count.
{"version": "1.0", "type": "Count", "data": 123.0}
And here is Count in a more typical context: embedded as values (and underflow, overflow, nanflow) of a Bin.
{"version": "1.0",
"type": "Bin",
"data": {
"low": -5.0,
"high": 5.0,
"entries": 613.0,
"values:type": "Count",
"values": [10.0, 20.0, 20.0, 30.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0],
"underflow:type": "Count", "underflow": 5.0,
"overflow:type": "Count", "overflow": 8.0,
"nanflow:type": "Count", "nanflow": 0.0}}
First kind: aggregate data without sub-aggregators
Sum: sum of a given quantity
Accumulate the (weighted) sum of a given quantity, calculated from the data.
Sum differs from Count in that it computes a quantity on the spot, rather than percolating a product of weight metadata from nested primitives. Also unlike weights, the sum can add both positive and negative quantities (weights are always non-negative).
Summing constructor and required members
Sum.ing(quantity)
quantity
(function returning double) computes the quantity of interest from the data.entries
(mutable double) is the number of entries, initially 0.0.sum
(mutable double) is the running sum, initially 0.0.
Summed constructor and required members
Sum.ed(entries, sum)
entries
(double) is the number of entries.sum
(double) is the sum.
Fill and combine algorithms
def fill(summing, datum, weight):
if weight > 0.0:
q = summing.quantity(datum)
summing.entries += weight
summing.sum += q * weight
def combine(one, two):
return Sum.ed(one.entries + two.entries, one.sum + two.sum)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)sum
(JSON number, “nan”, “inf”, or “-inf”)- optional
name
(JSON string), name of thequantity
function, if provided.
Example:
{"version": "1.0",
"type": "Sum",
"data": {"entries": 123.0, "sum": 3.14, "name": "myfunc"}}
Average: mean of a quantity
Accumulate the weighted mean of a given quantity.
Uses the numerically stable weighted mean algorithm described in “Incremental calculation of weighted mean and variance,” Tony Finch, Univeristy of Cambridge Computing Service, 2009.
Averaging constructor and required members
Average.ing(quantity)
quantity
(function returning double) computes the quantity of interest from the data.entries
(mutable double) is the number of entries, initially 0.0.mean
(mutable double) is the running mean, initially NaN.
Averaged constructor and required members
Average.ed(entries, mean)
entries
(double) is the number of entries.mean
(double) is the mean or NaN if no data were observed.
Fill and combine algorithms
The relevant part of the fill
method is the three lines in its else
clause. Complications arise when handling infinite and NaN values such that combine
is associative.
def fill(averaging, datum, weight):
if weight > 0.0:
q = averaging.quantity(datum)
if averaging.entries == 0.0:
averaging.mean = 0.0 # make it not NaN (has no weight in total)
averaging.entries += weight
if math.isnan(averaging.mean) or math.isnan(q):
averaging.mean = float("nan")
elif math.isinf(averaging.mean) or math.isinf(q):
if math.isinf(averaging.mean) and math.isinf(q) and averaging.mean * q < 0.0:
averaging.mean = float("nan") # opposite-sign infinities is bad
elif math.isinf(q):
averaging.mean = q # mean becomes infinite with sign of q
else:
pass # mean is already infinite
if math.isinf(averaging.entries) or math.isnan(averaging.entries):
averaging.mean = float("nan") # non-finite denominator is bad
else: # handle finite case
delta = q - averaging.mean
shift = delta * weight / averaging.entries
averaging.mean += shift
def combine(one, two):
entries = one.entries + two.entries
if one.entries == 0.0:
mean = two.mean
elif two.entries == 0.0:
mean = one.mean
else:
mean = (one.entries*one.mean + two.entries*two.mean)/entries
return Average.ed(entries, mean)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)mean
(JSON number, “nan”, “inf”, or “-inf”)- optional
name
(JSON string), name of thequantity
function, if provided.
Example:
{"version": "1.0",
"type": "Average",
"data": {"entries": 123.0, "mean": 3.14, "name": "myfunc"}}
Deviate: mean and variance
Accumulate the weighted mean and weighted variance of a given quantity.
The variance is computed around the mean, not zero.
Uses the numerically stable weighted mean and weighted variance algorithms described in “Incremental calculation of weighted mean and variance,” Tony Finch, Univeristy of Cambridge Computing Service, 2009.
Deviating constructor and required members
Deviate.ing(quantity)
quantity
(function returning double) computes the quantity of interest from the data.entries
(mutable double) is the number of entries, initially 0.0.mean
(mutable double) is the running mean, initially NaN.variance
(mutable double) is the running sum of squared deviations from the mean, initially NaN. This is a sample variance without a correction for the one degree of freedom removed by computing it relative to the mean: for an unbiased sample variance, multiply byentries/(entries - 1)
. If no data are observed,variance
is NaN; if one datum is observed,variance
is 0.0.
Deviated constructor and required members
Deviate.ed(entries, mean, variance)
entries
(double) is the number of entries.mean
(double) is the mean or NaN if no data were observed.variance
(double) is the sum of squared deviations from the mean or NaN if no data were observed. See above for the relation to sample variance.
Fill and combine algorithms
The relevant part of the fill
method is the four lines in its else
clause. Complications arise when handling infinite and NaN values such that combine
is associative.
def fill(deviating, datum, weight):
if weight > 0.0:
q = deviating.quantity(datum)
if deviating.entries == 0.0:
deviating.mean = 0.0 # make it not NaN (has no weight in total)
deviating.variance = 0.0
varianceTimesEntries = deviating.variance * deviating.entries
deviating.entries += weight
if math.isnan(deviating.mean) or math.isnan(q):
deviating.mean = float("nan")
varianceTimesEntries = float("nan")
elif math.isinf(deviating.mean) or math.isinf(q):
if math.isinf(deviating.mean) and math.isinf(q) and deviating.mean * q < 0.0:
deviating.mean = float("nan") # opposite-sign infinities is bad
elif math.isinf(q):
deviating.mean = q # mean becomes infinite with sign of q
else:
pass # mean and variance are already infinite
if math.isinf(deviating.entries) or math.isnan(deviating.entries):
deviating.mean = float("nan") # non-finite denominator is bad
# any infinite value makes the variance NaN
varianceTimesEntries = float("nan")
else: # handle finite case
delta = q - deviating.mean
shift = delta * weight / deviating.entries
deviating.mean += shift
varianceTimesEntries += weight * delta * (q - deviating.mean)
deviating.variance = varianceTimesEntries / deviating.entries
def combine(one, two):
entries = one.entries + two.entries
if one.entries == 0.0:
mean = two.mean
varianceTimesEntries = two.variance * two.entries
elif two.entries == 0.0:
mean = one.mean
varianceTimesEntries = one.variance * one.entries
else:
mean = (one.entries*one.mean + two.entries*two.mean) / entries
varianceTimesEntries = one.entries*one.variance + two.entries*two.variance \
+ one.entries*one.mean*one.mean + two.entries*two.mean*two.mean \
- 2.0*mean*(one.entries*one.mean + two.entries*two.mean) \
+ entries*mean*mean
if entries == 0.0:
variance = varianceTimesEntries
else:
variance = varianceTimesEntries / entries
return Deviate.ed(entries, mean, variance)
Note: the combine
function is not explicitly defined in Tony Finch’s paper, but it’s derivable from the algebra.
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)mean
(JSON number, “nan”, “inf”, or “-inf”)variance
(JSON number, “nan”, “inf”, or “-inf”)- optional
name
(JSON string), name of thequantity
function, if provided.
Example:
{"version": "1.0",
"type": "Deviate",
"data": {"entries": 123.0, "mean": 3.14, "variance": 0.1, "name": "myfunc"}}
Minimize: minimum value
Find the minimum value of a given quantity. If no data are observed, the result is NaN.
Minimizing constructor and required members
Minimize.ing(quantity)
quantity
(function returning double) computes the quantity of interest from the data.entries
(mutable double) is the number of entries, initially 0.0.min
(mutable double) is the lowest value of the quantity observed, initially NaN.
Minimized constructor and required members
Minimize.ed(entries, min)
entries
(double) is the number of entries.min
(double) is the lowest value of the quantity observed or NaN if no data were observed.
Fill and combine algorithms
def fill(minimizing, datum, weight):
if weight > 0.0:
q = minimizing.quantity(datum)
minimizing.entries += weight
if math.isnan(minimizing.min) or q < minimizing.min:
minimizing.min = q
def combine(one, two):
entries = one.entries + two.entries
if math.isnan(one.min):
min = two.min
elif math.isnan(two.min):
min = one.min
elif one.min < two.min:
min = one.min
else:
min = two.min
return Minimize.ed(entries, min)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)min
(JSON number, “nan”, “inf”, or “-inf”)- optional
name
(JSON string), name of thequantity
function, if provided.
Example:
{"version": "1.0",
"type": "Minimize",
"data": {"entries": 123.0, "min": 3.14, "name": "myfunc"}}
Maximize: maximum value
Find the maximum value of a given quantity. If no data are observed, the result is NaN.
Maximizing constructor and required members
Maximize.ing(quantity)
quantity
(function returning double) computes the quantity of interest from the data.entries
(mutable double) is the number of entries, initially 0.0.max
(mutable double) is the highest value of the quantity observed, initially NaN.
Maximized constructor and required members
Maximize.ed(entries, min)
entries
(double) is the number of entries.max
(double) is the highest value of the quantity observed or NaN if no data were observed.
Fill and combine algorithms
def fill(maximizing, datum, weight):
if weight > 0.0:
q = maximizing.quantity(datum)
maximizing.entries += weight
if math.isnan(maximizing.max) or q > maximizing.max:
maximizing.max = q
def combine(one, two):
entries = one.entries + two.entries
if math.isnan(one.max):
max = two.max
elif math.isnan(two.max):
max = one.max
elif one.max > two.max:
max = one.max
else:
max = two.max
return Maximize.ed(entries, max)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)max
(JSON number, “nan”, “inf”, or “-inf”)- optional
name
(JSON string), name of thequantity
function, if provided.
Example:
{"version": "1.0",
"type": "Maximize",
"data": {"entries": 123.0, "max": 3.14, "name": "myfunc"}}
Bag: accumulate values for scatter plots
Accumulate raw numbers, vectors of numbers, or strings, with identical values merged.
A bag is the appropriate data type for scatter plots: a container that collects raw values, maintaining multiplicity but not order. (A “bag” is also known as a “multiset.”) Conceptually, it is a mapping from distinct raw values to the number of observations: when two instances of the same raw value are observed, one key is stored and their weights add.
Although the user-defined function may return scalar numbers, fixed-dimension vectors of numbers, or categorical strings, it may not mix range types. For the purposes of Label and Index (which can only collect aggregators of a single type), bags with different ranges are different types.
This primitive is likely to be eliminated or replaced by BagNumbers, BagStrings, BagVectors.
Bagging constructor and required members
Bag.ing(quantity, range)
quantity
(function returning a double, a vector of doubles, or a string) computes the quantity of interest from the data.range
("N"
,"N#"
where#
is a positive integer, or"S"
): the data type: number, vector of numbers, or string.entries
(mutable double) is the number of entries, initially 0.0.values
(mutable map from quantity return type to double) is the number of entries for each unique item.
If the range
parameter can be inferred from static type constraints, it may be omitted (or used as an assertion).
Bagged constructor and required members
Bag.ed(entries, values, range)
entries
(double) is the number of entries.values
(map from double, vector of doubles, or string to double) is the number of entries for each unique item.range
("N"
,"N#"
where#
is a positive integer, or"S"
): the data type: number, vector of numbers, or string.
Fill and combine algorithms
def fill(bagging, datum, weight):
if weight > 0.0:
q = bagging.quantity(datum)
if bagging.range == "N":
if math.isnan(q): # something to avoid NaN != NaN
q = "nan" # (handling is more complex in type-safe languages)
elif bagging.range[0] == "N":
q = tuple("nan" if math.isnan(qi) else qi for qi in q)
bagging.entries += weight
if q in bagging.values:
bagging.values[q] += weight
else:
bagging.values[q] = weight
def combine(one, two):
if one.range != two.range:
raise Exception
entries = one.entries + two.entries
values = {}
for v in set(one.values.keys()).union(set(two.values.keys())):
if v in one.values and v in two.values:
values[v] = one.values[v] + two.values[v]
elif v in one.values:
values[v] = one.values[v]
elif v in two.values:
values[v] = two.values[v]
return Bag.ed(entries, values, one.range)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)values
(JSON array of JSON objects containingw
(JSON number), the total weight of entries for a unique value andv
(JSON number, array of numbers, or string), the value); canonical form is sorted byv
(lexicographically)range
(JSON string), therange
type (see above)- optional
name
(JSON string), name of thequantity
function, if provided.
Examples:
{"version": "1.0",
"type": "Bag",
"data": {
"entries": 123.0,
"values": [
{"w": 23.0, "v": -999.0},
{"w": 20.0, "v": -4.0},
{"w": 20.0, "v": -2.0},
{"w": 30.0, "v": 0.0},
{"w": 30.0, "v": 2.0}],
"range": "N"}}
{"version": "1.0",
"type": "Bag",
"data": {
"entries": 123.0,
"values": [
{"w": 23.0, "v": [1.0, 2.0, 3.0]},
{"w": 20.0, "v": [3.14, 3.14, 3.14]},
{"w": 20.0, "v": [99.0, 50.0, 1.0]},
{"w": 30.0, "v": [7.0, 2.2, 9.8]},
{"w": 30.0, "v": [33.3, 66.6, 99.9]}],
"range": "N3"}}
{"version": "1.0",
"type": "Bag",
"data": {
"entries": 123.0,
"values": [
{"w": 23.0, "v": "five"},
{"w": 20.0, "v": "four"},
{"w": 20.0, "v": "one"},
{"w": 30.0, "v": "three"},
{"w": 30.0, "v": "two"}],
"range": "S"}}
Second kind: pass to different sub-aggregators based on values seen in data
Bin: regular binning for histograms
Split a quantity into equally spaced bins between a low and high threshold and fill exactly one bin per datum.
When composed with Count, this produces a standard histogram:
Bin.ing(100, 0, 10, fill_x, Count.ing())
and when nested, it produces a two-dimensional histogram:
Bin.ing(100, 0, 10, fill_x,
Bin.ing(100, 0, 10, fill_y, Count.ing()))
Combining with Deviate produces a physicist’s “profile plot:”
Bin.ing(100, 0, 10, fill_x, Deviate.ing(fill_y))
and so on.
Binning constructor and required members
Bin.ing(num, low, high, quantity, value=Count.ing(), underflow=Count.ing(), overflow=Count.ing(), nanflow=Count.ing())
num
(32-bit integer) is the number of bins; must be at least one.low
(double) is the minimum-value edge of the first bin.high
(double) is the maximum-value edge of the last bin; must be strictly greater thanlow
.quantity
(function returning double) computes the quantity of interest from the data.value
(present-tense aggregator) generates sub-aggregators to put in each bin.underflow
(present-tense aggregator) is a sub-aggregator to use for data whose quantity is less thanlow
.overflow
(present-tense aggregator) is a sub-aggregator to use for data whose quantity is greater than or equal tohigh
.nanflow
(present-tense aggregator) is a sub-aggregator to use for data whose quantity is NaN.entries
(mutable double) is the number of entries, initially 0.0.values
(list of present-tense aggregators) are the sub-aggregators in each bin.
Binned constructor and required members
Bin.ed(low, high, entries, values, underflow, overflow, nanflow)
low
(double) is the minimum-value edge of the first bin.high
(double) is the maximum-value edge of the last bin; must be strictly greater thanlow
.entries
(double) is the number of entries.values
(list of past-tense aggregators) is the filled sub-aggregators, one for each bin.underflow
(past-tense aggregator) is the filled underflow bin.overflow
(past-tense aggregator) is the filled overflow bin.nanflow
(past-tense aggregator) is the filled nanflow bin.
Fill and combine algorithms
def fill(binning, datum, weight):
if weight > 0.0:
q = binning.quantity(datum)
if math.isnan(q):
fill(binning.nanflow, datum, weight)
elif q < binning.low:
fill(binning.underflow, datum, weight)
elif q >= binning.high:
fill(binning.overflow, datum, weight)
else:
bin = int(math.floor(binning.num * \
(q - binning.low) / (binning.high - binning.low)))
fill(binning.values[bin], datum, weight)
binning.entries += weight
def combine(one, two):
if one.num != two.num or one.low != two.low or one.high != two.high:
raise Exception
entries = one.entries + two.entries
values = [combine(x, y) for x, y in zip(one.values, two.values)]
underflow = combine(one.underflow, two.underflow)
overflow = combine(one.overflow, two.overflow)
nanflow = combine(one.nanflow, two.nanflow)
return Bin.ed(one.low, one.high, entries, values, underflow, overflow, nanflow)
JSON fragment format
JSON object containing
low
(JSON number)high
(JSON number)entries
(JSON number or “inf”)values:type
(JSON string), name of the values sub-aggregator typevalues
(JSON array of sub-aggregators)underflow:type
(JSON string), name of the underflow sub-aggregator typeunderflow
sub-aggregatoroverflow:type
(JSON string), name of the overflow sub-aggregator typeoverflow
sub-aggregatornanflow:type
(JSON string), name of the nanflow sub-aggregator typenanflow
(sub-aggregator)- optional
name
(JSON string), name of thequantity
function, if provided. - optional
values:name
(JSON string), name of thequantity
function used by each value. If specified here, it is not specified in all the values, thereby streamlining the JSON.
Examples:
Here is a five-bin histogram, whose bin centers are at -4, -2, 0, 2, and 4. It counts the number of measurements made at each position.
{"version": "1.0",
"type": "Bin",
"data": {
"low": -5.0,
"high": 5.0,
"entries": 123.0,
"name": "position [cm]",
"values:type": "Count",
"values": [10.0, 20.0, 20.0, 30.0, 30.0],
"underflow:type": "Count",
"underflow": 5.0,
"overflow:type": "Count",
"overflow": 8.0,
"nanflow:type": "Count",
"nanflow": 0.0}}
Here is another five-bin histogram on the same domain, this one quantifying an average value in each bin. The quantity measured by the average has a name ("average time [s]"
), which would have been a "name"
field in the JSON objects representing the averages if it had not been specified once in "values:name"
.
{"version": "1.0",
"type": "Bin",
"data": {
"low": -5.0,
"high": 5.0,
"entries": 123.0,
"name": "position [cm]",
"values:type": "Average",
"values:name": "average time [s]",
"values": [
{"entries": 10.0, "mean": 4.25},
{"entries": 20.0, "mean": 16.21},
{"entries": 20.0, "mean": 20.28},
{"entries": 30.0, "mean": 16.19},
{"entries": 30.0, "mean": 4.23}],
"underflow:type": "Count",
"underflow": 5.0,
"overflow:type": "Count",
"overflow": 8.0,
"nanflow:type": "Count",
"nanflow": 0.0}}
SparselyBin: ignore zeros
Split a quantity into equally spaced bins, creating them whenever their entries
would be non-zero. Exactly one sub-aggregator is filled per datum.
Use this when you have a distribution of known scale (bin width) but unknown domain (lowest and highest bin index).
Unlike fixed-domain binning, this aggregator has the potential to use unlimited memory. A large number of distinct outliers can generate many unwanted bins.
Like fixed-domain binning, the bins are indexed by integers, though they are 64-bit and may be negative. Bin indexes below -(2**63 - 1)
are put in the -(2**63 - 1)
are bin and indexes above (2**63 - 1)
are put in the (2**63 - 1)
bin.
SparselyBinning constructor and required members
SparselyBin.ing(binWidth, quantity, value=Count.ing(), nanflow=Count.ing(), origin=0.0)
binWidth
(double) is the width of a bin; must be strictly greater than zero.quantity
(function returning double) computes the quantity of interest from the data.value
(present-tense aggregator) generates sub-aggregators to put in each bin.nanflow
(present-tense aggregator) is a sub-aggregator to use for data whose quantity is NaN.origin
(double) is the left edge of the bin whose index is 0.entries
(mutable double) is the number of entries, initially 0.0.bins
(mutable map from 64-bit integer to present-tense aggregator) is the map, probably a hashmap, to fill with values when theirentries
become non-zero.
SparselyBinned constructor and required members
SparselyBin.ed(binWidth, entries, contentType, bins, nanflow, origin)
binWidth
(double) is the width of a bin.entries
(double) is the number of entries.contentType
(string) is the value’s sub-aggregator type (must be provided to determine type for the case whenbins
is empty).bins
(map from 64-bit integer to past-tense aggregator) is the non-empty bin indexes and their values.nanflow
(past-tense aggregator) is the filled nanflow bin.origin
(double) is the left edge of the bin whose index is zero.
Fill and combine algorithms
def fill(sparselybinning, datum, weight):
if weight > 0.0:
q = sparselybinning.quantity(datum)
if math.isnan(q):
fill(sparselybinning.nanflow, datum, weight)
else:
softbin = (q - sparselybinning.origin) / sparselybinning.binWidth
if softbin <= -(2**63 - 1):
bin = -(2**63 - 1)
elif softbin >= (2**63 - 1):
bin = (2**63 - 1)
else:
bin = int(math.floor(softbin))
if bin not in sparselybinning.bins:
sparselybinning.bins[bin] = sparselybinning.value.copy()
fill(sparselybinning.bins[bin], datum, weight)
sparselybinning.entries += weight
def combine(one, two):
if one.binWidth != two.binWidth or one.origin != two.origin or one.contentType != two.contentType:
raise Exception
entries = one.entries + two.entries
contentType = one.contentType
bins = {}
for key in set(one.bins.keys()).union(set(two.bins.keys())):
if key in one.bins and key in two.bins:
bins[key] = combine(one.bins[key], two.bins[key])
elif key in one.bins:
bins[key] = one.bins[key].copy()
elif key in two.bins:
bins[key] = two.bins[key].copy()
nanflow = combine(one.nanflow, two.nanflow)
return SparselyBin.ed(one.binWidth, entries, contentType, \
bins, nanflow, one.origin)
JSON fragment format
JSON object containing
binWidth
(JSON number)entries
(JSON number or “inf”)bins:type
(JSON string), name of the bins sub-aggregator typebins
(JSON object), keys are string representations of the bin indexes (decimal, no leading zeros) and values are sub-aggregatorsnanflow:type
(JSON string), name of the nanflow sub-aggregator typenanflow
(sub-aggregator)origin
(JSON number)- optional
name
(JSON string), name of thequantity
function, if provided. - optional
values:name
(JSON string), name of thequantity
function used by each bin value. If specified here, it is not specified in all the values, thereby streamlining the JSON.
Example:
{"version": "1.0",
"type": "SparselyBin",
"data": {
"binWidth": 2.0,
"entries": 123.0,
"bins:type": "Count",
"bins": {"-999": 5.0, "-4": 10.0, "-2": 20.0, "0": 20.0, "2": 30.0, "4": 30.0, "12345": 8.0},
"nanflow:type": "Count",
"nanflow": 0.0,
"origin": 0.0,
"name": "myfunc"}}
CentrallyBin: fully partitioning with centers
Split a quantity into bins defined by irregularly spaced bin centers, with exactly one sub-aggregator filled per datum (the closest one). When exactly between two centers, select the righter (greater valued) bin. (Thus, the intervals of validity are closed on the left and open on the right, with the exception of positive infinity being included in the last bin, just like IrregularlyBin.)
Unlike irregular bins defined by low-high ranges, irregular bins defined by bin centers are guaranteed to fully partition the space with no gaps and no overlaps. It could be viewed as cluster scoring in one dimension.
CentrallyBinning constructor and required members
CentrallyBin.ing(centers, quantity, value=Count.ing(), nanflow=Count.ing())
centers
(list of doubles) is the centers of all binsquantity
(function returning double) computes the quantity of interest from the data.value
(present-tense aggregator) generates sub-aggregators to put in each bin.nanflow
(present-tense aggregator) is a sub-aggregator to use for data whose quantity is NaN.entries
(mutable double) is the number of entries, initially 0.0.bins
(list of double, present-tense aggregator pairs) are the bin centers and sub-aggregators in each bin.
CentrallyBinned constructor and required members
CentrallyBin.ed(entries, bins, nanflow)
entries
(double) is the number of entries.bins
(list of double, past-tense aggregator pairs) is the list of bin centers and their accumulated data.nanflow
(past-tense aggregator) is the filled nanflow bin.
Fill and combine algorithms
def fill(centrallybinning, datum, weight):
if weight > 0.0:
q = centrallybinning.quantity(datum)
if math.isnan(q):
fill(centrallybinning.nanflow, datum, weight)
else:
for index in range(len(centrallybinning.bins)):
if index == len(centrallybinning.bins) - 1:
break
thisCenter = centrallybinning.bins[index][0]
nextCenter = centrallybinning.bins[index + 1][0]
if q < (thisCenter + nextCenter)/2.0:
break
fill(centrallybinning.bins[index][1], datum, weight)
centrallybinning.entries += weight
def combine(one, two):
if set(one.centers) != set(two.centers):
raise Exception
entries = one.entries + two.entries
bins = []
for c1, v1 in one.bins:
v2 = [v for c2, v in two.bins if c1 == c2][0]
bins.append((c1, combine(v1, v2)))
nanflow = combine(one.nanflow, two.nanflow)
return CentrallyBin.ed(entries, bins, nanflow)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)bins:type
(JSON string), name of the bins sub-aggregator typebins
(JSON array of JSON objects containingcenter
(JSON number) anddata
(sub-aggregator)), collection of bin centers and their associated datananflow:type
(JSON string), name of the nanflow sub-aggregator typenanflow
(sub-aggregator)- optional
name
(JSON string), name of thequantity
function, if provided. - optional
bins:name
(JSON string), name of thequantity
function used by each bin value. If specified here, it is not specified in all the values, thereby streamlining the JSON.
Examples:
{"version": "1.0",
"type": "CentrallyBin",
"data": {
"entries": 123.0,
"bins:type": "Count",
"bins": [
{"center": -999.0, "data": 5.0},
{"center": -4.0, "data": 10.0},
{"center": -2.0, "data": 20.0},
{"center": 0.0, "data": 20.0},
{"center": 2.0, "data": 30.0},
{"center": 4.0, "data": 30.0},
{"center": 12345.0, "data": 8.0}],
"nanflow:type": "Count",
"nanflow": 0.0,
"name": "myfunc"}}
Here is an example with Average
sub-aggregators:
{"version": "1.0",
"type": "CentrallyBin",
"data": {
"entries": 123.0,
"bins:type": "Average",
"bins": [
{"center": -999.0, "data": {"entries": 5.0, "mean": -1.0}},
{"center": -4.0, "data": {"entries": 10.0, "mean": 4.25}},
{"center": -2.0, "data": {"entries": 20.0, "mean": 16.21}},
{"center": 0.0, "data": {"entries": 20.0, "mean": 20.28}},
{"center": 2.0, "data": {"entries": 20.0, "mean": 16.19}},
{"center": 4.0, "data": {"entries": 30.0, "mean": 4.23}},
{"center": 12345.0, "data": {"entries": 8.0, "mean": -1.0}}],
"nanflow:type": "Count",
"nanflow": 0.0,
"name": "myfunc",
"bins:name": "myfunc2"}}
IrregularlyBin: fully partitioning with edges
Accumulate a suite of aggregators, each between two thresholds, filling exactly one per datum.
This is a variation on Stack, which fills N + 1
aggregators with N
successively tighter cut thresholds. IrregularlyBin fills N + 1
aggregators in the non-overlapping intervals between N
thresholds.
IrregularlyBin is also similar to CentrallyBin, in that they both partition a space into irregular subdomains with no gaps and no overlaps. However, CentrallyBin is defined by bin centers and IrregularlyBin is defined by bin low edges, the first of which is negative infinity (inclusive) and the last is implicitly positive infinity (inclusive).
IrregularlyBinning constructor and required members
IrregularlyBin.ing(thresholds, quantity, value=Count.ing(), nanflow=Count.ing())
thresholds
(list of doubles) specifiesN
lower cut thresholds, so the IrregularlyBin will fillN + 1
aggregators in distinct intervals.quantity
(function returning double) computes the quantity of interest from the data.value
(present-tense aggregator) generates sub-aggregators for each bin.nanflow
(present-tense aggregator) is a sub-aggregator to use for data whose quantity is NaN.entries
(mutable double) is the number of entries, initially 0.0.bins
(list of double, present-tense aggregator pairs) are theN + 1
lower thresholds and sub-aggregators. The first threshold is minus infinity; the rest are the ones specified bythresholds
.
IrregularlyBinned constructor and required members
IrregularlyBin.ed(entries, bins, nanflow)
entries
(double) is the number of entries.bins
(list of double, past-tense aggregator pairs) are theN + 1
thresholds and sub-aggregator pairs.nanflow
(past-tense aggregator) is the filled nanflow bin.
Fill and combine algorithms
def fill(irregularlybinning, datum, weight):
if weight > 0.0:
q = irregularlybinning.quantity(datum)
if math.isnan(q):
fill(irregularlybinning.nanflow, datum, weight)
else:
lowEdges = irregularlybinning.bins
highEdges = list(irregularlybinning.bins[1:]) + [(float("nan"), None)]
for (low, sub), (high, _) in zip(lowEdges, highEdges):
if q >= low and not q >= high: # include high endpoint only for the last bin
fill(sub, datum, weight)
break
irregularlybinning.entries += weight
def combine(one, two):
if one.thresholds != two.thresholds:
raise Exception
entries = one.entries + two.entries
bins = [(c, combine(v1, v2)) for (c, v1), (_, v2) in zip(one.bins, two.bins)]
nanflow = combine(one.nanflow, two.nanflow)
return IrregularlyBin.ed(entries, bins, nanflow)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)bins:type
(JSON string), name of the sub-aggregator typebins
(JSON array of JSON objects containingatleast
(JSON number or “-inf”) anddata
(sub-aggregator)), collection of lower cut thresholds (including minus infinity) and their associated datananflow:type
(JSON string), name of the nanflow sub-aggregator typenanflow
(sub-aggregator)- optional
name
(JSON string), name of thequantity
function, if provided. - optional
bins:name
(JSON string), name of thequantity
function used by the sub-aggregators. If specified here, it is not specified in all the sub-aggregators, thereby streamlining the JSON.
Examples:
{"version": "1.0",
"type": "IrregularlyBin",
"data": {
"entries": 123.0,
"bins:type": "Count",
"bins": [
{"atleast": "-inf", "data": 23.0},
{"atleast": 1.0, "data": 20.0},
{"atleast": 2.0, "data": 20.0},
{"atleast": 3.0, "data": 30.0},
{"atleast": 4.0, "data": 30.0}],
"nanflow:type": "Count",
"nanflow": 0.0,
"name": "myfunc"}}
{"version": "1.0",
"type": "IrregularlyBin",
"data": {
"entries": 123.0,
"bins:type": "Average",
"bins": [
{"atleast": "-inf", "data": {"entries": 23.0, "mean": 3.14}},
{"atleast": 1.0, "data": {"entries": 20.0, "mean": 2.28}},
{"atleast": 2.0, "data": {"entries": 20.0, "mean": 1.16}},
{"atleast": 3.0, "data": {"entries": 30.0, "mean": 8.9}},
{"atleast": 4.0, "data": {"entries": 30.0, "mean": 22.7}}],
"nanflow:type": "Count",
"nanflow": 0.0,
"name": "myfunc",
"bins:name": "myfunc2"}}
Categorize: string-valued bins, bar charts
Split a given quantity by its categorical value and fill only one category per datum.
A bar chart may be thought of as a histogram with string-valued (categorical) bins, so this is the equivalent of Bin for bar charts. The order of the strings is deferred to the visualization stage.
Unlike SparselyBin, this aggregator has the potential to use unlimited memory. A large number of distinct categories can generate many unwanted bins.
Categorizing constructor and required members
Categorize.ing(quantity, value=Count.ing())
quantity
(function returning string) computes the quantity of interest from the data.value
(present-tense aggregator) generates sub-aggregators to put in each bin.entries
(mutable double) is the number of entries, initially 0.0.bins
(mutable map from string to present-tense aggregator) is the map, probably a hashmap, to fill with values when theirentries
become non-zero.
Categorized constructor and required members
Categorize.ed(entries, contentType, bins)
entries
(double) is the number of entries.contentType
(string) is the value’s sub-aggregator type (must be provided to determine type for the case whenbins
is empty).bins
(map from string to past-tense aggregator) is the non-empty bin categories and their values.
Fill and combine algorithms
def fill(categorizing, datum, weight):
if weight > 0.0:
q = categorizing.quantity(datum)
if q not in categorizing.bins:
categorizing.bins[q] = categorizing.value.copy()
fill(categorizing.bins[q], datum, weight)
categorizing.entries += weight
def combine(one, two):
if one.contentType != two.contentType:
raise Exception
entries = one.entries + two.entries
contentType = one.contentType
bins = {}
for key in set(one.bins.keys()).union(set(two.bins.keys())):
if key in one.bins and key in two.bins:
bins[key] = combine(one.bins[key], two.bins[key])
elif key in one.bins:
bins[key] = one.bins[key].copy()
elif key in two.bins:
bins[key] = two.bins[key].copy()
return Categorize.ed(entries, contentType, bins)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)bins:type
(JSON string), name of the sub-aggregator typebins
(JSON object), keys are the string-valued categories and values are sub-aggregators- optional
name
(JSON string), name of thequantity
function, if provided. - optional
bins:name
(JSON string), name of thequantity
function used by each bin value. If specified here, it is not specified in all the values, thereby streamlining the JSON.
Example:
{"version": "1.0",
"type": "Categorize",
"data": {
"entries": 123.0,
"bins:type": "Count",
"bins": {"one": 23.0, "two": 20.0, "three": 20.0, "four": 30.0, "five": 30.0},
"name": "myfunc"}}
Fraction: efficiency plots
Accumulate two aggregators, one containing only entries that pass a given selection (numerator) and another that contains all entries (denominator).
The aggregator may be a simple Count to measure the efficiency of a cut, a Bin to plot a turn-on curve, or anything else to be tested with and without a cut.
As a side effect of NaN values returning false for any comparison, a NaN return value from the selection is treated as a failed cut (the denominator is filled but the numerator is not).
Note that the selection function can return any non-negative number. Fractional selections would fractionally weight the numerator and would not affect the denominator. It is even possible to make the numerator exceed the denominator with weights greater than 1.0.
Fractioning constructor and required members
Fraction.ing(quantity, value=Count.ing())
quantity
(function returning boolean or double) computes the quantity of interest from the data and interprets it as a selection (multiplicative factor on weight).value
(present-tense aggregator) generates sub-aggregators for the numerator and denominator.entries
(mutable double) is the number of entries, initially 0.0.numerator
(present-tense aggregator) is the sub-aggregator of entries that pass the selection.denominator
(present-tense aggregator) is the sub-aggregator of all entries.
Fractioned constructor and required members
Fraction.ed(entries, numerator, denominator)
entries
(double) is the number of entries.numerator
(past-tense aggregator) is the filled numerator.denominator
(past-tense aggregator) is the filled denominator.
Fractioned alternate constructor
Fraction.build(numerator, denominator)
numerator
(past-tense aggregator) is the filled numerator.denominator
(past-tense aggregator) is the filled denominator.
This constructor will make a past-tense Fraction object that can be used to represent any ratio, not just a fraction. The numerator and denominator may come from different sources, but they must have compatible binning.
Fill, combine, and alternate constructor algorithms
def fill(fractioning, datum, weight):
if weight > 0.0:
w = weight * fractioning.quantity(datum)
fill(fractioning.denominator, datum, weight)
if w > 0.0:
fill(fractioning.numerator, datum, w)
fractioning.entries += weight
def combine(one, two):
entries = one.entries + two.entries
numerator = combine(one.numerator, two.numerator)
denominator = combine(one.denominator, two.denominator)
return Fraction.ed(entries, numerator, denominator)
def build(numerator, denominator):
return Fraction.ed(denominator.entries, numerator, denominator)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)sub:type
(JSON string), name of the numerator/denominator typenumerator
(sub-aggregator)denominator
(sub-aggregator)- optional
name
(JSON string), name of thequantity
function, if provided. - optional
sub:name
(JSON string), name of thequantity
function used by the numerator and denominator. If specified here, it is not specified in all the sub-aggregators, thereby streamlining the JSON.
Example:
{"version": "1.0",
"type": "Fraction",
"data": {
"entries": 123.0,
"name": "trigger",
"sub:name": "energy [GeV]",
"sub:type": "Bin",
"numerator": {
"low": -5.0,
"high": 5.0,
"entries": 98.0,
"values:type": "Count",
"values": [2.0, 15.0, 18.0, 25.0, 30.0],
"underflow:type": "Count",
"underflow": 0.0,
"overflow:type": "Count",
"overflow": 8.0,
"nanflow:type": "Count",
"nanflow": 0.0},
"denominator": {
"low": -5.0,
"high": 5.0,
"entries": 123.0,
"values:type": "Count",
"values": [10.0, 20.0, 20.0, 30.0, 30.0],
"underflow:type": "Count",
"underflow": 5.0,
"overflow:type": "Count",
"overflow": 8.0,
"nanflow:type": "Count",
"nanflow": 0.0}}}
Stack: cumulative filling
Accumulates a suite of aggregators, each filtered with a tighter selection on the same quantity.
This is a generalization of Fraction, which fills two aggregators, one with a cut, the other without. Stack fills N + 1
aggregators with N
successively tighter cut thresholds. The first is always filled (like the denominator of Fraction), the second is filled if the computed quantity exceeds its threshold, the next is filled if the computed quantity exceeds a higher threshold, and so on.
The thresholds are presented in increasing order and the computed value must be greater than or equal to a threshold to fill the corresponding bin, and therefore the number of entries in each filled bin is greatest in the first and least in the last.
Although this aggregation could be plotted as a stack of histograms, stacks of histograms are often used to represent something different: data from different sources, rather than cuts on the same source. For example, it is common to stack distributions from different Monte Carlo simulations to show that they add up to the observed data. The Stack aggregator does not make plots of this type because aggregation trees in Histogrammar draw data from exactly one source.
To make plots from different sources in Histogrammar, one must perform separate aggregation runs. It may then be convenient to stack the results of those runs as though they were created with a Stack aggregation, so that plotting code can treat both cases uniformly. For this reason, Stack has an alternate constructor to build a Stack manually from distinct aggregators, even if those aggregators came from different aggregation runs.
Stacking constructor and required members
Stack.ing(thresholds, quantity, value=Count.ing(), nanflow=Count.ing())
thresholds
(list of doubles) specifiesN
lower cut thresholds, so the Stack will fillN + 1
aggregators, each overlapping the last.quantity
(function returning double) computes the quantity of interest from the data.value
(present-tense aggregator) generates sub-aggregators for each bin.nanflow
(present-tense aggregator) is a sub-aggregator to use for data whose quantity is NaN.entries
(mutable double) is the number of entries, initially 0.0.bins
(list of double, present-tense aggregator pairs) are theN + 1
lower thresholds and sub-aggregators. The first is minus infinity; the rest are the ones specified bythresholds
.
Stacked constructor and required members
Stack.ed(entries, bins, nanflow)
entries
(double) is the number of entries.bins
(list of double, past-tense aggregator pairs) are theN + 1
thresholds and sub-aggregator pairs.nanflow
(past-tense aggregator) is the filled nanflow bin.
Stacked alternate constructor
Stack.build(aggregators)
aggregators
(list of aggregators of the same type from any source); the algorithm will attempt to add them, so they must also have the same binning/bounds/etc.
This constructor will make a past-tense Stacked object with NaN as cut thresholds and Count.ed(0.0)
as nanflow
.
Fill, combine, and alternate constructor algorithms
def fill(stacking, datum, weight):
if weight > 0.0:
q = stacking.quantity(datum)
if math.isnan(q):
fill(stacking.nanflow, datum, weight)
else:
for threshold, sub in stacking.bins:
if q >= threshold:
fill(sub, datum, weight)
stacking.entries += weight
def combine(one, two):
if [c for c, v in one.bins] != [c for c, v in two.bins]:
raise Exception
entries = one.entries + two.entries
bins = []
for (c, v1), (_, v2) in zip(one.bins, two.bins):
bins.append((c, combine(v1, v2)))
nanflow = combine(one.nanflow, two.nanflow)
return Stack.ed(entries, bins, nanflow)
def build(aggregators):
entries = sum(x.entries for x in aggregators)
bins = []
for i in range(len(aggregators)):
stackedAggregators = reduce(lambda a, b: combine(a, b), aggregators[i:])
bins.append((float("nan"), stackedAggregators))
return Stack.ed(entries, bins, Count.ed(0.0))
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)bins:type
(JSON string), name of the sub-aggregator typebins
(JSON array of JSON objects containingatleast
(JSON number or “-inf”) anddata
(sub-aggregator)), collection of lower cut thresholds (including minus infinity) and their associated datananflow:type
(JSON string), name of the nanflow sub-aggregator typenanflow
(sub-aggregator)- optional
name
(JSON string), name of thequantity
function, if provided. - optional
bins:name
(JSON string), name of thequantity
function used by the sub-aggregators. If specified here, it is not specified in all the sub-aggregators, thereby streamlining the JSON.
Examples:
{"version": "1.0",
"type": "Stack",
"data": {
"entries": 123.0,
"bins:type": "Count",
"bins": [
{"atleast": "-inf", "data": 123.0},
{"atleast": 1.0, "data": 100.0},
{"atleast": 2.0, "data": 82.0},
{"atleast": 3.0, "data": 37.0},
{"atleast": 4.0, "data": 4.0}],
"nanflow:type": "Count",
"nanflow": 0.0,
"name": "myfunc"}}
{"version": "1.0",
"type": "Stack",
"data": {
"entries": 123.0,
"bins:type": "Average",
"bins": [
{"atleast": "-inf", "data": {"entries": 123.0, "mean": 3.14}},
{"atleast": 1.0, "data": {"entries": 100.0, "mean": 2.28}},
{"atleast": 2.0, "data": {"entries": 82.0, "mean": 1.16}},
{"atleast": 3.0, "data": {"entries": 37.0, "mean": 8.9}},
{"atleast": 4.0, "data": {"entries": 4.0, "mean": 22.7}}],
"nanflow:type": "Count",
"nanflow": 0.0,
"name": "myfunc",
"bins:name": "myfunc2"}}
Select: apply a cut
Filter or weight data according to a given selection.
This primitive is a basic building block, intended to be used in conjunction with anything that needs a user-defined cut. In particular, a standard histogram often has a custom selection, and this can be built by nesting Select → Bin → Count.
Select also resembles Fraction, but without the denominator
.
The efficiency of a cut in a Select aggregator named x
is simply x.cut.entries / x.entries
(because all aggregators have an entries
member).
Note that the selection function can return any non-negative number. Fractional selections would fractionally weight the sub-aggregator.
Selecting constructor and required members
Select.ing(quantity, cut=Count())
quantity
(function returning boolean or double) computes the quantity of interest from the data and interprets it as a selection (multiplicative factor on weight).cut
(present-tense aggregator) will only be filled with data that pass the cut, and which are weighted by the cut.entries
(mutable double) is the number of entries, initially 0.0.
Selected constructor and required members
Select.ed(entries, cut)
entries
(double) is the number of entries.cut
(past-tense aggregator) is the filled sub-aggregator.
Fill and combine algorithms
def fill(selecting, datum, weight):
if weight > 0.0:
w = weight * selecting.quantity(datum)
if w > 0.0:
fill(selecting.cut, datum, w)
selecting.entries += weight
def combine(one, two):
entries = one.entries + two.entries
cut = combine(one.cut, two.cut)
return Select.ed(entries, cut)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)sub:type
(JSON string), name of the sub-aggregator typedata
(sub-aggregator)- optional
name
(JSON string), name of thequantity
function, if provided.
Examples:
{"version": "1.0",
"type": "Select",
"data": {
"entries": 123.0,
"name": "trigger",
"sub:type": "Count",
"data": 98.0}}
{"version": "1.0",
"type": "Select",
"data": {
"entries": 123.0,
"name": "trigger",
"sub:name": "energy [GeV]",
"sub:type": "Bin",
"data": {
"low": -5.0,
"high": 5.0,
"entries": 98.0,
"values:type": "Count",
"values": [2.0, 15.0, 18.0, 25.0, 30.0],
"underflow:type": "Count",
"underflow": 0.0,
"overflow:type": "Count",
"overflow": 8.0,
"nanflow:type": "Count",
"nanflow": 0.0}}}
Third kind: broadcast to every sub-aggregator, independent of data
Label: directory with string-based keys
Accumulate any number of aggregators of the same type and label them with strings. Every sub-aggregator is filled with every input datum.
This primitive simulates a directory of aggregators. For sub-directories, nest collections within the Label collection.
Note that all sub-aggregators within a Label must have the same type (e.g. histograms of different binnings, but all histograms). To collect objects of different types with string-based look-up keys, use UntypedLabel.
To collect aggregators of the same type without naming them, use Index. To collect aggregators of different types without naming them, use Branch.
In strongly typed languages, the restriction to a single type allows nested objects to be extracted without casting.
Labeling constructor and required members
Label.ing(pairs)
pairs
(map of string, present-tense aggregator pairs) is the collection of aggregators to fill.entries
(mutable double) is the number of entries, initially 0.0.
Labeled constructor and required members
Label.ed(entries, pairs)
entries
(double) is the number of entries.pairs
(map of string, past-tense aggregator pairs) is the collection of filled aggregators.
Fill and combine algorithms
def fill(labeling, datum, weight):
if weight > 0.0:
for _, v in labeling.pairs.items():
fill(v, datum, weight)
labeling.entries += weight
def combine(one, two):
if set(one.pairs.keys()) != set(two.pairs.keys()):
raise Exception
entries = one.entries + two.entries
pairs = {}
for l, v1 in one.pairs.items():
v2 = two.pairs[l]
pairs[l] = combine(v1, v2)
return Label.ed(entries, pairs)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)sub:type
(JSON string), name of the sub-aggregator typedata
(JSON object), keys are the labels and values are sub-aggregators
The fact that Label requires all contents to have a single type allows the sub:type
to be specified once here, instead of once for every item in the collection. UntypedLabel is more verbose.
Example:
{"version": "1.0",
"type": "Label",
"data": {
"entries": 123.0,
"sub:type": "Average",
"data": {
"one": {"entries": 123.0, "mean": 3.14},
"two": {"entries": 123.0, "mean": 6.28},
"three": {"entries": 123.0, "mean": 99.9}}}}
UntypedLabel: directory of different types
Accumulate any number of aggregators of any type and label them with strings. Every sub-aggregator is filled with every input datum.
This primitive simulates a directory of aggregators. For sub-directories, nest collections within the UntypedLabel.
Note that sub-aggregators within an UntypedLabel may have different types. In strongly typed languages, this flexibility poses a problem: nested objects must be type-cast before they can be used. To collect objects of the same type with string-based look-up keys, use Label.
To collect aggregators of the same type without naming them, use Index. To collect aggregators of different types without naming them, use Branch.
UntypedLabeling constructor and required members
UntypedLabel.ing(pairs)
pairs
(map of string, present-tense aggregator pairs) is the collection of aggregators to fill.entries
(mutable double) is the number of entries, initially 0.0.
UntypedLabeled constructor and required members
UntypedLabel.ed(entries, pairs)
entries
(double) is the number of entries.pairs
(map of string, past-tense aggregator pairs) is the collection of filled aggregators.
Fill and combine algorithms
def fill(untypedlabeling, datum, weight):
if weight > 0.0:
for _, v in untypedlabeling.pairs.items():
fill(v, datum, weight)
untypedlabeling.entries += weight
def combine(one, two):
if set(one.pairs.keys()) != set(two.pairs.keys()):
raise Exception
entries = one.entries + two.entries
pairs = {}
for l, v1 in one.pairs.items():
v2 = two.pairs[l]
pairs[l] = combine(v1, v2)
return UntypedLabel.ed(entries, pairs)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)data
(JSON object mapping labels to JSON objects containingtype
(JSON string), name of sub-aggregator type anddata
(sub-aggregator))
The fact that UntypedLabel allows each element to have a different type forces the type
annotation to be nested with the sub-aggregator. For collections that actually contain only one type, Label avoids this redundancy.
Example:
{"version": "1.0",
"type": "UntypedLabel",
"data": {
"entries": 123.0,
"data": {
"one": {"type": "Count", "data": 123.0},
"two": {"type": "Average", "data": {
"entries": 123.0,
"mean": 3.14}},
"three": {"type": "Bin", "data": {
"low": -5.0,
"high": 5.0,
"entries": 123.0,
"name": "position [cm]",
"values:type": "Count",
"values": [10.0, 20.0, 20.0, 30.0, 30.0],
"underflow:type": "Count",
"underflow": 5.0,
"overflow:type": "Count",
"overflow": 8.0,
"nanflow:type": "Count",
"nanflow": 0.0}}}}}
Index: list with integer keys
Accumulate any number of aggregators of the same type in a list. Every sub-aggregator is filled with every input datum.
This primitive provides an anonymous collection of aggregators (unless the integer index is taken to have special meaning, but generally such bookkeeping should be encoded in strings). Indexes can be nested to create two-dimensional ordinal grids of aggregators. (Use Bin if the space is to have a metric interpretation.)
Note that all sub-aggregators within an Index must have the same type (e.g. histograms of different binnings, but all histograms). To collect objects of different types, still indexed by integer, use Branch.
To collect aggregators of the same type with string-based labels, use Label. To collect aggregators of different types with string-based labels, use UntypedLabel.
In strongly typed languages, the restriction to a single type allows nested objects to be extracted without casting.
Indexing constructor and required members
Index.ing(values)
values
(list of present-tense aggregators) is the collection of aggregators to fill.entries
(mutable double) is the number of entries, initially 0.0.
Indexed constructor and required members
Index.ed(entries, values)
entries
(double) is the number of entries.values
(list of past-tense aggregators) is the collection of filled aggregators.
Fill and combine algorithms
def fill(indexing, datum, weight):
if weight > 0.0:
for v in indexing.values:
fill(v, datum, weight)
indexing.entries += weight
def combine(one, two):
if len(one.values) != len(two.values):
raise Exception
entries = one.entries + two.entries
values = []
for v1, v2 in zip(one.values, two.values):
values.append(combine(v1, v2))
return Index.ed(entries, values)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)sub:type
(JSON string), name of the sub-aggregator typedata
(JSON array of aggregators)
The fact that Index requires all contents to have a single type allows the sub:type
to be specified once here, instead of once for every item in the collection. Branch is more verbose.
Example:
{"version": "1.0",
"type": "Index",
"data": {
"entries": 123.0,
"sub:type": "Average",
"data": [
{"entries": 123.0, "mean": 3.14},
{"entries": 123.0, "mean": 6.28},
{"entries": 123.0, "mean": 99.9}]}}
Branch: tuple of different types
Accumulate aggregators of different types, indexed by i0 through i9. Every sub-aggregator is filled with every input datum.
This primitive provides an anonymous collection of aggregators of different types, usually for gluing together various statistics. For instance, if the following associates a sum of weights to every bin in a histogram,
Bin.ing(100, 0, 1, lambda d: d.x,
Sum.ing(lambda d: d.weight))
the following would associate the sum of weights and the sum of squared weights to every bin:
Bin.ing(100, 0, 1, lambda d: d.x,
Branch.ing(Sum.ing(lambda d: d.weight),
Sum.ing(lambda d: d.weight**2)))
Branch is a basic building block for complex aggregators. The limitation to ten branches, indexed from i0 to i9, is a concession to type inference in statically typed languages. It is not a fundamental limit, but the type-metaprogramming becomes increasingly complex as branches are added. Error messages may be convoluted as the compiler presents internals of the type-metaprogramming in response to a user’s simple mistake.
Therefore, individual implementations may allow more than ten branches, but the Histogrammar standard only requires ten.
To collect an unlimited number of aggregators of the same type without naming them, use Index. To collect aggregators of the same type with string-based labels, use Label. To collect aggregators of different types with string-based labels, use UntypedLabel.
Branching constructor and required members
Branch.ing(values)
values
(list of present-tense aggregators) is the collection of aggregators to fill.entries
(mutable double) is the number of entries, initially 0.0.
Branched constructor and required members
Branch.ed(entries, values)
entries
(double) is the number of entries.values
(list of past-tense aggregators) is the collection of filled aggregators.
Fill and combine algorithms
def fill(branching, datum, weight):
if weight > 0.0:
for v in branching.values:
fill(v, datum, weight)
branching.entries += weight
def combine(one, two):
if len(one.values) != len(two.values):
raise Exception
entries = one.entries + two.entries
values = []
for v1, v2 in zip(one.values, two.values):
values.append(combine(v1, v2))
return Branch.ed(entries, values)
JSON fragment format
JSON object containing
entries
(JSON number or “inf”)data
(JSON array of JSON objects containingtype
(JSON string), name of sub-aggregator type anddata
(sub-aggregator))
Example:
{"version": "1.0",
"type": "Branch",
"data": {
"entries": 123.0,
"data": [
{"type": "Count", "data": 123.0},
{"type": "Average", "data": {
"entries": 123.0,
"mean": 3.14}},
{"type": "Bin", "data": {
"low": -5.0,
"high": 5.0,
"entries": 123.0,
"name": "position [cm]",
"values:type": "Count",
"values": [10.0, 20.0, 20.0, 30.0, 30.0],
"underflow:type": "Count",
"underflow": 5.0,
"overflow:type": "Count",
"overflow": 8.0,
"nanflow:type": "Count",
"nanflow": 0.0}}]}}
Aliases: common functions and compositions
Although the following could be constructed by hand, they are so often used in data analyses that they have their own constructors. They are not distinct types, though: an aggregator created by explicit construction is completely interchangeable with an aggregator created by one of the following convenience functions.
Identity
A transform (for Count) named "identity"
exists in the Histogrammar namespace with the following definition:
def identity(weight):
return weight
Unweighted
A weighting function named "unweighted"
exists in the Histogrammar namespace with the following definition:
def unweighted(datum):
return 1.0
Histogram
An interval is divided into bins and the number of entries (sum of weights) is counted in each bin. All plotting front-ends should be capable of displaying this.
def Histogram(num, low, high, quantity, selection=unweighted):
return Select.ing(selection, Bin.ing(num, low, high, quantity,
Count.ing(), Count.ing(), Count.ing(), Count.ing()))
num
(32-bit integer) is the number of bins; must be at least one.low
(double) is the minimum-value edge of the first bin.high
(double) is the maximum-value edge of the last bin; must be strictly greater thanlow
.quantity
(function returning double) computes the quantity to bin from the data.selection
(function returning boolean or double) computes the quantity to use as a selection (multiplicative factor on weight).
SparselyHistogram
No memory is allocated for empty bins. This allows the analyst to plot a whole distribution without first knowing its support. (The analyst must have an appropriate choice of bin width, however.) Most plotting front-ends convert it into a dense representation immediately before plotting.
This kind of aggregator has two dangers: (1) if the binWidth
is too small or the distribution has long tails, it can use a large amount of memory, and (2) if it has a few far outliers (e.g. 1e9 to express “missing value”), then the conversion to a dense representation can use a large amount of memory.
def SparselyHistogram(binWidth, quantity, selection=unweighted, origin=0.0):
return Select.ing(selection,
SparselyBin.ing(binWidth, quantity, Count.ing(), Count.ing(), origin))
binWidth
(double) is the width of a bin; must be strictly greater than zero.quantity
(function returning double) computes the quantity to bin from the data.selection
(function returning boolean or double) computes the quantity to use as a selection (multiplicative factor on weight).origin
(double) is the left edge of the bin whose index is 0.
Profile
Views a two-dimensional dataset as a function by binning along one axis and averaging the other.
For a “profile plot” as defined in HBOOK, PAW, and ROOT, see ProfileErr, which includes the variance in each bin to compute errors on the means.
def Profile(num, low, high, binnedQuantity, averagedQuantity, selection=unweighted):
return Select.ing(selection,
Bin.ing(num, low, high, binnedQuantity,
Average.ing(averagedQuantity)))
num
(32-bit integer) is the number of bins; must be at least one.low
(double) is the minimum-value edge of the first bin.high
(double) is the maximum-value edge of the last bin; must be strictly greater thanlow
.binnedQuantity
(function returning double) computes the quantity to bin from the data.averagedQuantity
(function returning double) computes the quantity to average from the data.selection
(function returning boolean or double) computes the quantity to use as a selection (multiplicative factor on weight).
SparselyProfile
Views a two-dimensional dataset as a function by sparsely binning along one axis and averaging the other.
def SparselyProfile(binWidth, binnedQuantity, averagedQuantity, selection=unweighted, origin=0.0):
return Select.ing(selection,
SparselyBin.ing(binWidth, binnedQuantity,
Average.ing(averagedQuantity), Count.ing(), origin))
binWidth
(double) is the width of a bin; must be strictly greater than zero.binnedQuantity
(function returning double) computes the quantity to bin from the data.averagedQuantity
(function returning double) computes the quantity to average from the data.selection
(function returning boolean or double) computes the quantity to use as a selection (multiplicative factor on weight).origin
(double) is the left edge of the bin whose index is 0.
ProfileErr
Views a two-dimensional dataset as a function by binning along one axis and averaging the other, with variances to compute the error on the mean.
def ProfileErr(num, low, high, binnedQuantity, averagedQuantity, selection=unweighted):
return Select.ing(selection,
Bin.ing(num, low, high, binnedQuantity,
Deviate.ing(averagedQuantity)))
num
(32-bit integer) is the number of bins; must be at least one.low
(double) is the minimum-value edge of the first bin.high
(double) is the maximum-value edge of the last bin; must be strictly greater thanlow
.binnedQuantity
(function returning double) computes the quantity to bin from the data.averagedQuantity
(function returning double) computes the quantity to average from the data.selection
(function returning boolean or double) computes the quantity to use as a selection (multiplicative factor on weight).
SparselyProfileErr
Views a two-dimensional dataset as a function by sparsely binning along one axis and averaging the other, with variances to compute the error on the mean.
def SparselyProfileErr(binWidth, binnedQuantity, averagedQuantity, selection=unweighted, origin=0.0):
return Select.ing(selection,
SparselyBin.ing(binWidth, binnedQuantity,
Deviate.ing(averagedQuantity), Count.ing(), origin))
binWidth
(double) is the width of a bin; must be strictly greater than zero.binnedQuantity
(function returning double) computes the quantity to bin from the data.averagedQuantity
(function returning double) computes the quantity to average from the data.selection
(function returning boolean or double) computes the quantity to use as a selection (multiplicative factor on weight).origin
(double) is the left edge of the bin whose index is 0.
TwoDimensionallyHistogram
Views a two-dimensional distribution in its entirety by counting the number of entries (sum of weights) in a grid of rectangular bins.
def TwoDimensionallyHistogram(xnum, xlow, xhigh, xquantity,
ynum, ylow, yhigh, yquantity,
selection=unweighted):
return Select.ing(selection,
Bin.ing(xnum, xlow, xhigh, xquantity,
Bin.ing(ynum, ylow, yhigh, yquantity)))
xnum
(32-bit integer) is the number of bins along the x-axis; must be at least one.xlow
(double) is the minimum-value edge of the first x-axis bin.xhigh
(double) is the maximum-value edge of the last x-axis bin; must be strictly greater thanxlow
.xquantity
(function returning double) computes the quantity to bin along the x-axis from the data.ynum
(32-bit integer) is the number of bins along the y-axis; must be at least one.ylow
(double) is the minimum-value edge of the first y-axis bin.yhigh
(double) is the maximum-value edge of the last y-axis bin; must be strictly greater thanylow
.yquantity
(function returning double) computes the quantity to bin along the y-axis from the data.selection
(function returning boolean or double) computes the quantity to use as a selection (multiplicative factor on weight).
TwoDimensionallySparselyHistogram
Views a two-dimensional distribution in its entirety by counting the number of entries (sum of weights) in a conceptual grid of rectangular bins; the bins are only filled if non-zero.
def TwoDimensionallySparselyHistogram(xbinWidth, xquantity,
ybinWidth, yquantity,
selection=unweighted,
xorigin=0.0, yorigin=0.0):
return Select.ing(selection,
SparselyBin.ing(xbinWidth, xquantity,
SparselyBin.ing(ybinWidth, yquantity,
Count.ing(), Count.ing(), yorigin), Count.ing(), xorigin))
xbinWidth
(double) is the width of a bin along the x-axis; must be strictly greater than zero.xquantity
(function returning double) computes the quantity to bin along the x-axis from the data.ybinWidth
(double) is the width of a bin along the y-axis; must be strictly greater than zero.yquantity
(function returning double) computes the quantity to bin along the y-axis from the data.selection
(function returning boolean or double) computes the quantity to use as a selection (multiplicative factor on weight).xorigin
(double) is the left edge of the bin on the x-axis whose index is 0.yorigin
(double) is the left edge of the bin on the y-axis whose index is 0.