# HISTOGRAM: The Breathless Horror and Disgust -- by JD Smith

There is something at work in my soul which I do not understand . . . there is a love for the marvellous, a belief in the marvellous, intertwined in all my projects, which hurries me out of the common pathways of men, even to the wild sea and unvisited regions I am about to explore. --Mary Shelley,FRANKENSTEIN

*[Editor's Note: It would not be putting too fine a point on the subject
to say that the following tutorial, written by JD Smith, is famous. And
justifiably so. It is the only coherent and complete explanation of
the Histogram command extant. Read it several times. You will
learn more each time you do. And, maybe, if you read it enough, the
secrets of the command will be revealed to you. Stranger things have
happened.]*

You've seen it... scattered about old tomes of code like some
forgotten incantation drawn from an ancient tongue, summoning unspoken
forces. Perhaps you felt a deep, visceral uneasiness in the presence of
its occult construction, arcane invocation, and dark power. Though it
may turn your stomach, `HISTOGRAM` is a monster which, when tamed
and brought into your service, can be a powerful ally in your quest to
avoid the dreaded and oft-inefficient IDL `for` loop. With it
you can sort and select, trim and smooth, index and decimate: almost the
full range of array manipulations, including many you probably didn't
think could be done efficiently in IDL.

Although a full account of `HISTOGRAM`'s many uses would
occupy a large volume, here I'll try to build a foundation on which many
novel variations can be built.

## Remedial Review

A histogram, as we all know (at least those of us who didn't spend Algebra II passing notes and perfecting spitball ranging techniques) represents nothing more than a fancy way to count: a series of numbers in an input vector is divvied up into bins according to their value, and the (fixed) size of the histogram bin. For each input value which falls into a given bin, a single "drop in the bucket" is added. Let's start with a simple, familiar example:

IDL> print,histogram(indgen(10),BINSIZE=5) 5 5

Indeed, 5 values each fell in the two bins of size 5. What about:

IDL> print,histogram(indgen(10)) 1 1 1 1 1 1 1 1 1 1

Here you see the default is a binsize of 1 -- one drop fell into each bin. You can of course also take histograms of floating point values:

IDL> print,histogram(randomu(sd,100),BINSIZE=.2,MIN=0.) 20 20 20 18 22

Looks about right. If this were all there was to `HISTOGRAM`,
there wouldn't be much use in a tutorial. You'll soon see this is far
from the case.

## We Can Build Him Faster

A surprising point worth mentioning about `HISTOGRAM` is that
it's very fast: *much* faster than direct `for` loops, and
usually noticeably faster than equivalent solutions using,
e.g. `WHERE` (where they exist). Sometimes these savings result
from a more efficient algorithm, e.g. avoiding unnecessary repeated
linear searches through an input vector. But even when the number of
operations (e.g. indexing) is equivalent, `HISTOGRAM` can
outperform other IDL built-ins -- it's very well optimized.

Another thing to remember: although the `for` loop is quite
slow in IDL, if the number of iterations is kept small, and the work
done per iteration is large, you won't feel the looping penalty.
`HISTOGRAM` is often very helpful in this respect. A typical
rule of thumb: if you're looping over each data element individually,
there's (probably) a faster way to do it.

## Bins and Buckets

A slight detour brings us through the littered landscape strewn with
various inputs to `HISTOGRAM` which work together to specify that
all important ingredient: the size of the histogram bin.

`MAX`- The maximum value of the array to consider.
*Can change from the input*-- see`NBINS`. `MIN`- The minimum value of the array to consider.
`BINSIZE`- The size of the bin, which defaults to 1. It's worth noting that the binsize is constant throughout the histogram.
`NBINS`- A relative of
`BINSIZE`,`NBINS`is something of a misnomer. The relation`HISTOGRAM`uses to compute the bin size if passed`NBINS`is:

`BINSIZE`=`(``MAX`-`MIN`)/(`NBINS`-1)

and if`NBINS`is specified,`MAX`is changed to be (independent of any value passed as`MAX`):

`MAX`=`NBINS`*`BINSIZE`+`MIN`

As such, it's probably better to avoid`NBINS`, if you care about the`MAX`value staying put. A better relation which would leave`MAX`as is and give you exactly`NBINS`bins between`MIN`and`MAX`:

`BINSIZE`=`(``MAX`-`MIN`)/`NBINS` `OMIN|OMAX`- These output keywords give you the min and max value used to construct the histogram. Useful if you don't specify them directly.

Remember that `MIN` and `MAX` will default to the minimum
and maximum of the array (except for byte data, which defaults to
`MIN`=0). It's very helpful to use `OMIN` if you don't
specify the minimum, so you know what the first bin corresponds to.

## When Good `HISTOGRAM`'s Go Bad

An important point to keep in mind is that `HISTOGRAM` creates
one bin for each unit bin size between the minimum and maximum, even if
no such data are present in the input! This means that histograms of
sparse arrays can be very wasteful:

IDL> h=histogram(MIN=0,[100000000L]) IDL> print,n_elements(h) 100000001

That's a lot of zeroes, just to tell us we have one data point in the
100,000,000th bin! In some cases, you can partially mitigate this
problem by not specifying `MIN` directly, using the `OMIN`
output instead:

IDL> h=histogram(OMIN=om,[100000000L]) IDL> print,n_elements(h) 1 IDL> print,om 100000000

But more often for sparsely populated data this won't help:

IDL> h=histogram(OMIN=om,[0L,100000000L]) IDL> print,n_elements(h) 100000001

You must always balance the speed gains of `HISTOGRAM` against
its potentially large memory usage in the case of very sparse arrays.
The sparseness of course depends on the size of your bin (e.g. if I had
made a bin size of 100,000,001 above, there would have only been one
bin), but for the typical case of integers with binsize=1, it's easy to
understand the data sparseness as "what fraction of integers, on
average, are present in the data between its min and max?" Below you'll
see how to compute this.

## Reverse Indices

This little `HISTOGRAM` jewel, obtained by setting the
`REVERSE_INDICES` keyword, is actually its most useful and least
understood aspect. It tells you, for each bin in the histogram, the
actual indices of the data which fell into that bin. The format of the
reverse indices vector is sadly quite obtuse, turning many away from its
use, but in fact it's really quite simple. In a call like:

IDL> data=fix(randomu(101,25)*12) IDL> h=histogram(data,OMIN=om,REVERSE_INDICES=ri)

The `ri` vector is *actually* (and quite disturbingly) two
vectors in one, concatenated together. I call them the
`i`-vector and the `o`-vector:

======================================================= The `i' vector The `o' vector ri = iiiiiiiiiiiiiiiiiiioooooooooooooooooooooooooooooo |-----------------||----------------------------| |0 nh||nh+1 nh+total(h)| =======================================================

The last index of the `i`-vector, `nh`, is the number of
elements in the histogram. The `i`-vector is actually an index
into the `o`-vector portion of the reverse indices list itself: a
list which indexes itself! Scary, yes, but that's just unusual
bookkeeping. It's much simpler than all that. The easiest way to think
of it is as follows: each bin of the histogram has a short list of zero
or more indices associated with it -- picture each drop in the bucket
painted with the index of the data to which it corresponds:

| | | | |6 | | | | | | | | | |5 | | |10 | | | | | | |3 | |7 |12 | | |15 | | | | | | |9 | |11 |20 | |16 |19 | | | | | |8 |17 | |14 |21 |1 |24 |23 |0 |2 |4 | |18 |13 |22 | +---+---+---+---+---+---+---+---+---+---+---+---+ 0 1 2 3 4 5 6 7 8 9 10 11

The `o`-vector contains these indices, in order, and the
`i`-vector just shows us where to go to get them. E.g., to get
the indices from the first bin of the histogram, we use:

IDL> print, ri[ri[0]:ri[1]-1] 7 11 14 IDL> print, data[ri[ri[0]:ri[1]-1]] 0 0 0 IDL> print,ri[0],ri[1]-1 13 15

and from the 5th bin:

IDL> print, ri[ri[4]:ri[5]-1] 6 10 15 19 23 IDL> print, data[ri[ri[4]:ri[5]-1]] 4 4 4 4 4

That is, adjacent values in the `i`-vector part of `ri`
specify the *range* of elements in the `o`-vector part of
`ri` which contain the indices of data present in that bin. In
the first example, there were 3 data elements which fell in the first
bin (all, as you'd anticipate, with a value of 0). What if no data fall
in a given bin?

IDL> print,where(h eq 0) 8 IDL> print,ri[8],ri[9] 31 31

In this case, you see that the two adjacent elements of the
`i`-vector are the same: they don't span any elements of the
`o`-vector. Typically, you need to test for this to avoid a null
index range. Something like this.

IDL> if ri[4] ne ri[5] then print, data[ri[ri[4]:ri[5]-1]] else print, 'No data in bin 4' IDL> if ri[8] ne ri[9] then print, data[ri[ri[8]:ri[8]-1]] else print, 'No data in bin 8' 4 4 4 4 4 No data in bin 8

Note that the smallest value in the `i`-vector is
`nh+1`, i.e. the first index of the `o`-vector, one more
than the number of elements in the histogram. This is just because
`i` and `o` are glued together like that -- had the
`i`-vector and `o`-vector been kept separate, the former
would have started at 0.

The main point to remember: `HISTOGRAM` actually has
*three* very useful outputs, not one: the histogram itself,
`h`, the reverse index self-index vector `i`, and the
original index vector `o`. Sometimes they're useful together,
and sometimes only one of the three holds the key to solving a
particular problem.

## Flexing your `HISTOGRAM` Muscle

The best way to learn `HISTOGRAM` techniques is to dive right
into a set of working examples. Don't forget to come up for air!

### Using the `h`-vector

Perhaps the most common set of operations in IDL is building,
reformatting, dissecting, and applying lists of indices to arrays of
various dimensionality (see, for instance the dimensional juggling tutorial).
`HISTOGRAM` is no stranger to these index slinging operations. A
simple example taken from the manual:

**Problem**: Increment each element of a vector listed in a index
vector by 1.

IDL> inds=fix(randomu(sd,8)*3) IDL> print,inds 2 2 0 1 2 0 1 2 IDL> vec=intarr(3) IDL> vec=histogram(inds,INPUT=vec,MIN=0) IDL> print,vec 2 2 4

This works even if individual indices are mentioned several times (in
which case the explicit `vec[inds]=vec[inds]+1` does *not*
work). `INPUT` is a convenience, equivalent to
`vec=vec+histogram(...)`. You simply can't do this with normal
indexing, without a `for` loop. You needn't just increment by
one either:

**Problem**: Increment all elements of vec with indices mentioned at
least twice by PI.

IDL> vec=vec+(histogram(inds,MIN=0,MAX=n_elements(vec)-1) ge 2)*!PI

**Problem**: Find the value intersection of two vectors,
a.k.a. which values are present in both `a` and `b`?

IDL> a=fix(randomu(sd,8)*20) IDL> b=fix(randomu(sd,8)*20) IDL> print,a,b 6 7 16 13 15 13 19 7 16 5 8 17 7 7 4 7 IDL> print,where(histogram(a,OMIN=om) gt 0 AND histogram(b,MIN=om) gt 0)+om 7 16

I.e., we look for which buckets contain at least one drop for both
vectors, using `OMIN` to save some work (since there can be no
overlap below the minimum of either array). You could also easily use
this method to find values repeated exactly twice, or present in one but
not another (the set *difference*).

**Problem**: Remove some elements, listed in random order, from a
vector.

IDL> vec=randomu(sd,10) IDL> remove=[3,7,2,8] IDL> keep=where(histogram(remove,MIN=0,MAX=n_elements(vec)-1) eq 0,cnt) IDL> if cnt ne 0 then vec=vec[keep] IDL> print,keep 0 1 4 5 6 9

We've used `HISTOGRAM` and `WHERE` to simply generate a
list of kept indices.

**Problem**: Find a multi-dimensional histogram of a set of input
coordinate tuples (e.g. `(x,y,z)`).

This general class of problems (solved in 2D by RSI's
`HIST_2D`, and in up to eight dimensions by HIST_ND)
also has other applications. The key trick is to scale all data into
integer bins yourself, e.g.:

IDL> h=(data-min(data))/binsize

and to convert these multi-dimensional scaled "indices" into a single
index (exactly analogous to converting a `(column, row)` index of
an array into a single running index). In 3D this looks like:

IDL> h=h0+nbins[0]*(h1+nbins[1]*h2) IDL> ret=histogram(h,MIN=0,MAX=total_bins-1)

where `h0`,`h1`, and `h2` are the scaled data
"indices" from the first, second and third dimensions, and
`total_bins` is the total number of bins in your grid (just the
product of the bin sizes). You can see how to generalize to higher
dimensions, and how to apply the same technique to do other work with
data sets of more than one dimension.

**Problem**: Determine how sparse a set of input data is.

IDL> data=fix(randomu(sd,1000)*10000) IDL> h=histogram(data) IDL> print,total(h ne 0)/n_elements(h) 0.0953624

I.e., very close to 1 in 10 integers are present. This works even if your data are very clustered and have a tendency towards repeated values.

### Using the `o`-vector

The `o`-vector portion of the reverse indices vector is essential
for its intended use: gathering together those elements of the input
data vector which fall in an individual bin. With it, you can easily
perform operations on data which have been divided up into bins of any
given size. And unlike linear search methods, you can precompute the
histogram with reverse indices, and subsequently perform many different
operations on different data sub-groups using the same histogram, for a
substantial speedup.

**Problem**: Find the individual median in each quartile of a data
set.

IDL> data=randomu(sd,100)*100 IDL> h=histogram(data,BINSIZE=25, REVERSE_INDICES=ri) IDL> med=fltarr(4) IDL> for j=0L,3L do if ri[j+1] gt ri[j] then $ med[j]=median(data[ri[ri[j]:ri[j+1]-1]]) IDL> print,med 13.3847 39.0525 68.0970 87.5475

This pattern forms the standard usage of the `o`-vector component
of the reverse indices vector: loop over histogram bins, and, if
non-empty, do something to the data which fell into that bin.

The reverse indices don't have to be used to index the original
array, much as an index list returned by `SORT` doesn't have to
be used to sort the vector it was passed:

Editor's Note: Please note that quartiles can only be calculated like this if you data is uniformly distributed. The proper definition of quartile would suggest that you are going to have four bins and the same number of points would be in each bin. That is, the median would separate the data set into two bins of equal number of points. Then, taking the median of those two sub-bins would result in the two quartiles (25% and 75%). In practice, say you wanted to draw a box and whisker plot, you might do something like this if you had a non-uniform distribution of points.

data=randomu(sd,100)*100 minVal = min(data) maxVal = max(data) medianVal = median(data,/even) ; Find the quartiles. qtr_25th = Median(data[Where(data LE medianVal, countlowerhalf)]) qtr_75th = Median(data[Where(data GT medianVal, countupperhalf)])

**Problem**: Total data together according to a separate list of
indices, e.g.:

inds=[3 , 1, 0, 4, 2, 3, 3, 2, 4, 0] data=[.1,.8,.6,.4,.2,.9,.7,.3,.5,.2] vec[0]=data[2]+data[9] vec[1]=data[1] ....

This is easy, using the `o`-vector -- simply use the reverse
index of the `inds` vector as an index to another vector --
`data`:

IDL> h=histogram(inds,reverse_indices=ri,OMIN=om) IDL> for j=0L,n_elements(h)-1 do if ri[j+1] gt ri[j] then $ vec[j+om]=total(data[ri[ri[j]:ri[j+1]-1]])

For large histograms, there are even more efficient ways to do this with very short or no loops (e.g. using a histogram of the histogram). Additonal information on this topic can be found in the article on array decimation.

**Problem**: Threshold a vector, setting all elements between 20 and 60,
inclusive, to zero.

IDL> data=fix(randomu(sd,100)*100) IDL> h=histogram(data,OMIN=om,REVERSE_INDICES=ri) IDL> data[ri[ri[20-om>0]:ri[61-om>0]-1]]=0.

What's this? We've clearly violated the time honored
`r[i]:r[i+1]` convention, and skipped right through 41 bins worth
of `o`-vector indices in one go. Since the `o`-vector
contains adjacent groups of indices one right after another, this is
perfectly legal. You'll get into trouble if there are no elements
between the limits (i.e. if `ri[20-om>0] eq ri[61-om>0]`), but
you can test for that.

**Problem**: Throw away data with more than twice the median
repeat count rate.

data=fix(randomn(sd,100)*5) h=histogram(data,REVERSE_INDICES=ri) keep=where(h le 2*median(h)) num=h[keep] for n=1,max(num) do begin wh=where(num ge n,cnt) if cnt eq 0 then continue if n_elements(k_inds) eq 0 then k_inds=ri[ri[keep[wh]]+n-1] else $ k_inds=[k_inds,ri[ri[keep[wh]]+n-1]] endfor data=data[k_inds]

Here we've chosen to loop over bin count and not directly over the reverse indices, to reduce the size of the loop (with its costly concatenation). The simpler form using the standard pattern is:

for i=0,n_elements(keep)-1 do begin k=keep[i] if ri[k+1] gt ri[k] then $ if n_elements(k_inds) eq 0 then k_inds=ri[ri[k]:ri[k+1]-1] $ else k_inds=[k_inds,ri[ri[k]:ri[k+1]-1]] endfor data=data[k_inds]

If you need to preserve the ordering of the kept elements, accumulate a
throw-away index list similarly (e.g. `bad=where(h ge
2*median(h))`, and then use the method from the previous section to
generate an ordered kept-index list from it.

### Using the `i`-vector

The `i`-vector is the least used portion of `HISTOGRAM`'s
output (aside from the trivial case when it's used to index the
`o`-vector), but it can do some very neat things.

**Problem**: Construct a vector with values from one vector and a
repeat count from another (so called chunk indexing). E.g., turn:

IDL> d=[4,5,6,8] IDL> n=[2,1,3,4]

into

new_d=[4,4,5,6,6,6,8,8,8,8]

This is easy with the `i`-vector:

IDL> h=histogram(total(n,/CUMULATIVE)-1,/BINSIZE,MIN=0,REVERSE_INDICES=ri) IDL> i=ri[0:n_elements(h)-1]-ri[0] IDL> print,d[i] 4 4 5 6 6 6 8 8 8 8

What have we done here? Well, we've taken the cumulative sum of
`n`:

IDL> print,total(n,/CUMULATIVE)-1 1.00000 2.00000 5.00000 9.00000

created a histogram from it, and used the fact that the
`i`-vector contains multiple repeated indices for all the empty
bins:

IDL> print,i 0 0 1 2 2 2 3 3 3 3

Note that this subject is discussed in considerably more depth in a related article on array decimation.

## Applying the Current

Don't worry if everything doesn't come together at once. No monster was built in a day. Just keep tinkering with the examples, examining the by-products along the way, and soon you'll be terrorizing unsuspecting villagers the world-round with your creations.

Copyright © 2002 JD Smith

Last Updated Oct 14, 2002