writing

Use eCDFs instead of histograms

Histograms are usually the first plot people reach for when they want to look at a distribution. They are familiar and often good enough. But for many distribution questions, they are the wrong default.

My default has long been the empirical CDF, or eCDF. Once you get used to reading them, eCDFs are almost always cleaner: no bin widths, percentiles are directly visible, and zooming in does not make you forget about the tails.

An eCDF plots the fraction of observations less than or equal to each value. Formally, for a sample x1,,xnx_1,\ldots,x_n, it is

F^(t)=1ni=1n1(xit).\hat F(t) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(x_i \le t).

In plain English: the x-axis is the value, and the y-axis is the cumulative share of the data. If the curve is at 0.8 when x=12x = 12, then 80% of the observations are at or below 12.

Here is a simple example. The sample mostly comes from a central distribution, with a visible right-tailed outlier component. Draw a few random samples, change the histogram bin width, and toggle the x-axis zoom. The histogram keeps asking you to make plotting choices; the eCDF answers distribution questions directly.

Histogram vs eCDF on the same sample

The sample has a visible right-tailed outlier component. Zooming hides those x-values, but the eCDF still shows the missing tail mass as vertical distance from 100%.

full sample range
0.5
sample size
420
420 visible
hidden tails
0.0%
0 below / 0 above
median
0.2
read at eCDF = 50%
90th pct.
1.9
read at eCDF = 90%
Histogramlocal mass depends on bin widtheCDFcumulative share keeps the tail mass visible0%10%20%-20-1001020value0%25%50%75%100%-20-1001020median90thvalue0.0% above range

No bins

The first advantage is that eCDFs have no bin width.

With a histogram, the shape can change depending on where the bins start and how wide they are. Make the bins too narrow and the plot looks noisy. Make them too wide and you smooth away real structure. There are rules of thumb for choosing bin widths, but the fact that you need a rule of thumb is already a smell.

An eCDF has no bins. Every observation is used directly, and the plot means exactly what it says. There is no hidden smoothing parameter quietly changing the visual impression.

Percentiles are the native units

The second advantage is that percentiles are trivial to read. Want the median? Look at where the curve crosses 0.5. Want the 90th percentile? Look at where it crosses 0.9. Want to compare two distributions? Plot two eCDFs and see which curve reaches a given cumulative share sooner. This makes the plot especially nice for questions like “how bad is the worst 10%?” or “what share is under this threshold?”

Histograms are worse for this because they show local mass rather than cumulative probability. That is sometimes what you want, but often you end up mentally integrating the bars anyway. If the question is about ranks, percentiles, thresholds, or tail probabilities, the eCDF is already in the right units.

Zooming without losing the tails

The third advantage is that eCDFs behave much better when there are outliers.

With a histogram, a few extreme values can stretch the x-axis so far that the relevant part of the distribution gets crushed into a small region. You can zoom in, but then the tails disappear from the plot entirely. So you are forced to choose between seeing the main body clearly and remembering that the tails exist.

With an eCDF, zooming into the relevant x-range does not have the same failure mode. If you cut off the x-axis at, say, the 99th percentile, the curve simply stops near 0.99. The missing 1% is still visible as missing vertical distance. You can focus on the region where most of the data lives without pretending that the tail mass is zero.

This is a nice separation of concerns. The x-axis can focus on the region you care about, while the y-axis still accounts for the whole sample. If the curve ends at 0.96, then 4% of the sample remains above the visible range. That is much harder to miss than a zoomed histogram with chopped-off bars.

The familiarity trap

The real advantage of histograms is familiarity. People know how to read them because they have seen them a thousand times. That is not nothing, and in narrow cases it may be decisive. Otherwise, “histogram” mostly means “the plot everyone is used to”, which is not a statistical argument.

For routine distribution checks, histograms are usually a bad bargain. They make you choose bins, they hide percentiles, and they handle outliers poorly. If the question is about thresholds, ranks, quantiles, tail probabilities, or comparing distributions, a histogram is forcing the reader to do extra work. The eCDF is already showing the relevant object.

Keyboard

gh
go to home
gw
go to writing
gb
go to bits
gp
go to projects
j/k
move focus down/up in list
o
open focused item
t
toggle light/dark theme
?
show this help