On how many bins to choose for a histogram

A question

We often make histograms. I wonder if you have ever asked yourself this question: How many bins should I set? Certainly, the answer is: Yes, you/we have. But, what’s really the answer to this question?

An answer

Here is an answer. In R function hist(), the default is breaks = "Sturges", which means the number of bins $$ k = \lceil \log_2 n \rceil + 1, $$ where $\lceil \cdot \rceil$ is the ceiling function, and $n$ is the number of data points. So this should be our first try. If we are not entirely happy about the result, we can set breaks = "FD". FD stands for Freedman–Diaconis method, and the number of bins $$ k = \left\lceil\frac{\hbox{max(data)} - \hbox{min(data)}}{h}\right\rceil, $$ where $$ h = 2\frac{\hbox{IQR}(\hbox{data})}{\sqrt[3]{n}}. $$ There is some evidence that Freedman–Diaconis method is “robust and works well in practice” (see [1] and [2]).

Note if we use ggplot2 package, we can firstly calculate $h$ and $k$, and then use geom_histogram(aes(x), binwidth = h) or geom_histogram(aes(x), bins = k).

Referecnes

[1] wikipedia

[2] stackexchange

Lingyun Zhang (张凌云)
Lingyun Zhang (张凌云)
Design Analyst

I have research interests in Statistics, applied probability and computation.