On how many bins to choose for a histogram
A question
We often make histograms. I wonder if you have ever asked yourself this question: How many bins should I set? Certainly, the answer is: Yes, you/we have. But, what’s really the answer to this question?
An answer
Here is an answer. In R function hist()
, the default is breaks = "Sturges"
, which means the number of bins
$$
k = \lceil \log_2 n \rceil + 1,
$$
where $\lceil \cdot \rceil$ is the ceiling function, and $n$ is the number of data points. So this should be our first try. If we are not entirely happy about the result, we can set breaks = "FD"
. FD stands for Freedman–Diaconis method, and the number of bins
$$
k = \left\lceil\frac{\hbox{max(data)} - \hbox{min(data)}}{h}\right\rceil,
$$
where
$$
h = 2\frac{\hbox{IQR}(\hbox{data})}{\sqrt[3]{n}}.
$$
There is some evidence that Freedman–Diaconis method is “robust and works well in practice” (see [1] and [2]).
Note if we use ggplot2
package, we can firstly calculate $h$ and $k$, and then use geom_histogram(aes(x), binwidth = h)
or
geom_histogram(aes(x), bins = k)
.
Referecnes
[1] wikipedia
[2] stackexchange