Histogram estimators (nonparametric statistics)

When dealing with nonparametric statistics, which in a way says that you let the data speak for themselves, the first concept many think of is a histogram estimator. But there is also a more advanced class of nonparametric estimators, namely kernel density estimators (KDE).

This post will cover R code and some explanations for histograms estimators. The following post will cover KDEs and compare them to histogram estimators.

Histogram Estimators

For this post, we will use random draws from the Skewed t distribution. Therefore we set a seed to reproduce the results.

library(skewt)
set.seed(73)
draw_skewed_t = rskt(n = 50, df = 6, gamma = 0.4)

Then we plot the true density of the skewed t distribution with the same parameters that we used to draw our random sample.

true_density = dskt(draw_skewed_t, df = 6, gamma = 0.4)
plot(function(x)dskt(x, df = 6, gamma = 0.4),-4,4, main = "Plot of true pdf and our random draws", xlab = "", ylab = "")
points(draw_skewed_t, true_density)

If you use the same seed your plot will look like this:

Next we will create the histogram estimate using the built in hist() function from R and plot the results.

hist(draw_skewed_t, probability = T, xlab = "", main = "Random sample, histogram estimator, true density")
plot(function(x)dskt(x, df = 6,gamma = 0.4),-10,4, main = "", xlab = "", ylab = "", add = TRUE)
points(draw_skewed_t, true_densities)
legend("topleft", legend = c("True density", "Histrogram estimate", "Random sample"),
col=c("black", "gray", "black"),lty=c(1,1,3), cex = 0.7)

We see that histogram estimate pretty close to the true estimate in the interval [-1, 2]  but is not a good approximation in the interval [-infty, -1].  What we could do, is adjust the width of the boxes. This adjustment leads on to the topic of the next blogpost: Kernel density estimation (KDE).

We will look at KDE in general and compare the accuracy to accuracy of histogram estimators.

As always you can find the complete script on Github.