Estimating Parameter N of Discrete Uniform Distribution
Introduction
Problem statement: Suppose we have a finite population of $N$ items with labels being $ 1,\ 2,\ \ldots,\ N $, respectively. Taking a simple random sample (SRS) without replacement, we observe
$$ \{ y_1,\ y_2,\ \ldots,\ y_n\}, $$
where $ y_i \in \{1,\ 2,\ \ldots, N\} $. Based on the SRS, we want to estimate the unknown parameter $ N $.We have the following possible estimators:
$$ \hat{N}_1 = 2 \bar{y} -1, $$
where $ \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i $;$$ \hat{N}_2 = y_{(n)} := \hbox{max}\{y_1,\ \ldots,\ y_n\}; $$
$$ \hat{N}_3 = \frac{n+1}{n} y_{(n)} - 1; $$
$$ \hat{N}_4 = \hbox{max}\{\hat{N}_1,\ \hat{N}_2\}. $$
Remarks:
- The distribution of $y_{(n)}$ is given by
$$ \hbox{Pr}(y_{(n)} = i) = {i-1 \choose n-1} / {N \choose n},\ \hbox{for}\ i=n,\ n+1,\ \ldots, N. $$
- It is easy to show that
$$ E(\hat{N}_1) = N. $$
- It can be shown that
$$ E(\hat{N}_2) = \frac{n(N+1)}{n+1}, $$
which implies$$ E(\hat{N}_3) = N. $$
-
According to this Wikipedia article, $\hat{N}_3$ is the Uniformly Minimum Variance Unbiased Estimator (UMWUE) of $N$.
-
It is of interest to know the relative performances of the above estimators. To compare two estimators' performance, we compare their Mean Squared Errors (MSE) and calculate the following probability
$$ \hbox{Pr}(|\hat{N}_i - N| \le |\hat{N}_j - N|), $$
where $ 1 \le i < j \le 4. $In the next section, we do a small simulation study.
Simulation Study
simu <- function(nbr_rep = 100000, N = 100, n = 10, the_seed = 23456)
{set.seed(the_seed)
data <- parallel::mclapply(1:nbr_rep, function(i) sample(1:N, size = n,
replace = FALSE),
mc.cores = 20)
hat_N_1 <- vapply(data, function(y) 2 * mean(y) - 1, FUN.VALUE = numeric(1))
hat_N_2 <- vapply(data, max, FUN.VALUE = numeric(1))
hat_N_3 <- (n + 1) / n * hat_N_2 - 1
hat_N_4 <- pmax(hat_N_1, hat_N_2)
output_df <- data.frame(est_1 = hat_N_1,
est_2 = hat_N_2,
est_3 = hat_N_3,
est_4 = hat_N_4)
return(output_df)
}
analysis <- function(df, N)
{MSE <- vapply(df, function(x) mean((x - N)^2), FUN.VALUE = numeric(1))
d <- dim(df)[1]
abs_dev_df <- purrr::map_dfc(df, function(x) abs(x - N))
est_1_vs_est_2 <- sum(abs_dev_df[[1]] <= abs_dev_df[[2]]) / d
est_1_vs_est_3 <- sum(abs_dev_df[[1]] <= abs_dev_df[[3]]) / d
est_1_vs_est_4 <- sum(abs_dev_df[[1]] <= abs_dev_df[[4]]) / d
est_2_vs_est_3 <- sum(abs_dev_df[[2]] <= abs_dev_df[[3]]) / d
est_2_vs_est_4 <- sum(abs_dev_df[[2]] <= abs_dev_df[[4]]) / d
est_3_vs_est_4 <- sum(abs_dev_df[[3]] <= abs_dev_df[[4]]) / d
output_df <- data.frame(comp_1_vs_other = c(MSE[1], NA, NA, NA),
comp_2_vs_other = c(est_1_vs_est_2, MSE[2],
NA, NA),
comp_3_vs_other = c(est_1_vs_est_3, est_2_vs_est_3,
MSE[3], NA),
comp_4_vs_other = c(est_1_vs_est_4, est_2_vs_est_4,
est_3_vs_est_4, MSE[4]))
return(output_df)
}
test_1 <- simu(N = 5, n = 3)
(my_re_1 <- analysis(test_1, N = 5))
## comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1 1.334324 0.40033 0.4003300 0.798460
## 2 NA 0.70269 0.6981500 0.800050
## 3 NA NA 0.8007467 0.699580
## 4 NA NA NA 1.210784
test_2 <- simu(N = 100, n = 10)
(my_re_2 <- analysis(test_2, N = 100))
## comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1 302.6019 0.33147 0.26323 0.70556
## 2 NA 129.50745 0.41678 0.67666
## 3 NA NA 75.80357 0.65454
## 4 NA NA NA 217.83004
test_3 <- simu(N = 1000, n = 100)
(my_re_3 <- analysis(test_3, N = 1000))
## comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1 3001.361 0.12317 0.09670 0.56325
## 2 NA 164.02811 0.41018 0.87716
## 3 NA NA 86.95230 0.74058
## 4 NA NA NA 1585.57204
test_4 <- simu(N = 10000, n = 100)
(my_re_4 <- analysis(test_4, N = 10000))
## comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1 331305.6 0.13012 0.09768 0.56785
## 2 NA 18977.91834 0.39616 0.86991
## 3 NA NA 9637.27100 0.74545
## 4 NA NA NA 175634.20105
test_5 <- simu(N = 1000000, n = 1000)
(my_re_5 <- analysis(test_5, N = 1000000))
## comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1 334959460 4.421000e-02 3.187000e-02 5.22140e-01
## 2 NA 2.014845e+06 3.923900e-01 9.55790e-01
## 3 NA NA 1.010327e+06 7.85420e-01
## 4 NA NA NA 1.68419e+08