Estimating Parameter N of Discrete Uniform Distribution

Introduction

Problem statement: Suppose we have a finite population of $N$ items with labels being $ 1,\ 2,\ \ldots,\ N $, respectively. Taking a simple random sample (SRS) without replacement, we observe

$$ \{ y_1,\ y_2,\ \ldots,\ y_n\}, $$

where $ y_i \in \{1,\ 2,\ \ldots, N\} $. Based on the SRS, we want to estimate the unknown parameter $ N $.

We have the following possible estimators:

$$ \hat{N}_1 = 2 \bar{y} -1, $$

where $ \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i $;

$$ \hat{N}_2 = y_{(n)} := \hbox{max}\{y_1,\ \ldots,\ y_n\}; $$

$$ \hat{N}_3 = \frac{n+1}{n} y_{(n)} - 1; $$

$$ \hat{N}_4 = \hbox{max}\{\hat{N}_1,\ \hat{N}_2\}. $$

Remarks:

  • The distribution of $y_{(n)}$ is given by

$$ \hbox{Pr}(y_{(n)} = i) = {i-1 \choose n-1} / {N \choose n},\ \hbox{for}\ i=n,\ n+1,\ \ldots, N. $$

  • It is easy to show that

$$ E(\hat{N}_1) = N. $$

  • It can be shown that

$$ E(\hat{N}_2) = \frac{n(N+1)}{n+1}, $$

which implies

$$ E(\hat{N}_3) = N. $$

  • According to this Wikipedia article, $\hat{N}_3$ is the Uniformly Minimum Variance Unbiased Estimator (UMWUE) of $N$.

  • It is of interest to know the relative performances of the above estimators. To compare two estimators' performance, we compare their Mean Squared Errors (MSE) and calculate the following probability

$$ \hbox{Pr}(|\hat{N}_i - N| \le |\hat{N}_j - N|), $$

where $ 1 \le i < j \le 4. $

In the next section, we do a small simulation study.

Simulation Study

simu <- function(nbr_rep = 100000, N = 100, n = 10, the_seed = 23456)
{set.seed(the_seed)
 
 data <- parallel::mclapply(1:nbr_rep, function(i) sample(1:N, size = n, 
                                                      replace = FALSE),
                            mc.cores = 20) 
 hat_N_1 <- vapply(data, function(y) 2 * mean(y) - 1, FUN.VALUE = numeric(1))
 hat_N_2 <- vapply(data, max, FUN.VALUE = numeric(1))
 hat_N_3 <- (n + 1) / n * hat_N_2 - 1
 hat_N_4 <- pmax(hat_N_1, hat_N_2)
 
 output_df <- data.frame(est_1 = hat_N_1, 
                         est_2 = hat_N_2, 
                         est_3 = hat_N_3, 
                         est_4 = hat_N_4)
 return(output_df)
}

analysis <- function(df, N)
{MSE <- vapply(df, function(x) mean((x - N)^2), FUN.VALUE = numeric(1))

 d <- dim(df)[1]
 abs_dev_df <- purrr::map_dfc(df, function(x) abs(x - N))
 est_1_vs_est_2 <- sum(abs_dev_df[[1]] <= abs_dev_df[[2]]) / d
 est_1_vs_est_3 <- sum(abs_dev_df[[1]] <= abs_dev_df[[3]]) / d
 est_1_vs_est_4 <- sum(abs_dev_df[[1]] <= abs_dev_df[[4]]) / d
 est_2_vs_est_3 <- sum(abs_dev_df[[2]] <= abs_dev_df[[3]]) / d
 est_2_vs_est_4 <- sum(abs_dev_df[[2]] <= abs_dev_df[[4]]) / d
 est_3_vs_est_4 <- sum(abs_dev_df[[3]] <= abs_dev_df[[4]]) / d

 output_df <- data.frame(comp_1_vs_other = c(MSE[1], NA, NA, NA),
                         comp_2_vs_other = c(est_1_vs_est_2, MSE[2], 
                                             NA, NA),
                         comp_3_vs_other = c(est_1_vs_est_3, est_2_vs_est_3,
                                             MSE[3], NA),
                         comp_4_vs_other = c(est_1_vs_est_4, est_2_vs_est_4,
                                             est_3_vs_est_4, MSE[4]))
 return(output_df)
}

test_1 <- simu(N = 5, n = 3)
(my_re_1 <- analysis(test_1, N = 5))
##   comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1        1.334324         0.40033       0.4003300        0.798460
## 2              NA         0.70269       0.6981500        0.800050
## 3              NA              NA       0.8007467        0.699580
## 4              NA              NA              NA        1.210784
test_2 <- simu(N = 100, n = 10)
(my_re_2 <- analysis(test_2, N = 100))
##   comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1        302.6019         0.33147         0.26323         0.70556
## 2              NA       129.50745         0.41678         0.67666
## 3              NA              NA        75.80357         0.65454
## 4              NA              NA              NA       217.83004
test_3 <- simu(N = 1000, n = 100)
(my_re_3 <- analysis(test_3, N = 1000))
##   comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1        3001.361         0.12317         0.09670         0.56325
## 2              NA       164.02811         0.41018         0.87716
## 3              NA              NA        86.95230         0.74058
## 4              NA              NA              NA      1585.57204
test_4 <- simu(N = 10000, n = 100)
(my_re_4 <- analysis(test_4, N = 10000))
##   comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1        331305.6         0.13012         0.09768         0.56785
## 2              NA     18977.91834         0.39616         0.86991
## 3              NA              NA      9637.27100         0.74545
## 4              NA              NA              NA    175634.20105
test_5 <- simu(N = 1000000, n = 1000)
(my_re_5 <- analysis(test_5, N = 1000000))
##   comp_1_vs_other comp_2_vs_other comp_3_vs_other comp_4_vs_other
## 1       334959460    4.421000e-02    3.187000e-02     5.22140e-01
## 2              NA    2.014845e+06    3.923900e-01     9.55790e-01
## 3              NA              NA    1.010327e+06     7.85420e-01
## 4              NA              NA              NA     1.68419e+08
Lingyun Zhang (张凌云)
Lingyun Zhang (张凌云)
Design Analyst

I have research interests in Statistics, applied probability and computation.