A Note on R Function base::sample()

By chance I came across Professor Tillé’s paper, Remarks on some misconceptions about unequal probability sampling without replacement. The thing that drew my attention is:

By reading all the articles mentioned above, we understand that these authors all realized that this intuitive method [“Weighted Random Sampling”] is false.

Unfortunately, this method is implemented in the sample(.) function of the R language when we specify a sampling without replacement with unequal probabilities.

(NB: I put the words in the square brackets.)

Professor Tillé provides a simple example (see below), which shows that base::sample() “does not select a sample that respects the inclusion probabilities”.

N <- 12
n <- 4
pik <- (1:N)/sum(1:N) * n
p <- pik/n
pikest <- rep(0, N)
SIM <- 100000
for (j in 1:SIM) {
    s <- sample(x = 1:N, size = n, replace = FALSE, prob = p)
    pikest[s] <- pikest[s] + 1
}
pikest <- pikest/SIM
pik
##  [1] 0.05128205 0.10256410 0.15384615 0.20512821 0.25641026 0.30769231
##  [7] 0.35897436 0.41025641 0.46153846 0.51282051 0.56410256 0.61538462
pikest
##  [1] 0.06291 0.11844 0.17503 0.22988 0.27894 0.32774 0.37196 0.41046 0.45424
## [10] 0.48931 0.52241 0.55868
(pikest - pik)/sqrt(pik * (1 - pik)/SIM)
##  [1]  16.670624  16.547747  18.566790  19.384092  16.316307  13.735880
##  [7]   8.560398   0.130887  -4.629670 -14.874244 -26.588107 -36.858000

The above example reminds me of a story: In 2019, I and my ex-colleague Chris Hansen noticed that base::sample() got “something” wrong. Chris modified my example and provided the following one to show this:

pi <- c(1.0, 0.75, 0.25)
nit <- 100000
system.time({
  s <- 
    lapply(1:nit,
           function(i) base::sample(letters[1:3], 
                                    size = 2, 
                                    replace = FALSE, 
                                    prob = pi)
           ) 
  })
##    user  system elapsed 
##    1.53    0.02    1.56
table(unlist(s)) / nit
## 
##       a       b       c 
## 0.87157 0.80293 0.32550

Furthermore, he pointed out that we can instead use sampling::UPbrewer()

system.time({
   s <- lapply(
     1:nit,
     function(i) letters[1:3][sampling::UPbrewer(pi) == 1]
   ) 
 })
##    user  system elapsed 
##    7.23    0.01    7.33
table(unlist(s)) / nit
## 
##       a       b       c 
## 1.00000 0.75192 0.24808

I write down this note to record our little story. More importantly, the “defect” in base:sample() should be known by more people. BTW, it seems that I became a “fan” of R package sampling, for which Professor Tillé is the leading author.

References

Tillé, Y. (2023). Remarks on some misconceptions about unequal probability sampling without replacement. Computer Science Review, 47.

Lingyun Zhang (张凌云)
Lingyun Zhang (张凌云)
Design Analyst

I have research interests in Statistics, applied probability and computation.