A Note on R Function base::sample()

Last updated on Jun 17, 2023 2 min read

By chance I came across Professor Tillé’s paper, Remarks on some misconceptions about unequal probability sampling without replacement. The thing that drew my attention is:

By reading all the articles mentioned above, we understand that these authors all realized that this intuitive method [“Weighted Random Sampling”] is false.

Unfortunately, this method is implemented in the sample(.) function of the R language when we specify a sampling without replacement with unequal probabilities.

(NB: I put the words in the square brackets.)

Professor Tillé provides a simple example (see below), which shows that base::sample() “does not select a sample that respects the inclusion probabilities”.

N <- 12
n <- 4
pik <- (1:N)/sum(1:N) * n
p <- pik/n
pikest <- rep(0, N)
SIM <- 100000
for (j in 1:SIM) {
    s <- sample(x = 1:N, size = n, replace = FALSE, prob = p)
    pikest[s] <- pikest[s] + 1
}
pikest <- pikest/SIM
pik

##  [1] 0.05128205 0.10256410 0.15384615 0.20512821 0.25641026 0.30769231
##  [7] 0.35897436 0.41025641 0.46153846 0.51282051 0.56410256 0.61538462

pikest

##  [1] 0.06291 0.11844 0.17503 0.22988 0.27894 0.32774 0.37196 0.41046 0.45424
## [10] 0.48931 0.52241 0.55868

(pikest - pik)/sqrt(pik * (1 - pik)/SIM)

##  [1]  16.670624  16.547747  18.566790  19.384092  16.316307  13.735880
##  [7]   8.560398   0.130887  -4.629670 -14.874244 -26.588107 -36.858000

The above example reminds me of a story: In 2019, I and my ex-colleague Chris Hansen noticed that base::sample() got “something” wrong. Chris modified my example and provided the following one to show this:

pi <- c(1.0, 0.75, 0.25)
nit <- 100000
system.time({
  s <- 
    lapply(1:nit,
           function(i) base::sample(letters[1:3], 
                                    size = 2, 
                                    replace = FALSE, 
                                    prob = pi)
           ) 
  })

##    user  system elapsed 
##    1.53    0.02    1.56

table(unlist(s)) / nit

## 
##       a       b       c 
## 0.87157 0.80293 0.32550

Furthermore, he pointed out that we can instead use sampling::UPbrewer()

system.time({
   s <- lapply(
     1:nit,
     function(i) letters[1:3][sampling::UPbrewer(pi) == 1]
   ) 
 })

##    user  system elapsed 
##    7.23    0.01    7.33

table(unlist(s)) / nit

## 
##       a       b       c 
## 1.00000 0.75192 0.24808

I write down this note to record our little story. More importantly, the “defect” in base:sample() should be known by more people. BTW, it seems that I became a “fan” of R package sampling, for which Professor Tillé is the leading author.

References

Tillé, Y. (2023). Remarks on some misconceptions about unequal probability sampling without replacement. Computer Science Review, 47.

A Note on R Function base::sample()

References

Lingyun Zhang (张凌云)

Design Analyst