A Note on R Function base::sample()
By chance I came across Professor Tillé’s paper, Remarks on some misconceptions about unequal probability sampling without replacement. The thing that drew my attention is:
By reading all the articles mentioned above, we understand that these authors all realized that this intuitive method [“Weighted Random Sampling”] is false.
Unfortunately, this method is implemented in the sample(.) function of the R language when we specify a sampling without replacement with unequal probabilities.
(NB: I put the words in the square brackets.)
Professor Tillé provides a simple example (see below), which shows that base::sample()
“does not select a sample that respects the inclusion probabilities”.
N <- 12
n <- 4
pik <- (1:N)/sum(1:N) * n
p <- pik/n
pikest <- rep(0, N)
SIM <- 100000
for (j in 1:SIM) {
s <- sample(x = 1:N, size = n, replace = FALSE, prob = p)
pikest[s] <- pikest[s] + 1
}
pikest <- pikest/SIM
pik
## [1] 0.05128205 0.10256410 0.15384615 0.20512821 0.25641026 0.30769231
## [7] 0.35897436 0.41025641 0.46153846 0.51282051 0.56410256 0.61538462
pikest
## [1] 0.06291 0.11844 0.17503 0.22988 0.27894 0.32774 0.37196 0.41046 0.45424
## [10] 0.48931 0.52241 0.55868
(pikest - pik)/sqrt(pik * (1 - pik)/SIM)
## [1] 16.670624 16.547747 18.566790 19.384092 16.316307 13.735880
## [7] 8.560398 0.130887 -4.629670 -14.874244 -26.588107 -36.858000
The above example reminds me of a story:
In 2019, I and my ex-colleague Chris Hansen noticed that base::sample()
got “something” wrong. Chris modified my example and provided the following one to show this:
pi <- c(1.0, 0.75, 0.25)
nit <- 100000
system.time({
s <-
lapply(1:nit,
function(i) base::sample(letters[1:3],
size = 2,
replace = FALSE,
prob = pi)
)
})
## user system elapsed
## 1.53 0.02 1.56
table(unlist(s)) / nit
##
## a b c
## 0.87157 0.80293 0.32550
Furthermore, he pointed out that we can instead use sampling::UPbrewer()
system.time({
s <- lapply(
1:nit,
function(i) letters[1:3][sampling::UPbrewer(pi) == 1]
)
})
## user system elapsed
## 7.23 0.01 7.33
table(unlist(s)) / nit
##
## a b c
## 1.00000 0.75192 0.24808
I write down this note to record our little story. More importantly, the “defect” in base:sample()
should be known by more people. BTW, it seems that I became a “fan” of R package sampling
, for which Professor Tillé is the leading author.
References
Tillé, Y. (2023). Remarks on some misconceptions about unequal probability sampling without replacement. Computer Science Review, 47.