Thursday, July 10, 2014

Stratified random sampling in R using dplyr

A recent Gist I wrote

Setup

Let’s say I have a number of sample units for which I have observed some characteristic(s) at two time-points. In my specific case, I have species abundance data for 120 plots in 1992 and 2011. Using these data, I calculated the species turn-over between the two time points for each plot. I then shuffled the 2011 plots, leading to random pairing of plots between the two time points, and recalculated the turn-over.
There are many cases in which we may want to do something similar to this, and many non-parametric randomization methods use a similar setup. The particular problem I faced is that the plots were stratified into broad vegetation types, Fynbos, Thicket, and Grassland. When shuffling the 2011 plots, I wanted to shuffle plots only within their vegetation type. I thought up of a number of complicated ways to write a function to do this, and even started coding one up. Then I thought about how I could use dplyr to carry out stratified random sampling. Here’s an example of how it works.

Make a data set

Here is a sample data set including 20 plots (p1, …, p20), randomly assigned into one of three categories. I’ve printed out the data set, since it’s small.
## Load dplyr
require( dplyr )
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Make data.frame
df <- data.frame( plot = paste( "p", 1:20, sep = "" ),
                  category = sample( x = letters[1:3], size = 20, replace = TRUE ),
                  stringsAsFactors = FALSE )

## Print data.frame, arranged by category
print( arrange( df, category ) )
##    plot category
## 1    p3        a
## 2    p8        a
## 3   p12        a
## 4    p1        b
## 5    p2        b
## 6    p4        b
## 7    p5        b
## 8    p9        b
## 9   p15        b
## 10  p17        b
## 11  p19        b
## 12  p20        b
## 13   p6        c
## 14   p7        c
## 15  p10        c
## 16  p11        c
## 17  p13        c
## 18  p14        c
## 19  p16        c
## 20  p18        c

Simple shuffle

Shuffling plots, disregarding their category classification, is easy - just use sample. Below I’ve printed out the shuffled paired-plots.
## Shuffle plots
plots_shuffled <- sample( df$plot )

## Print plots and plots_shuffled together
print( cbind( df$plot, plots_shuffled ) )
##             plots_shuffled
##  [1,] "p1"  "p5"          
##  [2,] "p2"  "p13"         
##  [3,] "p3"  "p16"         
##  [4,] "p4"  "p14"         
##  [5,] "p5"  "p1"          
##  [6,] "p6"  "p3"          
##  [7,] "p7"  "p20"         
##  [8,] "p8"  "p2"          
##  [9,] "p9"  "p11"         
## [10,] "p10" "p19"         
## [11,] "p11" "p10"         
## [12,] "p12" "p17"         
## [13,] "p13" "p9"          
## [14,] "p14" "p7"          
## [15,] "p15" "p6"          
## [16,] "p16" "p8"          
## [17,] "p17" "p15"         
## [18,] "p18" "p4"          
## [19,] "p19" "p12"         
## [20,] "p20" "p18"

Stratified random sampling (shuffling)

But what if we want to account for the category classification? Here’s how I used dplyr to perform stratified random sampling.
## Use dplyr group_by and mutate to randomly sample within category
df <- 
  group_by( df, category ) %.%
  mutate( strat_rsamp = sample( plot ) )

print( arrange( df, category ) )
## Source: local data frame [20 x 3]
## Groups: category
## 
##    plot category strat_rsamp
## 1    p3        a         p12
## 2    p8        a          p3
## 3   p12        a          p8
## 4    p1        b          p5
## 5    p2        b         p20
## 6    p4        b          p9
## 7    p5        b         p15
## 8    p9        b         p17
## 9   p15        b          p1
## 10  p17        b          p2
## 11  p19        b          p4
## 12  p20        b         p19
## 13   p6        c         p11
## 14   p7        c         p14
## 15  p10        c          p7
## 16  p11        c         p10
## 17  p13        c         p16
## 18  p14        c         p18
## 19  p16        c         p13
## 20  p18        c          p6
We could also return just a vector of the shuffled samples, without the data.frame. Convenient, but not very pretty code-wise
( group_by( df, category ) %.%
    mutate( strat_rsamp = sample( plot ) ) )$strat_rsamp
##  [1] "p5"  "p20" "p8"  "p9"  "p15" "p16" "p11" "p3"  "p19" "p18" "p14"
## [12] "p12" "p13" "p10" "p4"  "p6"  "p1"  "p7"  "p17" "p2"

Conclusion

There you have it - stratified random sampling. There may be an even easier way to do this (perhaps I missed a function or didn’t dive into sample enough?), but this seems pretty easy to me. Thanks dplyr!

1 comment:

  1. The above is a little Gist that I wrote this morning. The source code can be found here: https://gist.github.com/96c9e597471d48a8f69d.git

    ReplyDelete