# Stratified random sampling with dplr

####
*Matthew E. Aiello-Lammens*

####
*July 10, 2014*

#### Setup

Let’s say I have a number of sample units for which I have observed some characteristic(s) at two time-points. In my specific case, I have species abundance data for 120 plots in 1992 and 2011. Using these data, I calculated the species turn-over between the two time points for each plot. I then shuffled the 2011 plots, leading to random pairing of plots between the two time points, and recalculated the turn-over.There are many cases in which we may want to do something similar to this, and many non-parametric randomization methods use a similar setup. The particular problem I faced is that the plots were stratified into broad vegetation types, Fynbos, Thicket, and Grassland. When shuffling the 2011 plots, I wanted to shuffle plots

*only within*their vegetation type. I thought up of a number of complicated ways to write a function to do this, and even started coding one up. Then I thought about how I could use

`dplyr`

to carry out stratified random sampling. Here’s an example of how it works.#### Make a data set

Here is a sample data set including 20 plots (p1, …, p20), randomly assigned into one of three categories. I’ve printed out the data set, since it’s small.```
## Load dplyr
require( dplyr )
```

```
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

```
## Make data.frame
df <- data.frame( plot = paste( "p", 1:20, sep = "" ),
category = sample( x = letters[1:3], size = 20, replace = TRUE ),
stringsAsFactors = FALSE )
## Print data.frame, arranged by category
print( arrange( df, category ) )
```

```
## plot category
## 1 p3 a
## 2 p8 a
## 3 p12 a
## 4 p1 b
## 5 p2 b
## 6 p4 b
## 7 p5 b
## 8 p9 b
## 9 p15 b
## 10 p17 b
## 11 p19 b
## 12 p20 b
## 13 p6 c
## 14 p7 c
## 15 p10 c
## 16 p11 c
## 17 p13 c
## 18 p14 c
## 19 p16 c
## 20 p18 c
```

#### Simple shuffle

Shuffling plots, disregarding their category classification, is easy - just use`sample`

. Below I’ve printed out the shuffled paired-plots.```
## Shuffle plots
plots_shuffled <- sample( df$plot )
## Print plots and plots_shuffled together
print( cbind( df$plot, plots_shuffled ) )
```

```
## plots_shuffled
## [1,] "p1" "p5"
## [2,] "p2" "p13"
## [3,] "p3" "p16"
## [4,] "p4" "p14"
## [5,] "p5" "p1"
## [6,] "p6" "p3"
## [7,] "p7" "p20"
## [8,] "p8" "p2"
## [9,] "p9" "p11"
## [10,] "p10" "p19"
## [11,] "p11" "p10"
## [12,] "p12" "p17"
## [13,] "p13" "p9"
## [14,] "p14" "p7"
## [15,] "p15" "p6"
## [16,] "p16" "p8"
## [17,] "p17" "p15"
## [18,] "p18" "p4"
## [19,] "p19" "p12"
## [20,] "p20" "p18"
```

#### Stratified random sampling (shuffling)

But what if we want to account for the category classification? Here’s how I used`dplyr`

to perform stratified random sampling.```
## Use dplyr group_by and mutate to randomly sample within category
df <-
group_by( df, category ) %.%
mutate( strat_rsamp = sample( plot ) )
print( arrange( df, category ) )
```

```
## Source: local data frame [20 x 3]
## Groups: category
##
## plot category strat_rsamp
## 1 p3 a p12
## 2 p8 a p3
## 3 p12 a p8
## 4 p1 b p5
## 5 p2 b p20
## 6 p4 b p9
## 7 p5 b p15
## 8 p9 b p17
## 9 p15 b p1
## 10 p17 b p2
## 11 p19 b p4
## 12 p20 b p19
## 13 p6 c p11
## 14 p7 c p14
## 15 p10 c p7
## 16 p11 c p10
## 17 p13 c p16
## 18 p14 c p18
## 19 p16 c p13
## 20 p18 c p6
```

We could also return just a vector of the shuffled samples, without the data.frame. Convenient, but not very pretty code-wise```
( group_by( df, category ) %.%
mutate( strat_rsamp = sample( plot ) ) )$strat_rsamp
```

```
## [1] "p5" "p20" "p8" "p9" "p15" "p16" "p11" "p3" "p19" "p18" "p14"
## [12] "p12" "p13" "p10" "p4" "p6" "p1" "p7" "p17" "p2"
```

#### Conclusion

There you have it - stratified random sampling. There may be an even easier way to do this (perhaps I missed a function or didn’t dive into`sample`

enough?), but this seems pretty easy to me. Thanks `dplyr`

!
The above is a little Gist that I wrote this morning. The source code can be found here: https://gist.github.com/96c9e597471d48a8f69d.git

ReplyDelete