Monday, July 14, 2014

Reflection

Today, a bit of a story ...


Kinsman Pond from ledges of North Kinsman


I stood near the edge of the clearing with the two scientists, straining my ears to hear the call of a Bicknell’s Thrush. They played the recording, and we listened. As I recall, it was a relatively cool summer evening, though it was often cool in the evening at Kinsman Pond campsite. It was 2002. I had just finished my third year of college, and I was working as a backcountry campsite caretaker for the Appalachian Mountain Club. I had been dreaming of having this job for at least two years at that point, and a few cold, rainy days aside, I was enjoying every moment of it. It was a stark contrast from the previous summer, when I worked at Brookhaven National Labs as a research assistant for one of my professors. That too was a great experience and important for my development as a scientist. But living in the woods, spending my days doing trail work, sleeping in a canvas tent, dealing with black flies ... that was the life!

On this particular evening, two scientists from the Vermont Institute of Natural Science came to my site, and were planning to spend the night so they could survey for Bicknell's Thrush both that evening and very early the next morning. We chatted for a while, and they told me that they were using known locations of this bird, along with environmental information from those locations, to build a model to predict where it might be observed. I had no knowledge of species distribution modeling, and honestly, probably only a vague notion of what a "species niche" was. What I did know was that I as a physics major, I knew what the word model meant. I knew that any such model would likely be constructed using some computer programming language. I naively thought you would probably code that up in C or C++, because hey, isn't that what most scientists were using? But most importantly, I thought "I could probably learn to do that". 

I've told this story many times over the last 7+ years, in part as a way of explaining to people how I went from physics to ecology. To be honest, the path from physics to ecology was not direct. I finished my degree, spent two more years working as a caretaker, and then spent two years working in a neuroscience research group. But this experience stuck with me, and eventually I got to "build ecological models". I was reflecting on this again recently and I finally went and looked to see if any publications resulted from this field work these scientists had been doing. Here's what I found:

Lambert et al. 2005. A practical model of Bicknell's Thrush distribution in the northeaster United States. The Wilson Bulletin 117(1): 1-12.

And here was the line I was looking for - "during 2000–2002 ... we recorded the presence or presumed absence of Bicknell’s Thrush on 130 mountains first sampled by Atwood et al. (1996) (New York: n = 30, Vermont: n = 56, New Hampshire: n = 26, Maine: n = 18)." I'm assuming that one of the 26 New Hampshire locations was Kinsman Pond.

Perhaps the best part of finding this paper is that since that meeting in 2002, I have acquired the knowledge necessary to understand what these researchers were working on. The species distribution model that they constructed is pretty uncommon compared to today's methods. It appears to be a mixture of quantile regression, which is reminiscent of some of the first distribution models used (e.g., Bioclim), and basic pattern observation. The model options for presence-absence data are quite numerous now, and it would be interesting to see how different the distribution predictions are using more modern techniques. 

It's nice to look back on this story and think "yeah, I did learn how to do that."

Thursday, July 10, 2014

Stratified random sampling in R using dplyr

A recent Gist I wrote

Setup

Let’s say I have a number of sample units for which I have observed some characteristic(s) at two time-points. In my specific case, I have species abundance data for 120 plots in 1992 and 2011. Using these data, I calculated the species turn-over between the two time points for each plot. I then shuffled the 2011 plots, leading to random pairing of plots between the two time points, and recalculated the turn-over.
There are many cases in which we may want to do something similar to this, and many non-parametric randomization methods use a similar setup. The particular problem I faced is that the plots were stratified into broad vegetation types, Fynbos, Thicket, and Grassland. When shuffling the 2011 plots, I wanted to shuffle plots only within their vegetation type. I thought up of a number of complicated ways to write a function to do this, and even started coding one up. Then I thought about how I could use dplyr to carry out stratified random sampling. Here’s an example of how it works.

Make a data set

Here is a sample data set including 20 plots (p1, …, p20), randomly assigned into one of three categories. I’ve printed out the data set, since it’s small.
## Load dplyr
require( dplyr )
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Make data.frame
df <- data.frame( plot = paste( "p", 1:20, sep = "" ),
                  category = sample( x = letters[1:3], size = 20, replace = TRUE ),
                  stringsAsFactors = FALSE )

## Print data.frame, arranged by category
print( arrange( df, category ) )
##    plot category
## 1    p3        a
## 2    p8        a
## 3   p12        a
## 4    p1        b
## 5    p2        b
## 6    p4        b
## 7    p5        b
## 8    p9        b
## 9   p15        b
## 10  p17        b
## 11  p19        b
## 12  p20        b
## 13   p6        c
## 14   p7        c
## 15  p10        c
## 16  p11        c
## 17  p13        c
## 18  p14        c
## 19  p16        c
## 20  p18        c

Simple shuffle

Shuffling plots, disregarding their category classification, is easy - just use sample. Below I’ve printed out the shuffled paired-plots.
## Shuffle plots
plots_shuffled <- sample( df$plot )

## Print plots and plots_shuffled together
print( cbind( df$plot, plots_shuffled ) )
##             plots_shuffled
##  [1,] "p1"  "p5"          
##  [2,] "p2"  "p13"         
##  [3,] "p3"  "p16"         
##  [4,] "p4"  "p14"         
##  [5,] "p5"  "p1"          
##  [6,] "p6"  "p3"          
##  [7,] "p7"  "p20"         
##  [8,] "p8"  "p2"          
##  [9,] "p9"  "p11"         
## [10,] "p10" "p19"         
## [11,] "p11" "p10"         
## [12,] "p12" "p17"         
## [13,] "p13" "p9"          
## [14,] "p14" "p7"          
## [15,] "p15" "p6"          
## [16,] "p16" "p8"          
## [17,] "p17" "p15"         
## [18,] "p18" "p4"          
## [19,] "p19" "p12"         
## [20,] "p20" "p18"

Stratified random sampling (shuffling)

But what if we want to account for the category classification? Here’s how I used dplyr to perform stratified random sampling.
## Use dplyr group_by and mutate to randomly sample within category
df <- 
  group_by( df, category ) %.%
  mutate( strat_rsamp = sample( plot ) )

print( arrange( df, category ) )
## Source: local data frame [20 x 3]
## Groups: category
## 
##    plot category strat_rsamp
## 1    p3        a         p12
## 2    p8        a          p3
## 3   p12        a          p8
## 4    p1        b          p5
## 5    p2        b         p20
## 6    p4        b          p9
## 7    p5        b         p15
## 8    p9        b         p17
## 9   p15        b          p1
## 10  p17        b          p2
## 11  p19        b          p4
## 12  p20        b         p19
## 13   p6        c         p11
## 14   p7        c         p14
## 15  p10        c          p7
## 16  p11        c         p10
## 17  p13        c         p16
## 18  p14        c         p18
## 19  p16        c         p13
## 20  p18        c          p6
We could also return just a vector of the shuffled samples, without the data.frame. Convenient, but not very pretty code-wise
( group_by( df, category ) %.%
    mutate( strat_rsamp = sample( plot ) ) )$strat_rsamp
##  [1] "p5"  "p20" "p8"  "p9"  "p15" "p16" "p11" "p3"  "p19" "p18" "p14"
## [12] "p12" "p13" "p10" "p4"  "p6"  "p1"  "p7"  "p17" "p2"

Conclusion

There you have it - stratified random sampling. There may be an even easier way to do this (perhaps I missed a function or didn’t dive into sample enough?), but this seems pretty easy to me. Thanks dplyr!