Monday, August 4, 2014

Did I say local field work? Going international!

One of my earliest posts here was about why I enjoyed local field work (and some of the pitfalls of it too). Well, I'm writing this post from Calvinia, South Africa, a location that is definitely not local for me. I am here as part of my postdoc work, collecting data on plant functional traits  in one of the most diverse floras in the world with a group of about eight people. As the postdoc on this project I had to do quite a bit of leg work before getting here related to examining existing data from the region, putting together species lists, helping to determine optimal sampling locations, etc. Fortunately, because the research team I'm with has carried out several similar studies in various parts of South Africa, most of the challenging logistics were worked out by others with much more experience. To that end, all I really need to do is learn how to drive a standard, on the left side of the road. But putting together a species observation list and sampling location plan was definitely a new experience for me. When I started a few months ago, I had imagined that I would have to have things planned out roughly to the day, as our time is limited and the area we are covering is vast. I quickly learned that what we're after is more of a sketch of a plan. What that means is that most every night I'll sit down with the botanist in our team and help him decide on the best spots to sample the next day. And our species list is getting updated daily - checking off what we've observed, dropping some species, adding others. Really, the key is to be flexible. However, this flexibility requires preparation. It's best if I can load up the vegetation survey data a colleague collected several years ago in a matter of minutes, and the hours of examining these data before hand has definitely helped me to navigate it on the fly.

Planning considerations aside, scientifically the most important part of this trip is for me to just get out into these incredible plant communities and to experience the field and lab work involved in our project. I've been spending the past year organizing, cleaning, and analyzing data collected in previous years without fully understanding what the system looks like. I've read a lot about it, but have never seen it with my own eyes. Now I can better understand some of the ambiguities in our datasets. For example, I've now held a specimen in my hand and had the thought "what the heck is the leaf on this thing?" This experience is absolutely invaluable for me while I'm wrapping up some of the analyses I've started, and moving forward with new ones. I'm hoping to have a few posts on some of my insights (or causes of confusion!) during the rest of this trip.

On a personal level, this trip is also my longest international trip, and I gave a lot of thought into packing, and trying to be minimal about what I brought (even if I did most of my packing the day before I left). I didn't want to bring anything I wouldn't use (except my small first aid kit. I'm ok carrying that around without using it). So far it's working out well. I promise more biologically relevant pictures in the future, but for now, here's a pic of what I thought I would need for multiple weeks of field work. (Note - Nessy, the green monster, didn't come. The shoes are my wife's (though I'm told they are comfortable). And yes, one of my carry-on items did end up being the Leaf Area Meter. I'll try to sucker one of the other group members into dragging it home!)


Monday, July 14, 2014

Reflection

Today, a bit of a story ...


Kinsman Pond from ledges of North Kinsman


I stood near the edge of the clearing with the two scientists, straining my ears to hear the call of a Bicknell’s Thrush. They played the recording, and we listened. As I recall, it was a relatively cool summer evening, though it was often cool in the evening at Kinsman Pond campsite. It was 2002. I had just finished my third year of college, and I was working as a backcountry campsite caretaker for the Appalachian Mountain Club. I had been dreaming of having this job for at least two years at that point, and a few cold, rainy days aside, I was enjoying every moment of it. It was a stark contrast from the previous summer, when I worked at Brookhaven National Labs as a research assistant for one of my professors. That too was a great experience and important for my development as a scientist. But living in the woods, spending my days doing trail work, sleeping in a canvas tent, dealing with black flies ... that was the life!

On this particular evening, two scientists from the Vermont Institute of Natural Science came to my site, and were planning to spend the night so they could survey for Bicknell's Thrush both that evening and very early the next morning. We chatted for a while, and they told me that they were using known locations of this bird, along with environmental information from those locations, to build a model to predict where it might be observed. I had no knowledge of species distribution modeling, and honestly, probably only a vague notion of what a "species niche" was. What I did know was that I as a physics major, I knew what the word model meant. I knew that any such model would likely be constructed using some computer programming language. I naively thought you would probably code that up in C or C++, because hey, isn't that what most scientists were using? But most importantly, I thought "I could probably learn to do that". 

I've told this story many times over the last 7+ years, in part as a way of explaining to people how I went from physics to ecology. To be honest, the path from physics to ecology was not direct. I finished my degree, spent two more years working as a caretaker, and then spent two years working in a neuroscience research group. But this experience stuck with me, and eventually I got to "build ecological models". I was reflecting on this again recently and I finally went and looked to see if any publications resulted from this field work these scientists had been doing. Here's what I found:

Lambert et al. 2005. A practical model of Bicknell's Thrush distribution in the northeaster United States. The Wilson Bulletin 117(1): 1-12.

And here was the line I was looking for - "during 2000–2002 ... we recorded the presence or presumed absence of Bicknell’s Thrush on 130 mountains first sampled by Atwood et al. (1996) (New York: n = 30, Vermont: n = 56, New Hampshire: n = 26, Maine: n = 18)." I'm assuming that one of the 26 New Hampshire locations was Kinsman Pond.

Perhaps the best part of finding this paper is that since that meeting in 2002, I have acquired the knowledge necessary to understand what these researchers were working on. The species distribution model that they constructed is pretty uncommon compared to today's methods. It appears to be a mixture of quantile regression, which is reminiscent of some of the first distribution models used (e.g., Bioclim), and basic pattern observation. The model options for presence-absence data are quite numerous now, and it would be interesting to see how different the distribution predictions are using more modern techniques. 

It's nice to look back on this story and think "yeah, I did learn how to do that."

Thursday, July 10, 2014

Stratified random sampling in R using dplyr

A recent Gist I wrote

Setup

Let’s say I have a number of sample units for which I have observed some characteristic(s) at two time-points. In my specific case, I have species abundance data for 120 plots in 1992 and 2011. Using these data, I calculated the species turn-over between the two time points for each plot. I then shuffled the 2011 plots, leading to random pairing of plots between the two time points, and recalculated the turn-over.
There are many cases in which we may want to do something similar to this, and many non-parametric randomization methods use a similar setup. The particular problem I faced is that the plots were stratified into broad vegetation types, Fynbos, Thicket, and Grassland. When shuffling the 2011 plots, I wanted to shuffle plots only within their vegetation type. I thought up of a number of complicated ways to write a function to do this, and even started coding one up. Then I thought about how I could use dplyr to carry out stratified random sampling. Here’s an example of how it works.

Make a data set

Here is a sample data set including 20 plots (p1, …, p20), randomly assigned into one of three categories. I’ve printed out the data set, since it’s small.
## Load dplyr
require( dplyr )
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Make data.frame
df <- data.frame( plot = paste( "p", 1:20, sep = "" ),
                  category = sample( x = letters[1:3], size = 20, replace = TRUE ),
                  stringsAsFactors = FALSE )

## Print data.frame, arranged by category
print( arrange( df, category ) )
##    plot category
## 1    p3        a
## 2    p8        a
## 3   p12        a
## 4    p1        b
## 5    p2        b
## 6    p4        b
## 7    p5        b
## 8    p9        b
## 9   p15        b
## 10  p17        b
## 11  p19        b
## 12  p20        b
## 13   p6        c
## 14   p7        c
## 15  p10        c
## 16  p11        c
## 17  p13        c
## 18  p14        c
## 19  p16        c
## 20  p18        c

Simple shuffle

Shuffling plots, disregarding their category classification, is easy - just use sample. Below I’ve printed out the shuffled paired-plots.
## Shuffle plots
plots_shuffled <- sample( df$plot )

## Print plots and plots_shuffled together
print( cbind( df$plot, plots_shuffled ) )
##             plots_shuffled
##  [1,] "p1"  "p5"          
##  [2,] "p2"  "p13"         
##  [3,] "p3"  "p16"         
##  [4,] "p4"  "p14"         
##  [5,] "p5"  "p1"          
##  [6,] "p6"  "p3"          
##  [7,] "p7"  "p20"         
##  [8,] "p8"  "p2"          
##  [9,] "p9"  "p11"         
## [10,] "p10" "p19"         
## [11,] "p11" "p10"         
## [12,] "p12" "p17"         
## [13,] "p13" "p9"          
## [14,] "p14" "p7"          
## [15,] "p15" "p6"          
## [16,] "p16" "p8"          
## [17,] "p17" "p15"         
## [18,] "p18" "p4"          
## [19,] "p19" "p12"         
## [20,] "p20" "p18"

Stratified random sampling (shuffling)

But what if we want to account for the category classification? Here’s how I used dplyr to perform stratified random sampling.
## Use dplyr group_by and mutate to randomly sample within category
df <- 
  group_by( df, category ) %.%
  mutate( strat_rsamp = sample( plot ) )

print( arrange( df, category ) )
## Source: local data frame [20 x 3]
## Groups: category
## 
##    plot category strat_rsamp
## 1    p3        a         p12
## 2    p8        a          p3
## 3   p12        a          p8
## 4    p1        b          p5
## 5    p2        b         p20
## 6    p4        b          p9
## 7    p5        b         p15
## 8    p9        b         p17
## 9   p15        b          p1
## 10  p17        b          p2
## 11  p19        b          p4
## 12  p20        b         p19
## 13   p6        c         p11
## 14   p7        c         p14
## 15  p10        c          p7
## 16  p11        c         p10
## 17  p13        c         p16
## 18  p14        c         p18
## 19  p16        c         p13
## 20  p18        c          p6
We could also return just a vector of the shuffled samples, without the data.frame. Convenient, but not very pretty code-wise
( group_by( df, category ) %.%
    mutate( strat_rsamp = sample( plot ) ) )$strat_rsamp
##  [1] "p5"  "p20" "p8"  "p9"  "p15" "p16" "p11" "p3"  "p19" "p18" "p14"
## [12] "p12" "p13" "p10" "p4"  "p6"  "p1"  "p7"  "p17" "p2"

Conclusion

There you have it - stratified random sampling. There may be an even easier way to do this (perhaps I missed a function or didn’t dive into sample enough?), but this seems pretty easy to me. Thanks dplyr!