Tuesday, September 17, 2013

Examining whether the order of scaling and log-transformation matters

The following question has come up as I continue to explore datasets related to my current PostDoc work. Given a dateset that requires log-transformation in order to fit a normal distribution, does it matter if I log-transform then scale the data versus scale then log-transform?
First, let's get some data that could be considered as needing log-transformation to meet the assumptions of normality. I downloaded the dataset of plant specific leaf area (SLA) from Reich 1999, as used and cited in Logan 2012, from the websit associated with Logan 2012 here.
## Require packages
require(ggplot2)
## Loading required package: ggplot2
require(reshape2)
## Loading required package: reshape2

## Read in the data
LeafArea <- read.csv("~/Google Drive/Professional/Short-R-Examples/reich.csv")

## Quick peek at the data
head(LeafArea)
##   LOCATION FUNCTION LEAFAREA
## 1   Newmex    Shrub    105.0
## 2   Newmex     Tree    124.0
## 3   Newmex     Tree     83.8
## 4   Newmex    Shrub     39.7
## 5   Newmex    Shrub     51.2
## 6   Newmex    Shrub     66.0

## Make a histogram of the LeafArea
qplot(LEAFAREA, data = LeafArea, geom = "histogram", binwidth = 10)
plot of chunk unnamed-chunk-1

## Now look at this same data, log10 transformed
qplot(log10(LEAFAREA), data = LeafArea, geom = "histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
## Warning: position_stack requires constant width: output may be incorrect
plot of chunk unnamed-chunk-1
Ok, now lets see how things look when I scale, then log10 tranform, versus log10 transform, then scale.
## First scale the log transform
LeafArea$ScaleLog <- log10(scale(LeafArea$LEAFAREA))
## Warning: NaNs produced

## Next Lof then scale
LeafArea$LogScale <- scale(log10(LeafArea$LEAFAREA))

## Now plot these two
LeafArea_m <- melt(data = LeafArea, id.vars = c(1:3))
p <- ggplot(LeafArea_m, aes(x = value, colour = variable)) + geom_density()
p
## Warning: Removed 33 rows containing non-finite values (stat_density).
## Warning: Removed 1 rows containing non-finite values (stat_density).
plot of chunk unnamed-chunk-2
Defnitely different in the density plots. What about histograms?
h <- ggplot(LeafArea_m, aes(x = value, fill = variable)) + geom_histogram(position = "identity", 
    alpha = 0.4)
h
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
plot of chunk unnamed-chunk-3

Let's try this one more time with simulate data.
# Generate some random data that needs a log transform
Sample_Data <- exp(rnorm(n = 500, mean = 0, sd = 1))
# Plot the data before transform
qplot(x = Sample_Data, geom = "histogram", binwidth = 1)
plot of chunk unnamed-chunk-4
# Plot the data after transform
qplot(x = log(Sample_Data), geom = "histogram")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
plot of chunk unnamed-chunk-4

# Now compare Scaling then log transform vs Log transform then scaling
Sample_Data_Test <- data.frame(ScaleLog = log(scale(Sample_Data)), LogScale = scale(log(Sample_Data)))
## Warning: NaNs produced

h <- ggplot(melt(Sample_Data_Test), aes(x = value, fill = variable)) + geom_histogram(position = "identity", 
    alpha = 0.4)
## Using as id variables
h
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
plot of chunk unnamed-chunk-4
I'm not sure what to make of this. It's clear that the order of scaling and log transforming matters. However, I'm not sure which order makes more sense. It certainly seems that Log then Scale produces a nice centered distribution. Though the result seems a bit leptokurtic.
However, one observation that is very clear is that by scaling the data first, I ended up with many values equal to 0, which when then log transformed were assigned NA. This happened in both examples. This definitely leads me to think that the order to do thins is Log then Scale.