Over a staff meeting at work, the topic of price of solid state hard drives came up (what are they, is it non linear with size, etc.). I decided to sample 120 solid state hard drives from newegg.com and recorded their size (in GB) and price (in USD) as well as their class (SATA II or SATA III). Note that the sampling was semi-random, in that I had no particular agenda, but did not go to great lengths to sample randomly. To look at this, I used ggplot2.
ssd <- read.csv("http://joshuawiley.com/files/ssd.csv") ssd$class <- factor(ssd$class) require(ggplot2) ## first pass p <- ggplot(ssd, aes(x = price, y = size, colour = class)) + geom_point() print(p)
Not too bad, but the data is sparser at higher sizes and prices, so we can use a log-log scale to make it a little easier to see, and add locally weighted regression (loess) lines to assess linearity (or lack there of).
## add smooths and log to make clearer p <- p + stat_smooth(se=FALSE) + scale_x_log10(breaks = seq(0, 1000, 100)) + scale_y_log10(breaks = seq(0, 600, 100))
Okay, that is nice. Lastly, let’s add better labels, make the x-axis text not overlap, and include the intercept and slope parameters for the linear lines of best fit for each class of hard drive.
## fit separate intercept and slope model m <- lm(size ~ 0 + class*price, data = ssd) est <- round(coef(m), 2) size2 <- paste0("II Size = ", est, " + ", est, "price") size3 <- paste0("III Size = ", est, " + ", est, "price") ## finalize p <- p + annotate("text", x = 100, y = 600, label = size2) + annotate("text", x = 100, y = 500, label = size3) + labs(x = "Price in USD", y = "Size in GB") + opts(title = "Log-Log Plot of SSD Size and Price", axis.text.x = theme_text(angle = 45, hjust = 1, vjust = 1))
(guest post by Joshua Wiley)