Layered Grammar of Graphics in R
Contributing Authors

Inspired by writing an answer to this question on StackOverflow, I decided to write up a more detailed description of creating a new transformation using the scales package (and also to make sure that I understood all the details about how to really do it).


To start with, it helps to understand the philosophy behind the scales package. From the description of the scales package:

Scales map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends.

Within the realm of scales, a transformation allows for a maniuplation of the data space prior to its mapping to an aesthetic. In particular, it is responsible for

  • The mapping, in both directions, between the data space and an intermediate representation space
  • Providing a mechanism for determining “nice” breaks in the data space
  • Providing a mechanism for formatting the labels in the data space

There are two main use cases for a transformation:

  • Taking an existing continuous scale and performing a functional transformation of it prior to mapping. For example, taking the logarithm, exponential, square root, recriprocal, inverse, etc. of a variable.
  • Providing a way of handling a variable of a type which represents a continuous quantity, but has specific structure and/or formatting conventions, typically represented with a class. Prototypical examples of this are dates and datetimes.

These variable transformations take place before any stats are performed on the data. In fact, they are equivalent, in terms of effects on data, as putting a transform in as the variable itself (though the axes breaks and labels are different). Quoting from ggplot2: Elegent Graphics for Data Analysis (page 100):

Of course, you can also perform the transformation yourself. For example, instead of using scale_x_log(), you could plot log10(x). That produces an identical result inside the plotting region, but the axis and tick labels won’t be the same. If you use a transformed scale, the axes will be labelled in the original data space. In both cases, the transformaiton occurs before the statistical summary.

I reproduce Figure 6.4 using the current version of the code because it is different than what was published.

qplot(log10(carat), log10(price), data=diamonds)
qplot(carat, price, data=diamonds) + 
  scale_x_log10() + scale_y_log10()

Building blocks

The pieces that are needed to create a transformation are described on the help page for trans_new, but I’ll go through them in more detail.

transform and inverse

These are the workhorses of the transformation and define the functions that map from the original data space to the intermediate data space (transform) and back again (inverse). These can be specified as a function (an anonymous function or a function object) or as a character string which will cause a function of that name to be used (as determined by

Each of these functions should take a vector of values and return a vector of values of the same length. Callilng inverse on the results of transform should result in the original vector (to within any error introduced by floating point arithmetic). That is all.equal(inverse(tranform(x)), x) should be TRUE for any x (for which transform is defined; see domain below).

Both of these functions are required.


breaks is a function which takes a vector of length 2 which represents the range of the data, expressed in the original data space, that is to be represented. This will include any requested expansion, in addition to the actual data values. breaks should return a vector of whatever length it deems appropriate such that each break is represented by one element of the vector. Optionally, the vector can be a named vector. If it is, the default formatter will use the names as the displayed version of the values.

In general, this is a hard problem, primarily because breaks should look “nice” which is difficult for an algorithm to determine. Luckily, others have spent time working on the problem and often much of what they have learned and implemented can be used without having to do much yourself. In partciular, there are existing break determination algorithms in scales such as pretty_breaks (which is based on base::pretty) which find breaks for a simple numeric scale, extended_breaks which is based on extensions of work by Wilkinson which covers the same terrirory, log_breaks which give integer breaks on a log-transformed scale, and date_breaks which works with date data.

All these functions are generators, meaning that they are functions which return functions which do the actual work of finding the breaks. These function can take parameters which define the properties of the breaking algorithm, such as the number of breaks, the base of the logarithm, or the spacing in time between dates.

This argument is optional, and if not supplied a default algorithm is used which will evenly space the ticks in the original data space.


format is a function which takes a vector of values in the original data space (those returned by the breaks function) and returns either a character vector of the same length or a list of expressions of the same length. The latter is useful for making expressions that can be handled by plotmath.

scales includes many formatting functions including comma_format which puts commas between thousands, millions, billions, etc.; dollar_format which rounds to either cents or dollars (threshold definable in generator) and adds a “$” in front and commas; percent_format which multiples by 100 and add a percent sign (“%”); and parse_format and math_format which aid in making plotmath expressions for lables.

As with breaks, all the functions are generators which means that they are functions which return functions. The returned function is the one that takes a single vector, and is what is assigned to the format argument.

This argument is optional and if not supplied, the default algorithm will use any names returned with the breaks. If there are no names, then format is called on the passed values.


The domain is the values over which the transform function is defined. For example, square root and logarithm are only defined for positive values (ggplot does not deal with complex values); and arcsine transformation would only be defined between -1 and 1. This is represented by a length 2 vector of the endpoints (inclusive) of the defined range.

This argument is optional, and if missing it is assumed that the transformation if valid over all numeric values.


A character string to identify the transformation. It is used in summary output, but has no computational value.

Example: reverse logarithm

There are built in transformations for logarithms and for reversing a scale, but there is not one to do both at once (largest to smallest, left to right). The code for each of these, taken from the scales package, is

log_trans <- function(base = exp(1)) {
  trans <- function(x) log(x, base)
  inv <- function(x) base ^ x
  trans_new(str_c("log-", format(base)), trans, inv,
    log_breaks(base = base), domain = c(1e-100, Inf))

reverse_trans <- function() {
  trans_new("reverse", function(x) -x, function(x) -x)

To make a reversed log scale, the breaks are the same as for a regular log scale, so that part does not need to be recreated. In fact, much of log_trans can be reused with changes being made to just the transformation and inverse functions.

reverselog_trans <- function(base = exp(1)) {
    trans <- function(x) -log(x, base)
    inv <- function(x) base^(-x)
    trans_new(paste0("reverselog-", format(base)), trans, inv,
              log_breaks(base = base), domain = c(1e-100, Inf))

I have opted here to follow the general pattern of the other trans functions making this a generator of the transformation, parameterizable by the base of the logarithm. Since I’ve used the _trans naming convention, I can also just call it (with the default parameters) as a string in the trans argument of scale_x_continuous. Some examples of it at work:

dat <- data.frame(x=1:20, y=1:20)

ggplot(dat, aes(x,y)) + geom_point() +

ggplot(dat, aes(x,y)) + geom_point() +

Reproducibility details

R-2.15.1, ggplot2-0.9.1, scales-0.2.1

code available at

Jim Perkins is a scientific illustrator who recently contributed a guest post about the various meanings of DPI (dots-per-inch) at Symbiartic, a Scientific American blog about “The art of science and the science of art.” He also has a couple previous posts about calibrating monitors.

Christopher Gandrud uses ggplot to illustrate his analysis of violence in national legislative chambers (e.g. Turkey, above). After gathering a data set of incidents of legislative violence, he applied logistic regression for rare events to identify the most important variables and the extent of their importance. He then predicted the probability of violence in a range of conditions with a round of simulations, depicted below.

Christopher discusses his approach to this plot in detail here.

The take home on legislative violence: new democracies with poor concordance between votes from the electorate, seats in the legislature, and proportion of governmental power are more likely to see legislative violence.

Over a staff meeting at work, the topic of price of solid state hard drives came up (what are they, is it non linear with size, etc.). I decided to sample 120 solid state hard drives from and recorded their size (in GB) and price (in USD) as well as their class (SATA II or SATA III). Note that the sampling was semi-random, in that I had no particular agenda, but did not go to great lengths to sample randomly. To look at this, I used ggplot2.

 ssd <- read.csv("")
 ssd$class <- factor(ssd$class)

 ## first pass
 p <- ggplot(ssd, aes(x = price, y = size, colour = class)) +

Scatter plot of Size and Price of SSDs

Not too bad, but the data is sparser at higher sizes and prices, so we can use a log-log scale to make it a little easier to see, and add locally weighted regression (loess) lines to assess linearity (or lack there of).

 ## add smooths and log to make clearer
 p <- p +
  stat_smooth(se=FALSE) +
  scale_x_log10(breaks = seq(0, 1000, 100)) +
  scale_y_log10(breaks = seq(0, 600, 100))

Scatter plot of Size and Price of SSDs in log 10
scale with loess smooth lines

Okay, that is nice. Lastly, let’s add better labels, make the x-axis text not overlap, and include the intercept and slope parameters for the linear lines of best fit for each class of hard drive.

 ## fit separate intercept and slope model
 m <- lm(size ~ 0 + class*price, data = ssd)
 est <- round(coef(m), 2)

 size2 <- paste0("II Size = ", est[1], " + ", est[3], "price")
 size3 <- paste0("III Size = ", est[2], " + ", est[4], "price")

 ## finalize
 p <- p +
  annotate("text", x = 100, y = 600, label = size2) +
  annotate("text", x = 100, y = 500, label = size3) +
  labs(x = "Price in USD", y = "Size in GB") +
  opts(title = "Log-Log Plot of SSD Size and Price",
       axis.text.x = theme_text(angle = 45, hjust = 1, vjust = 1))

Fancy Scatter plot of Size and Price of SSDs in log
10 with loess smooth lines

(guest post by Joshua Wiley)


In publications, presentations, and popular media, scientific results are predominantly communicated through graphs. But are these figures clear and honest or misleading? We examine current practices in data visualization and discuss improvements, advocating design choices which reveal data rather than hide it.

Written by Ramon Saccilotto, shared by @m4xl1n

This dynamic representation of the popularity of names over the years is a favorite. It’s not new, but I still find new things to appreciate, like names that used to apply to both sexes and now only one (Ellie), or vice versa (Harley). It seems like there are more very popular male names than very popular female names; I can hardly guess what underlies that.