ggplot2

Layered Grammar of Graphics in R
Contributing Authors
Posts tagged "scales"

Inspired by writing an answer to this question on StackOverflow, I decided to write up a more detailed description of creating a new transformation using the scales package (and also to make sure that I understood all the details about how to really do it).

Background

To start with, it helps to understand the philosophy behind the scales package. From the description of the scales package:

Scales map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends.

Within the realm of scales, a transformation allows for a maniuplation of the data space prior to its mapping to an aesthetic. In particular, it is responsible for

  • The mapping, in both directions, between the data space and an intermediate representation space
  • Providing a mechanism for determining “nice” breaks in the data space
  • Providing a mechanism for formatting the labels in the data space

There are two main use cases for a transformation:

  • Taking an existing continuous scale and performing a functional transformation of it prior to mapping. For example, taking the logarithm, exponential, square root, recriprocal, inverse, etc. of a variable.
  • Providing a way of handling a variable of a type which represents a continuous quantity, but has specific structure and/or formatting conventions, typically represented with a class. Prototypical examples of this are dates and datetimes.

These variable transformations take place before any stats are performed on the data. In fact, they are equivalent, in terms of effects on data, as putting a transform in as the variable itself (though the axes breaks and labels are different). Quoting from ggplot2: Elegent Graphics for Data Analysis (page 100):

Of course, you can also perform the transformation yourself. For example, instead of using scale_x_log(), you could plot log10(x). That produces an identical result inside the plotting region, but the axis and tick labels won’t be the same. If you use a transformed scale, the axes will be labelled in the original data space. In both cases, the transformaiton occurs before the statistical summary.

I reproduce Figure 6.4 using the current version of the code because it is different than what was published.

qplot(log10(carat), log10(price), data=diamonds)
qplot(carat, price, data=diamonds) + 
  scale_x_log10() + scale_y_log10()

Building blocks

The pieces that are needed to create a transformation are described on the help page for trans_new, but I’ll go through them in more detail.

transform and inverse

These are the workhorses of the transformation and define the functions that map from the original data space to the intermediate data space (transform) and back again (inverse). These can be specified as a function (an anonymous function or a function object) or as a character string which will cause a function of that name to be used (as determined by match.fun).

Each of these functions should take a vector of values and return a vector of values of the same length. Callilng inverse on the results of transform should result in the original vector (to within any error introduced by floating point arithmetic). That is all.equal(inverse(tranform(x)), x) should be TRUE for any x (for which transform is defined; see domain below).

Both of these functions are required.

breaks

breaks is a function which takes a vector of length 2 which represents the range of the data, expressed in the original data space, that is to be represented. This will include any requested expansion, in addition to the actual data values. breaks should return a vector of whatever length it deems appropriate such that each break is represented by one element of the vector. Optionally, the vector can be a named vector. If it is, the default formatter will use the names as the displayed version of the values.

In general, this is a hard problem, primarily because breaks should look “nice” which is difficult for an algorithm to determine. Luckily, others have spent time working on the problem and often much of what they have learned and implemented can be used without having to do much yourself. In partciular, there are existing break determination algorithms in scales such as pretty_breaks (which is based on base::pretty) which find breaks for a simple numeric scale, extended_breaks which is based on extensions of work by Wilkinson which covers the same terrirory, log_breaks which give integer breaks on a log-transformed scale, and date_breaks which works with date data.

All these functions are generators, meaning that they are functions which return functions which do the actual work of finding the breaks. These function can take parameters which define the properties of the breaking algorithm, such as the number of breaks, the base of the logarithm, or the spacing in time between dates.

This argument is optional, and if not supplied a default algorithm is used which will evenly space the ticks in the original data space.

format

format is a function which takes a vector of values in the original data space (those returned by the breaks function) and returns either a character vector of the same length or a list of expressions of the same length. The latter is useful for making expressions that can be handled by plotmath.

scales includes many formatting functions including comma_format which puts commas between thousands, millions, billions, etc.; dollar_format which rounds to either cents or dollars (threshold definable in generator) and adds a “$” in front and commas; percent_format which multiples by 100 and add a percent sign (“%”); and parse_format and math_format which aid in making plotmath expressions for lables.

As with breaks, all the functions are generators which means that they are functions which return functions. The returned function is the one that takes a single vector, and is what is assigned to the format argument.

This argument is optional and if not supplied, the default algorithm will use any names returned with the breaks. If there are no names, then format is called on the passed values.

domain

The domain is the values over which the transform function is defined. For example, square root and logarithm are only defined for positive values (ggplot does not deal with complex values); and arcsine transformation would only be defined between -1 and 1. This is represented by a length 2 vector of the endpoints (inclusive) of the defined range.

This argument is optional, and if missing it is assumed that the transformation if valid over all numeric values.

name

A character string to identify the transformation. It is used in summary output, but has no computational value.

Example: reverse logarithm

There are built in transformations for logarithms and for reversing a scale, but there is not one to do both at once (largest to smallest, left to right). The code for each of these, taken from the scales package, is

log_trans <- function(base = exp(1)) {
  trans <- function(x) log(x, base)
  inv <- function(x) base ^ x
  trans_new(str_c("log-", format(base)), trans, inv,
    log_breaks(base = base), domain = c(1e-100, Inf))
}

reverse_trans <- function() {
  trans_new("reverse", function(x) -x, function(x) -x)
}

To make a reversed log scale, the breaks are the same as for a regular log scale, so that part does not need to be recreated. In fact, much of log_trans can be reused with changes being made to just the transformation and inverse functions.

reverselog_trans <- function(base = exp(1)) {
    trans <- function(x) -log(x, base)
    inv <- function(x) base^(-x)
    trans_new(paste0("reverselog-", format(base)), trans, inv,
              log_breaks(base = base), domain = c(1e-100, Inf))
}

I have opted here to follow the general pattern of the other trans functions making this a generator of the transformation, parameterizable by the base of the logarithm. Since I’ve used the _trans naming convention, I can also just call it (with the default parameters) as a string in the trans argument of scale_x_continuous. Some examples of it at work:

dat <- data.frame(x=1:20, y=1:20)

ggplot(dat, aes(x,y)) + geom_point() +
    scale_x_continuous(trans="reverselog")

ggplot(dat, aes(x,y)) + geom_point() +
    scale_x_continuous(trans=reverselog_trans(base=2))

Reproducibility details

R-2.15.1, ggplot2-0.9.1, scales-0.2.1

code available at https://github.com/BrianDiggs/trans

There are many resources on the use of colours in R, several packages, and a number of schemes already implemented in ggplot2. In the previous part, we saw how ggplot2 selects a default colour palette according to the type of variable, discrete or continuous. There are further options, illustrated below:

default ggplot palettes

Choosing colours for a graphic is often some kind of a compromise. One one hand, you want the computer, some algorithm, to choose a sensible colour scheme and pick automatically the required number of colours from this scale. On the other hand, there are always external human preferences that constrain the choices, and are not always easy to formalise.

Some choices, even prevalent in the literature such as the rainbow color scale (also known as Matlab’s flashy colorjet),

matlab

are just not good enough. They introduce artefacts, highlight regions of the data that should have a smooth transition with their surroundings, and do not degrade gracefully in black-and-white print, or when viewed by colour-impaired people.

If good colours for scientific graphics are not (entirely) in the eye of the beholder, what are the guides to make the best choice?

A recent blog post illustrates the search for a pleasing colour scheme in bar graphs. On the default HCL (Hue Chroma Luminance, pdf) choice of ggplot2 for discrete variables, the author remarks

The colour choice is not a bad one, but there’s something about the intensity of the colours that makes me want to find a new set of colours somewhat more soothing to my eyes.

and documents his heuristic search for satisfying colours,

I shuffled through many different colours on the Color Hex website, and nothing else seemed to work with me as I wasn’t selecting colours based on any theory

A good discussion is offered in the colorspace package and its accompanying vignettes and papers, e.g. Escaping RGBland: Selecting Colors for Statistical Graphics (pdf)

Despite this omnipresence of color, there is often only little guidance in statistical software packages on how to choose a palette appropriate for a particular visualization task

In this instance, I would argue that the hcl colour scale of ggplot2 is a good start for a well-balanced graphic that doesn’t draw the attention to a particular colour. If the colours are too flashy in bar plots (large areas), the saturation and luminosity can easily be muted by tuning the scale,

This basic idea of tuning the HCL colour scale to suit the application was discussed in more depth in Colour for Presentation Graphics (pdf). Bar plots and maps can also benefit from trying a few different colour palettes from the excellent ColorBrewer website. An interface is provided in R and ggplot2 through the RColorBrewer package.

The RColorBrewer package

Easily accessed with scale_colour_brewer(), it is trivial to choose among 35 palettes (see RColorBrewer::display.brewer.all()).

Sequential palettes, suited to ordered data that progress from low to high. Lightness steps dominate the look of these schemes, with light colors for low data values to dark colors for high data values.

RCB-seq

Qualitative palettes, do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes. Qualitative schemes are best suited to representing nominal or categorical data.

RCB-qual

Diverging palettes, put equal emphasis on mid-range critical values and extremes at both ends of the data range. The critical class or break in the middle of the legend is emphasized with light colors and low and high extremes are emphasized with dark colors that have contrasting hues.

RCB-div

In the next post, we’ll look at some special cases where the user might want finer control over these scales, or define completely new colour palettes tailored for a specific graphic.

In this series of three posts, we’ll look at colours in R graphics produced with ggplot2: what are the available choices of colour schemes, and how to choose a colour palette most suitable for a particular graphic?

In kindergarten, choosing a colour was easy, palettes were limited to a few classics. As cool kids grow older and use R, the spectrum expands to present us with overwhelming choice of millions of colours, most of them with poorly defined labels such as "#A848F2" or "lavenderblush3". Inasmuch as scientific graphics resemble a paint-by-numbers game, R can help us design more elegant palettes with pertinent colour choices based on the data to display.

Overview of basic colour functions in R

Base graphics rely mostly on the grDevices package for the selection of colours, with a few palettes to choose from:

(some palettes can have many more colours, this image is only an illustration of their structure)

The package also provides a number of basic operations to convert colours (adjustcolor, col2rgb, make.rgb, rgb2hsv, convertColor) and create interpolating palettes (rgb, hsv, hcl, gray, colorRamp, colorRampPalette, densCols, gray.colors).

Beyond that, a good resource is the colorspace package which provides further utilities to convert from one colorspace to another (HLS, HSV, LAB, LUV, RGB, sRGB, XYZ) and perform various operations on colours. A special note can be made of a few palette functions, “diverge_hcl”, “diverge_hsv”, “heat_hcl”, “rainbow_hcl”, “sequential_hcl”, “terrain_hcl”, which provide an easy way to produce colour palettes following a particular path in the colour space (varying hue with constant luminosity and saturation, for example).

Other packages such as RColorBrewer, munsell and dichromat provide more colour palettes and utilities.

While the combination of these tools is quite flexible, the user interface becomes a little bit chaotic. More recently, the scales package has provided wrappers around these functions to provide some consistency in the naming schemes and organise the different categories of palettes in a structured way:

Utilities functions, such as col2hcl, fullseq, muted, rescale, rescale_mid, rescale_none, rescale_pal, seq_gradient_pal, show_col

Palettes with consistent interface, brewer_pal, dichromat_pal, gradient_n_pal , div_gradient_pal, hue_pal, grey_pal, identity_pal, manual_pal.

The ggplot2 package uses scales internally, and mirrors this structure. In this first part, we’ll review the basic commands to assign colours in ggplot2.

Colours in ggplot2

Let’s consider three plots for illustration:

p1 maps the colour of points to a continuous variable, p2 maps the fill of bars to a discrete variable, and p3 maps the fill of tiles to a continuous variable.

Colour vs fill aesthetic

Fill and colour scales in ggplot2 can use the same palettes. Some shapes such as lines only accept the colour aesthetic, while others, such as polygons, accept both colour and fill aesthetics. In the latter case, the colour refers to the border of the shape, and the fill to the interior.

Aesthetic mapping vs set values

Another common source of confusion, general to ggplot2, is the distinction between set values and mapped values in a layer. Consider the following example,

 d = data.frame(x = 1:10, y = rnorm(10), z = gl(5, 2)) 
 a = ggplot(d, aes(x, y, group=z))

 grid.arrange(a + geom_path( colour = "red" ), 
                   a + geom_path( aes(colour = z )), 
                   nrow=1)

mapping

Continuous scales

The default continuous scale in ggplot2 is a blue gradient, from low = "#132B43" to high = "#56B1F7" which can be reproduced as

  scales::seq_gradient_pal(low = "#132B43", high = "#56B1F7", space = "Lab")

continuous scale

Discrete scales

The default discrete scale in ggplot2 is a range of hues from hcl,

  scales::hue_pal(h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, 
                          direction = 1)

discrete scale

In the next post of this series we’ll describe how one can fine-tune or change altogether these default colours, and, perhaps more importantly, give some pointers on choosing an appropriate colour scheme for a particular graphic.


Source code for the graphs