Month

ggbio

ggbio (package, publication) is an extension and specialization of ggplot designed for visualizing genomics annotations and high-throughput data

A frequent requirement for plots of genomic data is indicating the relationship of measurements on different scales. For example, the x-axis of the lower plots in this image is the linear chromosome position, but the upper plot shows the expression levels of each exon (which are indicated in the lower image by vertical lines). In the past I’ve written code to generate the diagonal lines between the images that show the relationship of the two scales, and I’ll be happy if ggbio makes that unnecessary.

Defining a new transformation for ggplot2/scales - Part II

In my previous blog post, I explored what was needed to create a new transformation for the scales package and gave an example of a mathematical transformation. In this post, I want to show an additional example related to the other mentioned use case (mapping a continuous like variable with specific structure and formatting) and extend the example into creating new scales functions which integrate into ggplot even more directly.

TimeDates and times are tricky to work with because they have detailed external constraints and conventions. Within the R ecosystem, several packages exist solely to deal with dates and times (chron, lubridate, date, mondate, timeDate, TimeWarp, etc.), and an article has appeared in R News on the topic (Brian D. Ripley and Kurt Hornik. Date-time classes. R News, 1(2):8-11, June 2001.).

There is already support for dates (using the `Date`

class, via `date_trans`

in `scales`

and `scale_*_date`

in `ggplot2`

) and datetimes (using the `POSIXt`

class, via `time_trans`

in `scales`

and `scale_*_datetime`

in `ggplot2`

). The piece that is missing is for time, separate from any date; “clock time”, if you will.

Exercising the first of the three great virtues of a programmer, laziness, it is worth seeing what has already been done (classes and functions) to deal with clock time.

The `chron`

package has a class `times`

which can specify times of day, independent of a date. Additionally, there are many supporting functions for this class:

```
> methods(class="times")
[1] [.times* [[.times* [<-.times*
[4] as.character.times* as.data.frame.times* axis.times*
[7] Axis.times* c.times* diff.times*
[10] format.times* hist.times* identify.times*
[13] is.na.times* lines.times* Math.times*
[16] mean.times* Ops.times* plot.times*
[19] points.times* pretty.times* print.times*
[22] quantile.times* summary.times* Summary.times*
[25] trunc.times* unique.times* xtfrm.times*
Non-visible functions are asterisked
```

Following the pattern of the previous post, each of the parts of the transformation can be determined.

`transform`

and `inverse`

When dealing with variable that is a class, `transform`

must take the specific representation and convert it to a simple numeric representation (map to [part of] the real line in mathematical terms); `inverse`

does the opposite functional mapping. Generally, this requires delving into the structure of the class to see how it is really put together. To do that, let’s create some data. The `times`

documentation says it can convert a character vector (by default in 24-hour, minute, second format, separated by colons) to times.

```
Time <- times(c("18:37:11", "16:51:34", "15:05:57", "13:20:20",
"11:34:43", "09:49:06", "08:03:29", "06:17:52",
"04:32:15", "02:46:38", "01:01:01"))
```

which if printed gives

```
> Time
[1] 18:37:11 16:51:34 15:05:57 13:20:20 11:34:43 09:49:06
[7] 08:03:29 06:17:52 04:32:15 02:46:38 01:01:01
```

So far, so good. But what does this object/class really look like?

```
> str(Time)
Class 'times' atomic [1:11] 0.776 0.702 0.629 0.556 0.482 ...
..- attr(*, "format")= chr "h:m:s"
> dput(Time)
structure(c(0.775821759259259, 0.702476851851852, 0.629131944444444,
0.555787037037037, 0.48244212962963, 0.409097222222222, 0.335752314814815,
0.262407407407407, 0.1890625, 0.115717592592593, 0.0423726851851852
), format = "h:m:s", class = "times")
```

`times`

are just vectors with an attribute and a class. A little more digging and testing can show that the numeric part is just the fraction of a day that that time represents.

```
> str(times(c("00:00:00","6:00:00","12:00:00","23:59:59")))
Class 'times' atomic [1:4] 0 0.25 0.5 1
..- attr(*, "format")= chr "h:m:s"
> dput(times(c("00:00:00","6:00:00","12:00:00","23:59:59")))
structure(c(0, 0.25, 0.5, 0.999988425925926), format = "h:m:s", class = "times")
```

Most of the work of creating a mapping to numeric values is already done; all that is needed is to strip off the class and attributes. `as.numeric()`

does that nicely.

```
> as.numeric(Time)
[1] 0.77582176 0.70247685 0.62913194 0.55578704 0.48244213
[6] 0.40909722 0.33575231 0.26240741 0.18906250 0.11571759
[11] 0.04237269
```

That is only half the mapping. We also need to go from this representation to a `times`

object. Looking at the constructor for `times`

, it can take a numeric vector representing “number of days since an origin.” It’s not stated, but maybe times are then just fractions of a day?

```
> times(as.numeric(Time))
[1] 18:37:11 16:51:34 15:05:57 13:20:20 11:34:43 09:49:06
[7] 08:03:29 06:17:52 04:32:15 02:46:38 01:01:01
```

Sure looks like it.

```
> identical(Time, times(as.numeric(Time)))
[1] TRUE
```

So `transform`

is just the `as.numeric`

function and `inverse`

is the `times`

function.

Getting breaks on time right is important; an axis where the ticks are every 7 seconds is going to look odd (unless there is a really compelling reason), as would 25 seconds. In base graphics, the generic function `pretty`

has the responsibility to find “nice” breaks. Looking at the methods for `times`

, there is a `pretty.times`

. Does it work (well enough)?

```
> pretty(Time)
[1] 03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00
attr(,"labels")
[1] 03:00 06:00 09:00 12:00 15:00 18:00
```

That’s pretty reasonable. Checking under the hood to see what is going on, `chron::pretty.times`

calls `chron::pretty.chron`

which calls `grDevices::pretty.POSIXt`

which calls `grDevices::prettyDate`

. Looking at the code for `prettyDate`

, the allowed (sub-day) breaks are 1 second, 2 seconds, 5 seconds, 10 seconds, 15 seconds, 30 seconds, 1 minute, 2 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes, 1 hour, 3 hours, 6 hours, and 12 hours. I might have added a 2 hour option, but it is not worth throwing away others’ work because of. `pretty_breaks`

already wraps `pretty`

in the format expected by scales, so we can just use `pretty_breaks()`

as the `breaks`

function.

```
> pretty_breaks()(range(Time))
03:00 06:00 09:00 12:00 15:00 18:00
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00
attr(,"labels")
[1] 03:00 06:00 09:00 12:00 15:00 18:00
```

format
Here, we almost catch another break. When `format`

is not defined, the names that are associated with what `breaks`

returns in principle are used. Unfortunately, in practice, this is not the case because inside ggplot code, the breaks get transformed back and forth between data spaces and lose their attributes (names). If we want the default formatting (full hour minute and second), then this can simply be `format`

. If we only want seconds to appear when they are not all 0 (when the increment is less than 1 minute), then we have to write our own function that passes the appropriate flag (`simplify`

) as to whether the seconds should be suppressed.

```
fmt <- function(x) {
format(x, simplify = !any(diff(x) < 1/(24*60)))
}
```

domain
Since `times`

is defined in terms of a fraction of a day, it is only meaningful in the range 0 to 1 (inclusive on the left, exclusive on the right). `domain`

does not have a way of defining inclusivity or exclusivity of the endpoints, so the domain is just `c(0,1)`

The transform object for datetime (`POSIXt`

) objects already use the name “time”, so the obvious name “times” would be confusing. I’ve chosen “chrontimes” as a name, to indicate that it is the `times`

object from the `chron`

package.

```
times_trans <- function() {
fmt <- function(x) {
format(x, simplify = !any(diff(x) < 1/(24*60)))
}
trans_new("chrontimes",
transform = as.numeric,
inverse = times,
breaks = pretty_breaks(),
format = fmt,
domain=c(0,1))
}
```

Using the transformation in ggplot
Using the `Time`

values previously created, and some other random data, make a data frame to plot.

```
dat <- data.frame(time = Time,
value = c(7L, 6L, 9L, 11L, 10L, 1L,
4L, 2L, 3L, 5L, 8L))
```

A default plot of this gives

```
ggplot(dat, aes(time, value)) + geom_point()
```

```
ggplot(dat, aes(time, value)) + geom_point() +
scale_x_continuous(trans=times_trans())
```

Integrating as a ggplot scale
If you want to go the next step and create `scale_*_times`

functions for using directly in ggplot, you can. In doing so, you may realize that, when done along a y-axis, you would expect time to run from top to bottom, not bottom to top as the y axis typically runs. Using the ideas of the reversed scale described in the previous post, a reversed times transformation can also be made. Then making the `scale_x_times`

and `scale_y_times`

is just a matter of passing the right transformation to `scale_x_continuous`

and `scale_y_continuous`

.

```
timesreverse_trans <- function() {
trans <- function(x) {-as.numeric(x)}
inv <- function(x) {times(-x)}
fmt <- function(x) {format(x, simplify = !any(diff(x) < 1/(24*60)))}
trans_new("chrontimes-reverse",
transform = trans,
inverse = inv,
breaks = pretty_breaks(),
format = fmt,
domain=c(0,1))
}
scale_x_times <- function(..., trans=NULL) {
scale_x_continuous(trans=times_trans(), ...)
}
scale_y_times <- function(..., trans=NULL) {
scale_y_continuous(trans=timesreverse_trans(), ...)
}
```

Examples of plots with times on each axis in full ggplot syntax

```
ggplot(dat, aes(time, value)) + geom_point() +
scale_x_times()
```

```
ggplot(dat, aes(value, time)) + geom_point() +
scale_y_times()
```

Mapping Quantitative Values to Color

Nils Gehlenborg & Bang Wong discuss “Mapping Quantitative Values to Color" in this month’s issue of Nature Methods. The article is paywalled, but I was able to access the figure without a subscription. They map out a systematic approach to color choice, starting with considering the salient regions of your data range and any values with special meaning (i.e. zero, 32 degrees Fahrenheit or sea level). They make explicit two options for mapping gradients to values:

…we can translate the ends of the color gradient to (i) zero and the theoretical maximum value or (ii) the observed minimum and maximum. The former approach allows us to interpret the data in the context of the theoretical data range (Fig. 1a). However, if higher contrast is needed from the graphical representation and zero is irrelevant as a reference point, then it is reasonable to map the lowest observed value to the lightest color and the highest observed value to the darkest color (Fig. 1b).

Check out Baptiste’s posts about choosing color palettes for more ideas and implementation: Introduction and Educated Choices.

Play

ggplot syntax reminders

There are ggplot features that I use often enough to know they exist but not often enough to remember in detail. Lately I’ve started moving examples of these features to the menu bar. I use the Mac utility called ClipMenu, which I first started using as a clipboard manager, but now I’m using the snippets feature for this.

This screen shot shows two ggplot snippets. The first one contains the code “+ opts(axis.text.x=theme_text(angle=90, hjust=0)).” To insert it in an R document, I click the clipmenu icon on the menu bar, highlight “ggplot” and click on “rotate axis text.”

I’m sure there are numerous other ways this type of strategy could be implemented; Quicksilver’s Shelf feature comes to mind. A web search suggests TextExpander might suffice in Windows; this page lists some more options.

Defining a new transformation for ggplot2/scales

Inspired by writing an answer to this question on StackOverflow, I decided to write up a more detailed description of creating a new transformation using the `scales`

package (and also to make sure that I understood all the details about how to really do it).

To start with, it helps to understand the philosophy behind the `scales`

package. From the description of the `scales`

package:

Scales map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends.

Within the realm of `scales`

, a transformation allows for a maniuplation of the data space prior to its mapping to an aesthetic. In particular, it is responsible for

- The mapping, in both directions, between the data space and an intermediate representation space
- Providing a mechanism for determining “nice” breaks in the data space
- Providing a mechanism for formatting the labels in the data space

There are two main use cases for a transformation:

- Taking an existing continuous scale and performing a functional transformation of it prior to mapping. For example, taking the logarithm, exponential, square root, recriprocal, inverse, etc. of a variable.
- Providing a way of handling a variable of a type which represents a continuous quantity, but has specific structure and/or formatting conventions, typically represented with a class. Prototypical examples of this are dates and datetimes.

These variable transformations take place before any `stat`

s are performed on the data. In fact, they are equivalent, in terms of effects on data, as putting a transform in as the variable itself (though the axes breaks and labels are different). Quoting from *ggplot2: Elegent Graphics for Data Analysis* (page 100):

Of course, you can also perform the transformation yourself. For example, instead of using

`scale_x_log()`

, you could plot`log10(x)`

. That produces an identical result inside the plotting region, but the axis and tick labels won’t be the same. If you use a transformed scale, the axes will be labelled in the original data space. In both cases, the transformaiton occurs before the statistical summary.

I reproduce Figure 6.4 using the current version of the code because it is different than what was published.

```
qplot(log10(carat), log10(price), data=diamonds)
qplot(carat, price, data=diamonds) +
scale_x_log10() + scale_y_log10()
```

Building blocks
The pieces that are needed to create a transformation are described on the help page for `trans_new`

, but I’ll go through them in more detail.

`transform`

and `inverse`

These are the workhorses of the transformation and define the functions that map from the original data space to the intermediate data space (`transform`

) and back again (`inverse`

). These can be specified as a function (an anonymous function or a function object) or as a character string which will cause a function of that name to be used (as determined by `match.fun`

).

Each of these functions should take a vector of values and return a vector of values of the same length. Callilng `inverse`

on the results of `transform`

should result in the original vector (to within any error introduced by floating point arithmetic). That is `all.equal(inverse(tranform(x)), x)`

should be `TRUE`

for any `x`

(for which `transform`

is defined; see `domain`

below).

Both of these functions are required.

`breaks`

`breaks`

is a function which takes a vector of length 2 which represents the range of the data, expressed in the original data space, that is to be represented. This will include any requested expansion, in addition to the actual data values. `breaks`

should return a vector of whatever length it deems appropriate such that each break is represented by one element of the vector. Optionally, the vector can be a named vector. If it is, the default formatter will use the names as the displayed version of the values.

In general, this is a hard problem, primarily because breaks should look “nice” which is difficult for an algorithm to determine. Luckily, others have spent time working on the problem and often much of what they have learned and implemented can be used without having to do much yourself. In partciular, there are existing break determination algorithms in `scales`

such as `pretty_breaks`

(which is based on `base::pretty`

) which find breaks for a simple numeric scale, `extended_breaks`

which is based on extensions of work by Wilkinson which covers the same terrirory, `log_breaks`

which give integer breaks on a log-transformed scale, and `date_breaks`

which works with date data.

All these functions are generators, meaning that they are functions which return functions which do the actual work of finding the breaks. These function can take parameters which define the properties of the breaking algorithm, such as the number of breaks, the base of the logarithm, or the spacing in time between dates.

This argument is optional, and if not supplied a default algorithm is used which will evenly space the ticks in the original data space.

`format`

`format`

is a function which takes a vector of values in the original data space (those returned by the `breaks`

function) and returns either a character vector of the same length or a list of expressions of the same length. The latter is useful for making expressions that can be handled by plotmath.

`scales`

includes many formatting functions including `comma_format`

which puts commas between thousands, millions, billions, etc.; `dollar_format`

which rounds to either cents or dollars (threshold definable in generator) and adds a “$” in front and commas; `percent_format`

which multiples by 100 and add a percent sign (“%”); and `parse_format`

and `math_format`

which aid in making plotmath expressions for lables.

As with `breaks`

, all the functions are generators which means that they are functions which return functions. The returned function is the one that takes a single vector, and is what is assigned to the `format`

argument.

This argument is optional and if not supplied, the default algorithm will use any names returned with the breaks. If there are no names, then `format`

is called on the passed values.

`domain`

The `domain`

is the values over which the `transform`

function is defined. For example, square root and logarithm are only defined for positive values (ggplot does not deal with complex values); and arcsine transformation would only be defined between -1 and 1. This is represented by a length 2 vector of the endpoints (inclusive) of the defined range.

This argument is optional, and if missing it is assumed that the transformation if valid over all numeric values.

`name`

A character string to identify the transformation. It is used in summary output, but has no computational value.

Example: reverse logarithmThere are built in transformations for logarithms and for reversing a scale, but there is not one to do both at once (largest to smallest, left to right). The code for each of these, taken from the `scales`

package, is

```
log_trans <- function(base = exp(1)) {
trans <- function(x) log(x, base)
inv <- function(x) base ^ x
trans_new(str_c("log-", format(base)), trans, inv,
log_breaks(base = base), domain = c(1e-100, Inf))
}
reverse_trans <- function() {
trans_new("reverse", function(x) -x, function(x) -x)
}
```

To make a reversed log scale, the `breaks`

are the same as for a regular log scale, so that part does not need to be recreated. In fact, much of `log_trans`

can be reused with changes being made to just the transformation and inverse functions.

```
reverselog_trans <- function(base = exp(1)) {
trans <- function(x) -log(x, base)
inv <- function(x) base^(-x)
trans_new(paste0("reverselog-", format(base)), trans, inv,
log_breaks(base = base), domain = c(1e-100, Inf))
}
```

I have opted here to follow the general pattern of the other trans functions making this a generator of the transformation, parameterizable by the base of the logarithm. Since I’ve used the `_trans`

naming convention, I can also just call it (with the default parameters) as a string in the `trans`

argument of `scale_x_continuous`

. Some examples of it at work:

```
dat <- data.frame(x=1:20, y=1:20)
ggplot(dat, aes(x,y)) + geom_point() +
scale_x_continuous(trans="reverselog")
```

```
ggplot(dat, aes(x,y)) + geom_point() +
scale_x_continuous(trans=reverselog_trans(base=2))
```

Reproducibility details
R-2.15.1, ggplot2-0.9.1, scales-0.2.1

code available at https://github.com/BrianDiggs/trans

Dot density measurements

Jim Perkins is a scientific illustrator who recently contributed a guest post about the various meanings of DPI (dots-per-inch) at Symbiartic, a Scientific American blog about “The art of science and the science of art.” He also has a couple previous posts about calibrating monitors.

Simulating conditions of legislative violence

Christopher Gandrud uses ggplot to illustrate his analysis of violence in national legislative chambers (e.g. Turkey, above). After gathering a data set of incidents of legislative violence, he applied logistic regression for rare events to identify the most important variables and the extent of their importance. He then predicted the probability of violence in a range of conditions with a round of simulations, depicted below.

Christopher discusses his approach to this plot in detail here.

The take home on legislative violence: new democracies with poor concordance between votes from the electorate, seats in the legislature, and proportion of governmental power are more likely to see legislative violence.

Assessing the Price of Solid State Harddrives

Over a staff meeting at work, the topic of price of solid state hard drives came up (what are they, is it non linear with size, etc.). I decided to sample 120 solid state hard drives from newegg.com and recorded their size (in GB) and price (in USD) as well as their class (SATA II or SATA III). Note that the sampling was semi-random, in that I had no particular agenda, but did not go to great lengths to sample randomly. To look at this, I used ggplot2.

```
ssd <- read.csv("http://joshuawiley.com/files/ssd.csv")
ssd$class <- factor(ssd$class)
require(ggplot2)
## first pass
p <- ggplot(ssd, aes(x = price, y = size, colour = class)) +
geom_point()
print(p)
```

Not too bad, but the data is sparser at higher sizes and prices, so we can use a log-log scale to make it a little easier to see, and add locally weighted regression (loess) lines to assess linearity (or lack there of).

```
## add smooths and log to make clearer
p <- p +
stat_smooth(se=FALSE) +
scale_x_log10(breaks = seq(0, 1000, 100)) +
scale_y_log10(breaks = seq(0, 600, 100))
```

Okay, that is nice. Lastly, let’s add better labels, make the x-axis text not overlap, and include the intercept and slope parameters for the linear lines of best fit for each class of hard drive.

```
## fit separate intercept and slope model
m <- lm(size ~ 0 + class*price, data = ssd)
est <- round(coef(m), 2)
size2 <- paste0("II Size = ", est[1], " + ", est[3], "price")
size3 <- paste0("III Size = ", est[2], " + ", est[4], "price")
## finalize
p <- p +
annotate("text", x = 100, y = 600, label = size2) +
annotate("text", x = 100, y = 500, label = size3) +
labs(x = "Price in USD", y = "Size in GB") +
opts(title = "Log-Log Plot of SSD Size and Price",
axis.text.x = theme_text(angle = 45, hjust = 1, vjust = 1))
```

(guest post by Joshua Wiley)

Name popularity

This dynamic representation of the popularity of names over the years is a favorite. It’s not new, but I still find new things to appreciate, like names that used to apply to both sexes and now only one (Ellie), or vice versa (Harley). It seems like there are more very popular male names than very popular female names; I can hardly guess what underlies that.

Choosing colour palettes. Part II: Educated Choices.

There are many resources on the use of colours in R, several packages, and a number of schemes already implemented in `ggplot2`

. In the previous part, we saw how `ggplot2`

selects a default colour palette according to the type of variable, discrete or continuous. There are further options, illustrated below:

Choosing colours for a graphic is often some kind of a compromise. One one hand, you want the *computer*, some algorithm, to choose a sensible colour scheme and pick automatically the required number of colours from this scale. On the other hand, there are always external *human* preferences that constrain the choices, and are not always easy to formalise.

Some choices, even prevalent in the literature such as the rainbow color scale (also known as Matlab’s flashy colorjet),

are just not good enough. They introduce artefacts, highlight regions of the data that should have a smooth transition with their surroundings, and do not degrade gracefully in black-and-white print, or when viewed by colour-impaired people.

If good colours for scientific graphics are not (entirely) in the eye of the beholder, what are the guides to make the best choice?

A recent blog post illustrates the search for a pleasing colour scheme in bar graphs. On the default HCL (Hue Chroma Luminance, pdf) choice of `ggplot2`

for discrete variables, the author remarks

The colour choice is not a bad one, but there’s something about the intensity of the colours that makes me want to find a new set of colours somewhat more soothing to my eyes.

and documents his heuristic search for satisfying colours,

I shuffled through many different colours on the Color Hex website, and nothing else seemed to work with me as I wasn’t selecting colours based on any theory

A good discussion is offered in the colorspace package and its accompanying vignettes and papers, e.g. Escaping RGBland: Selecting Colors for Statistical Graphics (pdf)

Despite this omnipresence of color, there is often only little guidance in statistical software packages on how to choose a palette appropriate for a particular visualization task

In this instance, I would argue that the `hcl`

colour scale of `ggplot2`

is a good start for a well-balanced graphic that doesn’t draw the attention to a particular colour. If the colours are too flashy in bar plots (large areas), the saturation and luminosity can easily be muted by tuning the scale,

This basic idea of tuning the HCL colour scale to suit the application was discussed in more depth in Colour for Presentation Graphics (pdf).
Bar plots and maps can also benefit from trying a few different colour palettes from the excellent ColorBrewer website. An interface is provided in `R`

and `ggplot2`

through the `RColorBrewer`

package.

Easily accessed with `scale_colour_brewer()`

, it is trivial to choose among 35 palettes (see `RColorBrewer::display.brewer.all()`

).

**Sequential palettes**, *suited to ordered data that progress from low to high. Lightness steps dominate the look of these schemes, with light colors for low data values to dark colors for high data values.*

**Qualitative palettes**, *do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes. Qualitative schemes are best suited to representing nominal or categorical data.*

**Diverging palettes**, *put equal emphasis on mid-range critical values and extremes at both ends of the data range. The critical class or break in the middle of the legend is emphasized with light colors and low and high extremes are emphasized with dark colors that have contrasting hues.*

In the next post, we’ll look at some special cases where the user might want finer control over these scales, or define completely new colour palettes tailored for a specific graphic.

Managing the deluge of DNA data

The explosion in DNA sequencing capacity has shifted the experimental bottleneck from sequencing to analyzing and interpreting sequences. The bioconductor package cummeRbund uses ggplot as part of its tool set for organizing, exploring and visualizing sequencing data related to gene expression. Congrats to the authors on their recent publication (paywall).

Posts about ggplot2 on r-bloggers

You can also get your fix of ggplot2 on r-bloggers (where this blog is also syndicated): http://www.r-bloggers.com/search/ggplot

Choosing colour palettes. Part I: Introduction

In this series of three posts, we’ll look at colours in R graphics produced with `ggplot2`

: what are the available choices of colour schemes, and how to choose a colour palette most suitable for a particular graphic?

In kindergarten, choosing a colour was easy, palettes were limited to a few classics. As cool kids grow older and use R, the spectrum expands to present us with overwhelming choice of millions of colours, most of them with poorly defined labels such as `"#A848F2"`

or `"lavenderblush3"`

. Inasmuch as scientific graphics resemble a paint-by-numbers game, R can help us design more elegant palettes with pertinent colour choices based on the data to display.

Base graphics rely mostly on the `grDevices`

package for the selection of colours, with a few palettes to choose from:

The package also provides a number of basic operations to convert colours (`adjustcolor, col2rgb, make.rgb, rgb2hsv, convertColor`

) and create interpolating palettes (`rgb, hsv, hcl, gray, colorRamp, colorRampPalette, densCols, gray.colors`

).

Beyond that, a good resource is the `colorspace`

package which provides further utilities to convert from one colorspace to another (`HLS, HSV, LAB, LUV, RGB, sRGB, XYZ`

) and perform various operations on colours.
A special note can be made of a few palette functions, “diverge_hcl”, “diverge_hsv”, “heat_hcl”, “rainbow_hcl”, “sequential_hcl”, “terrain_hcl”, which provide an easy way to produce colour palettes following a particular path in the colour space (varying hue with constant luminosity and saturation, for example).

Other packages such as `RColorBrewer`

, `munsell`

and `dichromat`

provide more colour palettes and utilities.

While the combination of these tools is quite flexible, the user interface becomes a little bit chaotic. More recently, the `scales`

package has provided wrappers around these functions to provide some consistency in the naming schemes and organise the different categories of palettes in a structured way:

**Utilities functions**, such as `col2hcl, fullseq, muted, rescale, rescale_mid, rescale_none, rescale_pal, seq_gradient_pal, show_col`

**Palettes with consistent interface**, `brewer_pal, dichromat_pal, gradient_n_pal , div_gradient_pal, hue_pal, grey_pal, identity_pal, manual_pal`

.

The `ggplot2`

package uses `scales`

internally, and mirrors this structure. In this first part, we’ll review the basic commands to assign colours in ggplot2.

Let’s consider three plots for illustration:

`p1`

maps the colour of points to a continuous variable, `p2`

maps the fill of bars to a discrete variable, and `p3`

maps the fill of tiles to a continuous variable.

Fill and colour scales in ggplot2 can use the same palettes. Some shapes such as lines only accept the colour aesthetic, while others, such as polygons, accept both colour and fill aesthetics. In the latter case, the colour refers to the border of the shape, and the fill to the interior.

Aesthetic mapping vs set valuesAnother common source of confusion, general to `ggplot2`

, is the distinction between set values and mapped values in a layer. Consider the following example,

```
d = data.frame(x = 1:10, y = rnorm(10), z = gl(5, 2))
a = ggplot(d, aes(x, y, group=z))
grid.arrange(a + geom_path( colour = "red" ),
a + geom_path( aes(colour = z )),
nrow=1)
```

Continuous scales
The default continuous scale in `ggplot2`

is a blue gradient, from `low = "#132B43"`

to `high = "#56B1F7"`

which can be reproduced as

```
scales::seq_gradient_pal(low = "#132B43", high = "#56B1F7", space = "Lab")
```

Discrete scales
The default discrete scale in `ggplot2`

is a range of hues from hcl,

```
scales::hue_pal(h = c(0, 360) + 15, c = 100, l = 65, h.start = 0,
direction = 1)
```

In the next post of this series we’ll describe how one can fine-tune or change altogether these default colours, and, perhaps more importantly, give some pointers on choosing an appropriate colour scheme for a particular graphic.

Automating repetitive plot elements

The syntax of **ggplot2** emphasizes constructing plots by adding components, or layers, using `+`

.

Possibly one of the most useful, but least remarked upon, consequences of this syntax is that it allows for an incredible degree of flexibility in saving and reusing components of plots. Here are two very simple examples that come up frequently for me.

I frequently make line plots where the x axis is categorical. For instance, consider the following example data:

```
x <- factor(paste(1990:2005,1991:2006,sep = "-"))
dat <- data.frame(x = x,y = rnorm(length(x))
```

which we can plot like so:

```
p <- ggplot(dat,aes(x = x,y = y)) + geom_line(aes(group = 1))
p
```

Obviously, we can’t keep those x axis labels like that, they’re unreadable! So I’m frequently doing something like the following:

```
p + opts(axis.text.x = theme_text(size = 7,
hjust = 0,
vjust = 1,
angle = 310))
```

But who wants to type all that over and over for each plot? So instead, I just store the results of that `opts()`

call:

```
x_angle <- opts(axis.text.x = theme_text(size = 7,
hjust = 0,
vjust = 1,
angle = 310))
p + x_angle
```

While in this case it doesn’t look too bad, if my x axis has even more values, showing all of the labels can seem a little excessive. Maybe we really only need to show every other x axis tick label:

```
l <- levels(dat$x)[seq(1,length(levels(dat$x)),by = 2)]
p + x_angle + scale_x_discrete(breaks = l,labels = l)
```

Again, this kind of thing comes up a lot, and typing this over and over can get a bit tedious. But you can write a simple function that takes the axis tick labels (in the correct order) and returns the `scale_x_discrete`

object as needed:

```
every_other <- function(labs,side = "x",...){
l <- labs[seq(1,length(labs),by = 2)]
if (side == 'x'){
return(scale_x_discrete(breaks = l,labels = l,...))
}
if (side == 'y'){
return(scale_y_discrete(breaks = l,labels = l,...))
}
}
```

So in the end you can do all that simply with the following code:

```
p + x_angle + every_other(levels(dat$x))
```

These examples are fairly simple, but perhaps they’ll get you thinking about components of your plots that can be stored and reused, or generated by functions.

Adding watermarks to plots

A question was raised today on the mailing list: Is there an easy way to add a watermark to a ggplot?

There are several options, depending on the type of watermark and the required level of control over the output,

add a text label using

`annotate`

(the original idea of the poster)add a custom grob (graphical object from the Grid package), using

`annotation_custom`

In either case, the placement of a watermark at an absolute location on the plot is greatly facilitated if you use +/- Inf values, which correspond to the extreme edges of the plot panel.

Here is an example with `annotate`

```
library(ggplot2)
library(grid)
qplot(1:10, rnorm(10)) +
annotate("text", x = Inf, y = -Inf, label = "PROOF ONLY",
hjust=1.1, vjust=-1.1, col="white", cex=6,
fontface = "bold", alpha = 0.8)
```

where the label is placed at the bottom-right, and the justification is adjusted to make sure the label stays in the panel area.

Below is a fancier example with a custom grob, which we define such that its width spans the full plot panel, even after resizing the interactive plot window,

```
watermarkGrob <- function(lab = "PROOF ONLY"){
grob(lab=lab, cl="watermark")
}
## custom draw method to
## calculate expansion factor on-the-fly
drawDetails.watermark <- function(x, rot = 45, ...){
cex <- convertUnit(unit(1,"npc"), "mm", val=TRUE) /
convertUnit(unit(1,"grobwidth", textGrob(x$val)), "mm",val=TRUE)
grid.text(x$lab, rot=rot, gp=gpar(cex = cex, col="white",
fontface = "bold", alpha = 0.5))
}
qplot(1:10, rnorm(10)) +
annotation_custom(xmin=-Inf, ymin=-Inf, xmax=Inf, ymax=Inf, watermarkGrob())
```

You can of course replace this grob with a more complex one, e.g a table of labels to tile the panel with multiple repetitions of the watermark, or an external graphic (consider the `annotation_raster`

function), etc.
As an example, the following function uses `rpatternGrob`

from the gridExtra package to tile multiple copies of the R logo, imported as a raster image,

```
library(png)
library(gridExtra)
## import logo as raster image
m <- readPNG(system.file("img", "Rlogo.png", package="png"), FALSE)
w <- matrix(rgb(m[,,1],m[,,2],m[,,3], m[,,4] * 0.2), # adjust alpha
nrow=dim(m)[1])
qplot(1:10, rnorm(10), geom = "blank") +
annotation_custom(xmin=-Inf, ymin=-Inf, xmax=Inf, ymax=Inf,
rpatternGrob(motif=w, motif.width = unit(1, "cm"))) +
geom_point()
```

This time we made sure that the logo was the first layer plotted, so that it doesn’t obfuscate the data but stays in the background.

ggplot2 presentation at Victoria University of Wellington

Next week I’ll present a glimpse of R and ggplot2 graphics at VUW. This is a MESA seminar on 'Data analysis and plotting with free and open source tools' where we’ll present spreadsheet alternatives based on gnuplot, Python, and R.

Play

2012