Winston Chang has a guest post on the Revolution Analytics blog about embedding fonts into PDFs using the extrafont package which, not surprisingly, works very well with ggplot2.
ggbio (package, publication) is an extension and specialization of ggplot designed for visualizing genomics annotations and high-throughput data
A frequent requirement for plots of genomic data is indicating the relationship of measurements on different scales. For example, the x-axis of the lower plots in this image is the linear chromosome position, but the upper plot shows the expression levels of each exon (which are indicated in the lower image by vertical lines). In the past I’ve written code to generate the diagonal lines between the images that show the relationship of the two scales, and I’ll be happy if ggbio makes that unnecessary.
In my previous blog post, I explored what was needed to create a new transformation for the scales package and gave an example of a mathematical transformation. In this post, I want to show an additional example related to the other mentioned use case (mapping a continuous like variable with specific structure and formatting) and extend the example into creating new scales functions which integrate into ggplot even more directly.
Dates and times are tricky to work with because they have detailed external constraints and conventions. Within the R ecosystem, several packages exist solely to deal with dates and times (chron, lubridate, date, mondate, timeDate, TimeWarp, etc.), and an article has appeared in R News on the topic (Brian D. Ripley and Kurt Hornik. Date-time classes. R News, 1(2):8-11, June 2001.).
There is already support for dates (using the Date
class, via date_trans
in scales
and scale_*_date
in ggplot2
) and datetimes (using the POSIXt
class, via time_trans
in scales
and scale_*_datetime
in ggplot2
). The piece that is missing is for time, separate from any date; “clock time”, if you will.
Exercising the first of the three great virtues of a programmer, laziness, it is worth seeing what has already been done (classes and functions) to deal with clock time.
The chron
package has a class times
which can specify times of day, independent of a date. Additionally, there are many supporting functions for this class:
> methods(class="times")
[1] [.times* [[.times* [<-.times*
[4] as.character.times* as.data.frame.times* axis.times*
[7] Axis.times* c.times* diff.times*
[10] format.times* hist.times* identify.times*
[13] is.na.times* lines.times* Math.times*
[16] mean.times* Ops.times* plot.times*
[19] points.times* pretty.times* print.times*
[22] quantile.times* summary.times* Summary.times*
[25] trunc.times* unique.times* xtfrm.times*
Non-visible functions are asterisked
Following the pattern of the previous post, each of the parts of the transformation can be determined.
transform
and inverse
When dealing with variable that is a class, transform
must take the specific representation and convert it to a simple numeric representation (map to [part of] the real line in mathematical terms); inverse
does the opposite functional mapping. Generally, this requires delving into the structure of the class to see how it is really put together. To do that, let’s create some data. The times
documentation says it can convert a character vector (by default in 24-hour, minute, second format, separated by colons) to times.
Time <- times(c("18:37:11", "16:51:34", "15:05:57", "13:20:20",
"11:34:43", "09:49:06", "08:03:29", "06:17:52",
"04:32:15", "02:46:38", "01:01:01"))
which if printed gives
> Time
[1] 18:37:11 16:51:34 15:05:57 13:20:20 11:34:43 09:49:06
[7] 08:03:29 06:17:52 04:32:15 02:46:38 01:01:01
So far, so good. But what does this object/class really look like?
> str(Time)
Class 'times' atomic [1:11] 0.776 0.702 0.629 0.556 0.482 ...
..- attr(*, "format")= chr "h:m:s"
> dput(Time)
structure(c(0.775821759259259, 0.702476851851852, 0.629131944444444,
0.555787037037037, 0.48244212962963, 0.409097222222222, 0.335752314814815,
0.262407407407407, 0.1890625, 0.115717592592593, 0.0423726851851852
), format = "h:m:s", class = "times")
times
are just vectors with an attribute and a class. A little more digging and testing can show that the numeric part is just the fraction of a day that that time represents.
> str(times(c("00:00:00","6:00:00","12:00:00","23:59:59")))
Class 'times' atomic [1:4] 0 0.25 0.5 1
..- attr(*, "format")= chr "h:m:s"
> dput(times(c("00:00:00","6:00:00","12:00:00","23:59:59")))
structure(c(0, 0.25, 0.5, 0.999988425925926), format = "h:m:s", class = "times")
Most of the work of creating a mapping to numeric values is already done; all that is needed is to strip off the class and attributes. as.numeric()
does that nicely.
> as.numeric(Time)
[1] 0.77582176 0.70247685 0.62913194 0.55578704 0.48244213
[6] 0.40909722 0.33575231 0.26240741 0.18906250 0.11571759
[11] 0.04237269
That is only half the mapping. We also need to go from this representation to a times
object. Looking at the constructor for times
, it can take a numeric vector representing “number of days since an origin.” It’s not stated, but maybe times are then just fractions of a day?
> times(as.numeric(Time))
[1] 18:37:11 16:51:34 15:05:57 13:20:20 11:34:43 09:49:06
[7] 08:03:29 06:17:52 04:32:15 02:46:38 01:01:01
Sure looks like it.
> identical(Time, times(as.numeric(Time)))
[1] TRUE
So transform
is just the as.numeric
function and inverse
is the times
function.
Getting breaks on time right is important; an axis where the ticks are every 7 seconds is going to look odd (unless there is a really compelling reason), as would 25 seconds. In base graphics, the generic function pretty
has the responsibility to find “nice” breaks. Looking at the methods for times
, there is a pretty.times
. Does it work (well enough)?
> pretty(Time)
[1] 03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00
attr(,"labels")
[1] 03:00 06:00 09:00 12:00 15:00 18:00
That’s pretty reasonable. Checking under the hood to see what is going on, chron::pretty.times
calls chron::pretty.chron
which calls grDevices::pretty.POSIXt
which calls grDevices::prettyDate
. Looking at the code for prettyDate
, the allowed (sub-day) breaks are 1 second, 2 seconds, 5 seconds, 10 seconds, 15 seconds, 30 seconds, 1 minute, 2 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes, 1 hour, 3 hours, 6 hours, and 12 hours. I might have added a 2 hour option, but it is not worth throwing away others’ work because of. pretty_breaks
already wraps pretty
in the format expected by scales, so we can just use pretty_breaks()
as the breaks
function.
> pretty_breaks()(range(Time))
03:00 06:00 09:00 12:00 15:00 18:00
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00
attr(,"labels")
[1] 03:00 06:00 09:00 12:00 15:00 18:00
Here, we almost catch another break. When format
is not defined, the names that are associated with what breaks
returns in principle are used. Unfortunately, in practice, this is not the case because inside ggplot code, the breaks get transformed back and forth between data spaces and lose their attributes (names). If we want the default formatting (full hour minute and second), then this can simply be format
. If we only want seconds to appear when they are not all 0 (when the increment is less than 1 minute), then we have to write our own function that passes the appropriate flag (simplify
) as to whether the seconds should be suppressed.
fmt <- function(x) {
format(x, simplify = !any(diff(x) < 1/(24*60)))
}
Since times
is defined in terms of a fraction of a day, it is only meaningful in the range 0 to 1 (inclusive on the left, exclusive on the right). domain
does not have a way of defining inclusivity or exclusivity of the endpoints, so the domain is just c(0,1)
The transform object for datetime (POSIXt
) objects already use the name “time”, so the obvious name “times” would be confusing. I’ve chosen “chrontimes” as a name, to indicate that it is the times
object from the chron
package.
times_trans <- function() {
fmt <- function(x) {
format(x, simplify = !any(diff(x) < 1/(24*60)))
}
trans_new("chrontimes",
transform = as.numeric,
inverse = times,
breaks = pretty_breaks(),
format = fmt,
domain=c(0,1))
}
Using the Time
values previously created, and some other random data, make a data frame to plot.
dat <- data.frame(time = Time,
value = c(7L, 6L, 9L, 11L, 10L, 1L,
4L, 2L, 3L, 5L, 8L))
A default plot of this gives
ggplot(dat, aes(time, value)) + geom_point()
ggplot(dat, aes(time, value)) + geom_point() +
scale_x_continuous(trans=times_trans())
If you want to go the next step and create scale_*_times
functions for using directly in ggplot, you can. In doing so, you may realize that, when done along a y-axis, you would expect time to run from top to bottom, not bottom to top as the y axis typically runs. Using the ideas of the reversed scale described in the previous post, a reversed times transformation can also be made. Then making the scale_x_times
and scale_y_times
is just a matter of passing the right transformation to scale_x_continuous
and scale_y_continuous
.
timesreverse_trans <- function() {
trans <- function(x) {-as.numeric(x)}
inv <- function(x) {times(-x)}
fmt <- function(x) {format(x, simplify = !any(diff(x) < 1/(24*60)))}
trans_new("chrontimes-reverse",
transform = trans,
inverse = inv,
breaks = pretty_breaks(),
format = fmt,
domain=c(0,1))
}
scale_x_times <- function(..., trans=NULL) {
scale_x_continuous(trans=times_trans(), ...)
}
scale_y_times <- function(..., trans=NULL) {
scale_y_continuous(trans=timesreverse_trans(), ...)
}
Examples of plots with times on each axis in full ggplot syntax
ggplot(dat, aes(time, value)) + geom_point() +
scale_x_times()
ggplot(dat, aes(value, time)) + geom_point() +
scale_y_times()
Nils Gehlenborg & Bang Wong discuss “Mapping Quantitative Values to Color" in this month’s issue of Nature Methods. The article is paywalled, but I was able to access the figure without a subscription. They map out a systematic approach to color choice, starting with considering the salient regions of your data range and any values with special meaning (i.e. zero, 32 degrees Fahrenheit or sea level). They make explicit two options for mapping gradients to values:
…we can translate the ends of the color gradient to (i) zero and the theoretical maximum value or (ii) the observed minimum and maximum. The former approach allows us to interpret the data in the context of the theoretical data range (Fig. 1a). However, if higher contrast is needed from the graphical representation and zero is irrelevant as a reference point, then it is reasonable to map the lowest observed value to the lightest color and the highest observed value to the darkest color (Fig. 1b).
Check out Baptiste’s posts about choosing color palettes for more ideas and implementation: Introduction and Educated Choices.
There are ggplot features that I use often enough to know they exist but not often enough to remember in detail. Lately I’ve started moving examples of these features to the menu bar. I use the Mac utility called ClipMenu, which I first started using as a clipboard manager, but now I’m using the snippets feature for this.
This screen shot shows two ggplot snippets. The first one contains the code “+ opts(axis.text.x=theme_text(angle=90, hjust=0)).” To insert it in an R document, I click the clipmenu icon on the menu bar, highlight “ggplot” and click on “rotate axis text.”
I’m sure there are numerous other ways this type of strategy could be implemented; Quicksilver’s Shelf feature comes to mind. A web search suggests TextExpander might suffice in Windows; this page lists some more options.