8 min read

Multi-level labels with ggplot2

Recently I needed to create multi-level labels with ggplot2 and had no idea how to do it. Multi-level labels imply some sort of hierarchical structure in data. For example, survey questions may be grouped by topics and dates on the timeline may be grouped by years. A 15-minute Google-fu provided me with various solutions described on Stack Overflow that worked with varying success for different types of charts. An important aspect is whether data points between groups should be connected. The bar chart (plot A below) is an example where data points from different groups should not be connected and a line chart (plot B below) is an example where data points should be connected both within and between groups:

Below, I discuss possible solutions to multi-level labels for these two charts. The line chart will have dates on the x axis as this is probably the most popular case. I deliberately keep charts simple with no customization of colours, etc.

Bar chart

First, let’s simulate the data for the bar chart. Let it be a result of a hypothetical survey with 9 questions labelled from Q1 to Q9 and combined into 3 groups. These will substitute the x axis labels. The y axis values are drawn from a uniform distribution and represent a proportion of respondents that correctly answered corresponding questions.

# For `tibble()`, `if_else()` and pipe `%>%`.
library(dplyr)
# Reproducibility.
set.seed(4) 
# Counts of elements in each of the three groups.
TIMES <- c(2, 4, 3)
# Simulate data.
data <- tibble(group = c(rep(paste("Group", 1:3), times = TIMES)),
               question = paste0("Q", 1:sum(TIMES)),
               proportion = runif(sum(TIMES)))
# Print data.
data
## # A tibble: 9 x 3
##   group   question proportion
##   <chr>   <chr>         <dbl>
## 1 Group 1 Q1          0.586  
## 2 Group 1 Q2          0.00895
## 3 Group 2 Q3          0.294  
## 4 Group 2 Q4          0.277  
## 5 Group 2 Q5          0.814  
## 6 Group 2 Q6          0.260  
## 7 Group 3 Q7          0.724  
## 8 Group 3 Q8          0.906  
## 9 Group 3 Q9          0.949

Here is a simple bar chart:

library(ggplot2)
data %>% 
  ggplot(aes(x = question, y = proportion)) +
  geom_col()

Why would you want to add a grouping variable? To make the chart more intuitive for the reader by clearly showing that questions belong to different topics.

Possible solutions

The most popular solutions for bar charts employ faceting with either facet_grid() (for example, here) or facet_wrap() (for example, here). The latter does not have an argument space that lets the width of facets vary, which in turn forces all bars to have the same width. Therefore, if a number of categories differ among groups then facet_grid() should be preferred. I really liked this approach and demonstrate it below in steps.

Example with faceting

The faceting approach starts with a simple bar chart that is turned into small multiples bar chart with a facet_grid():

data %>% 
  ggplot(aes(x = question, y = proportion)) +
  geom_col() +
  facet_grid(~group)

Let’s rewrite the code and add three additional arguments for the faceting function:

p_bars <- data %>% 
  ggplot(aes(x = question, y = proportion)) +
  geom_col() +
  facet_grid(~group, 
             scales = "free_x", # Let the x axis vary across facets.
             space = "free_x",  # Let the width of facets vary and force all bars to have the same width.
             switch = "x")      # Move the facet labels to the bottom.

p_bars

The final step is to customize the non-data components of the chart with the theme() function:

p_bars + 
  theme(strip.placement = "outside",                      # Place facet labels outside x axis labels.
        strip.background = element_rect(fill = "white"),  # Make facet label background white.
        axis.title = element_blank())                     # Remove x and y axis titles.

Done.

Line chart with dates

Again, we start with simulating data for the line chart. Let it be the sales volume of some healthy product over 16 months from November 2018 to February 2020. I deliberately chose this period as it partly covers three years.

# Number of months.
N <- 16
# Reproducibility.
set.seed(4)
# Simulate data.
data <- tibble(date = seq(as.Date("2018-11-01"), by = "1 month", length.out = N),
               sales = rnorm(n = N, mean = 100, sd = 5))
# Print data.
data
## # A tibble: 16 x 2
##    date       sales
##    <date>     <dbl>
##  1 2018-11-01 101. 
##  2 2018-12-01  97.3
##  3 2019-01-01 104. 
##  4 2019-02-01 103. 
##  5 2019-03-01 108. 
##  6 2019-04-01 103. 
##  7 2019-05-01  93.6
##  8 2019-06-01  98.9
##  9 2019-07-01 109. 
## 10 2019-08-01 109. 
## 11 2019-09-01 103. 
## 12 2019-10-01 100. 
## 13 2019-11-01 102. 
## 14 2019-12-01  99.8
## 15 2020-01-01 100. 
## 16 2020-02-01 101.

The first step is to create a simple line chart:

p_line <- data %>% 
  ggplot(aes(x = date, y = sales)) +
  geom_line()

p_line

Your x axis labels may look differently depending on regional settings. My default region is Latvia. Locale can be changed with Sys.setlocale():

# Change locale.
Sys.setlocale(category = "LC_ALL", locale = "english")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
# Reprint the chart.
p_line

Why would you want to use multi-level labels for the x axis in this case? Because labelling each month creates a lot of clutter even with a short date format:

p_line + 
  scale_x_date(date_breaks = "1 month",  # Date labels for each month.
               date_labels = "%b%y")     # Date format: abbreviated month and a 2-digit year.

The faceting approach is not suitable in this scenario because we need to connect data points between facets:

# Faceting approach does not suit this chart.
data %>% 
  ggplot(aes(x = date, y = sales)) +
  geom_line() +
  scale_x_date(date_breaks = "1 month",
               date_labels = "%b") +
  facet_grid(~lubridate::year(date), 
             scales = "free_x", 
             switch = "x", 
             space = "free_x") + 
  theme(strip.placement = "outside", 
        strip.background = element_rect(fill = "white"), 
        axis.title = element_blank())

Even if we remove the white space between the facets with theme(panel.spacing = unit(0, units = "cm"), the end data points of facets will still be disconnected. Notice the discontinuities between Decembers and Januaries:

Possible solutions

One solution offered here is to use annotations. I don’t use this approach for two reasons:

  1. coord_cartesian(clip = "off") can cause unexpected results as it allows drawing of data points anywhere on the plot, including in the plot margins.
  2. Fixed horizontal positions for text annotations in annotate(geom = "text", x = 2.5 + 4 * (0:4)) do not depend on data and other ggplot2 customizations, such as axis limits. This solution may not work for a different data set.

Another solution with grobs here appears overly complicated to me. Instead, I offer a different solution, which seems to me simple and generalizable, i.e. it does not depend on a particular data set.

Example with text wrap

My alternative solution is to modify x axis labels with a text formatting function (in this case - date formatting). Suppose we have a date vector. The function does three things:

  1. Gets abbreviated names of months.
  2. Gets the four-digit years.
  3. Pastes the years behind months if
    • it is the first element in vector, or
    • the previous date in a vector has a different year component (it does not have to be January).
    The separator is a new line symbol \n, which does the trick - it wraps the text and makes it look like a multi-line label. However, it is still a single vector of x axis labels, whose appearance may be manipulated with theme(axis.text.x = element_text(...)).

Here is the function:

format_dates <- function(x) {
  months <- strftime(x, format = "%b")              # Abbreviated name of the month.
  years <- lubridate::year(x)                       # Year as a 4-digit number.
  if_else(is.na(lag(years)) | lag(years) != years,  # Conditions for pasting.
          true = paste(months, years, sep = "\n"), 
          false = months)
}

Try it on two date vectors:

# Test the function.
format_dates(as.Date(c("2018-09-30", "2018-12-31", "2019-03-31", "2019-06-30")))
## [1] "Sep\n2018" "Dec"       "Mar\n2019" "Jun"
format_dates(as.Date(c("2018-12-31", "2019-03-31", "2019-06-30", "2019-09-30")))
## [1] "Dec\n2018" "Mar\n2019" "Jun"       "Sep"

We are now ready. Let’s start from the beginning with a simple line chart. Then supply the above defined date formatting function to scale_x_date(labels = ...) among few other arguments. As the final touch, left-justify the labels with the theme().

data %>% 
  ggplot(aes(x = date, y = sales)) +
  geom_line() + 
  # Customize x axis.
  scale_x_date(date_breaks = "1 month",          # Date labels for each month.
               minor_breaks = NULL,              # No additional labels in-between `date_breaks`.
               expand = expand_scale(add = 15),  # Add 15 days to the x-axis on the left and on the right.
               labels = format_dates) +          # Supply a user-defined date formatting function.
  
  # Customize the non-data components.
  theme(axis.text.x = element_text(hjust = 0),   # Left-justify x-axis labels.
        axis.title = element_blank())            # Remove x and y axis titles.

Done.

If abbreviated month names still look cluttered, we may use just the first letter of each month’s name by altering the format_dates() function and replacing

months <- strftime(x, format = "%b")      # Abbreviated name of the month.

with

months <- strftime(x, format = "%b") %>%  # Abbreviated name of the month.
  stringr::str_to_upper() %>%             # May or may not be needed depending on your locale.
  stringr::str_sub(start = 1, end = 1)    # Extract just the first letter.

Then the chart would look like this:


I would appreciate any comments or suggestions. Please leave them below, no login required if you check “I’d rather post as a guest”.