Better data communication with {ggplot2}
Styling a chart and improving the text using {forcats}
and {ggtext}
and {scales}
After attending the inspiring and eye opening talk about data communication by John Burn-Murdoch at the 2021 RStudio global conference I have decided to show how to put in practice one of his tips for better data communication: democratise plots with text. He showed the importance of the text in charts, as whoever is looking at the plot will most likely first look through the title, the axis titles and only after lingering there for a bit they will look at the area of the chart, your plot. If the text is so important, then it should be given as much attention as the plot itself.
The title and subtitle are key tools for our data communication, that should not be overlooked and that should already convey a clear message, not just describe the type of data used for the plot. This will help everyone who is looking at the chart understand what is the message, even without any level of data-literacy.
In this post, I will try to apply this concept to the data about the Tate Gallery artwork from week 3 of 2021 tidytuesday. Let’s pretend I am trying to convince the Tate gallery director to buy more digital prints for the gallery, and I am interested in showing how many art pieces of this type were acquired by the gallery in the past couple of decades. I will first create a simple bar plot with a communicative title and will work on styling the chart and the text, to make sure who is viewing the chart will focus on the wanted message.
Since the R programming language is my go-to tool for data analysis I will use it to summarise and plot the data. I will not go into why I use R or the {tidyverse}
here, but will leave the code (and its readability) speak for itself.
Let’s start simple with a bar plot that shows the number of artpieces created after 1990, by medium (code below).
Already here, using labs()
we can add a title, a subtitle, a caption and style the x and y axis labels. The chosen title and subtitle show what I am trying to communicate with this chart:
- Most of the art pieces that were made from 1990 onward are made of etching on paper,
- Only 179 art pieces are digital prints
This exemplifies the idea that the title should be use to already help and guide your audience to read this chart correctly, as the main messages are communicated in plain sight.
With little code and some basic styling I created a simple bar chart. What is the first lesson here?
Annotating (here done using labs()
) is a fundamental part of the plot, do not focus only on the the plot area!
Let’s go a step further in the data visualisation, let’s order the bars according to their height. To achieve this, we need to order the underlying data. We can treat the variablemedium
as a factor and order it based on how many artworks were created (by medium
which is our grouping variable) to tidy up our plot and immediately give a sense of which are the most and least used media, without people having to look at the numbers.
The go-to tool to achieve this, in the {tidyverse}
framework, is the{forcats}
package, a set of functions that help us deal with factors. If you have never looked into it, please do. It will greatly help your plotting skills. In this specific case I will use fct_infreq()
to order a factor by its frequency. This means that medium
values that appear less frequently in the original table will be the last factors.
Consider factors as if you were giving labels to your data, using colored post-it papers. You first label the data and then you want to order the labels based on the color gradient of your post-it. This is similar to what we are doing, but in this case we are ordering the post-its based on how often they appear on my data.
Look at the following code to see how they are ordered now. levels
is the function that we are using to see how they are ordered. The first, “Etching on paper” is the factor that appears most often and it is therefore the first of our list.
I can then use fct_infreq()
before count()
, so before I summarise the data. If I plot the data like this, the bars will be ordered, but in ascending order. Since I want the bars to be in descending order, I can then wrap it into fct_rev()
, that will reverse the factors order. (The same result could be achieved by reversing the y
axis, but in this case I have chosen the{forcats}
solution.)
With just the following extra line in our code:
mutate(medium = fct_rev(fct_infreq(medium)))
I have achieved my goal: the bars are ordered in descending order!
Yes, this prevalence of cats in the
{tidyverse}
makes you wonder if the main developers are cat people. I dare not answer to this question.
This plot will indeed be the starting point for building the final version, a little bit more styled and ready to be shared in my presentation. Instead of coloring all the bars the same, I can selectively color only the bar that I want to highlight and keep the others in a rather more dull tone. I can achieve this making best use of the{scales}
package, using scale_fill_identity()
.
To use this function, I first need to create a new variable, that contains the chosen color for each bar, as a string, that I can then apply directly to the fill
aesthetic. To create this new variable that I called medium_col
, that depends on the value of medium
, I will formulate a conditional statement with if_else()
. If the medium
is “Digital print on paper” I will set it to a specific color (#70284a), if not to another color (“#bdbdbd”).
I can then modify the code in the following way:
For readability, I have omitted parts of it, so you can focus on the only code that has changed.
Now let’s focus again on the text of the plot.
Using {ggtext}
I can style the text in our plot to match the color of the highlighted bar. This package will indeed allow us to use HTML and markdown syntax to apply the styles I want to the chosen text in the chart. I will focus now on the subtitle, coloring and emphasising the number of digital prints (179) and the "digital prints" text itself.
What I need to do to modify the subtitle text, is incorporating the html code such as <b style='color:#70284a'>
, that sets the text in bold and in the color specified by the hex string. I will then specify that the plot subtitle has HTML code in it appyling element_markdown()
in the theme()
.
And the result is that the chosen parts are now in bold and in the chosen colors, while the rest of the subtitle is styled in the default ggplot2
style. Quite nice already.
Let’s now go the extra mile with {ggtext}
and style also the text inside the plot, like in the chart below.
The concept will not differ from before, but the code needs a little bit of work.
First, I will create the medium_styled
column, that incorporates the text and the y label, to which I have selectively added the html code using if_else()
and glue()
. This allows me to create a new variable, that I can then use for the y axis of the plot.
For easiness, I have done this after the count()
and for not messing up with the factors, I am reordering them based on the artworks
values using a different {forcats}
function: fct_reorder()
.
I can then apply the same resoning to the artworks
variable, so to apply different colors to the numbers on top of the bars. In this way, everything that relates with digital prints on paper will have the same color. Once I have created the styled variables, I can apply them to the plot by mapping the y axis and the text labels to them. For styling the y axis, I can follow the same concept as for the subtitle:
I can add axis.text.y = ggtext::element_markdown()
to the theme()
.
For geom_text()
I actually need to use a different geom, borrowed from the{ggtext}
package: geom_richtext()
. geom_richtext()
is actually the natural substitute for geom_label()
in {ggtext}
so to remove the borders you need to add few extra arguments such as fill = NA
and label.color = NA
. The reason why we need to do this, is because there is not a natural substitute of geom_text()
in {ggtext}
.
Take a look at the code below to see how these steps have been implemented. With this, you are now ready to create the final version of our bar chart.
I hope that this little exercise can help you understand how useful it is to learn how to use {scales}
,{ggtext}
and {forcats}
packages and how these can be wonderful tools to style your plot a bit, so to help you communicate data insights in an effective way.
I hope you have enjoyed this little exercise and hopefully had learnt some little tips on how to modify your data visualisation so to make them as readable and effective as possible. Of course then you are more than welcome to explore other visualization types for the same data type: like tree maps, lollipop or waffle charts but keep the same data communication idea to however complex your chart it.
You may find more tips for improving your data visualisation also on the EPFL Extension School twitter page (like the tweet below) and of course on my personal page @giurugg.
Originally published at https://github.com.