This is my documentation while going through the book Data Visualization, A practical introduction, by Kieran Healy: http://vissoc.co/index.html#preface
CHAPTERS ONE AND TWO
Let’s first load all required libraries:
library(tidyverse)
library(socviz)
library(gapminder)
library(here)
library(ggrepel)
library(maps)
Now that our functions are loaded through their respective libraries, we can start lading some dat from the gapminder library:
gapminder
Let’s make a quick plot:
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point()
CHAPTER THREE
The central activity of visualizing data with ggplot more or less always involves the same sequence of steps. There is some structured relationship, some mapping, between the variables in your data and their representation in the plot displayed on your screen or on the page. Ggplot provides you with a set of tools to map data to visual elements on your plot, to specify the kind of plot you want, and then subsquently to control the fine details of how it will be displayed.In ggplot, the logical connections between your data and the plot elements are called aesthetic mappings or just aesthetics.
-Tell the ggplot() function what your data is -Then how the variables in this data logically map onto the plot’s aesthetics -Take the result and say what general sort of plot you want. In ggplot, the overall type of plot is called a geom. Each geom has a function that creates it. -You combine these two pieces, the ggplot() object and the geom, by literally adding them together in an expression, using the “+” symbol.
So: 1. Data (what data we want to use) 2. Mapping (what relationships we want to see) 3. Geom (how we want to see the relationships in our data) 4. Coordinates and Scales (reference) 5. Labels and Guides (explanation)
p + geom_point() + geom_smooth()
p + geom_smooth() + geom_point() # loads the points last, on top of the smooth line
p + geom_point() + geom_smooth(method = "lm")
It’s possible to give geoms separate instructions that they will follow instead, but in the absence of any other information, the geoms will look for the instructions it needs in the ggplot() function, or the object created by it.
An aesthetic mapping specifies that a variable will be expressed by one of the available visual elements, such as size, or color, or shape, and so on. Code does not give a direct instruction like “color the points purple”. Instead it says, “the property ‘color’ will represent the variable continent”, or “color will map continent”. The aes() function is for mappings only. Do not use it to change properties to a particular value. If we want to set a property, we do it in the geom_ we are using, and outside the mapping = aes(…) step.
The various geom_ functions can take many other arguments that will affect how the plot looks, but that do not involve mapping variables to aesthetic elements.
Now a more polished view:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, colour = continent, fill = continent))
p + geom_point(alpha = 0.3) +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy, by Continent",
subtitle = "Data points are country and years",
caption = "Source: Gapminder")
By default, geoms inherit their mappings from the ggplot() function. We can change this by mapping the aesthetics we want only to geom_ functions that we want them to apply to. We use the same mapping = aes(…) expression as in the initial call to ggplot(), but now use it in the geom_ functions as well, specifying only the mappings we want to apply to each one. Mappings specified only in the initial ggplot() function will carry through to all subsequent geoms.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(color = continent)) +
geom_smooth(method = "loess", color = "black") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy, by Continent",
subtitle = "Data points are country and years",
caption = "Source: Gapminder")
You can set the default size of plots within your .Rmd document by setting an option in your first code chunk. This one tells R to make 8x5 figures:
knitr::opts_chunk$set(fig.width=12, fig.height=5) # this sets the environment's default
Like so, on a per-chart basis:
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(color = continent)) +
geom_smooth(method = "loess", color = "black") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy, by Continent",
subtitle = "Data points are country and years",
caption = "Source: Gapminder")
Now save an image of the chart (using the here-package)
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(color = continent)) +
geom_smooth(method = "loess", color = "black") +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy, by Continent",
subtitle = "Data points are country and years",
caption = "Source: Gapminder") +
ggsave(here("figures", "lifeexp_vs_gdp.png"), width = 12, height = 4)
CHAPTER FOUR The ggplot library is an implementation of the “grammar” of graphics: a set of rules for producing graphics from data, taking pieces of data and mapping them to geometric objects that have aesthetic attributes, together with further rules for transforming the data if needed, adjusting scales, and projecting the results onto a different coordinate system.
p4 <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p4 + geom_line(color = "gray70", aes(group = country)) +
geom_smooth(size = 1.1, method = "loess", se = FALSE) +
scale_y_log10(labels = scales::dollar) +
facet_wrap(~ continent, ncol = 5) +
labs(x = "Year",
y = "GDP per Capita",
title = "GDP per Capita on Five Continents")
Now let’s create a grid of more than one variable:
p45 <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p45 + geom_point(alpha = 0.2) +
geom_smooth() +
facet_grid(sex ~ race) +
theme_minimal()
Every geom_ function has an associated stat_ function that it uses by default. The reverse is also the case: every stat_ function has an associated geom_ function that it will plot by default if you ask it to. Below we call the prop(ortion) stat, rather than the default count stat. Or ..count..
p46 <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p46 + geom_bar(mapping = aes(y = ..prop..))
p46 <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p46 + geom_bar(mapping = aes(y = ..prop.., group = 1)) # this tells R that what we group by (biregion) should add up to 1; to see proportion of total rather than proportion per biregion
p49 <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion))
p49 + geom_bar(mapping = aes(y = ..prop.., group = 1)) +
theme_minimal() +
guides(fill = FALSE)
p410 <- ggplot(data = gss_sm,
mapping = aes(x = bigregion, fill = religion))
p410 + geom_bar(position = "fill")
p410 + geom_bar(position = "dodge",
mapping = aes(y = ..prop.., group = religion))
p414 <- ggplot(data = gss_sm,
mapping = aes(x = religion))
p414 + geom_bar(position = "dodge",
mapping = aes(y = ..prop.., group = bigregion)) +
facet_wrap(~ bigregion, ncol = 2) +
theme_minimal()
p416 <- ggplot(data = subset(midwest, subset = state %in% c("OH", "WI", "IL")),
mapping = aes(x = percollege, fill = state))
p416 + geom_histogram(alpha = 0.4, bins = 20)
p418 <- ggplot(data = midwest,
mapping = aes(x = area, fill = state, color = state))
p418 + geom_density(alpha = 0.4)
Now some Titanic data, to show stat = ‘identity’ tells geom_bar not to do any stat calculation (like count or prop). Alternatively you can use geom_col() which does this automatically (ignore stat calcuation).
Also note how position = ‘dodge’ changes the bar chart from stacked to side-by-side
p420 <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex))
p420 + geom_bar(stat = 'identity') +
theme_minimal() +
theme(legend.position = "top")
p420 + geom_bar(position = 'dodge',stat = 'identity') +
theme_minimal() +
theme(legend.position = "top")
Now try a difference map, but show it vertical instead of the horizontal view the book provides
p421 <- ggplot(data = oecd_sum, mapping = aes(x = year, y = diff, fill = hi_lo))
p421 + geom_col() +
guides(fill = FALSE) +
theme_minimal() +
labs(x = NULL, y = "Difference in years",
title = "US vs rest Life Expectancy gap")
CHAPTER FIVE Prepping the data before visualizing it, using dplyr. Also introducing the pipe-operator: %>%. Data goes in one side of the pipe, actions are performed via functions, and results come out the other. A pipeline is typically a series of operations, like group_by(), filter() rows, select() columns, mutate(), and summarize(). As dplyr’s functions see things, summarizing actions “peel off” one grouping level at a time, so that the resulting summaries are at the next level up.
Let’s start by creating a table:
rel_by_region <- gss_sm %>%
group_by(bigregion, religion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq*100), 0))
Factor `religion` contains implicit NA, consider using `forcats::fct_explicit_na`
The variables specified in group_by() are retained in the new summary table; the variables created with summarize() and mutate() are added, and all the other variables in the original dataset are dropped.
p52 <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p52 + geom_col(position = "dodge2") +
geom_col(position = "dodge2", color = "black", fill = NA) +
labs(x = "region", y = "percent", fill = "religion") +
theme(legend.position = "top")
Two ways of drawing the same (although the book menions it cannot work this way)
p53 <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p53 + geom_col(position = "dodge2") +
labs(x = NULL, y = "%", fill = "Religions") +
guides(fill = FALSE) +
coord_flip() +
facet_grid(~ bigregion)
p53a <- ggplot(rel_by_region, aes(x = pct, y = religion, fill = religion))
p53a + geom_col(position = "dodge2") +
labs(x = "%", y = NULL) +
guides(fill = FALSE) +
facet_grid(~bigregion)
p54 <- ggplot(data = organdata, aes(x = year, y = donors))
p54 + geom_point()
p55 <- ggplot(data = organdata, aes(x = year, y = donors))
p55 + geom_line(aes(group = country)) + facet_wrap(~ country) + theme_minimal()
p56 <- ggplot(data = organdata, aes(x = country, y = donors))
p56 + geom_boxplot() + coord_flip()
The reorder() function will do this for us. It takes two required arguments. The first is the categorical variable or factor that we want to reorder. The second is the variable we want to reorder it by. The third and optional argument to reorder() is the function you want to use as a summary statistic.
p58 <- ggplot(data = organdata, aes(x = reorder(country, donors, na.rm=TRUE) ,y = donors, fill = world))
p58 + geom_boxplot() + labs(x=NULL) + coord_flip() + theme(legend.position = "top")
by_country <- organdata %>% group_by(consent_law, country) %>%
summarize(donors_mean= mean(donors, na.rm = TRUE),
donors_sd = sd(donors, na.rm = TRUE),
gdp_mean = mean(gdp, na.rm = TRUE),
health_mean = mean(health, na.rm = TRUE),
roads_mean = mean(roads, na.rm = TRUE),
cerebvas_mean = mean(cerebvas, na.rm = TRUE))
by_country <- organdata %>% group_by(consent_law, country) %>%
summarize_if(is.numeric, funs(mean, sd), na.rm = TRUE) %>%
ungroup()
funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
[90mThis warning is displayed once per session.[39m
The next phase is adding some labels. And although geom_text() does the trick, it doesn’t have many clean options that work if the standard does not suffice. That’s where we load the ggrepel library. The ggrepel library provides geom_text_repel() and geom_label_repel(), two geoms that can pick out labels much more flexibly than the default geom_text().
p518 <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label))
p518 + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() +
geom_text_repel() +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Winner's share Pop. Vote", y = "Winner's share Elec. Vote", title = "Presidents", subtitle = "Elections 1824 - 2016", caption = "Caption text")
Labelling only subsets of the data, based on coditions
p518 <- ggplot(data = by_country, mapping = aes(x = gdp_mean, y = health_mean))
p518 + geom_point() +
geom_text_repel(data = subset(by_country,
gdp_mean > 25000 | health_mean < 1500 | country %in% "Belgium"),
mapping = aes(label = country))
Annotate is for adding text that is not directly related to the data. We can still use the geom_text arguments by using annotate as follows:
p + geom_point() + annotate(geom = “text”, x = 91, y = 33, label = “A surprisingly high recovery rate.”, hjust = 0)
The annotate() function can work with other geoms, too. Use it to draw rectangles, line segments, and arrows. Just remember to pass along the right arguments to the geom you use.
Consistent with ggplot’s overall approach, adjusting some visible feature of the graph means first thinking about the relationship that the feature has with the underlying data. Roughly speaking, if the change you want to make will affect the substantive interpretation of any particular geom, then most likely you will either be mapping an aesthetic to a variable using that geom’s aes() function, or you will be specifying a change via some scale_ function. If the change you want to make does not affect the interpretation of a given geom_, then most likely you will either be setting a variable inside the geom_ function, or making a cosmetic change via the theme() function.
Because we have several potential mappings, and each mapping might be to one of several different scales, we end up with a lot of individual scale_ functions. Each deals with one combination of mapping and scale. scale__()
CHAPTER SIX is mostly about working with statistical models, which I’m skipping for now
CHAPTER SEVEN - MAPS
election %>% select(state, total_vote, r_points, pct_trump, party, census) %>% sample_n(5)
party_colors <- c("#2E74C0","#CB454A")
p72 <- ggplot(data = subset(election, st %nin% "DC"),
mapping = aes(x = r_points,
y = reorder(state, r_points),
color = party))
p72a <- p72 + geom_vline(xintercept = 0, color = "gray30") +
geom_point(size = 2)
p72b <- p72a + scale_color_manual(values = party_colors)
p72c <- p72b + scale_x_continuous(breaks = c(-30, -20, -10, 0, 10, 20, 30, 40),
labels = c("20\n (Clinton)", "20", "10", "0", "10", "20", "30", "40\n(Trump)"))
p72d <- p72c + facet_wrap(~ census, ncol = 1, scales = "free_y") +
guides(color=FALSE) +
labs(x = "Pt. mrgn", y ="")+
theme(axis.text = element_text(size = 8))
p72d
Now we are going to use the maps package (added library call at top). Also add mapproj
us_states <- map_data("state")
p73 <- ggplot(data = us_states, mapping = aes(x = long, y = lat, group = group, fill = region))
p73 + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
guides(fill = FALSE)
You probabble end up merging data you want to show on a state level, with the data that is used for drawing the map. Make sure every join works, or you may end up with NA data. To reiterate, it is important to know your data and variables well enough to check that they have merged properly.
election$region <- tolower(election$state)
us_states_elec <- left_join(us_states, election)
Joining, by = "region"
#head(us_states_elec)
p76 <- ggplot(data = us_states_elec,
aes(x = long, y = lat, group = group, fill = party))
p76 + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
scale_fill_manual(values = party_colors) +
labs(title = "Elections 2016", fill = NULL)
p78 <- ggplot(data = us_states_elec,
aes(x = long, y = lat, group = group, fill = pct_trump))
p78 + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
scale_fill_gradient(low = "white", high = "#CB454A") +
labs(title = "Trump vote", fill = "Percent")
CHAPTER EIGHT looking at opportunities to tweak or customize things. It’s only when we have some specific plot in mind that the question of polishing the results comes up.
p81 <- ggplot(data = subset(asasec, Year == 2014),
mapping = aes(x = Members, y = Revenues, label = Sname))
p81 + geom_point(mapping = aes(color = Journal)) +
geom_smooth(method = "lm", se = FALSE, color = "gray80") +
geom_text_repel(data = subset(asasec,
Year == 2014 & Revenues > 10000),
size = 3) +
labs(x = "Membership", y = "Revenues", color = "Section has own journal",
title = "ASA Sections", subtitle = "2014 calendar year", caption = "Source: ASA annual report") +
scale_y_continuous(labels = scales::dollar) +
theme_dark() +
theme(legend.position = "bottom")
NA
Themes can be turned on or off using the theme_set() function.
Internally, theme functions are a set of detailed instructions to turn on, turn off, or modify a large number of graphical elements on the plot. Once set, a theme applies to all subsequent plots and it remains active until it is replaced by a different theme.
# Democrat Blue and Republican Red
party_colors <- c("#2E74C0", "#CB454A")
p0 <- ggplot(data = subset(county_data,
flipped == "No"),
mapping = aes(x = pop,
y = black/100))
p1 <- p0 + geom_point(alpha = 0.15, color = "gray50") +
scale_x_log10(labels=scales::comma)
p1
p2 <- p1 + geom_point(data = subset(county_data,
flipped == "Yes"),
mapping = aes(x = pop, y = black/100,
color = partywinner16)) +
scale_color_manual(values = party_colors)
p2
p3 <- p2 + scale_y_continuous(labels=scales::percent) +
labs(color = "County flipped to ... ",
x = "County Population (log scale)",
y = "Percent Black Population",
title = "Flipped counties, 2016",
caption = "Counties in gray did not flip.")
p3
p4 <- p3 + geom_text_repel(data = subset(county_data,
flipped == "Yes" &
black > 25),
mapping = aes(x = pop,
y = black/100,
label = state), size = 2)
p4 + theme_minimal() +
theme(legend.position="top")
theme_set(theme_bw())
p4 + theme(legend.position="top")
theme_set(theme_dark())
p4 + theme(legend.position="top")
library(ggthemes)
theme_set(theme_economist())
p4 + theme(legend.position="top")
theme_set(theme_wsj())
p4 + theme(plot.title = element_text(size = rel(0.6)),
legend.title = element_text(size = rel(0.35)),
plot.caption = element_text(size = rel(0.35)),
legend.position = "top")
The theme() function allows you to exert very fine-grained control over the appearance of all kinds of text and graphical elements in a plot.
yrs <- c(seq(1972,1988,4),1993, seq(1996,2016,4))
mean_age <- gss_lon %>%
filter(age %nin% NA && year %in% yrs) %>%
group_by(year) %>%
summarize(xbar = round(mean(age, na.rm = TRUE),0))
mean_age$y <- 0.3
yr_labs <- data.frame(x = 85, y = 0.8, year = yrs)
p816 <- ggplot(data = subset(gss_lon, year %in% yrs),
mapping = aes(x = age))
p816a <- p816 + geom_density(fill = "gray20", color = FALSE,
alpha = 0.9, mapping = aes(y = ..scaled..)) +
geom_vline(data = subset(mean_age, year %in% yrs),
aes(xintercept = xbar), color = "white", size = 0.5) +
geom_text(data = subset(mean_age, year %in% yrs),
aes(x = xbar, y = y, label = xbar), nudge_x = 7.5,
color = "white", size = 3.5, hjust = 1) +
geom_text(data = subset(yr_labs, year %in% yrs),
aes(x = x, y = y, label = year)) +
facet_grid(year ~ ., switch = "y")
p816a
p816a + theme_minimal(base_size = 10) + #, panel_spacing = unit(0.1, "lines"))#, strip_text_size = 32)#, plot_title_size = 10)#, strip_text_size = 32)
theme(plot.title = element_text(size = 16),
axis.text.x= element_text(size = 12),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y = element_blank(),
strip.background = element_blank(),
strip.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(x = "Age",
y = NULL,
title = "Age Distribution of\nGSS Respondents")
p822 <- ggplot(data = yahoo, mapping = aes(x = Employees, y = Revenue))
p822 + geom_path(color = "gray40", size = 2, alpha = 0.3) +
geom_text(aes(color = Mayer, label = Year),
size = 4, fontface = "bold") +
theme(legend.position = "bottom") +
labs(color = "Mayer is CEO",
x = "employees", y = "revenue (Mln)",
title = "Yahoos") +
scale_y_continuous(labels = scales::dollar) +
scale_x_continuous(labels = scales::comma)
