This is a beginner session, if you feel you know the basics, jump to the reading list for something more useful. And it is focused on the tools and techniques more than the foundations and theoretical body of work on scientific/information visualization. Also, as we have only two hours, we will be spending maybe just minutes pointing to and referring you to the reading list below. Therefore we will spend the session discussion choices, libraries and of course examples of generating visualizations using R and R based wrapper packages for your projects and wider research work.
I think we all agree that: - it would be hard to do any data analysis without charts - even more difficult to communicate it without visualizing them - as bonus, you could even get an audience moving with changing bars to tree maps, watch John Stasko’s EuroVis capstone talk if you have time to find out why. But be warned!
I think we all know the sad effect of “flatten the curve”:
Getting started:
Creating a visualization requires a number of nuanced judgments. One must determine which questions to ask, identify the appropriate data, and select effective visual encodings to map data values to graphical features such as position, size, shape, and color.
~(Heer, Bostock, and Ogievetsky 2010)
These are the list of libraries we will be using in this short session:
ggplot2, grdiExtra, dplyr, plotly, rjson, rbokeh, stats19
. Packages
, rbokeh
and stats19
are not necessary really. You might
end up just using plotly
or ggplot2
for all your work. Having said
that as data scientists you would endup (Python or R) doing a lot of
package/library management in your work flows.
Feel free to go Python, I hope you will find nothing dogmatic for/against R. I think we discussed this in the Git/GitHub session but just in case.
We have everything we need:
If you need datasets, remember one of the base packages in R is called
. You can discover the datasets in it in your own time (tip:
try library(help = "datasets")
## [1] "base" "compiler" "datasets" "graphics" "grDevices" "grid"
## [7] "methods" "parallel" "splines" "stats" "stats4" "tcltk"
## [13] "tools" "utils"
# rownames(installed.packages(priority="recommended"))
# stats, graphics etc
This is actually not the most basic R bar plot but OK we accept it:
# create dummy data
d = data.frame(
# barplot(height=d$value, names=d$name)
# this is the most basic R barplot
# barplot(height=d$value)
# or
# barplot(iris[1:10,1])
# with names
# barplot(iris[1:10,1], names=iris[1:10,5])
In base R, what I like is that there is no wrapper functions and just like other functions in base R, you get what you see:
# a boring example
height = sample(50:100, 10)
weight = sample(50:100, 10)
# plot(height,weight)
# barplot(height, names = height)
and for histograms:
n = 100
x = rchisq(n = n, df = 3) # n number, df = degrees of freedom
# hist(x = x, freq = FALSE, xlim = c(0, 15))
# lines(x = density(x = x), col = "red")
Sometimes it is useful to show error bars on graphs. To do this, we need
to calculate standard deviations, I believe Excel or PSPP (alternative
to SPSS) can do it from the column given. Though, this is where we leave
it to the ggplot2
to do the heavy lifting rather than doing it in base
Scatter plots There is great support for visualizing data in base R, lets try some custom scatter plot:
Saving as files
# pdf
pdf("~/Downloads/Sepal vs Petal Length in Iris.pdf")
plot(iris$Sepal.Length, iris$Petal.Length,
col = iris$Species,
main = "Sepal vs Petal Length in Iris",
xlab = "Sepal.Length",
ylab = "Petal.Length")
It is actually possible, try ?boxplot
, following example is from the
boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
# p <- recordPlot()
# plot.new() ## clean up device
# p # redraw
For more but short section on the base R plotting functions, have a read here if reading list below is too much.
Finally, there is great support for arranging your plots, margins and
more to rely only on base R if all you do is black and white scientific
work. As shown above, you can save them as png
or pdf
s to attach to
your ground breaking publications.
Lets start similarly with the most basic barplot:
# p = ggplot(data = iris[1:10,], aes(x = Species, y = Sepal.Length)) +
# geom_bar(stat='identity')
p = ggplot(diamonds, aes(cut)) +
geom_bar(fill = "#0073C2FF")
# p
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Remember this session should not be about ggplot2
, otherwise I have
missed my point. Having said that, the above needs some clarification: -
is your data, work with columns, keep your rows for your “lines”
is a
variety, examples are
, geom_histogram
- crucailly the mapping
variable is the
crucial function
or the fancy
Construct aesthetic mappings
p = ggplot(mpg, aes(displ, hwy, colour = class)) +
# p
# or
# ggplot(data = mpg) +
# geom_point(mapping = aes(x = displ, y = hwy, color = class))
We have seen or heard that ggplot2 can also do “smoothing”. We can see a simple example from the code above
p = ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
# p
Notice the message given by the package making it clear what “method” and what “formulae” has been use dot generate the smoothing? Let’s jump ahead by exploring the different methods that are used. The results are commented out on purpose.
plots = lapply(list("lm", "gam", "glm"), function(x){
ggplot(mpg, aes(displ, hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(method = x)
# plots
# marrangeGrob(plots, nrow=2,ncol=2)
# ggsave("multiplot.pdf", ml)
To overcome the overlapping points in the above scatter plots, we can
use another function from ggplot
check these examples from
theme_set(theme_classic()) # play with these? theme_bw()
# Histogram on a Categorical variable
p = ggplot(mpg, aes(manufacturer))
p = p + geom_bar(aes(fill=class), width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Histogram on Categorical Variable",
subtitle="Manufacturer across Vehicle Classes")
# p
How about a density plot?
# Plot
p = ggplot(mpg, aes(cty))
p = p + geom_density(aes(fill=factor(cyl)), alpha=0.8) +
labs(title="Title: Density plot",
subtitle="Subtitle: City Mileage Grouped by Number of cylinders",
caption="Caption: Source: mpg",
x="City Mileage",
fill="# Cylinders")
# p
Boxplots I have great memories of Boxplots, the reason is the type of studies I did during my PhD and maybe something special about Boxplots in general too. For some reason someone has written an R package using Shiny and got themselves a space in “Correspondence” section in Nature (Spitzer et al. 2014), I do think that is en overstatement though.
p = ggplot(iris, aes(x = Species, y = Sepal.Length)) +
geom_boxplot(width = 0.4, fill = "white") +
geom_jitter(aes(color = Species),
width = 0.1, size = 1) +
scale_color_manual(values = c("#00AFBB", "#E7B800", "#E7B8AF"))
# p
Rather than doing “fancy” visualization, as researchers we might actually need scientific information embedded in our work. So I think it is worth adding an example of bars with errors (staandard deviation?).
# first we need a summary
iris.sum = iris %>%
group_by(Species) %>% # group
summarise(mean_PL = mean(Petal.Length), # Petal.Length mean for bars
sd_PL = sd(Petal.Length), # Petal.Length sd
n_PL = n(), # just keeping the number of each group
SE_PL = sd(Petal.Length)/sqrt(n())) # R gives us the SE too
p = ggplot(iris.sum, aes(Species, mean_PL)) +
geom_col() +
geom_errorbar(aes(ymin = mean_PL - sd_PL, ymax = mean_PL + sd_PL),
width=0.2) +
labs(y="Petal length (cm) ± s.d.", x = "Species") + theme_classic()
# p
I use React as my JS library (framework if you like) and the package I
borrowed from Uber Eng is called ReactVis and that is what I use
direclty. But in R, prominent and stable ones are “plotly” and I should
mention Python’s mighty Bokeh. The latter has of course an
interface in R
called rbokeh
So I will pick plotly
although I fancy Bokeh more for this section.
Shall we try a Boxplot? Just because they too like to start with
p = plot_ly(iris, x = ~Sepal.Length, color = ~Species, type = "box")
# p
This is where the R community work gets interesting and great. Run this or just have a look:
# same as before
d = ggplot(diamonds, aes(cut)) +
geom_bar(fill = "#0073C2FF", width=1)
# p
fig = ggplotly(p)
# fig
And sometimes it is useful to be able to hover over a stacked plot:
p = ggplot(diamonds, aes(price, fill = cut)) +
geom_histogram(binwidth = 500) # bin is 30 by default
# p
fig = ggplotly(p)
# fig
Please feel free to ignore this example, I have included it for those of
you who might be battling with tmap
or similar package or would want
to have something interactive in their web pages. Credit: the example is
copied form plotly
docs here:
url = 'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json'
counties = rjson::fromJSON(file=url)
url2= "https://raw.githubusercontent.com/plotly/datasets/master/fips-unemp-16.csv"
df = read.csv(url2, colClasses=c(fips="character"))
g = list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
fig = plot_ly()
fig = fig %>% add_trace( # this is smart in my opinion
fig = fig %>% colorbar(title = "Unemployment Rate (%)")
fig = fig %>% layout(title = "2016 US Unemployment by County")
fig = fig %>% layout(geo = g)
# fig # <= is slow so bear with it.
Just to show that you do have loads of options and you should not feel
pushed to learn a particular method of doing your visualization, we can
see how Bokeh
generates boxplots:
# install.packages("rbokeh")
# data below
p = figure(ylab = "Sepal.Length") %>%
ly_boxplot(Species, Sepal.Length, data = iris, color = "black")
# p
Would be good to have a look at some real messy data if we have time or at least show code what is needed before we arrive at generating graphs.
Dataset - US Elections (???)
# get your csv from this link as we cannot link directly to the file
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX
d = read.csv("~/Desktop/data/us-elections/1976-2016-president.csv")
# names(d)
# head(d)
# Winner and runner up candidates
# dw = d[(d$candidatevotes/d$totalvotes) > .4, ]
d = d %>% select(state, year, party, candidatevotes, totalvotes) %>%
mutate(percent = (candidatevotes/totalvotes) * 100)
# country wide percent
dc = d %>% group_by(year, party) %>%
summarise(percent = sum(percent), n= n()) %>%
mutate(percent = percent/n, n = NULL) %>% ungroup()
# head(dc)
# length(unique(dc$party))
# winning parties
dw = dc %>% filter(percent > 25)
# length(unique(dw$party))
# head(dw)
# add democratic-farmer-labor to democrat
# filter(dw, stringr::str_detect(party, "democratic-farmer-labor|democrat"))
# filter(dw, party == "democratic-farmer-labor")
# create new column to identify these two
dw$dems = stringr::str_detect(dw$party, "democratic-farmer-labor|democrat")
# dw
dw = dw %>% group_by(year, dems) %>%
summarise(percent = sum(percent), n= n()) %>%
mutate(percent = percent/n, n = NULL) %>%
mutate(party = ifelse(dems, "democrat", "republican")) %>% # add back party
mutate(dems = NULL)
# length(unique(dw$party))
dw$percent = format(round(dw$percent, 2), nsmall = 2)
p = ggplot(dw, aes(x = factor(year), y = percent,
fill = party)) +
# define the colours for the parties
geom_bar(stat = "identity", position = "dodge") +
position=position_dodge(width=0.9), vjust=-0.25) +
values = c("democrat" = "blue",
"republican" = "red"))
# p
This is an example from work we have done for stats19
R package. We
download accident data from DfT, do some joining and then plot a line
with a smooth function:
crashes = get_stats19(year = 2017, type = "accident", ask = FALSE)
casualties = get_stats19(year = 2017, type = "casualties", ask = FALSE)
crashes_wy = crashes[crashes$police_force == "West Yorkshire", ]
sel = casualties$accident_index %in% crashes_wy$accident_index
casualties_wy = casualties[sel, ]
cas_types = casualties_wy %>%
select(accident_index, casualty_type) %>%
mutate(n = 1) %>%
group_by(accident_index, casualty_type) %>%
summarise(n = sum(n)) %>%
tidyr::spread(casualty_type, n, fill = 0)
cas_types$Total = rowSums(cas_types[-1])
cj = left_join(crashes_wy, cas_types, by = "accident_index")
# then group by dates
crashes_dates = cj %>%
st_set_geometry(NULL) %>%
group_by(date) %>%
walking = sum(Pedestrian),
cycling = sum(Cyclist),
passenger = sum(`Car occupant`)
) %>%
tidyr::gather(mode, casualties, -date)
# finally
p = ggplot(crashes_dates, aes(date, casualties)) +
geom_smooth(aes(colour = mode), method = "loess") +
ylab("Casualties per day")
# p
If you like to know how visualization is used in every day work, feel free t6o discover these scrips which are part of our “SaferActive” project by our colleagues who are looking at finding patterns in change of road safety based on road safety interventions and cycling uptake in the UK.
- John Stasko: The Value of
Visualization…and Why Interaction Matters, Eurovis Capstone Talk.