Layik Hama

Data visualization

L Hama 2020-11-30

Intro + Setup

This is a beginner session, if you feel you know the basics, jump to the reading list for something more useful. And it is focused on the tools and techniques more than the foundations and theoretical body of work on scientific/information visualization. Also, as we have only two hours, we will be spending maybe just minutes pointing to and referring you to the reading list below. Therefore we will spend the session discussion choices, libraries and of course examples of generating visualizations using R and R based wrapper packages for your projects and wider research work.

I think we all agree that: - it would be hard to do any data analysis without charts - even more difficult to communicate it without visualizing them - as bonus, you could even get an audience moving with changing bars to tree maps, watch John Stasko’s EuroVis capstone talk if you have time to find out why. But be warned!

I think we all know the sad effect of “flatten the curve”: flatten the
curve

Getting started:

Creating a visualization requires a number of nuanced judgments. One must determine which questions to ask, identify the appropriate data, and select effective visual encodings to map data values to graphical features such as position, size, shape, and color.

~(Heer, Bostock, and Ogievetsky 2010)

So:

Required R libraries

These are the list of libraries we will be using in this short session: ggplot2, grdiExtra, dplyr, plotly, rjson, rbokeh, stats19. Packages gridExtra, rbokeh and stats19 are not necessary really. You might end up just using plotly or ggplot2 for all your work. Having said that as data scientists you would endup (Python or R) doing a lot of package/library management in your work flows.

Why R?

Feel free to go Python, I hope you will find nothing dogmatic for/against R. I think we discussed this in the Git/GitHub session but just in case.

We have everything we need:

plot(iris[,1:4])

If you need datasets, remember one of the base packages in R is called datasets. You can discover the datasets in it in your own time (tip: try library(help = "datasets")).

BaseR

rownames(installed.packages(priority="base"))
##  [1] "base"      "compiler"  "datasets"  "graphics"  "grDevices" "grid"     
##  [7] "methods"   "parallel"  "splines"   "stats"     "stats4"    "tcltk"    
## [13] "tools"     "utils"
# rownames(installed.packages(priority="recommended"))
# stats, graphics etc

This is actually not the most basic R bar plot but OK we accept it:

# create dummy data
d = data.frame(
  name=letters[1:5],
  value=sample(seq(4,15),5)
)
# barplot(height=d$value, names=d$name)
# this is the most basic R barplot
# barplot(height=d$value)
# or
# barplot(iris[1:10,1])
# with names
# barplot(iris[1:10,1], names=iris[1:10,5])

In base R, what I like is that there is no wrapper functions and just like other functions in base R, you get what you see:

set.seed(12)
# a boring example
height = sample(50:100, 10)
weight = sample(50:100, 10) 
# plot(height,weight)
# barplot(height, names = height)

and for histograms:

set.seed(12)
n = 100
x = rchisq(n = n, df = 3) # n number, df = degrees of freedom
# hist(x = x, freq = FALSE, xlim = c(0, 15))
# lines(x = density(x = x), col = "red")

Sometimes it is useful to show error bars on graphs. To do this, we need to calculate standard deviations, I believe Excel or PSPP (alternative to SPSS) can do it from the column given. Though, this is where we leave it to the ggplot2 to do the heavy lifting rather than doing it in base R.

Scatter plots There is great support for visualizing data in base R, lets try some custom scatter plot:

Saving as files

# pdf

pdf("~/Downloads/Sepal vs Petal Length in Iris.pdf")

plot(iris$Sepal.Length, iris$Petal.Length,
     col = iris$Species,
     main = "Sepal vs Petal Length in Iris", 
     xlab = "Sepal.Length",
     ylab = "Petal.Length")

dev.off()

Boxplots in base R?

It is actually possible, try ?boxplot, following example is from the docs:

boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
# p <- recordPlot()
# plot.new() ## clean up device
# p # redraw

For more but short section on the base R plotting functions, have a read here if reading list below is too much.

Finally, there is great support for arranging your plots, margins and more to rely only on base R if all you do is black and white scientific work. As shown above, you can save them as png or pdfs to attach to your ground breaking publications.

ggplot2

Hello Bar Chart

Lets start similarly with the most basic barplot:

library(ggplot2)
# p = ggplot(data = iris[1:10,], aes(x = Species, y = Sepal.Length)) +
#   geom_bar(stat='identity')
p = ggplot(diamonds, aes(cut)) +
  geom_bar(fill = "#0073C2FF")
# p

ggplot2 formula

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Remember this session should not be about ggplot2, otherwise I have missed my point. Having said that, the above needs some clarification: - data is your data, work with columns, keep your rows for your “lines”

Scatter plot

library(ggplot2)
p = ggplot(mpg, aes(displ, hwy, colour = class)) + 
  geom_point()
# p 
# or 
# ggplot(data = mpg) + 
#   geom_point(mapping = aes(x = displ, y = hwy, color = class))

We have seen or heard that ggplot2 can also do “smoothing”. We can see a simple example from the code above

p = ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()
# p

Notice the message given by the package making it clear what “method” and what “formulae” has been use dot generate the smoothing? Let’s jump ahead by exploring the different methods that are used. The results are commented out on purpose.

library(gridExtra)
plots = lapply(list("lm", "gam", "glm"), function(x){
  ggplot(mpg, aes(displ, hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(method = x)
}) 
# plots
# marrangeGrob(plots, nrow=2,ncol=2)
# ggsave("multiplot.pdf", ml)

Distribution

To overcome the overlapping points in the above scatter plots, we can use another function from ggplot check these examples from here:

library(ggplot2)
theme_set(theme_classic()) # play with these? theme_bw()

# Histogram on a Categorical variable
p = ggplot(mpg, aes(manufacturer))
p = p + geom_bar(aes(fill=class), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Histogram on Categorical Variable", 
       subtitle="Manufacturer across Vehicle Classes") 
# p

How about a density plot?

library(ggplot2)
theme_set(theme_classic())

# Plot
p = ggplot(mpg, aes(cty))
p = p + geom_density(aes(fill=factor(cyl)), alpha=0.8) + 
    labs(title="Title: Density plot", 
         subtitle="Subtitle: City Mileage Grouped by Number of cylinders",
         caption="Caption: Source: mpg",
         x="City Mileage",
         fill="# Cylinders")
# p

Boxplots I have great memories of Boxplots, the reason is the type of studies I did during my PhD and maybe something special about Boxplots in general too. For some reason someone has written an R package using Shiny and got themselves a space in “Correspondence” section in Nature (Spitzer et al. 2014), I do think that is en overstatement though.

p = ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot(width = 0.4, fill = "white") +
  geom_jitter(aes(color = Species), 
              width = 0.1, size = 1) +
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#E7B8AF")) 
# p

Bars with errors

Rather than doing “fancy” visualization, as researchers we might actually need scientific information embedded in our work. So I think it is worth adding an example of bars with errors (staandard deviation?).

library(ggplot2)
library(dplyr)
# first we need a summary
iris.sum = iris %>% 
  group_by(Species) %>% # group
  summarise(mean_PL = mean(Petal.Length),  # Petal.Length mean for bars
            sd_PL = sd(Petal.Length), # Petal.Length sd
            n_PL = n(),  # just keeping the number of each group
            SE_PL = sd(Petal.Length)/sqrt(n())) # R gives us the SE too

p = ggplot(iris.sum, aes(Species, mean_PL)) + 
                   geom_col() +  
                   geom_errorbar(aes(ymin = mean_PL - sd_PL, ymax = mean_PL + sd_PL),
                                 width=0.2) +
  labs(y="Petal length (cm) ± s.d.", x = "Species") + theme_classic()
# p

JS (web)

I use React as my JS library (framework if you like) and the package I borrowed from Uber Eng is called ReactVis and that is what I use direclty. But in R, prominent and stable ones are “plotly” and I should mention Python’s mighty Bokeh. The latter has of course an interface in R called rbokeh.

Plotly

So I will pick plotly although I fancy Bokeh more for this section. Shall we try a Boxplot? Just because they too like to start with Boxplot.

library(plotly)
p = plot_ly(iris, x = ~Sepal.Length, color = ~Species, type = "box")
# p

ggplot + plotly

This is where the R community work gets interesting and great. Run this or just have a look:

library(plotly)
# same as before
d = ggplot(diamonds, aes(cut)) +
  geom_bar(fill = "#0073C2FF", width=1)
# p
fig = ggplotly(p)
# fig

And sometimes it is useful to be able to hover over a stacked plot:

p = ggplot(diamonds, aes(price, fill = cut)) +
  geom_histogram(binwidth = 500) # bin is 30 by default
# p
fig = ggplotly(p)
# fig

plotly geospatial

Please feel free to ignore this example, I have included it for those of you who might be battling with tmap or similar package or would want to have something interactive in their web pages. Credit: the example is copied form plotly docs here:

library(plotly)
library(rjson)

url = 'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json'
counties = rjson::fromJSON(file=url)
url2= "https://raw.githubusercontent.com/plotly/datasets/master/fips-unemp-16.csv"
df = read.csv(url2, colClasses=c(fips="character"))
g = list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)
fig = plot_ly()
fig = fig %>% add_trace( # this is smart in my opinion
    type="choropleth",
    geojson=counties,
    locations=df$fips,
    z=df$unemp,
    colorscale="Viridis",
    zmin=0,
    zmax=12,
    marker=list(line=list(
      width=0)
    )
  )
fig = fig %>% colorbar(title = "Unemployment Rate (%)")
fig = fig %>% layout(title = "2016 US Unemployment by County")
fig = fig %>% layout(geo = g)

# fig # <= is slow so bear with it.

Bokeh mention

Just to show that you do have loads of options and you should not feel pushed to learn a particular method of doing your visualization, we can see how Bokeh generates boxplots:

# install.packages("rbokeh")
# data below
library(rbokeh)
p = figure(ylab = "Sepal.Length") %>%
  ly_boxplot(Species, Sepal.Length, data = iris, color = "black")
# p

Real data

Would be good to have a look at some real messy data if we have time or at least show code what is needed before we arrive at generating graphs.

US elections

Dataset - US Elections (???)

# get your csv from this link as we cannot link directly to the file
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX
d = read.csv("~/Desktop/data/us-elections/1976-2016-president.csv")
library(dplyr)
# names(d)
# head(d)
# Winner and runner up candidates
# dw = d[(d$candidatevotes/d$totalvotes) > .4, ]
d = d %>% select(state, year, party, candidatevotes, totalvotes) %>%
  mutate(percent = (candidatevotes/totalvotes) * 100)

# country wide percent
dc = d %>% group_by(year, party) %>%
  summarise(percent = sum(percent), n= n()) %>%
  mutate(percent = percent/n, n = NULL) %>% ungroup()
# head(dc)
# length(unique(dc$party))

# winning parties
dw = dc %>% filter(percent > 25)
# length(unique(dw$party))
# head(dw)

# add democratic-farmer-labor to democrat
# filter(dw, stringr::str_detect(party, "democratic-farmer-labor|democrat"))
# filter(dw, party == "democratic-farmer-labor")

# create new column to identify these two
dw$dems = stringr::str_detect(dw$party, "democratic-farmer-labor|democrat")
# dw
dw = dw %>% group_by(year, dems) %>%
  summarise(percent = sum(percent), n= n()) %>%
  mutate(percent = percent/n, n = NULL) %>%
  mutate(party = ifelse(dems, "democrat", "republican")) %>% # add back party
  mutate(dems = NULL)

# length(unique(dw$party))
dw$percent = format(round(dw$percent, 2), nsmall = 2)

library(ggplot2)

p = ggplot(dw, aes(x = factor(year), y = percent,
               fill = party)) +
  # define the colours for the parties
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label=percent),
            position=position_dodge(width=0.9), vjust=-0.25) +
  scale_fill_manual("legend",
                    values = c("democrat" = "blue",
                               "republican" = "red"))
# p

UK Road Saftey

This is an example from work we have done for stats19 R package. We download accident data from DfT, do some joining and then plot a line with a smooth function:

library(stats19)
library(dplyr)
crashes = get_stats19(year = 2017, type = "accident", ask = FALSE)
casualties = get_stats19(year = 2017, type = "casualties", ask = FALSE)
crashes_wy = crashes[crashes$police_force == "West Yorkshire", ]
sel = casualties$accident_index %in% crashes_wy$accident_index
casualties_wy = casualties[sel, ]
cas_types = casualties_wy %>% 
  select(accident_index, casualty_type) %>% 
  mutate(n = 1) %>% 
  group_by(accident_index, casualty_type) %>% 
  summarise(n = sum(n)) %>% 
  tidyr::spread(casualty_type, n, fill = 0) 
cas_types$Total = rowSums(cas_types[-1])
cj = left_join(crashes_wy, cas_types, by = "accident_index")
# then group by dates
library(ggplot2)
crashes_dates = cj %>% 
  st_set_geometry(NULL) %>% 
  group_by(date) %>% 
  summarise(
    walking = sum(Pedestrian),
    cycling = sum(Cyclist),
    passenger = sum(`Car occupant`)
    ) %>% 
  tidyr::gather(mode, casualties, -date)
# finally
p = ggplot(crashes_dates, aes(date, casualties)) +
  geom_smooth(aes(colour = mode), method = "loess") +
  ylab("Casualties per day")
# p

If you like to know how visualization is used in every day work, feel free t6o discover these scrips which are part of our “SaferActive” project by our colleagues who are looking at finding patterns in change of road safety based on road safety interventions and cycling uptake in the UK.

Reading List

Watching List

bar vis - John Stasko: The Value of Visualization…and Why Interaction Matters, Eurovis Capstone Talk. https://vimeo.com/98986594

More?

References

Heer, Jeffrey, Michael Bostock, and Vadim Ogievetsky. 2010. “A Tour Through the Visualization Zoo.” *Communications of the ACM* 53 (6): 59–67.
Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. *Geocomputation with R*. CRC Press.
Munzner, Tamara. 2014. *Visualization Analysis and Design*. CRC press.
Shneiderman, Ben. 1996. “The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations.” In *Proceedings 1996 Ieee Symposium on Visual Languages*, 336–43. IEEE.
Spitzer, Michaela, Jan Wildenhain, Juri Rappsilber, and Mike Tyers. 2014. “BoxPlotR: A Web Tool for Generation of Box Plots.” *Nature Methods* 11 (2): 121.
Wickham, Hadley. 2016. *Ggplot2: Elegant Graphics for Data Analysis*. springer.