--- title: "Plotting multi-dimensional data" sub-title: "07-lab" author: "Alison Hill" output: html_document: highlight: pygments keep_md: yes number_sections: yes smart: no theme: journal toc: yes toc_depth: 2 toc_float: yes pdf_document: toc: yes subtitle: Lab 5 always_allow_html: yes --- ```{r setup, include = FALSE} knitr::opts_chunk$set(error = TRUE, comment = NA, warnings = FALSE, messages = FALSE) ``` ```{r load_packages, include = FALSE} suppressWarnings(suppressMessages(library(dplyr))) suppressWarnings(suppressMessages(library(ggplot2))) suppressWarnings(suppressMessages(library(tidyr))) suppressWarnings(suppressMessages(library(viridis))) suppressWarnings(suppressMessages(library(GGally))) suppressWarnings(suppressMessages(library(ggfortify))) suppressWarnings(suppressMessages(library(cowplot))) theme_set(theme_gray()) # switch to default ggplot2 theme for good ``` For the first two plots, we'll use the `iris` dataset: > This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. `iris` is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. # Scatterplot matrix ## Step 1: Load and say HLO to the data ```{r} glimpse(iris) ``` ## Step 2: Install and load the package ```{r eval = FALSE} install.packages(GGally) library(GGally) ``` `ggpairs` is a special form of a ggmatrix that produces a pairwise comparison of multivariate data. By default, `ggpairs` provides two different comparisons of each pair of columns and displays either the density or count of the respective variable along the diagonal. With different parameter settings, the diagonal can be replaced with the axis values and variable labels. There are many hidden features within `ggpairs`. ## Step 3: Plot the data ```{r} ggpairs(iris, columns = c(1:4)) ggpairs(iris, columns = c("Sepal.Length", "Sepal.Width"), columnLabels = c("Length", "Width")) ``` Aesthetics can be applied to every subplot with the mapping parameter. ```{r} ggpairs(iris, columns = c(1:4), mapping = aes(color = Species)) ``` Matrix sections: http://ggobi.github.io/ggally/#matrix_sections ```{r} ggpairs(iris, columns = c(1:4), mapping = aes(color = Species), diag = list(continuous = wrap("densityDiag")), upper = list(continuous = wrap("cor", size = 5)) ) ``` # Parallel coordinates ## Step 1: Load and say HLO to the data ```{r} glimpse(iris) ``` ## Step 2: Prepare the data This type of plot in `ggplot` requires a tidy dataframe, which `iris` is not because each of our four flower attributes are stored in separate columns- we need them in one column. To tidy it up, we'll use the `tidyr::gather` function. ```{r} tidy_iris <- iris %>% gather(key = flower_att, value = measurement, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ``` ```{r} head(tidy_iris) ``` ```{r eval = FALSE} # less typing tidy_iris <- iris %>% gather(key = flower_att, value = measurement, -Species) # less typing tidy_iris <- iris %>% gather(key = flower_att, value = measurement, Sepal.Length:Petal.Width) ``` ## Step 3: Plot the data Now, `measurement` will be our y-axis, and we have one variable `flower_att` that we want to be represented along the x-axis. We would like to color by `Species` too. ```{r} ggplot(tidy_iris, aes(x = flower_att, y = measurement, color = Species)) + geom_line() ``` Maybe a `group` aesthetic will help? ```{r} ggplot(tidy_iris, aes(x = flower_att, y = measurement, color = Species, group = Species)) + geom_line() ``` Wah wah. That is disappointing. The problem is `ggplot` doesn't know that my first Petal.Length value is for the same "row" as my first Petal.Width value, etc. We need to number each of the rows, but by *each* flower attribute ## Step 4: Prepare the data, again ```{r} tidy_iris <- iris %>% gather(key = flower_att, value = measurement, -Species) %>% # tidyr group_by(flower_att) %>% # dplyr mutate(rownum = seq_len(n())) # dplyr ``` ```{r} head(tidy_iris) ``` ## Step 5: Plot the data, again ```{r} ggplot(tidy_iris, aes(x = flower_att, y = measurement, colour = Species, group = rownum)) + geom_line(alpha =.5) ``` # Heatmap http://flowingdata.com/2010/01/21/how-to-make-a-heatmap-a-quick-and-easy-solution/ ## Step 1: Load and say HLO to the data ```{r} nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv") ``` ```{r} glimpse(nba) head(nba) ``` ## Step 2: Sort the data The data is sorted by points per game (`PTS`), greatest to least. Let’s make it the other way around so that it’s least to greatest. ```{r} nba$Name <- factor(nba$Name, levels = nba$Name[order(nba$PTS)]) # base R way head(nba) ``` ## Step 3: Prepare the data This data is in wide format and it needs to be tidy- notice a pattern here? Also, the game statistics have very different ranges, so to make them comparable all the individual statistics need to be rescaled- we'll use z-scores (mean = 0, sd = 1). ```{r} tidy_nba <- nba %>% gather(key = player_stat, value = player_value, -Name) %>% group_by(player_stat) %>% mutate(player_value_z = (player_value - mean(player_value))/sd(player_value)) head(tidy_nba) ``` ## Step 4: Start plotting First draft heatmap plot ```{r} ggplot(tidy_nba, aes(x = player_stat, y = Name)) + geom_tile(aes(fill = player_value_z)) ``` Change up the colors ```{r} ggplot(tidy_nba, aes(x = player_stat, y = Name)) + geom_tile(aes(fill = player_value_z)) + scale_fill_viridis(begin = 1, end = 0) ``` Tweaking all the things ```{r} nba_heat <- ggplot(tidy_nba, aes(x = player_stat, y = Name)) + geom_tile(aes(fill = player_value_z), colour = "white") + scale_fill_viridis(guide = FALSE, begin = 1, end = 0, option = "magma") + scale_x_discrete(expand = c(0, 0), name = "") + scale_y_discrete(expand = c(0, 0), name = "") + theme(axis.text = element_text(size = 6), axis.text.x = element_text(angle = 310, vjust = .5)) nba_heat ``` This ensures the plot will have a 1:1 aspect ratio (i.e. `geom_tile()`–which draws rectangles–will draw nice squares). ```{r} nba_heat + coord_equal() ``` # Bubble plot ## Step 1: Load and say HLO to the data ```{r} library(gapminder) glimpse(gapminder) head(country_colors) ``` ## Step 2: Prepare the data Let's get rid of the Oceania continent for this visualization ```{r} gm <- gapminder %>% filter(!continent == "Oceania") ``` ## Step 3: Plot the data Simple scatterplot with facetting ```{r} gm_scatter <- ggplot(subset(gm, year == 2007), aes(x = gdpPercap, y = lifeExp, fill = country)) + geom_point(size = 4, pch = 21, color = 'grey20', show.legend = FALSE, alpha = .6) + facet_wrap(~ continent) + scale_fill_manual(values = country_colors) + theme(strip.text = element_text(size = rel(1.1))) gm_scatter + scale_x_log10(limits = c(150, 115000)) ``` Now map size as an aesthetic... ```{r} gm_bubble1 <- ggplot(subset(gm, year == 2007), aes(x = gdpPercap, y = lifeExp, fill = country)) + geom_point(aes(size = sqrt(pop/pi)), pch = 21, color = 'grey20', show.legend = FALSE, alpha = .6) + facet_wrap(~ continent) + scale_fill_manual(values = country_colors) + theme(strip.text = element_text(size = rel(1.1))) + scale_x_log10(limits = c(150, 115000)) gm_bubble1 ``` Tweaking ```{r} gm_bubble2 <- ggplot(subset(gm, year == 2007), aes(x = gdpPercap, y = lifeExp, fill = country)) + ylim(c(16, 100)) + geom_point(aes(size = sqrt(pop/pi)), pch = 21, color = 'grey20', show.legend = FALSE, alpha = .6) + scale_size_continuous(range = c(1, 20)) + facet_wrap(~ continent) + coord_fixed(ratio = 1/43) + scale_fill_manual(values = country_colors) + theme(strip.text = element_text(size = rel(1.1))) + scale_x_log10(limits = c(150, 115000)) gm_bubble2 ``` Similar plot with lines ```{r} ggplot(gm, aes(x = year, y = lifeExp, group = country, colour = country)) + geom_line(lwd = 1, show.legend = FALSE) + facet_wrap(~ continent) + scale_color_manual(values = country_colors) + theme(strip.text = element_text(size = rel(1.1))) ``` # Principal Components Analysis ## Step 1: Back to the iris data ## Step 2: Install and load packages ```{r eval= FALSE} install.packages("cowplot") library(cowplot) # required to arrange multiple plots in a grid ``` ## Step 3: EDA plot If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns: ```{r} p1 <- ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + geom_point() p2 <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=Species)) + geom_point() p3 <- ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) + geom_point() p4 <- ggplot(iris, aes(x=Sepal.Width, y=Petal.Width, color=Species)) + geom_point() plot_grid(p1, p2, p3, p4, labels = "AUTO") ``` In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four. ## Step 4: Do the PCA The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function prcomp(): [see also the Setosa Intro to PCA](http://setosa.io/ev/principal-component-analysis/) ```{r} pca <- iris %>% # store result as `pca` select(-Species) %>% # remove Species column scale() %>% # scale to 0 mean and unit variance prcomp() # do PCA ``` ```{r} # now display the results from the PCA analysis pca ``` The main results from PCA are the standard deviations and the rotation matrix. Squares of the std. devs represent the % variance explained by each PC. The rotation matrix tells us which variables contribute to which PCs. For example, Sepal.Width contributes little to PC1 but makes up much of PC2. ```{r} pca$rotation ``` We can also recover each original observation expressed in PC coordinates; the rotated data are available as pca$x: ```{r} head(pca$x) ``` ```{r} # add species information back into PCA data pca_data <- data.frame(pca$x, Species=iris$Species) head(pca_data) ``` ## Step 5: Plot the PCA results Let’s plot the data in the principal components. Specifically, we will plot PC2 vs. PC1. ```{r} ggplot(pca_data, aes(x=PC1, y=PC2, color=Species)) + geom_point() ``` In the PC2 vs PC1 plot, versicolor and virginica are much better separated. [PCA in `ggplot2` and `ggfortify`](https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html) ```{r eval = FALSE} install.packages("ggfortify") library(ggfortify) ``` ```{r} autoplot(pca) ``` ```{r} autoplot(pca, data = iris, colour = 'Species') autoplot(pca, data = iris, colour = 'Species', label = TRUE, label.size = 3) autoplot(pca, data = iris, colour = 'Species', shape = FALSE, label.size = 3) autoplot(pca, data = iris, colour = 'Species', loadings = TRUE) autoplot(pca, data = iris, colour = 'Species', loadings = TRUE, loadings.colour = 'blue', loadings.label = TRUE, loadings.label.size = 3) ```