For the first two plots, we'll use the iris dataset:

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

1 Scatterplot matrix

1.1 Step 1: Load and say HLO to the data

glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
$ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

1.2 Step 2: Install and load the package

install.packages(GGally)
library(GGally)

ggpairs is a special form of a ggmatrix that produces a pairwise comparison of multivariate data. By default, ggpairs provides two different comparisons of each pair of columns and displays either the density or count of the respective variable along the diagonal. With different parameter settings, the diagonal can be replaced with the axis values and variable labels.

There are many hidden features within ggpairs.

1.3 Step 3: Plot the data

ggpairs(iris, columns = c(1:4))

ggpairs(iris, columns = c("Sepal.Length", "Sepal.Width"), columnLabels = c("Length", "Width"))

Aesthetics can be applied to every subplot with the mapping parameter.

ggpairs(iris, columns = c(1:4), mapping = aes(color = Species))

Matrix sections: http://ggobi.github.io/ggally/#matrix_sections

ggpairs(iris,
  columns = c(1:4), mapping = aes(color = Species),
  diag = list(continuous = wrap("densityDiag")),
  upper = list(continuous = wrap("cor", size = 5))
)

2 Parallel coordinates

2.1 Step 1: Load and say HLO to the data

glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
$ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

2.2 Step 2: Prepare the data

This type of plot in ggplot requires a tidy dataframe, which iris is not because each of our four flower attributes are stored in separate columns- we need them in one column. To tidy it up, we'll use the tidyr::gather function.

tidy_iris <- iris %>% 
  gather(key = flower_att, value = measurement,
       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
head(tidy_iris)
  Species   flower_att measurement
1  setosa Sepal.Length         5.1
2  setosa Sepal.Length         4.9
3  setosa Sepal.Length         4.7
4  setosa Sepal.Length         4.6
5  setosa Sepal.Length         5.0
6  setosa Sepal.Length         5.4
# less typing
tidy_iris <- iris %>% 
  gather(key = flower_att, value = measurement, -Species)
# less typing
tidy_iris <- iris %>% 
  gather(key = flower_att, value = measurement, Sepal.Length:Petal.Width)

2.3 Step 3: Plot the data

Now, measurement will be our y-axis, and we have one variable flower_att that we want to be represented along the x-axis. We would like to color by Species too.

ggplot(tidy_iris, aes(x = flower_att, y = measurement, color = Species)) +
  geom_line()

Maybe a group aesthetic will help?

ggplot(tidy_iris, aes(x = flower_att, y = measurement, color = Species, group = Species)) +
  geom_line()

Wah wah. That is disappointing. The problem is ggplot doesn't know that my first Petal.Length value is for the same "row" as my first Petal.Width value, etc. We need to number each of the rows, but by each flower attribute

2.4 Step 4: Prepare the data, again

tidy_iris <- iris %>% 
  gather(key = flower_att, value = measurement, -Species) %>% # tidyr
  group_by(flower_att) %>% # dplyr
  mutate(rownum = seq_len(n())) # dplyr
head(tidy_iris)
Source: local data frame [6 x 4]
Groups: flower_att [1]

  Species   flower_att measurement rownum
   <fctr>        <chr>       <dbl>  <int>
1  setosa Sepal.Length         5.1      1
2  setosa Sepal.Length         4.9      2
3  setosa Sepal.Length         4.7      3
4  setosa Sepal.Length         4.6      4
5  setosa Sepal.Length         5.0      5
6  setosa Sepal.Length         5.4      6

2.5 Step 5: Plot the data, again

ggplot(tidy_iris, aes(x = flower_att, y = measurement, colour = Species, group = rownum)) +
  geom_line(alpha =.5)

3 Heatmap

http://flowingdata.com/2010/01/21/how-to-make-a-heatmap-a-quick-and-easy-solution/

3.1 Step 1: Load and say HLO to the data

nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
glimpse(nba)
Observations: 50
Variables: 21
$ Name <fctr> Dwyane Wade , LeBron James , Kobe Bryant , Dirk Nowitzki...
$ G    <int> 79, 81, 82, 81, 67, 74, 51, 50, 78, 66, 77, 78, 81, 72, 5...
$ MIN  <dbl> 38.6, 37.7, 36.2, 37.7, 36.2, 39.0, 38.2, 36.6, 38.5, 34....
$ PTS  <dbl> 30.2, 28.4, 26.8, 25.9, 25.8, 25.3, 24.6, 23.1, 22.8, 22....
$ FGM  <dbl> 10.8, 9.7, 9.8, 9.6, 8.5, 8.9, 6.7, 9.7, 8.1, 8.1, 8.0, 8...
$ FGA  <dbl> 22.0, 19.9, 20.9, 20.0, 19.1, 18.8, 15.9, 19.5, 16.1, 18....
$ FGP  <dbl> 0.491, 0.489, 0.467, 0.479, 0.447, 0.476, 0.420, 0.497, 0...
$ FTM  <dbl> 7.5, 7.3, 5.9, 6.0, 6.0, 6.1, 9.0, 3.7, 5.8, 5.6, 6.5, 5....
$ FTA  <dbl> 9.8, 9.4, 6.9, 6.7, 6.9, 7.1, 10.3, 5.0, 6.7, 7.1, 8.0, 6...
$ FTP  <dbl> 0.765, 0.780, 0.856, 0.890, 0.878, 0.863, 0.867, 0.738, 0...
$ X3PM <dbl> 1.1, 1.6, 1.4, 0.8, 2.7, 1.3, 2.3, 0.0, 0.8, 1.0, 0.2, 1....
$ X3PA <dbl> 3.5, 4.7, 4.1, 2.1, 6.7, 3.1, 5.4, 0.1, 2.3, 2.6, 0.6, 2....
$ X3PP <dbl> 0.317, 0.344, 0.351, 0.359, 0.404, 0.422, 0.415, 0.000, 0...
$ ORB  <dbl> 1.1, 1.3, 1.1, 1.1, 0.7, 1.0, 0.6, 3.4, 0.9, 1.6, 2.8, 1....
$ DRB  <dbl> 3.9, 6.3, 4.1, 7.3, 4.4, 5.5, 3.0, 7.5, 4.7, 5.2, 7.2, 3....
$ TRB  <dbl> 5.0, 7.6, 5.2, 8.4, 5.1, 6.5, 3.6, 11.0, 5.5, 6.8, 10.0, ...
$ AST  <dbl> 7.5, 7.2, 4.9, 2.4, 2.7, 2.8, 2.7, 1.6, 11.0, 3.4, 2.5, 5...
$ STL  <dbl> 2.2, 1.7, 1.5, 0.8, 1.0, 1.3, 1.2, 0.8, 2.8, 1.1, 0.9, 1....
$ BLK  <dbl> 1.3, 1.1, 0.5, 0.8, 1.4, 0.7, 0.2, 1.7, 0.1, 0.4, 1.0, 0....
$ TO   <dbl> 3.4, 3.0, 2.6, 1.9, 2.5, 3.0, 2.9, 1.8, 3.0, 3.0, 2.3, 1....
$ PF   <dbl> 2.3, 1.7, 2.3, 2.2, 3.1, 1.8, 2.3, 2.8, 2.7, 3.0, 2.5, 1....
head(nba)
            Name  G  MIN  PTS  FGM  FGA   FGP FTM FTA   FTP X3PM X3PA
1   Dwyane Wade  79 38.6 30.2 10.8 22.0 0.491 7.5 9.8 0.765  1.1  3.5
2  LeBron James  81 37.7 28.4  9.7 19.9 0.489 7.3 9.4 0.780  1.6  4.7
3   Kobe Bryant  82 36.2 26.8  9.8 20.9 0.467 5.9 6.9 0.856  1.4  4.1
4 Dirk Nowitzki  81 37.7 25.9  9.6 20.0 0.479 6.0 6.7 0.890  0.8  2.1
5 Danny Granger  67 36.2 25.8  8.5 19.1 0.447 6.0 6.9 0.878  2.7  6.7
6  Kevin Durant  74 39.0 25.3  8.9 18.8 0.476 6.1 7.1 0.863  1.3  3.1
   X3PP ORB DRB TRB AST STL BLK  TO  PF
1 0.317 1.1 3.9 5.0 7.5 2.2 1.3 3.4 2.3
2 0.344 1.3 6.3 7.6 7.2 1.7 1.1 3.0 1.7
3 0.351 1.1 4.1 5.2 4.9 1.5 0.5 2.6 2.3
4 0.359 1.1 7.3 8.4 2.4 0.8 0.8 1.9 2.2
5 0.404 0.7 4.4 5.1 2.7 1.0 1.4 2.5 3.1
6 0.422 1.0 5.5 6.5 2.8 1.3 0.7 3.0 1.8

3.2 Step 2: Sort the data

The data is sorted by points per game (PTS), greatest to least. Let’s make it the other way around so that it’s least to greatest.

nba$Name <- factor(nba$Name, levels = nba$Name[order(nba$PTS)]) # base R way
head(nba)
            Name  G  MIN  PTS  FGM  FGA   FGP FTM FTA   FTP X3PM X3PA
1   Dwyane Wade  79 38.6 30.2 10.8 22.0 0.491 7.5 9.8 0.765  1.1  3.5
2  LeBron James  81 37.7 28.4  9.7 19.9 0.489 7.3 9.4 0.780  1.6  4.7
3   Kobe Bryant  82 36.2 26.8  9.8 20.9 0.467 5.9 6.9 0.856  1.4  4.1
4 Dirk Nowitzki  81 37.7 25.9  9.6 20.0 0.479 6.0 6.7 0.890  0.8  2.1
5 Danny Granger  67 36.2 25.8  8.5 19.1 0.447 6.0 6.9 0.878  2.7  6.7
6  Kevin Durant  74 39.0 25.3  8.9 18.8 0.476 6.1 7.1 0.863  1.3  3.1
   X3PP ORB DRB TRB AST STL BLK  TO  PF
1 0.317 1.1 3.9 5.0 7.5 2.2 1.3 3.4 2.3
2 0.344 1.3 6.3 7.6 7.2 1.7 1.1 3.0 1.7
3 0.351 1.1 4.1 5.2 4.9 1.5 0.5 2.6 2.3
4 0.359 1.1 7.3 8.4 2.4 0.8 0.8 1.9 2.2
5 0.404 0.7 4.4 5.1 2.7 1.0 1.4 2.5 3.1
6 0.422 1.0 5.5 6.5 2.8 1.3 0.7 3.0 1.8

3.3 Step 3: Prepare the data

This data is in wide format and it needs to be tidy- notice a pattern here? Also, the game statistics have very different ranges, so to make them comparable all the individual statistics need to be rescaled- we'll use z-scores (mean = 0, sd = 1).

tidy_nba <- nba %>% 
  gather(key = player_stat, value = player_value, -Name) %>% 
  group_by(player_stat) %>% 
  mutate(player_value_z = (player_value - mean(player_value))/sd(player_value))
head(tidy_nba)
Source: local data frame [6 x 4]
Groups: player_stat [1]

            Name player_stat player_value player_value_z
          <fctr>       <chr>        <dbl>          <dbl>
1   Dwyane Wade            G           79      0.6179300
2  LeBron James            G           81      0.7693834
3   Kobe Bryant            G           82      0.8451101
4 Dirk Nowitzki            G           81      0.7693834
5 Danny Granger            G           67     -0.2907906
6  Kevin Durant            G           74      0.2392964

3.4 Step 4: Start plotting

First draft heatmap plot

ggplot(tidy_nba, aes(x = player_stat, y = Name)) +
  geom_tile(aes(fill = player_value_z))

Change up the colors

ggplot(tidy_nba, aes(x = player_stat, y = Name)) +
  geom_tile(aes(fill = player_value_z)) +
  scale_fill_viridis(begin = 1, end = 0) 

Tweaking all the things

nba_heat <- ggplot(tidy_nba, aes(x = player_stat, y = Name)) +
  geom_tile(aes(fill = player_value_z), colour = "white") +
  scale_fill_viridis(guide = FALSE, begin = 1, end = 0, option = "magma") +
  scale_x_discrete(expand = c(0, 0), name = "") +
  scale_y_discrete(expand = c(0, 0), name = "") +
  theme(axis.text = element_text(size = 6),
        axis.text.x = element_text(angle = 310, vjust = .5)) 
nba_heat

This ensures the plot will have a 1:1 aspect ratio (i.e. geom_tile()–which draws rectangles–will draw nice squares).

nba_heat + coord_equal()

4 Bubble plot

4.1 Step 1: Load and say HLO to the data

library(gapminder)
glimpse(gapminder)
Observations: 1,704
Variables: 6
$ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
$ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
head(country_colors)
         Nigeria            Egypt         Ethiopia Congo, Dem. Rep. 
       "#7F3B08"        "#833D07"        "#873F07"        "#8B4107" 
    South Africa            Sudan 
       "#8F4407"        "#934607" 

4.2 Step 2: Prepare the data

Let's get rid of the Oceania continent for this visualization

gm <- gapminder %>% 
  filter(!continent == "Oceania")

4.3 Step 3: Plot the data

Simple scatterplot with facetting

gm_scatter <- ggplot(subset(gm, year == 2007), 
                    aes(x = gdpPercap, y = lifeExp, fill = country)) +
  geom_point(size = 4, pch = 21, color = 'grey20', show.legend = FALSE, alpha = .6) + 
  facet_wrap(~ continent) + 
  scale_fill_manual(values = country_colors) +
  theme(strip.text = element_text(size = rel(1.1)))

gm_scatter + scale_x_log10(limits = c(150, 115000)) 

Now map size as an aesthetic...

gm_bubble1 <- ggplot(subset(gm, year == 2007), 
                    aes(x = gdpPercap, y = lifeExp, fill = country)) +
  geom_point(aes(size = sqrt(pop/pi)), pch = 21, 
             color = 'grey20', show.legend = FALSE, alpha = .6) + 
  facet_wrap(~ continent) + 
  scale_fill_manual(values = country_colors) +
  theme(strip.text = element_text(size = rel(1.1))) +
  scale_x_log10(limits = c(150, 115000)) 

gm_bubble1

Tweaking

gm_bubble2 <- ggplot(subset(gm, year == 2007), aes(x = gdpPercap, y = lifeExp, fill = country)) +
  ylim(c(16, 100)) +
  geom_point(aes(size = sqrt(pop/pi)), pch = 21, 
             color = 'grey20', show.legend = FALSE, alpha = .6) + 
  scale_size_continuous(range = c(1, 20)) +
  facet_wrap(~ continent) + 
  coord_fixed(ratio = 1/43) +
  scale_fill_manual(values = country_colors) +
  theme(strip.text = element_text(size = rel(1.1))) +
  scale_x_log10(limits = c(150, 115000))
gm_bubble2

Similar plot with lines

ggplot(gm, aes(x = year, y = lifeExp, group = country, colour = country)) +
  geom_line(lwd = 1, show.legend = FALSE) + 
  facet_wrap(~ continent) +
  scale_color_manual(values = country_colors) +
  theme(strip.text = element_text(size = rel(1.1)))

5 Principal Components Analysis

5.1 Step 1: Back to the iris data

5.2 Step 2: Install and load packages

install.packages("cowplot")
library(cowplot) # required to arrange multiple plots in a grid

5.3 Step 3: EDA plot

If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns:

p1 <- ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) + 
  geom_point()
p2 <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=Species)) + 
  geom_point()
p3 <- ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) + 
  geom_point()
p4 <- ggplot(iris, aes(x=Sepal.Width, y=Petal.Width, color=Species)) + 
  geom_point()
plot_grid(p1, p2, p3, p4, labels = "AUTO")
Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
instead

In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four.

5.4 Step 4: Do the PCA

The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function prcomp():

see also the Setosa Intro to PCA

pca <- iris %>%               # store result as `pca`
  select(-Species) %>%        # remove Species column
  scale() %>%                 # scale to 0 mean and unit variance
  prcomp()                    # do PCA
# now display the results from the PCA analysis
pca
Standard deviations:
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation:
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

The main results from PCA are the standard deviations and the rotation matrix. Squares of the std. devs represent the % variance explained by each PC. The rotation matrix tells us which variables contribute to which PCs. For example, Sepal.Width contributes little to PC1 but makes up much of PC2.

pca$rotation
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

We can also recover each original observation expressed in PC coordinates; the rotated data are available as pca$x:

head(pca$x)
           PC1        PC2         PC3          PC4
[1,] -2.257141 -0.4784238  0.12727962  0.024087508
[2,] -2.074013  0.6718827  0.23382552  0.102662845
[3,] -2.356335  0.3407664 -0.04405390  0.028282305
[4,] -2.291707  0.5953999 -0.09098530 -0.065735340
[5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
[6,] -2.068701 -1.4842053 -0.02687825  0.006586116
# add species information back into PCA data
pca_data <- data.frame(pca$x, Species=iris$Species)
head(pca_data)
        PC1        PC2         PC3          PC4 Species
1 -2.257141 -0.4784238  0.12727962  0.024087508  setosa
2 -2.074013  0.6718827  0.23382552  0.102662845  setosa
3 -2.356335  0.3407664 -0.04405390  0.028282305  setosa
4 -2.291707  0.5953999 -0.09098530 -0.065735340  setosa
5 -2.381863 -0.6446757 -0.01568565 -0.035802870  setosa
6 -2.068701 -1.4842053 -0.02687825  0.006586116  setosa

5.5 Step 5: Plot the PCA results

Let’s plot the data in the principal components. Specifically, we will plot PC2 vs. PC1.

ggplot(pca_data, aes(x=PC1, y=PC2, color=Species)) + 
  geom_point()

In the PC2 vs PC1 plot, versicolor and virginica are much better separated.

PCA in ggplot2 and ggfortify

install.packages("ggfortify")
library(ggfortify)
autoplot(pca)

autoplot(pca, data = iris, colour = 'Species')

autoplot(pca, data = iris, colour = 'Species', label = TRUE, label.size = 3)

autoplot(pca, data = iris, colour = 'Species', shape = FALSE, label.size = 3)

autoplot(pca, data = iris, colour = 'Species', loadings = TRUE)

autoplot(pca, data = iris, colour = 'Species',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)