For the first two plots, we'll use the iris
dataset:
This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
iris
is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
$ Species <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...
install.packages(GGally)
library(GGally)
ggpairs
is a special form of a ggmatrix that produces a pairwise comparison of multivariate data. By default, ggpairs
provides two different comparisons of each pair of columns and displays either the density or count of the respective variable along the diagonal. With different parameter settings, the diagonal can be replaced with the axis values and variable labels.
There are many hidden features within ggpairs
.
ggpairs(iris, columns = c(1:4))
ggpairs(iris, columns = c("Sepal.Length", "Sepal.Width"), columnLabels = c("Length", "Width"))
Aesthetics can be applied to every subplot with the mapping parameter.
ggpairs(iris, columns = c(1:4), mapping = aes(color = Species))
Matrix sections: http://ggobi.github.io/ggally/#matrix_sections
ggpairs(iris,
columns = c(1:4), mapping = aes(color = Species),
diag = list(continuous = wrap("densityDiag")),
upper = list(continuous = wrap("cor", size = 5))
)
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
$ Species <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...
This type of plot in ggplot
requires a tidy dataframe, which iris
is not because each of our four flower attributes are stored in separate columns- we need them in one column. To tidy it up, we'll use the tidyr::gather
function.
tidy_iris <- iris %>%
gather(key = flower_att, value = measurement,
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
head(tidy_iris)
Species flower_att measurement
1 setosa Sepal.Length 5.1
2 setosa Sepal.Length 4.9
3 setosa Sepal.Length 4.7
4 setosa Sepal.Length 4.6
5 setosa Sepal.Length 5.0
6 setosa Sepal.Length 5.4
# less typing
tidy_iris <- iris %>%
gather(key = flower_att, value = measurement, -Species)
# less typing
tidy_iris <- iris %>%
gather(key = flower_att, value = measurement, Sepal.Length:Petal.Width)
Now, measurement
will be our y-axis, and we have one variable flower_att
that we want to be represented along the x-axis. We would like to color by Species
too.
ggplot(tidy_iris, aes(x = flower_att, y = measurement, color = Species)) +
geom_line()
Maybe a group
aesthetic will help?
ggplot(tidy_iris, aes(x = flower_att, y = measurement, color = Species, group = Species)) +
geom_line()
Wah wah. That is disappointing. The problem is ggplot
doesn't know that my first Petal.Length value is for the same "row" as my first Petal.Width value, etc. We need to number each of the rows, but by each flower attribute
tidy_iris <- iris %>%
gather(key = flower_att, value = measurement, -Species) %>% # tidyr
group_by(flower_att) %>% # dplyr
mutate(rownum = seq_len(n())) # dplyr
head(tidy_iris)
Source: local data frame [6 x 4]
Groups: flower_att [1]
Species flower_att measurement rownum
<fctr> <chr> <dbl> <int>
1 setosa Sepal.Length 5.1 1
2 setosa Sepal.Length 4.9 2
3 setosa Sepal.Length 4.7 3
4 setosa Sepal.Length 4.6 4
5 setosa Sepal.Length 5.0 5
6 setosa Sepal.Length 5.4 6
ggplot(tidy_iris, aes(x = flower_att, y = measurement, colour = Species, group = rownum)) +
geom_line(alpha =.5)
http://flowingdata.com/2010/01/21/how-to-make-a-heatmap-a-quick-and-easy-solution/
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
glimpse(nba)
Observations: 50
Variables: 21
$ Name <fctr> Dwyane Wade , LeBron James , Kobe Bryant , Dirk Nowitzki...
$ G <int> 79, 81, 82, 81, 67, 74, 51, 50, 78, 66, 77, 78, 81, 72, 5...
$ MIN <dbl> 38.6, 37.7, 36.2, 37.7, 36.2, 39.0, 38.2, 36.6, 38.5, 34....
$ PTS <dbl> 30.2, 28.4, 26.8, 25.9, 25.8, 25.3, 24.6, 23.1, 22.8, 22....
$ FGM <dbl> 10.8, 9.7, 9.8, 9.6, 8.5, 8.9, 6.7, 9.7, 8.1, 8.1, 8.0, 8...
$ FGA <dbl> 22.0, 19.9, 20.9, 20.0, 19.1, 18.8, 15.9, 19.5, 16.1, 18....
$ FGP <dbl> 0.491, 0.489, 0.467, 0.479, 0.447, 0.476, 0.420, 0.497, 0...
$ FTM <dbl> 7.5, 7.3, 5.9, 6.0, 6.0, 6.1, 9.0, 3.7, 5.8, 5.6, 6.5, 5....
$ FTA <dbl> 9.8, 9.4, 6.9, 6.7, 6.9, 7.1, 10.3, 5.0, 6.7, 7.1, 8.0, 6...
$ FTP <dbl> 0.765, 0.780, 0.856, 0.890, 0.878, 0.863, 0.867, 0.738, 0...
$ X3PM <dbl> 1.1, 1.6, 1.4, 0.8, 2.7, 1.3, 2.3, 0.0, 0.8, 1.0, 0.2, 1....
$ X3PA <dbl> 3.5, 4.7, 4.1, 2.1, 6.7, 3.1, 5.4, 0.1, 2.3, 2.6, 0.6, 2....
$ X3PP <dbl> 0.317, 0.344, 0.351, 0.359, 0.404, 0.422, 0.415, 0.000, 0...
$ ORB <dbl> 1.1, 1.3, 1.1, 1.1, 0.7, 1.0, 0.6, 3.4, 0.9, 1.6, 2.8, 1....
$ DRB <dbl> 3.9, 6.3, 4.1, 7.3, 4.4, 5.5, 3.0, 7.5, 4.7, 5.2, 7.2, 3....
$ TRB <dbl> 5.0, 7.6, 5.2, 8.4, 5.1, 6.5, 3.6, 11.0, 5.5, 6.8, 10.0, ...
$ AST <dbl> 7.5, 7.2, 4.9, 2.4, 2.7, 2.8, 2.7, 1.6, 11.0, 3.4, 2.5, 5...
$ STL <dbl> 2.2, 1.7, 1.5, 0.8, 1.0, 1.3, 1.2, 0.8, 2.8, 1.1, 0.9, 1....
$ BLK <dbl> 1.3, 1.1, 0.5, 0.8, 1.4, 0.7, 0.2, 1.7, 0.1, 0.4, 1.0, 0....
$ TO <dbl> 3.4, 3.0, 2.6, 1.9, 2.5, 3.0, 2.9, 1.8, 3.0, 3.0, 2.3, 1....
$ PF <dbl> 2.3, 1.7, 2.3, 2.2, 3.1, 1.8, 2.3, 2.8, 2.7, 3.0, 2.5, 1....
head(nba)
Name G MIN PTS FGM FGA FGP FTM FTA FTP X3PM X3PA
1 Dwyane Wade 79 38.6 30.2 10.8 22.0 0.491 7.5 9.8 0.765 1.1 3.5
2 LeBron James 81 37.7 28.4 9.7 19.9 0.489 7.3 9.4 0.780 1.6 4.7
3 Kobe Bryant 82 36.2 26.8 9.8 20.9 0.467 5.9 6.9 0.856 1.4 4.1
4 Dirk Nowitzki 81 37.7 25.9 9.6 20.0 0.479 6.0 6.7 0.890 0.8 2.1
5 Danny Granger 67 36.2 25.8 8.5 19.1 0.447 6.0 6.9 0.878 2.7 6.7
6 Kevin Durant 74 39.0 25.3 8.9 18.8 0.476 6.1 7.1 0.863 1.3 3.1
X3PP ORB DRB TRB AST STL BLK TO PF
1 0.317 1.1 3.9 5.0 7.5 2.2 1.3 3.4 2.3
2 0.344 1.3 6.3 7.6 7.2 1.7 1.1 3.0 1.7
3 0.351 1.1 4.1 5.2 4.9 1.5 0.5 2.6 2.3
4 0.359 1.1 7.3 8.4 2.4 0.8 0.8 1.9 2.2
5 0.404 0.7 4.4 5.1 2.7 1.0 1.4 2.5 3.1
6 0.422 1.0 5.5 6.5 2.8 1.3 0.7 3.0 1.8
The data is sorted by points per game (PTS
), greatest to least. Let’s make it the other way around so that it’s least to greatest.
nba$Name <- factor(nba$Name, levels = nba$Name[order(nba$PTS)]) # base R way
head(nba)
Name G MIN PTS FGM FGA FGP FTM FTA FTP X3PM X3PA
1 Dwyane Wade 79 38.6 30.2 10.8 22.0 0.491 7.5 9.8 0.765 1.1 3.5
2 LeBron James 81 37.7 28.4 9.7 19.9 0.489 7.3 9.4 0.780 1.6 4.7
3 Kobe Bryant 82 36.2 26.8 9.8 20.9 0.467 5.9 6.9 0.856 1.4 4.1
4 Dirk Nowitzki 81 37.7 25.9 9.6 20.0 0.479 6.0 6.7 0.890 0.8 2.1
5 Danny Granger 67 36.2 25.8 8.5 19.1 0.447 6.0 6.9 0.878 2.7 6.7
6 Kevin Durant 74 39.0 25.3 8.9 18.8 0.476 6.1 7.1 0.863 1.3 3.1
X3PP ORB DRB TRB AST STL BLK TO PF
1 0.317 1.1 3.9 5.0 7.5 2.2 1.3 3.4 2.3
2 0.344 1.3 6.3 7.6 7.2 1.7 1.1 3.0 1.7
3 0.351 1.1 4.1 5.2 4.9 1.5 0.5 2.6 2.3
4 0.359 1.1 7.3 8.4 2.4 0.8 0.8 1.9 2.2
5 0.404 0.7 4.4 5.1 2.7 1.0 1.4 2.5 3.1
6 0.422 1.0 5.5 6.5 2.8 1.3 0.7 3.0 1.8
This data is in wide format and it needs to be tidy- notice a pattern here? Also, the game statistics have very different ranges, so to make them comparable all the individual statistics need to be rescaled- we'll use z-scores (mean = 0, sd = 1).
tidy_nba <- nba %>%
gather(key = player_stat, value = player_value, -Name) %>%
group_by(player_stat) %>%
mutate(player_value_z = (player_value - mean(player_value))/sd(player_value))
head(tidy_nba)
Source: local data frame [6 x 4]
Groups: player_stat [1]
Name player_stat player_value player_value_z
<fctr> <chr> <dbl> <dbl>
1 Dwyane Wade G 79 0.6179300
2 LeBron James G 81 0.7693834
3 Kobe Bryant G 82 0.8451101
4 Dirk Nowitzki G 81 0.7693834
5 Danny Granger G 67 -0.2907906
6 Kevin Durant G 74 0.2392964
First draft heatmap plot
ggplot(tidy_nba, aes(x = player_stat, y = Name)) +
geom_tile(aes(fill = player_value_z))
Change up the colors
ggplot(tidy_nba, aes(x = player_stat, y = Name)) +
geom_tile(aes(fill = player_value_z)) +
scale_fill_viridis(begin = 1, end = 0)
Tweaking all the things
nba_heat <- ggplot(tidy_nba, aes(x = player_stat, y = Name)) +
geom_tile(aes(fill = player_value_z), colour = "white") +
scale_fill_viridis(guide = FALSE, begin = 1, end = 0, option = "magma") +
scale_x_discrete(expand = c(0, 0), name = "") +
scale_y_discrete(expand = c(0, 0), name = "") +
theme(axis.text = element_text(size = 6),
axis.text.x = element_text(angle = 310, vjust = .5))
nba_heat
This ensures the plot will have a 1:1 aspect ratio (i.e. geom_tile()
–which draws rectangles–will draw nice squares).
nba_heat + coord_equal()
library(gapminder)
glimpse(gapminder)
Observations: 1,704
Variables: 6
$ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
$ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
head(country_colors)
Nigeria Egypt Ethiopia Congo, Dem. Rep.
"#7F3B08" "#833D07" "#873F07" "#8B4107"
South Africa Sudan
"#8F4407" "#934607"
Let's get rid of the Oceania continent for this visualization
gm <- gapminder %>%
filter(!continent == "Oceania")
Simple scatterplot with facetting
gm_scatter <- ggplot(subset(gm, year == 2007),
aes(x = gdpPercap, y = lifeExp, fill = country)) +
geom_point(size = 4, pch = 21, color = 'grey20', show.legend = FALSE, alpha = .6) +
facet_wrap(~ continent) +
scale_fill_manual(values = country_colors) +
theme(strip.text = element_text(size = rel(1.1)))
gm_scatter + scale_x_log10(limits = c(150, 115000))
Now map size as an aesthetic...
gm_bubble1 <- ggplot(subset(gm, year == 2007),
aes(x = gdpPercap, y = lifeExp, fill = country)) +
geom_point(aes(size = sqrt(pop/pi)), pch = 21,
color = 'grey20', show.legend = FALSE, alpha = .6) +
facet_wrap(~ continent) +
scale_fill_manual(values = country_colors) +
theme(strip.text = element_text(size = rel(1.1))) +
scale_x_log10(limits = c(150, 115000))
gm_bubble1
Tweaking
gm_bubble2 <- ggplot(subset(gm, year == 2007), aes(x = gdpPercap, y = lifeExp, fill = country)) +
ylim(c(16, 100)) +
geom_point(aes(size = sqrt(pop/pi)), pch = 21,
color = 'grey20', show.legend = FALSE, alpha = .6) +
scale_size_continuous(range = c(1, 20)) +
facet_wrap(~ continent) +
coord_fixed(ratio = 1/43) +
scale_fill_manual(values = country_colors) +
theme(strip.text = element_text(size = rel(1.1))) +
scale_x_log10(limits = c(150, 115000))
gm_bubble2
Similar plot with lines
ggplot(gm, aes(x = year, y = lifeExp, group = country, colour = country)) +
geom_line(lwd = 1, show.legend = FALSE) +
facet_wrap(~ continent) +
scale_color_manual(values = country_colors) +
theme(strip.text = element_text(size = rel(1.1)))
install.packages("cowplot")
library(cowplot) # required to arrange multiple plots in a grid
If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns:
p1 <- ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point()
p2 <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=Species)) +
geom_point()
p3 <- ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) +
geom_point()
p4 <- ggplot(iris, aes(x=Sepal.Width, y=Petal.Width, color=Species)) +
geom_point()
plot_grid(p1, p2, p3, p4, labels = "AUTO")
Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
instead
In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four.
The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function prcomp():
see also the Setosa Intro to PCA
pca <- iris %>% # store result as `pca`
select(-Species) %>% # remove Species column
scale() %>% # scale to 0 mean and unit variance
prcomp() # do PCA
# now display the results from the PCA analysis
pca
Standard deviations:
[1] 1.7083611 0.9560494 0.3830886 0.1439265
Rotation:
PC1 PC2 PC3 PC4
Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
The main results from PCA are the standard deviations and the rotation matrix. Squares of the std. devs represent the % variance explained by each PC. The rotation matrix tells us which variables contribute to which PCs. For example, Sepal.Width contributes little to PC1 but makes up much of PC2.
pca$rotation
PC1 PC2 PC3 PC4
Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
We can also recover each original observation expressed in PC coordinates; the rotated data are available as pca$x:
head(pca$x)
PC1 PC2 PC3 PC4
[1,] -2.257141 -0.4784238 0.12727962 0.024087508
[2,] -2.074013 0.6718827 0.23382552 0.102662845
[3,] -2.356335 0.3407664 -0.04405390 0.028282305
[4,] -2.291707 0.5953999 -0.09098530 -0.065735340
[5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
[6,] -2.068701 -1.4842053 -0.02687825 0.006586116
# add species information back into PCA data
pca_data <- data.frame(pca$x, Species=iris$Species)
head(pca_data)
PC1 PC2 PC3 PC4 Species
1 -2.257141 -0.4784238 0.12727962 0.024087508 setosa
2 -2.074013 0.6718827 0.23382552 0.102662845 setosa
3 -2.356335 0.3407664 -0.04405390 0.028282305 setosa
4 -2.291707 0.5953999 -0.09098530 -0.065735340 setosa
5 -2.381863 -0.6446757 -0.01568565 -0.035802870 setosa
6 -2.068701 -1.4842053 -0.02687825 0.006586116 setosa
Let’s plot the data in the principal components. Specifically, we will plot PC2 vs. PC1.
ggplot(pca_data, aes(x=PC1, y=PC2, color=Species)) +
geom_point()
In the PC2 vs PC1 plot, versicolor and virginica are much better separated.
install.packages("ggfortify")
library(ggfortify)
autoplot(pca)
autoplot(pca, data = iris, colour = 'Species')
autoplot(pca, data = iris, colour = 'Species', label = TRUE, label.size = 3)
autoplot(pca, data = iris, colour = 'Species', shape = FALSE, label.size = 3)
autoplot(pca, data = iris, colour = 'Species', loadings = TRUE)
autoplot(pca, data = iris, colour = 'Species',
loadings = TRUE, loadings.colour = 'blue',
loadings.label = TRUE, loadings.label.size = 3)