R is a programming language based off of S from Bell Labs. R is:
Install R from CRAN, the Comprehensive R Archive Network. Please choose a precompiled binary distribution for your operating system.
Launch R. You should see one console with a command line interpreter (>
).
>
and type x <- 2 + 2
, hit enter or return, then type x
, and hit enter/return again.[1] 4
prints to the screen, you have successfully installed R.RStudio provides a nice user interface for R, called an integrated development environment. RStudio includes:
>
),Install the free, open-source edition of RStudio: http://www.rstudio.com/products/rstudio/download/
Launch RStudio. You should get a window similar to the screenshot you see here, but yours will be empty.
>
and type x <- 2 * 2
, hit enter or return, then type x
, and hit enter/return again.[1] 4
prints to the screen, you have successfully installed RStudio.R Markdown documents are documents that combine text, R code, and R output, including figures. They are a great way to produce self-contained and documented statistical analyses. Create a new R Markdown file in RStudio and try to do some basic markdown editing. After you have made a change to the document, press "Knit HTML" in R Studio and see what kind of a result you get.
Try out basic R Markdown features, as described here. Write some text that is bold, and some that is in italics. Make a numbered list and a bulleted list. Make a nested list. Try the block-quote feature.
R code embedded in R chunks will be executed and the output will be shown.
x <- 5
y <- 7
z <- x * y
z
[1] 35
Play around with some basic R code. E.g., test that you can add comments to a code chunk by typing a #
followed by some random text and show that case matters in R code.
Next, use the code included in the blank R Markdown document you started with to plot the pressure
data set.
plot(pressure)
In order to use all of the great options offered within RStudio (in particular, knitting to PDF), you will need a full installation of TeX. If you don't already have TeX, this is a big download.
Code you can use in R will look like this:
## This is a comment
data <- c(1, 1, 4, 1, 1, 4, 1)
data
[1] 1 1 4 1 1 4 1
The first box shows something you can type into the R console. The second shows what you'd see as output if you did.
## Get the working directory
getwd()
[1] "/Users/hillali/Documents/Teaching/OHSU/data-vis/01-lab"
FALSE
and TRUE
, logical operators and programming words like Inf
, for
, else
, break
, function
, and words for special entities like NA
and NaN
.q
or c
or mean
, or pi
, or range
, or var
.letters # letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
[18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
pi # pi
[1] 3.141593
iris # datasets in `datasets` package like iris
Source: local data frame [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
(dbl) (dbl) (dbl) (dbl) (fctr)
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
.. ... ... ... ... ...
Note that I am only showing the first few lines of the iris dataset here.
## This is a vector of numbers
my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
my_numbers
[1] 1 2 3 1 3 5 25
summary(my_numbers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.500 3.000 5.714 4.000 25.000
Classes:
?double
: "identical to numeric", but "double precision")a few others you may run into ?typeof
Depending on what type of object something is, you can extract bits of information from it.
class(my_numbers)
[1] "numeric"
class(summary)
[1] "function"
typeof(my_numbers)
[1] "double"
typeof(summary)
[1] "closure"
R denotes missing data with a special type of thing, NA
, which is not a character (and hence not in quotes). It is actually a logical, which returns TRUE or FALSE.
?logical
missing <- c(NA, NA, NA)
typeof(missing)
[1] "logical"
is.na(missing)
[1] TRUE TRUE TRUE
dollars <- c(12, 1, 2, 3, NA)
typeof(dollars)
[1] "double"
is.na(dollars)
[1] FALSE FALSE FALSE FALSE TRUE
dollars*3
[1] 36 3 6 9 NA
mean(dollars)
[1] NA
mean(dollars, na.rm = TRUE)
[1] 4.5
## A Function takes arguments inside parentheses
my_summary <- summary(my_numbers)
class(my_summary)
[1] "summaryDefault" "table"
my_summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.500 3.000 5.714 4.000 25.000
my_numbers * 2
[1] 2 4 6 2 6 10 50
table(my_numbers)
my_numbers
1 2 3 5 25
2 1 2 1 1
sd(my_numbers)
[1] 8.616153
class(my_numbers)
[1] "numeric"
class(my_summary)
[1] "summaryDefault" "table"
class(table)
[1] "function"
typeof(my_numbers)
[1] "double"
typeof(my_summary)
[1] "double"
typeof(table)
[1] "closure"
str(my_numbers)
num [1:7] 1 2 3 1 3 5 25
str(my_summary)
Classes 'summaryDefault', 'table' Named num [1:6] 1 1.5 3 5.71 4 ...
..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
str(summary)
function (object, ...)
install.packages("gapminder")
head(gapminder)
## What is it?
class(gapminder)
[1] "tbl_df" "tbl" "data.frame"
## What's inside?
str(gapminder)
Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num 779 821 853 836 740 ...
## Get the dimensions of the data frame
dim(gapminder)
[1] 1704 6
## Another way to look at a data frame
head(gapminder)
Source: local data frame [6 x 6]
country continent year lifeExp pop gdpPercap
(fctr) (fctr) (int) (dbl) (int) (dbl)
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
Always do sanity-checks on your data after import! As in...
## Make an object containing the plot
## try str(p) if you like. Objects can be complex!
p <- ggplot(gapminder,
aes(x = gdpPercap,
y = lifeExp))
## Take our data and make a scatter plot
p + geom_point()
Two key concepts in the grammar of graphics: aesthetics map features of the data (for example, the lifeExp variable) to features of the visualization (for example, the y-axis coordinate), and geoms concern what actually gets plotted (here, each data point becomes a point in the plot).
Another key aspect of ggplot2: the ggplot() function creates a graphics object; additional controls are added with the + operator. The actual plot is made when the object is printed.
The following is equivalent to the code above. The actual plot isn't created until the p2 object is printed. (When you type an object's name at the R prompt, it gets printed, and that's the usual way that these plots get created.)
p1 <- ggplot(gapminder, aes(x=gdpPercap, y=lifeExp))
p2 <- p1 + geom_point()
p2
Change x-axis to log scale
ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) +
geom_point() +
scale_x_log10()
Could have also done just this:
p2 + scale_x_log10()
For a scatterplot, additional aesthetics include shape, size, and color.
For example, we might make our scatterplot for all countries, with data from 1952, and then color the points according to the continent.
gm_1952 <- filter(gapminder, year==1952)
ggplot(gm_1952, aes(x=gdpPercap, y=lifeExp)) +
geom_point() + scale_x_log10() +
aes(color=continent)
Note that we could have put color=continent within the call to ggplot(): the following is equivalent to the above.
ggplot(gm_1952, aes(x=gdpPercap, y=lifeExp, color=continent)) +
geom_point() +
scale_x_log10()
Try out the size
, shape
, and color
aesthetics, both with categorical variables (such as continent
) and numeric variables (such as pop
).
lifeExp
on the x-axis and gdpPercap
on the y-axis.pop
on the x-axis and gdpPercap on the y-axis.year
on the x-axis and any continuous variable on the y-axis.You don't have to do any of these, but if you are new to R these are some recommended resources for getting started.
"Learn R, in R."
Swirl is an R package that turns the R console into an interactive learning environment. It stands for Statistics with Interactive R Learning. If you are new to R, you may find some of these lessons useful. You can always save your place in Swirl and come back later.
install.packages("swirl", dependencies = TRUE)
library(swirl) #once a package is installed, you must load it before using it
install_from_swirl("R Programming") #give it a second to install the course
swirl() #the program should help you take it from there! There are 15 R Programming lessons in all.
"An intro to R for new programmers."
If you are a dog-lover (as I am), you may quickly realize that this is not the norm in the R community. Nevertheless, the site rforcats has a lot of great information, some of which overlaps with the lessons in swirl
, but some of which may be new (see the no no's and the do do's, and the section on the magrittr
pipe operator: %>%
).
"a newcomer's (angry) guide to R."
Also highly recommended, especially the section on data.frames
: