--- title: "Data Visualization" date: "September 29, 2016" author: "Alison Presmanes Hill" output: html_document: keep_md: TRUE highlight: pygments theme: journal smart: false toc: TRUE toc_float: TRUE number_sections: TRUE --- ```{r setup, include = FALSE, cache = FALSE} knitr::opts_chunk$set(error = TRUE, comment = NA, warnings = FALSE, errors = FALSE, messages = FALSE, tidy = FALSE, eval = TRUE) ``` ```{r load-packages, include = FALSE} suppressWarnings(suppressMessages(library(tidyverse))) suppressWarnings(suppressMessages(library(gapminder))) ``` # Install R R is a programming language based off of S from Bell Labs. R is: * Free * Open source * Available on almost every major platform Install R from [CRAN, the Comprehensive R Archive Network](https://cran.rstudio.com). Please choose a **precompiled binary distribution** for your operating system. * If you need more help, check out one of the following videos (courtesy of Roger Peng at Johns Hopkins Biostatistics): - [Installing R on a mac](https://www.youtube.com/watch?v=Icawuhf0Yqo&feature=youtu.be) - [Installing R on windows](https://www.youtube.com/watch?v=mfGFv-iB724&feature=youtu.be) * If you need even more help, read this [step-by-step guide](https://beckmw.files.wordpress.com/2014/09/r_install_guide.pdf), including screenshots. # Test R Launch R. You should see one console with a command line interpreter (`>`). * Place your cursor where you see `>` and type `x <- 2 + 2`, hit enter or return, then type `x`, and hit enter/return again. * If `[1] 4` prints to the screen, you have successfully installed R. * Close R. # Install RStudio **RStudio** provides a nice user interface for R, called an *integrated development environment*. RStudio includes: * a console (the standard command line interface: `>`), * a syntax-highlighting editor that supports direct code execution, and * tools for plotting, history, debugging and workspace management. Install the free, open-source edition of RStudio: http://www.rstudio.com/products/rstudio/download/ # Test RStudio Launch RStudio. You should get a window similar to the screenshot you see [here](https://www.rstudio.com/wp-content/uploads/2014/04/rstudio-workbench.png), but yours will be empty. * Place your cursor where you see `>` and type `x <- 2 * 2`, hit enter or return, then type `x`, and hit enter/return again. * If `[1] 4` prints to the screen, you have successfully installed RStudio. # Use R Markdown R Markdown documents are documents that combine text, R code, and R output, including figures. They are a great way to produce self-contained and documented statistical analyses. Create a new R Markdown file in RStudio and try to do some basic markdown editing. After you have made a change to the document, press "Knit HTML" in R Studio and see what kind of a result you get. ## Basic Markdown editing Try out basic R Markdown features, as described [here.](http://rmarkdown.rstudio.com/authoring_basics.html) Write some text that is bold, and some that is in italics. Make a numbered list and a bulleted list. Make a nested list. Try the block-quote feature. ## Embedding R code R code embedded in R chunks will be executed and the output will be shown. ```{r} x <- 5 y <- 7 z <- x * y z ``` Play around with some basic R code. E.g., test that you can add comments to a code chunk by typing a `#` followed by some random text and show that case matters in R code. Next, use the code included in the blank R Markdown document you started with to plot the `pressure` data set. ```{r} plot(pressure) ``` # Install TeX In order to use all of the great options offered within RStudio (in particular, knitting to PDF), you will need a full installation of TeX. If you don't already have TeX, this is a big download. * Install the appropriate full TeX distribution for your OS: - # Getting Started in R Code you can use in R will look like this: ```{r example, echo=TRUE} ## This is a comment data <- c(1, 1, 4, 1, 1, 4, 1) data ``` The first box shows something you can type into the R console. The second shows what you'd see as output if you did. ## Always know where R thinks you are ```{r getwd} ## Get the working directory getwd() ``` ## Everything has a *name* - Some names are forbidden. These include words like `FALSE` and `TRUE`, logical operators and programming words like `Inf`, `for`, `else`, `break`, `function`, and words for special entities like `NA` and `NaN`. - Some names you should not use. These include words that are also the names of very widely used objects like `q` or `c` or `mean`, or `pi`, or `range`, or `var`. - All names are case sensitive. ## Everything is an *object* - Objects are built in to R, are added via libraries, or are created by the user. ```{r built-in-obj} letters # letters pi # pi ``` ```{r iris-no-print, echo = TRUE, eval = FALSE} iris # datasets in `datasets` package like iris ``` ```{r iris-print, echo = FALSE} iris %>% tbl_df() ``` Note that I am only showing the first few lines of the iris dataset here. ```{r objects, echo=TRUE} ## This is a vector of numbers my_numbers <- c(1, 2, 3, 1, 3, 5, 25) my_numbers summary(my_numbers) ``` ## Every object has a *class* Classes: * Numeric * Character * Factor * Logical * Double (`?double`: "identical to numeric", but "double precision") * Closure (a function) * a few others you may run into `?typeof` - Depending on what type of object something is, you can extract bits of information from it. ```{r classes, echo=TRUE} class(my_numbers) class(summary) typeof(my_numbers) typeof(summary) ``` R denotes missing data with a special type of thing, `NA`, which is not a character (and hence not in quotes). It is actually a logical, which returns TRUE or FALSE. ```{r missing-values-in-vector} ?logical missing <- c(NA, NA, NA) typeof(missing) is.na(missing) dollars <- c(12, 1, 2, 3, NA) typeof(dollars) is.na(dollars) dollars*3 mean(dollars) mean(dollars, na.rm = TRUE) ``` ## Functions take Data (or Functions) as inputs, and produce outputs ```{r functions, echo=TRUE} ## A Function takes arguments inside parentheses my_summary <- summary(my_numbers) class(my_summary) my_summary ``` - For now just remember that you do things in R by creating and manipulating objects, and that you manipulate objects by feeding them to functions and getting output back as a result. ```{r output} my_numbers * 2 table(my_numbers) sd(my_numbers) ``` ## When in Doubt ### If you're not sure what something is, ask for its class/type: ```{r getclass} class(my_numbers) class(my_summary) class(table) ``` ```{r gettypeof} typeof(my_numbers) typeof(my_summary) typeof(table) ``` ### If you're not sure what something is, ask for its structure ```{r str} str(my_numbers) str(my_summary) str(summary) ``` # Let's get some Data ... ```{r eval = FALSE} install.packages("gapminder") head(gapminder) ``` ```{r getdata-1, echo = TRUE} ## What is it? class(gapminder) ## What's inside? str(gapminder) ``` ```{r getdata-3, echo = TRUE} ## Get the dimensions of the data frame dim(gapminder) ## Another way to look at a data frame head(gapminder) ``` Always do sanity-checks on your data after import! As in... ![](hlo.png) ## ... and get ready to plot it ```{r FirstPlot-2, fig.align = "center"} ## Make an object containing the plot ## try str(p) if you like. Objects can be complex! p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) ``` ```{r FirstPlot-3} ## Take our data and make a scatter plot p + geom_point() ``` Two key concepts in the grammar of graphics: aesthetics map features of the data (for example, the lifeExp variable) to features of the visualization (for example, the y-axis coordinate), and geoms concern what actually gets plotted (here, each data point becomes a point in the plot). Another key aspect of ggplot2: the ggplot() function creates a graphics object; additional controls are added with the + operator. The actual plot is made when the object is printed. The following is equivalent to the code above. The actual plot isn't created until the p2 object is printed. (When you type an object's name at the R prompt, it gets printed, and that's the usual way that these plots get created.) ```{r} p1 <- ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) p2 <- p1 + geom_point() p2 ``` Change x-axis to log scale ```{r} ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point() + scale_x_log10() ``` Could have also done just this: ```{r eval = FALSE} p2 + scale_x_log10() ``` For a scatterplot, additional aesthetics include shape, size, and color. For example, we might make our scatterplot for all countries, with data from 1952, and then color the points according to the continent. ```{r} gm_1952 <- filter(gapminder, year==1952) ggplot(gm_1952, aes(x=gdpPercap, y=lifeExp)) + geom_point() + scale_x_log10() + aes(color=continent) ``` Note that we could have put color=continent within the call to ggplot(): the following is equivalent to the above. ```{r} ggplot(gm_1952, aes(x=gdpPercap, y=lifeExp, color=continent)) + geom_point() + scale_x_log10() ``` Try out the `size`, `shape`, and `color` aesthetics, both with categorical variables (such as `continent`) and numeric variables (such as `pop`). # Try this 1. Create an RMarkdown file for your work if you haven't already. 2. Look again at the data. 3. Put `lifeExp` on the x-axis and `gdpPercap` on the y-axis. 4. Plot `pop` on the x-axis and gdpPercap on the y-axis. 5. Plot `year` on the x-axis and any continuous variable on the y-axis. # Help You don't have to do any of these, but if you are new to R these are some recommended resources for getting started. ## Swirl _"Learn R, in R."_ Swirl is an R package that turns the R console into an interactive learning environment. It stands for Statistics with Interactive R Learning. If you are new to R, you may find some of these lessons useful. You can always save your place in Swirl and come back later. ```{r eval = FALSE, comment = NA} install.packages("swirl", dependencies = TRUE) library(swirl) #once a package is installed, you must load it before using it install_from_swirl("R Programming") #give it a second to install the course swirl() #the program should help you take it from there! There are 15 R Programming lessons in all. ``` ## R for cats _"An intro to R for new programmers."_ If you are a dog-lover (as I am), you may quickly realize that this is not the norm in the R community. Nevertheless, the site [rforcats](http://rforcats.net) has a lot of great information, some of which overlaps with the lessons in `swirl`, but some of which may be new (see the no no's and the do do's, and the section on the `magrittr` pipe operator: `%>%`). ## aRrgh _"a newcomer's (angry) guide to R."_ Also highly recommended, especially the section on `data.frames`: *