Babynames dplyr drills

0.1 Question 1

Create a new dataframe that only includes years since 2000.

newest <- babynames %>%
  filter(year >= 2000)

newest

Source: local data frame [492,895 x 5]

    year   sex      name     n        prop
   (dbl) (chr)     (chr) (int)       (dbl)
1   2000     F     Emily 25952 0.013012976
2   2000     F    Hannah 23073 0.011569374
3   2000     F   Madison 19967 0.010011949
4   2000     F    Ashley 17995 0.009023139
5   2000     F     Sarah 17687 0.008868700
6   2000     F    Alexis 17627 0.008838615
7   2000     F  Samantha 17264 0.008656598
8   2000     F   Jessica 15704 0.007874375
9   2000     F Elizabeth 15088 0.007565497
10  2000     F    Taylor 15078 0.007560483
..   ...   ...       ...   ...         ...

0.2 Question 2

What name has been used for the most number of years (when used for a single gender)?

Hint 1: find the total number of years each name/sex combination has been used.
Hint 2: what is the maximum total number of years any name has been used?

years <- babynames %>%
  group_by(name, sex) %>%
  summarise(use_years = length(name))

years %>%
  ungroup %>%
  filter(use_years == max(use_years))

Source: local data frame [934 x 3]

      name   sex use_years
     (chr) (chr)     (int)
1    Aaron     M       135
2    Abbie     F       135
3      Abe     M       135
4     Abel     M       135
5  Abigail     F       135
6    Abner     M       135
7  Abraham     M       135
8    Abram     M       135
9      Ada     F       135
10    Adam     M       135
..     ...   ...       ...

It is a 934 way tie

0.3 Question 3

How many names have only been used one year?

years %>%
  ungroup %>%
  group_by(sex) %>%
  filter(use_years == 1)

Source: local data frame [22,787 x 3]
Groups: sex [2]

      name   sex use_years
     (chr) (chr)     (int)
1    Aabid     M         1
2    Aaden     F         1
3  Aadhyan     M         1
4   Aadian     M         1
5  Aadrian     M         1
6   Aadrit     M         1
7  Aafreen     F         1
8    Aagam     M         1
9     Aage     M         1
10   Aagot     F         1
..     ...   ...       ...

There are 22,787 of them.

0.4 Question 4

Create a new column that displays the number of genders the name was used for each year. Note: not recommended for slower computers.

boygirl <- babynames %>%
  group_by(year, name) %>%
  mutate(both = length(year)) %>% 
  select(year, name, both)

boygirl

Source: local data frame [1,825,433 x 3]
Groups: year, name [1664780]

    year      name  both
   (dbl)     (chr) (int)
1   1880      Mary     2
2   1880      Anna     2
3   1880      Emma     2
4   1880 Elizabeth     2
5   1880    Minnie     2
6   1880  Margaret     1
7   1880       Ida     2
8   1880     Alice     1
9   1880    Bertha     1
10  1880     Sarah     1
..   ...       ...   ...

0.5 Question 5

Make a data set of names that have been both boy and girl names.

Hint: use distinct()

both <- boygirl %>%
  filter(both == 2) %>%
  distinct(name)

both

Source: local data frame [160,653 x 3]
Groups: year, name [160653]

    year      name  both
   (dbl)     (chr) (int)
1   1880      Mary     2
2   1880      Anna     2
3   1880      Emma     2
4   1880 Elizabeth     2
5   1880    Minnie     2
6   1880       Ida     2
7   1880     Annie     2
8   1880     Clara     2
9   1880  Florence     2
10  1880      Cora     2
..   ...       ...   ...

0.6 Question 6

For each year, display the total number of names that were used. Treat boy and girl versions of the same name as two separate names.

new <- babynames %>%
  group_by(year, sex) %>%
  summarise(num_names = length(year))

new

Source: local data frame [270 x 3]
Groups: year [?]

    year   sex num_names
   (dbl) (chr)     (int)
1   1880     F       942
2   1880     M      1058
3   1881     F       938
4   1881     M       997
5   1882     F      1028
6   1882     M      1099
7   1883     F      1054
8   1883     M      1030
9   1884     F      1172
10  1884     M      1125
..   ...   ...       ...

0.7 Question 7

Which name received the largest percentage of any name for any year (consider boy and girl names as distinct)

babynames %>%
  filter(prop == max(prop))

Source: local data frame [1 x 5]

   year   sex  name     n       prop
  (dbl) (chr) (chr) (int)      (dbl)
1  1880     M  John  9655 0.08154561

John in 1880

0.8 Question 8

Which girl's name received the largest percentage of any girl's name for any year?

babynames %>%
  filter(sex == "F") %>%
  filter(prop == max(prop))

Source: local data frame [1 x 5]

   year   sex  name     n       prop
  (dbl) (chr) (chr) (int)      (dbl)
1  1880     F  Mary  7065 0.07238359

Mary in 1880

0.9 Question 9

Display the average percentage each name received during all the years it was used. Treat girl and boy versions of the same name as different names.

name_perc <- babynames %>%
  group_by(name, sex) %>%
  summarise(avg_prop = mean(prop))

name_perc

Source: local data frame [104,110 x 3]
Groups: name [?]

        name   sex     avg_prop
       (chr) (chr)        (dbl)
1      Aaban     M 5.028412e-06
2      Aabha     F 3.617440e-06
3      Aabid     M 2.381589e-06
4  Aabriella     F 2.491848e-06
5      Aadam     M 4.136221e-06
6      Aadan     M 5.956601e-06
7    Aadarsh     M 5.380576e-06
8      Aaden     F 2.473744e-06
9      Aaden     M 1.329001e-04
10    Aadesh     M 2.394223e-06
..       ...   ...          ...

0.10 Question 10

Which name recorded in the data set has been out of use for the longest time?

name_last <- babynames %>%
  group_by(name) %>%
  summarise(last_use = max(year)) 

name_last %>%
  filter(last_use == min(last_use))

Source: local data frame [2 x 2]

    name last_use
   (chr)    (dbl)
1   Roll     1881
2 Zilpah     1881

2 names have been out of use since 1881

0.11 Question 11

In a new column, display the total number of years a name has been used (as either a boys name or a girls name).

name_years <- babynames %>%
  group_by(name) %>%
  mutate(num_years = length(year))

name_years

Source: local data frame [1,825,433 x 6]
Groups: name [93889]

    year   sex      name     n       prop num_years
   (dbl) (chr)     (chr) (int)      (dbl)     (int)
1   1880     F      Mary  7065 0.07238359       265
2   1880     F      Anna  2604 0.02667896       267
3   1880     F      Emma  2003 0.02052149       246
4   1880     F Elizabeth  1939 0.01986579       269
5   1880     F    Minnie  1746 0.01788843       208
6   1880     F  Margaret  1578 0.01616720       254
7   1880     F       Ida  1472 0.01508119       208
8   1880     F     Alice  1414 0.01448696       236
9   1880     F    Bertha  1320 0.01352390       213
10  1880     F     Sarah  1288 0.01319605       260
..   ...   ...       ...   ...        ...       ...

0.12 Question 12

In general, are names that have been used for both boys and girls more popular for boys or girls?

bg_names <- name_perc %>%
  filter(name %in% both) %>%
  group_by(name) %>%
  summarise(diff = avg_prop[sex=="M"] - avg_prop[sex =="F"])
mean(bg_names$diff >= 0)

More popular for girls (bg_names$diff >= 0 is TRUE (i.e, 1) when the boys version is more popular. This occurred less than half the time (i.e. mean(bg_names$diff >= 0) < 0.5)

0.13 Question 13

What name has earned the most percentage points in a year for any name since 2000?

newest %>%
  filter(year >= 2000,
         prop == max(prop))

Source: local data frame [1 x 5]

   year   sex  name     n       prop
  (dbl) (chr) (chr) (int)      (dbl)
1  2000     M Jacob 34465 0.01651561

Jacob in 2000