Create a new dataframe that only includes years since 2000.
newest <- babynames %>%
filter(year >= 2000)
newest
Source: local data frame [492,895 x 5]
year sex name n prop
(dbl) (chr) (chr) (int) (dbl)
1 2000 F Emily 25952 0.013012976
2 2000 F Hannah 23073 0.011569374
3 2000 F Madison 19967 0.010011949
4 2000 F Ashley 17995 0.009023139
5 2000 F Sarah 17687 0.008868700
6 2000 F Alexis 17627 0.008838615
7 2000 F Samantha 17264 0.008656598
8 2000 F Jessica 15704 0.007874375
9 2000 F Elizabeth 15088 0.007565497
10 2000 F Taylor 15078 0.007560483
.. ... ... ... ... ...
What name has been used for the most number of years (when used for a single gender)?
years <- babynames %>%
group_by(name, sex) %>%
summarise(use_years = length(name))
years %>%
ungroup %>%
filter(use_years == max(use_years))
Source: local data frame [934 x 3]
name sex use_years
(chr) (chr) (int)
1 Aaron M 135
2 Abbie F 135
3 Abe M 135
4 Abel M 135
5 Abigail F 135
6 Abner M 135
7 Abraham M 135
8 Abram M 135
9 Ada F 135
10 Adam M 135
.. ... ... ...
It is a 934 way tie
How many names have only been used one year?
years %>%
ungroup %>%
group_by(sex) %>%
filter(use_years == 1)
Source: local data frame [22,787 x 3]
Groups: sex [2]
name sex use_years
(chr) (chr) (int)
1 Aabid M 1
2 Aaden F 1
3 Aadhyan M 1
4 Aadian M 1
5 Aadrian M 1
6 Aadrit M 1
7 Aafreen F 1
8 Aagam M 1
9 Aage M 1
10 Aagot F 1
.. ... ... ...
There are 22,787 of them.
Create a new column that displays the number of genders the name was used for each year. Note: not recommended for slower computers.
boygirl <- babynames %>%
group_by(year, name) %>%
mutate(both = length(year)) %>%
select(year, name, both)
boygirl
Source: local data frame [1,825,433 x 3]
Groups: year, name [1664780]
year name both
(dbl) (chr) (int)
1 1880 Mary 2
2 1880 Anna 2
3 1880 Emma 2
4 1880 Elizabeth 2
5 1880 Minnie 2
6 1880 Margaret 1
7 1880 Ida 2
8 1880 Alice 1
9 1880 Bertha 1
10 1880 Sarah 1
.. ... ... ...
Make a data set of names that have been both boy and girl names.
distinct()
both <- boygirl %>%
filter(both == 2) %>%
distinct(name)
both
Source: local data frame [160,653 x 3]
Groups: year, name [160653]
year name both
(dbl) (chr) (int)
1 1880 Mary 2
2 1880 Anna 2
3 1880 Emma 2
4 1880 Elizabeth 2
5 1880 Minnie 2
6 1880 Ida 2
7 1880 Annie 2
8 1880 Clara 2
9 1880 Florence 2
10 1880 Cora 2
.. ... ... ...
For each year, display the total number of names that were used. Treat boy and girl versions of the same name as two separate names.
new <- babynames %>%
group_by(year, sex) %>%
summarise(num_names = length(year))
new
Source: local data frame [270 x 3]
Groups: year [?]
year sex num_names
(dbl) (chr) (int)
1 1880 F 942
2 1880 M 1058
3 1881 F 938
4 1881 M 997
5 1882 F 1028
6 1882 M 1099
7 1883 F 1054
8 1883 M 1030
9 1884 F 1172
10 1884 M 1125
.. ... ... ...
Which name received the largest percentage of any name for any year (consider boy and girl names as distinct)
babynames %>%
filter(prop == max(prop))
Source: local data frame [1 x 5]
year sex name n prop
(dbl) (chr) (chr) (int) (dbl)
1 1880 M John 9655 0.08154561
John in 1880
Which girl's name received the largest percentage of any girl's name for any year?
babynames %>%
filter(sex == "F") %>%
filter(prop == max(prop))
Source: local data frame [1 x 5]
year sex name n prop
(dbl) (chr) (chr) (int) (dbl)
1 1880 F Mary 7065 0.07238359
Mary in 1880
Display the average percentage each name received during all the years it was used. Treat girl and boy versions of the same name as different names.
name_perc <- babynames %>%
group_by(name, sex) %>%
summarise(avg_prop = mean(prop))
name_perc
Source: local data frame [104,110 x 3]
Groups: name [?]
name sex avg_prop
(chr) (chr) (dbl)
1 Aaban M 5.028412e-06
2 Aabha F 3.617440e-06
3 Aabid M 2.381589e-06
4 Aabriella F 2.491848e-06
5 Aadam M 4.136221e-06
6 Aadan M 5.956601e-06
7 Aadarsh M 5.380576e-06
8 Aaden F 2.473744e-06
9 Aaden M 1.329001e-04
10 Aadesh M 2.394223e-06
.. ... ... ...
Which name recorded in the data set has been out of use for the longest time?
name_last <- babynames %>%
group_by(name) %>%
summarise(last_use = max(year))
name_last %>%
filter(last_use == min(last_use))
Source: local data frame [2 x 2]
name last_use
(chr) (dbl)
1 Roll 1881
2 Zilpah 1881
2 names have been out of use since 1881
In a new column, display the total number of years a name has been used (as either a boys name or a girls name).
name_years <- babynames %>%
group_by(name) %>%
mutate(num_years = length(year))
name_years
Source: local data frame [1,825,433 x 6]
Groups: name [93889]
year sex name n prop num_years
(dbl) (chr) (chr) (int) (dbl) (int)
1 1880 F Mary 7065 0.07238359 265
2 1880 F Anna 2604 0.02667896 267
3 1880 F Emma 2003 0.02052149 246
4 1880 F Elizabeth 1939 0.01986579 269
5 1880 F Minnie 1746 0.01788843 208
6 1880 F Margaret 1578 0.01616720 254
7 1880 F Ida 1472 0.01508119 208
8 1880 F Alice 1414 0.01448696 236
9 1880 F Bertha 1320 0.01352390 213
10 1880 F Sarah 1288 0.01319605 260
.. ... ... ... ... ... ...
In general, are names that have been used for both boys and girls more popular for boys or girls?
bg_names <- name_perc %>%
filter(name %in% both) %>%
group_by(name) %>%
summarise(diff = avg_prop[sex=="M"] - avg_prop[sex =="F"])
mean(bg_names$diff >= 0)
More popular for girls (bg_names$diff >= 0
is TRUE (i.e, 1) when the boys version is more popular. This occurred less than half the time (i.e. mean(bg_names$diff >= 0
) < 0.5)
What name has earned the most percentage points in a year for any name since 2000?
newest %>%
filter(year >= 2000,
prop == max(prop))
Source: local data frame [1 x 5]
year sex name n prop
(dbl) (chr) (chr) (int) (dbl)
1 2000 M Jacob 34465 0.01651561
Jacob in 2000