Since October 2018, I’ve watched David Robinson’s Tidy Tuesday screencasts and learned so much about data analysis in R. As a result, I’m writing a series of posts called Independent Learning from Tidy Tuesday Screencast. These are mostly written so that I can refer to them in the future, but by sharing these I hope they serve as useful cheatsheets for data analysis in the tidyverse.
Why should you read these posts instead of just reading through David’s code from the screencasts? Well, these posts include interpretations of graphs, tricks for better data visualization and manipulation, and advice about data analysis that David talks about in his screencasts, but doesn’t write down. Hope you enjoy.
College Majors Screencast
- When beginning your analysis, always load up the
tidyverse
, import the raw data, andView
it. - After exploratory data analysis on the raw data, which includes asking questions about the data with graphs, you should perform necessary data cleaning steps to the raw data and assign the cleaned data to another variable (ex.
majors_processed
,by_major_category
). - Generate alot of barplots in early analysis stage to get a sense for categorical variables.
recent_grads %>%
mutate(Major_category = fct_reorder(Major_category, Median)) %>%
ggplot(aes(Major_category, Median)) +
geom_boxplot() +
scale_y_continuous(labels = dollar_format()) +
labs(x = "",
y = "Median salary") +
ggtitle("Distr. of Median Salary by Major Category") +
coord_flip()
mutate(Major_category = fct_reorder(Major_category, Median))
reorders Major Categories by the Median salary so the reader sees an ordered boxplot.- This line of code is used pretty much in every one of David’s screencasts.
recent_grads %>%
count(Major_category, wt = Total, sort = TRUE) %>%
mutate(Major_category = fct_reorder(Major_category, n)) %>%
ggplot(aes(x = Major_category, y = n, fill = Major_category)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = comma_format()) +
labs(x = "",
y = "Total # of graduates",
title = "What major categories were most common?") +
theme(legend.position = "none")
- Useful barplot trick from Julia Silge: Add colors to the bars with
fill = Major_category
and drop the legend to add rainbow effect to plot and make graph easier to look at.
majors_processed %>%
head(20) %>%
ggplot(aes(Major, Median, color = Major_category)) +
geom_point() +
geom_errorbar(aes(ymin = P25th, ymax = P75th)) +
coord_flip() +
scale_y_continuous(labels = dollar_format()) +
labs(title = "What are the highest earning majors?",
x = "",
y = "Median Salary",
color = NULL) +
# Start axis with Median at 0
expand_limits(y = 0)
- David mentions this graph is very insightful and would include in a blog post if he were to write one.
- The error bars do NOT represent confidence intervals (We are NOT unsure about the median value.)
- Court Reporting has a very thin interval, meaning that most people in court reporting make around 50K.
majors_processed %>%
tail(20) %>%
ggplot(aes(Major, Median, color = Major_category)) +
geom_point() +
geom_errorbar(aes(ymin = P25th, ymax = P75th)) +
coord_flip() +
scale_y_continuous(labels = dollar_format()) +
labs(x = "",
y = "Median Salary",
color = "") +
# Start axis with Median at 0
expand_limits(y = 0) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
- Great thing about creating a cleaned dataset
major_processed
is we can easily runtail(20)
to get the 20 lowest earning majors, like library science and counseling psychology. - David would drop this graph from a blog post because all majors have similar salaries.
- Instead, he would build a Shiny app that allows readers to interact with the graph.
- Lastly, he emphasizes the importance of sample size (
n
). Ifn
is low, we can’t really trust the data because it might contain alot of variation.
majors_processed %>%
ggplot(aes(Sample_size, Median)) +
geom_point(aes(color = Major_category)) +
scale_x_log10() +
labs(x = "Sample Size",
y = "Median Salary",
color = "Major Category") +
ggtitle("Bland–Altman plot")
- There are too many categories in this graph to be insightful.
- David mentions that if his screencasts were intended to teach, he would show this kind of graph that don’t work before moving on to useful graphs.
majors_processed %>%
ggplot(aes(Sample_size, Median)) +
geom_point() +
geom_text(aes(label = Major), check_overlap = TRUE, vjust = 1, hjust = 1) +
scale_x_log10() +
labs(x = "Sample Size")
- The
geom_text
argumentcheck_overlap = TRUE
is just a work of art.
majors_processed %>%
select(Major, Major_category, Total, ShareWomen, Sample_size, Median) %>%
add_count(Major_category) %>%
filter(n >= 10) %>%
nest(-Major_category) %>%
mutate(model = map(data, ~ lm(Median ~ ShareWomen, data = ., weights = Sample_size)),
tidied = map(model, tidy)) %>%
unnest(tidied) %>%
# Filter out Intercept
filter(term == "ShareWomen") %>%
arrange(estimate) %>%
mutate(fdr = p.adjust(p.value, method = "fdr"))
## # A tibble: 9 x 7
## Major_category term estimate std.error statistic p.value fdr
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Biology & Life Scien… ShareW… -43735. 20982. -2.08 0.0592 0.106
## 2 Engineering ShareW… -33912. 15418. -2.20 0.0366 0.0937
## 3 Computers & Mathemat… ShareW… -28694. 18552. -1.55 0.156 0.235
## 4 Business ShareW… -28171. 9810. -2.87 0.0152 0.0937
## 5 Agriculture & Natura… ShareW… -16263. 5975. -2.72 0.0297 0.0937
## 6 Physical Sciences ShareW… -12820. 13349. -0.960 0.365 0.469
## 7 Education ShareW… -1996. 3084. -0.647 0.528 0.594
## 8 Humanities & Liberal… ShareW… -1814. 4128. -0.439 0.668 0.668
## 9 Health ShareW… 54721. 23427. 2.34 0.0416 0.0937
- The above code chunk seeks to answer the question: Within engineering, within health, within business, what is the correlation between share of women and salary?
- For better understanding, take this DataCamp course: Machine Learning in the Tidyverse
- False Discovery Rate (
fdr
) adjusts the p-value. - If you are writing to publish a paper, be very careful about fdr values. For internal reports, this is fine.
- At least a few major categories have negative correlations in terms of share of women and typical salaries.
Movie Profits Screencast
- If you see warning messages when importing data, you can pull up a dataframe of these messages and more with
problems(movie_profit_raw)
movie_profit <- movie_profit_raw %>%
# Get rid of row number column
select(-X1) %>%
# Parse release date
mutate(release_date = as.Date(parse_date_time(release_date, "%m!/%d/%Y"))) %>%
# Skip 2018 year
filter(release_date < "2018-01-01") %>%
# Way of reversing order of rows
arrange(desc(row_number())) %>%
# Very dirty data because rows are duplicated.
# By running distinct() after reversing order of rows, we are keeping the second row for each pair
distinct(movie, release_date, .keep_all = TRUE) %>%
mutate(distributor = fct_lump(distributor, 6)) %>%
filter(worldwide_gross > 0) %>%
mutate(profit_ratio = worldwide_gross / production_budget) %>%
mutate(decade = 10 * floor(year(release_date) / 10))
# Histogram
movie_profit %>%
ggplot(aes(production_budget)) +
geom_histogram() +
labs(x = "Production Budget",
y = "Count") +
scale_x_log10(labels = dollar_format())
- Anytime you are dealing with money data, you are going to need to put it on a log scale (
scale_x_log10
)
movie_profit %>%
ggplot(aes(distributor, production_budget)) +
geom_boxplot() +
scale_y_log10(labels = dollar_format()) +
labs(x = "Distributor",
y = "Production Budget") +
coord_flip()
- This graph shows that movies that have no distributors make way less money than movies with distributors.
movie_profit %>%
mutate(genre = fct_reorder(genre, production_budget)) %>%
filter(!is.na(distributor)) %>%
ggplot(aes(genre, production_budget)) +
geom_boxplot() +
labs(y = "Production Budget",
x = NULL) +
scale_y_log10(labels = dollar_format()) +
coord_flip() +
facet_wrap(~distributor) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
- Across all seven studios, Action and Adventure movies cost the most money.
- Horror generally costs the least money to produce.
movie_profit %>%
group_by(genre, year = year(release_date)) %>%
summarize(median_profit_ratio = median(profit_ratio),
movies = n()) %>%
ungroup() %>%
filter(year >= 2000) %>%
arrange(desc(median_profit_ratio)) %>%
mutate(genre = fct_reorder(genre, median_profit_ratio)) %>%
ggplot(aes(year, median_profit_ratio, color = genre)) +
geom_line() +
labs(x = "Year",
y = "Median Profit Ratio",
color = "Genre") +
scale_y_continuous(labels = function(x) paste0(x, "X"))
- Apparently, horror movies started being way more profitable around 2013.
- Here, we see
fct_reorder(genre, median_profit_ratio)
again! - You can adjust the way labels are shown with a function:
scale_y_continuous(labels = function(x) paste0(x, "X"))
horror_movies <- movie_profit %>%
filter(genre == "Horror") %>%
arrange(desc(profit_ratio))
horror_movies %>%
head(20) %>%
mutate(movie = paste0(movie, " (", year(release_date), ")"),
movie = fct_reorder(movie, profit_ratio)) %>%
ggplot(aes(movie, profit_ratio)) +
geom_col(aes(fill = distributor)) +
coord_flip() +
scale_y_continuous(labels = function(x) paste0(x, "X")) +
labs(x = "",
y = "Ratio of worldwide gross to production budget",
fill = "Distributor") +
ggtitle("What horror movies have most outgrossed their budget?")
- This graph shows us that most of the movies that have outgrossed their budget are recent.
- Paranormal Activity, The Blair Witch Project, and Halloween have made far more than other horror movies.
movie_profit %>%
filter(!is.na(distributor)) %>%
count(distributor, genre) %>%
ggplot(aes(genre, n, fill = genre)) +
geom_col(show.legend = FALSE) +
facet_wrap(~distributor, scales = "free_x") +
labs(x = "",
y = "") +
coord_flip()
- This is a facetted barplot with genres on the x-axis. This way, you can look at one distributor’s breakdown on genres of movies.
- Interpretation of graph: Horror is always rarest in all distributors.
- Anytime making a barplot, you need a
coord_flip()
for readability. scales = free_x
is important to show distribution of genre within each distributor- Give each barplot a
fill
parameter to make it easier to read. Addshow.legend = FALSE
horror_movies %>%
filter(release_date >= "1994-01-01",
profit_ratio >= 0.01) %>%
ggplot(aes(release_date, profit_ratio)) +
geom_point() +
geom_smooth(method = "lm") +
geom_text(aes(label = movie), vjust = 1, hjust = 1, check_overlap = TRUE) +
labs(x = "Release Date",
y = "Profit Ratio",
title = "") +
scale_y_log10(labels = function(x) paste0(x, "X"), breaks = c(0.1, 1, 10, 100))
- You absolutely need
check_overlap = TRUE
in these kind of graphs that have overlapping labels - When David sees an upward trend, he likes to throw in a line of best fit.
g <- movie_profit %>%
filter(release_date >= "1990-01-01",
profit_ratio >= 0.01) %>%
ggplot(aes(release_date, profit_ratio, label = movie)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Release Date",
y = "Profit Ratio") +
scale_y_log10(labels = function(x) paste0(x, "X"), breaks = c(0.1, 1, 10, 100)) +
facet_wrap(~genre)
library(plotly)
ggplotly(g)
- Plotly graph to allow people to zoom in on particular movies
- Reader understands that all genres of movies have flat lines except horror movies, due to movies such as Paranormal Activity, Insidious, Saw, etc
- Warning:
plotly
does not supportcheck_overlap = TRUE
R Downloads Screencast
r_downloads_year %>%
count(date) %>%
ggplot(aes(date, n)) +
geom_line() +
expand_limits(y = 0) +
labs(y = "# of R downloads per day",
x = "")
r_downloads_year %>%
# floor_date is a great lubridate function for aggregating over week
count(week = floor_date(date, "week")) %>%
filter(week > min(week)) %>%
ggplot(aes(week, n)) +
geom_line() +
expand_limits(y = 0) +
labs(y = "# of R downloads per week",
x = "")
- This graph is showing the same data as the first graph, but we are aggregating over weeks, thus showing a clearer trend.
r_download_gaps <- r_downloads_year %>%
mutate(datetime = as.POSIXlt(date) + time) %>%
arrange(datetime) %>%
group_by(ip_id) %>%
mutate(gap = as.numeric(datetime - lag(datetime))) %>%
filter(!is.na(gap))
r_download_gaps
## # A tibble: 934,367 x 10
## # Groups: ip_id [3,736]
## date time size version os country ip_id country_name
## <date> <drt> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 2017-10-20 00:12 7.82e7 3.4.2 win TW 913 Taiwan
## 2 2017-10-20 00:19 6.42e7 3.4.2 osx US 915 United Stat…
## 3 2017-10-20 00:20 7.82e7 3.4.2 win ID 921 Indonesia
## 4 2017-10-20 00:43 1.11e6 3.4.2 win US 1284 United Stat…
## 5 2017-10-20 00:53 6.42e7 latest osx US 27 United Stat…
## 6 2017-10-20 00:55 2.17e7 3.4.2 src US 1468 United Stat…
## 7 2017-10-20 01:05 8.29e6 3.4.2 win AU 1766 Australia
## 8 2017-10-20 01:11 7.82e7 3.4.2 win JP 125 Japan
## 9 2017-10-20 01:18 7.82e7 3.4.2 win JP 125 Japan
## 10 2017-10-20 01:29 7.82e7 3.4.2 win JP 1921 Japan
## # … with 934,357 more rows, and 2 more variables: datetime <dttm>,
## # gap <dbl>
- Interestingly, we are creating
datetime
by literally adding date and time:datetime = as.POSIXlt(date) + time
r_download_gaps %>%
ggplot(aes(gap)) +
geom_histogram() +
geom_vline(color = "red", lty = 2, xintercept = 86400) +
labs(x = "Gap", y = "Count") +
scale_x_log10(breaks = 60 ^ (0:4),
labels = c("Second", "Minute", "Hour", "2.5 Days", "120 Days"))
- This is a bimodal log-normal distribution.
- Along with money data, anytime you have time, you want to use a log scale (
scale_x_log10
) - Bunch of people install R in gaps of 1 day.
- David thinks that this is an automated system that install R every single day.
package_downloads <- read_csv("http://cran-logs.rstudio.com/2018/2018-10-27.csv.gz")
package_downloads %>%
filter(country %in% c("US", "IN")) %>%
count(country, package, sort = TRUE) %>%
spread(country, n, fill = 0) %>%
mutate(total = US + IN,
IN = (IN + 1) / sum(IN + 1),
US = (US + 1) / sum(US + 1),
ratio = US / IN) %>%
filter(total > 1000) %>%
arrange(desc(total)) %>%
arrange(desc(ratio))
## # A tibble: 158 x 5
## package IN US total ratio
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 schumaker 0.0000121 0.00232 1136 191.
## 2 Ckmeans.1d.dp 0.0000243 0.00277 1354 114.
## 3 copula 0.0000364 0.00246 1205 67.5
## 4 bold 0.0000729 0.00212 1044 29.2
## 5 Rserve 0.000279 0.00231 1150 8.26
## 6 covr 0.000437 0.00237 1193 5.41
## 7 devtools 0.00219 0.0108 5474 4.95
## 8 mitml 0.000437 0.00202 1023 4.62
## 9 rgdal 0.000546 0.00247 1251 4.51
## 10 jomo 0.000461 0.00202 1024 4.37
## # … with 148 more rows
- When calculating ratio, add 1 to numerator and to denominator (Laplace meaning) to prevent ratio from looking like an infinite ratio.
arrange(desc(ratio))
looks at packages installed the most in USAarrange(ratio)
looks at packages installed the most in India