Introduction: Seven Weeks of R
It has been seven weeks since I started my summer research in Educational Data Mining. That means seven weeks of using R to import, tidy, transform, and visualize data (Grolemund and Wickham, R for Data Science). While I really wanted to work in R, I ran into many obstacles, major and minor. There were many data manipulation and visualization issues that I had to deal with in order to extract insights from the data. My intent in this blog post is to share some of the tricks I learned in R to manipulate and visualize data inside the tidyverse
Data Manipulation and Visualization Tricks
- Throughout the summer, the most important skill I gained is validation of my code. For example, I checked whether
group_by
andsummarise
did what I intended it to do, as in this code chunk:
Test Code
library(tidyverse)
test <- data.frame(anon_screen_name = c("howard", "baik", "tomatoes", "bannas"),
cluster = c(1, 1, 2, 2))
test_dropout <- c("howard", "tomatoes")
test %>%
group_by(cluster) %>%
summarise(dropout_prop = mean(anon_screen_name %in% test_dropout))
## # A tibble: 2 x 2
## cluster dropout_prop
## <dbl> <dbl>
## 1 1 0.5
## 2 2 0.5
- I wanted to test whether my
dplyr
code actually calculated the proportion of matches between values in a column inside my dataframe (anon_screen_name
) and a separate vector(test_dropout
). - The test revealed it did!
Original Code
bar_first <- new_clust_first_kmeans %>%
mutate(cluster = as.character(cluster)) %>%
group_by(cluster) %>%
summarise(dropout_prop = mean(anon_screen_name %in% total_dropout))
When in doubt of my code, I learned to test the same code on a dummy dataset and see if the output matches my expectations. That way, I can be confident that my code will bring me an accurate output on my original dataset
dplyr::distinct()
is a super useful verb for selecting distinct / unique rows. According to the documentation,distinct
retains only unique/distinct rows from an input tbl. This is similar tounique.data.frame
, but considerably faster.
Table with Duplicate Rows
Cleaned Table
tidyr::replace_na()
allows you to replace missing values. After I saw the below table, I immediately turned toreplace_na()
to replace all the NA columns with 0.
replace_na(list("1" = 0,
"2" = 0,
"3" = 0,
"4" = 0,
"5" = 0,
"6" = 0,
"7" = 0,
"8" = 0,
"9" = 0,
"10" = 0))
Messy Table
Cleaned Table
- The
col.names
parameter inkableExtra::kable
allows you to change the column names.
Table with Raw Column Names
Table with Customized Column Names
- The
scales::percent_format
allows you to change the labels on your plots to percentages!
- When drawing boxplots, I found too many outliers and loved to somehow
jitter
just these points. Theoutlier.alpha
parameter ingeom_boxplot
changes the transparency of only the outliers
dplyr::case_when
is a super useful function that replaces my old trick of nestedifelse
statements.
An Incomprehensible Example of ifelse
# Code from another project:
odd_man = ifelse(odd_man %in% c("1-0", "2-1", "3-2", "4-3"), "one_man",
ifelse(odd_man %in% c("2-0", "3-1", "4-2", "3-0"), "two_plus_man",
"all_other_shots")))
A Readable Example of case_when
case_when(
between(first_submit, week_seq[1], week_seq[2]) ~ 1,
first_submit <= week_seq[3] & first_submit >= as.Date("2014-07-02") ~ 2,
first_submit <= week_seq[4] & first_submit >= as.Date("2014-07-09") ~ 3,
first_submit <= week_seq[5] & first_submit >= as.Date("2014-07-16") ~ 4,
first_submit <= week_seq[6] & first_submit >= as.Date("2014-07-23") ~ 5,
first_submit <= week_seq[7] & first_submit >= as.Date("2014-07-30") ~ 6,
first_submit <= week_seq[8] & first_submit >= as.Date("2014-08-06") ~ 7,
first_submit <= week_seq[9] & first_submit >= as.Date("2014-08-13") ~ 8,
first_submit <= week_seq[10] & first_submit >= as.Date("2014-08-20") ~ 9,
first_submit <= week_seq[11] & first_submit >= as.Date("2014-08-27") ~ 10,
TRUE ~ 0
ggplot2::scale_fill_hue
has alabels
parameter that allows you to change the labels of the legend.

scale_fill_hue(labels = c("High", "Low", "Medium"))
broom::augment
is a helpful function when tidying models. According to the documentation, augment adds columns to a dataset, containing information such as fitted values, residuals or cluster assignments. All columns added to a dataset have . prefix to prevent existing columns from being overwritten.
In my case, I used augment
in clustering by adding on a new column called cluster
with the number of cluster.
- Last but not least, the compound assignment pipe operator,
%<>%
, is very convenient as it saves you time and space. Less code means less errors.
Instead of
df <- df %>% mutate(time = c(a,b,c))
You can do
df %<>% mutate(time = c(a,b,c))