# Introduction: Seven Weeks of R It has been seven weeks since I started my summer research in Educational Data Mining. That means seven weeks of using R to import, tidy, transform, and visualize data (Grolemund and Wickham, R for Data Science). While I really wanted to work in R, I ran into many obstacles, major and minor. There were many data manipulation and visualization issues that I had to deal with in order to extract insights from the data. My intent in this blog post is to share some of the tricks I learned in R to manipulate and visualize data inside the tidyverse

# Data Manipulation and Visualization Tricks

1. Throughout the summer, the most important skill I gained is validation of my code. For example, I checked whether `group_by` and `summarise` did what I intended it to do, as in this code chunk:

### Test Code

``````library(tidyverse)
test <- data.frame(anon_screen_name = c("howard", "baik", "tomatoes", "bannas"),
cluster = c(1, 1, 2, 2))

test_dropout <- c("howard", "tomatoes")

test %>%
group_by(cluster) %>%
summarise(dropout_prop = mean(anon_screen_name %in% test_dropout))``````
``````## # A tibble: 2 x 2
##   cluster dropout_prop
##     <dbl>        <dbl>
## 1       1          0.5
## 2       2          0.5``````
• I wanted to test whether my `dplyr` code actually calculated the proportion of matches between values in a column inside my dataframe (`anon_screen_name`) and a separate vector(`test_dropout`).
• The test revealed it did!

### Original Code

``````bar_first <- new_clust_first_kmeans %>%
mutate(cluster = as.character(cluster)) %>%
group_by(cluster) %>%
summarise(dropout_prop = mean(anon_screen_name %in% total_dropout))``````

When in doubt of my code, I learned to test the same code on a dummy dataset and see if the output matches my expectations. That way, I can be confident that my code will bring me an accurate output on my original dataset

1. `dplyr::distinct()` is a super useful verb for selecting distinct / unique rows. According to the documentation, `distinct` retains only unique/distinct rows from an input tbl. This is similar to `unique.data.frame`, but considerably faster.

### Table with Duplicate Rows ### Cleaned Table 1. `tidyr::replace_na()` allows you to replace missing values. After I saw the below table, I immediately turned to `replace_na()` to replace all the NA columns with 0.
``````replace_na(list("1" = 0,
"2" = 0,
"3" = 0,
"4" = 0,
"5" = 0,
"6" = 0,
"7" = 0,
"8" = 0,
"9" = 0,
"10" = 0))``````

### Messy Table ### Cleaned Table 1. The `col.names` parameter in `kableExtra::kable` allows you to change the column names.

### Table with Raw Column Names ### Table with Customized Column Names 1. The `scales::percent_format` allows you to change the labels on your plots to percentages! 1. When drawing boxplots, I found too many outliers and loved to somehow `jitter` just these points. The `outlier.alpha` parameter in `geom_boxplot` changes the transparency of only the outliers 1. `dplyr::case_when` is a super useful function that replaces my old trick of nested `ifelse` statements.

### An Incomprehensible Example of ifelse

``````# Code from another project:
odd_man = ifelse(odd_man %in% c("1-0", "2-1", "3-2", "4-3"), "one_man",
ifelse(odd_man %in% c("2-0", "3-1", "4-2", "3-0"), "two_plus_man",
"all_other_shots"))) ``````

### A Readable Example of case_when

``````case_when(
between(first_submit, week_seq, week_seq) ~ 1,
first_submit <= week_seq & first_submit >= as.Date("2014-07-02") ~ 2,
first_submit <= week_seq & first_submit >= as.Date("2014-07-09") ~ 3,
first_submit <= week_seq & first_submit >= as.Date("2014-07-16") ~ 4,
first_submit <= week_seq & first_submit >= as.Date("2014-07-23") ~ 5,
first_submit <= week_seq & first_submit >= as.Date("2014-07-30") ~ 6,
first_submit <= week_seq & first_submit >= as.Date("2014-08-06") ~ 7,
first_submit <= week_seq & first_submit >= as.Date("2014-08-13") ~ 8,
first_submit <= week_seq & first_submit >= as.Date("2014-08-20") ~ 9,
first_submit <= week_seq & first_submit >= as.Date("2014-08-27") ~ 10,
TRUE ~ 0``````

1. `ggplot2::scale_fill_hue` has a `labels` parameter that allows you to change the labels of the legend. `scale_fill_hue(labels = c("High", "Low", "Medium"))`

1. `broom::augment` is a helpful function when tidying models. According to the documentation, augment adds columns to a dataset, containing information such as fitted values, residuals or cluster assignments. All columns added to a dataset have . prefix to prevent existing columns from being overwritten.

In my case, I used `augment` in clustering by adding on a new column called `cluster` with the number of cluster.

1. Last but not least, the compound assignment pipe operator, `%<>%`, is very convenient as it saves you time and space. Less code means less errors.

``df <- df %>% mutate(time = c(a,b,c))``
``df %<>% mutate(time = c(a,b,c))`` 