Data Visualization of Kaggle ML / Data Science Survey 2017

Python vs R

This is my first ever data analysis on my blog and I’m excited to show what I have worked on with the Kaggle ML and Data Science Survey 2017. For those of you not familiar with this survey, “Kaggle conducted an industry-wide survey to establish a comprehensive view of the state of data science and machine learning. The survey received over 16,000 responses and we learned a ton about who is working with data, what’s happening at the cutting edge of machine learning across industries, and how new data scientists can best break into the field.” Source: Kaggle ML and Data Science Survey 2017.

I will primarily be using the multiple choice responses. I am definitely open to any feedback on my code and any ideas for more visualizations! I want to make more visualizations of this dataset that generate better insights in the battle between Python and R.

Data Visualizations I will be walking you through:

  1. Barchart of responses to a question of job skill importance between Python and R
  2. Barchart of Job Satisfaction of Code Writers vs Non code writers
  3. Barchart of Work tools frequency between Python and R
  4. Density Plot of Compensation for Recommended Languages
# Load the most important package in R

# Read in kaggle multiple choice data
kaggle_mc <- read_csv("kaggle_mc.csv")

Purpose of below graph:

I visualize how Kagglers responded to the survey question: How important do you think the below skills or certifications are in getting a data science job? The skills: Fluent in R / Python

kaggle_mc %>% 
  select(JobSkillImportancePython, JobSkillImportanceR) %>% 
# tidy the data so that there is a column with language and  another column with responses
  gather(coding_lang, response, JobSkillImportancePython, JobSkillImportanceR) %>%  
  filter(! %>% 
# Let's draw a barchart of the responses segmented by the programming language
  ggplot(aes(x = response)) +
  geom_bar(aes(fill = coding_lang), position = "dodge") +
  guides(fill=guide_legend(title="Coding Languages")) +


More people responded “Necessary” to Python as the data science skill important to getting a data science job. More people responded “Nice to have” and “Unnecessary” to R as the data science skill important for employment.

Key Takeaway of above graph:

Learning (knowing) Python is absolutely necessary to getting a data science job while learning (knowing) R is secondary in its importance. I agree with this result based upon Stack Overflow Data Scientist David Robinson’s blog posts: The Incredible Growth of Python and Why is Python Growing So Quickly?

Purpose of below graph:

I look at current job satisfaction of code writers vs non-code writers.

kaggle_mc %>% 
  select(CodeWriter, JobSatisfaction) %>% 
  ggplot(aes(x = JobSatisfaction)) +
  geom_bar(aes(fill = CodeWriter), position = "dodge") +
  coord_flip() +
xlab("Job Satisfaction Score") +
ylab("Count") +
guides(fill=guide_legend(title="Code Writer?")) 

Key Takeaway from above graph:

Non-coders did not respond to the job satisfaction question; it is hard to comment on the job satisfaction of non-coders. On the other hand, the job satisfaction of coders is skewed to the left. Thus, coders are mostly satisfied with their current jobs.

Purpose of below graph:

I find work tools frequency between Python and R.

kaggle_mc %>% 
  select(WorkToolsFrequencyPython, WorkToolsFrequencyR) %>% 
  filter(!, ! %>% 
  # tidy the data so that we have one column with the coding language and another column with the frequency of usage.
  gather(coding_lang, frequency, WorkToolsFrequencyPython, WorkToolsFrequencyR) %>% 
 # plot a barchart of frequency segmented by coding languages 
ggplot(aes(x = frequency)) +
  geom_bar(aes(fill = coding_lang), position = "dodge") +
xlab("Frequency") +

Key Takeaway:

In the “Most of the time” and “Often” responses, Python has a slight edge over R while in the “Rarely” and “Sometimes” responses, it is R with the advantage. I think I can infer Python is used more than R in the workplace. This makes sense. From what I’ve read on Linkedin, Twitter, and blog posts, Python is used more than R in industry while the usage of R increases in academia.

Purpose of below graph:

I find the difference of compensation among different recommended coding languages, which are determined based on the survey question: What programming language would you recommend a new data scientist learn first?

library(ggridges) # A graphing library that adds life to the ggplot

kaggle_mc_comp <- kaggle_mc %>% 
  filter(CompensationCurrency == "USD") %>%
  select(LanguageRecommendationSelect, CompensationAmount)
   #remove commas in the numbers
kaggle_mc_comp[["CompensationAmount"]] <- gsub(',', '', kaggle_mc_comp$CompensationAmount)  
kaggle_mc_comp %>% 
  mutate(CompensationAmountMillions = as.numeric(CompensationAmount)/1000000) %>% 
  filter(!, ! %>% 
  ggplot(aes(x = CompensationAmountMillions,
             y = LanguageRecommendationSelect,
             fill = factor(LanguageRecommendationSelect))) +
  geom_density_ridges() +
  xlim(c(0, 4.0)) +
  xlab("Compensation in Millions") +
  ylab("Recommended Languages") +
  ggtitle("Compensation for Recommended Languages") +
  guides(fill=guide_legend(title="Recommended Languages")) +

Key Takeaway from above graph:

It is hard to tell the difference in height of these density plots. I may have to find another plot (boxplots?). However, what I do see is that there are some who recommend Julia and earn approximately $1 million USD. These may be outliers.

Summary of the four visualizations

  • Learning Python is very important in getting a data science job.
  • Workers who code are mostly satisfied with their jobs.
  • Python is used more frequently than R in the workplace.

I look forward to reading your comments / feedback (If anyone knows how to change the contents of the legend, please do leave me a note.)

Thanks for reading. Merry Christmas!

Howard Baek
Biostatistics Master’s student

My email is