Stop and Frisk: Hypothesis Testing


From New York Civil Liberties Union

Every time a police officer stops a person in NYC, the officer is supposed to fill out a form recording the details of the stop. The forms were filled out by hand and manually entered into an NYPD database until 2017, when the forms became electronic. The annual database includes nearly all of the data recorded by the police officer after a stop such as the age of the person stopped, if a person was frisked, if there was a weapon or firearm recovered, if physical force was used, and the exact location of the stop within the precinct.

In this post, I will be performing an exploratory data analysis of the stop-and-frisk dataset provided by the NYPD on the New York Civil Liberties Union website. The data contains 11008 rows and 83 variables. Then, I will use hypothesis testing with the infer package to assess if there is a statistical difference in police action when approaching a black New Yorker vs a non-black New Yorker.

Code is available here

Exploratory Data Analysis

  • This seems like a clear example of the officers just rounding to the nearest 5 as its unlikely that the actual stop durations are always multiples of 5.

Understanding Police Stops with Maps

  • In this map, I’m using a couple tricks I learned from David Robinson’s screencasts. First, in order to better show the higher values of stop_duration_minutes, I arrange in increasing order of stop_duration_minutes. This allows ggplot to plot the lower values (lighter colors) first and then plot the higher values (redder colors) next, allowing me to emphasize regions with long stop duration in redder colors. Second, the histogram shows that the stop duration is skewed to the right- according to David, “you can go a little bit below the median, but way above”. As a result, I transform the scale using the trans argument and choose the midpoint to be log10(median(value)). This way, the color scale gives more meaning to the data.

  • We can see that suspects are predominantly white or black, with apparently more black suspects than white suspects.

Racial discrimination in frisking/searching suspects

I will perform hypothesis testing to find whether police engage in an action, specified by each of these columns, more on black suspects than non-black suspects in a statistically significant manner. Specfically, I will use the infer package, following the steps layed out in Modern Dive Chapter 9

  • This facetted plot shows that the police frisk, handcuff, use restraint and verbal instruction on black suspects more than non-black suspects

Using infer for hypothesis testing

  • Let’s first create the processed version of sqf by taking out (null) inside suspect_race_description and lumping suspect_race_description into two groups of race: BLACK and NON-BLACK.
sqf_testing <- sqf %>% 
  filter(suspect_race_description != "(null)") %>% 
  mutate(suspect_race_description = if_else(suspect_race_description == "BLACK HISPANIC", "BLACK",
                                                         suspect_race_description)) %>% 
  mutate(suspect_race_description = if_else(suspect_race_description != "BLACK", "NON-BLACK",

  • Then, we use specify to formulate the response and explanatory variables. Let’s use frisked_flag as the response variable and obviously, suspect_race_description as the explanatory variable. The argument success = "Y" reveals that we are interested in the proportion of “Y” in the frisked_flag column.
sqf_testing %>% 
  specify(formula = frisked_flag ~ suspect_race_description, success = "Y") 

  • Now, we set meta data required for hyphotesis testing where we set null = "independence" for two sample (BLACK and NON-BLACK) hypothesis testing.
sqf_testing %>% 
  specify(formula = frisked_flag ~ suspect_race_description, success = "Y") %>% 
  hypothesize(null = "independence") 

  • Here, we generate replicates of “shuffled” datasets assuming the null hypothesis is true.
sqf_testing %>% 
  specify(formula = frisked_flag ~ suspect_race_description, success = "Y") %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute")

  • Now, we calculate the appropriate summary statistic for each of our 1000 shuffles, called the test statistic. According to Modern Dive, “…since the unknown population parameter of interest is the difference in population proportions, the test statistic of interest here is the difference in sample proportions” We have 1000 values of stat and we assign this dataframe to null_distribution
null_distribution <- sqf_testing %>% 
  specify(formula = frisked_flag ~ suspect_race_description, success = "Y") %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in props", order = c("BLACK", "NON-BLACK"))

  • Here, we calculate the observed difference in proportion between blacks and non-blacks by using the same code as above except we remove hypothesize and generate
obs_diff_prop <- sqf_testing %>% 
  specify(formula = frisked_flag ~ suspect_race_description, success = "Y") %>% 
  calculate(stat = "diff in props", order = c("BLACK", "NON-BLACK"))

  • Now, we visualize the null_distribution (values of the difference in proportions assuming that there is no racial discrimination) with a histogram and then “add what happened in real-life” with a red line and shades. Here, the shaded region is the p-value:

A p-value is the probability of obtaining a test statistic just as or more extreme than the observed test statistic assuming the null hypothesis is true.

visualize(null_distribution, bins = 10) + 
  shade_p_value(obs_stat = obs_diff_prop, direction = "right")

  • In the above graph, we clearly see that the p-value is extremely low. We calculate the exact p-value (“fraction that the null distribution is shaded”) with get_p_value
null_distribution %>% 
  get_p_value(obs_stat = obs_diff_prop, direction = "right")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

  • Below, we create a wrapper function that does all the steps we just covered so that we can iterate this hypothesis testing on all our desired columns.
# Wrapper Function
run_hypothesis_test <- function(response_var) {
  f <- as.formula(
        sep = " ~ "))
  null_distribution <- sqf_testing  %>% 
    specify(formula = f, success = "Y") %>% 
    # "independence" for hypotheses involving two samples
    hypothesize(null = "independence") %>% 
    generate(reps = 1000, type = "permute") %>% 
    calculate(stat = "diff in props", order = c("BLACK", "NON-BLACK"))
  obs_diff_prop <- sqf_testing %>% 
    specify(formula = f, success = "Y") %>% 
    calculate(stat = "diff in props", order = c("BLACK", "NON-BLACK"))
  # Calculate P-value
  null_distribution %>% 
    get_p_value(obs_stat = obs_diff_prop, direction = "right") # Because P(BLACK) > P(NON-BLACK)

# Find desired variables
testing_names <- sqf %>% 
  select(contains("physical"), contains("frisked"), contains("searched")) %>% 

results <- testing_names %>% 
  map_df(~(data.frame(p_value = run_hypothesis_test(.x))),
         .id = "response_variable")

All response variables have p-values that are bigger than 0.05 except for physical_force_restraint_used_flag and frisked_flag. This signifies that we reject the null hypothesis for these variables and conclude that there is evidence to suggest that police frisk and use physical force on blacks suspects more than non-black suspects.

Howard Baek
Biostatistics Master’s student

My email is