Introduction to dplyr, using SAO data

November 18, 2017

I recently wrote a post looking at dispositions coming out of the Cook County State Attorney's Office (SAO) - that post analyzed data that the SAO recently released on how they've handled cases since 2011. All of the analysis I did was in R, and made heavy use of the dplyr library. While there was a little bit of a learning curve at first, I found dplyr to be much easier to use than other similar libraries in R.

This post explains my analysis showing the actual code used - it also serves as a "hands on" introduction to the dplyr set of tools.

Preparing our data

The data analyzed below is made up of dispositions ("outcomes") of cases handled by the SAO between 2011 and 2016. It's broken down by year, disposition, offense type, race, and the total number of defendants - e.g. one row would tell us that the SAO had 903 narcotics cases that were pled guilty involving white defendants in 2015. The full dataset can be accessed here.

df <- read.csv('State_s_Attorney_Felony_Cases_-_Disposition_Outcomes_By_Offense_Type_and_Defendant_Race.csv')
kable(head(df))
YEAR DISPOSITION OFFENSE.TYPE RACE DEFENDANT.COUNT
2011 Aggravated Battery Black 1
2011 Aggravated Battery With A Firearm Black 1
2011 Aggravated DUI 2
2011 Aggravated DUI Black 2
2011 Aggravated DUI White [Hispanic or Latino] 2
2011 Aggravated Fleeing and Eluding Black 1
unique_dispositions <- unique(df$DISPOSITION)
length(unique_dispositions)

## [1] 33

unique_dispositions[grepl('^Plea', unique_dispositions)]

## [1] Plea Of Guilty                   Plea of Guilty - Amended Charge
## [3] Plea of Guilty - Lesser Included Plea of Guilty But Mentally Ill
## 33 Levels:  BFW Case Dismissed Charge Rejected ... WOWI

As you can see from the first few rows of the table, we are working with aggregated counts of offenses, by year, offense type, disposition, and race. There are 33 different dispositions, some of which are very similar. Let's do a little preprocessing on the data to make it easier to work with. For our use case here, we're mostly interested in dispositions based on the particular offense type.

prep_disaggregated <- function(df) {
  df <- df %>%
    # Exclude blank dispositions, there are a few
    filter(DISPOSITION != '') %>%
    # Keep year, disposition, and offense type
    group_by(YEAR, DISPOSITION, OFFENSE.TYPE) %>%
    # Sum up the total number of defendants
    summarise(total_defendants=sum(DEFENDANT.COUNT)) %>%
    # Add an extra column to determine whether the particular disposition was a guilty plea. There
    # are 4 different dispositions for guilty pleas, which we can identify with the '^Plea' regular
    # expression
    mutate(
      pled_guilty = grepl('^Plea', DISPOSITION)
    )

  return(df)
}
disagg <- prep_disaggregated(df)
kable(head(disagg))
YEAR DISPOSITION OFFENSE.TYPE total_defendants pled_guilty
2011 BFW Aggravated Battery 1 FALSE
2011 BFW Aggravated DUI 20 FALSE
2011 BFW Driving With Suspended Or Revoked License 3 FALSE
2011 BFW DUI 3 FALSE
2011 BFW Identity Theft 1 FALSE
2011 BFW Narcotics 8 FALSE


mutate, group_by, summarise - getting answers out of our data

Now that we have our data prepared, we can start looking at the first question of interest: how often are plea bargains used in narcotics cases vs. other cases? The dplyr library makes this kind of analysis much easier. The below snippet uses a number of the library's features to group the data by narcotics vs. non-narcotics, sum the total number of cases of each, and break down the number of each case type by the % of such cases that were pled guilty. There are comments in-line as well.

narcotics_vs_other_plea_rates <- function(df) {
  # Expects to have a "pled_guilty" column already present
  plea_rate_by_narco <- df %>%
    # Add a boolean column to indicate whether the offense was a narcotics cases or not
    mutate(narcotics_case = OFFENSE.TYPE == 'Narcotics') %>%
    # Now group by that new column - dplyr is flexible enough to use "new" column in expressions that
    # follow
    group_by(narcotics_case) %>%
    # Now that we're grouping by the `narcotics_case` boolean, sum up the total number of cases, the number
    # of cases that were pled guilt (as indicated by the `pled_guilty` field), the the percentage of cases
    # pled guilty. Note that columns used in `summarise` can reference one another - e.g. pct_pled_guilty
    # is a function of two other columns in that same `summarise` invocation.
    summarise(
      total_cases = sum(total_defendants),
      num_pled_guilty = sum(total_defendants[pled_guilty]),
      pct_pled_guilty= num_pled_guilty / total_cases
    )
  return(plea_rate_by_narco)
}
plea_rates_by_narco <- narcotics_vs_other_plea_rates(disagg)
kable(plea_rates_by_narco)
narcotics_case total_cases num_pled_guilty pct_pled_guilty
FALSE 125918 95295 0.7568020
TRUE 101924 43866 0.4303795

Especially when you're new to dplyr (or any library), it's good to double check your calculations. To double check that the summary above makes sense, we can use some slightly simpler expressions to hand-calculate the same numbers. As a library becomes more familiar, these gut-checks may be less necessary.

# Check total number of narcotics cases
only_narcotics_cases <- disagg %>%
  filter(OFFENSE.TYPE == 'Narcotics')
sum(only_narcotics_cases$total_defendants)

## [1] 101924

# Check total number of _non_ narcotics cases
non_narcotics_cases <- disagg %>%
  filter(OFFENSE.TYPE != 'Narcotics')
sum(non_narcotics_cases$total_defendants)

## [1] 125918

# Check total number of pled-out narcotics cases
pled_out_narcotics_cases <- disagg %>%
  filter(OFFENSE.TYPE == 'Narcotics') %>%
  filter(pled_guilty)
sum(pled_out_narcotics_cases$total_defendants)

## [1] 43866

# Check total number of pled-out _non_ narcotics cases
pled_out_non_narcotics_cases <- disagg %>%
  filter(OFFENSE.TYPE != 'Narcotics') %>%
  filter(pled_guilty)
sum(pled_out_non_narcotics_cases$total_defendants)

## [1] 95295

arrange - why is the narcotics plea rate so low?

The vast majority of narcotics cases result in a guilty plea - see, e.g. this report from the Human Rights Watch. It focuses on federal drug cases, but the general trend of pleas holds at the state level as well. The explanation for those high rates is complicated, and I won't attempt to explain that here. But our analysis above suggested that the plea rate was just 43% for narcotics cases, which suggests that we're not interpreting the data incorrectly.

To investigate what's happening in general with narcotics cases, we'd like to answer the question: what are the most common dispositions in narcotics cases? What % of all cases do those dispositions represent? dplyr can help us again here, this time using arrange to order our data.

most_common_dispos <- function(df) {
  num_all_dispos <- sum(df$total_defendants)
  most_common <- df %>%
    # We're interested in individual dispositions, so group by them here
    group_by(DISPOSITION) %>%
    # Add columns for the total number of cases per disposition, and the % of all cases
    summarise(
      total_dispos = sum(total_defendants),
      pct_of_total = total_dispos / num_all_dispos
    ) %>%
    # Arrange the rows by our newly added `total_dispos` column. Using `desc` orders the rows in descending
    # order, using the `total_dispos` column.
    arrange(desc(total_dispos))

  return(most_common)
}
most_common_narcotics_dispos <- most_common_dispos(only_narcotics_cases)
kable(head(most_common_narcotics_dispos))
DISPOSITION total_dispos pct_of_total
Plea Of Guilty 43782 0.4295554
FNPC 26911 0.2640301
Nolle Prosecution 26365 0.2586731
Finding Guilty 2045 0.0200640
FNG 1735 0.0170225
Death Suggested-Cause Abated 184 0.0018053

The above data show that a large percentage of narcotics cases are being dismissed ("FNPC" means "finding of no probable cause" and "Nolle Prosecution" simply means that the prosecution decided not to proceed). I investigated why that was the case in a little bit more detail in a previous post, for those curious.

Conclusion

As someone who hasn't worked with R for a few years, the introduction of dplyr has made exploring data and running analyses much easier! I had previously used plyr for similar use cases, but found it difficult to read and understand. Aside from the funny %>% syntax of dplyr, I find it much easier to read, write, and understand.

I hope this was a helpful "tutorial" / introduction to dplyr - suggestions, questions, and other feedback is welcome! Please reach out via email or on Twitter.