Introduction to dplyr, using SAO data

Posted on November 18, 2017 in blog

I recently wrote a post looking at dispositions coming out of the Cook County State Attorney's Office (SAO) - that post analyzed data that the SAO recently released on how they've handled cases since 2011. All of the analysis I did was in R, and made heavy use of the dplyr library. While there was a little bit of a learning curve at first, I found dplyr to be much easier to use than other similar libraries in R.

This post explains my analysis showing the actual code used - it also serves as a "hands on" introduction to the dplyr set of tools.

Preparing our data

The data analyzed below is made up of dispositions ("outcomes") of cases handled by the SAO between 2011 and 2016. It's broken down by year, disposition, offense type, race, and the total number of defendants - e.g. one row would tell us that the SAO had 903 narcotics cases that were pled guilty involving white defendants in 2015. The full dataset can be accessed here.

df <- read.csv('State_s_Attorney_Felony_Cases_-_Disposition_Outcomes_By_Offense_Type_and_Defendant_Race.csv')
kable(head(df))

YEAR	OFFENSE.TYPE	RACE	DEFENDANT.COUNT
2011	Aggravated Battery	Black	1
2011	Aggravated Battery With A Firearm	Black	1
2011	Aggravated DUI		2
2011	Aggravated DUI	Black	2
2011	Aggravated DUI	White [Hispanic or Latino]	2
2011	Aggravated Fleeing and Eluding	Black	1

unique_dispositions <- unique(df$DISPOSITION)
length(unique_dispositions)

## [1] 33

unique_dispositions[grepl('^Plea', unique_dispositions)]

## [1] Plea Of Guilty                   Plea of Guilty - Amended Charge
## [3] Plea of Guilty - Lesser Included Plea of Guilty But Mentally Ill
## 33 Levels:  BFW Case Dismissed Charge Rejected ... WOWI

As you can see from the first few rows of the table, we are working with aggregated counts of offenses, by year, offense type, disposition, and race. There are 33 different dispositions, some of which are very similar. Let's do a little preprocessing on the data to make it easier to work with. For our use case here, we're mostly interested in dispositions based on the particular offense type.

prep_disaggregated <- function(df) {
  df <- df %>%
    # Exclude blank dispositions, there are a few
    filter(DISPOSITION != '') %>%
    # Keep year, disposition, and offense type
    group_by(YEAR, DISPOSITION, OFFENSE.TYPE) %>%
    # Sum up the total number of defendants
    summarise(total_defendants=sum(DEFENDANT.COUNT)) %>%
    # Add an extra column to determine whether the particular disposition was a guilty plea. There
    # are 4 different dispositions for guilty pleas, which we can identify with the '^Plea' regular
    # expression
    mutate(
      pled_guilty = grepl('^Plea', DISPOSITION)
    )

  return(df)
}
disagg <- prep_disaggregated(df)
kable(head(disagg))

YEAR	DISPOSITION	OFFENSE.TYPE	total_defendants	pled_guilty
2011	BFW	Aggravated Battery	1	FALSE
2011	BFW	Aggravated DUI	20	FALSE
2011	BFW	Driving With Suspended Or Revoked License	3	FALSE
2011	BFW	DUI	3	FALSE
2011	BFW	Identity Theft	1	FALSE
2011	BFW	Narcotics	8	FALSE

`mutate`, `group_by`, `summarise` - getting answers out of our data

Now that we have our data prepared, we can start looking at the first question of interest: how often are plea bargains used in narcotics cases vs. other cases? The dplyr library makes this kind of analysis much easier. The below snippet uses a number of the library's features to group the data by narcotics vs. non-narcotics, sum the total number of cases of each, and break down the number of each case type by the % of such cases that were pled guilty. There are comments in-line as well.

narcotics_vs_other_plea_rates <- function(df) {
  # Expects to have a "pled_guilty" column already present
  plea_rate_by_narco <- df %>%
    # Add a boolean column to indicate whether the offense was a narcotics cases or not
    mutate(narcotics_case = OFFENSE.TYPE == 'Narcotics') %>%
    # Now group by that new column - dplyr is flexible enough to use "new" column in expressions that
    # follow
    group_by(narcotics_case) %>%
    # Now that we're grouping by the `narcotics_case` boolean, sum up the total number of cases, the number
    # of cases that were pled guilt (as indicated by the `pled_guilty` field), the the percentage of cases
    # pled guilty. Note that columns used in `summarise` can reference one another - e.g. pct_pled_guilty
    # is a function of two other columns in that same `summarise` invocation.
    summarise(
      total_cases = sum(total_defendants),
      num_pled_guilty = sum(total_defendants[pled_guilty]),
      pct_pled_guilty= num_pled_guilty / total_cases
    )
  return(plea_rate_by_narco)
}
plea_rates_by_narco <- narcotics_vs_other_plea_rates(disagg)
kable(plea_rates_by_narco)

narcotics_case	total_cases	num_pled_guilty	pct_pled_guilty
FALSE	125918	95295	0.7568020
TRUE	101924	43866	0.4303795

Especially when you're new to dplyr (or any library), it's good to double check your calculations. To double check that the summary above makes sense, we can use some slightly simpler expressions to hand-calculate the same numbers. As a library becomes more familiar, these gut-checks may be less necessary.

# Check total number of narcotics cases
only_narcotics_cases <- disagg %>%
  filter(OFFENSE.TYPE == 'Narcotics')
sum(only_narcotics_cases$total_defendants)

## [1] 101924

# Check total number of _non_ narcotics cases
non_narcotics_cases <- disagg %>%
  filter(OFFENSE.TYPE != 'Narcotics')
sum(non_narcotics_cases$total_defendants)

## [1] 125918

# Check total number of pled-out narcotics cases
pled_out_narcotics_cases <- disagg %>%
  filter(OFFENSE.TYPE == 'Narcotics') %>%
  filter(pled_guilty)
sum(pled_out_narcotics_cases$total_defendants)

## [1] 43866

# Check total number of pled-out _non_ narcotics cases
pled_out_non_narcotics_cases <- disagg %>%
  filter(OFFENSE.TYPE != 'Narcotics') %>%
  filter(pled_guilty)
sum(pled_out_non_narcotics_cases$total_defendants)

## [1] 95295

`arrange` - why is the narcotics plea rate so low?

The vast majority of narcotics cases result in a guilty plea - see, e.g. this report from the Human Rights Watch. It focuses on federal drug cases, but the general trend of pleas holds at the state level as well. The explanation for those high rates is complicated, and I won't attempt to explain that here. But our analysis above suggested that the plea rate was just 43% for narcotics cases, which suggests that we're not interpreting the data incorrectly.

To investigate what's happening in general with narcotics cases, we'd like to answer the question: what are the most common dispositions in narcotics cases? What % of all cases do those dispositions represent? dplyr can help us again here, this time using arrange to order our data.

most_common_dispos <- function(df) {
  num_all_dispos <- sum(df$total_defendants)
  most_common <- df %>%
    # We're interested in individual dispositions, so group by them here
    group_by(DISPOSITION) %>%
    # Add columns for the total number of cases per disposition, and the % of all cases
    summarise(
      total_dispos = sum(total_defendants),
      pct_of_total = total_dispos / num_all_dispos
    ) %>%
    # Arrange the rows by our newly added `total_dispos` column. Using `desc` orders the rows in descending
    # order, using the `total_dispos` column.
    arrange(desc(total_dispos))

  return(most_common)
}
most_common_narcotics_dispos <- most_common_dispos(only_narcotics_cases)
kable(head(most_common_narcotics_dispos))

DISPOSITION	total_dispos	pct_of_total
Plea Of Guilty	43782	0.4295554
FNPC	26911	0.2640301
Nolle Prosecution	26365	0.2586731
Finding Guilty	2045	0.0200640
FNG	1735	0.0170225
Death Suggested-Cause Abated	184	0.0018053

The above data show that a large percentage of narcotics cases are being dismissed ("FNPC" means "finding of no probable cause" and "Nolle Prosecution" simply means that the prosecution decided not to proceed). I investigated why that was the case in a little bit more detail in a previous post, for those curious.

Conclusion

As someone who hasn't worked with R for a few years, the introduction of dplyr has made exploring data and running analyses much easier! I had previously used plyr for similar use cases, but found it difficult to read and understand. Aside from the funny %>% syntax of dplyr, I find it much easier to read, write, and understand.

I hope this was a helpful "tutorial" / introduction to dplyr - suggestions, questions, and other feedback is welcome! Please reach out via email or on Twitter.

Preparing our data

mutate, group_by, summarise - getting answers out of our data

arrange - why is the narcotics plea rate so low?

Conclusion

`mutate`, `group_by`, `summarise` - getting answers out of our data

`arrange` - why is the narcotics plea rate so low?