Introduction to dplyr, using SAO data
Posted on November 18, 2017 in blog
I recently wrote a
post looking at
dispositions coming out of the Cook County State Attorney's Office (SAO)
- that post analyzed data that the SAO recently
released
on how they've handled cases since 2011. All of the analysis I did was
in R, and made heavy use of the
dplyr
library. While there
was a little bit of a learning curve at first, I found dplyr
to be
much easier to use than other similar libraries in R.
This post explains my analysis showing the actual code used - it also
serves as a "hands on" introduction to the dplyr
set of tools.
Preparing our data
The data analyzed below is made up of dispositions ("outcomes") of cases handled by the SAO between 2011 and 2016. It's broken down by year, disposition, offense type, race, and the total number of defendants - e.g. one row would tell us that the SAO had 903 narcotics cases that were pled guilty involving white defendants in 2015. The full dataset can be accessed here.
df <- read.csv('State_s_Attorney_Felony_Cases_-_Disposition_Outcomes_By_Offense_Type_and_Defendant_Race.csv')
kable(head(df))
YEAR | DISPOSITION | OFFENSE.TYPE | RACE | DEFENDANT.COUNT |
---|---|---|---|---|
2011 | Aggravated Battery | Black | 1 | |
2011 | Aggravated Battery With A Firearm | Black | 1 | |
2011 | Aggravated DUI | 2 | ||
2011 | Aggravated DUI | Black | 2 | |
2011 | Aggravated DUI | White [Hispanic or Latino] | 2 | |
2011 | Aggravated Fleeing and Eluding | Black | 1 |
unique_dispositions <- unique(df$DISPOSITION)
length(unique_dispositions)
## [1] 33
unique_dispositions[grepl('^Plea', unique_dispositions)]
## [1] Plea Of Guilty Plea of Guilty - Amended Charge
## [3] Plea of Guilty - Lesser Included Plea of Guilty But Mentally Ill
## 33 Levels: BFW Case Dismissed Charge Rejected ... WOWI
As you can see from the first few rows of the table, we are working with aggregated counts of offenses, by year, offense type, disposition, and race. There are 33 different dispositions, some of which are very similar. Let's do a little preprocessing on the data to make it easier to work with. For our use case here, we're mostly interested in dispositions based on the particular offense type.
prep_disaggregated <- function(df) {
df <- df %>%
# Exclude blank dispositions, there are a few
filter(DISPOSITION != '') %>%
# Keep year, disposition, and offense type
group_by(YEAR, DISPOSITION, OFFENSE.TYPE) %>%
# Sum up the total number of defendants
summarise(total_defendants=sum(DEFENDANT.COUNT)) %>%
# Add an extra column to determine whether the particular disposition was a guilty plea. There
# are 4 different dispositions for guilty pleas, which we can identify with the '^Plea' regular
# expression
mutate(
pled_guilty = grepl('^Plea', DISPOSITION)
)
return(df)
}
disagg <- prep_disaggregated(df)
kable(head(disagg))
YEAR | DISPOSITION | OFFENSE.TYPE | total_defendants | pled_guilty |
---|---|---|---|---|
2011 | BFW | Aggravated Battery | 1 | FALSE |
2011 | BFW | Aggravated DUI | 20 | FALSE |
2011 | BFW | Driving With Suspended Or Revoked License | 3 | FALSE |
2011 | BFW | DUI | 3 | FALSE |
2011 | BFW | Identity Theft | 1 | FALSE |
2011 | BFW | Narcotics | 8 | FALSE |
mutate
, group_by
, summarise
- getting answers out of our data
Now that we have our data prepared, we can start looking at the first question of interest: how often are plea bargains used in narcotics cases vs. other cases? The dplyr library makes this kind of analysis much easier. The below snippet uses a number of the library's features to group the data by narcotics vs. non-narcotics, sum the total number of cases of each, and break down the number of each case type by the % of such cases that were pled guilty. There are comments in-line as well.
narcotics_vs_other_plea_rates <- function(df) {
# Expects to have a "pled_guilty" column already present
plea_rate_by_narco <- df %>%
# Add a boolean column to indicate whether the offense was a narcotics cases or not
mutate(narcotics_case = OFFENSE.TYPE == 'Narcotics') %>%
# Now group by that new column - dplyr is flexible enough to use "new" column in expressions that
# follow
group_by(narcotics_case) %>%
# Now that we're grouping by the `narcotics_case` boolean, sum up the total number of cases, the number
# of cases that were pled guilt (as indicated by the `pled_guilty` field), the the percentage of cases
# pled guilty. Note that columns used in `summarise` can reference one another - e.g. pct_pled_guilty
# is a function of two other columns in that same `summarise` invocation.
summarise(
total_cases = sum(total_defendants),
num_pled_guilty = sum(total_defendants[pled_guilty]),
pct_pled_guilty= num_pled_guilty / total_cases
)
return(plea_rate_by_narco)
}
plea_rates_by_narco <- narcotics_vs_other_plea_rates(disagg)
kable(plea_rates_by_narco)
narcotics_case | total_cases | num_pled_guilty | pct_pled_guilty |
---|---|---|---|
FALSE | 125918 | 95295 | 0.7568020 |
TRUE | 101924 | 43866 | 0.4303795 |
Especially when you're new to dplyr
(or any library), it's good to
double check your calculations. To double check that the summary above
makes sense, we can use some slightly simpler expressions to
hand-calculate the same numbers. As a library becomes more familiar,
these gut-checks may be less necessary.
# Check total number of narcotics cases
only_narcotics_cases <- disagg %>%
filter(OFFENSE.TYPE == 'Narcotics')
sum(only_narcotics_cases$total_defendants)
## [1] 101924
# Check total number of _non_ narcotics cases
non_narcotics_cases <- disagg %>%
filter(OFFENSE.TYPE != 'Narcotics')
sum(non_narcotics_cases$total_defendants)
## [1] 125918
# Check total number of pled-out narcotics cases
pled_out_narcotics_cases <- disagg %>%
filter(OFFENSE.TYPE == 'Narcotics') %>%
filter(pled_guilty)
sum(pled_out_narcotics_cases$total_defendants)
## [1] 43866
# Check total number of pled-out _non_ narcotics cases
pled_out_non_narcotics_cases <- disagg %>%
filter(OFFENSE.TYPE != 'Narcotics') %>%
filter(pled_guilty)
sum(pled_out_non_narcotics_cases$total_defendants)
## [1] 95295
arrange
- why is the narcotics plea rate so low?
The vast majority of narcotics cases result in a guilty plea - see, e.g. this report from the Human Rights Watch. It focuses on federal drug cases, but the general trend of pleas holds at the state level as well. The explanation for those high rates is complicated, and I won't attempt to explain that here. But our analysis above suggested that the plea rate was just 43% for narcotics cases, which suggests that we're not interpreting the data incorrectly.
To investigate what's happening in general with narcotics cases, we'd
like to answer the question: what are the most common dispositions in
narcotics cases? What % of all cases do those dispositions represent?
dplyr
can help us again here, this time using arrange
to order our
data.
most_common_dispos <- function(df) {
num_all_dispos <- sum(df$total_defendants)
most_common <- df %>%
# We're interested in individual dispositions, so group by them here
group_by(DISPOSITION) %>%
# Add columns for the total number of cases per disposition, and the % of all cases
summarise(
total_dispos = sum(total_defendants),
pct_of_total = total_dispos / num_all_dispos
) %>%
# Arrange the rows by our newly added `total_dispos` column. Using `desc` orders the rows in descending
# order, using the `total_dispos` column.
arrange(desc(total_dispos))
return(most_common)
}
most_common_narcotics_dispos <- most_common_dispos(only_narcotics_cases)
kable(head(most_common_narcotics_dispos))
DISPOSITION | total_dispos | pct_of_total |
---|---|---|
Plea Of Guilty | 43782 | 0.4295554 |
FNPC | 26911 | 0.2640301 |
Nolle Prosecution | 26365 | 0.2586731 |
Finding Guilty | 2045 | 0.0200640 |
FNG | 1735 | 0.0170225 |
Death Suggested-Cause Abated | 184 | 0.0018053 |
The above data show that a large percentage of narcotics cases are being dismissed ("FNPC" means "finding of no probable cause" and "Nolle Prosecution" simply means that the prosecution decided not to proceed). I investigated why that was the case in a little bit more detail in a previous post, for those curious.
Conclusion
As someone who hasn't worked with R for a few years, the introduction of
dplyr
has made exploring data and running analyses much easier! I had
previously used plyr
for similar use
cases, but found it difficult to read and understand. Aside from the
funny %>%
syntax of dplyr
, I find it much easier to read, write, and
understand.
I hope this was a helpful "tutorial" / introduction to dplyr
-
suggestions, questions, and other feedback is welcome! Please reach out
via email or on Twitter.