A while back, two of my colleagues were arguing about which is a bigger problem in the criminal justice system: bias against defendants of color or bias against poor defendants. My first inclination was to suggest we could settle the dispute if we had the right dataset. (I’m an attorney turned data scientist, so yes, that really was my first thought.^{1}) That being said, the right dataset magically appeared in a Tweet from Ben Schoenfeld.^{2}

2.2 million Virginia criminal district court cases now available for bulk download and more to come! https://t.co/Wd82wkxJn1 #opendata

— Ben Schoenfeld (@oilytheotter) March 22, 2016

What follows is the story of how I used those cases to discover what best predicts defendant outcomes: race or income. This post is not a summary of my findings, though you will find them in this article. It is a look behind the curtain of data science, a *how to* cast as *case study*. Yes, there will be a few equations. But you can safely skim over them without missing much. Just pay particular attention to the graphs.

### “Big” Data

Attorneys rely on intuition. It’s how we “know” whether a case should go to trial. But all intuition is statistical and the product of experience and observation. Unfortunately, it is also subject to an assortment of cognitive biases. “Big” Data promises to help transcend these shortcomings by checking our intuitions.^{3}

To figure out the answer to which was a bigger problem—bias against defendants of color or bias against poor defendants—I sifted through millions of Virginia court records.^{4} You have some collection of variables and the suspicion that one of them is dependent on the others. How do you test your suspicion? In one word: statistics!

For the question at hand, our data need to contain at least three types of information:

- The defendants’ race.
- The defendants’ income.
- Some consistent measure of outcomes.

With enough data, we can look to see if judgement outcomes change when race and income change. That is, we can see if there are any correlations. The outcome is called the dependent variable. Race and income are independent variables, and we call these features.

If we can get data on other factors that might affect the outcome, we want those too. Generally, the more features, the better, because we can control for the effects.^{5} For example, we should probably know something about the seriousness of a charge. Otherwise, if we find outcomes (e.g., sentences) go down as defendant incomes go up, we won’t know if this is because the courts are biased against the poor or because the well-off aren’t charged with serious crimes and never face truly bad outcomes.

### Data Wrangling & Exploration

Ben’s data are basically a set of spreadsheets. Each row is a charge, and there are some 47 columns associated with each row. Here’s what they look like:

I’m going to let you in on a secret: most of a data scientist’s time is spent cleaning and joining data, sometimes called data wrangling or data munging. It’s not glamorous, but it’s necessary, and it’s a process that requires a good sense of what you’re looking for.

Immediately, I scanned the data for race, income, seriousness, and outcome. There was a column listing the defendants’ race, but there was no column for income.^{6} Luckily the dataset included the zip codes of defendants, and since the 2006-2010 American Community Survey tabulated mean income by zip code, I could make an educated guess about a defendant’s income.^{7} I just assumed a defendant’s income was the average income of their zip code. It’s not perfect, but we don’t need to be perfect. I’ll explain why.

### Creating a Model

You might not have realized it, but we’re about to build a statistical model. When evaluating if a model is useful, I like to remember two things:

- Always ask “compared to what?”
- Always remember that the output of a model should start, not end, discussion.

Right now, we’re operating in the dark. I had a guess as to whether or not a defendant’s race or income was a better predictor of outcomes, and I bet you have your own guess, but we have reasons to doubt our guesses. After we build a model, we’ll know more. We can then use the model’s output to move the conversation forward. That is the most one can ask for. Admittedly, I’ve made a lot of assumptions, and you’re welcome to disagree with them. In fact, you’re invited to improve upon them as I have shared all of my work, including computer code, over on GitHub: Class, Race, and Sex in Virginia Criminal Courts.

That’s how science works.

It’s important to note what we’re doing is modeling for insight. I don’t expect that we’ll use our model to predict the future. Rather, we’re trying to figure out how things interact. We want to know what happens to outcomes when we vary defendant demographics, specifically their race or income. The exact numbers aren’t as important as the general trends and how they compare.

### Finding Features

Next up, I had to figure out how to measure the seriousness of a case. The data listed charge types and classes (e.g., Class 1 Felony). You can find a description of what these mean here and here.^{8} Ideally, I wanted to place all crimes on a scale of seriousness. So I sorted the list of all possible combinations and numbered them from 1 to 10.^{9}

This is where I got caught in my first rabbit hole.

I spent a long time trying to map these 10 charge types on to some spectrum of seriousness, but sanctions are multifaceted. I tried combining the possible number of days in a sentence with the possible fine to get a single number that represented how serious a charge was. Coming from Massachusetts, I was looking for something like the ranking of seriousness levels used in our sentencing guidelines. In the end, however, I realized that my overly complicated ratings didn’t really improve the model (specifically, its R-squared, which is something we’ll discuss below).

The data included multiple outcomes, including the length of sentences, information on probation, and an accounting of fines and fees. Again, I spent a while trying to figure out how to combine these before remembering the sage’s advice: keep it simple, stupid.

Consequently, I opted to define outcome generally by what I assumed to be the most salient measure: the sentence length in days.^{10}

There was another variable relating to defendant demographics, sex, which I included.^{11} I could have looked through previous years’ data to construct a defendant’s criminal history along with a myriad of other features, but given that my question was aimed at the influence of race and income on outcomes, I was content to focus primarily on these features, along with seriousness. Consequently, there’s a lot more one could do with this data, and I’m sure that’s what Ben was hoping for when he compiled them. So please, dig in.

That being said, we’re ready to see how race, income, seriousness, and sex affect the outcomes of criminal cases.

### Best Fit Lines (Regressions)

If you’ve taken a statistics class, you’ve seen the next step coming. For those of you who became attorneys because you didn’t like math, let’s slow things down.

At the heart of modern science there are a class of tools that fall under the general name of regression analysis. You’ve probably heard of at least one of these. Here, for example, is a linear regression I ran on a subset of the VA court data.

Fundamentally, a linear regression is concerned with finding a best fit line. In the graph above, we are plotting the seriousness of a charge against the sentence a defendant received in days. Every charge in the data is plotted, and a line is drawn by a computer to minimize the distance between itself and each data point. A bunch of these points fall on top of each other. So it’s hard to get a feel for how they are distributed. To help with this we can replace all data points with a common X value with a single representative “dot.” The graph below shows the same data as the one above, but it groups data points into a single dot at its members’ center with bars to indicate where 95% of its membership falls.

Consequently, the Y-axes have different scales. In both graph’s, however, it can be seen that as the seriousness of charges go up, sentences go up. The lines allow us to put a hard number on how much.

To get this number, we use the equation of a line, *y = mx + b*. Where *y* is the sentence, *x* is the seriousness of our charge, *m* is the slope of our line, and *b* is where our line crosses the Y-axis.

You’ll notice that the line doesn’t go through every data point. In fact, the data seem very noisy. We can see this best in the first of our graphs (Linear Regression: All Data Points). Life is messy, and by plotting every data point, we can see the variations inherent in real life. Thankfully, the seriousness of a charge does not dictate its destiny. Sometimes people are acquitted and cases are dismissed. Both of these occurrences result in a sentence of zero days, and often when there is a finding of guilt, there are extenuating circumstances, causing sentences other than the maximum. Consequently, being charged with a crime that could land you in jail for a year doesn’t mean you’re going away for a year.

There’s a measure of this noise. It is often framed as a number indicating how much of the variation in your data your model (best fit line) explains. It’s called R-squared, and in the above graphs, the model accounts for roughly 12% of the variation. That is, R-squared = 0.121689. A perfect fit with every data point falling on the line would yield an R-squared of 1. Knowing this helps us understand what’s going on. We’re used to thinking about averages, and in a way, the best fit is just telling us the average for every value along the number line.^{12} It’s worth noting, however, that our data are a tad peculiar because the seriousness of charges is always a whole number between 1 and 10. That’s why we see those nice rows.

### Logarithms

If we look at the first plot of seriousness vs. sentences (in the graph above labeled* Linear Regression: All Data Points*), everything seems to be bunched up at the bottom of the graph, which isn’t ideal for a number of reasons. Luckily, there’s a way to deal with that. We can take the log of the data. When we do this, the name of our regression changes from *linear regression* to *log-linear* or *log-normal* regression. We won’t be able to read the number of days off our Y-axis anymore, but if we want to get that number, we could transform *log(days)* back into *days* by raising *e* to the *log(days)* power (i.e., *e ^{log(days)}*). It’s okay if you don’t understand logarithms. What’s important to know is that this trick helps un-bunch our numbers, and we can always undo this transformation if need be. One other detail,

*log(0)*isn’t a finite number. Since many cases (dismissals and acquittals) have a sentence equal to zero, I’ll be taking the log of 1 + the sentence. Don’t worry about the details. Just look at these pretty graphs.

See, everything’s nice and spread out now. Note that in the *All Data Points* graph, our best fit doesn’t go through the dark patches of data points because it is brought down by our acquittals and dismissals at the bottom.

Now, you can’t always fit a line to data. Sometimes it’s a curve and sometimes there’s no pattern at all—there is no signal to grab on to. Such cases return very low R-squared values.

### Curvy Lines (Fitting Polynomials)

If you’re looking to plot something other than a line, you can add exponents to the equation of your best fit. Such equations are called higher order polynomials. A line is a first-order polynomial (i.e, *y = mx + b)* where as a parabola (i.e., *y = ax ^{2} + bx + c*) is a second-order polynomial. You get third-order polynomials by adding an

*x*and fourth-order polynomials by adding an

^{3}*x*What makes polynomials useful is that for every term you add you get a new bend. So we aren’t limited to fitting straight lines. For example:

^{4}.Some of you are probably yelling at the screen: “Danger, Danger!” As we imagine increasingly curvy lines, what stops us from fitting a curve that goes right through the center of each of our data points? Judgment.

You would be right to be suspicious of such a fit. It’s unlikely that it would be generalizable. If you take such a model and apply it to new data, it’s likely to break. Generally speaking, if you’re going to fit curves to your data, you should have some reason other than “it fits better,” because you can always make the line fit better.

For example, it might be that charges punishable by fines are different from crimes punishable by incarceration and as you transition from one type to the other the seriousness jumps. Consequently, you wouldn’t expect a linear relationship across all charge types. Instead, you might be looking for a “hockey stick.” That being said, you should only make use of curvy fits if you have a good theoretical reason to do so.

When you aggressively fit your model, and the fit doesn’t reflect reality, we call it *overfitting.* To help avoid this temptation, we need to check our work. The process of making our fit based on data is called training, and it’s standard practice to train one’s model on a subset of data and then test one’s model against a different subset. This way you can be sure your model is generalizable and will avoid the trap of overfitting. This testing is called cross-validation, and like data wrangling, it’s an important step in the process.^{13}

### Statistically Significant?

So how do we know if a correlation is real? It turns out we can’t actually say for sure. Instead, statisticians ask: “If there is no correlation, how likely would it be for us to see these or more extreme results by chance?”

The answer to this question feature by feature is something called a P-value, and you may have heard about these in the news recently. A thoughtful explanation of their meaning is beyond the scope of this piece. However, you should know that they’re scored like golf. Low values are better than high values. This score is often used to help determine if something is *significant*.^{14}

The P-values for seriousness in the various models above are pretty good—well below 0.05. This comes as no surprise. What we really want to know is how race, income, and sex measure up. So what does a best fit look like when you deal with more than just seriousness? What happens if we take income into account?

### Multiple Dimensions

What you’re seeing is a plot of *seriousness* and *mean income* against *log(1 + the sentence in days)*. Instead of a best fit line, we now have a best fit plane. If you look carefully, you can see that income does, in fact, correlate with outcome, except this correlation is in the opposite direction. That is, the higher your income, the lower your sentence.

Like before, we can use math to quantify our best fit. In this case, however, we use the equation of a plane (i.e., *z = ax + by + c*). Of course, just like before, we could fit curved surfaces to our data, but we can also expand the number of features we consider. As we add more features, we add more dimensions, and our best fit moves into the space of n-dimensional geometry. This can be hard to visualize, but hopefully it’s easy enough to understand. We’re just doing more of the same. If we don’t worry about curves, we’re just adding two variables with every new feature/dimension, e.g., *x’ = ax + by + cz + d*.

Again, our best fit is telling us something analogous to the average for every value along both axes, *seriousness* and *income*. This implicitly includes all of their combinations. Yes, high income corresponds to a lower sentence, but the seriousness of a charge matters a lot more.

It’s worth noting that our race and sex data doesn’t look like our other data in that it’s not part of a number scale. To address this, we convert them into collections of binary variables. For example, I have added a column to our data called Male. It’s 1 if the client was identified as male and 0 if the client was identified as female. Likewise, there are columns for each of the races defined in the data except for Caucasian. If a client was identified as Caucasian, all of these columns would be zero. That is, our model’s default race is Caucasian.

Okay, let’s run a regression on all of our features. Yay, multi-dimensional best fit!

### Findings

The table below is a summary of the regression’s output. Remember P-values? They’re all really low (yay!). And although the R-squared isn’t great (6%), remember that life is messy.^{15} If race, income, sex, and the seriousness of a charge predicted a case’s outcome 100%, I’d be questioning what attorneys were for. So what do the rest of these numbers mean?

Well, the *coef *(short for coefficient) column tells us the slope of our best fit for a given variable.^{16} That is, they tell us how big and in what direction a feature (e.g., race) is correlated with the dependent variable (the sentence).^{17} So we can see that a defendant’s race is positively correlated with the sentence they receive, and their income is negatively correlated.

Here’s our model boiled down to an equation.

*D*is our dataset of court cases joined with income data.*S*is the**Sentence in Days plus 1 day**.**Coefficients**through are those determined by an ordinary least squares (OLS) regression for the dataset D corresponding to features*x*through_{1}*x*receptively, with equal to the intercept. Values of these can be found in the table above along with P-values and additional summary data. An explanation of these summary statistics can be found here._{8}- = some
**random error**for the above OLS. For more on this, consult Forecasting From Log-Linear Regressions. *x*= the_{1}**seriousness**level of a charge.*x*= 1 if defendant is_{2}**male**, otherwise 0.*x*= the_{3}**mean**income of the defendant’s zip code, used as a stand-in for their income.*x*= 1 if defendant is_{4}**Black (Non-Hispanic)**, otherwise 0.*x*= 1 if defendant is_{5}**Hispanic**, otherwise 0.*x*= 1 if defendant is an_{6}**Asian or Pacific Islander**, otherwise 0.*x*= 1 if defendant is an_{7}**American Indian**, otherwise 0.*x*= 1 if defendant is_{8}**Other**, otherwise 0.

Again, those coefficients tell us how big an influence our features have.^{18}

This all tells us for a black man in Virginia to get the same treatment as his Caucasian peer, he must earn an additional $90,000 a year.

Similar amounts hold for American Indians and Hispanics, with the offset for Asians coming in at a little less than half as much.

The answer to our question seems to be that race-based bias is *pretty big*. It is also worth noting that being male isn’t helpful either.

Because the R-squared is so low, we’re not saying that being black is an insurmountable obstacle to receiving justice as a defendant. Our model only accounts for 6% of the variation we see in the data. So thankfully, other factors matter a lot more. Hopefully, these factors include the facts of a case. However, it is clear that defendants of color are in a markedly worse position than their white peers.

And yes, correlation isn’t causation. And yes, strictly speaking, we’re only talking about the Virginia Criminal Circuit Courts in 2006-2010. But my guess is we’d find similar results in other jurisdictions or times. So let’s start looking for those datasets. That being said, we know a good deal more than when we started. So I’m willing to articulate the following working theory: race matters.

If you’ll forgive my soapbox, it’s time we stop pretending race isn’t a major driver of disparities in our criminal justice system. This is not to say that the courts are full of racist people. In fact, I’m working on a more detailed analysis of this data that seems to suggest that, at some points in the course of a case, one’s race plays no significant role in determining an outcome. What we see here is the aggregate effect of many interlocking parts. Reality is complex. Good people can find themselves unwitting cogs in the machinery of institutional racism, and a system doesn’t have to have racist intentions to behave in a racist way. Unfortunately, the system examined here is not blind to race, class, or sex for that matter. Knowing this changes everything. Once you recognize the bias in a system, you have a choice: you can do something to push back, or you can accept the status quo.

### Words of Warning

I did all of my analysis with freely available tools, and there’s nothing stopping you from picking up where I left off. In fact, I hope that a few of you will look at this GitHub repo and do exactly that. However, it’s important to note that you need a solid foundation in statistics to avoid making unwarranted claims due to lack of experience.

And beware the danger zone! As Drew Conway (creator of the Venn Diagram above) points out, “It is from [that] part of the diagram that the phrase ‘lies, damned lies, and statistics’ emanates.”

That being said, there is nothing magic here. You can also discover hidden truths. My advice? Be suspicious of answers that reinforce your existing assumptions. Do your work in the open. When confidentiality allows, share both your findings and your data. Have someone check your math. Listen to feedback, and always be ready to change your mind.^{19}

*This post was updated on 6/1/16 to address commenter feedback and to correct mislabeled Y axes on the first two plots. In retrospect, the last paragraph of this article seems prescient. Shortly after publication, a commenter took me up on the offer to dig into the data and noticed that I had neglected to clean some extraneous entries from the dataset (i.e., those entries with unidentified race and sex). It amounted to two lines of missing code the consequence of which was to inflate the coefficients associated with race and sex in the model. The coefficient for sex only changed slightly. However, those associated with race came down a good deal. After correcting for this error, the original observation that a black man had to earn an additional $500,000 to be on equal footing with his white peers was amended to reflect the fact that the model now puts the dollar amount closer to $90,000. Additionally, the offset for Asian defendants turned out to be a little less than half that of Black, Native, and Hispanic defendants. Aside from the paragraphs in which these results were communicated, the table of summary statistics, and footnote 18, the remainder of this article is unchanged with the exception of the inclusion of new plots to correct the mislabeled axes and references to the associated r-squared in same.*

Featured image: “The Supreme Court” by Tim Sackton is licensed CC BY-SA 2.0. The image has been modified to include a big Data and text, with inspiration from Josh Lee.

It’s worth noting the work I’m describing here was done on my own time and in no way represents the opinions of my employer. ↩

As his site, Virginia Court Data, makes clear there’s quite a story behind this data, and my hat’s off to Ben for making them usable, despite some disagreements over the particulars of the release. See note 4. Thank you Ben. ↩

In my experience, most practitioners of data analytics are ambivalent about the term Big Data. Only a handful of institutions deal with truly big data. Google and Facebook come to mind. Most of the time when people say Big Data, what they’re really talking about is data sufficiently large for statistical analysis to be useful. Hence, the quotes. ↩

As a criminal defense attorney, I’m well aware of the complications that arise from such a dataset. As a recent ethical breach involving OkCupid data makes clear, just because you can aggregate data doesn’t mean you should. The data used here was aggregated from publicly-accessible court webpages that include defendant names. The contents of these pages are in fact public records. However, Ben has prudently hidden defendant names behind one-way hashes, and the court had enough foresight to partially obfuscate dates of birth. Unfortunately, this does not preclude de-anonymization of the data in the future. I would have opted to further obscure some of the data to make de-anonymization harder, and I shared some suggestions for further scrubbing the data with Ben. It was clear he took seriously the need to consider the unintended consequences of making this data more accessible, hence his effort to obscure defendant names. It was also clear that he felt there was a clear public interest in making these public data (in the sense that they are public records) more accessible, listing several cases where he believed obscured data would have made the discovery of injustices in the system harder to find. In response, I provided several hypotheticals illustrating the concerns behind my belief that more scrubbing was called for. We both agree that true anonymization is probably impossible, barring full homomorphic encryption, and that their is a public interest in making parts of this data easier to access than they are in to the existing court interface. Where we disagree is on whether these should included all or only a subset of the data. He left open the question as to whether or not he will further scrub the data, but he made clear that he had no intent to do so in the near future. For this reason, I have not included the raw data in my supplementary materials. Consequently, if changes are made to provide additional privacy protections, these materials won’t be a weak link in the chain. ↩

When you have a lot of potential features, however, some special issues start to crop up. ↩

The Race column listed 6 categories: American Indian, Asian or Pacific Islander, Black (Non-Hispanic), Hispanic, Other, or White Caucasian (Non-Hispanic). However, I do not know what criteria the court uses to determine a defendant’s label. Also, I am aware that this appears to be a conflation of race and ethnicity, but again, these are the categories provided in the court data. You will see these labels referenced by shorthand in this article as Native, Asian, Black, Hispanic, Other, and Caucasian respectively. ↩

Assuming I limited my analysis to 2006-2010 data, which I did. This also limited me to data from the Criminal Circuit Courts. ↩

If you look at the data, you’ll also see U under the class column as well. These are unclassified charges with their own sentencing range, separate from the standard classes. ↩

You may notice when we examine the relationship between

*seriousness*and*sentence*below, some Class 3 & 4 Misdemeanors are linked to jail times despite the fact that such offenses should only involve fines. See Virginia Misdemeanor Crimes by Class and Sentences. For the cases I could find, these sentences agreed with the information found on Virginia Courts’ Case Information website. So this does not appear to be an issue with Ben’s data collection. A number of possible explanations come to mind. For example, Public Intoxication is a Class 4 Misdemeanor, but subsequent offenses can result in jail time. It is also possible that there may be data entry errors (e.g., misclassifying a charge as a Class 3 Misdemeanor when the governing statute makes clear it is actually a Class 1 Misdemeanor, something I saw in the court’s data). Whatever the reasons for these potential errors, they seem to be the exception, not the rule, and without direct access to the courts’ methods and measures of quality control, I have to take the data at face value. Hopefully, the fact that we’re working with hundreds of thousands of cases means that 134 outliers,*if*they are actually errors, don’t do much to skew our results. ↩The VA data listed a binary sex, not gender. So I am limited to such a taxonomy here. ↩

What’s really going on is something called ordinary least squares. ↩

To learn more about how I arrived at the final model below, check out this supplemental notebook. ↩

What counts as a good P-value, however, depends on context. For example, in most social science research, a P-value of less than 0.05 is considered significant, but in high-energy physics they hold out for a P-value of 0.0000003. ↩

I tested a number of different models and you can see my work in more detail over here. ↩

If you’re curious what all the other numbers mean, check out this post. ↩

Technically, the log of 1 + the sentence in days. ↩

For income’s influence to counteract that of being black

Therefore:

–0.000004166*x*+ (0.3763)(1) = 0_{3}

*x*= 0.3763/0.000004166_{3}

*x*= $90,456.73 ↩_{3}Thank you to Adrian Angus and William Li for your feedback. It was greatly appreciated. ↩