Uncovering Big Bias with Big Data


4-Step Computer Security Upgrade

Learn to encrypt your files, secure your computer when using public Wi-Fi, enable two-factor authentication, and use good passwords.

A while back, two of my colleagues were arguing about which is a bigger problem in the criminal justice system: bias against defendants of color or bias against poor defendants. My first inclination was to suggest we could settle the dispute if we had the right dataset. (I’m an attorney turned data scientist, so yes, that really was my first thought.1) That being said, the right dataset magically appeared in a Tweet from Ben Schoenfeld.2

What follows is the story of how I used those cases to discover what best predicts defendant outcomes: race or income. This post is not a summary of my findings, though you will find them in this article. It is a look behind the curtain of data science, a how to cast as case study. Yes, there will be a few equations. But you can safely skim over them without missing much. Just pay particular attention to the graphs.

“Big” Data

Attorneys rely on intuition. It’s how we “know” whether a case should go to trial. But all intuition is statistical and the product of experience and observation. Unfortunately, it is also subject to an assortment of cognitive biases. “Big” Data promises to help transcend these shortcomings by checking our intuitions.3

To figure out the answer to which was a bigger problem—bias against defendants of color or bias against poor defendants—I sifted through millions of Virginia court records.4 You have some collection of variables and the suspicion that one of them is dependent on the others. How do you test your suspicion? In one word: statistics!

For the question at hand, our data need to contain at least three types of information:

  1. The defendants’ race.
  2. The defendants’ income.
  3. Some consistent measure of outcomes.

With enough data, we can look to see if judgement outcomes change when race and income change. That is, we can see if there are any correlations. The outcome is called the dependent variable. Race and income are independent variables, and we call these features.

xkcd comic: Correlation

Source: “Correlation” from xkcd.

If we can get data on other factors that might affect the outcome, we want those too. Generally, the more features, the better, because we can control for the effects.5 For example, we should probably know something about the seriousness of a charge. Otherwise, if we find outcomes (e.g., sentences) go down as defendant incomes go up, we won’t know if this is because the courts are biased against the poor or because the well-off aren’t charged with serious crimes and never face truly bad outcomes.

Data Wrangling & Exploration

Ben’s data are basically a set of spreadsheets. Each row is a charge, and there are some 47 columns associated with each row. Here’s what they look like:

screen shot of spreadsheet cells

I’m going to let you in on a secret: most of a data scientist’s time is spent cleaning and joining data, sometimes called data wrangling or data munging. It’s not glamorous, but it’s necessary, and it’s a process that requires a good sense of what you’re looking for.

Immediately, I scanned the data for race, income, seriousness, and outcome. There was a column listing the defendants’ race, but there was no column for income.6 Luckily the dataset included the zip codes of defendants, and since the 2006-2010 American Community Survey tabulated mean income by zip code, I could make an educated guess about a defendant’s income.7 I just assumed a defendant’s income was the average income of their zip code. It’s not perfect, but we don’t need to be perfect. I’ll explain why.

Creating a Model

You might not have realized it, but we’re about to build a statistical model. When evaluating if a model is useful, I like to remember two things:

  1. Always ask “compared to what?”
  2. Always remember that the output of a model should start, not end, discussion.

Right now, we’re operating in the dark. I had a guess as to whether or not a defendant’s race or income was a better predictor of outcomes, and I bet you have your own guess, but we have reasons to doubt our guesses. After we build a model, we’ll know more. We can then use the model’s output to move the conversation forward. That is the most one can ask for. Admittedly, I’ve made a lot of assumptions, and you’re welcome to disagree with them. In fact, you’re invited to improve upon them as I have shared all of my work, including computer code, over on GitHub: Class, Race, and Sex in Virginia Criminal Courts.

That’s how science works.

It’s important to note what we’re doing is modeling for insight. I don’t expect that we’ll use our model to predict the future. Rather, we’re trying to figure out how things interact. We want to know what happens to outcomes when we vary defendant demographics, specifically their race or income. The exact numbers aren’t as important as the general trends and how they compare.

Finding Features

Next up, I had to figure out how to measure the seriousness of a case. The data listed charge types and classes (e.g., Class 1 Felony). You can find a description of what these mean here and here.8 Ideally, I wanted to place all crimes on a scale of seriousness. So I sorted the list of all possible combinations and numbered them from 1 to 10.9

This is where I got caught in my first rabbit hole.

I spent a long time trying to map these 10 charge types on to some spectrum of seriousness, but sanctions are multifaceted. I tried combining the possible number of days in a sentence with the possible fine to get a single number that represented how serious a charge was. Coming from Massachusetts, I was looking for something like the ranking of seriousness levels used in our sentencing guidelines. In the end, however, I realized that my overly complicated ratings didn’t really improve the model (specifically, its R-squared, which is something we’ll discuss below).

The data included multiple outcomes, including the length of sentences, information on probation, and an accounting of fines and fees. Again, I spent a while trying to figure out how to combine these before remembering the sage’s advice: keep it simple, stupid.

Consequently, I opted to define outcome generally by what I assumed to be the most salient measure: the sentence length in days.10

There was another variable relating to defendant demographics, sex, which I included.11 I could have looked through previous years’ data to construct a defendant’s criminal history along with a myriad of other features, but given that my question was aimed at the influence of race and income on outcomes, I was content to focus primarily on these features, along with seriousness. Consequently, there’s a lot more one could do with this data, and I’m sure that’s what Ben was hoping for when he compiled them. So please, dig in.

That being said, we’re ready to see how race, income, seriousness, and sex affect the outcomes of criminal cases.

Best Fit Lines (Regressions)

If you’ve taken a statistics class, you’ve seen the next step coming. For those of you who became attorneys because you didn’t like math, let’s slow things down.

At the heart of modern science there are a class of tools that fall under the general name of regression analysis. You’ve probably heard of at least one of these. Here, for example, is a linear regression I ran on a subset of the VA court data.ld_all

Fundamentally, a linear regression is concerned with finding a best fit line. In the graph above, we are plotting the seriousness of a charge against the sentence a defendant received in days. Every charge in the data is plotted, and a line is drawn by a computer to minimize the distance between itself and each data point. A bunch of these points fall on top of each other. So it’s hard to get a feel for how they are distributed. To help with this we can replace all data points with a common X value with a single representative “dot.” The graph below shows the same data as the one above, but it groups data points into a single dot at its members’ center with bars to indicate where 95% of its membership falls.

linear regression (representative 'dots')

Consequently, the Y-axes have different scales. In both graph’s, however, it can be seen that as the seriousness of charges go up, sentences go up. The lines allow us to put a hard number on how much.

To get this number, we use the equation of a line, y = mx + b. Where y is the sentence, x is the seriousness of our charge, m is the slope of our line, and b is where our line crosses the Y-axis.

You’ll notice that the line doesn’t go through every data point. In fact, the data seem very noisy. We can see this best in the first of our graphs (Linear Regression: All Data Points). Life is messy, and by plotting every data point, we can see the variations inherent in real life. Thankfully, the seriousness of a charge does not dictate its destiny. Sometimes people are acquitted and cases are dismissed. Both of these occurrences result in a sentence of zero days, and often when there is a finding of guilt, there are extenuating circumstances, causing sentences other than the maximum. Consequently, being charged with a crime that could land you in jail for a year doesn’t mean you’re going away for a year.

There’s a measure of this noise. It is often framed as a number indicating how much of the variation in your data your model (best fit line) explains. It’s called R-squared, and in the above graphs, the model accounts for roughly 12% of the variation. That is, R-squared = 0.121689. A perfect fit with every data point falling on the line would yield an R-squared of 1. Knowing this helps us understand what’s going on. We’re used to thinking about averages, and in a way, the best fit is just telling us the average for every value along the number line.12 It’s worth noting, however, that our data are a tad peculiar because the seriousness of charges is always a whole number between 1 and 10. That’s why we see those nice rows.


If we look at the first plot of seriousness vs. sentences (in the graph above labeled Linear Regression: All Data Points), everything seems to be bunched up at the bottom of the graph, which isn’t ideal for a number of reasons. Luckily, there’s a way to deal with that. We can take the log of the data. When we do this, the name of our regression changes from linear regression to log-linear or log-normal regression. We won’t be able to read the number of days off our Y-axis anymore, but if we want to get that number, we could transform log(days) back into days by raising e to the log(days) power (i.e., log(days)). It’s okay if you don’t understand logarithms. What’s important to know is that this trick helps un-bunch our numbers, and we can always undo this transformation if need be. One other detail, log(0) isn’t a finite number. Since many cases (dismissals and acquittals) have a sentence equal to zero, I’ll be taking the log of 1 + the sentence. Don’t worry about the details. Just look at these pretty graphs.

log-linear regression (all data points and representative 'dots')

See, everything’s nice and spread out now. Note that in the All Data Points graph, our best fit doesn’t go through the dark patches of data points because it is brought down by our acquittals and dismissals at the bottom.

Now, you can’t always fit a line to data. Sometimes it’s a curve and sometimes there’s no pattern at all—there is no signal to grab on to. Such cases return very low R-squared values.

Curvy Lines (Fitting Polynomials)

If you’re looking to plot something other than a line, you can add exponents to the equation of your best fit. Such equations are called higher order polynomials. A line is a first-order polynomial (i.e, y = mx + b) where as a parabola (i.e., y = ax 2 + bx + c) is a second-order polynomial. You get third-order polynomials by adding an x 3 and fourth-order polynomials by adding an x 4. What makes polynomials useful is that for every term you add you get a new bend. So we aren’t limited to fitting straight lines. For example:

2nd and 4th polynomial regression (representative 'dots')

Some of you are probably yelling at the screen: “Danger, Danger!” As we imagine increasingly curvy lines, what stops us from fitting a curve that goes right through the center of each of our data points?  Judgment.

You would be right to be suspicious of such a fit. It’s unlikely that it would be generalizable. If you take such a model and apply it to new data, it’s likely to break. Generally speaking, if you’re going to fit curves to your data, you should have some reason other than “it fits better,” because you can always make the line fit better.

For example, it might be that charges punishable by fines are different from crimes punishable by incarceration and as you transition from one type to the other the seriousness jumps. Consequently, you wouldn’t expect a linear relationship across all charge types. Instead, you might be looking for a “hockey stick.” That being said, you should only make use of curvy fits if you have a good theoretical reason to do so.

When you aggressively fit your model, and the fit doesn’t reflect reality, we call it overfitting. To help avoid this temptation, we need to check our work. The process of making our fit based on data is called training, and it’s standard practice to train one’s model on a subset of data and then test one’s model against a different subset. This way you can be sure your model is generalizable and will avoid the trap of overfitting. This testing is called cross-validation, and like data wrangling, it’s an important step in the process.13

Statistically Significant?

So how do we know if a correlation is real? It turns out we can’t actually say for sure. Instead, statisticians ask: “If there is no correlation, how likely would it be for us to see these or more extreme results by chance?”

The answer to this question feature by feature is something called a P-value, and you may have heard about these in the news recently. A thoughtful explanation of their meaning is beyond the scope of this piece. However, you should know that they’re scored like golf. Low values are better than high values. This score is often used to help determine if something is significant.14

xkcd comic: P-Values

Featured image: “P-Values” from xkcd.

The P-values for seriousness in the various models above are pretty good—well below 0.05. This comes as no surprise. What we really want to know is how race, income, and sex measure up. So what does a best fit look like when you deal with more than just seriousness? What happens if we take income into account?

Multiple Dimensions

multiple log-linear regression (all data points)

What you’re seeing is a plot of seriousness and mean income against log(1 + the sentence in days). Instead of a best fit line, we now have a best fit plane. If you look carefully, you can see that income does, in fact, correlate with outcome, except this correlation is in the opposite direction. That is, the higher your income, the lower your sentence.

Like before, we can use math to quantify our best fit. In this case, however, we use the equation of a plane (i.e., z = ax + by + c). Of course, just like before, we could fit curved surfaces to our data, but we can also expand the number of features we consider. As we add more features, we add more dimensions, and our best fit moves into the space of n-dimensional geometry. This can be hard to visualize, but hopefully it’s easy enough to understand. We’re just doing more of the same. If we don’t worry about curves, we’re just adding two variables with every new feature/dimension, e.g., x’ = ax + by + cz + d.

Again, our best fit is telling us something analogous to the average for every value along both axes, seriousness and income. This implicitly includes all of their combinations. Yes, high income corresponds to a lower sentence, but the seriousness of a charge matters a lot more.

It’s worth noting that our race and sex data doesn’t look like our other data in that it’s not part of a number scale. To address this, we convert them into collections of binary variables. For example, I have added a column to our data called Male. It’s 1 if the client was identified as male and 0 if the client was identified as female. Likewise, there are columns for each of the races defined in the data except for Caucasian. If a client was identified as Caucasian, all of these columns would be zero. That is, our model’s default race is Caucasian.

Okay, let’s run a regression on all of our features. Yay, multi-dimensional best fit!


The table below is a summary of the regression’s output. Remember P-values? They’re all really low (yay!). And although the R-squared isn’t great (6%), remember that life is messy.15 If race, income, sex, and the seriousness of a charge predicted a case’s outcome 100%, I’d be questioning what attorneys were for. So what do the rest of these numbers mean?

OLS summary statistics. R-squared 0.06...

Well, the coef (short for coefficient) column tells us the slope of our best fit for a given variable.16 That is, they tell us how big and in what direction a feature (e.g., race) is correlated with the dependent variable (the sentence).17 So we can see that a defendant’s race is positively correlated with the sentence they receive, and their income is negatively correlated.

Here’s our model boiled down to an equation.

S=e^{\beta + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7 + \beta_8 x_8 + \epsilon

  • D is our dataset of court cases joined with income data.
  • S is the Sentence in Days plus 1 day.
  • Coefficients \beta_1 through \beta_8 are those determined by an ordinary least squares (OLS) regression for the dataset D corresponding to features x1 through x8 receptively, with \beta equal to the intercept. Values of these can be found in the table above along with P-values and additional summary data. An explanation of these summary statistics can be found here.
  • \epsilon = some random error for the above OLS. For more on this, consult Forecasting From Log-Linear Regressions.
  • x1 = the seriousness level of a charge.
  • x2 = 1 if defendant is male, otherwise 0.
  • x3 = the mean income of the defendant’s zip code, used as a stand-in for their income.
  • x4 = 1 if defendant is Black (Non-Hispanic), otherwise 0.
  • x5 = 1 if defendant is Hispanic, otherwise 0.
  • x6 = 1 if defendant is an Asian or Pacific Islander, otherwise 0.
  • x7 = 1 if defendant is an American Indian, otherwise 0.
  • x8 = 1 if defendant is Other, otherwise 0.

Again, those coefficients tell us how big an influence our features have.18

This all tells us for a black man in Virginia to get the same treatment as his Caucasian peer, he must earn an additional $90,000 a year.

Similar amounts hold for American Indians and Hispanics, with the offset for Asians coming in at a little less than half as much. 

The answer to our question seems to be that race-based bias is pretty big. It is also worth noting that being male isn’t helpful either.

Because the R-squared is so low, we’re not saying that being black is an insurmountable obstacle to receiving justice as a defendant. Our model only accounts for 6% of the variation we see in the data. So thankfully, other factors matter a lot more. Hopefully, these factors include the facts of a case. However, it is clear that defendants of color are in a markedly worse position than their white peers.

And yes, correlation isn’t causation. And yes, strictly speaking, we’re only talking about the Virginia Criminal Circuit Courts in 2006-2010. But my guess is we’d find similar results in other jurisdictions or times. So let’s start looking for those datasets. That being said, we know a good deal more than when we started. So I’m willing to articulate the following working theory: race matters.

If you’ll forgive my soapbox, it’s time we stop pretending race isn’t a major driver of disparities in our criminal justice system. This is not to say that the courts are full of racist people. In fact, I’m working on a more detailed analysis of this data that seems to suggest that, at some points in the course of a case, one’s race plays no significant role in determining an outcome. What we see here is the aggregate effect of many interlocking parts. Reality is complex. Good people can find themselves unwitting cogs in the machinery of institutional racism, and a system doesn’t have to have racist intentions to behave in a racist way. Unfortunately, the system examined here is not blind to race, class, or sex for that matter. Knowing this changes everything. Once you recognize the bias in a system, you have a choice: you can do something to push back, or you can accept the status quo.

Words of Warning

I did all of my analysis with freely available tools, and there’s nothing stopping you from picking up where I left off. In fact, I hope that a few of you will look at this GitHub repo and do exactly that. However, it’s important to note that you need a solid foundation in statistics to avoid making unwarranted claims due to lack of experience.

Drew Conway's Data Science Venn Diagram

Featured image: “Data Science Venn Diagram” by Drew Conway is licensed CC BY-NC.

And beware the danger zone! As Drew Conway (creator of the Venn Diagram above) points out, “It is from [that] part of the diagram that the phrase ‘lies, damned lies, and statistics’ emanates.”

That being said, there is nothing magic here. You can also discover hidden truths. My advice? Be suspicious of answers that reinforce your existing assumptions. Do your work in the open. When confidentiality allows, share both your findings and your data. Have someone check your math. Listen to feedback, and always be ready to change your mind.19

This post was updated on 6/1/16 to address commenter feedback and to correct mislabeled Y axes on the first two plots. In retrospect, the last paragraph of this article seems prescient. Shortly after publication, a commenter took me up on the offer to dig into the data and noticed that I had neglected to clean some extraneous entries from the dataset (i.e., those entries with unidentified race and sex). It amounted to two lines of missing code the consequence of which was to inflate the coefficients associated with race and sex in the model. The coefficient for sex only changed slightly. However, those associated with race came down a good deal. After correcting for this error, the original observation that a black man had to earn an additional $500,000 to be on equal footing with his white peers was amended to reflect the fact that the model now puts the dollar amount closer to $90,000. Additionally, the offset for Asian defendants turned out to be a little less than half that of Black, Native, and Hispanic defendants. Aside from the paragraphs in which these results were communicated, the table of summary statistics, and footnote 18, the remainder of this article is unchanged with the exception of the inclusion of new plots to correct the mislabeled axes and references to the associated r-squared in same.

Featured image: “The Supreme Court” by Tim Sackton is licensed CC BY-SA 2.0. The image has been modified to include a big Data and text, with inspiration from Josh Lee.

  1. It’s worth noting the work I’m describing here was done on my own time and in no way represents the opinions of my employer. 

  2. As his site, Virginia Court Data, makes clear there’s quite a story behind this data, and my hat’s off to Ben for making them usable, despite some disagreements over the particulars of the release. See note 4. Thank you Ben. 

  3. In my experience, most practitioners of data analytics are ambivalent about the term Big Data. Only a handful of institutions deal with truly big data. Google and Facebook come to mind. Most of the time when people say Big Data, what they’re really talking about is data sufficiently large for statistical analysis to be useful. Hence, the quotes. 

  4. As a criminal defense attorney, I’m well aware of the complications that arise from such a dataset. As a recent ethical breach involving OkCupid data makes clear, just because you can aggregate data doesn’t mean you should. The data used here was aggregated from publicly-accessible court webpages that include defendant names. The contents of these pages are in fact public records. However, Ben has prudently hidden defendant names behind one-way hashes, and the court had enough foresight to partially obfuscate dates of birth. Unfortunately, this does not preclude de-anonymization of the data in the future. I would have opted to further obscure some of the data to make de-anonymization harder, and I shared some suggestions for further scrubbing the data with Ben. It was clear he took seriously the need to consider the unintended consequences of making this data more accessible, hence his effort to obscure defendant names. It was also clear that he felt there was a clear public interest in making these public data (in the sense that they are public records) more accessible, listing several cases where he believed obscured data would have made the discovery of injustices in the system harder to find. In response, I provided several hypotheticals illustrating the concerns behind my belief that more scrubbing was called for. We both agree that true anonymization is probably impossible, barring full homomorphic encryption, and that their is a public interest in making parts of this data easier to access than they are in to the existing court interface. Where we disagree is on whether these should included all or only a subset of the data. He left open the question as to whether or not he will further scrub the data, but he made clear that he had no intent to do so in the near future. For this reason, I have not included the raw data in my supplementary materials. Consequently, if changes are made to provide additional privacy protections, these materials won’t be a weak link in the chain. 

  5. When you have a lot of potential features, however, some special issues start to crop up. 

  6. The Race column listed 6 categories: American Indian, Asian or Pacific Islander, Black (Non-Hispanic), Hispanic, Other, or White Caucasian (Non-Hispanic). However, I do not know what criteria the court uses to determine a defendant’s label. Also, I am aware that this appears to be a conflation of race and ethnicity, but again, these are the categories provided in the court data. You will see these labels referenced by shorthand in this article as Native, Asian, Black, Hispanic, Other, and Caucasian respectively. 

  7. Assuming I limited my analysis to 2006-2010 data, which I did. This also limited me to data from the Criminal Circuit Courts

  8. If you look at the data, you’ll also see U under the class column as well. These are unclassified charges with their own sentencing range, separate from the standard classes. 

  9. This excludes class U charges. See note 8

  10. You may notice when we examine the relationship between seriousness and sentence below, some Class 3 & 4 Misdemeanors are linked to jail times despite the fact that such offenses should only involve fines. See Virginia Misdemeanor Crimes by Class and Sentences. For the cases I could find, these sentences agreed with the information found on Virginia Courts’ Case Information website. So this does not appear to be an issue with Ben’s data collection. A number of possible explanations come to mind. For example, Public Intoxication is a Class 4 Misdemeanor, but subsequent offenses can result in jail time. It is also possible that there may be data entry errors (e.g., misclassifying a charge as a Class 3 Misdemeanor when the governing statute makes clear it is actually a Class 1 Misdemeanor, something I saw in the court’s data). Whatever the reasons for these potential errors, they seem to be the exception, not the rule, and without direct access to the courts’ methods and measures of quality control, I have to take the data at face value. Hopefully, the fact that we’re working with hundreds of thousands of cases means that 134 outliers, if they are actually errors, don’t do much to skew our results. 

  11. The VA data listed a binary sex, not gender. So I am limited to such a taxonomy here. 

  12. What’s really going on is something called ordinary least squares

  13. To learn more about how I arrived at the final model below, check out this supplemental notebook

  14. What counts as a good P-value, however, depends on context. For example, in most social science research, a P-value of less than 0.05 is considered significant, but in high-energy physics they hold out for a P-value of 0.0000003

  15. I tested a number of different models and you can see my work in more detail over here

  16. If you’re curious what all the other numbers mean, check out this post

  17. Technically, the log of 1 + the sentence in days. 

  18. For income’s influence to counteract that of being black \beta_3 x_3 + \beta_4 x_4 = 0
    –0.000004166x3 + (0.3763)(1) = 0
    x3 = 0.3763/0.000004166
    x3 = $90,456.73 

  19. Thank you to Adrian Angus and William Li for your feedback. It was greatly appreciated. 


Get Lawyerist in Your Inbox, Daily

Current Articles
Current Lab Discussions
  • Johnny

    Nice article David. The lines that resonate most with me “a system doesn’t have to have racist intentions to behave in a racist way” and “be suspicious of answers that reinforce your existing assumptions.”

  • Jason

    Very cool that these data are publicly available, and I love the systematic approach that you took. However, the analysis suffers from a few deficiencies (not trying to be a jerk, just trying to help you reach the most valid conclusions). This is coming from my experience as a social scientist analyzing sentencing data.

    First, as far as my understanding of the analysis goes, the author does not account for the nested structure of the data – individual defendants are nested within zipcodes, which means that if the mean income of the zipcode is taken, then you have these income values repeated over and over for defendants living in the same zipcodes. This will tend to underestimate the standard error of the estimate, which will increase the likelihood of reporting a statistically significant association when in fact one does not exist in the population. This can be addressed by fitting a linear mixed model, which will account for the nested data structure and use the appropriate degrees of freedom for the income estimate. Alternatively, a cluster robust covariance matrix can be estimated to get the appropriate

    Second, as the author points out, the sentence length distribution is highly skewed, with a ton of zeros if the dismissals are not dropped. Recent research published in the Journal of Quantitative Criminology (http://link.springer.com/article/10.1007/s10940-016-9283-z) (behind a paywall, sorry!) forwards the use of zero-inflated negative binomial or logit-negative binomial hurdle regression, as OLS and log-linear models produce highly biased estimates given the distribution of the data. These can be coupled with the cluster robust variance-covariance estimators for more accurate standard errors and p values.

    • Thanks Jason. These are all valid points, and I appreciate your feedback. There’s clearly a lot more that can be done with these data, and I’m the first to admit the analysis has plenty of deficiencies to go around. It is certainly quick and dirty. My hope was to start a conversation and to be as up front as I could with the deficiencies as they currently exist while introducing a general audience to basic regression analysis. With any luck, subsequent passes at the data can further refine things.

    • Joshua Greene

      What do you think about two separate models:
      (1) conviction model applied to all records
      (2) sentencing model applied only to records with a non-zero SentenceDays?

      Naively, this would help address (an element of) the skew in the dep variable and also peak into hypotheses about different stages of the criminal justice process.

      • Joshua, you’re anticipating the “more detailed analysis” alluded to above. ;)

        Also, I’d like the opportunity of a deeper dive to explore issues like Simpson’s Paradox. You see, the thing about this post is that it’s largely a McGuffin to introduce attorneys to data analysis.

  • Smith

    Very interesting topic – it is great that all this data is so accessible upon analysis. I saw this crossposted on /r/statistics and felt obliged to respond.

    Looking at the results, something seems off. It is odd that whites (the base case) are so significantly different from every other race, while the t values for all other races would make them look quite similar. Furthermore, the suggestion that Asians receive significantly longer sentences than whites after some accounting for income on some level as well as crime and gender is contradictory to existing research that would actually suggest the opposite.

    Looking into the notebook you shared, I noticed that you (I think) unintentionally treat cases with unidentified race as caucasian and unidentified gender as male.
    For example:
    munged_df[‘Native’] = 0
    munged_df.loc[munged_df[‘Race’].str.contains(‘american’,case=False)==True, ‘Native’] = 1
    munged_df[‘Asian’] = 0
    munged_df.loc[munged_df[‘Race’].str.contains(‘asian’,case=False)==True, ‘Asian’] = 1
    munged_df[‘Black’] = 0
    munged_df.loc[munged_df[‘Race’].str.contains(‘black’,case=False)==True, ‘Black’] = 1
    munged_df[‘Hispanic’] = 0
    munged_df.loc[munged_df[‘Race’] == ‘Hispanic’, ‘Hispanic’] = 1
    munged_df[‘Other’] = 0
    munged_df.loc[munged_df[‘Race’].str.contains(‘other’,case=False)==True, ‘Other’] = 1

    I would not make this assumption and imagine you would get very different results if these records were removed instead. As I mentioned, something seems off about the race conclusions of this model. Coincidentally, you use white as the base case for your models.

    To add to that, with statsmodels and patsy it is easy to use the categorical function and define the base case for the categorical variable instead of hand coding dummy columns. In some ways, it helps avoid similar mistakes. For example:

    ols(np.log(1+SentenceDays) ~ Seriousness + C(Race,Treatment(reference=’White Caucasian (Non-Hispanic)’)))

    On that note, you may also find that your model would make more sense if you treated seriousness (or the crime itself) as a categorical variable, not as an ordinal variable with a linear/loglinear/exponentional relationship as you essentially disproved in your own analysis.

    • Smith, you’re correct about the unidentified characteristics. I failed to remove entries with unidentified (null) race and gender before running the analysis, causing the model to treat these as white and male respectively. If I rerun the regression on a corrected set, the corresponding coefficients come down a good deal. Looks like I’m going to have to make a correction above. Thank you for your catching this. Many eyes make all bugs shallow.

      • k g sugihara

        I agree with your response

  • Chris H

    A couple of thoughts/observations. Firstly, am I missing something or are the first couple of graphs labelled incorrectly? The text talks of seriousness against sentence, but the graphs have the y-axis labelled ‘mean income’. As such, they appear to suggest that the richer you are the more serious the crime you will commit.

    The second thought is about the seriousness thing. These data appear to be not scalar, but ordinal. That is, an offence with seriousness of 4 can be said to more seriousness than one rated 2, but it is not twice as serious. If this is the case, then I think their use in parametric tests such as OLS is a little dubious.

    • Yep. We caught that (the mislabeled axes) earlier this morning. Good eye. They were meant to be labeled “sentence in days” as referenced in the text, and I hear you on the seriousness levels. However, as you may have noticed, this article is primarily an intro to OLS for attorneys. So some of the choices I made were made with that in mind. I really wanted to show what happened when you pushed things to their limits and took a round about path to the final work-product, as this is closer to lived experience. So I leaned heavily on the idea of good enough (i.e., all models are wrong some are useful/does this get us any closer to answering our question?).

  • Robert Flight

    David, very cool, and love how you essentially “narrate” the decisions you make and why you make them, we need more worked examples like this.

    One caution about the p-values. You have a *lot* of points, 220000, I would be surprised if you didn’t get significant p-values when regressing against that many points. I would be interested to see how the fits and p-values change when using a series of bootstrapped samples from the data on the order of 200 – 400 points?

    • Thank you. This piece is more an introduction to regressions for lawyers than anything else, and I wanted to do something different than the usual presentation of a fully formed model with no lead up. And ya, there’s a lot of data points. You know, if you want to play around, you can find everything you need over here: http://bit.ly/VA_court_research I’d love to see what you come up with.

      • Robert Flight

        I unfortunately already have lots of projects, and things to work on, and multiple drafts of posts for my own blog that are languishing. But maybe someday ….


        • I hear you. Oh how I hear you. Thanks for the link and the feedback. We’ll see who gets to it first. It will be like a really slow race. :)

  • Emily Rosenzweig

    Very very interesting. But one question — was mean income by zipcode not available disaggregated by race? Given that we know those two variables are correlated at a national level, it seems like using average income for the defendant’s own race would be important…

    • That information wasn’t immediately available in the dataset I was working with, but that along with a number of other tacks are ripe for additional work.

  • Claudia Johnson

    Thank you! This is what we need to do to solve the Access to Justice Crisis in this country–generate hypothesis from data, then use data to test those hypothesis (part II). Thanks also for the correlation is not causation reminder–a lot of highly educated people don’t get that. And yes, also thanks for the analysis on race. It matters as you show here. Now lets do this for other questions/issues, so we can focus on the true reasons for unjustice and not just on those we believe due to our own implicit biases. Lets debunk the myths using these approaches in a collaborative way.

  • Walt French

    Thanks for such an open, accessible description of Big Data at Work. And for getting into the basic topic. I hope to see your follow-up work.

    I wonder whether you could use your talents in translation & stat to re-state the published coefficients as standardized, beta coefficients, i.e., state them something like “how much a typical step up in [income] increases (decreases) sentence in percent terms.” Of course, income is the biggest outlier in terms of its range, and also one of the most interesting issues.

    I wonder also whether there’s enough data that separate models, or perhaps some restructuring, might help in understanding the impact of different seriousness levels. Income effects might very well be less significant in low-seriousness cases, since wealthier defendants have more reason/ability to hire big guns lawyers for the worst cases, while less important charges are obviously not worth spending as much on.

  • umabird

    Would be interesting to see a similar study on Family Court cases for both, ones that settle pre-trial and ones that go to trial. Do you know if there are any out there?

  • Bryan Scheiderer

    David, thank you for this. I would like to analyze data from my state’s justice system in the same manner. So, how does a J.D. get the math/stats background necessary? Do you have any thoughts on the data science programs offered by Coursera/Udacity? Thanks,

  • Alex

    How good of a proxy for income is ZIP code? The problem I could see is that in many areas (I’m no expert on VA) poor mainly black zip codes aren’t the same as poor mainly white ones. So the Zip code could covariate with race or the courthouses in those zip codes may have more harsh courts in them. (for example rural poor areas compared to urban ones). Not sure how the data would play out but I am curious.