TL;DR: If you like issue spotting, play this “game” to help AI improve access to justice by training algorithms to help connect folks with legal services.

It is a truth universally acknowledged, that the majority of those in possession of legal problems, remain in want of solutions.1 Also, ROBOTS!  Ergo, we should throw AI at A2J. There is considerably less consensus, however, on how (or why exactly) this should be done. But don’t worry! There’s an app/game for that, and it let’s you train artificial intelligence to help address access-to-justice issues. We’ll get to that in a minute. But first, some background.

Machine Learning & Access to Justice, Together at Last

Machine Learning, the subdiscipline within AI around which the current hype cycle revolves, is good at pattern recognition. Acquaint it with a sufficiently large number of example items, and it can “learn” to find things “like” those items hiding in the proverbial haystack. To accomplish such feats, however, we have to satisfy the machine’s need for data—BIG data. Consequently, AI’s appetite is often a limiting factor when it comes to deploying an AI solution.

Image: “The Supreme Court” by Tim Sackton is licensed CC BY-SA 2.0. The image has been modified to include “big Data,” with inspiration from Josh Lee.

Let’s consider two areas where AI’s pattern recognition might have something to offer A2J. Services like ABA’s Free Legal Answers try to match people with legal questions to lawyers offering pro bono limited representation (think free advice “calls” over email). Unfortunately, some questions go unclaimed. In part, that’s because it can be hard to match questions to attorneys with relevant expertise. If I’m a volunteer lawyer with twenty years of health law experience, I probably prefer fielding people’s health law questions while avoiding IP issues.

To get health law questions on my plate and IP questions on someone else’s, a user’s questions need to be (quickly, efficiently, and accurately) labeled and routed to the right folks. Sure, people can do this, but their time and expertise are often better deployed elsewhere, especially if there are lots of questions. Court websites try to match users with the right resources, but it’s hard to search for something when you don’t know what it’s called. After all, you don’t know what you don’t know. Complicating matters further, lawyers don’t use words like everyone else. So it can be hard to match a user’s question with a lawyer’s expertise. Wouldn’t it be great if AI’s knack for pattern recognition could spot areas of law relevant to a person’s needs based on their own words (absent legalese), then direct them to the right guide, tool, template, resource, attorney, or otherwise? That’s what we’re working towards here.

"All models are wrong, but some are useful." -George Box 1. A model's output should start, not end, discussion. 2. Always ask, "compared to what?"

I include a version of this slide in just about every data science talk I give. You can find the map background image here.

I know what you’re thinking, but we are NOT talking about a robot lawyer. When we say “AI,” think augmented intelligence, not artificial intelligence. What we’re talking about is training models to spot patterns, and it’s worth remembering the sage advice of George Box, “all models are wrong, but some are useful.” Consequently, one must always consider two things before deciding to use a model: First, does the model improve on what came before? Second, is it starting a discussion (not ending it)? Unless the data are pristine and the decision is clear-cut, a model can only inform, not make, the decision.

Something like an automated issue spotter has the potential to improve access to justice simply by making it a little easier to find legal resources. It doesn’t need to answer people’s questions. It just needs to point them in the right direction or bring them to the attention of someone in a position to help. It can get the conversation started by making an educated guess about what someone is looking for and jumping over a few mundane—but often intimidating—first steps.

But at least two problems stand between us and realizing this dream. If we’re going to map lay folks’ questions to issues using machine learning, we’re going to need a list of issues and a boatload of sample questions to train our models. As if this wasn’t enough, those examples need to be tagged or labeled with the right issues. Unfortunately, we are unaware of any appropriately-labeled public dataset. So we’ve decided to help birth one.

Who’s “we” you ask? A collaboration of Suffolk Law School’s Legal Innovation and Technology (LIT) Lab (bringing the data science) and Stanford Law School’s Legal Design Lab (bringing the design chops), with funding from The Pew Charitable Trusts.

Learned Hands: An Introduction to Our Project

Image by Margaret Hagan.

So AI can help address an A2J need but only if someone has the resources and expertise to create a taxonomy, read a bunch of text, and (correctly) label all the legal issues present. This is where you, dear reader, can help.

The Access to Justice & Legal Aid Taxonomy

Stanford’s Legal Design Lab has taken the lead on creating a taxonomy of legal help issues based on existing ones. Eventually, service providers will be able to match their offerings to the list, and AI can pair the general population’s questions with the appropriate label or tag within the taxonomy. Heck, AI could even help service providers match their resources to the taxonomy, serving as a translator on both sides. Either way, the taxonomy will provide a standard nomenclature to help coordinate A2J work across the community. Setting standards is hard, but it’s the sort of foundational work that can pay big dividends. In short, we’re building Version 1.0 and looking for your input. If that appeals to you, give this description of the work/call for input a look and make yourself heard.

Help AI Address Access to Justice

Now we just need tens of thousands of legal questions to feed the machine, and each one must be tagged with items from the taxonomy. Luckily, people publicly post their legal questions all the time. Tens of thousands are available over at r/legaladvice. The moderators and forum rules work to ensure that these posts lack personally identifying information, and all questions are posted with the expectation that they will be published to the front page of the internet, as Reddit calls itself. This makes them unique because, unlike questions posted on sites like ABA Free Legal Answers, their authors understand them to reside in an explicitly public space. Although they haven’t been mapped to our taxonomy, their public nature leaves open the possibility that an army of citizen issue spotters (that’s you) could read through them and label away.

One can download these questions using the Reddit API, but moderators at r/legaladvice were kind enough to share their own repository of nearly 75,000 questions in the hopes they could help jump-start our work. Thanks especially to Ian Pugh and Shane Lidman for facilitating our work with the Reddit Legal Advice community.

The Game: Labeling Texts

To help label our growing collection of texts, we’ve created an online game in the hope that many hands will make light work. So, of course, we call it Learned Hands.2

The game presents players with a selection of lay peoples’ questions and asks them to confirm or deny the presence of issues. For example, “Do you see a Health Law issue?” We then combine these “votes” to determine whether or not an issue is present. As you can imagine, deciding when you have a final answer is one of the hard parts. After all, if you ask two lawyers for an opinion, you’ll likely get five different answers.

We decide the final answer using statistical assumptions about the breakdown of voters without requiring a fixed number of votes. Effectively, if everyone agrees on the labeling, we can call the final answer with fewer votes than if there is some disagreement. Consequently, the utility of the next vote changes based on earlier votes. We use this to order the presentation of questions and make sure that the next question someone votes on is the one that’s going to give us the most information/  or move us closest to finalizing a label. This means we don’t waste players’ time by showing them a bunch of undisputed issues.

You earn points based on how many questions you mark (with longer texts garnering more points). Players are ranked based on the points they’ve earned multiplied by their quality score, which reflects how well your markings agree with the final answers. Specifically, we’re using a measure statisticians call the F1 Score.

That’s right. You can compete against your colleagues for bragging rights as the best issue spotter (while training AI to help address A2J issues). After all, we’re trying to have this game go viral. Please tell all your friends! Also, it works on both your desktop and your phone.

Desktop and mobile screenshots.

Eventually, we will make different flavors of the labeled data available to researchers, developers, and entrepreneurs free of charge in the hopes that they can use the data to create useful tools in the service of A2J (for example, we may publish a set where the labels correspond to a 95% confidence level and another were the labels are just the current “best guess”). Not only could such datasets serve to help train new issue spotting models, but ideally, they could serve as a tool for benchmarking (testing) such models. See Want to improve AI for law? Let’s talk about public data and collaboration.

We’re also seeking private data sources for secure in-game labeling by users agreed upon by those providing the data (e.g., their own employees). By including more diverse datasets, we can better train the algorithms, allowing them to better recognize problems beyond those faced by Reddit users. Although we’ll be unable to publicly share labeled private data, we will be able to share the models trained on them, allowing the larger A2J community to benefit while respecting client confidence.

For the record, although this game’s design was a collaboration between the LIT and Legal Design Labs, Metin Eskili (the Legal Design Lab’s technologist) is responsible for the heavy lifting: turning our ideas into functional code. Thanks Metin.

Active Learning

We will also use a process called active learning. Basically, once we reach a critical mass of questions, we train our machine learning models on the labeled data as it comes in. We then point our models at the unlabeled questions looking for those it’s unsure of. We can then move these questions to the top of the queue. In this way, the models gain insights they need to parse “confusing” examples. Again, the idea is not to do more labeling than necessary. It just makes sense to skip those questions our algorithms are pretty sure about.

Proof of Concept

Here at Suffolk’s LIT Lab, we’ve started training algorithms on a pre-labeled private dataset. The early results are promising, or as I like to say, “not horrible.” As I’ve explained elsewhere, accuracy is often not the best measure of a model’s performance. For example, if you’re predicting something that only happens 5% of the time, your model can be 95% accurate by always guessing that it’s going to happen. It can be hard to say what makes a good model (aside from perfection), but it’s pretty easy to spot when a model’s bad. All you have to do is play through some scenarios.3 To keep it simple, we’ll assume a binary prediction (e.g., yes or no).

If a coin flip can beat your predictions, your predictions are horrible. Your accuracy better beat 50%.

If always guessing yes or no can beat your predictions, your predictions are horrible. Your accuracy must be better than the fraction of the majority answer (like in the 95% accuracy example above).

If you’re looking for Xs and you miss most of the Xs in your sample, your predictions are horrible. So your recall has to be greater than 0.5.

If you’re looking for Xs, and less than half of the things you call Xs are actually Xs, your predictions are horrible. So your precision has to be greater than 0.5.

Using these guideposts, we know a classifier is “not horrible” when it beats both a coin flip and always guessing yes or no. If it says something is X, it better be right most of the time, and across the entire dataset, it must correctly identify more than half of the Xs present.

Below, I’ve included some summary statistics for one of our tentative models trained on pre-labeled private data. As you can see, it’s not horrible—accuracy beats always guessing yes or no, and precision and recall beat 0.50. There are some other nice data points in there (like AUC), but we won’t highlight those here (their descriptions are beyond the scope of this post). In the end, “not horrible” is just an extension of the idea that a model should be an improvement on what came before. In this case, “what came before” includes coin flips and always guessing yes or no.

A snapshot of private data testing results.

As you’d expect, our models are getting better with more data. So we’re really excited to see what happens when a bunch of folks start labeling. Also, it’s worth noting that we are starting with high-level labels (e.g., family law and housing). Over time, we will be including more granular labels (e.g., divorce and eviction).

How Does This All Work? (A Slightly-Technical Description)

Text classification isn’t as complicated as you might think. That’s mostly because the algorithms aren’t really reading the texts (at least not the way you do). To oversimplify a common text-classification method called bag-of-words, one creates a list of words found across all texts and then represents each document as a count of words found in that document. Each word counts is treated as a dimension in a vector (think “column in a list of numbers”). After looking at all the data, one might notice that questions about divorce always have a value greater than or equal to three for the dimension associated with the word “divorce.” In other words, divorce-related questions always contain the word “divorce” at least three times. So it is be possible to describe questions about divorce by referring to their vectors.

Put another way, every text with vectors whose divorce dimension is on either side of three goes into either the divorce or not-divorce categories. This isn’t a very realistic example, though, because document types aren’t often like Beetlejuice (say the magic word three times and they appear). Still it is reasonable to assume there is a constellation of keywords that help define a document type. For example, maybe the chance that a question is housing related goes up when the query uses words like landlord, tenant, or roommate. Larger values across those dimensions, then, are correlated with housing questions. You can (of course) get more nuanced and start looking for n-grams (couplings of two, three, or n words) like best interest, while ignoring common words like and. But the general method remains the same: we throw the words into a bag and count them.

More sophisticated approaches—like word2vec—employ different methods for converting text to vectors, but without getting too far in the weeds we can generalize the the process of text-classification. First, you turn texts into numbers embedded in some multi-dimensional space. Then you look for surfaces in that space that define borders between different text groupings with different labels. This, of course, relies on different text types occupying different regions in the space after they are embedded. Whether or not these groupings exist is an empirical question (which is why it’s nice to see not horrible output above). The data help us think success is an option.  

Google’s Machine Learning Crash Course on Text Classification provides a good high-level introduction for those interested in the technology. Our workflow tracks with much of their description, although there are some differences. For example, we’re using over- and under-sampling for unbalanced classes and stacking various models. Don’t worry, we’ll eventually write everything up in detail. Here’s the point, though: we aren’t pushing the state of the art with these classifiers. We’re sticking with time-tested methods and producing a publicly-labeled dataset. We’d love to see this labeled dataset feeding some cutting-edge work down the road, and if you can make a compelling demonstration for how your novel method could make better predictions, we’re open to taking your model in-house and training it on our private datasets (assuming you commit to making the trained model free and publicly available). After all, many hands make light work. Tell your friends! Heck, let’s make its super simple. Just share this tweet as often as you can:

And don’t forget to play Learned Hands during your commute, over lunch, or while waiting in court.


  1. My apologies to both Jane Austin and the Legal Service Corporation’s 2017 Justice Gap Report

  2. This is wordplay riffing on the name of an eminent American jurist, Learned Hand. I’m sorry I felt compelled to explain the joke, but here we are… in a footnote. :/  

  3. In practice, one needs to think carefully about the costs of things like false positives and false negatives. Sometimes you’ll have a preference for one over the other, but we’re not going to get that nuanced here. 

1 Comment

Leave a Reply