How to Build a Law Bot

banksy_bot
computer-security-guide-cover-2nd-ed

4-Step Computer Security Upgrade

Learn to encrypt your files, secure your computer when using public Wi-Fi, enable two-factor authentication, and use good passwords.

This is the final post in a series on learning to code. If you missed the first two, I suggest you start at the beginning.

If you’ve ever found yourself obsessively hitting refresh on a website, you understand the motivational power of FOMO (fear of missing out). Be it bar passage information or changes to the court calendar, sometimes you can’t help but check, and check … and check.

Recently, I hit bottom, constantly jumping between half a dozen browser tabs refreshing various election forecasts. That is, until I built a bot to outsource my obsession: @meanvoter. Today, I’ll show you how to build your own cyber sentinel and banish FOMO.1

As an exercise, painting by numbers makes it clear that anyone can paint. That is, if you can write your name, you have the motor control needed to put brush and paint to canvas. The trick to being a painter is in developing your eye, not only for what to paint but for how to translate this vision into strokes on the canvas. Our last exercise proved the point that you can code, in that you have all the tools necessary to make and run a program. As with painters, the trick to becoming a coder is in developing your eye, learning how to decompose a complex problem into easy-to-implement steps.

In contrast to our last exercise, we’ll be taking a closer look at the why behind some of our choices, building on the foundation of nomenclature and experience you’ve gained over the past weeks. That being said, I’ll be referencing topics and tasks from our last two posts. So you’ll want to be sure you’re caught up.

After building your first bot, we’ll review what we’ve learned and how you can continue your journey, be it as a Disruptor, Pragmatist, or Liberal Arts Major.

Scope

Here’s what we want our system to do: Report on changes to information found on one or more websites. In our example, we’ll be pulling data from two election forecast websites, combining that data to get an average and reporting it out via Twitter. This is a subset of what @meanvoter does. I’ve limited this post to a description of this subset because reproducing the full functionality of @meanvoter requires more than one needs to introduce the ideas behind building your own bot. You can, however, find the full source code for @meanvoter on its GitHub page.

What is GitHub? GitHub is an online community for people looking to share computer code. Last time, you may have noticed that gspread had a GitHub page. It’s really big in the open source space. In fact, when I wanted to learn how to use gspread, I started with the instructions found on GitHub.

Web Scraping

Web scraping is the process of automatically extracting information from websites, usually by transforming unstructured human readable data into a structured machine-readable form (e.g., something we could put in a spreadsheet).

When I visit Samuel Wang’s election forecast site, I can see at a glance that he has two models predicting Clinton’s chance of winning. As of the writing of this piece, they have her chances at 84% and 91% respectively.

073

If you were going to watch the site for me and text me changes to the predictions, I wouldn’t have to give you many instructions. “Tell me if the predictions change” is more than enough. But absent an understanding of the English language, how is a computer program expected to do this?

If only there were a way to pick out the two digit number following the words random drift and Bayesian. Of course, before we get to this, we should ask, “What does a program see when it visits a webpage?”

HTML

Web pages are written in Hypertext Markup Language (HTML). As a markup language, it uses tags (little bits of shorthand) to tell browsers how they should display information. You can see this page’s HTML by right clicking on this page and choosing “View Page Source” or similar. It will look something like this:

head-source-html

You can probably start to make out what’s going on. Text contained inside angle brackets (e.g., <title>) seem to provide formatting information. Most of the time, when our program reads a web page, this raw HTML is what it sees. So it is here we need to look for information to extract. For example, if it wanted to know the title of a page (the text that appears in the browser tab or when you share on social media) we’d just have to find the text between <title> and </title>. Note: It appears again encased in header tags (i.e., <h1 class="headline entry-title>How to Build a Law Bot</h1>).

build-law-bot-html

This is the title as displayed on the page. If we wanted to find the author, we need only look for the byline following the title. It’s not hard to spot. It’s the text not in brackets following the word “By” directly below the title.

Now if only there were a way to tell a computer to look for patterns like this.

Regular Expressions (Regex)

Often described as find and replace on steroids, regular expressions are like magic. Instead of searching for an exact match, regular expressions (often called regex) allow you to search for patterns—say, a set of numbers followed by a space, the letters “U.S.”, another space, and a year inside a parenthetical. Where have you seen that before?2 That’s right SCOTUS nerds; I just described a citation to the United States Reports. Imagine the possibilities. For the record, the above regex would look like this:

(\d+ U\.S\. \d+ \(\d{4}\))
  • \d looks for numerical digits.
  • + modifies \d to indicate that we’re looking for anywhere from one to an infinite number.
  • \ is an escape character, making \., \(, and \) search for ., (, and ) respectively.
  • {4} modifies \d to make clear we’re looking for a block of four digits together.
  • The outermost parentheses define a group, and it is what the regex finds in that group that it will report back as a match.

If you were to run the above regex on a block of text with citations to the United States Reports, you would get a match for each citation. In this way, a program can parse text to find specific content.

Regex isn’t as straightforward as find and replace. There’s a whole vocabulary of symbols (tokens) to learn. But it’s really not that hard. David Zvenyach has a good introduction to regex, but in my opinion, the best way to learn is to play around. I suggest using Regex 101 to craft and run your own regular expressions.

067

What makes Regex 101 a great tool for beginners is the fact that it explains everything it’s doing with plain English in the right column. This includes a handy key of the most used tokens under QUICK REFERENCE. For a robust lesson on regular expressions, check out Pattern Matching with Regular Expressions.

Over at Regex 101, you can type a regex (e.g.  (\d+ U\.S\. \d+ \(\d{4}\))) into the REGULAR EXPRESSION field at the top of the screen, place some text in the the TEST STRING box, and it will find your match(es). Hint: to find multiple matches, put a g in the second text field under the REGULAR EXPRESSION heading3

If you have a website in mind as your bot’s focus, cut and paste its HTML into the TEST STRING box and see if you can write a regex to pull out what matters to you. Trust me. It’s fun.

Source: Regular Expressions from xkcd.

Source: Regular Expressions from xkcd.

Legality

What follows is not legal advice, just legal information, but there probably isn’t a copyright claim associated with web scraping (at least in the US) as long as you aren’t lifting large bits of text from a site (after all, web scraping is the basis of all internet search), and you can’t copyright facts. However, companies continue to make novel arguments around unauthorized access, claiming that a bot scraping their site against their wishes should count as a violation of the CFAA, making it a felony. So it’s worth asking the question, “Has the owner of this site tried to forbid scraping?” There are at least two places you should look: (1) the site’s robots.txt file, a text file declaring what parts of a website are off limits to robots, which is what your little bot will be; and (2) the site’s terms of service. Also, you should avoid swamping a server with a large number of visits as this could be mistaken for a DoS attack. There’s no need for your bot to hit the metaphorical refresh button every minute.

Think Big

Although our current concern is with plucking a individual facts from a few pages, one can build web scrapers to amass large collections of data. Earlier this year, Ben Schoenfeld set up a scraper to collect millions of Virginia court records.

I then used this data to explore questions of bias in the courts.

So feel free to think big.4 After all, Google is built on the back of web scraping. Just make sure you think before your scrape.

Databases

If our bot is going to report on changes, it will have to remember what things looked like in the past. So we’ll need a place to store these memories—the data we extract through web scraping. This is what coders call a database. A database is just an organized collection of data. As you might imagine, there are many types of databases, but customarily data gets structured into tables.5 So when you want a specific bit of data, you retrieve it based on its position in a table, perhaps by referencing its row and column. In fact, people often talk about databases in much the same way as they do spreadsheets.

Since we already know how to make a program that talks to spreadsheets, this part should be easy. Go ahead and create a new Google Sheet to serve as a database for your bot. Once you’ve done this, be sure to grab its ID. Follow the same instructions as last time, except you can dispense with the creation of a form. Just pull the ID from the Sheet’s URL.

Twitter

Our bot will need some way to notify us of the changes, and Twitter seems as good a vehicle as any since you can set up alerts and push notifications on your phone. When SCOTUS is hearing oral arguments, I love getting updates of laugh lines in the transcripts via @LOLSCOTUS. Now if only there were some way to have our program talk directly to Twitter.

I think you know where this is going. We’ll need a Twitter account and API access.

If you already know what Twitter handle (username) you want for your bot, go ahead and create a Twitter account. Keep in mind, you can’t change your handle after creating an account, so be sure you choose a great name.6

For your program to interface with Twitter, it will need four bits of secret information (credentials): (1) a consumer key; (2) a consumer secret; (3) an access token; and (4) an access token secret.

As with the Google API, you’ll need to create a project/app to get your credentials. First, you’ll need to add a cell phone to your account before Twitter will let you obtain credentials. After you’ve activated your account and cell phone, visit the Application Management page and click Create New.

068

Complete the resulting form. Note: you can use your Twitter URL as your website (e.g. https://twitter.com/meanvoter).

069

After the app is created, click on its name.

072

Once you’re in the app, click on the Keys and Access Tokens tab.

070

Under Application Settings, you’ll find your consumer key and consumer secret.

077

And under Your Access Token, you’ll find your access token and access token secret.

071

Of course, we’ll need a way for Python to talk with the API. So let’s go ahead and load the Python Twitter library. Open your terminal (Mac) or command prompt (Windows). Type pip install python-twitter into the command line and hit the enter/return key.

078

The Nuts & Bolts (The Code)

With what we’ve learned above plus our previous work, we should be ready to build our own bot. In this example we’ll be pulling data from the Princeton Election Consortium and Betfair’s 2016 presidential election forecasts. The Consortium’s robots.txt sets no limits on bot use, and Betfair doesn’t seem to have a robots.txt file. Nor does either site have a terms of service agreement. So there’s nothing stopping us from scraping away.

First download and open this Jupyter Notebook (right/control click to save). Unlike most programs on your computer, you can’t open the notebook by double clicking on the file. Instead, you’ll have to launch Jupyter as you have before, in the command line. Then from inside Jupyter, navigate to the file’s location and click on the notebook link in your browser.

079

For those of you not coding along at home, you can see a copy of the notebook on GitHub. Here’s a taste.

080

You may recognize part of the code from the last post (the part that talks to Google Sheets), but you’ll also notice that I’ve made use of comments (text following # marks that the computer ignores when running a file) while including short instructions before each cell.

If you follow the instructions in the notebook, you should get what you need to build your own bot. Here are the broad strokes. The program:

  • connects to your Google Sheet
  • visits each of the websites and grabs their contents
  • parses the sites’ contents, using regex to pull out the percentage chance Clinton has of winning the election
  • averages the two predictions
  • compares the current values to the last values
  • tweets out any changes to either prediction
  • tweets out the updated average

The notebook goes into more detail. It’s mostly stuff you’ve seen before. However, it will introduce the use of If statements in Python. But don’t worry. If you’ve made it this far, you should be fine. You may have to keep at it for a while. So plow through, and again, you can hit me up for help in the comments or on Twitter @Colarusso.

Once you have your code working properly, save the notebook as a script like you did last time. Then use the task scheduling method from last time to have the bot run every few minutes.

Obviously, the expectation is that you’ll change the websites and regular expressions to make your own bot. And it should go without saying that once your new bot is up and running, you should let me know on Twitter @Colarusso. I can’t wait to share them.

It’s worth remembering that web scrapers are notoriously fragile. For example, if our target pages change their  formatting it could break a regex. So keep this in mind as a maintenance issue and in the development of your regexes.

Have fun, and congratulations on building your first bot. This accomplishment is sure to put you in the good graces of our soon-to-be robot masters.

leo

In Review

We’ve covered a lot of information, and no one expects you to recall all of it in perfect detail. These exercises were never intended to turn you into a production coder but rather to provide you with a sense of the possible, some motivational fuel should you like to learn more, and a high-level understanding of some important concepts and tools. Namely, the command lineAPIsAPI keysopen sourceprogramming librariesscripts, the schtasks & crontab task schedulers, web scrapingHTMLrobots.txt, regular expressions (regex), databasesPython, Project JupyterGoogle Sheets & Formsmail merge, mail merge rules (if-then statements for mail merge), and GitGub.

What Next?

At the start of our journey I said the roads of Disruptors, Pragmatists, and Liberal Arts Majors would quickly diverge, and I fear we’ve come to that fork in the road. By this point I suspect you have a pretty good feel for where you fall on the spectrum. If you’re tempted to say this coding stuff isn’t for you, I’d suggest trying your hand at coding lite with an automation service like IFTTT or Zapier before walking away entirely.7

Both of these services take advantage of existing APIs to connect and automate various web services, hiding any “coding” behind user-friendly interfaces. IFTTT (short for If This Then That) makes use of what they call recipes. For example, here’s a simple recipe I made to text me when the MA Courts are closed. See what I did there? I described a service based on API usage without having to define what an API is, and (hopefully) you understood what I meant. That’s the power of nomenclature. Also given the existence of APIs, such services should seem only natural. Suddenly, you have a better understanding of the possible, and you can start to ask what connections are missing. If you can scrape websites for data and autopopulate documents, what’s to stop you from pulling data from multiple sources when building documents, combining public data with your own? Couldn’t a regex sift through a document and give me links to all of the cited cases? It doesn’t matter if your aim is to actually build answers to these questions or simply ponder them in silence. In both cases, you are thinking like a coder. You are thinking like a coder in the same way you are thinking like a scientist when you insist on the importance of evidence.

My work here is done.

If you choose to continue down the path of building your own tools, there is no better way to learn than by doing. My suggestion: find projects you want to see in the world and figure out how to make them happen. Find and dissect projects on GitHub that you find interesting, like this Slack bot for looking up case law. Adapt them to your own purposes (as you did with our bot). Add features. Play.

Of course, that’s not the answer people want to hear when they ask, “How do I learn to code?” So I’ve collected a few would-be next steps. I haven’t used any of these myself, but they come highly recommended by other lawyers who code, and I’ve done my best to give you my impressions after limited review.

One thing they all have in common, the word free.8

Additional Resources

  1. Automate the Boring Stuff with Python focuses on real world applications that actually seem relevant to lawyers (esp. the chapters on Excel, Word, and PDF Documents). In this respect, it’s probably the closest spiritual cousin to this series in this list. It comes in various incarnations: free online “book” (with select videos),  $25 online class, or physical book . h/t @PaulGowder
  2. Learn Python the Hard Way is a classic. When I ask a group of coding lawyers how they got started, this book always comes up. You can buy a copy for ~$30 or go with the free online version. ((The free version doesn’t include the videos, but my understanding is that these are mostly just you looking over-the-shoulder the author while he works through projects.)) One drawback is the fact that the Hard Way doesn’t address Python 3 (the latest version of Python). Whether or not you read the book, the author’s Advice from an Old Programmer is worth your time. h/t @vdavez (among many)
  3. Code Academy‘s freeish online course in Python offers a quick and approachable introduction to programming in Python. The down side is that the topics aren’t intrinsically spicy. So you’ll need to go in packing a strong dose of your own motivation. Plus, you have to pay ~$20/month to unlock quizzes and projects, hence my freeish labeling.  h/t @GTeninbaum.
  4. Intro to Computer Science (with Python) is a free online course. From the folks over at Udacity, this is a good option for those of you looking for a more general introduction to computer science, and they talk a little about Admrial Grace Hopper, which is always a plus in my book. I also really like Udacity’s use of quizes and problem sets. You actually run code and test it to see if it works. h/t Bryan Scheiderer (commenter from post one)
  5. Think Python 2e is the most traditional programming book in this list. From the venerable folks over at O’Reilly, this free online book is a comprehensive introduction to Python. That being said, beginners may find it a tad dry. h/t @PaulGowder

Final Words of Warning

Be humble. Chances are you’ll make a lot of mistakes along your journey. To make sure these mistakes don’t adversely affect your clients, always have a backup plan and be sure to institute some form of quality control if you ever decide to start using your work in the real world.

Featured image: “Banksy NYC, Coney Island, Robot” by Scott Lynch is licensed CC. The image has not been modified.


  1. Lawyerly disclaimer: Assuming that is, the data you’re looking for is accessible using the methods described here. 

  2. The first place I saw regex and citation in one place it was in Chapter One of David Zvenyach’s Coding for Lawyers

  3. You can also use regex to replace as well, but as they say, “That’s beyond the scope of this lesson.” 

  4. If you build a scraper to collect such large datasets, however, you should think carefully about how they might be used and how their collection will affect the resource(s) you are scraping. 

  5. FYI, there’s actually a debate about whether this should be the default, and in your travels, you may happen upon an argument or two regarding SQL vs. NoSQL. As with all things, the right answer is “it depends.” For our purposes, however, the SQL tabular model works just fine. 

  6. MeanVoter, get it? Mean means average, and the bot will be averaging forecasts, but mean can also be an adjective describing a person. Clever wordplay, right? 

  7. These are great resources for coders. 

  8. Though many have the option to “upgrade” for cash. 

Subscribe

Get Lawyerist in Your Inbox, Daily

Current Articles
Current Lab Discussions