Pattern Recognition: Regular Expressions and You

Regular expressions (regex for short) won’t replace associates or paralegals, but they can take a lot off their plate. They sit somewhere between Control-F and the computer on Star Trek.1 They’re a bit finicky about syntax, but once you learn the “magic words,” you can ask them to read a document and return lists of almost any textual pattern imaginable.

What does that even mean? Well, it means you can ask for a list of all nine-digit numbers found in the document, specifically those with non-numeric separators after the third and fifth digits. You know, anything that looks like this:

Social Security Card

Source: Social Security Card from the Social Security Administration.

Spammers use regular expressions to search the web for “words” comprised of text followed by an @ symbol and ending in a domain name. You know, anything that looks like this:

If crosswords are more your speed, and you train regex on a dictionary, you do can ask for things like “a list of 7 letter words starting with S and ending in a double F.”

And the coup de grâce: in addition to finding patterns, you can find and replace patterns. So every instance of text including a set of numbers followed by a space, the text “U.S.C.”, another space, a § mark, another space, a set of numbers, and optionally, a year inside a parenthetical, can be replaced with a link (e.g. 17 U.S.C. § 107). That’s right, you can put your Bluebook skills to use automatically creating links to statutes.2

Regular expressions take the edge off many monotonous tasks related to pattern recognition. That being said, I’m a big believer in learning by doing. So let’s have some fun finding and redacting phone numbers, citing the US Code, and creating a crossword puzzle or two. Don’t worry, you won’t have to install any software. At most, I’ll ask you to pull up a website, and maybe if you’re feeling feisty, you’ll open a Word file to play along.




Will I Really Use This? A Personnel Story

Probably. I used to work in a big bureaucratic organization with several hundred staff members. HR would put out a staff directory every quarter, but they insisted on providing it as a PDF. It had a row for every employee and “columns” indicating their names, job titles, phone numbers, and email addresses. Oh, how I wished that directory were a spreadsheet. Then I could filter rows to get mailing lists for department heads, secretaries, and the like (surprisingly, no one kept such lists). But alas, the directory was a PDF. I could, however, copy the text out of the PDF.  The result wasn’t pretty, but with a little regex magic I was able to create a text file with a line for each staff member and commas between their info (e.g., David Colarusso, Cog In the Machine, 555-555-5555, I then loaded this into Excel, and presto! I had my own sortable staff directory, and never again did I have to build mailing lists by hand.

If you put in the time to learn regex, you will use regex. Trust me.

Assemble Your Tools

You’ll find support for regular expressions in many text editors and a lot of programming languages. Heck, even Word has a limited implementation. Here we’ll be discussing two flavors of regex: Perl-like3 and the Word implementation.

If you want to play along at home, open Regular Expressions 101 (regex101) and work through the examples below. Here we’ll explore Perl-like expressions. Depending on your screen size, you’ll see either one, two, or three columns. We care about the one with REGULAR EXPRESSION at the top. By default, this column is subdivided into two rows. The first is where you place your pattern to be matched (REGULAR EXPRESSION). The second is where you place the text over which the regex will search (TEST STRING).


At the far right of the first row you’ll see a set of regex options (flags). If you click on the flag icon, you’ll see a list of possible flags. By default the g (global) flag is present. Its presence means the regex will find all matches, not just the first one. Another commonly used flag is the i (case insensitive) flag. It’s presence means the regex doesn’t need the case (capitalization) of text to match for your pattern to match.


Finding Phone (& Social Security) Numbers

Cut and paste the following text into the TEST STRING field.4

01110100 555.867.5309 01101000 1.4142135623 01101001 987-01-6661 01110011 202.555.9355 00100000 01101001 3.1415926535897932384626433832795 01110011 00100000 666-12-4895 01100001 202-555-9355 00100000 01101000 (555) 867-5309 01101001 01100100 2.718281828459 01100100 01100101 01101110 00100000 01101101 555-867-5309 01100101 01110011 01110011 555/867-5309 01100001 01100111 01100101

This string of text is our haystack, and Social Security and phone numbers are our needles. If we know exactly what phone number we’re looking for (e.g., 555-867-5309) it’s just like control-F. All we do is place the number in the REGULAR EXPRESSION field.


When you place a character in your pattern and it matches itself, it is called a “literal character” because it literally matches itself. The digits 0-9 by themselves are literal characters and - by itself is a literal character. In addition to literal characters you can combine some characters into tokens (metacharacters) to represent a class of characters, and occasionally single characters take on such roles. For example, [0-9] matches any digit between 0 and 9. So [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] matches our phone number and any other phone number making use of hyphens as separators.


Now you may be thinking that [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] is a bit unwieldy, and I’d agree. You can modify a token by declaring how many times you would like it to reoccur. Our unwieldy friend can be rewritten as [0-9]{3}-[0-9]{3}-[0-9]{4} where {n} is a modifier and n is the number of times the preceding character/metacharacter should occur.


Excitingly, the above works in Word as well. However, you do have to make sure that wildcards are turned on.


But what about all those people who don’t use hyphens? How do I find phone numbers of the form 202.555.9355? Obviously, you could do a second search with period spacers, but maybe we could replace the hyphens with a wildcard. According to Word’s documentation, the question mark matches “Any single character, including space and punctuation characters.” So let’s see what happens if we try [0-9]{3}?[0-9]{3}?[0-9]{4}


Promising, but not quite right. What’s with 1415926535897892384626433832795 0111? It would seem that our wildcard matched numbers as well as hyphens and periods, leading to some issues. Again, we could do two searches, one with the hyphens as separators and one with periods, but here we start to meet the limits of what’s possible in Word. If you want to do more in Word, I recommend reading its regex documentation.

For now, we’re going to shift back to the more robust Perl-like syntax and regex101. It’s worth noting that the question mark has a different meaning in Perl-like regex (more on that in a bit). The generic single-character wildcard is a period. So [0-9]{3}.[0-9]{3}.[0-9]{4} is equivalent to the Word pattern above.


What I’d really like to do is to build a pattern that searched for hyphens or periods. We know that the period is not a literal character, that is, it doesn’t match one-to-one with other periods. When regex makes a single character into a special character you can escape it by placing a \ in front of the offending character, causing it to act like a literal character. Consequently, \. will literally match a period.


So how do I match a period or a hyphen? We need a way of saying OR. As it happens, the | (pipe) does this. Of course, we need a way to say what parts of the pattern are included in the or comparison. For this job, we call in the parenthetical. For example, re(a|e)d matches read or reed. So let’s look for separators that are either hyphens or periods, that is, [0-9]{3}(-|\.)[0-9]{3}(-|\.)[0-9]{4}


Obviously, this means that the parentheses are not literal characters. So if we wanted to match a parenthetical we’d need to use \ to escape each parenthesis. For example, \([0-9]{3}(\) |-|\.)[0-9]{3}(-|\.)[0-9]{4} matches phone numbers where parentheses and a space are used to set off the area code.


Now we’re cookin’. So what does a question mark do in Perl-like regex? I’m glad you asked. It is a modifier like {3}, except it matches when the preceding character/metacharacter occurs zero or one time. For example, watch the first parenthesis given \(?[0-9]{3}(\) |-|\.)[0-9]{3}(-|\.)[0-9]{4}


There are more of these modifiers. +, for example, finds a match when the preceding character/metacharacter appears somewhere between one and an infinite number of times, and * matches between zero and infinity.

One of the reasons I like regex101 is that it actually tells you all of this when you mouse over a modifier, literal character, or the like.


If your screen is large enough, you can also find this info in the right column. And if you have no idea where to start, there’s a quick reference with a list of tokens.


Let’s be honest, once you work through this post, you’re going to forget what tokens stands for what. Additionally, there’s no way I can show them all to you without losing your interest. You need a reference and a cheat sheet. The Quick Reference at regex101 is such a resource, and all you have to remember is the dead simple URL Easy peasey.

Now here’s a handy metacharacter: \d is actually equivalent to [0-9]. So we can re-write our phone number regex as \(?\d{3}(\) |-|\.)\d{3}(-|\.)\d{4}


Parentheticals are actually a little more special than I let on earlier. They define something called a group, and we’ll talk about them more below, but for the moment, I want you to know that the same modifiers that we used on characters and metacharacters work on groups. So if you want to find phone numbers with no area code, you can place the area code in a group and throw a ? after it (meaning occurrences = 0 or 1). For example, (\(?\d{3}(\) |-|\.))?\d{3}(-|\.)\d{4}


So is our phone number regex complete? I don’t know. That depends on what you think counts as a valid phone number. Should we look for all dividers between numbers? If so, our wildcard example seems more correct than we first thought. Maybe we should add spaces to our list of dividers? Turns out there’s a token for that: \s. Maybe a phone number is just any string of ten digits dividers. Or not. There are a lot of possibilities. I actually hid a phone number of the form 555/555-5555 in our test string. Did you notice it?

In addition to phone numbers, I also sprinkled in some Social Security numbers.5 Can you find them?

Redaction & Linking to the US Code

Click on SUBSTITUTION below the TEST STRING field. This should reveal two new fields. The first is for you to place a “replacement value” and the second is a display of your test string with matches replaced.


If you leave the replacement value blank, you’ll notice that all of your matches are gone, replaced with nothing. Add a placeholder, and it fills in the holes.


To make things interesting, replace your old test string with this:6

Respondent Acuff-Rose Music, Inc., filed suit against petitioners, the members of the rap music group 2 Live Crew and their record company, claiming that 2 Live Crew’s song, “Pretty Woman,” infringed AcuffRose’s copyright in Roy Orbison’s rock ballad, “Oh, Pretty Woman.” The District Court granted summary judgment for 2 Live Crew, holding that its song was a parody that made fair use of the original song. See Copyright Act of 1976, 17 U.S.C. § 107. The Court of Appeals reversed and remanded, holding that the commercial nature of the parody rendered it presumptively unfair under the first of four factors relevant under § 107; that, by taking the “heart” of the original and making it the “heart” of a new work, 2 Live Crew had, qualitatively, taken too much under the third § 107 factor; and that market harm for purposes of the fourth § 107 factor had been established by a presumption attaching to commercial uses.

Using what we learned, let’s write a regex to find citations to the United States Code. This means, we’re looking for a title number followed by U.S.C., the § mark, the section number, and an optional parenthetical indicating the year of the code we’re citing. \d+ will pick up a string of digits between 1 and infinity. Something like (\d+) U\.S\.C\. § (\d+)( \(\d{4}\))? should do the trick.


Citation found! So what’s the deal with those groups? Well, you can use the content of your groups to construct replacement text. Each parenthetical is numbered, and you can use something called a “backreference” to get at its content. This is why I placed the title and section numbers in groups. Group one is referenced by placing \1 in the replacement field, group two by \2 and so on. Also worth noting: group zero matches the entire regex. So placing \0 in the replacement field alone produces output that looks like your input.


If you’ve made it this far, chances are you know what HTML looks like. If not, the fact you’ve made it this far means you really should. Might I recommend w3schools? Either way, we’d like to replace the cite with a link. In HTML that looks like this <a href="the link">the link text</a>. So we can put the following into our replacement field to get the desired result. <a href="\1/\2">\1 U.S.C. § \2\3</a>


If I saved this output to an html file, our US Code cite would be a link to Title 17, Section 107, and our regex would happily provide links for other citations as well. If you find yourself with a burning passion to write regex for legal citations, I recommend you check out this open source project and contribute.

Extra Credit: Build & Solve Crossword Puzzles

So far we’ve worked with small snippets of text, and I’m sure you’re curious about larger documents. If you want to get away from cutting and pasting text into regex101, I suggest you find a text editor with support for more Perl-like regex or pick up a little programming. If you decide on the latter, you should start with the instructions found in my Hello World! post for downloading Project Jupyter. After that, you can follow along with this notebook on building a crossword. However, I suspect most of you are looking for something a little less involved, so what follows is a scaled-down version one can do in regex101.

Obviously, cutting and pasting the entire dictionary into the regex101 test string is a bad idea. You can, however, get away with pasting the 10,000 most-used English words. Visit this page and copy its contents into the test string at regex101.7

We’ll start with an 11 by 11 grid. This is where we’ll build our crossword puzzle. First off, I want a long word to seed my puzzle. I know that E shows up with a good deal of frequency in English. So I’d love my word to be long and have an e in it. Looking at my test string, I can see that the words are separated by pipes, and the Quick Reference tells me that \w will match any word character. So I start with (\"|\|)\w{5}e\w{5}(\"|\|) in the hope that I can find an eleven-letter word where the sixth letter is E. Instead of sifting through the test string to find highlighted words, I scan the list provided in the right column.


Obviously, interesting looks like a good pace to start. So I place it in the middle of my grid going across. After that, I use regex to find words that fit into the remaining spaces. For example, if I want to fill in the first column going down, I know I need a word that matches (\"|\|)\w{0,5}i\w{0,5}(\"|\|) Wait, what’s the deal with {0,5}? Well, I bet you can guess. {0,5} matches between 0 and 5 times. I know, there’s so much to learn. Anyhow, middlings comes back as a nice long word matching my criteria despite my questioning how it can be among the 10,000 most used words.


Anywho, I work my way across the grid, making sure that various conditions are met, and eventually, I find myself with something that looks like a crossword.8 If my clues aren’t enough, you can use your own copy of the 10,000 words and regex101 to find the answers.9


Across Down
1. For a contract to be valid, these must meet.
2. A poison pill does this to a merger.
5. ___ of the Roses
8. Descriptor for thoughts of great intellectual heft.
9. One whose pants are on fire.
10. Worthy of curiosity and further investigation.
11. Going, going…
12. The greatest quantity.
13. Logical joinder requiring presence of both adjacent truths.
14. First line of instruction to athletes upon minor injury.
15. To eject a projectile from a gun.
1. Multiple bulk goods of medium grade.
3. The state of coming into being.
4. The collection of furniture provided for the resting of people around a table.
6. A guest.
7. The most luminous.

Obviously, limiting ourselves to 10,000 words is a bit restrictive, but you get the idea. My notebook crossword is a little better since it draws from a larger set of words. If your test string includes something approaching all English words, you end up with a much better crossword builder/helper. See e.g., crossword clue solver.

Additional Reading

Hopefully, I’ve sparked your interest. There’s a lot more to regular expressions, like anchors. In addition to matching character types, you can match specific locations in a text. For examples, the ^ (caret) and $ (dollar sign) match the beginning and end of a document/string. So ^regex only finds a match if regex is the first word, and regex$ only finds a match if regex is the last word. \n matches line breaks, and on and on. There’s even a really cool set of features called lookaround assertions. Regex101 is a great place to start, but if you’re really itching for more, you may want to check out the following:

Have fun.

Source: Regular Expressions from xkcd.

Source: Regular Expressions from xkcd.

Featured image: “Autostereogram Tutorial Random Dot Shark”, Wikimedia Commons GNU Free Documentation License.

  1. To be fair, they’re a lot closer to Control-F. 

  2. I referenced a similar example and the inspiration of <a href="">Dave Zvenyach</a> in a prior post on <a href="">building a twitter bot</a>. His thoughts on regex are worth <a href="">the read</a>. 

  3. Named for the Perl programming language, but supported in a constellation of programming languages and apps. 

  4. For what it’s worth, the contents of this string are not random. Twitter accolades to the first person to find the hidden message(s). 

  5. Don’t worry; they make use of prefixes that are invalid for Social Security numbers. Also, did you know the Social Security Administration won’t assign the <a href="">group number 666</a>? 

  6. You may recognize this as the intro to <a href="">Campbell v. Acuff-Rose Music, 510 U.S. 569 (1994)</a>. 

  7. This corpus is the list used to power Randall Munroe’s <a href="">Simple Writer</a> which is a spinoff of his ten hundred most common words series, including the <a href="">Up Goer Five</a> and <a href="">Thing Explainer</a>. 

  8. This is actually a pretty atrocious crossword. 2 and 14 across don’t intersect with any down words, which I am told is a big no no, but heck, I made it in like ten minutes. That being said, here are some additional hints re: 2 and 14 respectively. Smoking does this, and the instructions containing this word are also the title of a Taylor Swift song. 

  9. Also, if you don’t have much time, I give up. Here are the <a href="">answers</a>. 

1 Comment

  1. Avatar Ramsey Hanafi says:

    This is great. Most useful article on this blog in a while. Is this why the auto-TOC and auto-TOA code in word looks all crazy?

Leave a Reply