daniel shiffman

Regular Expressions

Examples:

Related:

Exercises (optional):

Regular Expressions

WARNING: This is a woefully incomplete overview of regular expressions. It would be absurd to try to fully cover the topic in a short handout like this. Hopefully, this will provide some of the basics to get you started, but to really understand regular expressions, I implore you to read as much of Mastering Regular Expressions by Jeffrey E.F. Friedl as you have time for.

A regular expression is a sequences of characters that describes or matches a given amount of text. For example, the sequence bob, considered as a regular expression, would match any occurance of the word “bob” inside of another text. The following is a rather rudimentary introduction to the basics of regular expressions. We could spend the entire semester studying regular expressions if we put our mind to it. . . Nevertheless, we’ll just have a basic introduction to them this week and learn more advanced technique as we explore different text processing applications over the course of the semester.

A truly wonderful book written on the subject is: Mastering Regular Expressions by Jeffrey Friedl. Chapter 1, available via the Safari Network (through NYU) can be found here:

http://safari.oreilly.com/0596002890/mastregex2-CHP-1

Regular expressions (sometimes referred to as ‘regex’ for short) have both literal characters and meta characters. In bob, all three characters are literal, i.e. the ‘b’ wants to match a ‘b’, the ‘o’ an ‘o’, etc. We might also have the regular expression:

^bob

In this case, the ‘^’ is a meta character, i.e. it does not want to match the character ‘^’, but instead indicates the “beginning of a line.” In other words the regex above would find a match in:

bob goes to the park.

but would not find a match in:

jill and bob go to the park.

Here are a few common meta-characters (I’m listing them below as they would appear in a Java regular expression, which may differ slightly from perl, php, .net, etc.) used to get us started:

Position Metacharacters:

^     beginning of line
$     end of line
\b    word boundary
\B    a non word boundary

Single Character Metacharacters:

.     any one character
\d    any digit from 0 to 9
\w    any word character (a-z,A-Z,0-9)
\W    any non-word character
\s    any whitespace character (tab, new line, form feed, end of line, carriage return)
\S    any non whitespace character

Quantifiers (refer to the character that precedes it):

?     appearing once or not at all
*     appearing zero or more times
+     appearing one or more times
{min,max} appearing within the specified range

Using the above, we could come up with some quick examples:

^$ –> matches beginning of line followed by end of line, i.e. match any blank line!

ingb –> matches ‘ing’ followed by a word boundary, i.e. any time ‘ing’ appears at the end of a word!

Character Classes allow one to do an “or” statement amongst individual characters and are denoted by characters enclosed in brackets, i.e. [aeiou] means match any vowel. Using a “^” negates the character class, i.e. [^aeiou] means match any character not a vowel (note this isn’t just limited to letters, it really means anything at all that is not an a, e, i, o, or u.) A hyphen indicates a range of characters, such as [0-9] or [a-z].

Another key metacharacter is |, meaning or. This is known as the concept of Alternation.

John | Jon -> match “John” or Jon”

note: this regex could also be written as Joh?n, meaning match “Jon” with an option “h” between the “o” and “n.”

Parentheses can also be used to constrain the alternation, i.e.:

(212|646|917)d* matches any sequence of zero or more digits preceded by 212, 646, or 917 (presumably to retrieve phone #’s with NYC area codes). Note this regular expression would need to be improved to take into consideration white spaces and/or punctuation.

Parentheses also serve the purpose of capturing groups for back-references. For example, examine the following regular expression:

b([0-9A-Za-z]+)s+1b

The first part of the expression without parentheses would read: b([0-9A-Za-z]+) meaning match any “word” containing at least one or more letters/digits. The next part s+ means any sequence of at least one white space. The third part 1 says match whatever you matched that was enclosed inside the first set of parentheses, i.e. ([0-9A-Za-z]+). So, thinking this over, what will this regular expression match in the following line:

This is really really super super duper duper fun.  Fun!

egrep

grep is a unix command line utility that takes an input file, a regular expression and outputs the lines that contain matches for that regular expression. It’s a quick way for us to test some regexes (and we can use it on ITP’s server or on any Mac OS X machine.) As a point of history, the name comes from the form “g/re/p” which stands for “Global Regular Expression Print.” We’ll be used egrep, which allows for more sophisticated regular expression searches. (Note: the examples below use a slightly different regex “flavor” than what we will see in Java. This is something we’ll have to get used to, and will likely cause a bit of confusion. Not to worry, confusion over regular expression flavors is extremely normal. No need to seek professional help.)

The syntax is simple:

egrep -flags ‘regexpattern’ filename

If we want to output a file:

egrep -flags ‘regexpattern’ filename >> outputfilename

%  egrep -i 'four' bible.txt
%  egrep -i 'five' bible.txt

The -i flag indicates that the match should be case-insensitive. You can find full documentation for the “egrep” command here (with full flags): http://www.unet.univie.ac.at/aix/cmds/aixcmds2/egrep.htm.

Let’s look at some other examples (special thanks to Friedl’s Mastering Regular Expressions).

Match URL’s:

%  egrep -i 'http://[^ ]*' a2z.txt

(run this with the following sample file: a2z.txt)

Match double words:

%  egrep -i '\< (w+) +\1\>' doubletext.txt

(run this with the following sample file: doubletext.txt)

(Note, in the above example, the metacharacter < means “start of word boundary” and > means “end of word boundary.” This is different than the b we’ll find in Java.

Regular Expressions in Java

With Java 1.4, Sun introduced the java.util.regex package. Having regex support come standard with Java is a great thing, and there are many advantages to working with regexes in a robust object-oriented environment. Nevertheless, unlike with Perl (where regexes are a low-level component of the language), using regexes in Java can prove to be a bit awkward. The following will offer a brief overview of using regexes in Java, for more information I would suggest reading Chapter 8 of Mastering Regular Expressions, the book Java Regular Expressions, and the online Sun tutorial.

Making a String into a Regular Expression

Perl accepts normal strings as regular expressions, which makes life lovely. With Java, however, a regular expression is a Pattern object that is made with a String. We have to deal with Java’s own String metacharacters when putting together a String that will be used as a Regular Expression. In other words, in Java if you use a backslash in a String, it will be considered as a metacharacter, i.e.:

String newline = "\n";

To actually have a backslash in a regular expression, we need to escape it with another backslash, i.e.:

String newlineregex = "\n";

Conceptually, it might take us a moment to wrap our heads around this distinction, nevertheless, functionally, in Java, the solution is simple: whenever you want to have backslash in your regex, use 2!

Ok, moving on to using a regex in Java, our program must impor the java.util.regex package:

import java.util.regex.*

The classes we will use are as follows:

Our first regex program will follow this pseudo-code:

Ok, let’s take a look at the actual code:

import java.util.regex.*;

public class RegexHelloWorld { public static void main(String[] args) { String inputtext = "This is a test of regular expressions."; // Step #1 String regex = "test"; // Step #2 Pattern p = Pattern.compile(regex); // Step #3 Matcher m = p.matcher(inputtext); // Step #4 if (m.find()) { System.out.println(m.group()); // Step #5 } else { System.out.println("No match!"); // Step #6 } } }

Note the use of the find() method, which attempts to find the next subsequence of the input sequence that matches the pattern (returns true or false based on whether it finds something) and group() which returns the input subsequence captured by the given group during the previous match operation.

If we want to look for multiple matches, we can simply use a “while” loop instead of an “if”:

String regex = "\b(\w+)\b\W+\1";   // Regex that matches double words
Pattern p = Pattern.compile(regex);     // Compile Regex
Matcher m = p.matcher(content);         // Create Matcher
while (m.find()) {
  System.out.println(m.group());
}

We can also add flags when compiling the regex. For example, if we want to have a case insensitive regex:

Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

Two flags can be added using the bitwise OR, i.e. |

Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.COMMENTS);

It’s easy to notice, how easy it would be to improve the Flesch Index example from last week. For example, we could use a regular expression to very quickly count vowels:

String regex = "[aeiou]";
Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE);
int vowelcount = 0;
Matcher m = p.matcher(content);         // Create Matcher
while (m.find()) {
  vowelcount++;
}
System.out.println("Total vowels: " + vowelcount);

Splitting with Regular Expressions

It should briefly be noted that the split function we examined last week actually takes a regular expression as an argument. An input String is split into an array wherever any part of that input String that matches that regular expression. For example. . .

String regex = "\W";  // Use any "non-word character" as a delimiter
String[] words = content.split(regex);
System.out.println("Total words: " + words.length);

. . .is a very quick way to use regular expressions to count the # of words (This method is not perfect by any means.)

Search and Replace

Running a search and replace is one of the more powerful things one can do with regular expressions. In Java, it’s simple. The String function itself has a replaceAll() method built-in. The method takes two arguments, a regex and a replacement String. Wherever there is a regex match, it is replaced with the String provided, i.e.:

String input = "Replace every time the word "the" appears with the word ze.";
String regex = "\bthe\b";  // Use any "non-word character" as a delimiter
String output = input.replaceAll(regex,"ze");

Output yields: Replace every time ze word “ze” appears with ze word ze.

The replaceAll() method is also available in the Matcher class, i.e.:

String input = "Replace every time the word "the" appears with the word ze.";
String regex = "\bthe\b";  // Use any "non-word character" as a delimiter
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
String output = m.replaceAll("ze");

We can also reference the matched text using a backreference in the substitution string. A backreference to the entire match is indicated as $0. If there are capturing parentheses, you can reference specifics groups as $1, $2, etc. . .

String input = "Anytime a sequence of one or more vowels appears, n" +
               "we're going to double the vowels.";
String regex = "[aeiou]+";  //
String output = input.replaceAll(regex, "$0$0");

Output yields:
Anytiimee aa seequeuencee oof oonee oor mooree vooweels aappeaears, wee’ree goioing too dououblee thee vooweels.

The closing example from this week using a regular expression to remove all HTML tags from a source file. A nice way to write regular expressions is to start with an exact text and then slowly generalize it, i.e.:

Let’s start with the regular expression:

<table>

Ok, now let’s generalize it to be:

< wwwww>
(less than followed by 5 word characters followed by a greater than)

Well, this can be further generalized to:

< w>

But really we should allow for white spaces, punctuation, and other characters inside the opening and closing brackets. Basically, we want to allow for any character that is not “>”!

< [^>]>

The code to replace this match with nothing is then:

// A Regex to match anything in between <>
// Reads as: Match a "< "
// Match one or more characters that are not ">"
// Match "< ";
String tagregex = "<[^>]*>";
Pattern p2 = Pattern.compile(tagregex);
Matcher m2 = p2.matcher(content);
count = 0;
// Just counting all the tags first
while (m2.find()) {
  //System.out.println(m.group());
  count++;
}
// Replace any matches with nothing
content = m2.replaceAll("");
System.out.println("Removed " + count + " other tags.");

Related Perl / PHP Examples

Perl version of the vowel doubler:

#!/usr/bin/perl

undef $/; # File "slurp" mode $stuff = <>; # read in the first file

# double any vowel occurences # g -- global # i -- case insensitive $stuff =~ s/([aeiou]+)/$1$1/g;

print $stuff;

PHP:
Run it: http://shiffman.net/itp/classes/a2z/week02/voweldoubler.php
Source: http://shiffman.net/itp/classes/a2z/week02/voweldoubler.phps

comments powered by Disqus