13. Regular Expressions#

A regular expression (often shortened to regex) is a sequence of characters that can be used to search for patterns of text. Regexes appear all over the place. This article covers some of the basic syntax.

13.1. Regex in Python#

The python standard library has the re module for working with regular expressions. re.match finds patterns in text strings.

>>> import re
>>> text = "Hello, World!"
>>> re.search("Hello", text)
<re.Match object; span=(0, 5), match='Hello'>
>>> re.search("Hello", text).span()
(0, 5)
>>> re.search("ll", text)
<re.Match object; span=(2, 4), match='ll'>

If the pattern is present in the text, re.search returns a re.Match object. You can retrieve the start and end point of the match using the span method. If the pattern is not present in the string, re.search returns None.

>>> import re
>>> text = "Hello, World!"
>>> re.search("Random", text)
>>> print(re.match("Random", text))
None

Regex provides a rich syntax for matching complex patterns. For example, + matches to one or more repetition of the preceding character.

>>> import re
>>> text = "Hello, World!"
>>> re.search("l+", "Hello, World")
<re.Match object; span=(2, 4), match='ll'>
>>> re.search("l+", "Helllo, World")
<re.Match object; span=(2, 5), match='lll'>
>>> re.search("l+", "Hellllo, World")
<re.Match object; span=(2, 6), match='llll'>

re.search does not match the second l

The second l in World is not matched because re.search only matches the first instance of the pattern. re.finall matches all instances of the pattern and returns the matches in a list.

>>> import re
>>> text = "Hello, World!"
>>> re.findall("l+", "Hello, World")
['ll', 'l']

13.2. Quantifiers#

Quantifiers specify how many instanstances of a character or group must be present for a match to be found.

Quantifier

Description

*

Matches zero or more repetitions of the preceding character.

+

Matches one or more repetitions of the preceding character.

?

Matches zero or one repetitions of the preceding character.

{n}

Matches exactly n repetitions of the preceding character.

{n,}

Matches at least n repetitions of the preceding character.

{n,m}

Matches between n and m repetitions of the preceding character.

Below are examples of each quantifier.

>>> import re
>>> text = "Hello, Mel, Ankle"
>>> re.findall("el+", "Hello, Mel, eat")
['ell', 'el']
>>> re.findall("el?", "Hello, Mel, eat")
['el', 'el', 'e']
>>> re.findall("el*", "Hello, Mel, eat")
['ell', 'el', 'e']
>>> re.findall("el{2}", "Hello, Mel, eat")
['ell']
>>> re.findall("el{1,}", "Hello, Mel, eat")
['ell', 'el']

13.3. Regex Syntax#

.

Matches any single character other than the newline character (n).

>>> bool(re.match(r"H.llo", "Hello, World!"))
True
>>> bool(re.match(r"H.llo", "Hxllo, World!"))
True
>>> bool(re.match(r"H.llo", "Hxxllo, World!"))
False
>>> bool(re.match(r"H.llo", "H\nllo, World!"))
False
?

Matches the previous regex for 0 or 1 repetitions.

>>> bool(re.match(r"He?llo", "Hello, World!"))
True
>>> bool(re.match(r"He?llo", "Hllo, World!"))
True
>>> bool(re.match(r"He?llo", "Heello, World!"))
False
>>> bool(re.match(r"He?llo", "Hxllo, World!"))
False
+

Matches the previous regex for 1 or more repetitions.

>>> bool(re.match(r"He+llo", "Hello, World!"))
True
>>> bool(re.match(r"He+llo", "Heeello, World!"))
True
>>> bool(re.match(r"He+llo", "Hllo, World!"))
False
*

Matches the previous regex for 0 or more repetitions.

>>> bool(re.match(r"He*llo", "Hello, World!"))
True
>>> bool(re.match(r"He*llo", "Heeello, World!"))
True
>>> bool(re.match(r"He*llo", "Hllo, World!"))
True
^

Matches the start of the string.

>>> bool(re.match(r"^Hello", "Hello, World!"))
True
>>> bool(re.match(r"^World", "Heeello, World!"))
False
$

Matches the end of the string.

>>> bool(re.match(r"foo$", "foobar"))
False
>>> bool(re.match(r"bar$", "foobar"))
False
>>> bool(re.match(r"foobar$", "foobar"))
True
{m,n}

Match next m to n characters to previous regex.

>>> bool(re.match(r"He{3}llo", "Hello, World!"))
False
>>> bool(re.match(r"He{3}llo", "Heeello, World!"))
True
>>> bool(re.match(r"He{2,3}llo", "Heello, World!"))
True
>>> bool(re.match(r"He{2,3}llo", "Heeello, World!"))
True
>>> bool(re.match(r"He{2,3}llo", "Heeeello, World!"))
False
[]

Used to define character sets.

>>> # Match 6 chars to chars 'f', 'o', 'b', 'a', and 'r'
>>> bool(re.match(r"[fobar]{6}", "foobar"))
True
>>> bool(re.match(r"[fobar]{6}", "fo3bar"))
False
>>> # Match 6 chars to chars 'a' to 'z'
>>> bool(re.match(r"[a-z]{6}", "foobar"))
True
>>> bool(re.match(r"[a-z]{6}", "fo3bar"))
False
>>> # Match 6 chars to chars 'a' to 'z' or '1' to '9'
>>> bool(re.match(r"[a-z1-9]{6}", "fo3bar"))
True
>>> # Match 6 chars to chars 'A' to 'Z'
>>> bool(re.match(r"[A-Z]{6}", "foobar"))
False
>>> # Match 6 chars to chars 'A' to 'Z'
>>> bool(re.match(r"[A-Z]{6}", "FOOBAR"))
True
()

Used to define groups.

>>> # Matches to 'a' or 'bc'
>>> bool(re.match(r"(a|bc)", "a"))
True
>>> bool(re.match(r"(a|bc)", "b"))
True
>>> bool(re.match(r"(a|bc)", "d"))
False

13.3.1. Grep#

Grep is a program common on most Unix-like systems. It is used for finding patterns in text. Suppose the file text.txt contains the follow text.

Line 1
Line 2
This is the final line

Here are some examples of how to use grep.

$ grep This text.txt
Line 1
Line 2
$ grep --invert-match This text.txt
This is the final line
$ grep --ignore-case this text.text
This is the final line
$ grep --file=text.txt --regexp=Line
Line 1
Line 2
$ grep --count Line text.txt
2
$ grep ^Line (1|2) file1.txt
zsh: no matches found: (1|2)
$ grep "^Line (1|2)" file1.txt
$ grep -E "^Line (1|2)" file1.txt
Line 1
Line 2

13.3.1.1. Grep Multiple Files#

Suppose we have the following directory structure.

directory
   file1.txt
   file2.txt

Suppose directory/file1.txt has the following text.

Line 1 in file1
Line 2 in file1
This is the final line in file 1

And directory/file2.txt has the following text.

Line 1 in file2
Line 2 in file2
This is the final line in file 2

We can grep patterns in multiple files.

$ grep Line directory/*.txt
file1.txt:Line 1 in file1
file1.txt:Line 2 in file1
file2.txt:Line 1 in file2
file2.txt:Line 2 in file2
$ grep Line --recursive directory
file1.txt:Line 1 in file1
file1.txt:Line 2 in file1
file2.txt:Line 1 in file2
file2.txt:Line 2 in file2