Regular Expressions in Python
Example of a raw string
- Using a raw string r before the string we can print out the full string
print(r'\tTab')
Output:
\tTab
We want our Regular Expressions to interpret the strings we’re passing in and not have python doing anything to them first.
Examples
import re
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
devinpowers.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Powers
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''
sentence = 'Start a sentence and then bring it to an end'
## Pass in a pattern ('abc')
pattern = re.compile(r'devin')
## NOw lets search through our text with this pattern
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output:
<re.Match object; span=(142, 147), match='devin'>
The span is the beginning and end index of the match
When we used the finditer function it found 1 match of devin and it found it in our text_to_search string from indexes 142 to 147.
Indexes are useful because it allows use to use the string slicing functionality in Python where can plug in these values and get the exact match.
match = text_to_search[142:147]
print(match)
Output:
devin
What happens if there is more than one stance of the pattern? For example if we passed in the string owers.
Output:
<re.Match object; span=(148, 153), match='owers'>
<re.Match object; span=(230, 235), match='owers'>
MetaCharacters
How do we deal with MetaCharacters?
pattern = re.compile(r'.')
Output:
<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
...
...
...
These are all literal periods from our string that we passed in!
We want to escape!
pattern = re.compile(r'devinpowers\.com')
Output:
<re.Match object; span=(142, 157), match='devinpowers.com'>
We use regular expressions to find Patterns
Snippets
- Can use these to search!!!!!
. - Any Character Except New Line \d - Digit (0-9) \D - Not a Digit (0-9) \w - Word Character (a-z, A-Z, 0-9, _) \W - Not a Word Character \s - Whitespace (space, tab, newline) \S - Not Whitespace (space, tab, newline)
\b - Word Boundary \B - Not a Word Boundary ^ - Beginning of a String $ - End of a String
[] - Matches Characters in brackets [^ ] - Matches Characters NOT in brackets | - Either Or ( ) - Group
Quantifiers:
-
- 0 or More
-
- 1 or More ? - 0 or One {3} - Exact Number {3,4} - Range of Numbers (Minimum, Maximum)
Sample Regexs:
[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+
Example
- Find all the digits of this sentence
import re
sentence = 'Start23 a sentence69 and then bring it420 to an end'
pattern = re.compile(r'\d')
matches = pattern.finditer(sentence)
for match in matches:
print(match)
Output:
<re.Match object; span=(5, 6), match='2'>
<re.Match object; span=(6, 7), match='3'>
<re.Match object; span=(18, 19), match='6'>
<re.Match object; span=(19, 20), match='9'>
<re.Match object; span=(38, 39), match='4'>
<re.Match object; span=(39, 40), match='2'>
<re.Match object; span=(40, 41), match='0'>
We can combine a bunch of these snippets and search for things like a phone number.
Example of this:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')
Output:
<re.Match object; span=(159, 171), match='321-555-4321'>
<re.Match object; span=(172, 184), match='123.555.1234'>
<re.Match object; span=(185, 197), match='123*555*1234'>
<re.Match object; span=(198, 210), match='800-555-1234'>
<re.Match object; span=(211, 223), match='900-555-1234'>
Another Example using a .txt file
- Insert .txt file
import re
#pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
with open ('data.txt', 'r') as f:
contents = f.read()
matches = pattern.finditer(contents)
for match in matches:
print(match)
Output:
<re.Match object; span=(12, 24), match='615-555-7164'>
<re.Match object; span=(102, 114), match='800-555-5669'>
<re.Match object; span=(191, 203), match='560-555-5153'>
<re.Match object; span=(281, 293), match='900-555-9340'>
<re.Match object; span=(378, 390), match='714-555-7405'>
<re.Match object; span=(467, 479), match='800-555-6771'>
<re.Match object; span=(557, 569), match='783-555-4799'>
<re.Match object; span=(647, 659), match='516-555-4615'>
<re.Match object; span=(740, 752), match='127-555-1867'>
<re.Match object; span=(829, 841), match='608-555-4938'>
<re.Match object; span=(915, 927), match='568-555-6051'>
<re.Match object; span=(1003, 1015), match='292-555-1875'>
<re.Match object; span=(1091, 1103), match='900-555-3205'>
<re.Match object; span=(1180, 1192), match='614-555-1166'>
<re.Match object; span=(1269, 1281), match='530-555-2676'>
<re.Match object; span=(1355, 1367), match='470-555-2750'>
<re.Match object; span=(1439, 1451), match='800-555-6089'>
<re.Match object; span=(1526, 1538), match='880-555-8319'>
<re.Match object; span=(1614, 1626), match='777-555-8378'>
<re.Match object; span=(1697, 1709), match='998-555-7385'>
<re.Match object; span=(1790, 1802), match='800-555-7100'>
<re.Match object; span=(1874, 1886), match='903-555-8277'>
<re.Match object; span=(1962, 1974), match='196-555-5674'>
<re.Match object; span=(2051, 2063), match='900-555-5118'>
<re.Match object; span=(2135, 2147), match='905-555-1630'>
<re.Match object; span=(2216, 2228), match='203-555-3475'>
<re.Match object; span=(2300, 2312), match='884-555-8444'>
<re.Match object; span=(2387, 2399), match='904-555-8559'>
<re.Match object; span=(2475, 2487), match='889-555-7393'>
<re.Match object; span=(2562, 2574), match='195-555-2405'>
<re.Match object; span=(2647, 2659), match='321-555-9053'>
<re.Match object; span=(2734, 2746), match='133-555-1711'>
<re.Match object; span=(2826, 2838), match='900-555-5428'>
<re.Match object; span=(2915, 2927), match='760-555-7147'>
<re.Match object; span=(3012, 3024), match='391-555-6621'>
<re.Match object; span=(3103, 3115), match='932-555-7724'>
<re.Match object; span=(3192, 3204), match='609-555-7908'>
<re.Match object; span=(3284, 3296), match='800-555-8810'>
<re.Match object; span=(3372, 3384), match='149-555-7657'>
<re.Match object; span=(3452, 3464), match='130-555-9709'>
<re.Match object; span=(3535, 3547), match='143-555-9295'>
<re.Match object; span=(3624, 3636), match='903-555-9878'>
<re.Match object; span=(3714, 3726), match='574-555-3194'>
<re.Match object; span=(3802, 3814), match='496-555-7533'>
<re.Match object; span=(3887, 3899), match='210-555-3757'>
<re.Match object; span=(3971, 3983), match='900-555-9598'>
<re.Match object; span=(4056, 4068), match='866-555-9844'>
<re.Match object; span=(4140, 4152), match='669-555-7159'>
<re.Match object; span=(4225, 4237), match='152-555-7417'>
<re.Match object; span=(4317, 4329), match='893-555-9832'>
<re.Match object; span=(4407, 4419), match='217-555-7123'>
<re.Match object; span=(4498, 4510), match='786-555-6544'>
<re.Match object; span=(4588, 4600), match='780-555-2574'>
<re.Match object; span=(4676, 4688), match='926-555-8735'>
<re.Match object; span=(4762, 4774), match='895-555-3539'>
<re.Match object; span=(4859, 4871), match='874-555-3949'>
<re.Match object; span=(4945, 4957), match='800-555-2420'>
<re.Match object; span=(5034, 5046), match='936-555-6340'>
<re.Match object; span=(5123, 5135), match='372-555-9809'>
<re.Match object; span=(5210, 5222), match='890-555-5618'>
<re.Match object; span=(5292, 5304), match='670-555-3005'>
<re.Match object; span=(5382, 5394), match='509-555-5997'>
<re.Match object; span=(5475, 5487), match='721-555-5632'>
<re.Match object; span=(5566, 5578), match='900-555-3567'>
<re.Match object; span=(5656, 5668), match='147-555-6830'>
<re.Match object; span=(5745, 5757), match='582-555-3426'>
<re.Match object; span=(5830, 5842), match='400-555-1706'>
<re.Match object; span=(5921, 5933), match='525-555-1793'>
<re.Match object; span=(6011, 6023), match='317-555-6700'>
<re.Match object; span=(6099, 6111), match='974-555-8301'>
<re.Match object; span=(6189, 6201), match='800-555-3216'>
<re.Match object; span=(6273, 6285), match='746-555-4094'>
<re.Match object; span=(6360, 6372), match='922-555-1773'>
<re.Match object; span=(6445, 6457), match='711-555-4427'>
<re.Match object; span=(6530, 6542), match='355-555-1872'>
<re.Match object; span=(6619, 6631), match='852-555-6521'>
<re.Match object; span=(6711, 6723), match='691-555-5773'>
<re.Match object; span=(6803, 6815), match='332-555-5441'>
<re.Match object; span=(6889, 6901), match='900-555-7755'>
<re.Match object; span=(6971, 6983), match='379-555-3685'>
<re.Match object; span=(7061, 7073), match='127-555-9682'>
<re.Match object; span=(7152, 7164), match='789-555-7032'>
<re.Match object; span=(7243, 7255), match='783-555-5135'>
<re.Match object; span=(7336, 7348), match='315-555-6507'>
<re.Match object; span=(7427, 7439), match='481-555-5835'>
<re.Match object; span=(7515, 7527), match='365-555-8287'>
<re.Match object; span=(7607, 7619), match='911-555-7535'>
<re.Match object; span=(7693, 7705), match='681-555-2460'>
<re.Match object; span=(7779, 7791), match='274-555-9800'>
<re.Match object; span=(7864, 7876), match='800-555-1372'>
<re.Match object; span=(7953, 7965), match='300-555-7821'>
<re.Match object; span=(8043, 8055), match='133-555-3889'>
<re.Match object; span=(8129, 8141), match='705-555-6863'>
<re.Match object; span=(8218, 8230), match='215-555-9449'>
<re.Match object; span=(8309, 8321), match='988-555-6112'>
<re.Match object; span=(8395, 8407), match='623-555-3006'>
<re.Match object; span=(8479, 8491), match='192-555-4977'>
<re.Match object; span=(8564, 8576), match='178-555-4899'>
<re.Match object; span=(8648, 8660), match='952-555-3089'>
<re.Match object; span=(8741, 8753), match='900-555-6426'>
We can see it’s super easy to parse .txt files using re in Python!
Groups
- Used to match several different patterns
pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match)
Output:
<re.Match object; span=(225, 235), match='Mr. Powers'>
<re.Match object; span=(236, 244), match='Mr Smith'>
<re.Match object; span=(245, 253), match='Ms Davis'>
<re.Match object; span=(254, 267), match='Mrs. Robinson'>
<re.Match object; span=(268, 273), match='Mr. T'>
Writing Regular Expressions for Email Example
import re
emails = '''
devinjpowers@gmail.com
powers88@msu.edu
devin-powers3@grand-rapids.net
'''
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
matches = pattern.finditer(emails)
for match in matches:
print(match)
Output
<re.Match object; span=(1, 23), match='devinjpowers@gmail.com'>
<re.Match object; span=(24, 40), match='powers88@msu.edu'>
<re.Match object; span=(41, 71), match='devin-powers3@grand-rapids.net'>
Working with URLs
- Writing a regular expression to read urls
import re
urls = '''
https://www.google.com
http://github.co
https://youtube.com
https://www.nasa.gov
http://www.devintheengineer.com
http://www.AustinPowers.com
'''
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
subbed_urls = pattern.sub(r'\2\3', urls)
print(subbed_urls)
Output:
google.com
github.co
youtube.com
nasa.gov
devintheengineer.com
AustinPowers.com