Replacing Lines of Code with 2 Little Regexs in Postgres
Greetings readers, today we're going to take a semi-break from my “doing data science in SQL” series to cover a really cool use case I just solved with regular expressions (regex) in Postgres. For those of you who have a bad taste in your mouth from earlier run-ins with regexs, this will be more use case focused and I will do my best to explain the search patterns I used.
If you've never heard of regex, there are good resources to learn more about them but I will not be giving a tutorial. I highly recommend learning about them because:
- They are available in almost all modern programming languages
- They help you solve problems that would be really awkward in code
- Most editors also let use regex for find/replace
I have loved regexs ever since I met them. I find them highly logical and easier than writing a bunch of code. I know not all devs agree with me, but to each their own.
Background
I have been doing some work with Island Conservation to help them manage data. I am moving their data off different systems into one centralized PostgreSQL database (with the PostGIS extension as well).
They are doing a study trying to use contraceptives, as opposed to kill trapping, for rodent population management. One of the devices they were using in the field was a camera trap (otherwise known as a game camera).
While the field collector (software on a phone) was giving each camera a GUID for the primary key, the humans would always use the camera name to identify it. Unfortunately, the software was not using controlled vocabularies or value checking for the camera name on the client. All the data values were extracted from the data collector website in GeoJSON format. It was then converted to PostGIS SQL insert statements using ogr2ogr and imported into a table.
The Problem
Now is when things get interesting. Because they were not using a controlled vocabulary, or even formatting requirements on the camera name field, we got all sorts of values there. Some just had “testing” strings in their names and we could ignore them. But others were supposed to follow the format “camXXX” where XXX is 3 digits. The XXX would start with 001 for the first station and then increase accordingly for new stations measured.
Here are just some of the different variations I saw:
Value in Table for camera name | Should be |
---|---|
cam058 | This is correct |
cam08 | cam008 |
cam59 | cam059 |
CAM011 | cam011 |
The examples above should demonstrate that there are two classes of problems in the data:
- Camera names are missing a 0 preceding some of numeric part of the name where the digits are less than 100
- The letters in cam are capitalized and should be lowercase
The Solution
Now my first instinct was to say “time to break out some Python and process the original file and fix this”. But my new response kicked in and I said, “Let’s see if we can do this in the database with SQL!”
I start searching the PostgreSQL doc and look what
I find! If
you search the page you can find regexp_replace
which we'll use today (by the
way, this doc page should really list all the regexp_*
functions). What this
commands does is:
- Take an input string
- Test the regex against the input string
- If the regex returns true (the pattern is found), return the replacement string
- If it returns false, return the original input string
Ok,, so that fits our requirements quite nicely, time to get writing the regex. Let me be clear again, I am not going to teach you regex today but I will try to walk you through the patterns I used.
Fixing the missing 0s
Alright let’s put one of the incorrect strings here first:
cam08
Here is the SQL I used to fix it
UPDATE camera SET cameraname = regexp_replace(cameraname, 'cam(\d{2})\s\*$',
'cam0\1' )
The SQL parts of this should look like a straightforward UPDATE statement. Let’s
look at the regexp_replace
function call. The first parameter is the field we
want to search, the second parameter is the regex we are using for the test, and
the last parameter is the substitution phrase.
Here is the regex by itself:
‘cam(\d{2})\s\*$’
Breaking it down in English:
- Look for the string “cam” - capitalization matters in regex unless you tell the command to ignore case.
- Now the
()
means we are going to want to keep whatever is matched inside the parentheses for later \d
means a digit character (0-9){2}
means we want exactly 2 digits\s
means a space character-
means 0 or more space characters$
means anchor this pattern at the end of the string (this whole pattern has to occur at the end of the string)
Let me write this as sentences:
At the end of the string look for “cam” followed by exactly 2 digits, save the two digits for later, and there may be trailing spaces.
This exactly matches our records with only 2 numbers rather 3. I also discovered the “hidden” spaces while doing this so I had to put the \s on the end.
I had to anchor to the end so that we wouldn’t match cam001. This anchor is needed because cam00, in cam001, would also match “cam followed by exactly 2 digits followed by 0 or more spaces”.
Now when we look at the third parameter, the replacement string it should look pretty straight forward except the
cam0\1
This says put “cam0” at the beginning of the replacement string. For the \1 we have to go back to step 2 above. That group we put in the () was the first group we captured in the regex. So now \1 contains the content of the first group, which in our case was the 2 digits. If we had another () in the regex that would have been \2.
With this one call we replace all the incorrect camXX values because when the regex matches we get the new string and when it doesn’t we just set the original string back to itself. Boom!
Fixing the Capital Letters
With that other regex under our belt this one should be easier to understand. If you didn’t understand the other one maybe this will help you understand it more. Again, here is our problem string:
CAM011
Again we are going to use regex replace. Here is the regex expression I came up with:
UPDATE camera SET cameraname = regexp_replace(cameraname,
'^CAM(\d{3})\s\*$', 'cam\1')
Here is the regex by itself:
'^CAM(\d{3})\s\*$'
Breaking it down in English:
^
means anchor/start at the beginning of the line- Look for the string “CAM”
()
means we are going to want to save the contents inside\d
means means a digit (0-9){3}
means we want exactly 3 digits\s
means a space-
means 0 or more spaces$
means this pattern goes to the end of the line
The sentence for this one looks like:
The beginning of the line has to start with “CAM”, followed by exactly 3 digits that we save for later, then 0 or more spaces, and finally this string has to go to the end of the line.
I don’t want to pick up a string like “This is CAM000 test”. To make sure our
pattern starts at the beginning of the line, we use ^
. We also anchor at the
end of the line to ensure the pattern we are searching for is the whole line.
And the replacement parameter is this:
'cam\1'
Since we captured 3 digits in our parentheses, that capture can go right after the ‘cam’ string to make a valid camera name.
Wrap Up
I hope you see both the power and the ease of using regular expressions in the database. Just like programming languages, you can start with simple statements and grow in your expertise over time. There are also some nice web applications to play around and test your regex.
As I said before, you can also use regular expressions in so many other places, like your programming language of choice or your favorite IDE/editor. Do you love regex as much as me? Do you have any cool examples of using them in Postgres? If so shoot us a message on our twitter account. Happy pattern matching!