Regex - an easier approach
2015 May 17How to build more complex regex without going insane
“Regular expressions are a way to describe patterns in string data.”
- Excerpt From: Haverbeke, Marijn. “Eloquent JavaScript.” iBooks.
This material may be protected by copyright.
Defining the problem
Let's say we want to write a regex to match time-codes, to test a given time-code has a valid format
We consider the notation hours, minutes, seconds, milliseconds.
Such as 01:17:23.736
where the milliseconds that can be represented either with a .
or a ,
hh:mm:ss.mmm
Or hh:mm:ss,mmm
.
When writing a regex you'd be tempted to mash all the symbols together, test it and tweak it until it works, and get it done and over with asap. An alternative approach is [ giving a variable to patterns and then combining those patterns this makes it easier to read, easier to write, and most importantly easier to think about in a systematic way.
exploring the problem
For this example I'm using JavaScript, but most languages have support for regex with only minor tweaks to transpose them from one language to another.
We know that \d
is equivalent to any digit character [0-9]
.
So just by looking at our time-code string 01:17:23.736
hh:mm:ss.mmm
we can already describe it as:
/\d\d:\d\d:\d\d.\d\d\d/
This can be refactored using the notation {number}
defines how many time the instance is expected to occur. so where writing \d{2}
is equivalent to write \d\d
, where we are expecting two digits one after the other.
/\d{2}:\d{2}:\d{2}.\d{3}/
Square brackets such as [something somethingElse]
means any something
or somethingElse
. And any character in brackets loose it's special value.
So if we want to see milliseconds with ,
and milliseconds with ,
we would write
/\d{2}:\d{2}:\d{2}[ . ,]\d{3}/
Which in JavaScript you would run by defining a variable with the regex pattern, and the method .test
on that variable passing the string you want to test as argument. Last but not least using console.log
to print out the outcome.
const timeCode = /\d{2}:\d{2}:\d{2}[ . ,]\d{3}/;
console.log(timeCode.test("01:17:23.736"));
This is also equivalent to writing it without using the variable declaration if you want to have a one liner.
console.log( /\d{2}:\d{2}:\d{2}[ . ,]\d{3}/.test("01:17:23.736"));
a more sensible approach
Now you might have noticed that I got carried away and didn't really follow the approach I set out to show you and the code is not very readable especially in the last one liner, I didn't define variables with regex patterns and combined those to make it more manageable and readable, as well as flexible if the time-code I am looking for where to change.
So here's how you'd go about following this more sensible approach.
Once again you got your time-code string 01:17:23.736
hh:mm:ss.mmm
Now you think, what have you got as smallest unit. We got hours,minutes, and milliseconds that are all 2 digit characters.
So I could start to write a pattern to match that
const twoDigit = /\d{2}/;
Then we see we got milliseconds which is a 3 digit characters.
const mmm = /\d{3}/;
which we could already write as
const timeCode = /twoDigit:twoDigit:twoDigit.mmm/;
which would look like:
// a simple readable approach to writing regex using variables.
const twoDigit = /\d{2}/;
const mmm = /\d{3}/;
const timeCode = /twoDigit:twoDigit:twoDigit.mmm/;
console.log(timeCode.test("01:17:23.736"));
but timecode come in pairs
but generally timecode, for example in subtitles files come in pairs, for example in an srt file, so here's in an example of how we did a regex, but now need to expand on it to match a larger string.
00:00:06,500 --> 00:00:10,790
some text from a subtitle file srt
it can be usefull
“A part of a regular expression that is surrounded in parentheses counts as a single element as far as the operators following it are concerned.”
- Excerpt From: Haverbeke, Marijn. “Eloquent JavaScript.”
for this example if we just consider the string with the time-codes, let's say we have parsed it and isolated it into its own string.
00:00:06,500 --> 00:00:10,790
considering the code written in the previous section, we can now add to it as follow:
// a simple readable approach to writing regex using variables.
const twoDigit = /\d{2}/;
const mmm = /\d{3}/;
const timeCode = /twoDigit:twoDigit:twoDigit,mmm/;
const twoTimeCodes = /timeCode --> timeCode/;
console.log(twoTimeCodes.test("00:00:06,500 --> 00:00:10,790"));
how much easier was that? we just had to add this line:
const twoTimeCodes = /timeCode --> timeCode/;
and change what variable we were using to run the test to this last one.
console.log(twoTimeCodes.test("00:00:06,500 --> 00:00:10,790"));