Regular expressions are a behemoth of syntactical nightmares, that is at least the opinion I have always had of regular expressions in the past. The problem with learning it, in my opinion, is the lack of guided examples that are similar to what I want to accomplish. The tutorials tend to just be very generic explanations of the syntax and when you end up trying to use it to fit your purposes it all collapses on your head.
So I want to solve a scenario I need the help of regular expressions for and hopefully it will be helpful for someone else down the line.
This is just a very basic guide to a simple scenario and I will not go in depth into what exactly everything does. The hopes is just that it will bring a little clarity to other lost souls out there.
What
There is a need for remaking the logging logic for a system I am working on, since it was very basic and very outdated (timestamps didn't record milliseconds, no severity of the output in the logs and little to no context to the content of the print). After having that done I was curious about the aspect of developing a small python script which could extract useful information from the massive amounts of logs, to aid the people that actually have to read the logs when problems arise.
Some example situations might be as follows.
Only show entries for
- A certain time/date interval.
- Specific processes.
- A specific process id.
So we need to be able to extract the different elements of a log entry so that the script then can utilize them in any way we might need them.
The log format I want to process is similar to the following:
[2012-01-13 13:00:00.250, Error < MyServer:8071 > [ContextInProcess|LocalInfo] Gosh! Something went wrong around here [2012-01-13 13:00:00:270, Info <otherserver:2871 > [BullshitDetector] MyServer was just joking. Go ahead and ignore it.
What we can see here is that the information is padded to help with readability (since these logs are made for humans, so they need to be easy to read) and multi-line entries are allowed as well.
So this post will be focused on parsing out the different elements of the log entries so they can be processed later.
How
Since we have some different elements to the log entries we want to be able to extract these and put names to them so we can easily utilize them later.
So since this is quite a daunting task, for the regex-phobic, we take this one step at a time and start very simple.
First we create a simple python script which just reads in my log sample from above and runs it through re.compile and then we can continue building from there.
import re
def main():
file = open("Log.txt")
com = re.compile(r'^(?P<everything>.*)', re.X)
for line in file:
match = com.match(line);
if match:
print match.groupdict()
file.close()
pass
if __name__ == '__main__':
main()
The python code itself should be quite straight forward.
We open Log.txt, then read it one line at a time and run it through the regular expression we set up, if it matches we just print out the resulting dictionary, then we close the file.
One interesting thing about the python code is re.X, this enables verbose mode for the regex compiler.
This basically means that it will ignore any white spaces we insert into the expression (with some exceptions), and allows us to comment the expression with #, which allows us to make the expression readable in our code.
So this is what it would look like if we tried to structure the above expression a little bit better.
re.compile(r'''
^ # from the start of the line
(?P<everything> # we create a group called "everything"
.* # the group will match anything
) # end of group
''', re.X)
It will give the exact same result, but now it is so readable that I doubt I will have to explain the elements of this expression.So now lets tackle the expression itself. After running the script you would get the following output.
{'everything': '[2012-01-13 13:00:00.250, Error < MyServer:8071 > [ContextInProcess|LocalInfo] Gosh! Something went wrong around here'}
{'everything': '[2012-01-13 13:00:00:270, Info <otherserver:2871 > [BullshitDetector] MyServer was just joking.'}
{'everything': 'Go ahead and ignore it.'}
We got three prints, one for every line in the file. And we can see that in every print we have a dictionary with only one key (everything) and it contains the entire row from the log. This should be pretty obvious from the commented version of the expression why it would result in this.
So this is all fine and dandy, but it is not very usable in this manner. We would like to group the different elements in the log entries so we get a nicely formatted dictionary at the end with all the relevant keys that we would need.
We also want to ensure that the entries correctly formatted. So that if there were ever a multi-line log entry printed which looks like this:
[2012-01-13 15:00:00.980, Error < MyServer:8071 > [ContextInProcess|LocalInfo] Interesting stuff on the next line. [something, other < whatever: nice > [ hmm ] okIt should fail since we would expect the first parameter to be a timestamp and not a string. Then we should know that this line is actually a part of the last log entry we parsed out and it should be appended to it.
Since we have delimiters in the format of the log it is quite simple to extract the different groups we want.
In this next version we have just extracted the different elements of our log entry and given them appropriate names.
com = re.compile(r'''
^\[ # the line starts with a [
(?P<timestamp> # timestamp group
.*
)
, # delimiter
(?P<severity> # severity group
.*
)
< # delimiter
(?P<process> # process group
.*
)
: # delimiter
(?P<pid> # process id group
.*
)
>\ \[ # delimiter for "> ["
(?P<context> # context group
.*
)
\] # delimiter
(?P<entry> # the log entry group
.*
)
''', re.X)
This gives us:{'severity': ' Error ', 'process': ' MyServer', 'timestamp': '2012-01-13 13
:00:00.250', 'pid': '8071 ', 'context': 'ContextInProcess|LocalInfo', 'entry':
' Gosh! Something went wrong around here'}
{'severity': ' Info ', 'process': 'otherserver', 'timestamp': '2012-01-13 13
:00:00:270', 'pid': '2871 ', 'context': 'BullshitDetector', 'entry': ' MyServer
was just joking.'}
Now we have all the groups we need, but no validation for any of the fields.
This code is also quite simple. We just specify our groups and then select our delimiters between the groups.Note that we have to escape special characters and spaces. So at line 18 you see ">\ \[" but it means "> [" and that is the three character delimiter between the process id group and context group that we have in our format. The results of the script have not trimmed the excess white spaces from the elements, but this is outside of the scope of this post since that is very easily handled in python later.
We could of course stop here since now we do have the data that we want in the format that we want.
However the special case mentioned earlier about the multi-line entry would give a false positive with the code we currently have. What we can do to avoid this is to specify the expected contents of the different groups we have. So that if it doesn't match what we expect, then it wont return anything and we will take this as a hint that this line is a continuation of the last match we had.
So here comes a final version of our expression.
com = re.compile(r'''
^\[ # the line starts with a [
(?P<timestamp> # timestamp group
[0-9]{4} # year
-
[0-9]{2} # month
-
[0-9]{2} # day
\ # whitespace
[0-9]{2} # hour
:
[0-9]{2} # minute
:
[0-9]{2} # second
.
[0-9]{3} # millisecond
)
, # delimiter
(?P<severity> # severity group
[A-za-z ]+
)
< # delimiter
(?P<process> # process group
[A-Za-z ]+
)
: # delimiter
(?P<pid> # process id group
[0-9 ]+
)
>\ \[ # delimiter for "> ["
(?P<context> # context group
.*
)
\] # delimiter
(?P<entry> # the log entry group
.*
)
''', re.X)
What we have done here is set up some rules for some of the groups (timestamp, severity, process and pid) which makes it harder to get false positives.If we take a look at the timestamp group we see the following type of statements:
[0-9]{4}
[0-9] means that we want to match any number between and including 0-9.{4} means that we want there to be exactly 4 of such numbers.
So the above one would indicate a year, since it has four digits.
In this manner we have defined up the entire timestamp in such variations, with the appropriate delimiters in between. This will not impact our original group, since the outer group remains unchanged, we have just changed the pattern with which it detects a valid timestamp group.
In the severity group we have:
[A-Za-z ]+[A-Za-z ] This is just a variation of the one from the timestamp one. It means that we want all letter ranges between upper-case and lower-case A-Z. Also notice the white space added, this is because we are not trimming white spaces, so if there is a white space in there it would fail if we got there. Note that we do not have to escape the white space here.
+ Means one or more. So if one of the fields were to be blank it would fail. If we wouldn't care if they were empty we could replace them with * which is zero or more.
The context and entry groups we leave as they were since they can be very diverse and we don't want any restrictions there.
Afterthought
With the discovery of the verbose flag for regex it sure became a lot easier to read these kinds of expressions, so that was an encouraging discovery.
There are still some things that could be done with the expression. For example the context group could be entirely optional which would impact the delimiters etc, so that might be material for a second post in the future if I end up going there.
My opinions of regular expressions have slightly improved after doing this, but I still find it rather crude when you want to do some more "complex" things so there is still a long way to go before I become a convert.
Let me know in the comments what you thought. Did it give you anything even though it was very basic?


