SANS Digital Forensics and Incident Response Blog | Perl-Fu: Regexp log file processing

Remember that with Perl the key benefit is the ability to easily implement almost any kind of input/output processing system one might need or conceive, without the need for a lot of code or time in development. When you are faced with massive amounts of data and a small amount of analytical time, this agility is critical. I will not be teaching regular expression syntax but there are countless primers and resources on the web for this, and they almost universally apply to languages/interpreters other than Perl, including our favorite command line tool, grep. Consider the following code:

#!/usr/bin/perl
# UserSplit.pl
# Creates user-specific files from a single log file based on the field "User="
$logfile = $ARGV[0];
open(LOG, "<$logfile") or die "Usage: UserSplit.pl <filename>\n";

print "Processing $logfile...\n";
while (<LOG>) {
 if (/User\=(\w+)/i) {
 open (USERFILE, ">>$1.$logfile.log");
 print USERFILE $_;
 close USERFILE;
 }
}
close LOG;

It accepts a log file at the command line (e.g. "UserSplit.pl logfile.20090511") and for each time the string "User=" appears in the file it will initially create and then append to a new log file all the original log entries specific to that username. For example, if my username in these logs were "mworman", a file would be created named "mworman.logfile20090511.log". This new file would contain every instance of "User=mworman" as well as any words/strings that appears after it, for every line in the original log file. I've now accomplished multiple things at once:

I've calculated the unique number of usernames in the file, i.e. the number of new files created.
I've created a re-usable, "greppable" file for each user, allowing me to perform calculations for any subset of them. For example I can immediately see which users had the most/least activity based on file sizes, I can use "diff" to compare any subset, etc.
I've ensured date synchronization between the original log file and the new set of files by re-using the date. When grinding data down from the petabyte and terabyte levels to something more manageable, this kind of thing becomes really important for maintaining your sanity as well as the integrity of your analysis.
I can reuse this code for patterns other than "User=" simply by altering the regular expression.

It may not look like much, but this little script is very useful and is something I wrote to separate the User fields in the logs of a Symantec appliance into a set of user-specific activity files. Other than the regular expression in the IF statement, this script is very similar to the one I posted a few weeks back. While one reader correctly pointed out that that script could have been replaced with single grep command (and in most cases, depending on your command line powers this is always possible but not always practical or wise), this script is just as simple but far more powerful and extensible for analytical purposes. Again, the matching pattern ("User=") could literally be changed on the fly to any regular expression, including

IP addresses:

(b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/)

Visa Credit Card Numbers:

(^4[0-9]{12}(?:[0-9]{3})?$)

Valid Social Security Numbers:

(^(?!000)([0-6]\d{2}|7([0-6]\d|7[012]))([ -]?)(?!00)\d\d\3(?!0000)\d{4}$)

What else can we do? The sky is the limit, folks. Does the input to this process have to be a static text file? No, in fact I have a similar script (barely twice the size of this one) that scans a given list of IP addresses/hostnames and generates a text file with the hostname->Windows username pairs for each system with a currently logged on user (this uses a Perl NetBIOS module available from cpan.org, your best one-stop repository for Perl development).

Adding simple TCP/IP functionality to scripts like this starts to move us into the area of "banner grabbing" and network scanning, and sure enough, many popular network scanners (e.g. Nessus) began as glorified Perl scripts that iterated pattern matching across lists of network hosts or subnets.

Once you get the basics of Perl and regular expressions down, a trip to CPAN.org will show you just how much re-usable code is out there: modules for everything from Windows to Samba to UNIX to things like LED.pm:

"Provides an interface for turning the front LED on Apple laptops running Linux on and off. The user needs to open /dev/adb and print either "ON" or "OFF" to it to turn the LED on and off, respectively."

Whether or not turning the LED on Apple laptops is a forensic necessity is an exercise left to the reader. Or, maybe someone now sees Perl as an instrument they can use to access and analyze devices in ways they hadn't thought possible before. The sky is the limit folks.

A gracious thanks to GCFA Rod Caudle who just reminded me of the the BEST tool for regular expression development (which is really an art of it's own) I have ever used. RegexCoach is a tool someone introduced to me years ago and it is priceless for playing with regexps and tweaking them to get the right one you are looking for. It includes the ability to provide test strings to ensure your regexp matches and will sytax-color portions of the test string that do match, etc, greatly speeding up development time. Having easily spent more time figuring out complex regular expressions than actually writing Perl code wrapping them, I couldn't plug this utility enough even though I'd forgotten about it the last ten years or so. Thanks Rod!

Mike Worman, GCFA Gold #124, GCIA Gold #282, is an incident response, forensics, and information security subject matter expert for an international telecommunications carrier. He holds a BS in Computer Systems Engineering from the University of Massachusetts, an MS in Information Assurance from Norwich University, and is CISSP-certified.

Free Course Demos

SANS CISO Primer: 4 Cyber Trends

2023 Security Awareness Report

Security Policy Templates

Join the Community

Our Mission

Perl-Fu: Regexp log file processing

Related Content