[olug] awk script to separate apache log files

Adam Haeder adamh at aiminstitute.org
Fri Mar 23 18:13:19 UTC 2007


Spent some time on this and thought it would be useful to share with the
group.

If you've got an apache logfile that contains logs for each virtualhost,
with the name of the virtual host as the first field on the line, and you
want to create separate web logs for each virtual host (without the
virtual host name) so you can run webalizer (or whatever) on it, try this
awk script:

awk -F" " '{ domain = $1; sub(/^www\./, "", domain); $1 = ""; sub(/^[ \t]+/, ""); print >> "/tmp/logs/"domain }' $WEBLOGFILE

where $WEBLOGFILE is your apache logfile

This will create files in /tmp/logs (assuming the directory exists). Each
file will be the name of the virtual host (minus the www part) and will
contain all the lines from the log that correspond to that virtual host.
The first 'sub' removes the 'www.' and the second sub removes any leading
white space (left over from assigning "" to $1).

For example, here is a snippet from one of my logs:

careerlink.com 205.188.117.65 - - [01/Jan/2007:00:00:00 -0600] "GET /cgi-bin/redirect.pl?redirecttype=apply&domain=40.adg&key=9/9/9/2&po=014666&doco=409992&redirect=http://up.aihres.com/application/index.htm?po=014666&domain=67.adg&where=outside HTTP/1.1" 302 351 "http://careerlink.com/9/9/9/2/po/014666.htm?doco=&po=014666&career=&industry=0&use=consolidated&employer=&firm=" "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
careerlink.com 71.223.153.112 - - [01/Jan/2007:00:00:01 -0600] "GET /cgi-bin/redirect.pl?redirecttype=apply&domain=40.adg&key=9/9/9/2&po=014676&doco=409992&redirect=http://up.aihres.com/application/index.htm?po=014676&domain=67.adg&where=outside HTTP/1.1" 302 351 "http://careerlink.com/9/9/9/2/po/014676.htm?doco=&po=014676&career=&industry=0&use=consolidated&employer=&firm=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
www.firstdatajobs.com 70.10.175.188 - - [01/Jan/2007:00:00:01 -0600] "GET /longapp.php?req=026DE10600246 HTTP/1.1" 302 324 "http://www.careerbuilder.com/JobSeeker/ApplyOnline/ExternalApply.aspx?useframes=True&aourl=http%3a%2f%2fwww.firstdatajobs.com%2flongapp.php%3freq%3d026DE10600246&sc_cmp1=JS_JobDetails_ExtApply&Job_DID=J8C6R76Q8JR2P2NLGDC" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20060912 Netscape/8.1.2"
firstdatajobs.com 70.10.175.188 - - [01/Jan/2007:00:00:02 -0600] "GET /longapp.php?req=026DE10600246 HTTP/1.1" 302 5 "http://www.careerbuilder.com/JobSeeker/ApplyOnline/ExternalApply.aspx?useframes=True&aourl=http%3a%2f%2fwww.firstdatajobs.com%2flongapp.php%3freq%3d026DE10600246&sc_cmp1=JS_JobDetails_ExtApply&Job_DID=J8C6R76Q8JR2P2NLGDC" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20060912 Netscape/8.1.2"
nebraskapanhandle.careerlink.com 66.249.66.227 - - [01/Jan/2007:00:00:02 -0600] "GET /state/ne/city/beatrice/page41.htm HTTP/1.1" 200 8816 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET /0/5/5/1/employer.htm HTTP/1.1" 200 1848 "http://careerlink.com/state/c2/logo3.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET /0/5/5/1/mast.htm HTTP/1.1" 200 5750 "http://careerlink.com/0/5/5/1/index_m.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET /0/5/5/1/avantas5.jpg HTTP/1.1" 200 84444 "http://careerlink.com/0/5/5/1/index.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
siouxfalls.careerlink.com 66.249.66.227 - - [01/Jan/2007:00:00:04 -0600] "GET /1/2/1/3/po/000271f.htm HTTP/1.1" 200 19749 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET /0/5/5/1/index_m.htm HTTP/1.1" 200 1398 "http://careerlink.com/0/5/5/1/employer.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

If I run that log through this awk script, I get the following in
/tmp/logs:

logs:~ # ls -al /tmp/logs/
total 24
drwxr-xr-x 2 root root 4096 2007-03-23 13:10 .
drwxrwxrwt 8 root root 4096 2007-03-23 13:09 ..
-rw-r--r-- 1 root root 1819 2007-03-23 13:10 careerlink.com
-rw-r--r-- 1 root root  826 2007-03-23 13:10 firstdatajobs.com
-rw-r--r-- 1 root root  185 2007-03-23 13:10 nebraskapanhandle.careerlink.com
-rw-r--r-- 1 root root  175 2007-03-23 13:10 siouxfalls.careerlink.com
logs:~ #

Now I can call webalizer on each of these files to get unique metrics for
that domain.

I was doing this with a bash shell script, looping through the apache log
and using 'cut' to pull off the domain. On a 6G log file, that script was
taking almost 24 hours to run. This awk script does the same thing in 26
minutes.

--
Adam Haeder
Vice President of Information Technology
AIM Institute
adamh at aiminstitute.org
(402) 345-5025 x115
PGP Public key: http://www.haederfamily.org/pgp.html



More information about the OLUG mailing list