[olug] awk script to separate apache log files

Travis Owens openbook1441 at gmail.com
Fri Mar 23 19:35:59 UTC 2007


Adam,

Just out of curiosity, these 3rd level domains you're hosting, aren't
they individual virtual hosts? If so, why not have Apache run separate
logs for each domain?

Overly curious,
Travis

On 3/23/07, Adam Haeder <adamh at aiminstitute.org> wrote:
> Spent some time on this and thought it would be useful to share with the
> group.
>
> If you've got an apache logfile that contains logs for each virtualhost,
> with the name of the virtual host as the first field on the line, and you
> want to create separate web logs for each virtual host (without the
> virtual host name) so you can run webalizer (or whatever) on it, try this
> awk script:
>
> awk -F" " '{ domain = $1; sub(/^www\./, "", domain); $1 = ""; sub(/^[ \t]+/, ""); print >> "/tmp/logs/"domain }' $WEBLOGFILE
>
> where $WEBLOGFILE is your apache logfile
>
> This will create files in /tmp/logs (assuming the directory exists). Each
> file will be the name of the virtual host (minus the www part) and will
> contain all the lines from the log that correspond to that virtual host.
> The first 'sub' removes the 'www.' and the second sub removes any leading
> white space (left over from assigning "" to $1).
>
> For example, here is a snippet from one of my logs:
>
> careerlink.com 205.188.117.65 - - [01/Jan/2007:00:00:00 -0600] "GET /cgi-bin/redirect.pl?redirecttype=apply&domain=40.adg&key=9/9/9/2&po=014666&doco=409992&redirect=http://up.aihres.com/application/index.htm?po=014666&domain=67.adg&where=outside HTTP/1.1" 302 351 "http://careerlink.com/9/9/9/2/po/014666.htm?doco=&po=014666&career=&industry=0&use=consolidated&employer=&firm=" "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
> careerlink.com 71.223.153.112 - - [01/Jan/2007:00:00:01 -0600] "GET /cgi-bin/redirect.pl?redirecttype=apply&domain=40.adg&key=9/9/9/2&po=014676&doco=409992&redirect=http://up.aihres.com/application/index.htm?po=014676&domain=67.adg&where=outside HTTP/1.1" 302 351 "http://careerlink.com/9/9/9/2/po/014676.htm?doco=&po=014676&career=&industry=0&use=consolidated&employer=&firm=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
> www.firstdatajobs.com 70.10.175.188 - - [01/Jan/2007:00:00:01 -0600] "GET /longapp.php?req=026DE10600246 HTTP/1.1" 302 324 "http://www.careerbuilder.com/JobSeeker/ApplyOnline/ExternalApply.aspx?useframes=True&aourl=http%3a%2f%2fwww.firstdatajobs.com%2flongapp.php%3freq%3d026DE10600246&sc_cmp1=JS_JobDetails_ExtApply&Job_DID=J8C6R76Q8JR2P2NLGDC" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20060912 Netscape/8.1.2"
> firstdatajobs.com 70.10.175.188 - - [01/Jan/2007:00:00:02 -0600] "GET /longapp.php?req=026DE10600246 HTTP/1.1" 302 5 "http://www.careerbuilder.com/JobSeeker/ApplyOnline/ExternalApply.aspx?useframes=True&aourl=http%3a%2f%2fwww.firstdatajobs.com%2flongapp.php%3freq%3d026DE10600246&sc_cmp1=JS_JobDetails_ExtApply&Job_DID=J8C6R76Q8JR2P2NLGDC" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20060912 Netscape/8.1.2"
> nebraskapanhandle.careerlink.com 66.249.66.227 - - [01/Jan/2007:00:00:02 -0600] "GET /state/ne/city/beatrice/page41.htm HTTP/1.1" 200 8816 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
> careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET /0/5/5/1/employer.htm HTTP/1.1" 200 1848 "http://careerlink.com/state/c2/logo3.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
> careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET /0/5/5/1/mast.htm HTTP/1.1" 200 5750 "http://careerlink.com/0/5/5/1/index_m.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
> careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET /0/5/5/1/avantas5.jpg HTTP/1.1" 200 84444 "http://careerlink.com/0/5/5/1/index.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
> siouxfalls.careerlink.com 66.249.66.227 - - [01/Jan/2007:00:00:04 -0600] "GET /1/2/1/3/po/000271f.htm HTTP/1.1" 200 19749 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
> careerlink.com 68.224.162.53 - - [01/Jan/2007:00:00:04 -0600] "GET /0/5/5/1/index_m.htm HTTP/1.1" 200 1398 "http://careerlink.com/0/5/5/1/employer.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
>
> If I run that log through this awk script, I get the following in
> /tmp/logs:
>
> logs:~ # ls -al /tmp/logs/
> total 24
> drwxr-xr-x 2 root root 4096 2007-03-23 13:10 .
> drwxrwxrwt 8 root root 4096 2007-03-23 13:09 ..
> -rw-r--r-- 1 root root 1819 2007-03-23 13:10 careerlink.com
> -rw-r--r-- 1 root root  826 2007-03-23 13:10 firstdatajobs.com
> -rw-r--r-- 1 root root  185 2007-03-23 13:10 nebraskapanhandle.careerlink.com
> -rw-r--r-- 1 root root  175 2007-03-23 13:10 siouxfalls.careerlink.com
> logs:~ #
>
> Now I can call webalizer on each of these files to get unique metrics for
> that domain.
>
> I was doing this with a bash shell script, looping through the apache log
> and using 'cut' to pull off the domain. On a 6G log file, that script was
> taking almost 24 hours to run. This awk script does the same thing in 26
> minutes.
>
> --
> Adam Haeder
> Vice President of Information Technology
> AIM Institute
> adamh at aiminstitute.org
> (402) 345-5025 x115
> PGP Public key: http://www.haederfamily.org/pgp.html
> _______________________________________________
> OLUG mailing list
> OLUG at olug.org
> http://lists.olug.org/mailman/listinfo/olug
>


-- 
Travis Owens

VISTA is just a secret codeword that Microsoft thought up which
actually stands for: Viruses, Intruders, Spy-ware, Trojans & Ad-ware



More information about the OLUG mailing list