Analyzing Website Traffic

Recent Projects

Okay, so you spent weeks learning HTML, coding perfectly balanced pages, scanning your company logo and photos of your employees, agonizing over every word, and finally the big day came, it all went live on the Internet.

Now you're wondering - is anyone looking at your website? Do they like what they see? Do all your links work, or have some of the websites you linked to disappeared? The answers to all these questions (and more!) can be found in your log files. Log files are generated by your web server every time someone accesses your site, and you can use the information collected there to update and improve your website.

The organization that hosts your website should collect two kinds of data about people who access your pages. The first type of data are the errors, or places where people had problems. The second type of information are the file transfers, or accesses, that were made successfully from your pages. If you've set your pages up correctly, the transfers file will be orders of magnitude bigger than the errors file. (If you have a dedicated line and your own webserver, you'll have to configure the server software to collect both types of information yourself.) In the remainder of this article we'll look at the information available in these files, and how you can use it to maintain and improve your site.

Error Log File

The error log file will have entries like the following:

[Tue Feb 18 18:03:06 1997] httpd: send aborted for sjx-ca16-27.ix.netcom.com [Fri Feb 21 17:31:14 1997] httpd: access to /usr/runtime/www_igate.rt/B/Balbes/boys.gif failed for crc13.cris.com, reason: file does not exist from - [Sun Mar 2 23:24:13 1997] httpd: send aborted for pm105_246.promedia.net [Mon Mar 3 09:22:43 1997] httpd: access to /usr/runtime/www_igate.rt/B/Balbes/shops.html failed for akron63.imperium.net, reason: file does not exist from -

The first field shows the date and time that access was attempted. The second field shows what file they were trying to get, and why the access failed. In the first entry, dated Feb 18), "send aborted" indicates that the person got tired of waiting and stopped the transfer (or their connection went down). Do you have a lot of large graphics that people are not willing to wait for? If this happens often, you should cut your filesizes down. ("Designing for the Web" by Jennifer Niederst, published by O'Reilly and Associates has excellent suggestions on how to do this).

The other main type of error is "access to filename.html failed for foo.bar.com, reason: file does not exist", as in the entry dated Feb 21. This means the link was invalid for some reason - the filename was changed, the server on which it resides was down, etc. If the same file appears repeatedly as non-existent, there is a serious problem. First check to make sure there is not a typographical error in your html file, and then make sure the link is still valid by clicking on it yourself. If the server is down, wait a while and try again to make sure it's not a permanent outage. By monitoring the contents of the error file, and correcting the problems it indicates, you can keep your site current.

Access Log File

If you've put your site together well, the access (or transfers) file will be much larger than the errors file. Typical entries in the access file might look like the following.

gfpm01-053.mcn.net - - [08/Mar/1997:15:04:17 -0500] "GET /~Balbes/index.shtml HTTP/1.0" 200 8084 "" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)" gfpm01-053.mcn.net - - [08/Mar/1997:15:04:24 -0500] "GET /~Balbes/background.gif HTTP/1.0" 200 2749 "" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)"

Each access by any user is recorded as a separate line. These two entries, made from the same place, with the same browser and only a few seconds apart, are almost certainly from the same person, who loaded the file "index.shtml", and its background file "background.gif". There are 6 different fields of information collected about each transfer. These are:

the computer's name (or IP address) from which the request was made
the date and time of the request,
what the request was (normally which file they were accessing)
the response code given (a number indicating how the transfer went, like the dreaded 404)
the number of bytes transferred
the referrer URL
and the agent (browser) used to access the files.

(More on agent field confusion in a little bit).

Since I know this page also has two smaller images, and this particular user did not download either of them (there were no more lines at this time from the same host), I can assume they were running their browser with graphics downloading turned off. Compare that with the following users' entries.

ppp7.millnet.net - - [08/Mar/1997:10:06:30 -0500] "GET /~Balbes/ HTTP/1.0" 200 8084 "" "Mozilla/3.01Gold (Win95; I)" ppp7.millnet.net - - [08/Mar/1997:10:06:33 -0500] "GET /~Balbes/background.gif HTTP/1.0" 304 0 "" "Mozilla/3.01Gold (Win95; I)" ppp7.millnet.net - - [08/Mar/1997:10:06:33 -0500] "GET /~Balbes/face.gif HTTP/1.0" 304 0 "" "Mozilla/3.01Gold (Win95; I)" ppp7.millnet.net- - [08/Mar/1997:10:06:34 -0500] "GET /~Balbes/smhard.gif HTTP/1.0" 304 0 "" "Mozilla/3.01Gold (Win95; I)"

This user downloaded not only the text and background image, but also the two other images on the page, so they must be surfing with graphics downloading enabled.

The following entries show users who found the same pages as above, but through the search engines Lycos and InfoSeek. This is confirmation that these pages are being indexed, and shows what keywords the person was searching for when they found these pages.

mason.ge.com - - [08/Mar/1997:11:57:32 -0500] "GET /~Balbes/quality.html HTTP/1.0" 200 3420 "http://www.lycos.com/cgi-bin/pursuit?first=61&part=&cat=lycos&query=SOFTWARE+TESTING" "Mozilla/3.0Gold (Win95; I)" p.infoseek.com/Titles?col=WW&sv=IS&lk=noframes&qt=telecommuting+career" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)"

The agent field is supposed to tell you which browser was used to make the request. As the browser wars continue, this field is getting more complicated. The different browsers try to look like each other, in order to get any browser specific code sent by the server. For example,
Mozilla/3.0Gold (Win95;I)
indicates Netscape Navigator Gold, version 3.0 running on a windows 95 machine.
while
Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)
indicates the Microsoft Internet Explorer running on a Windows 95 machine, but trying to look like Netscape Navigator.

Logfile Analysis Tools

You can manually scan through the log file, but since it can be hundreds, or even thousands of lines long, that's not very practical. There are a large number of usage analysis tools out there that will automate the process for you. There is a good list of tools, that gives information on platforms, cost and capabilities. My personal favorite is Analog for several reasons. First, there are versions for Macintosh, Unix, DOS, Windows and VMS, so you can ftp your logfiles (since they are plain text) and analyze them on whichever platform you prefer. Second, it's free. Also, while the default configuration gives a great report, Analog is easy customizable to produce exactly the report you want. A few of the most useful sections of a typical report are discussed below.

You can get a quick overview of activity on your site from the summary report, as shown below.

----------------------------------------------------------------- Analysed requests from Sat-15-Feb-1997 18:24 to Sat-08-Mar-1997 15:58 (20.9 days). Total successful requests: 1,697 (83) Average successful requests per day: 81 (11) Total successful requests for pages: 672 (39) Total failed requests: 63 (2) Total redirected requests: 161 (15) Number of distinct files requested: 43 (15) Number of distinct hosts served: 295 (21) Number of new hosts served in last 7 days: 17 Total data transferred: 17,218 kbytes (434,698 bytes) Average data transferred per day: 843,643 bytes (62,100 bytes) (Figures in parentheses refer to the last 7 days). -----------------------------------------------------------------

From this summary you can see that in the 3 weeks covered by this logfile data, almost 1,700 transfers of 43 different files were made to a total of 295 different hosts.

Bar graphs can also be created, showing how much data was transferred each week, or at what time of day transfers are highest. The domain report will show where the traffic on your site is coming from, based on domain name of the user. For slightly more specific information, you can create a host report showing how many pages were requested by each host. (Example: foo.bar.com and spam.bar.com are two different hostnames, but they both have bar.com as their domain name.) Besides knowing where the traffic is coming from, you can find out what they were looking for. The request report will show which files were accessed, and how many times. The referrer report will show which other pages sent traffic to your site (by having a link pointing to one of your pages). And if you want to know what they used to access your information, the browser report shows which web browsers were used to access your site.

Now that you know how to access all this information, you can use it to update your site on a continuing basis. If there are particular areas that are being accessed often, you'll want to pay particular attention to keeping them up to date, and adding more information. If there are files that are rarely or never accessed, you'll want to see how you can make their existence more prominent or simplify the path users must take to get to them. In effect, you've turned all your users into testers, and the information you need to serve them better is at your fingertips. It's up to you to take advantage of it.

The author, Lisa M. Balbes, Ph.D., has been a scientific software consultant since 1992, and has been developing websites since 1994. She welcomes comments on this article.