I’ve been using my own PHP web statistics script for over a year now. I realized that some dates were missing in reports. It turns out PHP has a limit of 2GB or so when fopen-ing files, regardless of the fact that the script is reading it line by line and not storing any lines in memory.
The solution is to use Linux split command to break the file in manageable pieces and process them one by one. Don’t go crazy and try to split it in 2GB pieces, unless you have abundant RAM. If you’re splitting it in 2GB files, the process will use 2GB of RAM while doing it. Ouch!!!
Since, I’m working with 1GB RAM total, I decided to go with 100MB files, hence using 100MB of RAM in doing so. Also, I wanted my files to have a prefix zzz_split_ (instead of a default x). “zzz” just lists nice at the end of all files in a directory.
split -C 100m access_log.old zzz_split_
This command split my apache access_log file into 30 pieces, 100 MB each, making sure that lines are not broken.
I fixed my PHP to glob the files in a directory.
$logfiles = '/home/admin/webstats/zzz_split_*';
foreach(glob($logfiles) as $logfile) {
$logfile = $logfile[0];
$handle = fopen($logfile,'r') or die("Can't open the log file");
...
}
Here’s a (wo)man page for split
NAME
split – split a file into pieces
SYNOPSIS
split [OPTION] [INPUT [PREFIX]]
DESCRIPTION
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, …; default
PREFIX is ‘x’. With no INPUT, or when INPUT is -, read standard input.
Mandatory arguments to long options are mandatory for short options
too.
-a, –suffix-length=N
use suffixes of length N (default 2)
-b, –bytes=SIZE
put SIZE bytes per output file
-C, –line-bytes=SIZE
put at most SIZE bytes of lines per output file
-l, –lines=NUMBER
put NUMBER lines per output file
–verbose
print a diagnostic to standard error just before each output
file is opened
–help display this help and exit
–version
output version information and exit
SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.