Files >> Memory
Sometimes, memory is slower than a file.
In Python, there are ‘file like objects’ that all respond to the same sorts of methods, open, read, write, close, that sort of thing. Some of them are files on disk, and some are memory buffers that are implement those methods. One would generally think that memory access is going to be faster than file access, but in the case of cStringIO and csv, it doesn’t appear to be the case.
I had a 4000 line csv file, about 700k worth, and another one that was 2x the size. Parsing that file took 10 seconds, the 8000 line file took 40 seconds. (amd64/ubuntu linux/python 2.4). Holy O(n^2) Batman! Since the CSV is just reading lines off a file (which implicitly advances the file pointer), I’m guessing that the cStringIO implementation of a file pointer is a numerical offset followed by traversing the string till it gets to the right spot. It’s not an actual pointer to a spot in memory.
So, I ripped out this code and used a temporary file (which on this machine, is going to be memory backed until it gets swapped out), and the times went to 2 and 4 seconds respectively. Almost certainly here, the file pointer is an actual pointer to a position in memory. Each movement is only a traversal of the size of the line, not the size of the file.
Big counterintuitive win here, by ignoring the buffering that Python provides and using the one that the OS provides.
No comments