Saturday, November 14, 2009

Thinking about the file cache

So, under Linux, just about any memory that's not actively used by applications eventually gets used by the file cache. The file cache keeps a copy of data from recently-used files in memory so that you don't need to read them from disk if you need them again.

One great way to visualize this process this is to install htop, add the memory counter, and configure that counter as a bar graph. The green portion of the graph represents memory actively used by applications, the blue portion represents buffers and such in use by the kernel(I'm still a tad unclear on this point; I think some shared memory mechanisms may be represented there), and the yellow portion is your file cache. The file cache will hold data chosen by some heuristic I'm unfamiliar with. I might describe it as "popular" or "recently-used", but it's really up to the kernel devs.

My desktop machine has 8GB of RAM. That's an insane amount, by any conventional reasoning; Even I'll admit that none of the applications I've run have used that much memory, 64-bit aware or no. However, again, any memory that's not being used by an application eventually gets used by the file cache, which means I eventually have about 6-7GB of file data cached in RAM. Believe me, it makes a difference when I'm not cycling through disk images tens of gigabytes long.

What if that file cache could be populated in advance? What if a filesystem could retain a snapshot of which files (or pages or sectors or blocks; However they organize the data in the cache.) were in the file cache at a particular time? I'm not talking about the file data itself, but pointers to that data. When the filesystem is mounted, assuming it's clean, the snapshot could be used for initially populating the filesystem cache.

At a naive level, the snapshot could be made when unmounting a write-enabled filesystem, though not when remounting to read-only. (That's a common failsafe approach for dealing with hardware blips, and it doesn't make sense to try to commit data to a potentially failing device..) When the filesystem is next mounted, the file cache state could be restored, immediately bringing recently-used files into memory. That will increase the mount time, but in a large number of use cases, it will improve the speed of file access. You could even choose to not restore that file cache state without any worries for data integrity.

More sophisticated approaches might allow the triggered switching of profiles. Let's say you use your system for web browsing as well as the occasional game, or even as a file server. You might have a different set of cache data for each purpose. Tie it to individual binaries, or even trigger it based on loading particular files, and be able to flush a large amount of data into the cache in anticipation of the workload historically seen associated with that application. Did gdm just start? Load all the GNOME pixmaps and sound into the file cache. Did Firefox just start? Load the theme data, plugins and that stuff under ~/.mozilla-firefox.

So long as the filesystem is aware of these cache profiles, it might even be able to take advantage of some of the free space on the disk to keep copies of the cached files in a contiguous place on the block device, to speed up loading of the cached data. If the data was modified, of course, the filesystem would have to rebuild the cache block at an idle time in accordance with system energy usage policy. (I.e. if you're on battery, you might only tack the modified version onto the end of the block. Or you might not rebuild the block at all until after you're wall-powered again.)

No comments:

Post a Comment