Design pattern for memcached data caching


Question

It's easy to wrap optional memcached caching around your existing database queries. For example:

Old (DB-only):

function getX
    x = get from db
    return x
end

New (DB with memcache):

function getX
    x = get from memcache
    if found
      return x
    endif

    x = get from db
    set x in memcache
    return x
end

The thing is though, that's not always how you want to cache. For instance take the following two queries:

-- get all items (recordset)
SELECT * FROM items;

-- get one item (record)
SELECT * FROM items WHERE pkid = 42;

If I was to use the above pseudo-code to handle the caching, I would be storing all fields of item 42 twice. Once in the big record set and once on its own. Whereas I'd rather do something like this:

SELECT pkid FROM items;

and cache that index of PK's. Then cache each record individually as well.

So in summary, the data access strategy that will work best for the DB doesn't neatly fit the memcache strategy. Since I want the memcache layer to be optional (i.e. if memcache is down, the site still works) I kind of want to have the best of both worlds, but to do so, I'm pretty sure I'll need to maintain a lot of the queries in 2 different forms (1. fetch index, then records; and 2. fetch recordset in one query). It gets more complicated with pagination. With the DB you'd do LIMIT/OFFSET SQL queries, but with memcache you'd just fetch the index of PK's and then batch-get the relevant slice of the array.

I'm not sure how to neatly design this, does anyone have any suggestions?

Better yet, if you've come up against this yourself. How do you handle it?

1
10
11/10/2008 12:12:17 AM

If you're using a cache then, to get the most out of it, you have to accept that your data will always be stale to an extent, and that some portions of the data will be out of sync with each other. Trying to keep all the records up to date by maintaining a single copy is something best left to relational databases, so if this is the behaviour you need then you're probably better off with a powerful 64-bit DB server with a lot of RAM so it can perform its own internal caching.

If you can accept stale data (which you'll need to if real scalability is important) then one approach is to just throw the whole result set into the cache; don't worry about duplication. RAM is cheap. If you find your cache is getting full then just buy more RAM and/or cache servers. For example if you have a query that represents items 1-24 in a set filtered by conditions X and Y then use a cache key that contains all this information, and then when asked for that same search again just return the entire result set from the cache. You either get the full result set from the cache in one hit, or you go to the database.

The hardest thing is working out how much data can be stale, and how stale it can be without either (a) people noticing too much, or (b) breaking business requirements such as minimum update intervals.

This approach works well for read-mostly applications, particularly ones that have paged queries and/or a finite set of filter criteria for the data. It also means that your application works exactly the same with the cache on or off, just with 0% hit rate when the cache is off. It's the approach we take at blinkBox in almost all cases.

4
11/10/2008 2:01:42 AM

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon