September 5th, 2009
Apparently some hardware combinations have some kind of problem if the JavaScript and CSS content uses certain cache-control headers. After a 10 month exercise in sleuthing to try and find what was causing up to a 45 second delay for some site visitors we finally nailed it down to the cache-control headers.
The Problem
First reports from one or two users were, that loading a page sometimes took approximately 45 seconds. where hundreds of other users have no delay at all. Some investigation eventually narrowed this down to only users with specific routers and computer combinations and only on one ISP. Needless to say this was completely baffling. Tests with firebug indicated that the delay occurred before the browser even started to download the page - so it looked more and more like a networking issue. Out of several thousands of users though - they were perhaps 10-20 that had the problem - making it very difficult to diagnose.
Changing almost anything in the mix would solve it - someone switching from Mac to PC or swapping out their router and the problem would disappear. It was stumping countless networking and web sysadmins world wide.
How We Found It
In short, totally by accident. We were doing unrelated optimizations to caching algorithms and as a result elected to change the cache-control headers being used for the JS and CSS files. As soon as we changed the headers the problem disappeared.
The Culprit
Initially were were using an "Expires:" header for cache control. Several articles and concept papers we had red indicated this was the right way to go. But for the type of control we wanted it was actually easier to deal with the "Cache-Control:" header instead. Something in the Expires header was confusing something somewhere because as soon as we removed the header the problem vanished.
Best Guess
Some routers and servers along the path are trying to cache web content as it passes through. They might be changing headers or getting confused by having specific ones there (our Expires header was specified in GMT as it should be) perhaps it was a time-zone issue. Whatever caused them to get confused about the state of the data - they would get stuck and leave the end user waiting for a network time out (about 45 seconds) before passing along the data (perhaps even downloading it again - refreshing the cache)
The Solution
Just removing the Expires header did it. We replaced it with a ´Cache-Control´ header as follows:
"Cache-Control: max-age=604800, must-revalidate"
This might seem a bit long for dynamically generated CSS and JS to be cached, but we have other mechanisms in place to change the URL of the linked files if the content changes in addition to the document headers.
I know this one was a bit technical - but I´ve posted it here for anyone else that might be struggling to figure out why their pages might be loading in a bizarre slow pattern like this can get it sorted.