After 36 grueling hours of investigating the most recent server troubles, we've again reached some form of stability. However, in order to do this, we had to disable search temporarily. We'll re-enable search as soon as we can do so without causing problems. For those interested in what was going on, here's the overview. Over the past months, our server problems had always been because some resource was being squeezed: i.e. slow queries locking tables on the database server, a disk almost full, or the CPU usage really high. The good thing about all of those is that they were easy to debug because the symptoms led themselves to particular remedies. Disk full? Get a bigger disk. CPU too high? Optimize the code to use less cpu, buy a faster server. Slow queries? Optimize the tables, re-evaluate indexes, optimize the fetch plan, cache, buy a faster server. All of these problems have one thing in common: when requests are processing slowly, some resource is constrained more than usual. This time was different. This time, no system resource was under pressure. CPU, disk IO, Queries/sec, connections, netstat, vmstat, entropy, memory: all were normal. Yet, the application server was not serving requests in a timely manner. We had way too many users end up at the "Jamming too hard" page, where requests are redirected to when they timeout. Moreover, it wasn't even consistent. Most requests would be quick, but every few minutes request processing would seem to stop altogether. This is where it got really confusing: when request processing was slow/halted, no resource was constrained. In fact, all resources were LESS constrained during the periods of slowness. Right. This was tricky because when a resource isn't the problem, pouring more of that resource at the problem won't solve it. We have a very powerful application server, and it wasn't being used. We had all of these worker threads that were supposed to be serving requests, and they were all just sitting there, doing nothing. Fast forward 24 hours. After trying countless reconfigurations, starting from scratch on multiple new servers, and analyzing all the logs and stack traces that we could get our hands on, we found something promising. First, our site runs on Glassfish, which is a Java application server that uses the non-blocking http connector, Grizzly. Despite running Grizzly with 5 acceptor threads for 6 months now, we found that going down to 1 acceptor thread helped, though we still don't know why. Second, we use Apache lucene for search, and we found many of the stack traces had a large number of locked lucene threads. This partially explained why our app wasn't using any resources: a lot of the threads were locked by lucene and just sitting there waiting for a lock condition that may never have occurred. So, we disabled lucene. Conclusion: The battle isn't over yet. We want to re-enable search, but we have to make sure it's behaving properly first. There is also the question of acceptor threads. With a non-blocking http connector you don't need a lot of threads and so far that 1 acceptor seems to be doing alright, but there doesn't seem to be any logic to why after 6 months of running on 5 acceptors sans acceptor-related-problem it would suddenly become a problem. Thanks for hanging with us while we worked this out. Game on!