blog.jamlegend.com

Sorry to say, but all inserts to our users table are queuing up and using all of our database connections right now. We're looking into why and hope to have everything back up shortly. Sorry for the inconvenience.

Car!

Sorry everybody, we're experiencing some slowness as queries get backed up in our database. We'll try to have this remedied quickly, but you may continue to have periods of intermittent slowness for the next few hours. We're really sorry about this and appreciate your patience.

Vote for us in the Cnet Webware 100 Competition

We are excited to announce that we are a finalist in the Cnet Webware 100 competition. Thanks for all your support so far, we couldn't have done it without a great base of users like you.

We still have another round to go, so help us win by giving us a vote: http://www.cnet.com/html/ww/100/2009/poll/audio.html

Tell your friends and spread the word, JamLegend is changing the face of music gaming.

We're Back Again

Will see how long this lasts ... we have reported the network issues to AWS, but have not head anything back...

Check That, More Network Problems

Looking into it again... will keep this updated.

We're Back

We're back up after an instance restart. Communications Link Failures tend to be TCP level problem and I've confirmed with an other company running many instances on EC2 that they are seeing TCP problems across their instances. So, I think we can blame this one on Amazon.... Again, we apologize for the downtime and appreciate your patience.

Game on!

Communications Link Failures

Sorry everybody, we're having some unexpected downtime do to some nasty "Communications Link Failures" when connecting to our database server from the app server. We're currently looking for the root cause and will be back as soon as possible. Thanks for your patience!

Search is Back Online

After much testing and a bit of reconfiguration, we have re-enabled search for artists, songs, and albums. It should be accurate, snappy, and (best of all) not break the site! User search is still disabled, though we hope to have have a solution for that soon as well. Game on!

Status Update: Looking OK but no search

After 36 grueling hours of investigating the most recent server troubles, we've again reached some form of stability. However, in order to do this, we had to disable search temporarily. We'll re-enable search as soon as we can do so without causing problems. For those interested in what was going on, here's the overview.

Over the past months, our server problems had always been because some resource was being squeezed: i.e. slow queries locking tables on the database server, a disk almost full, or the CPU usage really high. The good thing about all of those is that they were easy to debug because the symptoms led themselves to particular remedies. Disk full? Get a bigger disk. CPU too high? Optimize the code to use less cpu, buy a faster server. Slow queries? Optimize the tables, re-evaluate indexes, optimize the fetch plan, cache, buy a faster server. All of these problems have one thing in common: when requests are processing slowly, some resource is constrained more than usual.

This time was different. This time, no system resource was under pressure. CPU, disk IO, Queries/sec, connections, netstat, vmstat, entropy, memory: all were normal. Yet, the application server was not serving requests in a timely manner. We had way too many users end up at the "Jamming too hard" page, where requests are redirected to when they timeout. Moreover, it wasn't even consistent. Most requests would be quick, but every few minutes request processing would seem to stop altogether. This is where it got really confusing: when request processing was slow/halted, no resource was constrained. In fact, all resources were LESS constrained during the periods of slowness. Right.

This was tricky because when a resource isn't the problem, pouring more of that resource at the problem won't solve it. We have a very powerful application server, and it wasn't being used. We had all of these worker threads that were supposed to be serving requests, and they were all just sitting there, doing nothing.

Fast forward 24 hours.

After trying countless reconfigurations, starting from scratch on multiple new servers, and analyzing all the logs and stack traces that we could get our hands on, we found something promising. First, our site runs on Glassfish, which is a Java application server that uses the non-blocking http connector, Grizzly. Despite running Grizzly with 5 acceptor threads for 6 months now, we found that going down to 1 acceptor thread helped, though we still don't know why. Second, we use Apache lucene for search, and we found many of the stack traces had a large number of locked lucene threads. This partially explained why our app wasn't using any resources: a lot of the threads were locked by lucene and just sitting there waiting for a lock condition that may never have occurred. So, we disabled lucene.

Conclusion: The battle isn't over yet. We want to re-enable search, but we have to make sure it's behaving properly first. There is also the question of acceptor threads. With a non-blocking http connector you don't need a lot of threads and so far that 1 acceptor seems to be doing alright, but there doesn't seem to be any logic to why after 6 months of running on 5 acceptors sans acceptor-related-problem it would suddenly become a problem.

Thanks for hanging with us while we worked this out. Game on!