Wednesday, April 4, 2007

The value of O(1) recovery strategies...

SQL Server 2005 introduced Service Broker, a message queue platform for decoupling database services from one another. It's pretty much just standard message-based Service Oriented Architecture type fare except for one thing: it's persistent.

Not only are the messages persistent but the conversations are too. A conversation contains metadata describing a communication between services. By default, they are persisted in the database with an indefinite lifetime. Because they are persistent, you must be very careful to close conversations so they can be cleaned up. Somehow that slipped past us... It also happened that tempdb was filling up at a prodigious rate with internal objects associated with Service Broker. Uh oh.

Moreover, there's a non-trivial cost associated with maintaining these conversations. Particularly if they pile up. Some people have recommended recycling or pooling conversations to improve performance. (Somehow having to resort to such hackery just seems broken to me for a critical piece of infrastructure like Service Broker.)

So we managed to pile up over 20 million conversations. Also due to some strange regression involving the loss of the Database's master key (no idea how that happened) we had over 12 million undeliverable messages in the transmission queue. You know it's bad when running SELECT COUNT(*) FROM sys.transmission_queue WITH (NOLOCK) or SELECT COUNT(*) FROM sys.conversation_endpoints WITH (NOLOCK) takes up to half an hour to run and returns an 8 digit number.

As we were running SQL Service 2005 Service Pack 1, there were only two viable options: write a loop to call END CONVERSATION ... WITH CLEANUP for each stale conversation or backup and restore the DB without its queues. As you can expect, the loop was very slow, even when running batches of 100 handles at a time and running multiple parallel instances side by side. You can't just TRUNCATE your queues or anything. It would have taken a day or two to clean up.

It turns out that Service Pack 2 includes a very noteworthy enhancement (though I couldn't find mention of it in the public release announcements). To completely blow away all Service Broker conversation state in 1 second do this: ALTER DATABASE ... SET NEW_BROKER WITH ROLLBACK IMMEDIATE. It was orders of magnitude faster to do the upgrade to SP2, wipe the state with SET NEW_BROKER and then carry on! (Remark: You may need to do some extra work to repair Service Broker routes after issuing this command since it will produce a new broker instance id.)

Ahhh... the benefits of constant-time recovery strategies!

I made a little checklist of things to watch for with Service Broker deployments:

  • 1. Is the DB configured correctly? Does it have Service Broker enabled? Does it have a master key?
  • 2. Do conversations have maximum lifetimes? Do messages expire? Do both the initiator and the receiver call END CONVERSATION each time? Is the reclamation robust in the face of runtime errors (use try/catch)? Apply defensive programming techniques to protect the core functions and ensure resources are reclaimed in a timely fashion.
  • 3. Will we receive a notification if the Db starts to blow up? Install appropriate monitors to watch size of sys.transmission_queues, size of sys.conversation_groups, size of the app's queues, size of tempdb, cpu %, disk queue length, the rate of queue message production and consumption, etc...
  • 4. Is there a reasonable limit to the size a queue can be expected to attain? Is the nominal rate of queue message consumption expected to be at least twice or three times that of the rate of message production? If the consumer gets too far behind can the producer drop traffic? If the producer gets too far ahead can it throttle itself? If the messages are no longer timely do they need to be queued or processed at all?
  • 5. Do the queues need to live on the same database as the data or can they be located elsewhere? Can the queues be partitioned among multiple small Dbs to make them easier to repair in the event of an emergency (ie. because we could just drop & rebuild them or we can easily isolate them)?
  • 6. Can the application tolerate faults in the queues or abnormally slow response time? Tip: When an error occurs insert an exponentially increasing delay into polling loops and other periodic tasks. Also try to avoid spamming the logs with repeated messages regarding the same kind of failure. If the database seems to be dead, then log a message to that effect then shut up. Retry operations silently until some change in state is observed like when the database is alive again. Try to provide some sensible behavior when other required components are offline.
  • 7. Are conversation group locks and transactions being used effectively to serialize the processing of messages when required? (Another engineer developed a component that was broken in this respect and so was not guaranteed to process its messages in FIFO order which broke the application logic.)
  • 8. How long should messages or conversations be kept before they should be thrown away? Is it possible they will pile up and cause problems? Consider writing a scheduled job to periodically flush stale data just in case.
  • 9. Can the processing of the request be performed entirely within the DB using automatic stored-procedure activation? If not, does your application know how to poll queues in parallel or retrieve multiple messages at once to offset the added latency?
  •