The widespread outage of state agency computer systems continues today. The Virginia Department of Motor Vehicles has issued a statement that they will be unable to renew driver’s licenses at their services locations again today. They offered no insight as to when they expect to be able to resume this rather critical service. They did make certain to say that when service is restored, you can expect seriously long lines and major delays. (At least, they’ve implied that so heavily as to be an explicit announcement.)
Over at the Virginia Information Technologies Agency (VITA) they are posting updates on the progress they’re making. As of 10:00pm Sunday, 29 August 2010, they reported that they were “continuing to work to restore services to agencies.” From their update page:
Update on storage outage
10 p.m., Sunday, August 29, 2010
Throughout the weekend, teams have been working steadily and deliberately to ensure the restoration process is complete and that all data is verified following the networked storage system failure experienced last week. We appreciate the cooperation and patience of every agency and citizen affected by this issue. Below is an update regarding the substantial progress that has been made over the weekend and that we expect will continue to be made through the night:
• Successful repair to the storage system hardware is complete, and all but three or possibly four agencies out of the 26 agency systems have been restored. Agencies continue to perform verification testing.
• Progress continues, but work is not yet complete for the three or four agencies that have some of the largest and most complex databases. These databases make the restoration process extremely time consuming. The unfortunate result is the agencies will not be able to process some customer transactions until additional testing and validation are complete.
• According to the manufacturer of the storage system, the events that led to the outage appear to be unprecedented. The manufacturer reports that the system and its underlying technology have an exemplary history of reliability, industry-leading data availability of more than 99.999% and no similar failure in one billion hours of run time.
• While most issues have been resolved over the weekend, some issues may continue as the impacted systems are tested and validated. State agencies should report any issues to the VITA Customer Care Center (VCCC) at (866) 637-8482 or email@example.com. Additional staff will be available to handle any increase in call volume. Please note: E-mail should not be used to report critical issues or outages impacting an agency. To report a critical issue, please call the VCCC directly.
The outage, they say, “resulted from a damaged networked storage system” which now directly identifies the SAN part of their data center as the location of the outage. That makes sense, considering the presence of EMC in the troubleshooting team. That second bullet point is especially worrisome, though, from a design standpoint. About the only reason you’d need to take this incredibly long to repair damage to a database is if 1) the indexes of the database itself were completely blown out and needed to be recreated from scratch and 2) you don’t have a replicated backup available. At all. For mission-critical systems, that’s just ridiculously poor design.
The 3rd bullet point is, frankly, verbiage from the vendor’s sales brochure. It’s completely irrelevant that the systems in place have a billion hours of history or 5-nines of reliability under normal circumstances. It’s absolutely beside the point that the failure occurred in a completely unprecedented way. The fact of the matter is that the design of this system could not handle the failure of the primary database. A continuance-of-operation plan (COOP), if done properly, takes this kind of event into account and provides for the enterprise’s ability to keep working when a complete failure of the primary system occurs. COOP’s aren’t rocket science. They are well understood endeavors that are – or should be – part of any major enterprise design where mission-critical, it’s-gotta-work-every-time functions are handled.
Gov. McDonnell needs to be putting together an audit of this contractor’s work (NG and EMC both) that makes no assumptions. He should also be auditing VITA and find out why this situation (referring to the lack of a back up that could handle the load) was allowed to exist. Whether this contract should have been extended or not is beside the immediate point and this situation, bad as it is, is not an indictment of the practice of privatizing functions supporting government operations. Northrop Grumman needs to put the design in place to see to it that the government operation doesn’t go down regardless of major crashes like this. Finding out where they fell down is the 1st step. Following through and making them correct this situation – under this existing contract – is step 2.