Disaster Recovery ROI Minus the Disaster
- By Dian Schaffhauser
The Florida Department of Education revamped its disaster recovery strategy to strengthen it and save money. But now the department is enjoying a side benefit beyond cost savings. Even if a disaster never happens, applications can be failed over for maintenance, and users can revive a file they've lost.
The 40-person staff of the Education Data Center, based in Tallahassee, has a broad mission. It provides help desk support; desktop, network, and training services; and voice operations for about 1,200 employees who work at the department. It also implements rules and regulations, participates in distribution of funding to counties, and performs data collection from the 67 school districts and institutions of higher education within the state of Florida for state and federal reporting. Although the center doesn't host data for school districts, it requires copies, including enrollment, student information, grades, and test scores.
"We act in a lot of ways like a 68th district," said Bureau Chief Ted Duncan. "We have a staff of people. We service staff in the building. We also communicate with the other districts."
The data center also directs districts in setting technology guidelines--but by example, not directive. For example, said Duncan, when server virtualization was just beginning to enter school district technology centers, the data center virtualized its systems and became a source of information for other districts about their virtualization efforts.
The physical setting is what would be expected: a raised floor area with a bunch of servers--322 virtual servers, according to Duncan. Most run Windows 2003; a few run Windows 2008 and Windows 2000; plus, the center has a handful of Sun Solaris machines. On the application side, the center runs software from Oracle, IBM, and SAS and has a variety of financial, Web, and more specialized programs.
Disaster Recovery Concerns
The school districts are responsible for maintaining their own backups for current data using whatever means they choose; but the center handles historic records. When somebody needs to gain knowledge of what happened 10 years ago, explained Duncan, "where the test scores are today versus then," the center maintains that. "In a disaster recovery sense, we can't rely on districts to provide us with what happened years ago," he said. "We've got to be able to retain it and ensure it's not lost. For us DR means absolute ability to recover as well as short-term recovery efforts."
The major concern is hurricane damage. "The site we currently occupy is not necessarily suited for what I consider a hurricane of class two or three," Duncan said. "The building isn't rated for extremely high winds, but even if the building wasn't destroyed, it wouldn't necessarily be something that could be occupied."
Up until a couple of years ago, the center's answer to that worry was to maintain a contract with a vendor that would provide servers in the event that the primary data center couldn't operate. But the contract posed limitations: "You had an opportunity to use that equipment should it be available," said Duncan. "If you weren't first in line, and they ran out of space--if you read the fine print--you didn't have anything."
Plus, he said, the agreement, which was about $120,000 a year, would only cover a small percentage of equipment the center really needed and only for 30 days. "It cost more than we could afford to do everything we wanted," Duncan added. "We were limited as to how often we could test. We didn't have any of the replication functionality we really wanted to have. It put us in a corner."
Moving off a Promise and into a Secondary Site
The devastating hurricane seasons of 2005 and 2006 drove the center to reconsider its approach. "Looking at the numbers, we realized that building our own offsite recovery center would cost less than [the current solution] and provide more functionality," the bureau chief said. "There was a lot of freedom and flexibility in this approach. It put us in the driver's seat to reduce our costs and pick and choose our systems. We could test anytime and replicate data and systems to have them on standby mode. If an issue such as a power outage or water intrusion occurred, we could run systems at our DR site and still access the necessary data and applications to be productive. We became owners at that point, and investing in the site really gave us the opportunity for [return on investment]."
Duncan and his crew put out the word through their network to seek out another facility that would share its space with the center in a reciprocal arrangement. Several organizations volunteered surplus space, but Duncan accepted the offer from a community college in Gainesville, a city far enough from Tallahassee that "it's unlikely a hurricane would hit both regions."
The center purchased some new equipment and also repurposed servers that were no longer needed for production to move to the DR site. It also pays for an IP connection to the offsite location, which benefits the host institution, Duncan said.
At the time of this interview, the Gainesville site had about 10 hosting systems, which run VMware's ESX hypervisor to expand the number of logical servers available. Obviously, that capacity isn't going to be able to support replicating all of the primary center's operations. But, as Duncan pointed out, "We're not going to full level strength if we had to use the site in a true disaster. But we're also not going to have the number of people on the systems that we do now."
The goal, he explained, is to carry the systems forward in the event that the primary site isn't available for an extended period. "We wouldn't be completely down. We'd have limited capacity until that time."
Currently, the offsite location maintains copies of both data and applications for the Microsoft Exchange system and a number of file servers, as well as the primary Web site for the department.
Figuring out what else is replicated over to the secondary site is "an on-going effort," said Duncan. "We rank our applications based on priority of need. There's a host of things that go into that. It boils down to this: If you have a need to have systems recovered quickly with minimal loss of data, then it sounds like you're a good candidate to be at the disaster recovery site."
In other cases business units have the funding to support offsite backup. As he pointed out, there's cost associated with locating any of the equipment, which is billed back to those who use the offsite location for backup. "Some business areas have more funding than others whether they have the need or not. We also work with those units."
One of the business units taking advantage of the offsite service is the Educator Certification bureau, which maintains credentials for teachers and candidates. "They have an online application that's heavily used by district as well as by teachers and new recruits," said Duncan. "We're working through that business group's needs now."
Another kind of data that requires airtight backup: student test scores. "There are very tight timelines for reporting test scores for federal purposes, and it involves several months in preparation for that effort," Duncan said. "To be down the week or so when scores need to be resolved--that would be a crisis."
Choosing a Replication Program
Originally, the Gainesville location was set up as a warm site. It had equipment, the data link, and power. But a recovery process would have required pulling data off tape. That, said Duncan, would be the longest part of recovery. "It would take hours or days to get data off tape because we had a limited number of tape devices and a lot of tape. What we needed to do was cut this timeframe down by having the data already there."
That led to the next order of business: data and system replication. To handle the replication process--getting data and applications copied from the primary site to the secondary site--the center had a couple of requirements.
"We really wanted a lot of it to be hardware-independent," Duncan explained. "We happen to have Dell EMC [storage area networks] at both sites presently, but we may not have that in the future. And we didn't want to be tied into having this particular brand of product tied to that particular brand of hardware and be locked in. So we thought we'd look at something independent of the hardware but at the same time provided system replication."
Duncan's team also decided they should take advantage of the technology to gain an extra service--the ability to recover missing files for users on the fly--a daily administrative duty.
They evaluated software from three vendors. Cost was a consideration as well as the size of the vendor. "We wanted the attention we could get with a smaller company," Duncan said. InMage won the work.
The InMage software runs at both sites. As data are added or modified at the primary site, those changes are duplicated immediately at the secondary site. Administrators access the data on those secondary site servers through a graphical interface.
One feature in InMage Scout stands out for Duncan, he said: an estimator that will allow the administrator to model recovery point objective (the time relative to a disaster in which data recovery is planned) and recovery time objective (the amount of time in which data or a function needs to be restored) based on how much bandwidth is available between the sites and the change rate of data at the source site. "It has given us a lot of information to be able to plan ahead," Duncan said. "It helps us determine the size of disk space we need at the DR site." At the same time, he added, the interface for that tool is a bit more complicated than he'd like.
Eventually, Duncan said, the offsite location will be used for standard failover purposes. "While we're doing SAN maintenance here in Tallahassee, we could operate the Exchange system in Gainesville. That's what we're working towards. It's not hard to get there--just a matter of putting some other things in place."
In the meantime, Duncan has much greater confidence that the center's data is protected. And by being able to exploit the set-up for work beyond recovering from a disaster, he added, "We're getting the value of that upfront. I'm resting easier these days."