COMPUTERWORLD
EXECUTIVE BRIEFINGS EXECUTIVE GUIDES FOR STRATEGIC DECISION-MAKING
Disaster Recovery &
High-Availability IT Strategies for keeping your systems working 24/7 and protecting corporate data. INTRODUCTION Continuity, Availability and Security
. . . . . . . . . . . . . . . . . . . . . . . . . .2
RECOVERY STRATEGIES Rising From Disaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Five Classic Mistakes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Realistic Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Synchronizing With Suppliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 E-mail Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
Compliments of
SAFE & SECURE Storage Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Long-Distance Data Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 Advances in Tape Backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 Backing Up the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 EMERGING TECHNOLOGIES Grid Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 MAID Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
ST R AT E G I C I N S I G H TS F R O M T H E E D I TO RS O F C O M P UT E RWO R L D
INTRODUCTION
Continuity, Availability and Security recovery and data protection are like parenting — neither job is ever really finished. This report takes a wide-ranging look at high-availability IT and data protection in an effort to help you meet this never-ending — and interrelated — set of responsibilities. Obviously, the goal is to provide the IT infrastructure and data that your business needs to operate (and better yet, thrive). That means avoiding infrastructure downtime; ensuring business continuity when there is unavoidable downtime (such as in disasters); and protecting the corporate data (such as backup and long-distance data replication). Consider this: Some months after the Sept. 11, 2001, terrorist attacks, the CIO of a large Wall Street law firm — located only blocks from the collapsed World Trade Center towers — talked about the tremendous outpouring of sympathy and concern from hundreds of the attorneys’ clients in the 24 hours after the disaster. Then Day 2 dawned, and the story changed. The clients who called wanted reassur-
D
ISASTER
n
Introduction
ances that their files were safe and that business would promptly get back on track. It was a reminder that even a major disaster has a short shelf life as an excuse in the business world. What matters most is the speed and effectiveness of recovery.
High Availability IBM has gotten the message. IBM recently launched an initiative focused on ensuring that the iSeries and its other server lines are highly available, an area of increasing interest among users who can’t afford system downtime because of round-the-clock global supply chain demands. “In many cases, our clients may be trying to do high availability but don’t have all the pieces put together to make it a truly resilient set of infrastructures,” says John Reed, the IBM executive who was recently picked to lead the development of the company’s High Availability Design Center. He says the center could lead to new products, services and business partnerships. Part of the plan involves assembling best-practices guidance and tools, according to Reed. IBM will conduct system assessments, and help users define and develop high-availability architectures and run application benchmarks, he says. Gerald Lake, a programmer/analyst at Sovereign Specialty Chemicals Inc.’s Buffalo operations, says increasing demands from outside auditors for IT redundancy have prompted his company to improve system availability. As part of a server consolidation
Causes of Human Error IDC says 25% of IT downtime occurrences are caused by human operator error. The typical problems include the following: n
Complex or inadequate operational processes
n
Lack of training
n
Poor organizational structure and communications
n
Overextended staff
project, Sovereign converted an iSeries machine located at a different facility from the one that houses its primary server into a backup system, Lake says. But Lake is eyeing IBM’s plan warily. “The way IBM charges so heavily for everything, I think a lot of people are going to continue to do everything on their own,” he says.
Growing Importance IT availability has always been an important IT function, but the stakes are even higher in an environment where IT is being used for real-time analysis and business transactions, the IT infrastructure continues to get more complex and regulators are keeping a close eye on business continuity. Some extremely IT-dependent companies have driven toward five-nines’ availability, or about five minutes of downtime per year. Most IT organizations operate at three- to four-nines’ availability, or about four to eight hours of
Computerworld editor in chief Don Tennant n Executive briefings editor Mitch Betts Designer Julie Quinn n Design director Stephanie Faucher n Managing editor/production Michele Lee DeFilippo n Copy editors Bob Rawson, Eugene Demaître, Mike Parent, Monica Sambataro
Computerworld Executive Briefings 2
SOURCE: MIKE MULHOLLAND, EVERGREEN ASSURANCE INC., ANNAPOLIS, MD.
Rules of Thumb 1. REDEFINE DISASTER. For busi-
TIONS. Whenever possible, mis-
nesses today, a disaster is any event that blocks access to corporate data and applications. It’s not just hurricanes and terrorism. It includes denial-of-service attacks and planned downtime, for example.
sion-critical applications should be Web-enabled so employees can access them anywhere, anytime. In the event of a disaster that prevents staff from entering the office, Webenabled applications allow employees, customers and partners to stay connected.
2. PRIORITIZE APPLICATIONS.
Every company has specific applications that are critical to keeping the business running. Typical candidates for most businesses include e-mail and ERP systems. By prioritizing your applications, you can allocate your IT budget appropriately and protect what is most important to your company instead of spreading that budget across noncritical applications. 3. WEB-ENABLE APPLICA-
downtime per year. You should do your own downtime and vulnerability assessments, of course, but researchers at IDC say that, on average, IT organizations find that downtime has the following causes:
4. MOVE DATA 100 MILES AWAY.
In preparation for regional disasters, keep your data at least 100 miles from your primary site. You should also replicate your data continuously to maintain complete data integrity. 5. AUTOMATE THE RECOVERY PROCESS. Whenever possible, au-
tomate disaster recovery processes to reduce bottlenecks and human error. downtime is caused by several factors, controllable and uncontrollable,” IDC says. “Technology process, human error, and external factors such as natural disasters, terrorist attacks and power failures present IT with difficult downtime scenarios,” IDC says. In fact, we really need to redefine the concept of disaster, says Mike Mulholland, cofounder of Evergreen Assurance Inc. in Annapolis, Md. It’s really any event that blocks access to corporate data and applications. By that definition denial-of-service attacks and planned downtime are also IT disasters, he says.
of downtime occurrences are caused by application-related failure, such as application software, database software, Web servers or middleware. of downtime occurrences are caused by human operator error. of downtime occurrences are caused by a hardware component failure, involving the network, a server or a desktop PC.
Three Tiers of Protection
Preventing unplanned downtime requires a combination of remedies, including more training for IT staff, and pushing vendors for technologies that are better at identifying IT service problems that can disrupt the business. “IT must recognize that
Improving from four-nines’ availability to five-nines’ availability can be extremely expensive, and sometimes not worth the effort, IDC notes. Are you overspending on disaster recovery? It seems like a ridiculous question. Newspaper headlines throw more risks — and regulators throw more
45% 25% 30%
Introduction
requirements — in your face almost every day. But it’s possible to overspend on disaster recovery, especially if you listen to every vendor saying you must do x, y and z to comply with the Sarbanes-Oxley Act. Tim DeLisle, managing principal at Corigelan LLC, a disaster recovery consultancy in Chicago, says the way to avoid overspending is to establish three tiers of disaster recovery based on business requirements. It begins with the CIO asking business managers which few applications are truly critical and require recovery within 24 hours to keep the business afloat. You don’t have to mirror everything. The second tier of applications, which require recovery in 48 to 72 hours, may need only inexpensive tape backup, while the third tier may need nothing at all, DeLisle says. All that the business executives and regulators really require is that you take prudent steps for business continuity. You don’t have to bankrupt the company.
User Tips In this report we provide plenty of cost-conscious tips and insider advice from IT managers who have faced disaster and recovered. Their experi-
Causes of Hardware Failure IDC says that 30% of downtime occurrences are caused by a hardware component failure. This is typically caused by one of the following problems: n Physical changes n
Incorrect configurations
n
Overworked circuitry
n
Lack of redundancy
True Costs Be sure to analyze the true costs of IT downtime by including the following metrics: n
Revenue impact
n
Reputation impact
n
Financial performance (short- and long-term)
n
Staffing costs
n
Process impact
n
Overtime
n
Travel
n
Stock-market impact
SOURCE: IDC, FRAMINGHAM, MASS., NOVEMBER 2004
ences raise questions you should be able to answer. For starters: n How strong is your disaster recovery documentation? What if the head of sales is the one who has to turn on the systems in the data center? “We fashion our document so anyone in the business should be able to restart an application,” says Elbert Lane, a lead software developer at Gap Inc. in earthquake-prone San Francisco. n Which applications are really the most important ones to restore first? At most companies, it’s probably e-mail, not the SAP system or the Oracle database. n How robust and ready are the plans at your suppliers, your outsourcers, your business partners? Who’s checking on them? n What are your most critical access issues? Getting to the data, the systems or the people? Disaster recovery is one test that IT can ace — without big budgets or expensive consultants. It’s a matter of commonsense planning, attention to process and doing your disaster homework.
Computerworld Executive Briefings 3
common disasters unscathed.
RECOVERY STRATEGIES
Rising from Disaster key to keeping your business on its feet in a disaster is anticipating the sometimes cascading effects a catastrophe can have on your IT operation. Take Miami-Dade County, for example. When a hurricane hit southern Florida in 1992, the county’s data center lost power. Diesel generators had overheated when well water ran out because high winds had broken water mains and lowered the water table. IT managers later had air-cooled generators installed. One of the problems with disaster recovery, experts say, is that although most companies have plans for common scenarios — weather-related emergencies, headquarters lockouts and massive power outages — those plans aren’t regularly tested or communicated to end users. In fact, in a recent survey of 283 Computerworld readers, 81% of the respondents said their organizations have disaster recovery plans. But 71% of the respondents at companies with plans said the plans hadn’t been exercised in 2003. It takes forethought to avoid a business shutdown during a disaster. Experts and users agree that there are steps you can take to increase your chances of coming through the most
O
NE
Recovery Strategies
Weather-Related Emergencies “If you look at why facilities fail [during weather disasters], it’s all pretty predictable. They call it an act of God, and I call it an act of stupidity,” says Ken Brill, executive director of The Uptime Institute in Santa Fe, N.M. Hurricanes threaten MiamiDade County’s data center every year from June through November, yet IT managers still struggle with getting everyone to understand the importance of disaster planning. “The challenge we always have is to make sure the staff is completely involved and we have participation,” says Ruben Lopez, director of the enterprise technology services department for the county. Miami-Dade County gives itself a 56-hour window to test its disaster recovery plan each year by cutting over to its alternate data center and restoring data. It uses the time to find deficiencies and later corrects them. “Business continuity and disaster recovery preparedness is all about figuring
“Business continuity and disaster recovery preparedness is all about figuring out what your deficiencies are and how you’re going to fix them. It’s not about how to get an A+ on paper.” JOE TORRES, DISASTER RECOVERY COORDINATOR, MIAMI-DADE COUNTY, FLA.
out what your deficiencies are and how you’re going to fix them. It’s not about how to get an A+ on paper,” says Joe Torres, disaster recovery coordinator for Miami-Dade County. He points out that it’s not the people he’s testing during a disaster recovery exercise but the plan — “because you can’t depend on the people being available.” “You’re going to give them a book with instructions, and they need to be able to follow that,” Torres says. One step Miami-Dade has taken in that direction is to consider call-tree software that could help employees contact key managers in an emergency. Walter Hatten, senior vice president and technical services manager at Hancock Bank in Gulfport, Miss., has focused on consolidating his server farm and creating a redundant communications network for an area of the country that gets hit or brushed by a hurricane every three and a half years. The 100-branch bank, with headquarters on the Gulf of Mexico, is consolidating 500 servers onto a Linux-based mainframe to reduce recovery time in a disaster. “Just the sheer magnitude of rebuilding 500 servers puts us at risk for not being able to do it quickly enough,” says Hatten, who chose Linux for its open standard and scalability. He says the mainframe will offer greater speed for recovery of data, reducing the amount of time it would take to restore data from days to hours.
Headquarters Lockouts Maria Herrera is chief technology officer at Patton Boggs LLP, a Washington-based law firm with 400 attorneys specializing in international trade law. Because of the firm’s proximity to the U.S. Capitol building, one constant concern
Computerworld Executive Briefings 4
Earthquake Law Requires IT Response A California law that mandates earthquake-proof hospitals is sparking massive investments in IT infrastructure upgrades by health care companies in the state, starting with the hardening of data centers but also including the deployment of faster networks, wireless systems and other new technologies. For example, Sacramentobased Sutter Health expects to spend the better part of $1 billion on technology upgrades at its 26 hospitals over the next 10 years as a result of the law, CIO John Hummel says. As the not-for-profit company rebuilds some of its facilities to comply with the law, it plans to invest in new bandwidth and storage capabilities in an effort to meet processing demands well into the future. Mark Zielanzinski, CIO at El Camino Hospital in Mountain View, says his facility is building a data center and demolishing its existing one as part of an overhaul of its entire campus to meet the law’s requirements. The new data center is due to be fully operational by March 2005. In addition, the data center re-
is a building lockout brought on by terrorist threats, she says. Herrera has set up duplicate operating environments in several remote offices and has contracted with two disaster recovery vendors: SunGard Data Systems Inc. in Wayne, Pa., for server recovery and workstation services, and AmeriVault Corp. in Waltham, Mass., for data backup. AmeriVault recently installed its CentralControl interface on desktops and an agent on each of Patton Boggs’ servers. After completing an initial full backup of all data, AmeriVault now performs daily incremental backups of deltas, or changes, to disaster
Recovery Strategies
construction prompted a server consolidation and upgrade project, Zielanzinski says. El Camino Hospital is consolidating more than 150 smaller servers onto two Unisys Corp. ES 7000 systems, each of which can support up to 32 Intel processors. A matching set of servers is being installed at a new disaster recovery site 120 miles away, in more geologically stable Sacramento. The law, known as the California Facilities Seismic Safety Act, was passed in 1994 after the Northridge earthquake struck north of Los Angeles and caused $3 billion in damage to 23 hospitals. But the measure is just now becoming an urgent matter for many health care companies, which must comply by 2008 — or 2013 if extensions are granted. The California HealthCare Association estimates that it will cost $24 billion to earthquake-proof or rebuild a total of about 2,700 hospital buildings throughout the state. IT costs could account for $2.4 billion to $3.6 billion of that, says Gerard Nussbaum, a consultant at Kurt Salmon Associates Inc. in Atlanta.
recovery centers in Waltham and Philadelphia. In an emergency, data restores can be performed remotely, even from home, by administrators using a point-and-click function on a Web portal provided by AmeriVault, or data can be shipped on tape for large restores. “Every month or couple of months, we access several documents and download them from AmeriVault to test the system,” says Herrera. During full testing, she spends 16 hours recovering full data sets. “We’re able to restore everything within the firm in about 10 hours,” she says. Herrera also suggests involving all IT personnel in the
disaster recovery testing process, because in an emergency, you never know who might be available to help. She has trained employees in all four satellite offices around the country on disaster recovery procedures. SunGard also has several facilities where IT personnel and lawyers can meet to continue work in the event of a headquarters lockout, Herrera says. Officials at Mizuho Capital Markets Corp., a subsidiary of the world’s second-largest financial services firm, Mizuho Financial Group Inc. in Tokyo, say that some of the most effective disaster recovery tools are the simplest. For example, when a protest kept employees from entering the firm’s Times Square headquarters late last year, IT managers passed out laminated business cards with a directory of managers’ home phone numbers. Doug Lilly, a senior telecommunications technologist at the Delaware Department of Technology and Information, says his agency has three data centers that support about 20,000 state employees. The department uses EMC Corp.’s Symmetrix Remote Data Facility to replicate data among the data centers. It also uses backup software from Oceanport, N.J.-based CommVault Systems Inc. as a central management tool. “If this site were bombed . . . we’d have servers running to replace them, but we’d still have to restore data from tapes,” Lilly says. “CommVault’s software transfers between 60GB and 65GB of data per hour. It would be a few hours before we got people up online.” Lilly’s IT team also keeps a copy of disaster recovery procedures at home. “Team leaders notify everyone, and we carry cell phones and BlackBerries that are on redundant networks,” he says. “It’s a pretty unified messaging platform . . . that ties data, voice, fax and
Practical Tips Choose vendors that are proactive and don’t require prodding to upgrade or test your disaster recovery plan.
n
n Don’t test people; test your disaster recovery plan. People come and go. Make the plan easy to follow and use. n After a disaster, don’t count on employees being willing to fly to alternate work sites. n Distribute key disaster recovery personnel across many geographic locations. n Turn disaster recovery data centers into active work sites. n Disaster recovery plans are living, breathing things. Keep them up to date and make sure employees are well versed in them. n Seek vendors with plenty of longevity and geographically dispersed offices for disaster recovery. n Make sure portals to your outsourcing vendor are dedicated or have enough bandwidth to handle multiple companies seeking fast restores. n Make sure that not just your vendor but you understand how to back up and restore systems. n Verify that backup tapes can restore data. n Train and involve all IT personnel in the disaster recovery process.
video into one application. They can get hold of us anytime, anywhere.”
Massive Power Outages Edward Koplin, an engineer at Jack Dale Associates PC, an engineering firm in Baltimore, says a lack of disaster testing is the No. 1 cause of data center failures during a blackout. Koplin suggests that companies test their diesel generators of-
Computerworld Executive Briefings 5
Six Tips for Continuity Planning Here’s a strategy for the IT department to consider when determining its role in closing the business continuity gap: Make a priority list of applications to optimize the recovery process, taking into account the resources required, cash flow and time frames. Breaking down business continuity in this way gives IT a reasonable framework that affords both IT and management the opportunity to work together to determine the appropriate amount of effort to spend on closing the gap to an acceptable level. Address all physical and logical vulnerabilities to reduce the probability of disaster and ensure information integrity, including building access, physical security and firewalls. IT must show
1
2
management how specific vulnerability zones could affect the financial side of the business. To truly understand vulnerabilities and risks, the IT department must lead the charge in finding the answer to this question: How far off are senior management’s expectations from the reality of IT availability? Validate IT availability service levels, including recovery-time objectives, recovery-point objectives, system performance, information access and delivery, network performance and monitoring, and security, to enable more effective business-unit continuity planning. This allows the IT department to determine the business’s current capability in terms of compute utility restoration and lost data, in comparison to the business’s per-
3
ceived or desired baseline and the operational, logistical and financial impact of the business’s current availability vs. its desired availability.
Transaction protection (minutes to hours) n High availability (minutes) n Continuous availability (always up and running with minimal information loss) n
Coordinate, plan, document and practice within the business units the synchronization and reproduction of lost data/transactions and manual re-entry of data, taking into consideration the organization’s needs. For example, the business continuity gap will be larger when continuous availability is absent, which would be the case for financial services companies. If only best efforts are required, as might be the case for some manufacturing companies, the gap may be the smallest. Optimum points of availability could be one of the following: n Best efforts (could take days or longer) n Traditional recovery (hours to days)
4
Validate access to transportable information among all business units, including remote/alternate facilities and return to home, a WAN/LAN, information exchange via e-mail and the Web site. While validating the information and the access to that information, organizations must not overlook the secure state of the infrastructure.
5
Implement a more effective management process to support the business continuity program, paying special attention to cross-training and staff rotation, program currency and accuracy, and distribution and access.
6
SOURCE: MICHAEL CROY, DIRECTOR OF BUSINESS CONTINUITY SOLUTIONS, FORSYTHE SOLUTIONS GROUP INC., SKOKIE, ILL.
ten and at full load for as long as they’re expected to be in use during a blackout. The Uptime Institute’s Brill adds to that advice: Always prepare for a blackout with at least two more generators than needed, and test them by literally pulling the plug. “I would test it for as long as I expected it to work under load. I’d do
that at least every two or three years. And I would run it in the summer,” Brill says. Jim Rittas, a security administrator responsible for networking at Mizuho, says the company can now perform full data restores after blackouts or other disasters in an hour instead of two days because it now mirrors its data to a New
Jersey office that’s also an active work site. “The other thing we did was diversify our Internet connections. Internet connections now flow in and out of New York and New Jersey, where we only had one in New York before,” Rittas says. Needham, Mass.-based research firm TowerGroup recommends turning parts of dis-
Critical Success Factors
SOURCE: TOWERGROUP, NEEDHAM, MASS.
VIEW disaster recovery not as a tac-
tical IT project but as a strategic IT asset. Both disaster recovery and business continuity planning are essentially event-triggered, nonstrategic insurance policies that offer no reasonable expectation of a return on investment.
INTEGRATE disaster recovery IT assets and personnel into operations budgets across geographically dispersed data centers to blur the lines between what is disaster recovery and what is operational expense. In all cases, disaster recovery must be viewed as a mission-critical expense.
TURN portions of your disaster re-
BUILD operational efficiency objec-
covery sites from a cost center into a profit center by splitting your business operations between your headquarters and the alternate data center. This eliminates the need for staff relocation and backup sites.
Recovery Strategies
tives into your disaster recovery strategies. CURRENT IT wisdom dictates that IT departments use part of the capital freed up by driving down opera-
tional costs to invest in IT projects geared toward building strategic advantage. Instead, invest the resulting savings in IT resilience across critical lines of business. SHARE the cost of disaster recov-
ery with partners and fellow institutions. Disaster recovery isn’t a matter of competitive differentiation or advancement, but a matter of survival. It may make sense for companies to pool their assets and personnel to provide resilience capabilities for interconnected systems or collaborative technologies such as payments or check processing.
aster recovery or business continuity data centers into profit centers by going with an active/active operations model. Traditionally, companies have set up an active primary data center and unmanned backup site. An active/active model eliminates the need for IT staffers to relocate in a disaster because they’re permanently stationed at the disaster recovery site, which is also used to run active business applications. Integrating disaster recovery IT assets and personnel into operations budgets across geographically dispersed data centers will also help blur the line between disaster recovery and operations spending. It’s best to have a complete copy of your data in an alternate site at all times, “not just some of it,” says Wayne Schletter, associate director of global technology at Mizuho Capital Markets. “You don’t want to be piecing things together after something happens. You just want to be ready to go.”
Computerworld Executive Briefings 6
RECOVERY STRATEGIES
Five Classic Mistakes is an unpleasant task. And that makes it a low-priority project in almost all companies, says Scott Lundstrom, an analyst at AMR Research Inc. “There are no users screaming over business continuity,” he says. “So given the firefighting nature of most IT organizations, [disaster recovery] never gets the resources it deserves.” Because disaster recovery takes a back seat to other IT projects, mistakes are bound to happen. We asked IT managers and other experts what’s most likely to be forgotten or overlooked in disaster recovery planning. Here are the five classics.
D
ISASTER RECOVERY
MISTAKE 1: Failing to do your homework. IT groups often neglect to ask users and line-ofbusiness executives which applications they need most. This leads to faulty assumptions about disaster recovery priorities. In particular, IT tends to assume that heavyduty enterprise applications should be restored first. In reality, the most needed applications may be much more basic — e-mail and scheduling tools such as Microsoft Outlook, for example. How do you find out? Ask the users. “The business itself needs a plan in case operations
Recovery Strategies
are disrupted,” says Elbert Lane, a lead software developer at San Francisco-based retailer Gap Inc. and a 20-year veteran of disaster planning at several companies. “They’ll need procedures for doing paperwork, etc., so the question is, How would they recover? That’s not just an IT issue, but a business [issue].” The lesson: IT constantly hears the term mission-critical used in reference to CRM and ERP software. But to find out which applications the users really want restored first, simply ask them. MISTAKE 2: Thinking it’s purely an
IT issue. In a crisis, the performance of the IT staff may be the least of a company’s worries. “A common assumption is that disaster recovery and business continuity are synonymous,” says Don O’Connor, CIO at Southern California Water Co., a utility based in San Dimas. “They’re not.” Even underprepared IT organizations have done some thinking about what to do when disaster strikes. But can the same be said of other groups? “In my experience, IT can respond relatively quickly,” O’Connor says. “The part that’s missing is the users.” The lesson: Company officers need to understand that rebooting systems and recovering data is just one part of the problem. Disaster recovery plans need to include line-ofbusiness managers and end users who, in a crisis, will run the business in the midst of adversity. “Too often, continuity is something we task IT with,” Lundstrom says. “It’s really a business issue.” MISTAKE 3: Fighting the last war.
If, as the saying goes, generals are always preparing to fight
the last war, too many enterprises spend their disaster recovery budgets and energy preparing for the most recent catastrophic event. While understandable, this is self-defeating; disasters are, by their nature, well-nigh impossible to predict. Recent history offers a compelling example. The Sept. 11, 2001, terrorist attacks on the World Trade Center devastated many New York-based financial services firms. Many
Three Tips From: Dorian Cougias, CEO of Network Frontiers LLC in San Francisco and author of The Backup Book: Disaster Recovery From Desktop to Data Center (Schaser-Vartan Books, 2003). TIP NO. 1: Figure out how to recover from “stupid-user tricks,” such as the user who accidentally drags an empty file directory on top of a very important file directory and wipes it out, or the janitor who disregards the “Don’t touch this switch” sign. Ask your help desk staffers to list the problems they’ve dealt with in the past 12 months. TIP NO. 2: Have a disaster recovery plan for your e-mail system, the mostused system on the network. Consider a product like the Emergency Messaging System from MessageOne Inc. in Austin. TIP NO. 3: Make sure each
employee’s daily, weekly or monthly work procedures include disaster recovery practices, just like a sailor’s duties include checking the boat’s rigging and pumps before every excursion.
Computerworld Executive Briefings 7
“In my experience, IT can respond relatively quickly. The part that’s missing is the users.” DON O’CONNOR, CIO, SOUTHERN CALIFORNIA WATER CO.
wished they’d had nearby backup facilities, and they proceeded to build such facilities at great expense across the river in Jersey City, N.J. But Manhattan’s next major businesscontinuity crisis — the August 2003 blackout — took out electricity in Jersey City as well. The lesson: While it’s sensible to consider certain broad crisis categories (terrorist or hacker attacks, earthquakes, fires and so on), don’t think you can anticipate future events. Plan not for specific crises, but rather for their effects. The Gap had servers located in the World Trade Center on Sept. 11, Lane says, but “we had set them up to fail-over to backups located in the South.” MISTAKE 4: Overlooking the people. This is another lesson from
Recovery Strategies
Sept. 11: Top-notch backup equipment helps only if somebody is able to use it. “Some businesses had recovery data centers in Lower Manhattan,” says Carl Claunch, an analyst at Gartner Inc. However, he says, immediately following the collapse of the World Trade Center towers, “police wouldn’t let people in. The equipment was fine, but it just sat there unused.” This can happen if a building is quarantined, an elevator stuck or a major road closed. The other part of this gotcha is the expertise of those who finally do access backup equipment. Too many companies — especially those that fudge their recovery exercises — count on IT heroics to pull them out of a crisis. However, as the Gap’s Lane says, “you never know if key personnel will be back.” The lesson: This is where strong documentation comes in. “We fashion our document so anyone in the business should be able to restart an application,” Lane says. “You should be able to have somebody from the mail room start everything up.” MISTAKE 5: Conducting phony-
Sweat the Small Stuff When a crisis hits, IT staffers seeking to maintain or restore operations are often tripped up by the most basic items. Disaster planning analysts and experts say you need to think about things like the following: ACCESS. Who has keys or access
cards for the building? How do you get in if the electrical grid is shut down? What local public-safety officials (police, fire or town officials) can you turn to for help? COMMUNICATION. In a crisis, IT
staffers may need to contact corpobaloney practice drills. “Sure, companies do testing. But because full tests are so resource-intensive, they’re scheduled in advance,” Claunch says. The result: IT workers, driven by the natural desire to ace a test, cheat. “They prepare. They collect tools, review procedures,” he says. “Then, when a real disaster hits, blooey.” This is a sticky problem for IT organizations stretched thin even before disaster planning is factored into their workloads. Lane says practices at the Gap are planned in ad-
rate officers whose names they don’t even know. An emergency “telephone tree” that includes mobile numbers is a must. LIGHT. At home, we’ve all felt stu-
pid when a blackout hit and our flashlight batteries were dead. The same goes for the workplace — after all, backup generators fail, too. PASSWORDS. Security is good,
but in an emergency, even low-level staffers may need extraordinary systems access. Organizations need to put a crisis-only override in place. vance. “We are a retailer; we need to support our stores” around the clock, he says. The lesson: There is no easy answer here. Everybody concedes that surprise disaster tests are more effective, but performing one in a round-theclock, e-business environment is a massive undertaking. Claunch suggests surprise tests of one IT subgroup at a time, leaving the rest of the staff to run operations. And some businesses use auditors to make sure IT workers don’t lean on prepared information.
Computerworld Executive Briefings 8
RECOVERY STRATEGIES
Realistic Testing I
to really test your disaster recovery plan, you have to get out from behind your desk and step out into the real world. Because in the real world, the backup site lost your tapes, your emergency phone numbers are out of date, and you forgot to order Chinese food for the folks working around the clock at your offsite data center. “Unless it’s tested, it’s just a document,” says Joyce Repsher, product manager for business continuity services at Electronic Data Systems Corp. How often should you test? Several experts suggest realworld testing of an organization’s most critical systems at least once a year. In the wake of Sept. 11 and with new regulations holding executives responsible for keeping corporate data secure, organizations are doing more testing than they did 10 years ago, says Repsher. An exclusive Computerworld online survey of 224 IT managers supports that assertion, indicating that 71% had tested their disaster recovery plans in the past year. Desktop disaster recovery testing involves going through a checklist of who should do what in case of a disaster. Such walk-throughs are a necessary first step and can help you catch changes such as a new F YOU WANT
Recovery Strategies
version of an application that will trigger other changes in the plan. They can also identify the most important applications, says Repsher, “before moving to the expense of a more realistic recovery test.” Companies do desktop tests at different intervals. Fluor Fernald Inc., which is handling the cleanup of a government nuclear site in Fernald, Ohio, does both desktop and physical tests of its disaster response plans every three years “or anytime there’s a significant change in our hardware configuration,” says Jan Arnett, manager of systems and administration at the division of engineering giant Fluor Corp.
What’s Critical? Determining which systems need a live test is also critical. Fluor Fernald schedules live tests on only about 25 of its most critical applications and then tests only one server running a representative sample of these applications, says Arnett. “We feel if we can bring one server up, we can bring 10 servers up,” he says, especially since the company uses standard Intel-based servers and networking equipment. The most common form of live testing is parallel testing, says Todd Pekats, national director of storage alliances at IT services provider CompuCom Systems Inc. in Dallas. Parallel testing recovers a separate set of critical applications at a disaster recovery site without interrupting the flow of regular business. Costly and rarely done, the most realistic test is a full switch of critical systems during working hours to standby equipment, which Pekats says is appropriate only for the most critical applications. Businesses that are growing or changing quickly should test their disaster recovery plans
Ditch the Script A disaster drill isn’t much good if everyone knows what’s coming. But too many organizations script disaster tests weeks ahead of time, ship special backup files to an off-site recovery center and even make hotel reservations for the recovery staff, says John Jackson, vice president of business resilience and continuity services at IBM in Chicago. That eliminates messy but all-too-likely problems such as losing backup tapes in transit or discovering that a convention has booked all the hotel rooms in town. He advises telling the recovery staff, “We just had a disaster. . . . You can’t take anything out of the building. . . . You have to rely on the disaster recovery plan and what’s in the off-site recovery center.” That makes the test more “exciting,” he acknowledges, but it also makes it a lot more useful.
more often, says Al Decker, executive director of security and privacy services at EDS. He cites one firm that has grown eightfold since 1999, when its disaster plan called for the recovery of critical systems in 24 hours. Today, just mounting the tapes required for those systems would take four to 10 days, he says.
Proper Balance Deciding how realistic to make the test “is a balance between the amount of protection you want” and the cost in money, staff time and disruption, says Repsher. As an organization’s disaster recovery program matures, the tests of its recovery plans should become more
Computerworld Executive Briefings 9
challenging, adds Dan Bailey, senior manager at risk consulting firm Protiviti Inc. in Dallas. While the more realistic exercises provide more lessons about what needs improvement, he says, an organization just starting out with a rudimentary plan probably can’t handle a very challenging drill. Never assume that everything will go as planned. That includes anything from having enough food or desks at a recovery site to having up-to-
Recovery Strategies
“Unless it’s tested, it’s just a document” JOYCE REPSHER, ELECTRONIC DATA SYSTEMS
date contact numbers. Communications problems are common, but they’re easily prevented by having every staff member place a test call to everyone on their contact
list, says Kevin Chenoweth, a disaster recovery administrator at Vanderbilt University Medical Center in Nashville. Also, never assume that the data on your backup tapes is current or that your recovery hardware can handle your production databases. Arnett found subtle differences in the drivers and network configuration cards on his replacement servers that forced him to load an older version of his Oracle database software to re-
cover his data. Chenoweth or his staffers review each test with the affected business units and develop specific plans (with timelines) for fixing problems. Finally, Chenoweth says, thank everyone for their help, especially if the test kept them away from home. “If you’ve got a good relationship, they’re more likely to be responsive” to the firm’s disaster recovery needs, he says.
Computerworld Executive Briefings 10
RECOVERY STRATEGIES
Synchronizing With Suppliers
B
USINESS-TO-BUSINESS
dependencies create the opportunity for great benefits. But if a disaster strikes any company in the supply chain, the risks to all are equally great. At Ryder System Inc., customers routinely vet their supply chain partners to ensure that they meet minimum standards for robustness and security. “If they can’t make the cut, we won’t do business with them,” says Chuck Lounsbury, senior vice president of sales and marketing at the Miamibased transportation, logistics and supply chain management services company. “We don’t want to jeopardize the capabilities of all the other companies involved.” “It is a matter of working together,” adds Richard Arns, executive director of the Chicago Research & Planning Group, which spun off a post-Sept. 11 effort called the Security Board. A key lesson from the terrorist attacks, he says, is that organizations should enlarge their circle of preparedness. But that message may not be getting through. An American Management Association survey conducted last year showed a sharp increase in the number of companies with crisis plans, drills or simulations. Yet only about a third of those companies reported having ongoing and backup emer-
Recovery Strategies
gency communications plans with their suppliers. To make their operations truly disaster-resistant, IT managers should determine if business partners are ready to handle a disaster, experts say. Then they must work closely with those suppliers to achieve parity in their disaster recovery efforts and get their recovery times in sync. Here are some more tips:
TIP: Tighten SLA Language. A good starting point, says Roberta J. Witty, an analyst at Gartner Inc., is the language of the service-level agreement. SLAs are normally applied to IT providers but also offer a framework for talking about critical IT support from partners. But that’s only the beginning. Witty says IT managers should conduct an internal inventory assessment to determine which points outside the enterprise are critical to a company’s functions. They should then extend the process to suppliers. “Have a conversation with them about what the risks are
“In some cases, companies find that they are doing far more than their partners, and their partners either have to catch up, or they need to consider spending less, since they won’t really get much benefit.” JOHN JACKSON, VICE PRESIDENT OF IBM BUSINESS CONTINUITY AND RECOVERY SERVICES
within their own supply chain,” she says. “You are outsourcing functions; maybe they are, too.” It may be worthwhile to line up backup suppliers for your outsourced services so you have more redundancy — and encourage partners to do the same, says Witty. In any case, at each step in the supply chain — including with your internal operations, your outsourcers, your suppliers and their outsourcers and suppliers — there needs to be a credible recovery plan, she says, “or their disaster will become yours.” And nothing beats testing. Whenever possible, it’s a good idea to include partners in your own tests and vice versa, Witty says.
TIP: Test ERP Connections. Jim Grogan, vice president of alliances at SunGard Data Systems Inc. in Wayne, Pa., says he’s seeing more clients embrace the ideal of the real-time enterprise. And enterprise applications, such as ERP software, that support that vision almost invariably have links outside the organization. “We encourage [clients] to do an informationavailability study of their trading partners and suppliers, even if they have to foot the bill,” he says. Most worrisome to Grogan is the fact that many organizations have entrusted key business processes to software — to the point that unaided humans would have difficulty handling those functions on their own. “Even a few years ago, you could count on someone being able to get on the phone and fix things,” he says. Likewise, Grogan notes, phone communication used to be planners’ first priority. But not anymore. “Now, everyone tells us that
Computerworld Executive Briefings 11
Gaining Trust Ensuring disaster resistance along the supply chain is laborintensive, but there is hope that it might get easier. Don Houser, a security architect at Nationwide Mutual Insurance Co. in Columbus, Ohio, has developed a technology called XOTA, or Extensible Organization Trust Assertion. Using XOTA, partners in a business relationship set standards for that relationship — for example, the format and security requirements for message transmissions. That information is then embedded in a digital certificate. “An organization would exchange that with their business partner, and it can then be graded for compliance with that organization’s standards or with contractual language in real time,” he says. The goal is to make it easier to set up and maintain communications and relationships among different organizations while meeting each organization’s specific needs.
Recovery Strategies
Rich Mogull, an analyst at Gartner Inc. in Stamford, Conn., says XOTA seems to be a good start. However, Mogull says Gartner advocates a more automated approach to the problem. “What is really needed is a capability to automatically detect and analyze the compliance level of anyone with whom you are connecting,” he says. So far, however, the industry has taken little action in that direction, Mogull says. Nationwide wants to bring XOTA to market through a consortium, according to Houser. “We already have several major companies signed up, and we have about 90 on the sidelines waiting to come aboard,” he says. Those interested include consulting, banking, financial and health care organizations as well as “pure-play software companies,” Houser says. “We are currently building a proof of concept, and we hope to spin up the consortium by the third quarter,” adds Houser.
the first thing they need to get back in business with partners is e-mail,” he says. At a granular level, Grogan says SunGard always looks for potential single points of failure within a supply chain, such as a server, switch or cable upon which many operations depend. Companies also need to coordinate their recovery plans because for many applications, particularly ERP, “systems are connected in real time with others that may have different recovery times or different recovery points, which can complicate efforts to get back to business,” he says.
TIP: Secure Partner Communications. It’s also important to look at the security of business partner communications because glitches in that area could precipitate a disaster. Nick Brigman, vice president of strategy at RedSiren Inc., an IT security management firm in Pittsburgh, says it’s important to understand whether you’re connected to partners via a private network, a virtual private network or the Internet. One of the best ways to enhance the security of that com-
munication is to assign “leastprivileged” accounts to partners that define the nature and even the volume of expected traffic, says Brigman. This not only eliminates potentially spurious communications, but it also provides a basis for detecting abnormal activities, he says. Finally, John Jackson, vice president of IBM Business Continuity and Recovery Services, says business-to-business dependencies make it critical for companies to “get together and do a business impact analysis to determine how their individual recovery times could be made to mesh.” “In some cases, companies find that they are doing far more than their partners, and their partners either have to catch up, or they need to consider spending less, since they won’t really get much benefit,” he says. Communication infrastructure is the key, Jackson adds. Partners, especially smaller ones, may not have the knowledge needed to ensure robust and resilient performance. And they may just need help to get there.
Computerworld Executive Briefings 12
HE increasingly business-critical nature of e-mail is prompting some companies to take backup measures specifically designed to retain access to their e-mail systems in the event of a disaster. Reinsurance company Max Re Ltd. in Hamilton, Bermuda, had taken such measures before Hurricane Fabian hit the island. And online business publication Forbes.com in New York was prepared when the massive blackout struck the Northeast. But the two companies took dramatically different approaches to the problem.
recovery strategy. Max Re is setting up a disaster recovery system in its Dublin offices for redundancy. But even when it’s completed, the system will take 12 to 24 hours to go live in an emergency, Lohan says. While Max Re’s other critical business systems might be able to wait that long, e-mail has to be back up much faster, Lohan says. MessageOne EMS is Linuxbased software that backs up users’ address books, contact lists and other critical information to provide instant access in an emergency if the main e-mail system goes down, says Mike Rosenfelt, a MessageOne spokesman. That data is hosted on MessageOne’s servers and can be accessed from any Internetconnected PC. The service doesn’t back up old e-mail, cutting expenses for storage and bandwidth. “It’s a life-support system until you can go to recovery,” says Rosenfelt. Pricing for EMS runs between 80 cents and $8 per user per month, depending on the number of users.
MAX RE LTD.
FORBES.COM
Max Re took a bare-bones, software-centric route, using the Emergency Messaging System (EMS) backup application from Austin-based MessageOne Inc. The software enabled the company to set up backup e-mail capabilities for 52 users in only a few hours, says Kevin Lohan, vice president of technology and systems at Max Re. “Fabian came in at quite an inopportune moment,” Lohan says, noting that the company was still several months away from fully plotting its disaster
Forbes.com, meanwhile, uses Microsoft Exchange backup services from Evergreen Assurance Inc. in Annapolis, Md. Its hardware-based approach provides full backup of all old messages, as well as address books and contact lists. Evergreen uses dedicated servers that activate in 15 minutes following a service outage. These redundant e-mail servers reside in an Evergreen data center. “Our customers are demanding that they have access to both their [old e-mail] and
RECOVERY STRATEGIES
E-mail Recovery T
Recovery Strategies
“Our customers are demanding that they have access to both their [old e-mail] and their [current e-mail] applications.” MICHAEL MULHOLLAND, CO-FOUNDER, EVERGREEN ASSURANCE INC., ANNAPOLIS, MD.
their [current e-mail] applications,” says company founder Michael Mulholland. Michael Smith, chief technology officer at Forbes.com, says his 85 users had e-mail capability almost immediately after the blackout hit. Evergreen’s fees begin at about $5,000 monthly for 250 users and can be up to $30,000 monthly for 5,000 users.
Overkill? But both approaches may be overkill for some users, said Mike Gotta, an analyst at Meta Group Inc. in Pleasanton, Calif. “I’m not denying that e-mail is critical communication, but so is the telephone,” Gotta says. For marketing companies or communications businesses, where “the bloodstream is information,” there’s a reasonable need, he says. But for manufacturing companies, getting factories up and running quickly is likely to be more critical, Gotta says. “I’m just not sure that I’m in the camp that I can only conduct my business if I can get my e-mail back up,” he says.
Computerworld Executive Briefings 13
SAFE & SECURE
Storage Security S
TORAGE SYSTEMS
weren’t designed with security in mind. They started out as direct-attached, so if the host was secure, the storage was too. That’s all changed. Fibre Channel storage networks often have multiple switches and IP gateways, allowing access from a myriad of points. Compound this with poor work by systems administrators, new data security laws and recent high-profile cases of consumer information theft, and the need for improved storage security becomes urgent. But if systems administrators can’t follow the basic steps of network storage security, better tools may not help. That’s part of the reason why encryption is becoming the most widely adopted solution to the problem. Misconfiguring logical unit number (LUN) zones and not maintaining network-access lists are two major causes of unauthorized access to storage networks, says Nancy Marrone, an analyst at The Enterprise Storage Group Inc. in Milford, Mass. Another common mistake administrators make is not bothering to change the device default password, according to Dennis Martin, an analyst at Evaluator
Safe & Secure
Group Inc. in Greenwood Village, Colo. Beyond the human failings, Fibre Channel itself isn’t a secure protocol. Through it, application servers can see every device on a storage-area network (SAN). Switch zoning and LUN masking on a storage array can restrict access to devices on a SAN. Zoning segregates a network node either by hard wiring at the switch port or by creating access lists around device world-wide names (WWN). Masking hides devices on a SAN from application servers either through software code residing on each device or through intelligent storage controllers that permit only certain LUNs to be seen by a host’s operating system. According to Marrone, managing access through LUN masking works on smaller SANs but becomes cumbersome on large SANs because of the extensive configuration and maintenance.
Encryption Makes Gains Given these human errors and technology shortfalls, some users are turning to encryption. Michelle Butler, technical program manager for the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, manages three SANs — two with 60TB of capacity and one with 40TB. For her, security means that data needs to be encrypted, both when it’s in transit and stored on a disk — or “at rest.” “There are some tools out there, but there are also some big gaping holes being left that so far don’t seem that interesting to hackers,” Butler says. Nevertheless, the NCSA plans to buy Brocade Communications Systems Inc.’s newly
Top of Mind IT managers rate the following storage topics as “extremely important” in the near future: Disaster recovery/ business continuity
54%
Storage security
52%
Storage-area networks
31%
Regulatory compliance
27%
BASE: 91 IT MANAGERS; MULTIPLE RESPONSES ALLOWED. SOURCE: COMPUTERWORLD’S IT LEADER RESEARCH PANEL, 2004
released Secure Fabric operating system and Fabric Manager software. Butler says the products will allow her storage administrators to create network management access-control lists using public-key infrastructure (PKI) technology and device access-control lists based on WWN. The software also offers authentication and encryption for control information or management data on SAN devices. Examples of the necessity of encryption abound. For instance, in January, a disk drive with 176,000 insurance policies was stolen from Guelph, Ontario-based Co-operators Life Insurance Co.
California Law In response to events like this, California adopted a new law. SB 1386 requires any company that stores information about California residents to publicly divulge any breach of security affecting that data within 48 hours. In addition, Sen. Dianne Feinstein’s (D-Calif.) office is developing a federal version of the bill — called the Database Security Breach Notification Act — that would provide sim-
Computerworld Executive Briefings 14
SAN Security Glossary FABRIC: The hardware and
software that connect a network of storage devices to one another, to servers and eventually to clients. LUN MASKING: Using the
Logical Unit Number (LUN) of a storage device, or a portion of a storage device, to determine which storage resources a server or host may see. PORT: A physical connection
on a storage switch that links that switch to storage devices, servers or other switches. Many SAN security techniques limit which devices a port can connect to or the manner in which it connects to those devices. SPOOFING: Impersonating the
identity of an individual (such as a storage administrator) or of a device (such as a storage switch) to gain unauthorized access to a storage resource. TRUSTED SWITCH: A switch within a storage network that uses a digital certificate, key or other mechanism to prove its identity. VSAN: A virtual SAN, which
functions like a zone but uses a different layer of the Fibre Channel protocol to enforce which devices in the fabric can speak to other devices. WORLD WIDE NAME: A
unique numeric identifier for a device on a storage network, such as a disk array or a switch. ZONE: A collection of Fibre Channel device ports that are permitted to communicate with each other via a Fibre Channel fabric.
ilar protections to all U.S. residents. The only companies exempt from the California law and the proposed national legislation are those that encrypt data at rest. Several newly released products address concerns posed
Safe & Secure
by the recent legislation. Mississauga, Ontario-based Kasten Chase Applied Research Ltd. announced its Assurency Secure Networked Storage platform, agent-based software that provides a stripped-down PKI-based authentication and encryption for networked storage devices. The company estimates that a complete encryption system is generally 7% to 10% of the cost of a SAN.
Other Vendors Another company getting noticed is start-up appliance vendor Decru Inc. in Redwood City, Calif., which uses proprietary software to encrypt data on the storage array, but uses the IPsec protocol on the application server to encrypt data while in transit. Its DataFort security appliances work for for both SANs and network-attached (NAS) storage. Vormetric Inc. in Santa Clara, Calif., sells an appliance that supports SANs as well as both NAS and direct-attached storage devices and can be used to do high-speed encryption of data at the file system level, on a file by file basis. And NeoScale Systems Inc. in Milpitas, Calif., sells a product called CryptoStor FC, that provides wire-speed, policy-based encryption for SAN and NAS data. Although most currently available storage security technologies offer encryption, analysts say it’s important for users to make sure that the data is encrypted both at rest and while being transmitted across networks.
SAN-ity Check Cathy Gilbert at American Electric Power Inc. isn’t too worried about security on her 2-year-old storage-area network (SAN). There are “very few people in our building that would actually know what to do” to reconfigure her Fibre Channel SAN — assuming they could reach it on its inter-
nal private network, which can be administered only from a locked room, says Gilbert, a senior IT architect at the Columbus, Ohio, energy producer. She uses the built-in configuration capabilities of her EMC Corp. Symmetrix storage arrays, McData Corp. Intrepid Directors and McData Enterprise Fabric Connectivity Manager 6.0 software to control which servers can access which storage devices. But protecting SANs will become more difficult, and more important, as customers begin deploying SANs more widely, to enable the money-saving consolidation of servers, applications and data. And as more SAN traffic migrates from the relatively unknown Fibre Channel protocol to IP, it will become vulnerable to the same well-known attacks used against the Internet and corporate networks.
Future Threats SAN security will become a larger problem as companies cut costs by forcing different departments to share storage networks, says Wayne Lam, vice president at FalconStor. In most companies, IT managers from one department don’t have the authority to manage data from other departments. But companies often need to commingle data from multiple departments on a single SAN to drive down their storage costs. “You can’t afford to have five islands of SANs,” he says. The need for more granular control over who can manage which portions of a SAN is one of the features customers ask for most frequently, says Kamy Kavianian, a product marketing director at Brocade Communications Systems Inc. in San Jose. He says customers also need the following: n Stronger authentication to verify the identities of both administrators and devices. n The ability to use a wider
Security Tips MAINTAIN current network-
access lists. GET up to speed on the
port-zoning method your vendor uses (they’re not all the same). CHANGE default passwords
on new hardware. DESIGN the topology of a
SAN with network security administrators. ENCRYPT data both at rest
and in transit. CONSIDER carefully how you dispose of old hard drives and backup tapes.
variety of methods, such as Telnet and Simple Network Management Protocol, to manage SANs. n Encryption to protect SAN data from eavesdropping if it crosses public networks such as the Internet.
Identity Spoofing Authentication — the ability to prove the identity of a person or device — becomes crucial as more users are able to tap into SANs and as data from more sources is commingled in corporate storage networks. Spoofing the identity of a person, or even of a device such as a host bus adapter, is a real threat, Lam says. Spoofing the identity of a device should be impossible because manufacturers give each device a unique WWN that identifies it to other parts of the storage network, says Lam. But manufacturers deliberately let customers change the WWN through an upgrade to the firmware in the device, he says. That makes it easier, for example, for a customer to replace a switch in a storage network without having to up-
Computerworld Executive Briefings 15
date every device that communicates with that switch’s new WWN. Many vendors are planning key-based authentication to create “trusted” administrators with the authority to manage only a subset, such as a zone, of a corporate SAN. This might be overkill in small environments such as Alloy’s, but Tajudeen says, “I could see it being an issue if you have a larger set of administrators.”
Safe & Secure
Encryption may increase in importance as more SAN data migrates from Fibre Channel to IP and as storage over IP allows data to travel farther outside the data center than is possible with Fibre Channel. “It is nice to have certain types of data encrypted,” says Tajudeen, but only if the encryption isn’t too expensive and doesn’t exact too much of a toll on performance.
Building the Business Case Storage managers must also get ready to explain the intricacies of SAN security to their less-technical peers, says John Webster, a senior analyst at Data Mobility Group Inc. in Nashua, N.H. Some pioneers looking to consolidate corporate data on SANs are facing tough questions from department heads worried about how their data will be kept separate
from data generated by other business units, and from chief security officers worried about whether the SAN will be secure from outside threats. First, “you’ve got to figure out how, or if, you can overcome” such objections, says Webster, and be prepared to defend your plan in understandable terms. “If you’re not prepared to answer them, you can be in trouble,” he says.
Computerworld Executive Briefings 16
SAFE & SECURE
Long-Distance Data Replication before the 9/11 terrorist attacks and the temporary shutdown of the nation’s airlines, IT managers were beginning to use the words disaster recovery and storage in the same sentence, especially in the financial services industry. But afterwards, the marriage of the two disciplines seemed even more urgent. The potential for disasters seemed bigger, and disaster recovery plans that called for flying tapes across the country seemed naive. The result is that the words disaster recovery are now actually driving many storage technology projects, as corporate IT managers look for ways to replicate data and send it to sites that are 10, 20 or even hundreds of miles from headquarters. Here are three technology strategies they’re using.
E
VEN
1
THE STORAGE SUBSYSTEMS APPROACH. When Lajuana
Earwood began looking for a new disaster recovery system, she found she was alone. “No one else had done what we were trying to do — we wanted to mirror massive amounts of data over a very long distance, about 700 miles,” she says. Earwood, director of mainframe systems at Norfolk Southern Railway Co. in Norfolk, Va., realized that the company’s business systems were too vulnerable. “At the time we
Safe & Secure
were replicating a small subset of our data in real time,” says Earwood. “But it was not really enough to carry us through in the event of a major disaster.” The data sets amounted to about 6TB of critical railroad, payroll and order entry information on two IBM mainframes — essentially the railroad’s IT hub. So Earwood sent out a bid request to all the major storage vendors; Hitachi Data Systems Corp. in Santa Clara, Calif., got the job. “IBM’s proposal would have required too much additional hardware,” says Earwood. “EMC’s solution gave only snapshots. HDS gave us the closest thing to real-time mirroring, and it required less hardware.” And she liked the price. Along the way there was one major change in direction, though. The original plan was to replicate to a site in North Bergen, N.J., about 700 miles away. “But after Sept. 11, we realized that might not be such a good idea. The logistics of transporting personnel could effectively negate all our other efforts,” Earwood says. So Norfolk Southern decided instead to use a much closer backup center in Buckhead, Ga., for the mirrored mainframe data. The sheer volume of data presented another challenge. “We wanted to put all our data in one consistency group,” says Earwood. A consistency group is a set of data that is shared by a number of critical applications, but in Norfolk’s case all of the data is shared by all of the applications. It made sense on paper, but the execution pushed the technology envelope a bit too far. “We are using HDS 9960 storage hardware, the HDS TrueCopy replication software and two OC3 network pipes,” says Earwood. “It all worked,
Business Continuity Tips DECIDE on your recovery ob-
jectives before selecting technologies and spending money. DON’T NEGLECT the people
part of business continuity. The best data replication system in the world won’t help if your people aren’t trained and in place to take advantage of it. LEVERAGE the infrastructure you already have. For example, if you have dark fiber in place, it might be cost-effective to go with a high-end SAN and Dense Wave Division Multiplexing for data replication. CONSIDER that if a disaster occurs and you have to use the airlines to get to a remote site, your recovery time will increase — if you can fly at all.
but we kept hitting the ceiling on high-volume write transactions.” The solution was to split the data into three consistency groups. But Earwood isn’t giving up on the original goal. “We are looking at new hardware from HDS that can handle more volumes. This might allow us to consolidate all our data back to one consistency group,” she says. Earwood tests the system with a simulated disaster recovery almost every week. “We are almost down to a four-hour recovery time,” she says, “and we now feel that we can go to our board of directors and say that we have confidence in our disaster recovery system.”
Computerworld Executive Briefings 17
2
HOST-BASED SOFTWARE.
When Chadd Warwick, operations manager at Comprehensive Software Systems Inc., a financial software development house in Golden, Colo., went shopping for a new business continuity system, he wanted something a little more flexible than the hardwarebased systems from vendors such as Hitachi and Hopkinton, Mass.-based EMC Corp. He found it in Veritas Volume Replicator (VVR) software from Veritas Software Corp. in Mountain View, Calif. “We liked VVR,” says Warwick, “because it is a hostbased, software solution.” Because the software runs on the server instead of on the disk array, it’s independent of the storage hardware. “It meant we didn’t have to forklift a new hardware infrastructure in, which meant lots of savings for us,” says Warwick. Warwick started using VVR in November 2001 as a beta tester and decided to stick with it. “This is block-level data replication so the software doesn’t need to know anything about the applications or the data. And the hardware independence is really nice,” he says. “You can ac-
Safe & Secure
“You can actually restore to different hardware, so, in the event of a major disaster, we could even run down to Best Buy and pick up whatever machines we could find to get us up and running quickly.” CHADD WARWICK, OPERATIONS MANAGER, COMPREHENSIVE SOFTWARE SYSTEMS INC.
tually restore to different hardware, so, in the event of a major disaster, we could even run down to Best Buy and pick up whatever machines we could find to get us up and running quickly.” Another advantage, Warwick says, is the absence of complex, proprietary network protocols. “VVR uses standard IP,” he says. Currently, the company replicates about 400MB to 1GB of data per day — over T1 and T3 lines — from the data center in Golden to a site in downtown Denver.
The software approach is usually cheaper than subsystems products, especially for replication over long distances, says Bob Guilbert, vice president of NSI Software Inc. in Hoboken, N.J., which competes with Veritas. “The subsystems products typically require dedicated fiber, and that can get very expensive.”
3
THE HYBRID: SAN OVER IP.
Shimon Wiener, like many of his peers, started looking for a better way to protect his firm’s data after the terrorist attacks of Sept. 11, 2001. Wiener is the manager of the Internet and networking department at Mivtachim, a leading provider of pension insurance in Ramat Gan, Israel. It’s a part of the world where, unfortunately, disasters aren’t a rare occurrence. The firm has two data centers: one with 600MB, in Ramat-Gan, and the other with 400MB, about seven miles away in Tel Aviv. “We wanted to do a double replication,” says Wiener, “so if Tel Aviv goes down, Ramat Gan can take over, and vice versa.” But when Wiener went shopping, he was dismayed at the high prices. “We first looked at
Compaq, IBM and EMC. None of them could do this without a Fibre Channel connection, which was very expensive,” he says. Wiener finally found what he wanted from Dot Hill Systems Corp. in Carlsbad, Calif. Mivtachim wanted a storagearea network (SAN) at each site, and Dot Hill’s Axis Storage Manager software supports IP replication for SANbased systems. Beginning in January, the pension company installed the Dot Hill SANnet 7100 hardware in Ram Gat and a second SANnet 7100 in Tel Aviv, and then started replicating between sites. Testing lasted about four weeks. “Initially, we had some problems getting the two systems synchronized,” says Wiener, “but we had good support and now we are very satisfied. We replicate about once a day.” Although Wiener got the system for disaster recovery purposes, he says it had a significant side benefit: “The SAN also centralized our data so backups are much easier to manage.”
Computerworld Executive Briefings 18
ters for at least another decade.
SAFE & SECURE
Advances in Tape Backup many IT executives, Eric Eriksen, chief technology officer at New York-based Deloitte Consulting, would like tape to just go away. The added cost of managing tape backup systems, slow and unreliable restoration, cartridge inventorying and offsite storage headaches have him hoping that cheap disk drives may someday replace 50-year-old tape technology in the data center. “We only need tape for cases when we can’t restore from disk. It’s a necessary evil,” he says. Yet despite a drastic shift toward low-cost Advanced Technology Attachment disk arrays for backing up business data, there’s no end in site to the use of tape in the data center — especially for archival storage. Administrators may complain, but tape still has an enormous installed base and remains 10 to 50 times less expensive than disk. It’s also very secure, since data stored off-line on removable media is physically inaccessible to hackers and viruses. And vendors and analysts say evolutionary advances in the basic technology in midrange tape drive systems, improvements in management tools, and the emergence of combined disk/tape subsystems are likely to answer some user complaints — and keep tape technology in data cen-
L
IKE
Safe & Secure
Bigger and Faster Manufacturers of the three leading midrange tape drive technologies — digital linear tape (DLT), linear tape-open (LTO) and advanced intelligent tape (AIT) — are preparing significant capacity and speed improvements. Advanced drives, including SuperDLT (SDLT), SuperAIT (SAIT) and LTO Ultrium 2 (LTO-2), are the latest variations. Each uses half-inch tape and offers roughly five times the capacity and performance of standard DLT, AIT and LTO tapes. For example, DLT was developed in 1986 and the average cartridge originally held about 96MB of data. SDLT today holds 160GB. Over the next decade, SDLT will grow to about 2.5TB native capacity with 250MB/sec. throughput. LTO, which derives its name from its open architecture, could grow to 10TB native capacity by 2011. Vendors say 1TB tape cartridges could appear as early as next year. Tape manufacturers such as Quantum Corp., Certance LLC and Storage Technology Corp. expect tape to more than meet future needs. That’s a tall order, since the amount of data produced by the average enterprise is doubling every year, according to Gartner Inc. To keep up, tape media will evolve to have more than 1,000 tracks and a thickness of 6.9 microns (about as thick as cellophane). And it will also work with drives that write on both sides of the tape, says Jeff Laughlin, director of strategy for the automated tape solutions unit at StorageTek in Louisville, Colo. In contrast, StorageTek’s
current high-end tape drive, the proprietary T9940B, uses 200GB, one-sided tape that has 576 tracks and is 9 microns thick. Laughlin expects transfer rates to keep up with the larger capacity tapes as well. “There’s more money being spent on tape media research than ever before in history. You’re going to see greater transfer rates at the head interface, transfer rates of 100GB/sec., 200GB/sec.,” he says.
Smarter Emerging management software that can monitor the health of tape drives, Fibre Channel switch port connections to libraries and even the tape cartridges themselves will help ensure that users are able to restore from tape, more easily manage backups and predict problems and backup failures, vendors say. Advanced Digital Information Corp. (ADIC) and Quantum, for example, have recently introduced native management software tools on their tape library and drive technology. ADIC sells all major tape cartridge technologies in its automated libraries and tape autoloaders, but Dave Uvelli, an executive director at the Redmond, Wash.-based company, says he believes cartridge formats and drive technologies are becoming irrelevant. Instead, ADIC is betting on new, intelligent tape library systems that will eventually provide detailed information on drives and tape, whether it’s related to a downed switch port, a stuck drive or a tape cartridge that’s reaching the end of its life. One example of archival intelligence is ADIC’s Scalar i2000 tape library. The Scalar i2000 is designed to eliminate the need for an external library
Computerworld Executive Briefings 19
cations can also tell administrators when drives have reached critical thresholds for capacity and predict where and when errors may occur.
Choosing the Right Format Will the SDLT, LTO-2 or S-AIT tape drive technology you’re using today be around tomorrow? Most likely, vendors and analysts say, although some users are finding reasons to switch from one format to another. Even with software advances in SDLT, more users are buying LTO-2 drives these days. Bob Abraham, an analyst at Freeman Reports in Ojai, Calif., says LTO-2 appeals to users because its open architecture offers a choice of vendors. HewlettPackard Co., IBM and Costa Mesa, Calif.-based Certance all manufacture LTO-2 products, whereas only Quantum produces DLT and SDLT drives. In July, Quantum put self-diagnosing intelligence into its SDLT drives, a move that analysts say will help boost sales. Quantum also says it has plans for at least four more incarnations of SDLT, and the vendor has 31% of the overall tape market — more than any of its competitors. But John Pearring, president of
control server. Among other things, the system can send backup failure alerts via pager or e-mail, partition a library into multiple logical libraries
Safe & Secure
StorServer Inc. in Colorado Springs, a manufacturer that sells all three tape technologies, still gives LTO the edge. “LTO is open and makes more sense, and it’s 200GB native [vs. 160GB for the latest SDLT 320 drives],” he says. Deloitte’s Eric Eriksen says he’s looking at moving from four HP tape libraries, with eight SDLT drives each, to a single HP or ADIC Scalar 10K tape library using LTO drives for greater capacity in a smaller footprint. He says his decision isn’t being driven so much by LTO-2’s openness, but by its compression rates and speeds, which — for the moment — exceed those of SDLT. He also says that the new LTO-2 libraries are more scalable than his older system. “One of the things that’s important when we’re doing streaming across multiple tape drives is to be able to restore quickly,” he says, referring to LTO-2’s 200GB capacity and 35MB/sec. throughput. And while LTO has a capacity
and perform mixed media, performance and proactive system readiness checks. San Jose-based Quantum also introduced DLTSage, a
and performance edge over SDLT today, analysts say the two tape technologies continuously leapfrog each other in capacity and throughput, so other factors may be more important. SDLT and LTO-2 may be neck and neck in speeds and feeds, but Sony Electronics Inc.’s S-AIT leapfrogged both with the vendor’s introduction of a 500GB, 30MB/sec. drive in December — and it’s likely to remain ahead for some time, based on current SDLT and LTO road maps. S-AIT also has the edge in pricing: S-AIT tape cartridges are $80, vs. $120 for LTO-2 and $130 for SDLT. Sony intends to develop and support S-AIT through at least a sixth generation, says Stephen Baker, vice president of storage solutions at Sony in San Jose. But SAIT’s appeal has been limited because, as with SDLT, only one manufacturer produces the drives.
Here Come the Hybrids
suite of predictive and preventative diagnostic tools that run on its SDLT tape drives to help ensure that backups have completed successfully. The appli-
While disk-to-disk backup is already popular, during the coming year, manufacturers plan to introduce more hybrid systems that combine disk with tape libraries in storagearea networks for faster backups and restores and easier archiving. ADIC, for example, plans to introduce a combined tape/disk library this month. “You won’t just have tape. One could imagine RAID-protected disk where I/Os from the backup job are completed at [wire] speed while the [library] robot, through management software, stages it on tape drives for archival,” says StorageTek’s Laughlin. Ultimately, however, scalability and restorability will continue to be the key criteria to take into account when selecting tape systems, says Deloitte Consulting’s Eriksen. “We’re looking for a single solution that can cover everything, regardless of the needs we have,” he adds.
Computerworld Executive Briefings 20
SAFE & SECURE
Backing Up The Edge M
IT director at Hogan & Hartson LLP, had had enough. The Washington-based law firm was paying $30,000 a month to back up data on more than 400 servers located in 27 offices worldwide and store the tapes off-site. Lucas says he couldn’t stomach the cost of buying more tape drives to back up every new print, file or application server. Along with the increasing costs, the tape-based infrastructure created administration issues, including the need to sometimes rely on nontechnical staffers to swap out tape cartridges in each remote office every night and take them off-site. Then there were the software glitches. “We’d have trouble from time to time with a tape getting hung, having to do a reboot of a server during off hours. We were at risk of not having a backup,” Lucas says, adding that retrieving tapes for restoring data in an emergency could take more than a day. IKE LUCAS,
Hit or Miss Data protection executed at remote sites is often a hit-ormiss scenario because “no one knows if the backup actually happened or if a restore can occur,” says Arun Taneja, an analyst at Taneja Group Inc. in Hopkinton, Mass.
Safe & Secure
Those frustrations led Lucas to use a remote backup strategy that brings backup data into the data center, where it can be centrally managed. Vendors offer a variety of networkbased schemes that pull data across a WAN to a central repository. These systems are simpler to manage and more cost-effective than local tape backups, analysts say. Most include software and appliances that replicate data from branch offices to the data center, where it is backed up to a disk device and/or tape library. This model eliminates the need for media handling or IT support at remote sites and offers greater security, since backup data is centralized. The increasing popularity of these systems is starting to affect sales of entry-level tape drives commonly used to back up direct-attached storage. IDC in Framingham, Mass., is forecasting a 20% decline this year as administrators increasingly decide not to back up branch servers locally.
The Options Vendors offer several approaches to remote backup. Software such as Veritas Software Corp.’s Storage Replicator and CYA Technologies Inc.’s HotBackup first execute a complete backup of directattached storage on each remote server or network-at-
“I don’t think it’s a fad. I think more people are going to adopt this technology because it’s cost-effective.” TONY ASARO, ANALYST, ENTERPRISE STORAGE GROUP INC., MILFORD, MASS.
tached storage appliance and then move incremental or “delta” changes over the WAN to the data center. Some organizations with branch offices that host multiple servers are choosing to first consolidate backups to a local disk-backup appliance before replicating data across the WAN. The appliance can complete server backups quickly across a LAN and then stream updates over the slower WAN connection to the data center, where it can be archived to tape. For workstation backups, some storage administrators are creating virtual drives on remote end-user PCs and mapping those to a file server back in the data center. To avoid performance problems over the WAN, administrators install a local data-caching appliance that gives users access to their files at LAN speeds while updates stream in the background to the back-end appliance in the data center. Lucas contracted with DS3 Data Vaulting LLC, a service provider in Fairfax, Va., for his network backup system, which includes disk-based appliances and software from Asigra Inc. in Toronto. Asigra’s Televaulting DS-Client software runs on servers, desktops and laptops connected to each remote office LAN and automates the backup of about 3TB of compressed data from local backup appliances in 10 offices over the WAN to an AT&T data center. After completing an initial full backup, the remote appliance provides updates only for changed data blocks. It eliminates duplicate files, encrypts the data and compresses it at a 2-1 ratio before automatically sending it across the WAN on a scheduled basis. Lucas expects a two-year payback on his investment. The
Computerworld Executive Briefings 21
initial system installation in Hogan & Hartson’s central office cost about $13,000. He has deployed 10 offices to date and is continuing to roll out the technology.
Caching Up Companies such as Actona Technologies Inc. (recently acquired by Cisco Systems Inc.), Riverbed Technology Inc., Disksites Inc. and Tacit Networks Inc. use appliances at both the remote site and the central data center for global file sharing. The appliances speed up access to shared files in part by removing the overhead associated with file-serving protocols such as the Common Internet File System and Network File System. Mukesh Shah, director of network services at The Associated Merchandising Corp. (AMC) in Plainfield, N.J., is in charge of file-sharing operations among 40 remote locations in a worldwide network that includes data center hubs in Hong Kong and New Jersey. AMC uses MetaFrame serverware from Fort Lauderdale, Fla.-based Citrix Systems Inc., which gives Windows XP PCs and wireless devices virtual, thin-client access to applications running on back-end servers. It also uses the New Jersey data center for global file sharing of Excel spreadsheets, Microsoft Word documents and other files. But users in Asia and Europe were waiting more than two minutes for remote files to open. The system also lacked adequate file-locking safeguards for some shared files. Users were “quite unhappy,” Shah says. Eight months ago, Shah began piloting a caching appliance from South Plainfield, N.J.-based Tacit Networks in his New Jersey data center. File-access times dropped from an average of 122 seconds to 11 seconds on first access and eliminated the
Safe & Secure
Remote Backup Systems PROS n Ease backup management headaches by removing tape drives from branch offices and consolidating them in the data center. n Afford greater security by centralizing backup data.
Eliminate need for branch office staff involvement in backup processes such as tape rotations. n
end-user wait altogether on subsequent attempts after the file was loaded into the local appliance’s cache. “Tacit has a process where you can push files to a local cache on a scheduled basis,” Shah says. “So when users go to access the file, it’s already there.” When users change and save the file back to the cache, it’s also saved on the main filesharing server in New Jersey, where AMC staffers back it up. “All restores can be done centrally, whereas if we had to substitute the cache appliance with a file server, we’d have the complexity of backups and restores at the remote office level,” Shah says.
Outsourcing It All Overworked IT organizations that don’t have the time or resources to set up a remote backup system can consider similar offerings from service providers. Brian Asselin, IT director at Harborside Healthcare Corp., a Boston-based long-term care company, oversees operations for 55 locations and 8,500 employees, but he says he has only one IT person for each of the nine states in which facilities are located. Harborside had been using direct-attached tape backup for its remote application servers, but Asselin says ensuring that backups occurred
CONS n Experience problems with replication throughput if WAN bandwidth is insufficient. n Require additional software, adding complexity to the backup process. n Require upfront investment in software and possibly hardware as well, depending on the system chosen.
and performing restores were a nightmare. “Our people working in the facilities are definitely technically challenged,” he says. “Logistically, it would be impossible to restore with the people I have.” What’s more, the Health Insurance Portability and Accountability Act requires greater security around patient information than Harborside’s IT infrastructure can provide, Asselin says. “There’s just a slew of security that needs to be in place by 2005 for HIPAA,” he says. Instead of building a central data center where data could be replicated for disaster recovery purposes and further burdening his IT staff, Asselin chose service provider AmeriVault Corp. in Waltham, Mass., to host backup data storage and handle daily replication from the remote sites. AmeriVault installed its CentralControl software on Harborside’s desktops and an agent on each of its servers. After completing an initial full backup of all data, the vendor performs daily, incremental, encrypted backups over the Internet to its disaster recovery centers.
Point-and-Click Portal
using a point-and-click application on AmeriVault’s Web portal. Alternatively, data can be shipped on tape for large restores. Asselin says AmeriVault has “processes and procedures” that are HIPAA-compliant, which relieves his staff from having to set up its own compliance program. And Asselin says he also reduced labor costs by outsourcing his remote backup and recovery architecture because “we don’t have to have people running around dedicated to the task of backup.” But while the remote backup technology made processes more efficient, the outsourcing approach wasn’t necessarily cheaper. “In terms of actual backup cost, it’s pretty much a wash. When you consider bandwidth and license payments for software, it’s pretty much even with other backup solutions,” Asselin says. Tony Asaro, an analyst at Enterprise Storage Group Inc. in Milford, Mass., says the costs of edge network backup technologies are continuing to drop, and as large companies investigate using these systems, big vendors are stepping in with new products. Asaro points to EMC Corp.’s entry-level Clariion AX100 array, which can be directly attached to its NetWin 110 NAS Gateway or bought as a preconfigured storage-area network with backup and storage management software for remote office backup. And EMC’s Legato RepliStor replication software is bundled with switches from Brocade Communications Systems Inc. “I don’t think it’s a fad,” Asaro says. “I think more people are going to adopt this technology because it’s costeffective.”
In an emergency, administrators at Harborside can perform data restores, even from home,
Computerworld Executive Briefings 22
EMERGING TECHNOLOGIES
Grid Storage DEFINITION: Grid storage, analogous to grid computing, is a new model for deploying and managing storage distributed across multiple systems and networks, making efficient use of available storage capacity without requiring a large, centralized switching system.
grids to build what amounts to supercomputer capability from PCs, Macintoshes and Linux boxes. After grid computing came into being, it was only a matter of time before a similar model would emerge for making use of distributed data storage. Most storage networks are built in star configurations, where all servers and storage devices are connected to a single central switch. In contrast, grid topology is built with a network of interconnected smaller switches that can scale as bandwidth increases and continue to deliver improved reliability and higher performance and connectivity.
What Is Grid Storage? We routinely talk about the electrical power grid or the telephone grid, and it’s pretty clear what we mean — a large, decentralized network with massive interconnectivity and coordinated management. A grid is, in fact, a meshed network in which no single centralized switch or hub controls routing. Grids offer almost unlimited scalability in size and performance because they aren’t constrained by the need for ever-larger central switches. Grid networks thus reduce component costs and produce a reliable and resilient structure. Applying the grid concept to a computer network lets us harness available but unused resources by dynamically allocating and deallocating capacity, bandwidth and processing among numerous distributed computers. A computing grid can span locations, organizations, machine architectures and software boundaries, offering power, collaboration and information access to connected users. Universities and research facilities are using
Emerging Technologies
Based on current and proposed products, it appears that a grid storage system should include the following: Modular storage arrays: These systems are connected across a storage network using serial ATA disks. The systems can be block-oriented storage arrays or network-attached storage gateways and servers. Common virtualization layer: Storage must be organized as a single logical pool of resources available to users. Data redundancy and availability: Multiple copies of data should exist across nodes in
A well-designed grid network is extremely resilient. Rather than providing just two paths between any two nodes, the grid offers multiple paths between each storage node.
the grid, creating redundant data access and availability in case of a component failure. Common management: A single level of management across all nodes should cover the areas of data security, mobility and migration, capacity on demand, and provisioning. Simplified platform/management architecture: Because common management is so important, the tasks involved in administration should be organized in modular fashion, allowing the autodiscovery of new nodes in the grid and automating volume and file management.
Three Basic Benefits Applying grid topology to a storage network provides several benefits, including the following: RELIABILITY. A well-designed grid network is extremely resilient. Rather than providing just two paths between any two nodes, the grid offers multiple paths between each storage node. This makes it easy to service and replace components in case of failure, with minimal impact on system availability or downtime. PERFORMANCE. The same factors that lead to reliability also can improve performance. Not requiring a centralized switch with many ports eliminates a potential performance bottleneck, and applying load-balancing techniques to the multiple paths available offers consistent performance for the entire network. SCALABILITY. It’s easy to expand a grid network using inexpensive switches with low port counts to accommodate additional servers for increased performance, bandwidth and capacity. In essence, grid storage is a way to scale out rather than up, using relatively inexpensive storage building blocks.
Computerworld Executive Briefings 23
Vendor Offerings The biggest player in the grid storage arena seems to be Hewlett-Packard Co., the first major storage vendor to deliver a grid storage product. HP’s StorageWorks Grid architecture stores information in numerous individual “smart cells” — standardized, modular and intelligent devices. HP’s smart cells will be smaller than the monolithic storage usually found in storage-area networks. They will be based almost entirely on low-cost, commodity hardware and provide a basic storage unit that customers can configure and change as needed to run many different tasks. According to an HP technical white paper, smart cells contain a CPU and, optionally, cache memory in addition to storage devices (disk, optical or tape drives). The cells are interconnected to form a powerful, flexible, peer-to-peer storage network. All smart cells have a set of common software installed in them, but each can be given a specific function (or personality) by loading appropriate operational software for tasks such as capacity allocation, policy and reporting, block or file serving, archiving and retrieval, or auditing and antivirus services. Administrators can change the functions of specific smart cells to deliver different types of services as business needs change. And
Emerging Technologies
it’s easy to expand the grid by simply adding modules, which are automatically detected and incorporated into an appropriate domain.
Smart Cells Smart cells are more capable than single-purpose disk arrays or tape libraries. Because the data path is completely virtualized, any smart cell can manage any I/O operation. Smart cell software maintains consistency between smart cells as well as ensures data redundancy and reliability. In the HP StorageWorks grid, all components are integrated to present a single system image for administration as a single entity. The system is designed from the ground up to be selfmanaging — tasks traditionally associated with storage resource management are performed by the utility itself, with no human involvement. The only time an administrator needs to know anything about individual smart cells is when failed hardware must be repaired or replaced. Even then, the single-system-image software provides fault isolation and failure identification to simplify maintenance. So far, HP has announced four StorageWorks Grid products: n Document Capture, Retention and Retrieval, using multifunction printers, scanners and digital senders to simplify the conversion of paper-based
records to digital form. n Sharable File System, a selfcontained file server that distributes files in parallel across clusters of industry-standard, Linux-based server and storage components. n Reference Information Storage System (RISS), an all-in-one archive and retrieval system for storing, indexing and rapidly retrieving e-mail and Microsoft Office documents, available in 1TB and 4TB configurations. n StorageWorks XP120000, a new enterprise-class disk array providing a two-tier storage architecture with single-system image management Nasdaq Stock Market Inc. has deployed RISS in an effort to comply with regulations about e-mail archiving. The product also provides New York-based Nasdaq with a strategic foundation for managing its overall information life cycle. “HP’s StorageWorks Grid is a visionary architecture for an intelligent, scalable, reliable and agile storage platform,” says Joseph Zhou, a senior analyst at D.H. Brown Associates Inc. in Port Chester, N.Y. Zhou notes that the grid leverages HP’s storage and server technology expertise, as well as research done at HP Labs, to create a new storage environment that delivers long-promised capabilities in novel and very useful ways. An open question, Zhou says, is whether HP will
eventually converge its development toward a single unified grid encompassing both servers and storage.
Other Players Oracle Corp. has announced partnerships with several storage providers (including EMC Corp., HP and Hitachi Ltd.) to simplify storage management for customers and offer support for the new Oracle Database 10g. Oracle and its partners plan to enhance features to automate the storage administration and provisioning tasks, thereby freeing up database and storage administrators for more productive work. In Europe, New York-based Exanet Inc. has launched its ExaStore Grid Storage 2.0 software-based product built around grid computing. It clusters all storage system resources, including RAID arrays, servers and controllers and cache memory, into a single unified network-attached storage resource with a single namespace operating in a heterogeneous environment using several storage protocols. The Exanet product is aimed at applications for digital media and premedia, media streaming, oil and gas, digital video animation, medical imaging and others with performance, highvolume and high-availability requirements.
Computerworld Executive Briefings 24
EMERGING TECHNOLOGIES
MAID Storage I
N THE ONGOING STRUGGLE
to automate and speed data backups and restores, storage administrators are increasingly turning to Advanced Technology Attachment disk subsystems. Now two vendors are pitching the idea of using specialized ATA disk backup appliances as an alternative to robotic tape autoloaders for handling large volumes of archival storage. Both are using specialized ATA disk array technology to lower the cost per gigabyte of diskbased storage and extend the life of backup disk drives, making them more attractive for archival and near-line storage. The vendors, Longmont, Colo.-based start-up Copan Systems Inc. and Santa Clara, Calif.-based Exavio Inc., claim that this new technology, dubbed MAID, for massive arrays of idle disks, is competitive with tape and offers faster and more reliable access to data. MAID systems use arrays of ATA disk drives that power down when idle in an effort to extend media life. By spinning up only when they write or read data, the arrays use less power, mitigating heat issues and allowing drives to be packed more densely into the system. Idle disk drives require about 10 seconds to spin up, but once online, they provide faster access to archived data than tape does.
Emerging Technologies
Although powering up disks as needed can extend useful life, disks that remain inactive for long periods tend to develop problems spinning up. To avoid this, MAID arrays can periodically power up all drives to relubricate the mechanics, Copan says. Drives are hot-swappable, and the systems support RAID for fault tolerance. Prices range from $3 to $5 per gigabyte, depending on the configuration, the amount of redundancy and total capacity. Steve Curry, architect for storage operations at Yahoo Inc. in Sunnyvale, Calif., is considering buying Copan’s Revolution 200T MAID array to cut the use of some 350 tape drives by half. By doing so, he hopes to improve reliability. “We see [one or two tape drive] failures every day. To us, it’s not super-unreliable, but it still has mechanical properties and does break down, which requires manual intervention,” Curry says. AT A GLANCE
Massive Arrays of Idle Disks (MAID) WHAT IT IS: Low-cost diskbased backup and archiving appliances that power down idle disks to extend media life. Lower power requirements and less heat allow for more compact, lowercost designs. PROS: Faster and more reliable than tape libraries. CONS: Cost and portability. At $3 to $5 per gigabyte, MAID still costs more than tape libraries. Disk media aren’t well suited for offsite storage.
Archiving to MAID Today Yahoo ships archival tapes to an underground storage facility run by Bostonbased Iron Mountain Inc. Curry wants to locate a MAID array at the backup facility and archive to it directly using a Fibre Channel or Fibre Channelover-IP link. “From our calculations, it’s looking like it’s doable. We are just waiting for someone to build a product that works as advertised,” he says. Copan’s 200T, announced last month, emulates a virtual tape library. It will scale to 224TB and restore 2.4TB of data per hour — about five times faster than tape access speeds — while keeping only one in every four drives powered up and online at any one time. The basic 56TB configuration, which includes 224 7,200 rpm, 250GB Serial ATA disk drives mounted in a single rack, will ship in the third quarter and sell for $196,000, or about $3.50 per gigabyte, according to Aloke Guha, Copan’s chief technology officer. Exavio’s ExaVault array is primarily marketed as a device for near-line storage and streaming of multimedia content, although the company claims that the array can also emulate a tape backup system. ExaVault, available now, uses 300GB, 5,400 rpm and parallel ATA disk drives arranged in a single rack with one controller and a Fibre Channel or Gigabit Ethernet interface. Configurations range from 3TB to 120TB. A basic unit including a controller and 3.6TB of storage is $27,700; additional modules are $6,600 per terabyte, says Kevin Hsu, Exavio’s director of marketing and product management. Despite MAID’s advantages, digital tape libraries remain the cheaper form of storage, at
Computerworld Executive Briefings 25
about $1.25 to $4.50 per gigabyte, according to Fred Moore, president of Horison Information Strategies in Boulder, Colo. The low cost of tape and the fact that tape cartridges can be easily removed and stored off-site are the medium’s most attractive features. In contrast, the individual disk drives that make up MAID appliances are bulkier and more fragile.
Hsu acknowledges that MAID systems cost more per gigabyte than tape libraries but argues that they are less expensive to run overall. “Terabyte for terabyte, tape is cheaper than MAID. If you look at total cost of ownership . . . you have to look at robotics, manpower, replacing the tape heads, maintenance costs. MAID is cheaper,” he says. Robert Amatruda, an IDC
analyst, disagrees, saying that tape still provides a lower total cost of ownership overall. “You’re looking at a lot less money. It’s still a compelling solution,” he says. Both Exavio and Copan are developing portable versions of their systems. Copan, for example, is working on special shockproof disk enclosures that could be transported offsite. Drives would be stored
remotely in a Revolution 200T shell chassis that would spin up the drives periodically to keep them conditioned for use. But Amatruda eyes such portability designs with skepticism. “You drop some of that stuff and there could be data integrity issues,” he says. “At the end of the day, disk and tape will play a complementary role.”
© Copyright 2005, Computerworld Inc., Framingham, Mass.
See our full selection of Executive Briefings at the Computerworld Store.
https://store.computerworld.com Computerworld has Executive Briefings on many subjects including Outsourcing, Mobile & Wireless, Storage, ROI and Security.
Emerging Technologies
Computerworld Executive Briefings 26