What Are High Availability Support Services, and What Should Your Organization Look for in Using Them?
(Originally published as a feature article in the October 2001 issue of AFSMI's The Professional Journal.)
By Dennis S. Lynton
What Is High Availability?
In an ideal world, nothing would fail. However, businesses do not operate in an ideal world and, as a result, they need to take steps to ensure that their systems do not fail - and if they do, that the failures neither occur regularly, nor last too long. That is why high availability support services are needed - and as businesses increasingly operate in mission-critical modes, the need for high availability services is becoming even more important.
But what is high availability? The term high availability may generally be used interchangeably with "business-critical" or "mission-critical". Gartner Dataquest defines high availability as "IT infrastructures that have a critical impact on the businesses or environments that they serve". A broader definition includes "any IT system or process that, if stopped or degraded, negatively affects employee productivity, customer satisfaction or money-making ability."
However, according to services industry consultant William K. Pollock, president of the consulting firm Strategies for GrowthSM, "in the 'real world', your organization may simply call it a system failure, energy outage, production stoppage, process halt, or some other term that really means workers cannot use their computers, people are left in the dark, food is spoiling, assembly lines are stopped, or the business is temporarily shut down - for an unknown period of time. High availability is an effective tool that can be used to protect the business from these unexpected outages."
Whatever definition of high availability is ultimately adopted by your organization, it will always include "the ability to eliminate or avoid (i.e., via error detection, correction and recovery), or minimize (i.e., via rapid restart) the occurrences - and the effects - of unscheduled outages". However, it is generally the role of the high availability support services vendor to deliver the services that address the specific business- or mission-critical needs of your organization.
It is also important to note that the "quantitative" value of high availability may differ greatly from one organization to another. For example, an organization that requires "only" 99.0% availability will still need to deal with up to 87 hours of downtime per year. Another organization that requires 99.99% availability will "only" have to deal with 53 minutes of downtime per year. Even an organization that requires up to 99.9999% availability will still have to deal with 5¼ minutes of downtime per year. It is extremely easy to calculate how much downtime is associated with each level of availability - however, it is far more difficult to choose the level of availability that is appropriate for your organization.
In the past, even 90% availability may have been good enough for most organizations - if the 10% downtime occurred after 5 pm, or sometime in the middle of the night. Some segments, such as medical, were among the first to realize that availability was critical at all hours of the day. The bank/financial services sector typically cites the end of the banking day (i.e., 3 pm), the trading day (i.e., 4 pm) or the business day (i.e., 5 pm) as the most critical times for availability. Fast food chains will tell you breakfast, lunch and dinner times are most critical; while a milk processing plant will tell you whenever they are processing milk is the most critical time. In today's global business environment, where many businesses are now producing goods, providing services and handling customer calls at all hours of the day - and from all over the world - availability requirements can no longer tolerate 36.5 days of downtime per year (i.e., 90% availability), let alone just under an hour (99.99% availability).
What Are High Availability Support Services?
High availability support services are what keep businesses running at their required business- or mission-critical levels. Gartner Dataquest defines high availability support services as a means "to minimize downtime via comprehensive proactive and reactive offerings". As such, high availability is much more than just "break/fix." It embraces a preventative approach that utilizes every available analog, digital, on-site and/or remote means of keeping the system up and running, accounting for all known causes of failure and, if necessary, providing the highest levels of after-the-fact diagnostics and remedial support. Even so, there are no high availability support services vendors today that will guarantee "100% availability", nor will there ever be - at least during our lifetime.
At StorageTek®, we define high availability support services as a collection of specific and comprehensive support components that, when used in conjunction, provide the customer with the highest possible levels of system uptime. The key elements of our high availability support services offering:
These services are provided in terms of:
- Focus on rapid response, restore and resolve;
- Provide virtually instantaneous access to support experts and knowledge to minimize downtime (i.e., through both remote and on-site capabilities;
- Are based on an enhanced escalation management model (with a heightened sense of urgency);
- Are comprised of strong service processes, methodology and systems;
- Utilize remote support and diagnostics to facilitate rapid problem isolation; and
- Maintain a local/on-site service parts inventory.
The key components of a high availability support services offering, as illustrated in Figure 1, consist of the following:
- Operational reviews
- Change control management
- System connectivity support
- Patch management reviews
- Critical problem notification
- Preventative, high availability operational support
Remote and Proactive Monitoring Services
Remote diagnostics provide the ability to run diagnostic tests on the customer hardware remotely, thereby reducing or eliminating the need for a 24x7 on-site technician in many cases. Remote console takeover is the ability to take remote control of the customer console or system, and perform certain actions, such as accessing the system log, redirecting the console or, perhaps, even powering the system on and off. The remote fix goes well beyond detecting the problem, to performing the necessary steps to fix a diagnosed problem. This generally involves the ability to fix the problem remotely, with or without direct involvement from the customer.
Proactive Detection and Alerting
Proactive detection and alerting is the capability to automatically generate an alert if a pre-established threshold is exceeded, indicating an out-of-specification reading. This also includes device discovery and topology mapping capabilities.
Monitoring of Key Components
The monitoring of key components consists of the ability to monitor the status of key hardware components on the customer hardware, and remotely providing proactive detection and notification of hardware degradation and failures.
Advanced Proactive Services
Advanced proactive services include many functions, such as:
- System performance monitoring
- Configuration reporting
- Auto-report to vendor (i.e., call-home)
- Identification of missing software patches
- Pre-failure alert
Online self-help includes web-enabled support documentation, self-help tools, knowledgebase and frequently asked questions (FAQs). Interactive services may also include automatic event notification, software updates and patches, and forums. To be most effective, however, any online self-help tools must be "personalized" to the user in terms of being information-specific to the customer's needs.
Critical Path and Resilience Planning (Solution Services)
What brings it all together in terms of ensuring that the organization can successfully attain its desired level of availability results directly from the critical path and resilience planning solution services it utilizes. These planning services, provided directly by the high availability support services vendor, lay the foundation for what will ultimately define the organization's degree of high availability.
Key elements of this planning process include:
This is the stage in the overall process where the organization's high availability goals are set, its systems and resources are thoroughly reviewed, the support process is put into place, and the progress of the ongoing program begins to be closely monitored.
- Performing an on-site evaluation and risk assessment
- Identifying specific customer availability requirements, including a review of the systems management requirements, operating procedures, technology and support resources
- The ability to assist the customer in the design, implementation and performance tuning of direct-connect and Storage Area Network (SAN) solutions, including:
- Capacity and performance planning
- Storage-consolidation consulting
- High availability planning
- Backup and restore consulting
- SAN architecture and design services
- Disk management review
- Developing the detailed customer support plan
Availability Contracts (Operational Services)
The availability contract represents the commitment of both the customer organization and its high availability support services vendor to provide an environment where the desired levels of availability can be consistently attained. This is also where the specific terms of the agreement are laid out in a Service Level Agreement (SLA).
The comprehensive services contained in the SLA must address all proactive and reactive hardware, software and network support; education and training; and disaster contingency and recovery services in order to be effective. The vendor must be relied upon to create innovative and comprehensive contracts that cover all of the organization's major network components and software applications. It must also create a set of repeatable, standardized high availability support offerings that are flexible enough to meet the requirements of customized SLAs. Gartner Dataquest believes that the most successful high availability support services providers will be the ones who "seize upon the growing user desire for contracts that compensate for lost revenues due to any unplanned downtime".
A typical high availability support services SLA checklist will be defined on the basis of:
It will also be critical for both parties to agree on the line item details of the SLA before the contract is signed and implemented. This way, there can be no "he said/she said" or finger-pointing once the program is underway - both parties remain protected.
- Balanced expectations (on both the part of the provider and the customer)
- Specific user expectations and requirements
- Function, service type and locale
- Tracking capability (i.e., reports, tracking tools, feedback sessions, etc.)
- Framework for escalation, root-cause analysis and resolution
What Should Your Organization Look for in Using High Availability Support Services?
There are many vendor organizations that claim to be high availability support services providers, although each may define the levels of support they provide differently. To ensure that your organization only considers those vendors that have the proper resources and capabilities, there is a short checklist that may be used to evaluate the alternative options. Specific characteristics to look for include:
Of course, there are many other factors that will also need to be addressed before selecting the appropriate high availability support services vendor - but these are among the most important ones to arrive at a means for most effectively differentiating the alternative vendor choices.
- Do they have a customer orientation, the required technical competence and a heterogeneous system skill set? In otherwords, are they more attuned to the specific customer's needs than to simply selling a pre-packaged availability solution? Do they have the requisite technical skills to provide the level of customization that is necessary? And, are their support capabilities applicable across more than just a small handful of system types and configurations?
- Do they have a strong reputation in the storage, maintenance and availability sector?
- Do they have a knowledgeable, experienced and well-trained employee base (not only in support, but also in sales and account management)?
- Do they have an extensive geographic scope of service operations (i.e., sales, account management, customer service, technical support, field engineers, etc.)? And, can they support the organization's global (i.e., domestic and international) operations with a full portfolio of availability services?
- Do they have an adequate services infrastructure to provide the organization with all of its required availability support services both today, and tomorrow (i.e., do they fully understand the organization's present and emerging availability support services needs)?
- Can they provide a single point of accountability for heterogeneous system environments (i.e., MVSS)?
- Do they have the proper breadth of service offerings, and the ability to match the right ones to the customer organization's specific needs?
At the very least, the customer organization will want to select a high availability support services provider that (Figure 2):
- Focuses its service delivery model on remote diagnostics/remote fix, supported by on-site field engineers, technical support, and web-enabled self-help, as required;
- Offers unlimited remote support tools (i.e., reactive, proactive and self-help, etc.);
- Embraces a combined proactive/predictive service philosophy;
- Has fully integrated (i.e., MVSS) systems support capability;
- Provides totally customer-focused service offerings;
- Has a consistent and global service infrastructure and processes;
- Has both an integrated and common product/solution set; and
- Provides "best-in-class" service performance.
The Customer Support Continuum
However, simply meeting the basic desired state requirements for a high availability support services provider will not necessarily mean that the vendor is right for the customer's organization - it must also comply with the Customer Support Continuum (as illustrated in Figure 3). The Customer Support Continuum identifies the process that takes place once a customer begins to move into a high availability environment; and it does so by matching the service philosophy of the selected vendor against its own availability requirements.
The continuum first addresses the basic set of "value services" that may be defined as the "high quality, responsive maintenance support" services that normally represent the basic levels of services that are required. These are generally "reactive" in nature, and are delivered in situations well below the business-critical level. Further along the Customer Support Continuum are "enhanced services", or those services that "reduce unplanned downtime through proactive (rather than 'reactive') support and rapid resolution". Again, these services are typically delivered at a stage somewhat lower than one defined as "business-critical".
Finally, the level of "high availability support services" is the only one that goes beyond merely "reactive" and "proactive" support in terms of providing "systems availability management through personalized and predictive support". As such, it is the only level that addresses the "business-critical" support needs of the customer - it represents the pinnacle of matching the vendor's service philosophy against the business-critical nature of the customer's operations.
Bridging the Gap to High Availability
Bridging the gap to high availability is the ultimate goal of the business-critical organization. Making it happen, however, may be another thing entirely. The goal of high availability, for both the vendor and the customer, is to "drive improvements in service delivery efficiencies that result in faster problem resolution and increase customer satisfaction." This will require moving both organizations from an initial "reactive" model, through a more "proactive" stage, up to a truly high availability environment (Figure 4).
In the earliest stage of the initial "reactive" model, the customer organization generally finds itself requiring a "design for serviceability" that, if not yet implemented, may result in "days" of downtime, costing significant dollars, both per incident, as well as cumulatively over time. Organizations with an "on-site repair" program in place may find that annual downtime is reduced to "hours", and that the costs associated with unplanned outages is reduced accordingly. Those organizations that have stepped up a "contact center diagnosis & repair" mode find that their annual levels of downtime may be further reduced to "many minutes", resulting in still less annual costs for downtime.
Only by moving to a "remote & proactive monitoring" mode (in the "proactive" stage of the process), can the organization see its annual downtime reduced to merely "minutes", and its associated costs declining to a much smaller level. However, by moving all the way to a "high availability" mode, and by utilizing "resilience planning & availability contracts" the organization can reduce its downtime to virtually "none", and the associated costs to virtually "nothing".
Some examples of the key components of an availability offering that provide the customer with the necessary high availability support include (but are not limited to):
Remote and Proactive Services
- 24 x 7 x 365 access to a Customer Call Center
- 24 x 7 x 365 access to a Customer Resource Center
- 24 x 7 x 365 access to an Assigned Customer CSE
- 24 x 7 x 365 Spare Parts Support (and on-site spares)
Availability Contracts (CS) & Resilience Planning (SP)
- Remote Tools (such as StorageTek's RS ToolboxTM)
- Assigned Systems Engineer
- Enhanced Escalation Management
- 24 x 7 x 365 access to Level 3 Support
- Patch Management Assistance
- Firmware Updates
- Critical Patch Notification
- New Patch Notification
Of course, there are many other components that should also be part of a high availability support services package; but the above list represents some of the key differentiating features that may ultimately distinguish one vendor from another. All of these features are available through one of StorageTek's TekCareTM High Availability Services offerings, beginning with our basic TekCare Support Services focusing on hardware and software support, a Customer Support Service Center, Customer Resource Center, Spares Support (Logistics) and Support Tools.
- Optional Network Support (e.g., SANCareTM 6000 support)
- 24 x 7 x 365 Hardware Coverage
- Hardware Response (e.g., 4-hour call time-to-repair)
- 24 x 7 x 365 Software Coverage
- Software Response (e.g., immediate, via remote, for critical problems; 2-hour target response for non-critical problems)
- Software Call-to-Resolution (e.g., 4-hour, 15-day permanent fix)
- Operational Assessments
- Customer Operational Design
- Service Level Agreement (SLA)
- System Release Documentation and Seminar
- Change Review and Planning
- Assigned System Engineer
- Operational Reviews (e.g., 1 technical review per month; and 4 delivery support per year)
The TekCare Plus Extended Services offering expands upon our basic services to include multi-vendor support services (MVSS), channel services, equipment relocation (CS/PS), SVA services, VSM services and software implementation. Our top-of-the-line TekCare Solutions offering includes total Storage Area Network support (i.e., SANCare), SN6000 ADIM, data center services, enterprise back-up and recovery, tape assessment & design, and medical consulting - all necessities for the customer's business-critical environment.
Pollock agrees that "high availability is not conducive to a 'one size fits all' solution. There must be options from which customers may choose the best solution to fit their business-critical requirements. That is why the StorageTek solutions are so effective - they give the customer several options from which to choose the most appropriate solution."
Moving Toward High Availability
Once an organization determines that it requires a high availability solution for its business-critical systems, there are many areas that will need to be addressed. First, the organization will need to determine whether it has reached a stage where it can meet its own internal high availability requirements (as defined by the vendor); second, it will have to select the vendor that can best match its needs with the most appropriate and effective solution; third, it will have to identify who within the organization should have the direct responsibility to manage the program; and fourth, it will need to determine what gets covered, and with what SLA level of coverage. None of these decisions will be easy to make - and all will require a great deal of thought and consideration.
The desired level of high availability can only be attained when the proper "mix" of internal and external, availability-oriented products and services, are configured just right for the customer organization. Accordingly, the right "mix" of products and services will first need to be identified. For example, the combined use of StorageTek's RS Toolbox product featuring the ability to provide single point of management using one tool, vendor management, Service Level Agreements (SLAs) and complete account ownership; with the appropriate level of TekCare High Availability Services, will assure the customer organization that its high availability needs will be effectively met.
High availability is not a "best effort" solution, nor is it a service for which all customers and vendors are equally qualified. It is, however, a means to ensure that there is no unplanned downtime in business-critical systems resulting from either internal or external causes. It is also a solution that requires total partnership between the customer and its vendor - at all times - for the good of the customers and end users ultimately being served.
Dennis S. Lynton is Services Marketing manager at StorageTek, a global leader in virtual storage solutions for tape automation, disk subsystems and storage networking. He may be reached at +1.303.661.4723 or via e-mail at email@example.com.