Organisations are now nearly always dependent on IT services in order to support their business practices and to provide goods and services to their customers. This increased dependency on IT means that services need to be protected from extended unavailability. This is where IT Service Continuity Management (ITSCM) fits in. It is a critical discipline that will not only maintain and preserve services but it will genuinely contribute to the very survival and continuance of an organisation.
Many organisations do not take ITSCM as seriously as perhaps they should, adopting the attitude 'it will never happen to us'. The cost of not implementing effective ITSCM is often overlooked as Senior Management do not see any tangible benefit or return on investment unless ITSCM is invoked.
Statistics indicate an extremely high number of businesses fail within a year of a major disaster if they have not taken steps to protect themselves adequately. Where businesses are companies quoted on a stock exchange their share prices can be adversely affected if investors believe that appropriate steps were not taken to mitigate against the impact of disaster. Another significant consideration is that ITSCM helps protect the company's bottom line, may reduce insurance premiums and assists in the retaining its credibility and standing.
However ITSCM is not the complete solution and should be considered as an integral part of organisation-wide Business Continuity Management (BCM) and the resulting Business Continuity Plan (BCP). Ideally the business will drive BCM holistically and ITSCM will benefit from that. From past experience it is often IT that starts the ball rolling and then the business latches on. Whatever way it starts holistic BCM is a key element in an organisation's long-term survival strategy.
So where do you start?
ITSCM can be broken down into 4 (project) phases:
1. Business Continuity Management Initiation. The activities to be considered depend on what contingency arrangements already exist and their quality. There may be continuity plans based around manual workarounds for individual systems and/or the IT department may have contingency plans for systems it considers to be critical. This is a good starting point but effective ITSCM must support all critical business functions and ensure that available resources are focused on these, especially if money is tight.
2. Requirements and Strategy. This is really covers two areas that of defining requirements and then sorting out a strategy to support these. In the first stage a Business Impact Analysis (BIA) which identifies the organisation's critical business processes and the IT systems that support them. The BIA quantifies (often in money terms) the potential damage that the organisation could suffer as the result of a disruption to critical business processes. You can see from this way ITSCM cannot work alone as all this requires detailed input and support from the business.
Once the critical business processes have been identified a Risk Assessment should be undertaken to:
- Identify risks to specific IT Service components, that support critical business processes, which would cause an interruption to service (such as data centre housing key servers on a flood plane in an area where flooding is relatively frequent).
- Assess threat - 'how likely it is that a service disruption will occur' and vulnerability - 'whether, and to what extent, the organisation will be affected by the threat materialising'.
- Assess the levels of risk so that the overall risk can be calculated. If quantitative information is available this could be a measurement (percentage likelihood).
- Otherwise it could be qualitative, using subjective judgement (defining any assumptions and caveats) and grading risks as low, medium or high.
Once the risks are known the next step is to consider counter-measures/risk reduction options (such as ensuring backups of data are protected [off-site storage] and proven to work or multiple routing of critical network circuits). The options will need to balance the need for timely recovery with the cost of achieving it.
This is often a sticking point as potential loss is being measured against actual/real expenditure and the outcome is often based around about how the organisation perceives/manages risk, for example:
- Does it feel lucky?
- What is the impact on short term financial performance?
- Is it prepared to pay the 'BCM insurance premium' or is the short-term bottom-line more important?
- How does this square with acceptable corporate governance and legislative obligations?
So what are the Recovery Options for ITSCM?
- Do nothing - Popular in the past as it costs nothing! It is a high risk strategy these days and is only acceptable where recovery from system failure is genuinely not needed. I personally cannot think of a system that is so disposable, but, maybe they do exist!
- Manual Workarounds - Manual Workarounds can be an effective interim measure until the IT Service is resumed wherever they are practical and possible. It is up to individual business units to work out whether this is feasible. As IT is becoming part of the business infrastructure the opportunities for manual workarounds diminish as the administrative overhead can become considerable and 'catch up' difficult.
- Reciprocal Arrangements - This used to be an effective contingency option when the IT workload was essentially batch processing. The more complex environments of today make it increasingly less viable. There could be some benefits in some reciprocal arrangements such as in the off-site storage of backups and other critical information.
- Gradual Recovery/Cold Standby - This applies to organisations that do not need business processes to be restored immediately. They can function for a period of at least 72 hours, or longer, without some or all of their IT facilities. Typically this can be provided by empty server room(s) equipped with power, network cabling and external comms circuits. This is then made available in a disaster situation for an organisation install its own computer equipment. Provision of this sort is generally provided by specialist service providers and organisations then negotiate contracts with these. It is important to build into the contract where you are in the order of recovery, for example if several organisations invoke recovery at the same time there may be insufficient resource for all of these and it could become first-come first serve.
- Intermediate Recovery/Warm Standby - This applies to organisations that need to recover critical systems and services within a 24 to 72 hour period. The most common approach is to use third party recovery providers who provide these to a limited number of subscribers thus spreading the cost. These facilities often include operational, system management and technical support. The cost is dependent on the facilities requested and how quickly the services need to be restored. Again, it is important to build into the contract where you are in the order of recovery, for example if several organisations invoke recovery at the same time there may be insufficient resource for all of these and it could become first-come first serve.
- Immediate Recovery/Hot Standby - This provides immediate restoration of critical systems and services. It is usually an extended version of Intermediate Recovery and again is typically provided by a third party recovery provider. Immediate Recovery is accompanied by the recovery of other critical business and support areas during, say, the first 24 hours following a service disruption. For even shorter timeframes having a distributed IT infrastructure and mirroring critical systems and data physically geographically could be an option for those organisations that need this.
Any option chosen for recovery must be tested regularly. These tests must be as realistic as possible and all elements that do not work as required must be fixed before the next test. This is definitely an action that needs to be built into the Continuous Service Improvement Programme (CSIP).
"If you don't test it you can't be sure it will work when you need it". But do not take risks with the testing as a bravado approach of switching off critical systems in production may lead to consequences that could be personally painful!
3. Implementation. Once the BCM strategy has been agreed the next step involves IT at a detailed level. This stage includes:
- Establishment of the IT and Business Recovery Organisations
- Development of IT Service Continuity Plan (part of the Business Continuity Plan)
- Development of Business Continuity implementation plans
- Implementation of any standby arrangements
- Implementation of agreed risk reduction measures
- Development of IT systems and services recovery plans
- Development of recovery procedures
- Defining and undertaking initial tests
- Defining maintenance and review procedures
4. Operational Management. Once implementation is complete all elements of BCM should be handed over to the managers and teams designated to support and operate it. Typically a manager will be designated (ITIL Service Continuity Manager) to manage the IT elements. Operational activities include:
- Education and awareness - this covers the overall organisation and the IT organisation, in detail, for service continuity activities. The objective is to ensure that all staff are aware of BCM and what they need to do in order to support it.
- Training - This is to ensure that recovery team members are capable of fulfilling their obligations to facilitate recovery.
- Review - regular review of the whole BCP is needed. For IT this is required whenever there is a significant change to any component of production systems/services. Such changes should be done through ITIL Change Management and communicated to the ITIL Service Continuity Manager so as to assess impact before implementation.
- Testing - a programme of regular testing to ensure that the business critical systems and services are tested (typically at least once a year).
- ITIL Change Management - plans need to be updated following tests and reviews and to incorporate changes so the Service Continuity Manager must be involved closely with the Change Management process.
- Assurance - this involves proving that the quality of ITSCM operation meets senior business management requirements and that the associated processes are working satisfactorily.
Benefits of IT Service Continuity Management
The benefits of ITSCM include:
- Controlled recovery of systems
- Reduction of downtime - increased continuity of service to customer
- Minimal disruption to Departments business
The costs of ITSCM include:
- Cost and time of producing the IT Service Continuity Plan
- Cost of third party service continuity provision
- Cost of software packages to support recovery
- Cost of implementation - more equipment
- Cost of maintaining the IT Service Continuity Plan
- Recurring cost of testing/reviewing the IT Service Continuity Plan
- Cost of extra staff for testing the IT Service Continuity Plan
The potential problems of ITSCM include:
- Resourcing the development and implementation (extra staff as 'back-fill' or to build and implement)
- Keeping live systems running when testing system/service recovery
- Financing - Ensuring that budgets are agreed are adhered to