Availability Management

Читайте также:

Availability Management ’s (AM) methods are part of the measuring process explained in Chapter 4. They are part of the measuring process – gathering, processing and analysing activities. When the information is provided to CSI in the form of a report or a presentation, it then becomes part of CSI’s gathering activity. For more details on each method, please consult the Service Design publication.

With regards to AM, it provides IT with the business and user perspective about how deficiencies in the infrastructure and underpinning process and procedure s impact the business operation. The use of business-driven metric s can demonstrate this impact in real terms and help quantify the benefits of improvement opportunities.

AM plays an important role in helping the IT support organization recognize where they can add value by exploiting technical skills and competencies in an availability context. The continual improvement technique can be used by AM to harness this technical capability. This can be used with either small groups of technical staff or a wider group within a workshop environment. The information provided by AM is made available to CSI through the Availability Management Information System (AMIS).

This section provides practical usage and details on how each AM method mentioned below can be used in various activities of CSI.

Component Failure Impact Analysis

Component Failure Impact Analysis (CFIA) identifies single points of failure, IT services at risk from failure of various Configuration Item s (CI) and the alternatives that are available should a CI fail. It should also be used to assess the existence and validity of recovery procedures for the selected CIs. The same approach can be used for a single IT service by mapping the component CIs against the vital business function s and users supported by each component.

When a single point of failure is identified, the information is provided to CSI. This information, combined with business requirements, enable CSI to make recommendations on how to address the failure.

Fault Tree Analysis

Fault Tree Analysis (FTA) is a technique that is used to determine the chain of event s that cause a disruption of IT services. This technique offers detailed models of availability. It makes a representation of a chain of events using Boolean algebra and notation. Essentially FTA distinguishes between four events: basic events, resulting events, conditional events and trigger events.

When provided to CSI, FTA information indicates which part of the infrastructure, process or service was responsible in the service disruptions. This information, combined with business requirements, enables CSI to make recommendations about how to address the fault.

Service Failure Analysis

Service Failure Analysis (SFA) is a technique designed to provide a structured approach to identify end-to-end availability improvement opportunities that deliver benefits to the user. Many of the activities involved in SFA are closely aligned with those of Problem Management. In a number of organizations these activities are performed jointly by Problem and Availability Management. SFA should attempt to identify improvement opportunities that benefit the end user. It is therefore important to take an end-to-end view of the service requirements.

CSI and SFA work hand in hand. SFA identifies the business impact of an outage on a service, system or process. This information, combined with business requirements, enables CSI to make recommendations about how to address improvement opportunities.

Technical Observation

A Technical Observation (TO) is a prearranged gathering of specialist technical support staff from within IT support. They are brought together to focus on specific aspects of IT availability. The TO’s purpose is to monitor event s, real-time as they occur, with the specific aim of identifying improvement opportunities within the current IT infrastructure. The TO is best suited to delivering proactive business and end-user benefits from within the real-time IT environment. Bringing together specialist technical staff to observe specific activities and events within the IT infrastructure and operational processes creates an environment to identify improvement opportunities.

The TO gathers, processes and analyses information about the situation. Too often the TO is reactive by nature and is assembled hastily to deal with an emergency. Why wait? If the TO is included as part of the launch of a new service, system or process for example, a lot of the issues inherent to any new component would be identified and dealt with more quickly.

One of the best examples for a TO is the mission control room for a space agency. All the specialists from all aspects of the mission are gathered in one room. Space agencies don’t wait for the rocket to be launched and experience a problem before gathering specialists to monitor, observe and provide feedback. They set it up well before the actual launch and they practise the monitoring, observing and providing feedback.

Certainly, launching a rocket is very costly, but so is launching a new service, system or process. Can the business afford a catastrophic failure of a new ERP application, for example? Oh, by the way, rocket launches are often aborted seconds before the launch. Shouldn’t organizations (including yours) do the same when someone discovers a major potential flaw in a service or system? CSI starts from the beginning and includes preventing things from failing in the first place. Let’s fix the flaw before it goes into production instead of fixing the fixes (what a concept!). This information, combined with business requirements, enables CSI to make recommendations about how to address the TO’s findings.

Expanded incident lifecycle

Figure 5.7 Expanded incident lifecycle

First, let’s define a few items:

Availability Management – To optimize the capability of the IT infrastructure, services and supporting organization to deliver a cost-effective and sustained level of availability enabling the business to meet their objective s. The AM process has both a reactive and proactive nature.
Expanded Incident lifecycle – A technique to help with the technical analysis of Incident s affecting the availability of component s and IT services. The Expanded Incident lifecycle is further made up of two parts: time to restore service (aka downtime) and time between failures (aka uptime). There is a diagnosis part to the Incident lifecycle as well as repair, restoration and recovery of the service.

Let’s assume that CSI has decided to improve the incident lifecycle by reducing the mean time to restore service (MTRS) and expanding the mean time between failures (MTBF).

Here is an example of how AM can assist in reducing downtime in the expanded Incident lifecycle by using many techniques:

Monitoring (detection of Incident) – By adequately monitoring for availability of vital business function s through automated monitoring tools (set at the right threshold) that record and escalate incidents, the time it takes to detect and record incidents is reduced.
Incident recording – Since one of AM’s goal is to ‘optimize the... support organization’, educating and training first-line staff as well as simplifying and/or automating Incident recording helps reduce the time it takes to record Incidents.
Investigation – Using the FTA method, AM assists in reducing the time to investigate by creating proper investigation procedure s for Incident management staff. The same logic applies to the diagnosis of the Incident cause, resolution and recovery.

Here is an example of how AM can assist in increasing up-time in the expanded Incident lifecycle by using many techniques:

Дата добавления: 2015-10-02; просмотров: 88 | Нарушение авторских прав

<== предыдущая страница	\|	следующая страница ==>
Deming Cycle used for improving services and service management processes	\|	Capacity Management

mybiblioteka.su - 2015-2024 год. (0.007 сек.)