How to Define, Measure, and Report IT Service Availability

Define, Measure, and Report IT Service Availability

Let’s talk about IT service availability. Service availability really matters. When services that a customer expects to be able to access aren’t available, that customer is going to be unhappy. After all, why should the customer pay for a service that isn’t there when they need it? This, of course, is why an agreed measure of service availability is so often a key performance indicator (KPI) in IT service management (ITSM).

IT staff often work very hard to see that the agreed target is met, and provide figures proving it has been met when reporting to customers. Typically, IT organizations use a percentage, such as 99.999% service availability, to do this. Unfortunately, this sometimes means that IT service organizations focus on the percentage measure and lose sight of their true goal – providing value for customers. Please keep reading to understand more, including some service availability examples.

The trouble with percentage service availability

A simple service availability definition is the percentage of time your service is available. One of the simplest ways to calculate service availability is based on two numbers, and you might remember this from your ITIL training. You agree the amount of time that the service should be available over the reporting period. This is the service availability or agreed service time (AST). You measure any downtime (DT) during that period. You take the downtime away from the agreed service time, and turn this into a service availability percentage.

Service availability calculation

In terms of service availability examples, if AST is 100 hours and downtime is 2 hours, then the service availability would be:

Availability


The trouble with this is that, while this service availability calculation is easy enough to perform, and collecting the data to do it seems straightforward, it’s really not at all clear what the number you end up with is actually telling you. I’ll go into a bit more detail about this later in the blog.

Worse still, from the customer’s point of view, you may be reporting that you have met agreed service availability goals, while leaving the customer totally unsatisfied.

Customers are only interested in percentage service availability insofar as it correctly identifies their ability to use IT services to support business processes – and a blanket percentage figure is probably not going to do that. A meaningful service availability report needs to be based on measurements that describe things the customer cares about, for example, the ability to send and receive emails, or to withdraw cash from ATMs.

Defining service availability targets

If you want to measure, document, and report service availability in ways that will be helpful to your organization and your customers you need to do two things. Firstly, you need to understand the context. To do this you’ll need to talk to your customers.

Secondly, you need to think very carefully about a range of practical issues: what will you measure, how will you collect your data, and how will you document and report your service availability findings.

Talking to customers

Before you do anything else, you need to work out what it is that your customers need service availability for, and what impact the loss of service availability has on them. This will allow you to agree realistic goals that consider technology, budgetary, and staffing constraints. In other words, you need to talk to your customers to ensure you understand what they need and, if necessary, to help them understand that “I want it to be available all the time” is probably going to cost more than it’s ever going to be worth.

But what exactly should you be talking to your customers about? An excellent starting point is the impact of downtime. Here are five service availability questions you should consider asking:

  • Which of your business functions are so critical that protecting them from downtime is a priority?
  • How does the length of any downtime affect your business?
  • How does the frequency of downtime affect your business?
  • What impact do service availability and downtime have on your organization’s productivity?
  • What impact do service availability and downtime have on your organization’s customers?

Critical business functions

Most IT services support several business processes, some of these are critical and others are less important in service availability terms. For example, an ATM may support cash dispensing and statement printing. The ability to dispense cash is critical, but if the ATM can’t print statements this has a much lower impact.

You need to talk to your customers about service availability and reach an agreement on the importance of the various functions to their business. You may find it helpful to draw up a table that indicates the business impact that follows losing each of these functions relative to one another. Table 1 is a service availability example:

Table 1 – Percentage degradation of service

IT function that is not available% degradation of service
Sending email100%
Receiving email100%
Reading public folders50%
Updating public folders10%
Accessing shared calendars30%
Updating shared calendars10%

NB: Figures are not intended to add up to 100%

It’s clear from this table that the service has no value at all if it cannot send and receive emails, and that the value of the service is reduced to half its normal level if public folders cannot be read. This tells the IT organization where to focus their service availability efforts when designing and managing the email service.

Duration and frequency of downtime

You need to find out how the customer’s business is affected by both the frequency and the duration of downtime. I’ve already mentioned that percentage service availability may not tell you enough to be of value. When a service that should be available for 100 hours has 98% service availability that means there were two hours downtime. But this could mean a single two-hour incident, or many shorter incidents. The relative impact of a single long incident or many shorter incidents will be different, depending on the nature of the business and the business processes involved.

For example, a billing run that takes two days to complete and must be restarted after any outage will have its service availability seriously impacted by every short outage, but one outage that lasts a long time may be less significant. On the other hand, a web-based shopping site may not be impacted by a one-minute outage, but after two hours the loss of customers could be significant. Once you know the likely impact, you’re in a much better position to put in place infrastructure, applications, and processes that will really support the customer, to devise meaningful targets, and to find ways of documenting and reporting the service availability appropriately.

In terms of service availability examples, here’s an example of how you could measure and document service availability to reflect the fact that the impact of downtime varies:

Table 2 – Outage duration and maximum frequency

Outage durationMaximum frequency
Up to 2 minutes2 events per hour
5 events per day
10 events per week
2 minutes to 30 minutes2 events per week
6 events per quarter
30 minutes to 4 hours4 events per year
4 hours to 8 hours1 event per year

If you use a table like this when you’re discussing the frequency and duration of downtime with your customers, the numbers are likely to be much more useful than percentage service availability, and they’ll certainly be more meaningful to your customers.

Downtime and productivity

I’ve said that percentage service availability is not very useful for talking to customers about the frequency and duration of downtime. In contrast, when you discuss the impact of downtime on productivity, percentage impact can be a very useful measure indeed.

Most incidents don’t cause complete loss of service for all users. Some users may be unaffected, while others have no service at all. At one extreme, there may be a single user with a faulty PC who cannot access any services. You might class this as 100% loss of service, but this would leave IT with a totally unrealistic goal, and would not be a fair measurement of service availability.

At the opposite extreme, you might decide to say that a service is available so long as someone can still access it. However, you don’t need much imagination to understand how customers would feel if service availability is being reported while many people just can’t access it.

One way you can quantify service availability impact is to calculate the percentage of user minutes that were lost. To do this:

  • Calculate the PotentialUserMinutes. This is the total number of users times the length of time that they work. For example, if you have 10 staff who work for 8 hours then the PotentialUserMinutes is 10 x 8 x 60 = 4800
  • Calculate the UserOutageMinutes. This is the total number of users who were not able to work, multiplied by the time they were unable to work. For example, if an incident prevented 5 people from working for 10 minutes then the UserOutageMinutes is 50.
  • Calculate the percentage service availability using a very similar formula to the one we saw earlier
Service availability calculation

In this example, you would calculate the service availability as:

Service availability calculation

You can use this same technique to calculate the impact of lost service availability of IP telephony in a call center in terms of PotentialAgentPhoneMinutes and LostAgentPhoneMinutes; and for applications that deal with transactions or manufacturing you can use a similar approach to quantify the business impact of an incident. You compare the number of transactions that would have been expected without downtime to the number of actual transactions, or the amount of production that you would have predicted to the actual production.

Measuring and reporting service availability

After you’ve agreed and documented your service availability targets, you need to think about practical aspects of how you can measure and report service availability. For example:

  • What will you measure?
  • How will you collect your service availability data?
  • How will you document and report your service availability findings?

Service availability metrics: what to measure

It’s essential to measure and report service availability in terms that can be compared to targets that have been agreed with customers and that are based on a shared understanding of what the customer’s service availability needs actually are. The targets need to make sense to the customer, and to ensure that the IT organization’s efforts are focused on providing support for the customer’s business needs.

Usually, the targets will form part of a service level agreement (SLA) between the IT organization and the customer – but be careful that meeting numbers in an SLA does not become your goal. The numbers in the SLA are simply agreed ways of measuring, the real goal is to deliver services that meet your customers’ needs (and this includes service availability). For example, with IT self-service.

How to collect your service availability data

There are many different ways that you could collect data about service availability. Some of these are simple, but not very accurate, others are more expensive. You may want to focus on just one approach, or you may need to combine some of these to generate your reports.

Collecting service availability data at the service desk

One way to collect service availability data is via the service desk. Service desk staff identify the business impact and duration of each incident as a routine part of managing incidents. You can use this data to identify the duration of incidents and the number of users impacted.

This approach is generally fairly inexpensive. However, it can lead to disputes about the accuracy of the service availability data.

Measuring the service availability of infrastructure and applications

This approach involves instrumenting all the components required to deliver the service and calculating service availability based on understanding how each component contributes to the end-to-end service.

This can be very effective, but may miss subtle failures, for example a minor database corruption could result in some users being unable to submit particular types of transaction. This service availability method can also miss the impact of shared components, for example one of my customers had regular downtime for their email service due to unreliable DHCP servers in their HQ, but the IT organization did not register this as email downtime in their service availability calculation.

Using dummy clients

Some organizations use dummy clients to submit known transactions from particular points on the network to check whether the service is functioning. This does actually measure end-to-end service availability. Depending on the size and complexity of the network this can be quite expensive to implement, and it can only report the service availability from the particular dummy clients. This means that subtle failures may be missed, for example if a change means that clients running a particular web browser no longer work correctly, but the dummy clients use a different browser.

Tools that support this data collection often report service performance, as well as service availability, and this can be a useful addition.

Instrumenting applications

Some organizations include code in their applications to report end-to-end service availability. This can actually measure end-to-end service availability, provided that the requirement is included early in the application design. Typically, this will include code in the client application as well as on the servers.

When this is done well it can not only collect end-to-end service availability data, but it can also identify exactly where a failure has occurred, helping to improve service availability by reducing the time needed to resolve incidents.

How to document and report your service availability findings

When you have collected service availability data, you need to consider how this should be reported to customers.

Plan your downtime

One aspect of service availability measurement and reporting that’s often overlooked is planned downtime. If you forget to factor in planned downtime when you’re working out how to report service availability, you could end up reporting service availability figures that don’t fairly reflect your service provision.

There are several ways to make sure that planned downtime doesn’t accidentally end up inflating the service availability statistics. One is to have the planned downtime happen during a specific window that’s not included in service availability calculations. Another is to schedule planned downtime. For example, some organizations may not count downtime that has been scheduled a month in advance.

Whatever you decide to do with service availability, it’s important that your SLA clearly defines how planned downtime will be reported.

Agree on your service availability reporting period

Earlier in this blog, I talked about the limitations of percentage service availability as a useful measure. Nevertheless, it does have its uses and it continues to be widely used. So, it’s important to understand that you need to specify the time period over which calculation and reporting take place, as this can have a dramatic effect on the service availability numbers that you’ll be reporting.

For example, let’s consider an IT organization that has agreed a 24×7 service and service availability of 99%. Suppose there’s an eight-hour outage:

  • If we report service availability every week then the AST (Agreed Service Time) is 24 x 7 hours = 168 hours
  • Measured monthly the AST is (24 x 365) / 12 = 730 hours
  • Measured quarterly the AST is (24 x 365) / 4 = 2190 hours

Putting these numbers into the service availability equation gives:

  • Weekly service availability = 100% x (168 – 8) / 168 = 95.2%.
  • Monthly service availability = 100% x (730 – 8) / 730 = 98.9%
  • Quarterly service availability = 100% x (2190 – 8) / 2190 = 99.6%

Each of these is a valid figure for service availability, but only one of them shows that the target was met.

And finally

Almost every IT organization that I’ve worked with measures and reports service availability. The really great IT departments work with their customers to optimize their investment and deliver levels of service availability that delight. Sadly, many IT organizations focus on the numbers in an SLA, and completely fail to meet their customers’ needs – even if they deliver the agreed numbers.

In this article, I’ve offered a number of suggestions for how you can measure and report service availability, but I haven’t discussed what you can do to help manage and improve it. This is probably even more important, but it’s a topic for another article.

It’s a long service availability article, so here are some of the key points that I’ve made within it:

  • There’s little point in telling a customer that you provided 98% service availability if you don’t understand the impact of the 2% downtime
  • Talk to your customers to make sure you understand the business impact of service availability and any downtime on them, and on their end customers
  • Think about ways to protect your customers’ critical business processes in service availability terms
  • Find ways to measure the frequency and duration of downtime, and the impact of downtime on productivity that are matched to your customers’ needs
  • Agree, document, and report service availability metrics in ways that both make sense to your customers and help you to plan
  • Use appropriate tools and instrumentation to help you measure and report service availability.

What else would you add to my service availability advice? Where do you need help, perhaps with service delivery? Please comment below.

This service availability article was originally written in 2017 and updated in 2023.

If you liked this availability article, then the following ITSM articles might be helpful too:

If you can’t find anything helpful here, please use the site’s search capability to find more interesting ITSM articles.

Stuart Rance
Service Management and Security Management Consultant at Optimal Service Management Ltd.

Stuart Rance is a consultant, trainer, and author with an international reputation as an expert in ITSM and information security. He was a lead architect and lead editor for ITIL 4, and the lead author for RESILIA™: Cyber Resilience Best Practice. He writes blogs and white papers for many organizations, including his own website.

Stuart is a lead examiner for ITIL, chief examiner for RESILIA, and an instructor for ITIL, CISSP, and many other topics. He develops and delivers custom training courses, and delivers presentations on many topics, for events such as itSMF conferences and for private organizations.

In addition to his day job, he is also an ITSM.tools Associate Consultant.

Want ITSM best practice and advice delivered directly to your inbox? Why not sign up for our newsletter? This way you won't miss any of the latest ITSM tips and tricks.

nl subscribe strip imgage

More Topics to Explore

12 Responses

  1. Here is my example of defining service availability in business terms;
    Business – say 500 retails shops across country, each shop has 3 cash registers to service shoppers.
    Service name = Retail service
    Value proposition = Retail service facilitates easy, fast and satisfying processing of customer sales in the shops. Value is created when every customer transaction is quickly processed, thereby avoiding lengthy queue at cash register, and therefore prevents money walking out the door.
    Fact 1 = During peak hours, if one of 2 registers are down, queues are longer at remaining 2 and many customers walk out without buying because of time constraints
    Suggested definition of service availability = During peak hours if any of the three cash registers is down, we will consider the entire retail service is down, SLA breach is incurred and downtime is accumulated.
    During non peak hours if any two of the cash registers are down, we will consider the entire retail service is down, SLA breach is incurred and downtime is accumulated. If only one cash register is down retail service is considered up.
    No percentages to play around with.
    Thank you

    1. Ashok, That can certainly work if the customer is happy with it. I suspect that this customer would be better off with one more register, so that failure of a single register doesn’t cost them so much, but this depends on the cost/benefit tradeoff so is up to them.

  2. Good article Stuart, thanks for sharing.
    The challenge though is how to automate this kind of measuring and outage reporting in the era of microservices and API economy.

    Regards,
    Beno

    1. Benjamin, it certainly is a challenge. My preference is to go for something simple, rather than trying to create complex automation. Try to find a measure of business impact and apply it to all outages.

  3. The problem here is that we are approaching the measurement of services to often just using availability as the primary indicator of performance. The reality is services provide a means of getting things done. So it stands to reason that we should measure services on their ability to get things done rather than just how available they are for getting things done. The consumers of services want things like transaction throughput and responsiveness/speed at the times they need to use the service. Just think of your internet connection at home – when you need to use it you want bandwidth and no lags in up/download speeds. The fact that it is available when you need to use the service is a given. The performance of the service when you needed is the paramount concern. It’s the same with IT services being consumed by business customers. Don’t get me wrong, we should still measure availability, but we do need more focus on service performance measurements that have better relevance to the consumers of these services.

    1. You are absolutely right, but it is not just availability and performance that matter. One of my customers has “How quickly you responded to my ad-hoc change requests” as a KPI, because they are in an ever changing business and that is what matters to them.

    2. Interested in your comments about transactions. Does anyone agree measuring Availability for a transaction processing service on the basis of failed v successful transactions is a valid approach? The point being that there are various cut off times throughout the day and if cut off is missed then the service isn’t being provided. Service Restoration time is kind of irrelevant as long as the service is available the next day to process transactions before the next cut off.

  4. Stuart, thanks for the article. I like table 2 but I do not necessarily agree with the number of users impacted story. First, it is often very difficult to quantify the number of users impacted. The first calculation that you stated provides no valuable information is, in fact, the undisputed metric of availability for the service in question during the reporting period. That 98% tells me more than the 98.96% that is reported when you include the number of users impacted. In fact, I often argue that the only purpose of including users impacted (an imprecise metric) into the calculation is to dilute the SLA percentage to something less severe. As an example, very recently, we had an outage of our VOIP telephone service for a few minutes. Fortunately, at the time of the outage, only 2 people were using telephony services and thus the outage was reported as a 5-minute outage but only for 2 users. I find that really hard to buy. Just by simple blind luck, our SLA reported was much better than what I would have reported it as – which is 5 minutes of outage PERIOD. The two people who were impacted where the CEO and one of his reports. Needless to say, they were not very happy and our reporting a stellar performance metric did not do anything to improve their confidence in the numbers being reported. Until there is a time where we are able to accurately and precisely measure who is impacted by a service outage, I believe teams should not try to dilute the calculation down with their guess of who is or should have been impacted … take the high road and assume everyone was impacted.

Leave a Reply

Your email address will not be published. Required fields are marked *