This web page is still under construction, but here is a brief
look at high availability and what it means to Business and IT.
Please check back to see the updated content for this page.
High availability is a system design protocol and associated
implementation that ensures a certain absolute degree of operational
continuity during a given measurement period.
Definition of availability
Availability refers to the ability of the user community to access
the system, whether to submit new work, updated or alter existing
work, or collect the results of previous work. If a user cannot
access the system, it is said to be unavailable. Generally, the
term downtime is used to refer to periods when a system is unavailable.
A distinction needs to be made between planned downtime and unplanned
downtime. Typically, planned downtime is a result of maintenance
that is disruptive to system operation and usually cannot be avoided.
Planned downtime events include patches to system software that
require a reboot or system configuration changes that only take
effect upon a reboot. In general, planned downtime is usually
the result of some logical event. Unplanned downtime events typically
arise from some physical event, such as a hardware failure or
environmental anomaly. Examples of unplanned downtime events include
power outages, failed CPU or RAM components (or possibly other
failed hardware components) and or an over-temperature related
shutdown.
Many computing sites typically exclude planned downtime from
availability calculations, since planned downtime should have
no impact upon the computing community. By excluding planned downtime,
many systems can claim to have phenomenally high availability,
which might give the illusion of continuous availability. Systems
that exhibit truly continuous availability are rare, expensive
and carefully implemented specialty designs that eliminate any
single point of failure.
Availability is usually expressed as a percentage of uptime in
a given year. (Shorter time periods can be used, but sites that
pick artificially short measurement periods may be hiding latent
problems in their systems which produce instability, leading to
unplanned downtime.) In a given year, the number of minutes of
unplanned downtime is tallied for a system; the aggregate unplanned
downtime is divided by the total number of minutes in a year (approximately
525,600), producing a percentage of downtime; the complement is
the percentage of uptime, which is what is typically referred
to as the availability of the system. Common values of availability
for highly available systems are:
99.9% = 43.8
minutes/month or 8.76 hours/year
99.99% = 4.38
minutes/month or 52.6 minutes/year
99.999% = 0.44
minutes/month or 5.26 minutes/year
It should be noted that uptime and availability are not synonymous.
A system can be up, but not available, as in the case of a network
outage.
Clearly, how availability is measured is subject to some degree
of interpretation. A system that has been up for 365 days in a
non-leap year might have been eclipsed by a network failure that
lasted for 9 hours during a peak usage period; the user community
will see the system as unavailable, whereas the system administrator
will claim 100% "uptime". However, given the true definition of
availability, the system will be approximately 99.897% available
(8751 hours of available time out of 8760 hours per non-leap year).