Last Updated:

Monitoring

Grzegorz Dostatni

Who is the audience? Who am I?

Figuring out the monitoring and reporting of new services is something that usually is relegated to sometime after thourough testing. If this is your experience, this guide will help you start with something that has a chance of being useful to you and your clients. If your experience includes careful consideration of available software options and choice of settings / reports I would love to hear about it. Personally, I was only involved in one project that included that level of attention to detail.

There is a number of reasons why you would like to consider monitoring as part of service build and design. Change of logging options often requires a service restart, making it an ITIL change subject to change management, testing and approvals. The right information in front of the correct people will make future decisions much easier. Lastly, adding a metric in response to an ongoing problem not be fully useful untill enough data has been collected to indicate a trend. It may also lack a critical baseline from before a problem became aparent. Capacity planning may require a full year of data before being fully useful. All of these are good reasons to have a monitoring strategy that minimizes future changes.

In creation of this strategy, I drew inspiration from a well established process: Hazard analysis and critial control points (HACCP) which I modified for use in IT. From Wikipedia, HACCP is a systematic preventitive approach to food safety from biological, chemical and physical hazards in production processes that can cause the finished product to be unsafe and designs measures to reduce these risks to safe level. As it stands it is not entirely useful for IT services, but that style of rigourous thinking can be beneficial. The HACCP has 7 basic principles, which we will attempt to adapt to our use. Those are, hazard analysis, Critical Control Point identification, establishing critical limits, monitoring procedures, corrective actions, verification procedures and record-keeping and documentation.

Hazard Analysis

Lets start by defining 6 types of hazards:

  • [[#Human Error]]
  • [[#Undesireable effect of change]]
  • [[#Security]]
  • [[#Capacity / Scaling]]
  • [[#Hardware failure]]
  • [[#Dependency failure]]
  • link text

It may be possible to view Hardware failure as a type of Dependency failure, but in my experience it is useful to separate these two along the line of things I am responsible for and/or can fix directly.

Human Error

Human error can be the result of lack of knowledge or mis-communication between members. Sometimes it is a result of overwork, or a simple oversight. It happens. While I do not know of any foolproof way of avoiding it, having some ways of monitoring for it can make the resolution faster.

Undesireable effect of change

This type of hazard would typicaly be a human error on part of the developers or designers. Perhaps a database index was dropped from latest version, or an edge case was not fully considered and the full effect was not aparent during testing. The change may not be limited to this service, either. Imagine changing the storage mechanism of the file server that your database is running on. While the file service may operate correctly, and within accepted parameters, the secondary effects can be quite significant.

Security

No service operates in a vacuum. Software is a mess of dependencies that keeps sysadmins busy and employed. This type of hazard would include a range from being a target of an active attack, responding to recently discovered vulnerability or needing to keep up with client security requirements. The timing of these types of issues is very rarely under our control.

Capacity / Scaling

One of the best things that can happen to a service is that it becomes popular. It is also, possibly, one of the worst things. Trying to predict future trends and expectations early enough to affect change is difficult. I believe this is one of the reasons for popularity of cloud services. Cloud allows capacity / scaling to be performed as part of incident resolution instead of planning. Elastic services may make sense for some, but not all, services. In all cases capacity planning may allow you to estimate future costs, at the very least.

Hardware failure

Hard drives break, fans stop spinning, network cards may decide to fry themselves and memory sticks can go bad. Things happen. Virtualization can help move these issues to a Dependency failure status and is a good option for that alone.

Dependency failure

Someone with a backhoe can do a lot of damage to a network or underground chilled water supply. UPSes can explode and power strips can go bad. Web services can go offline for seemingly no reason at all. I doubt you will ever have enough confidence in complete monitoring all depencies. If you do, the world is likely to show you how wrong you were. Still, that is no reason not to try.

Critical Control Point Identification

A critical control point is defined as a step at which control can be applied and is essential to prevent or eliminate a [food safety] hazard or reduce it to an acceptable level.

Consider monitoring frequency, backups and software triggered events.

Establishing Critical Limits

Monitoring Procedures

Corrective actions

Verification Procedures