Health check APIs, 500 status codes and media types

A status or health check resource (or endpoint, to use the more popular terminology) is a common way for a system to provide an aggregated representation of its operational status. This status representation typically includes a list with the individual system components or health check points and their individual status (e.g. database connectivity, memory usage threshold, deadlocked threads).

For instance, the popular Dropwizard Java framework already provides an out-of-the-box health check resource, located by default on the /healthcheck URI of the administration port, for this purpose.

The following is an example of such representation, defined by a JSON object containing a field by each health check verification.

{
    "deadlocks":{
        "healthy":true
    },
    "database":{
        "healthy":true
    }
}

Apparently, it is also a common practice for a GET request to these resources to return a 500 status code if any of the internal components reports a problem. For instance, the Dropwizard documentation states

If all health checks report success, a 200 OK is returned. If any fail, a 500 Internal Server   Error is returned with the error messages and exception stack traces (if an exception was  thrown).

In my opinion, this practice goes against the HTTP status code semantics because the server  was indeed capable of processing the request and producing a valid response with a correct resource state representation, that is, a correct representation of the system status. The fact that this status includes the information of an error does not changes that.

So, why is this incorrect practice used so often? My conjecture has two reasons for it.

  • First, an incomplete knowledge of the HTTP status code semantics that may induce the following reasoning: if the response contains an error then a 500 must be used.
  •  Second, and perhaps more important, because this practice really comes in handy when using external monitoring systems (e.g. nagios) to periodically check these statuses. Since these monitoring systems do not commonly understand the healthcheck representation, namely because each API or framework uses a different one, the easier solution is to rely solely on the status code: 200 if everything is apparently working properly, 500 if something is not ok.

Does this difference between a 200 and a 500 matters, or are we just being pedantic here? Well, I do think it really matters: by returning a 500 status code on a correctly handled request, the status resource is hiding errors on its own behaviour. For instance, lets consider the common scenario where the status resource is implemented by a third-party provider. A failure of this provider will be indistinguishable of a failure on the system under checking, because a 500 will be returned in both cases.

This example shows the consequences of the lack of effort on designing and standardizing media types. The availability of a standard media type would allow a many-to-many relation between monitoring systems and health check resources.

  • A health check resource could easily be monitored/queried by any monitoring system.
  • A monitoring system could easily inspect multiple health check resources, implemented over different technologies.

 

monitoring

Also, by a using a media-type, the monitoring result could be much richer than “ok” vs. “not ok”.

To conclude with a call-to-action, we really need to create a media type to represent health check or status outcomes, eventually based on an already existing media type:

  • E.g. building upon the “application/problem+json” (RFC 7807), extended to represent multiple problem status (e.g example).
  • E.g. building upon the “application/status+json” media type proposal.

Comments are welcomed.

 

 

 

Advertisements

2 thoughts on “Health check APIs, 500 status codes and media types

  1. nferro

    While I mostly agree with you, I can see a third reason for using a 500 response: load balancer monitoring. You can say this is monitoring but then I’ll say it’s a very different monitoring.

    The old way of having a node in or out of the pool would be to have static VIP files which presence would be managed by some monitoring tool. The problem here is that the load balancer is testing the monitoring tool and not the application, the advantage is that testing for the presence of a static file is quite fast and doesn’t require much resources from the load balancer. Testing an healthcheck resource takes that monitoring component out of the equation and helps to have a node out faster than waiting for the monitoring tool to remove the VIP file, having the status code tell the load balancer if the node can stay in the pool is much more efficient than having the load balancer analyse the responde body.

    Reply
    1. pedrofelix Post author

      You’re right. I forgot to mention that health check endpoints can also be used by load balancers (LB) to decide if a node should stay or not in the active pool. However, perhaps there should be two different endpoints: one for the monitoring tool with a detailed status description and another for the LB that just provide a stay/remove response. Even there, I would prefer this information to be provided on the response representation instead of on the status code. It can be a very simple “true/false” text response. It doesn’t even have to be JSON.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s