Orchestrator failure detection and recovery: New Beginnings
Shlomi Noach
Shlomi Noach
9/19/2020
Engineering6 min read

Orchestrator failure detection and recovery: New Beginnings

Orchestrator is an open source MySQL replication topology management and high availability solution. Vitess has recently integrated orchestrator as a native component of its infrastructure to achieve reliable failover, availability, and topology resolution of its clusters. This post first illustrates the core logic of orchestrator’s failure detection, and proceeds to share how the new integration adds new failure detection and recovery scenarios, making orchestrator’s operation goal-oriented.__

Note: in this post we adopt the term “primary” over the term “master” in the context of MySQL replication.

Orchestrator’s holistic failure detection#

Vitess and orchestrator both use MySQL’s asynchronous (async) or semi-synchronous replication. For the purposes of this post, the discussion is limited to async replication. In an async setup, we have one primary server and multiple replicas. The primary is the single writable server and the replicas are all read-only, mainly being used for read scale-out, backups, etc. While MySQL offers a multi-writable primaries setup, it is commonly discouraged, and Vitess does not support it (in fact, a multi-writer setup is considered a failure scenario as described later on).

The most critical and important failure scenario in an async topology is a primary’s outage. Either the primary server has crashed, or is network isolated: the result is that there are no writes on the cluster, and the replicas are left hanging with no server to replicate from.

Common failure detection practices#

How does one diagnose that the primary server is healthy? A common practice is to see that port :3306 is open. More reliably, we can send a trivial query, such as SELECT 1 FROM DUAL. Or even more reliable is to query for actual information: a status variable, or actual data. All these techniques share a similar problem. What if the primary server doesn’t respond?

A naive conclusion is that the primary is down, kicking off a failover sequence. However, this may well be a false positive since there could be a network glitch. It is not uncommon to miss a communication packet once in a while, so database clients are commonly configured to retry a couple times upon error. The common way to reduce such false positives is to run multiple checks, successively: if the primary fails a health check, try again in, say, 5 seconds, and again, and again, up to n times. If the nth test still fails, we determine the server is indeed down.

This approach yet introduces a few problems:

  • Exactly when is enough tests?
  • Exactly what is a reasonable check interval?
  • What if the primary is really down? We have wasted n*interval seconds to double check, triple check, etc., when we could have failed over sooner.
  • What if the primary is really _up, and the problem is with the network between the primary and our testing endpoint? That’s a false negative and we failed over for nothing.

Consider the last bullet point. Some monitoring solutions run health checks from multiple endpoints, and require a quorum, an agreement of the majority of check endpoints that there is indeed a problem. This kind of setup must be used with care; the placement of the endpoints in different availability zones is critical to achieve sensible quorum results. Once that’s done, though, the triangulation is powerful and useful.

Orchestrator’s approach#

Orchestrator uses a different take on triangulation. It recognizes that there are more players in the field: the replicas. The replicas connect to the primary over MySQL protocol, and request the changelog so as to follow up on the primary’s footsteps. To evaluate a primary failure, orchestrator asks:

  • Am I failing to communicate with the primary? And,
  • Are all replicas failing to communicate with the primary?

If, for example, orchestrator is unable to reach the primary, but can reach the replicas, and the replicas are all happy and confident that they can read from the primary, then orchestrator concludes there’s no failure scenario. Possibly some of the replicas themselves are unreachable: maybe a network partitioning or some power failure took both primary and a few of the replicas. orchestrator can still reach a conclusion by the state of all available replicas. It’s noteworthy that orchestrator itself runs in a highly available setup, cross availability zones, where orchestrator requires quorum leadership so as to be able to run failovers in the first place, mitigating network isolation incidents. But this discussion is outside the scope of this post. Orchestrator doesn’t do check intervals and a number of tests. It needs a single observation to act. Behind the scenes, orchestrator relies on the replicas themselves to run retries in intervals; that’s how MySQL replication works anyhow, and orchestrator utilizes that.

This holistic approach, where orchestrator triangulates its own checks with the servers’ checks, results in a highly reliable detection method. Iterating our example, if orchestrator thinks the primary is down, and all the replicas say the primary is down, then a failover is justified: the replication cluster is effectively not receiving any writes, the data becomes stale, and that much is observable all the way to the users and client apps. The holistic approach further allows orchestrator to treat other scenarios: an intermediate replica (e.g. 2nd level replica in a chained replication tree) failure is detected in exactly the same way. It further offers granularity into the failure severity. orchestrator is able to tell that the primary is seen down, while replicas still disagree. Or that replicas think the primary is down while orchestrator can still see it.

Emergency detection operations#

If orchestrator can’t see the primary, but can see the replicas, and they still think the primary is up, should this be the end of the story?

Not quite. We may well have an actual primary outage, it’s just that the replicas haven’t realized it yet. If we wait long enough, they will eventually report the failure; but orchestrator wishes to reduce total outage time by resolving the situation as early as possible.

Orchestrator offers a few emergency detection operations, which are meant to speed up failure detection. Examples:

  • As in the above, orchestrator can’t see the primary. Emergently probe the replicas to check what they think. Normally each server is probed once in a few seconds, but orchestrator now chooses to probe sooner.
  • A first tier replica reports it can’t see the primary. The rest of the replicas are fine, and orchestrator can see the primary. This is still very suspicious, so orchestrator runs an emergency probe on the primary. If that fails, then we’re on to something, falling back to the first bullet.
  • orchestrator cannot reach the primary, replicas can all reach the primary, but lag on replicas is ever increasing. This may be a limbo scenario caused by either a locked primary, or a “too many connections” situation. The replicas are likely to be some of the oldest connections to the primary. New connections cannot reach the primary and to the app it seems down, but replicas are still connected. orchestrator can analyze that and emergently kick a replication restart on all replicas. This closes and reopens the TCP connections between replicas and primary. On locked primary or on “too many connections” scenarios, replicas are expected to fail reconnecting, leading to a normal detection of a primary outage.

Orchestrator and your replication clusters

An important observation is that orchestrator knows what your replication clusters actually look like, but doesn’t have the meta information about how they should look like. It doesn’t know if some standalone server should belong to this or that cluster; if the current primary server is indeed what’s advertised to your application; if you really intended to set up a multi-primary cluster. It is generic in that it allows a variety of topology layouts, as requested and used by the greater community.

Old Vitess-orchestrator integration

For the past few years, orchestrator was an external entity to Vitess. The two would collaborate over a few API calls. orchestrator did not have any Vitess awareness, and much of the integration was done through pre- and post- recovery hooks, shell scripts and API calls. This led to known situations where Vitess and orchestrator would compete over a failover, or make some operations unknown to each other, causing confusion. Clusters would end up in split state, or in co-primary state. The loss of a single event could cause cluster corruption.

Orchestrator as first class citizen in Vitess#

We have recently integrated orchestrator into Vitess as an integral part of the vitess infrastructure. This is a specialized fork of orchestrator, that is Vitess-aware. In fact, the integrated orchestrator is able to run Vitess native functions, such as locking shards or fetching tablet information.

The integration makes orchestrator both cluster aware and goal driven.

Cluster-awareness

MySQL itself has no concept of a replication cluster (not to be confused with InnoDB cluster or MySQL Cluster): servers just happen to replicate from each other, and MySQL has no opinion on whether they should replicas from each other, or what’s the overall health and status of the replication tree. orchestrator can share observations and opinions on the replication tree, based on what it can see. Vitess, however, has a firm opinion on what it expects. In Vitess, each MySQL server has its own vttablet, an agent of sorts. The tablet knows the identity of the MySQL server: which schema it contains; part of what shard it is; what role it assumes (primary, replicas, OLAP, ...) etc. The integrated orchestrator now gets all of the MySQL metadata directly from the Vitess topology server. It knows beyond doubt that two servers belong to the same cluster, not because they happen to be connected in a replication chain, but because the metadata provided by Vitess says so. orchestrator can now look at a standalone, detached server, and tell that it is, in fact, supposed to be part of some cluster.

Goal driven

This cluster awareness is a fundamental change in orchestrator’s approach, and allows us to make orchestrator goal-driven. orchestrator’s goal is to ensure a cluster is always in a state compatible with what Vitess expects it to be. This is accomplished by introducing new failure detection modes not possible before, and new recovery methods too opinionated otherwise. Examples:

  • orchestrator observes a standalone server. According to Vitess’ topology server, that server is a REPLICA. orchestrator diagnoses this as a “replica without a primary” and proceeds to connect it with the proper replication cluster, after validating that GTID-wise the operation is supported.
  • orchestrator observes a REPLICA that is writable. Vitess does not support that setup. orchestrator turns the replica to be read-only.
  • Likewise, orchestrator sees that the primary is read-only. It switches it to be writable.
  • orchestrator detects a multi-primary setup (circular replication). Vitess strictly forbids this setup. orchestrator checks with the topology service which of the two is marked as the true primary, then makes the other(s) standard replicas. To emphasize the point, a multi-primary setup is considered to be a failure scenario.
  • Possibly the most intriguing scenario is where orchestrator sees a fully functional replication tree, with writable primary and read-only replicas, but notices that Vitess thinks the primary should be one of the replicas, and that the server that acts as the cluster’s primary should be a replica. This situation can result from a previously, prematurely terminated failover process. In this situation, orchestrator runs a graceful-takeover (or a planned-reparent, in Vitess jargon) to actually promote the correct server as the new primary, and to demote the “impersonator” primary.

Thus, Vitess has an opinion of what the cluster should look like, and orchestrator is the operator that makes it so. It is furthermore interesting to note that orchestrator’s operations will either fail or converge to the desired state.

But, what if a primary unexpectedly fails? What server should be promoted?

Orchestrator’s promotion logic#

On an unexpected failure, it is orchestrator’s job to pick and promote the most suitable server, and to advertise its identity to Vitess. The new interaction ensures this is a converging process and that orchestrator and vitess do not conflict with each other over who should be the primary. Orchestrator promotes a server based on multiple limiting factors: is the server configured such that it can be a primary, e.g. has binary logs enabled? Does its version match the other replicas? What are the general recommendation for the specific host (metadata acquired from Vitess). But there are also general, non server-specific rules, that dictates what promotions are possible. Do we strictly have to only failover within the same data center? The same region/availability zone? Or, do we strictly have to only failover outside the data center? Do we only ever failover onto a server configured as semi-sync replica? And how do we reconfigure the cluster after promotion?

Previously, some of these questions were answered by configuration variables, and some by the user’s infrastructure. However, the new integration allows the user to choose a failover and recovery policy, that is described in code. Orchestrator and Vitess already support three pre-configured modes, but will also allow the user to define any arbitrary (within a set of rules) policy they may choose.

More on that in a future post.