Migrating a cluster using CockroachDB’ elastic architecture

Andrew Deally
9 min readOct 29, 2022

This Blog presents and illustrates one mechanism for migrating a CockroachDB cluster from one set of nodes to another, using the inherent elastic nature of the cluster architecture. The cluster is “migrated” by adding the new nodes to the cluster, and then retiring the old nodes.

This approach is presented as an alternative to the traditional migration strategy of backing up the existing cluster, restoring to a new cluster, synchronizing the intervening changes with a changefeed, and then cutting over the applications.

The process outlined here allows the migration to take place without downtime, without changing the cluster’s identity, and without relying on any tools outside of CockroachDB’s native functionality, or even on any of CockroachDB’s Enterprise features.

Starting and Ending States

In our example, we start with a CockroachDB cluster deployed across 3 regions: us-east1, us-east4, us-central1. The desired end result is to have the cluster instead occupy a different (but, in this case, overlapping) set of regions: us-east1, us-south1, us-west3. The goal is to perform the migration while the cluster is live, without impacting data availability.

There are many reasons where this is useful such as moving from on-prem nodes to a cloud provider, or from one cloud platform to another.

This node map from the DB Console illustrate the current state of node distribution…

and this one represents the desired state this exercise will produce.

Note that one existing region, us-east1, will remain part of the cluster. This is not necessary for the process we’re describing to work, but we’re doing it this way because:

  • It reflects the real-world customer situation that prompted this document
  • It lets us demonstrate the cluster migration without having to concern ourselves with the issue of driving application connections to new nodes. (That would typically be handled as part of Load Balancer management, which is part of cluster system administration, but outside the scope of CockroachDB configuration.)

Overview of the Process

As illustrated above, we’re starting with 9 nodes in 3 regions. We’ll add another 6 nodes in 2 more regions, for a total of 15 nodes. Then we’ll manage the transfer of data out of the 2 regions we want to vacate, and into the 2 regions we’ve added. When we retire the 6 nodes in the vacated regions, we’ll leave the cluster with a new set of 9 nodes in a new set of 3 regions.

Here’s a high-level view of the steps we’re going to take.

  1. Verify all schema changes and upgrades have been finalized and backup the cluster
  2. Prepare all the new nodes with appropriate Cockroachdb binaries
  3. Update cluster certificates
  4. (Optional) Prevent data migrating to the new nodes as soon as they’re added
  5. Add new nodes to the cluster
  6. (Optional) Modify cluster configuration to speed up range migration
  7. Push data off the to-be-retired nodes and onto the new nodes, monitoring the rebalance
  8. Decommission nodes to be retired, one at a time

We’ll demonstrate uninterrupted operation by running the TPCC workload in the east1 region, which is the one region in the starting cluster which will remain in the final cluster.

Getting Started

Note that provisioning cloud resources, configuring networking connectivity, deploying cockroach binaries, and updating certificates are all well-documented procedures that lie beyond the scope of this demonstration. See:

Let’s presume, for the sake of this demonstration, that those tasks have already been performed and verified. Let’s also presume that any changes to our database schemas have been completed, and that the most recent version upgrade to our existing cluster has been successfully finalized.

Using the Data Distribution & Zone Configs page on the DB Console, we see that the system and database ranges are deployed evenly across all the nodes in the regions:

Note too that the zone configuration commands are all defaults. In particular the system ranges have 5 replicas.

We’re now at step 4 in the process outlined above.

When a new node is added to a cluster, that node is immediately available to assume its share of the workload, including hosting replicas of ranges. As the cluster’s available compute and storage resources are expanded, the system will undertake to rebalance the load, sending replicas and assigning leaseholder responsibilities to the new node(s).

We want to prevent that from happening to the TPCC database, since we want to control and monitor the migration of data to the new nodes, and maintain predictable application latency during the reconfiguration process. For this, we’ll take advantage of the replica placement constraints available from CockroachDB’s zone configuration.

Our database has a replication factor of 3, and currently occupies 3 regions. If we explicitly pin one replica of each range to each of the 3 regions, CockroachDB will be unable to move any replicas into a new region, when it becomes available. Let’s apply this zone configuration to our entire database:

Since this zone configuration simply enforces the existing replica distribution (3 replicas of each database range, evenly distributed across 3 regions), we don’t expect this command to have any effect on the data distribution. We have added a lease preference, pushing leaseholders into east1, where our application is running; if there is any effect on system performance, it will be to improve read latency.

Adding new nodes

Now we can add nodes in a new region, with no concern that CockroachDB will immediately start pushing replicas to them. This is as simple as starting up each new node with the parameters necessary to have them join the cluster. Since the new nodes are not going to receive any replicas right away, we can add them to the cluster fairly quickly, with no expectation that they’ll disrupt system performance.

Note: If we were allowing new nodes to receive data immediately, we’d want to add the nodes one region at a time, and allow the system to rebalance and stabilize before moving on to the next region. There is an unresolved question as to whether it’s better to add all the nodes in a new region at once, or add the nodes in a new region one at a time at intervals. The argument for adding them all at once is that CockroachDB tries to balance the load across regions, and that a region that initially has fewer nodes than the others may be overwhelmed. However, adding a new region all at once has occasionally been observed to produce its own performance issues.

In the screenshots below, you can see the workload running in the east1 region. The new regions are added at 16:37 with no impact on the workload.

And, because we’ve constrained the database replicas to the original 3 regions, we see that the newly added ranges only contain system ranges.

Rebalance Rate Control

When we allow the system to start moving ranges to the new nodes, we’ll be enabling a flow of replicas between nodes. This is going to consume some bandwidth, and take some time. We can tune the rate at which replicas can flow from one node to another with two cluster settings:

kv.snapshot_rebalance.max_rate

kv.snapshot_recovery.max_rate

These settings specify the per-node rate limit (bytes/second) for moving replicas between nodes. We recommend that these settings always have the same value, and the default value for both is 32MiB. You can shorten the time it takes for the cluster to rebalance its replica distribution — at the cost of increasing network usage between nodes — by increasing these settings. We can’t recommend a specific value to use; that’s dependent on too many variables (individual node performance, network capacity, current workload, etc). We do advise that 128MiB is not an unusually high value to use for this, and recommend staying below 1GiB.

When the rebalance is complete, we’ll set these back to their prior values, as they are responsible for rate-limiting the system’s normal and ongoing rebalancing activity.

Migrating Data

Now we can apply a new zone configuration to relocate the replicas out of east4 and into south1.

We can monitor the movement of replicas off old nodes and onto new ones:

And verify that the workload continues to run without errors:

When the system has finished complying with the new zone configuration, one datacenter has evacuated its database ranges:

Next, we update the database zone reconfiguration to push replicas out of central1 and into west3.

Over the same duration as before, the range replicas are relocate to west3 while the application continues to run uninterrupted:

SQL Executions Continue at the same rate with some minor outlier latencies
highlights range metrics and the locations
Illustrates range movement across nodes over time, since we have default rabalancing settings the background processing of data movement has little impact to online application query processing. Settings can be adjust to improve background data movement bandwidth but would have impact to foreground SQL processing
The dbconsole replication graph illustrates all the data range replication operations including snapshots metrics which indicate the amount data movement the cluster has automating in the background
The cluster cpu profile accounting for all foreground and background processing

At this point, our cluster consists of 15 nodes in 5 regions, but our database occupies only the 9 nodes in 3 regions that we want to persist. The remaining 6 nodes from the original cluster do not host any replicas of database ranges, so are not participating in any distributed SQL queries. And, since the application is running in a different region, and not connecting directly to any of these nodes, none of them are gateway nodes for any clients.

Retiring nodes

Next, we will remove the nodes from east4 and central1 using the cockroach drain and decommission commands. These commands would force the nodes to give up any replicas or leaseholders they host, and to complete and close out any SQL operations they were participating in. (At this point, these will only involve any system ranges that remain on the retiring nodes, as we’ve already moved all the database ranges to other nodes.)

the node status commands indicates on a few live ranges remain on the older nodes and in a good state to be removed from the cluster, using the drain command to remove the remaining system ranges before decommisioning nodes

Finally, we command the nodes to shut down and remove themselves from the cluster:

We monitor the operation, waiting for all the retiring nodes to register as “decommissioned”:

Finalizes

Before long, our cluster arrives at the desired final state, migration complete and our application continue to run without errors.

and we can return our rebalance rate cluster settings to their previous values. Compared to how this would have been handled with other databases either in the traditional monolithic or newsql distrubuted space, this was extremely easy. Some of the obvious pain point savings are

  • reconfiguration of nodes and data was non disruptive to the application
  • reconfiguration accomplish as a complete online and atomic data movement process
  • did not require any additional tooling or great administration expertise

I feel this is one of my better blogs and chalk it up to the great contributions in editing and content from Greg Goodman. Up next, Greg and I will be working on the same process using the new Multiregion commands in CockraochDB 21.X ~

--

--