Fulfilling Data Domiciling requirements with CockroachDB V21.2 features
Illustrating how to work with CockroachDB controls to set data placement policies.
Building a resilient global database for multi-region applications is easy with CockroachDB data placement controls restricting all replicas for a data row in the row’s origin region or home region. It only takes a few steps once your cluster is built with node locality settings.
So, it starts with a feature called Regional by row which is a data policy which governs how a table maintains row locality. For instance, a row representing John Doe and his attributes is inserted to the customer table in the us-east-1 region, then CockroachDB will maintain the rows locality or lease holder in the us-east-1 region by creating a hidden column which tracks the rows home region. Then enabling a restricted placement policy, Cockroach will ensure all of the rows replicas will also be placed in the same home region.
Now let’s walk through the process, we have a cluster deployed across AWS east and west and an Azure region in West-europe.
Also notice the Longitude and Latitude of the nodes
And Start command which reflect Secure settings and locality
Create a database and set the preferred regions for storing data
And we see the database has no data or replicas, so let’s add a table with the intention to domicile data.
Now let’s look at how the table was created, notice my use of the LOCALITY REGIONAL BY ROW in the create table statement
The following hidden column was added above
Which is the partitioning column to distinguish a row’s locality mapping or home region. Let us look at home the regional by row setting which has optimized the table.
Digging into all the ranges which make up the table,
we can see that for every key range, we have 5 replicas where the majority of the replicas are stored in the primary region us-east-1 because the default setup for a zone survival goal is a majority of [ voter / consensus ] replicas will be stored in the primary region. This can be reconfigured for regional survival and rearrange the replicas. In this writeup, we are heading to Data Domiciling.
We have empty data structures in place for the Index organized table and secondary indexes along with their partitions pivoted by the crdb_region column. Any rows inserted will have all replicas following the above survival policy. Setting a restricted placement policy which will change the replica configuration to follow a domicile data pattern.
And replicas have reconfigured
Now let’s start entering data. Logging into a node in AWS us-east-1, I will insert 1100 rows of customer data. Then log into a node in Azure westEurope and insert 900 rows of customer data representing Western Europe.
After 1000 rows in us-east-1
After 1000 rows in westEurope, we can see above ranges with data only in us-east-1 and below we have ranges with data in westEurope.
Let’s now look at how data is accessed,
Knowing customer ID 1980 was inserted and homed in the westEurope region, we can see that any query run from the us-east-1 region takes 166ms to retrieve the row because it does a parallel scan as it does not know the partition key.
└── • union all
│ estimated row count: 1
│ limit: 1
├── • scan
│ estimated row count: 1 (0.05% of the table; stats collected 18 minutes ago)
│ table: customer_gdpr@customer_gdpr_c_w_id_c_d_id_c_id_key
│ spans: [/’us-east-1'/0/1/1980 — /’us-east-1'/0/1/1980]
└── • scan
estimated row count: 1 (0.05% of the table; stats collected 18 minutes ago)
spans: [/’us-west-1'/0/1/1980 — /’us-west-1'/0/1/1980] [/’westeurope’/0/1/1980 — /’westeurope’/0/1/1980]
We can also see the output of the SHOW RANGE FOR ROW displays that all the replicas for the row are stored in region=westeurope.
Now let’s look at a local us-east-1 customer 198, we see the latency is 2ms [ the network latency is my home laptop in NJ to the node in us-east-1 North Virginia]. A very nice short circuit optimization was added to ensure a query to a key in its local region will perform well and its nicely written up.
We leveraged some features of multi-region through regional by row tables coupled with restricted data placement to achieve data domiciling. My illustration path took a 2 step approach to restricting which could have been setup as part of the database creation process in one step. The multi-region feature knobs are extremely powerful and have many paths for setting multi-regional scenarios which can be explored in the CockroachDB documentation and additional content around data domiciling can be found here.