This article was originally published on InfoQ on April 28, 2021.
At Salesforce, we required a storage system which could work with two kinds of streams, one stream for write-ahead logs and one for data. But we have competing requirements from both of the streams. The write-ahead log stream should be low latency for writes and high throughput for reads. The data stream should have high throughputs for writes, but have low random read latency. Being the pioneers in cloud computing, we also required our storage system to be cloud-aware as the requirements of availability and durability are ever more increasing. Our deployment models are designed to be immutable in order to scale to massive levels and run on commodity hardware.
Our initial research for such a storage system confronted us with the question of build versus buy.
We decided to go with an open-source storage system given all of our expectations and requirements hinge on the main business drivers of time-to-market, resources, cost etc.
After researching what open source had to offer, we settled upon two finalists: Ceph and Apache BookKeeper. With the requirement that the system be available to our customers, scale to massive levels and also be consistent as a source of truth, we needed to ensure that the system can satisfy aspects of the CAP Theorem for our use case. Let’s take a bird’s-eye view of where BookKeeper and Ceph stand in regard to the CAP Theorem (Consistency, Availability and Partition Tolerance) and our unique requirements.
While Ceph provided Consistency and Partition Tolerance, the read path can provide Availability and Partition Tolerance with unreliable reads. There’s still a lot of work required to make the write path provide Availability and Partition Tolerance. We also had to keep in mind the immutable data requirement for our deployments.
We determined Apache BookKeeper to be the clear choice for our use case. It’s close to being the CAP system we require because of its append only/immutable data store design and a highly replicated distributed log. Other key features:
Furthermore, Salesforce has always encouraged open source and the Apache BookKeeper community is an active and vibrant community.
While Apache BookKeeper has checked most of the boxes for our requirements, there are still gaps. Before we get into the gaps, let's understand what Apache BookKeeper provides.
From a durability standpoint, ledgers are replicated across an ensemble of bookies, and entries within the ledgers are also striped across the ensemble.
Writes are confirmed based on configurable write and ack quorums. This provides low write latencies and the ability to scale.
However, it proved challenging to run BookKeeper on commodity hardware in the cloud.
The data placement policies are not natively cloud-aware and do not take into consideration the underlying public cloud provider (cloud substrate). The way some users currently deploy it is by manually identifying nodes in different availability zones, logically grouping them together, and retrofitting an existing placement policy on those node groups. While this is a work-around, it does not provide any support for zone failures nor does it provide ease of use while maintaining and upgrading massive clusters.
All of the cloud substrates have notoriously seen downtime across their availability zones from time to time and the general understanding has been for the applications to design for these faults. A great example is Netflix during the Christmas of 2012 when it got affected by a Amazon Web Services Availability Zone outage. The Netflix service kept running at a limited capacity, even though the underlying public cloud infrastructure on which it depended was down.
The Internet--from websites, to apps, and even enterprise software--is mostly run on the infrastructure provided by public cloud providers. This is because public cloud infrastructure is easily scalable and, to an extent, less expensive to use and maintain. However, it is not without its faults, one of which is unavailability either at node, zone or region level. If the underlying infrastructure is unavailable, there is literally nothing the users can do. This may be due to an outage in certain machines, zones, or regions. It could even be due to network latencies increasing due to faulty hardware. So in the end, as applications running on top of this public cloud infrastructure, the onus is on us--the developers--to design with its faults in mind.
Apache BookKeeper doesn’t have a built-in answer for this potentiality, so we needed to design a fix.
Now that we have a general understanding of what the problems are, we started to think about how we could fix them in order to make Apache BookKeeper cloud-aware and match our requirements. We boiled down the gaps as follows:
Let's take a look at how we addressed these gaps.
The existing BookKeeper architecture provides each bookie with a unique identity that is assigned at the time of its first boot up. This identity is persisted in the metadata store (Zookeeper) and is available for access by other bookies or by clients.
The first step to making Apache BookKeeper cloud-aware is to make each of the bookies aware of their place in a cluster deployed in Kubernetes. We thought these cookies would be the best place for this information.
We added a field called the ‘networkLocation’ to the cookie which consists of the two main components that can locate a bookie - the availability zone and the upgrade domain. Working with Kubernetes allows us to be substrate agnostic, so we used the Kubernetes APIs to query the underlying availability zone information. We also generate the ‘upgradeDomain’ based on a formula which involves the hostname ordinal index. The upgradeDomain can be used for rolling upgrades without affecting the availability of the cluster.
All of this information is generated at boot up and persisted in the metadata store for access.
This information can now be used when generating ensembles, assigning bookies to ensembles, and deciding which bookies to replicate from or to replicate to.
Now that we have the intelligence in the client such that it can communicate with bookies in certain zones, we need to ensure that we have a data placement policy which utilizes this information. One of our unique developments is the ZoneAwareEnsemblePlacementPolicy (ZEPP). It’s a two-level hierarchical placement policy specifically designed for cloud-based deployments. ZEPP is aware of Availability Zones (AZ) and upgradeDomains (UD).
An AZ is a logical isolated data center within a region. A UD is a set of nodes in an AZ that can be brought down together with no impact to the service. It also provides heuristics to understand when a zone is down and when it’s back.
Below is a visualization of a deployment which ZEPP can use. This deployment considers the AZ and UD information from the cookies to group the bookie nodes as shown.
With these changes, we’ve been able to make Apache BookKeeper truly cloud-aware. However, it’s also important to consider costs when designing this architecture.Most cloud substrates charge for unidirectional data transfer out of their service, and charges are different when it's across availability zones. This is an important factor to consider for the BookKeeper client since it currently picks an arbitrary bookie from an ensemble to satisfy a read.
If that bookie is from another availability zone than the client, this would result in unnecessary charges. Data replication would now occur between bookies which are spread across availability zones and would result in more costs in scenarios where an availability zone goes down.
Let’s take a look at how we handled these specific scenarios.
Currently the BookKeeper client picks an arbitrary bookie from an ensemble to satisfy a read.
With our reordered reads feature, the client now picks a bookie such that it minimizes read latencies while reducing costs.
With reordered reads feature switched on the client picks bookies in the following ordering:
In this order, we satisfy the latency vs cost requirements with a decent trade-off in a system that has been running for a long time and has experienced faults.
In case of a zone down scenario, now all bookies from different ensembles would start replicating their data to bookies in currently available zones to satisfy ensemble size constraints and quorum requirements, causing a ‘thundering herd problem.’
The way we approached this problem is to first decide when a zone is actually down. Failures can be transient blips; we don’t want to start replicating terabytes of data just because of a network blip causing a zone to be unavailable. At the same time, we want to be ready for the time when there is a real failure.
Our solution has two phases:
The below diagram shows the heuristics we consider when declaring a zone down vs the zone coming back up.
HighWaterMark and LowWaterMark are two values that are computed based on the number of bookies available in a zone vs the total bookies in that zone. The thresholds for these two values are configurable so users can decide the sensitivity of failure with respect to what they deem to be a failure.
We also disable auto replication when a zone is marked as down to avoid the automatic replication of terabytes of data across zones. We added alerts in its place to alert the user of a possible zone failure. We believe an operations expert would be able to differentiate the noise from a real failure and make the call on whether to start auto replication of an entire zone.
We also provide bookie shell commands that kick-start the auto replication that had been disabled.
Apache BookKeeper is a very active open source project and has an amazingly supportive community with vibrant discussions for all of the challenges faced by the system. As it is a source of truth for many of its users, it needs to become cloud-aware.
However, such an architectural change comes with trade-offs and deciding factors at every level - availability, latency, costs, ease of deployment, and maintenance. The above considerations and changes have been battle-tested at Salesforce, and we are now able to support AZ and AZ+1 failure using Apache BookKeeper. We have already merged a few of our changes and will continue to contribute more to the community in the coming releases. These additions aim to make it easier to patch, upgrade, and restart clusters with minimal impact on consuming services.