Scalability Cheat Sheet #1 - From Basics to Replication

Scalability Basics, Vertical vs. Horizontal Scaling, and Introduction to Replication 👥

Sep 06, 2024

a group of star wars stormtroopers are lined up — Photo by Darren Mosher on Unsplash - The Empire Got Replication Right

Scalability is a key non-functional requirement that can sometimes feel a bit elusive, especially when you're diving into system design for the first time. This cheat sheet pulls together essential concepts of scalability in a practical, opinionated guide.

In this first part, we'll focus on the basics of scaling and replication, setting a strong foundation for anyone looking to build or optimize scalable systems.

You'll learn about:

🏗️ Vertical vs. Horizontal Scaling: The pros and cons of scaling up versus scaling out.
👑 Leader-Based Replication: Delve into single-leader and multi-leader setups, understanding how they manage consistency and scale.
🤷‍♂️ Leaderless Replication: Explore how leaderless systems handle writes and manage conflicts through quorum reads and writes.
🔄 Replication Methods: Compare synchronous, semi-synchronous, and asynchronous replication to balance performance and consistency.

Scalability Cheat Sheet #2 - When Things go Wrong and Partitioning

Mirco

September 11, 2024

Read full story

What is Scalability in the First Place? 🚀

Scalability is all about ensuring your system can handle increasing loads —whether doubling, tripling, or more—without a drop in performance or a steep rise in costs. It’s crucial for sustaining efficiency as demand grows.

There are two main strategies:

🏗️ Vertical Scaling

Vertical Scaling means adding more resources (CPU, RAM, disk space, etc.) to a single host. It's less favored these days due to several downsides ❌:

Cost and Performance Imbalance: Costs can rise disproportionately with performance gains.
Physical Limits: There's a ceiling to how much you can scale a single machine, constrained by technology and physics.
Downtime: Upgrades, like adding a new CPU, may require downtime.
Limited Flexibility: Scaling down isn't straightforward, making it hard to respond to decreased loads.

🌐 Horizontal Scaling

Horizontal Scaling involves adding more hosts and distributing the load across them. It has several advantages ✅ over vertical scaling:

Cost-Effectiveness: You can use off-the-shelf components (e.g., deploying 24 machines with 4-core CPUs instead of a single 96-core machine).
Flexibility: It can adapt to traffic increases and decreases.
Automation: Scaling can be fully automated, especially in cloud environments.

However, horizontal scaling also introduces complexity, such as managing distributed storage and computing, and handling increased network traffic.

In the age of cloud computing, horizontal scaling is often the preferred approach. Everything mentioned in this post will focus on horizontal scaling and its challenges.

Replication 👯‍♂️: Keeping Your Data Available

Replication is a key approach in horizontal scaling, where the basic idea is to copy data across multiple nodes. This setup offers several advantages ✅:

Load Distribution: Each node only handles a fraction of the requests, balancing the workload.
Fault Tolerance: If one node fails, requests can be redirected to another, improving system resilience.
Reduced Latency: Clients can connect to geographically closer nodes, which reduces latency.
Increased Data Safety: With multiple copies, the risk of data loss decreases.

However, replication introduces new challenges ⚠️:

Increased Complexity: The overall system becomes more complex, both in terms of architecture and data management.
Network Dependencies: Many operations now rely on network communication, adding potential points of failure and latency.
Data Consistency: Keeping data synchronized across nodes is a major concern—should you strive for perfect consistency, or is eventual consistency acceptable?

Different replication approaches can be viewed through the lens of these challenges. Some setups designate specific nodes to manage consistency, leading to single-leader or multi-leader replication. Alternatively, leaderless replication takes a different approach altogether.

There’s also another axis to consider: how nodes communicate with each other. This gives rise to synchronous, semi-synchronous, and asynchronous replication models.

Leader-Based Replication 👑

Leader-based replication scales systems by designating leaders to manage write operations, which are then propagated to followers. This approach balances load, improves read performance, and enhances fault tolerance but also presents challenges related to consistency and failure management.

How It Works:

Leaders and Followers:
- The Leader 👑 handles all write operations and ensures changes are replicated to other nodes.
- Followers receive updates from the leader and typically serve read requests, helping to distribute the load.
Replication Methods:
- Synchronous Replication: Writes are only considered successful when all followers have applied them, ensuring strong consistency but potentially increasing write latency.
- Semi-Synchronous Replication: A write is successful when a subset of followers (often just one) applies the change, offering a middle ground between consistency and performance.
- Asynchronous Replication: Writes are successful as soon as the leader processes them, with followers catching up later, improving performance but risking consistency issues due to replication lag.

✅ Advantages of Leader-Based Replication:

Improved Scalability: Distributing read operations across multiple followers can significantly increase the system's ability to handle larger loads.
Fault Tolerance: Followers can act as backups and be promoted to leaders if the current leader fails, providing resilience against node failures.
Reduced Latency: By directing read operations to followers closer to the clients, latency can be reduced, enhancing user experience.

❌ Downsides of Leader-Based Replication:

Replication Lag: Particularly with asynchronous replication, followers may not always be in sync with the leader, leading to stale reads and inconsistencies.
Single Point of Failure: In single-leader configurations, the leader is a critical failure point, which can halt write operations until a new leader is elected.
Conflict Resolution: In multi-leader setups, concurrent writes on different leaders can lead to conflicts, requiring strategies to resolve discrepancies between nodes.
Split Brain: Network issues or failures can cause multiple nodes to incorrectly assume the leader role, resulting in data inconsistencies.

Single-Leader Replication 🥇

In this scenario, there exists exactly one leader.

✅ Advantages (compared to multi-leader):

Simplicity and Predictability: The architecture is straightforward, making it easier to implement, debug, and reason about system behavior.
Consistency Management: Having a single leader ensures a clear source of truth for data writes, simplifying consistency management.

❌ Downsides (compared to multi-leader):

Single Point of Failure: The leader is a critical component; if it goes down, writes are paused until a new leader is elected, which can cause downtime.
Scalability Limits on Writes: Since all writes funnel through a single leader, there's an upper limit to write throughput. Especially with synchronous replication, where every follower must confirm the write, leading to higher latency.

Multi-Leader Replication 🤝

Here, we can have an arbitrary number of leaders. The number of leaders also may change anytime.

✅ Advantages (compared to single-leader):

Enhanced Write Availability: Writes can be accepted by multiple leaders, reducing the dependency on any single node and increasing write availability.
Better Geographical Distribution: Leaders in different locations can handle writes closer to their respective regions, reducing latency for distributed users.
Improved Fault Tolerance: If one leader fails, other leaders can continue to handle writes, reducing the impact of node failures on the system's availability.

❌ Downsides (compared to single-leader):

Conflict Resolution Complexity: When the same data is written concurrently on different leaders, conflicts can occur, requiring complex conflict resolution strategies.
Increased Operational Complexity: Managing multiple leaders adds complexity to the system, including ensuring all leaders remain consistent and coordinating writes between them.
Split Brain Scenarios: Multi-leader setups are more susceptible to split brain issues during network partitions, where isolated leaders may independently accept writes, leading to data divergence.

Leader-based replication improves scalability but has drawbacks like single points of failure. Leaderless replication offers an alternative, with different trade-offs and solutions.

Leaderless Replication 🤷‍♂️

As the name suggests, in leaderless replication, there are no designated leaders—every node can accept writes, effectively making every node a leader.

How It Works:

Quorum Concept:
- Clients perform read and write operations on multiple nodes, using a concept called quorum:
- With n total nodes, clients write to w nodes and read from r nodes.
- All data is versioned with timestamps or version numbers.
- For consistency, w + r > n ensures reads always include the latest version.
- Example: With n = 5, w = 3, and r = 3, writing to three nodes means two might have outdated versions, but reading from three nodes ensures at least one has the current version.
- Quorum with n=3, r=2 and w=2
Failure Tolerance:
- If w < n, the system remains operational even if some nodes fail (e.g., with n = 5 and w = 3, up to two nodes can fail without impacting availability).
- Similarly, if r < n, the system can tolerate node failures during reads.

✅ Advantages (compared to leader-based replication):

No Leader Failover Required: Eliminates the need for leader election and failover processes.
High Resilience: Continues functioning even with multiple node failures.

❌ Downsides (compared to leader-based replication):

Complex Conflict Resolution: Handling write conflicts is more complex since all nodes accept writes independently.
Increased Latency: Quorum reads and writes can introduce additional latency due to interactions with multiple nodes.

Leaderless replication promotes resilience with no single point of failure but complicates conflict resolution and may increase latency. It's ideal for high-availability needs but requires careful management of data consistency.

Sync vs. Async Replication 🔄: Balancing Consistency and Performance

Replication modes—synchronous, semi-synchronous, and asynchronous—affect how writes are handled across nodes, balancing consistency and performance. Choose based on whether you need immediate consistency or faster writes.

Synchronous Replication

Synchronous Replication is the simplest mode. In this setup, the leader waits until all replicas have executed the write.

✅ Advantages:

Strong Consistency: All replicas contain the same data immediately.
Easier Fail-over: Since all replicas are in sync, any replica can become the new leader without the risk of lost writes after a leader failure.

❌ Downsides:

Clients must wait until all replicas complete the write
Sensitive to network latency.
One failing replica renders the system unusable for writes.
Not practical in leaderless scenarios.

Use Cases: Best for critical applications where strong consistency is paramount, like financial transactions.

Asynchronous Replication

In Asynchronous Replication, writes are considered successful once the leader executes the write.

✅ Advantages:

Lower Latency on Writes: Improves system responsiveness.
High Availability: The system remains operational even if some nodes fail.

❌ Downsides:

Replication Lag: Nodes can get out of sync (we will discuss Replication Lag in part 2).
Risk of Data Loss: If a leader fails, unsynced writes may be lost.

Use Cases: Suitable for applications where write speed is critical, and occasional data staleness is acceptable, like logging, analytics, or social networks.

Semi-Synchronous Replication

Semi-Synchronous Replication aims to get the best of both worlds. One follower per leader is synchronous, while the rest are updated asynchronously.

Partial Consistency: Provides "partial consistency" and reduces the risk of lost writes.
Use Cases: Ideal for systems needing a balance between consistency and performance, such as content management systems or collaborative tools.

In summary, synchronous replication ensures strong consistency but can add latency, while asynchronous replication boosts performance with potential data lag. Choose based on whether your priority is consistency or speed.

Key Takeaways 📌

Here are the key takeaways to solidify your understanding of scalability concepts covered in this guide.

Scalability is essential for systems to handle increasing loads efficiently without performance loss or disproportionate cost increases.
Horizontal Scaling 🌐 is preferred in modern architectures, spreading the load across multiple nodes, offering flexibility and automation but adding complexity in distributed data management.
Replication 👯‍♂️ improves data availability and fault tolerance by duplicating data across multiple nodes, balancing load, and reducing latency for geographically distributed clients.
Leader-Based Replication 👑 involves leaders managing writes and followers handling reads, offering simplicity and consistency but posing risks of single points of failure and potential conflicts in multi-leader setups.
Leaderless Replication 🤷‍♂️ allows any node to handle writes, enhancing resilience but complicating conflict resolution and increasing latency through quorum-based reads and writes.
Replication Modes 🔄 (Synchronous, Semi-Synchronous, Asynchronous) balance trade-offs between consistency and performance; synchronous offers strong consistency, while asynchronous favors performance at the risk of data lag.

If you found this guide helpful, please share it and drop your thoughts in the comments below—I'd love to hear your feedback!

Thanks for reading verbosemode! This post is public so feel free to share it.

Stay tuned for Part 2, where we’ll explore replication lag ⏳, write conflicts ⚔️, failover strategies 🔧, and data partitioning 🧩.

Special thanks to Nina for her great proofreading and tips! 😊

Happy scaling, and see you in Part 2! 🚀

Scalability Cheat Sheet #2 - When Things go Wrong and Partitioning

Mirco

September 11, 2024

Building on the concepts covered in the first part, we’ll explore how to manage replication lag, resolve write conflicts, and efficiently partition data across multiple nodes.

Read full story

verbosemode

Scalability Cheat Sheet #1 - From Basics to Replication

Scalability Basics, Vertical vs. Horizontal Scaling, and Introduction to Replication 👥

Scalability Cheat Sheet #2 - When Things go Wrong and Partitioning

What is Scalability in the First Place? 🚀

🏗️ Vertical Scaling

🌐 Horizontal Scaling

Replication 👯‍♂️: Keeping Your Data Available

Leader-Based Replication 👑

Single-Leader Replication 🥇

Multi-Leader Replication 🤝

Leaderless Replication 🤷‍♂️

Sync vs. Async Replication 🔄: Balancing Consistency and Performance

Synchronous Replication

Asynchronous Replication

Semi-Synchronous Replication

Key Takeaways 📌

Scalability Cheat Sheet #2 - When Things go Wrong and Partitioning

Discussion about this post