https://DevOpsCloud.io -- Cloud Monk Losang Jinpa, Ph.D., MCSE/MCT, GitOps DevOps Engineer

System Design

Return to Acing the System Design Interview

“TEXT”: Summarize this text in quotes in 2 paragraphs. The response MUST include double brackets system_design around the words from the word list in the uploaded file and double brackets system_design around all acronyms, products, software, services, technical terms, proper names, companies. MUST provide the URL for the product or documentation or Wikipedia (whichever is appropriate). URLs must be RAW, no formatting, no double bracket surrounding it. Always list the date and year of introduction of the product, services, or software. NEVER use ** around a word or acronym, only use double brackets. NEVER use ** around a topic.

1

“System design interview preparation is critical to your career and also benefits your company. The system design interview is a discussion between engineers about designing a software system that is typically provided over a network. GeoDNS, caching, and CDN are basic techniques for scaling our service. CI/CD tools and practices allow feature releases to be faster with fewer bugs. They also allow us to divide our users into groups and expose each group to a different version of our app for experimentation purposes. Infrastructure as Code tools like Terraform are useful automation tools for cluster management, scaling, and feature experimentation. Functional partitioning and centralization of cross-cutting concerns are key elements of system design. ETL jobs can be used to spread out the processing of traffic spikes over a longer time period, which reduces our required cluster size. Cloud hosting has many advantages. Cost is often but not always an advantage. There are also possible disadvantages such as vendor lock-in and potential privacy and security risks. Serverless is an alternative approach to services. In exchange for the cost advantage of not having to keep hosts constantly running, it imposes limited functionality.”

“System design interview preparation emphasizes critical practices that benefit both individual careers and company performance. The system design interview centers on discussions about creating scalable, reliable, and efficient software systems. Techniques like GeoDNS, caching, and CDNs improve scalability, while CI/CD methodologies ensure faster feature rollouts with fewer issues. These practices also support A/B testing, enabling experimentation with different application versions for specific user groups. Tools like Terraform, introduced in 2014, enhance automation for cluster management and scaling. Incorporating functional partitioning and addressing cross-cutting concerns are pivotal for clean architecture. ETL jobs mitigate traffic spikes, reducing cluster demands. Cloud hosting offers flexibility but involves risks like vendor lock-in. Finally, serverless computing, introduced by AWS Lambda in 2014, offers cost efficiency but limits functionality.

Infrastructure decisions, such as choosing between cloud hosting and bare metal, must account for scalability, cost, and privacy. ETL pipelines efficiently manage data, smoothing traffic demands. The trade-offs in using CDNs highlight benefits like latency reduction against concerns like security. Serverless architecture eliminates idle server costs but requires adjusted system designs to accommodate limitations. The focus on experimentation through CI/CD and A/B testing aligns system design goals with user satisfaction. Employing advanced techniques like functional partitioning ensures modularity and performance optimization. These elements underscore the multidimensional nature of system design, balancing performance, cost, and user-centric adaptability.”

https://en.wikipedia.org/wiki/Content_delivery_network

https://en.wikipedia.org/wiki/Terraform_(software)

https://en.wikipedia.org/wiki/Continuous_integration

2

“Everything is a tradeoff. Low latency and high availability increase cost and complexity. Every improvement in certain aspects is a regression in others.

Be mindful of time. Clarify the important points of the discussion and focus on them.

Start the discussion by clarifying the system’s requirements and discuss possible tradeoffs in the system’s capabilities to optimize for the requirements.

The next step is to draft the API specification to satisfy the functional requirements.

Draw the connections between users and data. What data do users read and write to the system, and how is data modified as it moves between system components?

Discuss other concerns like logging, monitoring, alerting, search, and others that come up in the discussion.

After the interview, write your self-assessment to evaluate your performance and learn your areas of strength and weakness. It is a useful future reference to track your improvement.

Know what you want to achieve in the next few years and interview the company to determine if it is where you wish to invest your career.

Logging, monitoring, and alerting are critical to alert us to unexpected events quickly and provide useful information to resolve them.

Use the four golden signals and three instruments to quantify your service’s observability.

Log entries should be easy to parse, small, useful, categorized, have standardized time formats, and contain no private information.

Follow the best practices of responding to alerts, such as runbooks that are useful and easy to follow, and continuously refine your runbook and approach based on the common patterns you identify.”

System design involves trade-offs. Enhancing low latency and high availability can increase both cost and complexity. Prioritizing one aspect may lead to a regression in another, making balanced optimization essential.

https://en.wikipedia.org/wiki/System_design

Time management is critical during discussions. Start by clarifying the most important system requirements and focus on trade-offs to optimize the system's capabilities to meet these requirements effectively.

https://en.wikipedia.org/wiki/Software_requirements

Begin system design by defining the API specification to meet functional requirements. This step ensures that the system supports the necessary operations and interactions between its components.

https://en.wikipedia.org/wiki/API

Analyze the connections between users and data. Identify what data users read, write, and modify and how it transitions between different system components. This process is essential for understanding the system’s data flow and structure.

https://en.wikipedia.org/wiki/Data_flow

Address other key concerns, such as logging, monitoring, alerting, and search, which often arise during system design discussions. These elements provide operational insights and improve the system's reliability.

https://en.wikipedia.org/wiki/Logging_computing

After an interview, writing a self-assessment helps identify your strengths and areas for improvement. This reflective practice serves as a valuable reference for tracking personal progress and skill development.

https://en.wikipedia.org/wiki/Self-assessment

Consider long-term career goals when evaluating a company. Use interviews to determine if the organization aligns with your objectives and where you want to invest your career efforts.

https://en.wikipedia.org/wiki/Career_development

Implement logging, monitoring, and alerting to quickly detect unexpected events. These tools help provide actionable insights for resolving issues efficiently and maintaining system reliability.

https://en.wikipedia.org/wiki/System_monitoring

The four golden signals and three instruments are key to measuring service observability. Use structured log entries that are easy to parse, small, and categorized, with standardized time formats and no sensitive information.

https://en.wikipedia.org/wiki/Observability_(software)

Follow best practices for responding to alerts by maintaining actionable runbooks. Continuously refine these runbooks based on patterns observed in your alerting processes to improve their utility over time.

https://en.wikipedia.org/wiki/Runbook

Chapter 3

Summary

We must discuss both the functional and non-functional requirements of a system. Do not make assumptions about the non-functional requirements. Non-functional characteristics can be traded off against each other to optimize for the non-functional requirements.

Scalability is the ability to easily adjust the system’s hardware resource usage for cost efficiency. This is almost always discussed because it is difficult or impossible to predict the amount of traffic to our system.

Availability is the percentage of time a system can accept requests and return the desired response. Most, but not all, systems require high availability, so we should clarify whether it is a requirement in our system.

Fault-tolerance is the ability of a system to continue operating if some components fail and the prevention of permanent harm should downtime occur. This allows our users to continue using some features and buys time for engineers to fix the failed components.

Performance or latency is the time taken for a user’s request to the system to return a response. Users expect interactive applications to load fast and respond quickly to their input.

Consistency is defined as all nodes containing the same data at a moment in time, and when changes in data occur, all nodes must start serving the changed data at the same time. In certain systems, such as financial systems, multiple users viewing the same data must see the same values, while in other systems such as social media, it may be permissible for different users to view slightly different data at any point in time, as long as the data is eventually the same.

Eventually, consistent systems trade off accuracy for lower complexity and cost.

Complexity must be minimized so the system is cheaper and easier to build and maintain. Use common techniques, such as common services, wherever applicable.

Cost discussions include minimizing complexity, cost of outages, cost of maintenance, cost of switching to other technologies, and cost of decommissioning.

Security discussions include which data must be secured and which can be unsecured, followed by using concepts such as encryption in transit and encryption at rest.

Privacy considerations include access control mechanisms and procedures, deletion or obfuscation of user data, and prevention and mitigation of data breaches.

Cloud native is an approach to system design that employs a collection of techniques to achieve common non-functional requirements.

Chapter 4

4 Scaling databases

This chapter covers

Understanding various types of storage services

Replicating databases

Aggregating events to reduce database writes

Differentiating normalization vs. denormalization

Caching frequent queries in memory

In this chapter, we discuss concepts in scaling databases, their tradeoffs, and common databases that utilize these concepts in their implementations. We consider these concepts when choosing databases for various services in our system.

Summary

“Designing a stateful service is much more complex and error-prone than a stateless service, so system designs try to keep services stateless, and use shared stateful services.” (AceSysDsgnInt)

Stateful services maintain client session information, making their design and operation inherently more complex. This complexity arises from the need to manage session persistence, data synchronization, and scalability, especially in distributed systems. For instance, a stateful service must ensure that session data is preserved across server instances or regions, introducing challenges like data replication and consistency. These challenges can increase the risk of errors, especially during scaling or system failures, making it harder to achieve high availability and low latency.

https://en.wikipedia.org/wiki/Stateful_protocol

In contrast, stateless services simplify system design by not maintaining session information. Instead, all necessary data is included in each client request, allowing any server to process the request without relying on external state management. This approach reduces coupling between components and makes horizontal scaling straightforward, as new instances can be added without redistributing state. Stateless services are particularly advantageous in microservices architectures, where each service operates independently.

https://en.wikipedia.org/wiki/Stateless_protocol

To address the challenges of stateful services, system designs often incorporate shared stateful services like databases or distributed caches to handle state management centrally. Tools such as Redis, introduced in 2009, or Apache Kafka, introduced in 2011, are commonly used to store and synchronize state across the system. This centralization reduces the complexity within individual services, allowing them to remain stateless while relying on robust external systems for state management.

https://en.wikipedia.org/wiki/Redis

https://en.wikipedia.org/wiki/Apache_Kafka

“Each storage technology falls into a particular category. We should know how to distinguish these categories, which are as follows: Database, which can be SQL or NoSQL. NoSQL can be categorized into column-oriented or key-value. Document database. Graph database. File storage. Block storage. Object storage.” (AceSysDsgnInt)

Storage technologies are categorized to support different data models and use cases. Databases are a common category, divided into SQL and NoSQL databases. SQL databases like MySQL (introduced in 1995) follow a relational model, storing data in structured tables with predefined database schemas, making them ideal for transactional applications requiring strong consistency. In contrast, NoSQL databases cater to unstructured data or semi-structured data and are optimized for scalability and flexibility.

https://en.wikipedia.org/wiki/SQL

NoSQL databases further branch into subcategories like column-oriented databases and key-value databases. Column-oriented databases, such as Apache Cassandra (introduced in 2008), store data in columns rather than rows, making them efficient for analytical queries over large datasets. Key-value databases like Redis (introduced in 2009) are optimized for rapid access to data stored as key-value pairs, often used in caching and session management.

https://en.wikipedia.org/wiki/NoSQL

Document databases are another category of NoSQL databases, designed to store semi-structured data as documents. These databases, such as MongoDB (introduced in 2009), use formats like JSON to store hierarchical data, providing flexibility and ease of integration with modern application development practices.

https://en.wikipedia.org/wiki/Document-oriented_database

Graph databases like Neo4j (introduced in 2007) are specialized for handling data represented as interconnected entities or nodes. They are commonly used in applications such as social networks, fraud detection, and recommendation engines, where relationships between entities are as important as the entities themselves.

https://en.wikipedia.org/wiki/Graph_database

File storage is a more traditional storage model, where data is stored in files and accessed through directory structures. File storage is common in operating systems and is used to manage unstructured data like images, videos, and documents. It is accessible via standard protocols like NFS or SMB.

https://en.wikipedia.org/wiki/File_system

Block storage breaks data into fixed-size blocks and stores them independently, with each block having a unique identifier. This method, used in systems like iSCSI, is ideal for high-performance transactional workloads, as it allows data to be distributed across multiple storage devices, providing better scalability and redundancy.

https://en.wikipedia.org/wiki/Block_storage

Object storage is designed for storing vast amounts of unstructured data by treating each piece of data as an object. Each object includes metadata and a unique identifier, making it ideal for cloud storage solutions like Amazon S3 (introduced in 2006) and Azure Blob Storage. Object storage excels in durability and scalability for applications like media storage and backups.

https://en.wikipedia.org/wiki/Object_storage

Understanding these categories enables architects to select the right storage technology based on application needs, balancing considerations like scalability, performance, and cost. Combining multiple types, such as using object storage for backups and document databases for dynamic content, is a common approach in modern system designs.

https://en.wikipedia.org/wiki/Data_storage

“Deciding how to store a service’s data involves deciding to use a database vs. another storage category.” (AceSysDsgnInt)

“There are various replication techniques to scale databases, including single-leader replication, multi-leader replication, leaderless replication, and other techniques such as HDFS replication that do not fit cleanly into these three approaches.” (AceSysDsgnInt)

“Sharding is needed if a database exceeds the storage capacity of a single host.” (AceSysDsgnInt)

“Database writes are expensive and difficult to scale, so we should minimize database writes wherever possible. Aggregating events helps to reduce the rate of database writes.” (AceSysDsgnInt)

“Lambda architecture involves using parallel batch and streaming pipelines to process the same data, and realize the benefits of both approaches while allowing them to compensate for each other’s disadvantages.” (AceSysDsgnInt)

“Denormalizing is frequently used to optimize read latency and simpler SELECT queries, with tradeoffs like consistency, slower writes, more storage required, and slower index rebuilds.” (AceSysDsgnInt)

“Caching frequent queries in memory reduces average query latency.” (AceSysDsgnInt)

“Read strategies are for fast reads, trading off cache staleness.” (AceSysDsgnInt)

“Cache-aside is best for read-heavy loads, but the cached data may become stale and cache misses are slower than if the cache wasn’t present.” (AceSysDsgnInt)

“A read-through cache makes requests to the database, removing this burden from the application.” (AceSysDsgnInt)

“A write-through cache is never stale, but it is slower.” (AceSysDsgnInt)

“A write-back cache periodically flushes updated data to the database. Unlike other cache designs, it must have high availability to prevent possible data loss from outages.” (AceSysDsgnInt)

“A write-around cache has slow writes and a higher chance of cache staleness. It is suitable for situations where the cached data is unlikely to change.” (AceSysDsgnInt)

“A dedicated caching service can serve our users much better than caching on the memory of our services’ hosts.” (AceSysDsgnInt)

“Do not cache private data. Cache public data; revalidation and cache expiry time depends on how often and likely the data will change.” (AceSysDsgnInt)

“Cache invalidation strategies are different in services versus clients because we have access to the hosts in the former but not the latter.” (AceSysDsgnInt)

“Warming a cache allows the first user of the cached data to be served as quickly as subsequent users, but cache warming has many disadvantages.” (AceSysDsgnInt)

5 Distributed transactions

Distributed transactions

This chapter covers

Creating data consistency across multiple services

Using event sourcing for scalability, availability, lower cost, and consistency

Writing a change to multiple services with Change Data Capture (CDC)

Doing transactions with choreography vs. orchestration

Table of Contents

System Design

1

2

Chapter 3

Chapter 4

Summary

5 Distributed transactions