Systems Design Study Guide

A Beginner’s Guide to Understanding How to Design, Build, and Scale a System and Why It’s Important to Know as a Software Engineer

Anna Carey
20 min readJun 26, 2020

I recently interviewed for an entry-level production engineering role at a Facebook, which included a section on Systems Design. As a software engineer, I had learned about programming languages and frameworks, OOP, data structures, algorithms, and other concepts in computer science and programming. Systems Design, however, was very new to me.

My work was cut out for me. Creating masterfully architected study guides with way too much detail was one of my favorite parts of school. (I’m a nerd.) So I took this opportunity to get back to my roots and pretend this was a college exam.

Although I did not land the job, I learned so much through the process. It became clear to me that understanding the structures and systems that underpinned my work as a developer was not only relevant but actually critical to becoming a great engineer.

Below, you will find a list of the resources, a glossary of terms, and an outlined process for thinking through a systems design interview question. (If you have any feedback or corrections, please let me know!)

Glossary

All of the terms I came across that I didn’t know or felt I didn’t understand deeply enough.

Production Engineering—So, what is it? A combination of software and systems engineering. Helps an organization run smoothly and scale efficiently. Makes sure production and infrastructure services are reliable and scaling.

CPU — Central Processing Unit — responsible for processing and executing instructions — chip that sits on a special socket on the motherboard of a device — separate from memory or graphics — “the brain” — fetch, decode, execute

RAM — Random Access Memory — your devices short term memory; “forgets” when powered off, which is why you have long term memory in a hard drive; memory that can be accessed and changed in any order

TCP port — TCP (Transmission Control Protocol) is a standard that defines how to establish and maintain a connection, so that applications can exchange data; “connection” is key to understanding TCP because it is all about establishing that connection and maintaining it until both applications have finished exchanging messages; TCP works with IP (Internet Protocol); a port is a communication endpoint and is always associated w/ an IP address and the protocol type

Shadow Testing — Technique of taking production traffic and replaying it against a newly deployed version of the service; the idea is that you are using real traffic to test on a system; it is possible to test a sampling of the traffic or it is also possible to test all of the traffic against the other system; because it is real traffic, it is a reliable way of testing; the challenge is keeping side effects separate (i.e. so when you duplicate traffic, the client doesn’t see the result as duplicate)

Deploy vs. Release — To deploy is to install the new version of a service into production, meaning that though it is ready to handle production traffic, it may not yet be getting any (so the older version may still be handling live production traffic); when the new version of a service is released, it is responsible for handling production traffic, meaning the production traffic is moved over to this released version

Cache — A component that stores data so that future requests can be served faster — the idea is that database queries are expensive so it is much less costly to get info from cache, which is especially important for read-heavy applications; “short-term memory”; i.e. the webpage you are accessing was already loaded before and stored in cache so instead of reloading the entire page, it can be pulled from cache; cache is used in almost every layer of a system but are most often useful closest to the frontend (client); 80/20 rule is best practice meaning assuming 20% of content is consuming 80% of traffic, we would cache the 20% of content; generally we should be caching the most popular content, the most recent content, and look at that by location, historical searches, ML (Machine Learning), etc.

Terabyte (TB) — unit of measure for storage; 1 TB = 1,024 Gigabytes (GB); 1GB = 1,024 Megabytes (MB); 1MB = 1,024 KB; 1KB = 1,024 bytes

Size in Bytes—Byte → KB → MB → GB → TB; A book is around 1MB, a song is around 5MB, and a basic HTML webpage might be 100KB; A server on average can hold 256GB; 10MB is a good limit for text input

SOAP vs. REST API — SOAP stands for Simple Object Access Protocol and REST stands for Representational State Transfer; SOAP is more rigid and relies on XML (Extensible Markup Language, a markup language that defines a set of rules for encoding documents); REST is more flexible and can transfer data in many different forms (including CSV, JSON) and primarily used over HTTP and makes use of HTTP methods (get, post, patch, put, delete); SOAP is better for security, transferring info in a rigid format, and stateless; REST is better for everything else

SQL vs. NoSQL:

SQL database— relational database; it stores data in rows and columns; good for data that is highly structured and when the schema is mostly decided beforehand and likely won’t change (you can change it but it requires updated the entire database and requires going offline); also good for data that needs to be ACID compliant so for sensitive info like financial or e-commerce applications; you can query the database with SQL which is really useful; scales vertically (more rows) but expensive to scale and difficult to scale over multiple servers; examples include: MySQL, Oracle, MS SQL Server, SQLite, Postgres, and MariaDB

NoSQL database— non-relational database; good for storing large amounts of data that is unstructured, distributed, and has a dynamic schema; also good if you want to use cloud storage because this requires flexibility of spreading data across multiple services; also good if you need speed and agility b/c SQL database requires more planning and down time when changes are made to the schema; easy to change the schema on the fly and doesn’t need to be consistent across different entries; querying is by collections of documents; scale horizontally meaning you can add servers easily and inexpensively to scale; examples include: MongoDB, CouchDB, Cassandra, and HBase

Encoding Types — Base36[a-z ,0–9] or Base62([A-Z, a-z, 0–9]) or Base64([A-Z, a-z, 0–9, +, /])

Hash Functions — function that maps different-lengthed pieces of data to uniformly-lengthed pieces of data, ideally in a random-like manner; cryptographic hash functions are used for security reasons so sensitive data is obscured and then stored as a hashed value; examples include SHA-256 and MD5 — you can add a salt (random bits) as well to make it more random/secure

KGS — Key Generation Service — a separate service in the system for generating keys

Consistent Hashing — Creates an easily scalable system for distributed hash systems (i.e. used for distributed cache systems) that distributes data across the system so that if a node is added or remove you barely need to reorganize and distributes them uniformly (vs. w/o consistent hashing where you may have to reorganize the entire system if one server is added and the information may be unevenly distributed); to add a new key to the system: use a hash function to change each cache server into a number in the range; when a new key is added, hash that into a number in the range too; imagine if all of the cache servers were located in a circle on the range and when you have the new key, put it in the cache server that comes up first moving clockwise; this way if a new cache server is added or removed, the keys are distributed in a consistent manner because they just move to the cache server that’s closest to the right

Distributed Cache System — Distribution of cache to multiple systems; the idea is that a distributed hash table (DHT) maps each resource to a certain server, so when you go and look for a specific resource, the hash table will tell you which server it’s on

Memcache — Free and open source software that is a ready-made distributed memory caching system; stores whatever you want in RAM; It is kind of like short-term memory for your applications; server will check in memcache and if it’s not there, it will look in the database drive and then store it in cache; it works by consolidating spare memory from different parts of the system together in one pool of memory storage space so it is available to the entire system when there is need

Cache Eviction Policy — Predicts which resources are most likely to be used again (and which ones we can remove from the cache once the server is full); common policy is Least Recently Used (LRU), so the least recently used resource would be removed from the cache first; other methods include Least Frequently Used (LFU), Random Replacement (RR) — just randomly removing a resource, and other systems that combine recency and frequency, First in First Out (FIFO), and Last in First Out (LIFO)

Load Balancer (LB) — Layer of process that balances traffic load from clients (or servers or databases) across multiple servers; when we horizontally scale, we add servers, so which client traffic goes to which server? That’s the job of the load balancer; can be situated in between any steps of the system (client to server to database) to try to evenly distribute the workload across servers; the system used is called a scheduling algorithm and round robin is a common and good choice to start with because it is simple to implement and doesn’t introduce overhead; problem is that it does not take server load into consideration; other methods include Least Connection Method and Least Bandwidth Method where a request is served to servers with the current fewest number of connections or serving the current least amount of traffic; usually you would want two LBs: Active-Passive (passive is polling the other to make sure it’s alive and if it dies, it takes over the active LB’s IP and directs traffic) or Active-Active paradigms (both active at the same time)

Latency — Time delay when transmitting or processing data; i.e. in between request and response

Throughput — # or actions executed or results per unit of time (aka bandwidth)

TPS — Transaction Processing System; a transaction is a single event that changes something; process includes collection, processing, editing, and storing the transaction

Common Error Codes — 4XX means client-side error, 5XX means server-side error; 401: unauthorized, 404: resource not found, 429: too many requests; 2XX means success — 202 means request accepted

Ingress and Egress Traffic — Ingress traffic data originating outside the network that is coming in (client → server); egress traffic is data originating inside the network that is moving out (server → client)

Object Storage — AKA object-based storage; allows for storing a lot of unstructured data in the form of objects; good for cloud-based storage because it’s easy and cheap to add more servers; method of storing files with all of their metadata together; you are not able to quickly edit one part of a file in this method but instead would need to access and update the whole unit, which is slow; examples: Amazon S3, Microsoft Azure, Google Cloud

Block Storage — Method of storage where files are separated into uniform “chunks”; good for databases and structured data; good for storage that requires frequent input/output (I/O); examples: AWS Elastic Block Storage, Microsoft Azure Premium, Google Persistent Discs

Shard — Method of splitting one dataset into different databases to allow for easier scaling as the dataset and traffic grow; data is split horizontally (vertically would mean different columns of a database are stored separately, horizontally means that rows are stored separately) so it is homogenous pieces of data that are stored across separate databases

DNS Server — Translates host names (urls) to IP addresses

Polling — When one device or program checks in with another program or device for a change in state or a particular output or piece of data at regular intervals; long polling means that the device puts out a normal polling check and if the other device doesn’t have anything new, it will keep the request open until it does indeed have the response information to share; once that information is shared, the original device immediately puts out another poll so that there is no latency in between

Epoch Time — Way of describing a point in time with a number; AKA Unix time, this represents the number of seconds elapsed since Jan 1, 1970

CDN — Content Distribution Network — an external service that provides caching for large systems by having geographically distributed servers and data centers to keep large pieces of data in cache; usually if a server makes a request to a CDN, it will serve up the content quickly if it is available or query the backend server if not and then cache it locally before sending it to the user; especially popular service for systems needing to provide video and photos

PoP — Point of Presence — physical access point where two or more networks share a connection; primarily, the access point that enables remote users to access the internet

ACID — Atomicity, Consistency, Isolation and Durability — set of properties that are used to guarantee the reliability, consistency, and accuracy of database transactions; important for e-commerce, financial applications

BASE — Basically available, soft state, eventual consistency — systems that do not need to be ACID compliant but need to guarantee availability would be BASE

Message Queue — A list of messages that is sent between applications; allows for asynchronous communication, meaning it stores messages that come in in that order until the receiving application can process them; this is really important for tasks that don’t need to occur in real time because it can remove bottlenecks by putting a task on a list for later and then once it is completed, it can update the user that it’s done later on

Request Queue — A list of requests that are sent to a server and good for time-consuming actions as they can be completed asynchronously; a client will make a request, which will be added to the queue and they will be told the action is in progress and to go on their way; once it’s done they will receive a notification that the task is completed

CAP Theorem — Consistency, Availability, Partition tolerance; states that it is impossible for a distributed system to have more than two out of three; consistency means that all nodes see the same data, and that when there is a change, all the nodes are updated before a new read of the data is allowed; availability is when a request gets a response no matter what, whether or not there is a success or failure; partition tolerance means that some messages failing does not result in the entire system failing — if there is a partition failure, should we cancel the action and decrease availability but ensure consistency or should we move forward with the action but risk inconsistency? In distributed systems, we are deciding between AP and CP (because there always might be a partition failure when things are divided)

Partitioning — Separation of servers and databases (simplest example is users with A-M last names and N-Z last names)

Hot partition — One particular partition is getting more traffic than the others

Push vs. Pull — Push architecture keeps a consistent connection open with the server and sends the message to a client as soon as it has one to send; pull model is more like polling and has the client check at consistent intervals for an update; pull has high likelihood for latency because what if you aren’t checking for new messages at the right moment or if you are checking often and receiving nothing, which is a waste of bandwidth

WebSockets — Allows simultaneous connections between a client and a server over one TCP port; good for real-time messaging between two users;

SSE — Sever-sent events — long-term connection between client and server where the server sends updates to the client whenever there is something new to share; good use cases are any real-time updates but only in one direction, like friend status updates or news updates

Scalability — Ability of a system to grow and handle increased demand; we often talk about horizontal vs. vertical scaling; horizontal scaling is when you add more servers to a system, which is more easily accomplished; vertical scaling is when you add more power to an existing server which is more difficult and requires downtime; Cassandra and MongoDB are horizontal because it is easy to add servers; MySQL is vertical because it is easy to switch from a smaller to bigger machine (even though it requires downtime)

Vertical scaling — Mo’ Resources Mo’ Money — make your machine bigger, badder (more CPU, more RAM, more memory)

Horizontal scaling — Mo’ machines (maybe cheap-o machines) but more! — now we have to decide how to distribute processes

Reliability — Probability that a system will fail in a given time period; risk of failure is often mitigated through redundancy so if one part of the system fails, another duplicate part can pick up the process; this is costly though

Availability — Amount of time a system is operational under normal circumstances; the key here is “normal circumstances”, so a system can have really high availability (meaning it’s great in normal circumstances) but it can have low reliability if a lot of incidents happen and the system goes down (because these wouldn’t be considered normal circumstances)

Efficiency — How high or low the response time or latency is and also how high or low the number of messages a system can send per unit of time

Manageability/Serviceability — How easy a system is to operate and maintain; so how easy/quick it is to repair or to update or change; a good system will have in place a way to detect early faults or failures

Fault Tolerance — Property that ensures a system continues to operate even if there is a failure; one way to combat this is redundancy in the system; gets rid of single point of failure

Map-Reduce — Programming model/pattern that facilitates the organization of big data and is characterized by splitting data up into clusters and performing a process on them concurrently so it’s more efficient; most commonly referenced with Hadoop

DOS Attack — Denial of Service Attack — cyber-attack on a web service where the attackers attempt to make the service unavailable, usually by flooding the system with traffic

Throttling — Intentional slowing or speeding of an internet service provider to prevent network congestion and being overwhelmed with traffic; Hard throttling is a strict # limit; Soft throttling allows the user to go above the limit by a certain percentage; Elastic/Dynamic throttling means the user can go above the limit if there is resources available to service the request

Index — When a database is really big and complex it is good to create an index to assist with lookup if it is a read-heavy system; an index consists of a two columns, one with the search key and the other pointing to specific rows in the database; example: Twitter search to store tweet ids on an index of words in the english language — they key is the word and the other column is a list of tweet ids containing that word (so it “points” to those rows)

RAID — Redundant Array of Independent Disks — assume you have multiple hard drives on your system — writes either partial info to different disks (striping) or mirror the information so that you are automatically creating a backup (redundancy); the goal is to improve: reliability, availability, performance, and capacity

Cookies —Small piece of data from a website that is stored on a user’s browser; some common examples are storing things in a cart or storing login information so you get automatically logged in

Replication (Master-Slave) — Replication is about making automatic copies; master database where your read and write data to; the master has one or more slave databases whose job it is to get a copy of the master database; anything a query happens on the master, it is copied to the slave; you can also use the slaves to balance read requests; Master-Master is another paradigm

Graph Database — Stores data in nodes and edges; nodes store info and edge store info about relationships; good for data with a lot of relationship info like social networks and recommendation engines

HTTP vs FTP vs SMTP — Hypertext Transfer Protocol, File Transfer Protocol, and Simple Mail Transfer Protocol allow you to transfer files between two systems; HTTP is the backbone of the web and is the protocol used by web browsers; FTP is for transferring files and data and does so by establishing separate connections to authenticate the user and transfer files; SMTP is used by email

Robots Exclusion Protocol — Standard used by websites to communicate with web crawlers; websites who are using this protocol will have a file called robots.txt in their root directory that the crawler will look for and has instructions for the crawler such as declaring parts of the site off limits or irrelevant for crawling

Some Helpful Math:

  • 2.5 million seconds per month
  • 1 request per second = 2.5 million requests per month
  • 40 requests per second = 100 million requests per month
  • 400 requests per second = 1 billion requests per month

Common Database Choices:

HBase — non-relational (NoSQL) distributed database; wide-column store; good for storing small bits of data frequently and for storing variable sized data; example: Facebook messenger

Cassandra — non-relational (NoSQL) distributed database; wide-column database meaning it allows for varying column types and names in one database; high availability and replication so there is no single point of failure; high scalability; examples: URL shortener, metadata storage, Instagram’s relationships (UserPhoto, UserFollow)

Apache Hadoop, HDSF — distributed file system; examples: video storage for YouTube

Redis — NoSQL/non-relational database key-value store; good for cache

Amazon S3 — Object storage; examples: storing content from Pastebin, storing images for Instagram,

MySQL — SQL/relational database with in-memory caching available (so if you make a query it may be slow the first time, but once it’s already loaded, the next time you make the request it will respond much faster b/c the response is stored in memory); examples: metadata for YouTube

Process for Answering a Systems Design Question

In a Systems Design question, you will usually be asked to either design or scale a given system (like how would I build a commonly used app like Instagram, a search engine, an elevator system.) The questions are intentionally broad, vague, and open-ended, so it’s up to you to as the right questions and share a structured well-thought-out answer. Here is a helpful process I outlined from reading about these kinds of interview questions:

General Questions to Ask Before Starting:

  • Will we be designing the frontend in addition to the backend?
  • Will we be handling photos and videos?
  • What scale is expected from the system? (i.e. how many users, posts, etc.)
  • How much storage is needed?
  • What network bandwidth usage is expected?

Step-by-Step Process

  1. Requirement clarification

a. Functional Requirements

  • Outline the user flows — What do users need to be able to do?
  • Should we put any limits on the usage (i.e. size of inputs)

b. Non-functional requirements

  • Use this as an opportunity to first discuss the CAP Theorem — Do we value consistency or availability more?
  • What is important for the application to run well? Does it need to be fast? Does it need to be secure? Low latency? High reliability (meaning data is not lost)? High consistency (do reads need to be reading super up-to-date information)?

c. Extended requirements

  • What other analytics would be useful? Should the service be accessible through REST APIs

2. Estimate numbers — THE MATH

a. Initial metrics: # of users, # of posts, ratio between read vs. write requests

  • If you need to estimate this yourself, think about the most-used platforms as having 1.5–2B users

b. Traffic — estimate queries per second

c. Storage (in bytes — see section on byte conversion and get familiar with size of basic data types like text, images, videos)

  • For how long?
  • Do you need to separate object storage from metadata?
  • Remember to aim for 70% of storage so accommodate for that

d. Bandwidth (multiply the storage times traffic to get # in bytes per second)

e. Memory — estimate # of bytes needed for anything we would want to cache

  • Calculate by bytes per day
  • A common framework is 80/20 rule meaning 20% of resources have half the traffic, so we would want to cache those

3. Define system interface/APIs — what will the functions/methods look like? (think CRUD)

a. Define the parameters (what type are they, are they required/optional, if optional what is the default value (none/null))

b. Define the return val and type and mention that otherwise it returns an error

c. Answer the question: how can we prevent abuse?

  • Create an “API Developer Key” — use this as an additional parameter for your methods to limit quota — how do we design this? (what could a limit number be, is that flexible, should we limit quota during a particular time frame, is the time frame flexible?)

4. Define the domain models and their attributes

a. Start with some things we know about what we need to story and the nature of the data

  • Are the things we need to store big or small?
  • Are there a lot of things we need to store?
  • Is our database read or write heavy?
  • Are there many or few relationships between what we need to store?

b. Write down the schema with domain models and attributes in tables

c. What backend storage is best?

  • Back to our conversation about Consistency vs. Availability
  • SQL vs NoSQL and example of one to use: SQL is good when consistency is required, ACID compliance, a good place to start for structured data (but point out that it doesn’t scale well); NoSQL is otherwise the good choice especially for scalability
  • Object vs. Block storage vs. File: Block or file works for when consistency is really important; Object is good when availability matters, when you have a lot of metadata to store w/ the object, and for cloud purposes

5. Basic Design

a. Draw a block diagram with 5–6 boxes

  • Client box
  • Application Server box: Think about whether we have different actions (i.e. read and write) that have very different requirements (i.e. one is MUCH slower than the other) — then we might want to have separate servers so one action doesn’t clog up a server and so we can scale the services independently
  • Database Storage box(es)
  • Always add Load Balancers between the server and client and the server and DB: It’s common to add multiple LBs in Active-Active or Active-Passive modes; Discuss method for load balancing — round robin, random, least busy, etc.
  • Client → LB → Server → LB → DB

b. Define the algorithm

6. Dive deeper into certain considerations that need more discussion — important to show decision making process with pros and cons

a. Redundancy — this is a really important consideration if you cannot lose certain pieces of data or parts of the service — to eliminate this problem we would want to story multiple copies — in the same vain, we would want copies of the same server running in case one fails to eliminate the single point of failure problem — Some concepts to mention here can be master-slave or master-master setups

b. Partitioning Data — discuss how to divide data across different servers to manage needs like scalability, easy accessibility, speed, etc. Options include:

  • Vertical partitioning — separating different columns across different databases and using joins to then access data
  • Range based partitioning — dividing data based on what letter it starts with for example or on the primary key (id) of a piece of data; can cause unequal loading of servers
  • Hash-based partitioning — take a hash of each of the objects and store using consistent hashing (see above definition)

c. Monitoring the system — What metrics should we collect to make sure things are running well and what the peak usage is of our system so we know where it is topping out currently?

d. Queuing — allowing for asynchronous work/messaging

e. Fault Tolerance — How does the system continue operating if there is a failure?

  • Redundancy of data and of servers
  • Good monitoring and error alerts if things are not going well
  • No single point of failure

f. Data Cleanups (how to delete entries if they have expired for example)

  • Cleanup Service — should be lightweight and only run when on specific intervals when traffic is low (so as not to take up too much bandwidth)

g. What extensions or other pieces of data might be useful to connect to better understand the usage of the system (i.e. for marketing purposes)

7. Define bottlenecks: general questions to ask when discussing bottlenecks:

  • Do we have enough copies of the data and different services so that if we have failures or lose servers, we can still serve users?
  • How are we monitoring performance and are we providing notifications to our developers (and to users) if things are failing or if the services is degraded?
  • When confronted with with a problem, determine what the metrics are we would look at
  • Then determine which one is the limiting factor (hardest to accommodate) and then try to estimate what that number will be
  • Then make sure the system can scale to that level
  • Testing — accommodate enough time and follow a best practice for testing (shadow, etc.)

Further Resources and Reading:

(Of course, I’ve really only just scratched the surface, but hopefully this provides some helpful info!

--

--

Anna Carey

Software engineer @Artsy. At the intersection of storytelling, art, and technology.