Will Twitter collapse? Examining the resiliency of Twitter's architecture

Let's explore the components that make Twitter work at such an impressive scale — and discuss Elon’s proposed improvements.

Nov 30, 2022

I really don’t mean to be alarmist with the subject line. Plenty of ink has been spilled recently about Twitter, mostly regarding Elon Musk’s management practices. That side of the story has been covered at length elsewhere. I don’t plan to get into it here.

However, what I haven’t seen discussed nearly enough is how this all connects back to Twitter’s architecture.

Recently Elon tweeted out an image, which sparked my thinking on this topic:

Elon Musk @elonmusk

Just leaving Twitter HQ code review

On the whiteboard is a high-level diagram of Twitter’s system architecture. It’s a rare transparent look under the hood of one of the world’s largest scalable systems.

Twitter services around 250 million daily active users. According to Elon, that number is only increasing. Meanwhile, Twitter is reducing staff and cutting costs. This could be catastrophic given the size and complexity of the system, and the number of engineers physically required to maintain it.

In the systems engineering world, we are all wondering: will Twitter be able to stay up and continue to handle this many users — let alone actually operate FASTER by orders of magnitude, like Elon is promising?

I thought this would make a great case study in designing scalable software capable of engaging hundreds of millions of users. So let’s do a technical deep dive into Twitter’s architecture.

First, we will explore the components that make Twitter work at such an impressive scale and discuss Elon’s proposed improvements.
Then we’ll talk about what factors could destabilize this system and explore what the consequences could look like.

Breaking Down the Architectural Diagram of Twitter's Read Path

To help understand Elon's diagram, I will briefly review some of the more important pieces of technology that make up the read path.

The Timeline Mixer is Twitter’s existing service for curating the content users see in their timelines. It was created back when Twitter shifted towards algorithm-based timelines instead of reverse-chronological order timelines to serve content that would be more relevant to the user.

Any data points that make up the user experience, including ads, tweets, recommended users, and pagination cursors, are contained within the timeline view.

From what we can see, the Timeline Mixer calls on three services:

People Discovery: Provides users with recommendations on who to follow. This service looks at who you already follow and what tweets you engage with or look at. Based on all that, it tries to recommend accounts that are relevant to your interests.
Ad Mixer: Responsible for curating which ads are shown in your timeline based on a relevancy score.
Onboarding: Provides an onboarding workflow for new users to follow so they can have a positive experience from the get-go.

The Home Ranker works similarly to the Timeline Mixer.

You have a group of ads, tweets, and users. Out of this pool of candidates, you want to select the most relevant ones to be shown to the user.

By looking at Elon’s diagram, we can also get a general idea of how these candidate sources are generated, scored, and ranked.

If we look towards the bottom right, we can see how the Home Ranker fetches candidates that rank the highest based on different parameters, such as what hashtags they follow (Utag), which spaces they follow (Spaces), and what the user likes.

Once the Prediction Service generates an overall score for the candidate source based on said parameters, it's sent back to the Home Mixer.

Above the diagram for the candidate sources, we can see the process they use for feature hydration, which further optimizes performance. With feature hydration, you're essentially populating the candidate sources with additional data that allows the Timeline Mixer or Home Ranker to score them more accurately, and better serve relevant content to the user.

Manhattan is Twitter’s general-purpose real-time distributed key-value store that stores the backend for Tweets, Twitter accounts, direct messages, and so on. Manhattan uses RocksDB as a storage engine responsible for storing and retrieving data within a particular node.

Memcache is Twitter’s cache. If a tweet is recent, it is cached. Twitter can serve requests much more quickly if they don't need to reach into the database layer, in this case, Manhattan, to hydrate candidate sources if that data is already available in Memcache.

Now that we've covered the major components of the current read path and talked a bit about how Twitter achieves its impressive speed, let's look at the proposed changes to its read path.

Proposed changes to Twitter’s read path

Home Mixer: Proposed system to replace the current timeline service (TLS)

Services that Home Mixer calls:

Manhattan: Twitter's internally-developed, real-time, distributed key-value store that stores the backend for Tweets, Twitter accounts, direct messages, and so on.
Gizmoduck: Twitter's User API uses Gizmoduck to store user profiles.
Social Graph: A model of their social network, similar to Facebook's TAO, is stored on FlocksDB, a distributed graph database.
Tweety Pie: A tweet object used by the Tweet API.

The read path is a critical component of Twitter's core services and is responsible for delivering content that users will view on their Twitter timelines. Not only is it where the majority of user interactions take place, it's also used to follow events in real-time.

Latency sensitivity is a key factor in delivering content quickly and effectively, so Twitter relies heavily on caching to handle hundreds of petabytes of data each day. As users interact with the timeline, data is swiftly retrieved from the server and stored in memory.

Why low-latency matters

Twitter's beginning as a reverse chronological "wall" of tweets has now morphed into a much more data-intensive system. As mentioned above, the mixing services generate timelines of tweets, ads, and social recommendations based on multiple ranking heuristics. Despite this, Twitter still needs to serve timelines quickly.

Alongside the typical non-functional requirements of a large social media platform (consistency availability, reliability, and scalability), users expect new tweets to be served to their timelines in near real-time. When a breaking news story is developing, a media event is ongoing, or there's a new meme, people converge on Twitter. The phrase "live-tweet" describes this platform-unique idea of a real-time stream of consciousness covering a relevant topic.

There are several ways that Twitter streamlines this process. Cached tweets can be retrieved quickly, but the system is able to provide an even better user experience by prioritizing tweets sent by accounts that you follow. By serving tweets from accounts that you follow first, the system can appear like a digital version of a real-world social space.

How will these changes improve Twitter's architecture?

The most important part of the Twitter user experience is browsing. Because of this, Twitter can make itself quite resilient. By making and storing copies of tweets, Twitter can ensure that it can always serve tweets for reading purposes. This is common across large-scale distributed systems, and not something that is new to Twitter. As one of the largest distributed social media systems in the world, Twitter is no stranger to making itself partition-tolerant and highly available.

The most data-intensive, and budgetarily expensive, service provided by Twitter is building timelines. This cost is the current focus of many Twitter engineers. The old "Timeline mixer" service is being replaced with a significantly faster and simpler "Home Mixer". In response to Twitter user @alexxubytes, Elon Musk claimed that the new Home Mixer service would be up to 10 times faster than the old system.

There is very little information publicly available as to how the new system will be faster, but Elon tweeted that the current Timeline mixer is slow because of 1000s of remote procedure calls. In the diagram, you can see that the Timeline Mixer talks to the Timeline Scorer through Thrift remote procedure calls. Thrift is a communication protocol that is known to be fairly slow because of its method of serializing and deserializing data structures.

In the past, I was able to gain 10x or even 20x improvements in speed by removing Thrift and implementing static data structures where memory can be quickly type cast. My best guess is that the HomeRanker will be key in achieving this increase in speed. With quicker protocols and potentially some other tweaks, it's definitely possible that Twitter can reach its goal of 10x faster timelines.

This significant increase in speed will greatly improve Twitter. Speed can help in multiple ways:

The system can support significantly more concurrent users without dips in performance.
Lower computational work reduces operating costs.

I believe that much of this Twitter redesign is centered around Musk seeking to save on infrastructure costs. Earlier this month, it was reported that Musk wanted to slash Twitter infrastructure spending by 1 billion dollars annually. At first glance, this could seem like irresponsible cost-cutting. Spending significantly less on server costs could be a real problem if the system isn't prepared to run without the backup it's historically had.

That said, if Twitter is able to serve users as fast as Elon hopes, then they will be able to dramatically reduce spending.

Conclusion: The ghosts of outages past

I hope you enjoyed this Twitter architecture deep dive. Twitter is an impressive application, but the reality is that you need many talented engineers to keep such a massive system operating smoothly. There are very real risks when you attempt to implement significant changes to the architecture, all while cutting costs by $1B and attempting to keep the lights on with 80% fewer engineers.

To put it simply: Twitter is resilient, but it isn’t invincible.

One more thing to note: when systems of this scale get shut down, it is never easy to bring them back.

Systems like Twitter are built to withstand the strain of several machines going down. But if you happen to lose the wrong server at the wrong time, it could take weeks to bring the whole system back online. Think of it like fracturing a bone. The injury could happen in a split second, but it could take weeks or even months for the body to fully heal.

A few high-profile outages come to mind as case studies:

1) Atlassian’s recent outage (April 2022)

This outage lasted two weeks and affected hundreds of companies. In many ways Atlassian is still working to build back trust with users. Gergely Orosz of Pragmatic Engineer wrote about the outage and its consequences in depth here:

The Pragmatic Engineer

The Scoop: Inside the Longest Atlassian Outage of All Time

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. If you’re not a full subscriber yet, you missed the deep-dive on Amazon’s engineering culture, one on Retaining software engineers and EMs, and a few others. Subscribe to get this newsletter every week 👇…

3 years ago · 88 likes · Gergely Orosz

2) Microsoft Azure worldwide outage (February 2013)

This is a slightly more personal (and painful) example — when I was working on Azure, my team was responsible for a worldwide outage caused by expired HTTPS certificates. Once they expired, there was no easy way for us to bring the systems back without a lot of hand-holding. It took us days to reinstall the new certificates on each of these machines.

This was an intimate reminder that these systems are never designed to shut down fully.

3) Facebook’s “call the cops” outage (August 2014)

This was a big deal for us back in 2014. Today it’s hard to imagine someone actually calling the police because Facebook was down, but that’s exactly what happened. This particular outage was caused by a server error and it got plenty of news coverage.

(On a related note, one of the issues my team was working on at the time was a scalability issue known as the Thundering Herd problem. When Facebook Live was rolled out, we had to figure out how to handle extreme traffic spikes as millions of users would suddenly tune in when celebrities with massive follower counts started live streams. It’s a great design problem that I will write about in a future newsletter).

As the engineers at Twitter acutely understand, there can be significant business consequences to an outage of this scale. An outage can trigger a series of cascading failures, leading to lost advertising revenue, harmed credibility, and beyond.

The jury is still out on whether Elon and his team can pull off the improvements they are promising. Along with the rest of the systems engineer community, I will be waiting anxiously to see what happens next.

In the meantime, if you want to dive deeper into Twitter’s system design, I highly recommend Grokking Modern System Design Interview for Engineers & Managers (where we even have an entire case study dedicated to Twitter):

Grokking Modern System Design Interview for Engineers and Managers
- Twitter System Design

You can also check out Grokking the Machine Learning Interview, especially the lesson on Tweet Selection (which dives into Twitter’s feed generation algorithm):

Grokking the Machine Learning Interview
- Tweet Selection

Vishy

Feb 17, 2023

Fahim, great article. Thanks a lot for sharing. Is it possible to also put your thoughts around "Enterprise Reference Data", it's current state and future visions?

Expand full comment

Junaid Effendi

Dec 2, 2022

Just started using the app, so can now actually interact over comments.

Thanks for writing this. This is great and my favorite artist on this Twitter design.