A longer link-and-rec, but a fascinating paper from ByteDance on one of their recommendation systems.
Brief Background on Recommendation Systems
In a recommendation system, your goal is to present the most relevant content to users; this can be anything from short videos, to ads, to product listings. You have a vast pool of users and content to handle.
You begin by representing users and content through their attributes (features). Attributes can be things like the user’s location, their watch history, or who they follow. These attributes can be dense (mostly populated with values) such as “age of user account”, or sparse (mostly empty), like “has liked a video about Vacuum Fluorescent Displays.”
For each user or piece of content, you create a lengthy array of attributes and their values, with the majority being zeroes. These high-dimensional arrays are then transformed into lower-dimensional representations called embeddings. Embeddings are like a semantic hash that captures the essence of users or content, placing similar ones close to each other in the vector space.
These embeddings are fed into a deep network, which maps the embeddings together and generates recommendations. Due to the enormous amount of data involved, there might be multiple layers or tiers within the system to handle the processing efficiently.
The recommendation system then suggests content for users. After a certain period, you’ll evaluate the quality of those recommendations. The feedback loop could be short (whether the user watched the entire video) or long (if the user purchased a product from an ad). This information is used as training data for future iterations of the recommendation system, creating a continuous learning process.
Challenges
The paper describes two major problems recommendations systems have, when compared to other deep learning tasks.
Sparse, Categorical & Dynamic Data: In the recommendation world, you’re dealing with a dizzying number of dimensions—billions of users and pieces of content, and a huge number of attributes to describe them. This creates a long tail of items that appear less often. One popular fix is hashing, where you map multiple elements (users, content, ads) into a single, embedding bucket. This trick also helps tackle the “cold start” problem, such as when you’re trying to recommend something to a user you know nothing about. These hashes can cause collisions between otherwise independent entries, which can impact how effective the recommendations are.
Shifting Data Distribution: Unlike language models, where the tokens you predict based on a phrase stay pretty consistent over time, user interests in recommendation systems can be as fickle as fashion trends. Over the course of an hour a user might consume a whole bunch of woodworking videos on TikTok, but then want to switch to stand-up snippets. If a feed gets too optimized it can turn stale or, worse, bore users, leading to less engagement. So, for better recommendations, you’ve got to keep up with the user’s actions to capture their ever-changing tastes and preferences.
“Intuitively, information from a more recent history can more effectively contribute to predicting the change in a user’s behavior. To mitigate the effect of Concept Drift, serving models need to be updated from new user feedback as close to real-time as possible to reflect the latest interest of a user.”
Monolith
The Monolith recommendation architecture addresses these two concerns through use of a collisionless hash algorithm and an online-training scheme.
Cuckoo Hashing & Parameter Servers
Recommendation models tend to be absolute behemoths, way too large for a single GPU (which, nowadays, max out at around 80GB). Each training example is some variant of “User X watched Video Y”, meaning most batches of examples only touch fractions of the users and videos in the mix. To manage this, they store the model’s parameters on a set of separate parameter servers and only load the relevant bits onto the training machines. These parameter servers rely on a Cuckoo hash-based hashtable, designed to avoid collisions:
“[the system] maintains two tables 𝑇0,𝑇1 with different hash functions ℎ0(𝑥), ℎ1(𝑥), and an element would be stored in either one of them. When trying to insert an element 𝐴 into 𝑇0, it first attempts to place 𝐴 at ℎ0(𝐴); If ℎ0(𝐴) is occupied by another element 𝐵, it would evict 𝐵 from 𝑇0 and try inserting 𝐵 into 𝑇1 with the same logic. This process will be repeated until all elements stabilize, or rehash happens when insertion runs into a cycle.”
This could end up using a lot of memory, so to save space (and improve performance) they prune old IDs (old, inactive users or old pieces of content), as well as ones that have very low usage.
Online training
The model is trained in two different ways. They start with traditional batch training, sifting through all their data just once. This generates the model, which is shifted over to live parameter servers and starts making recommendations.
Next, they switch gears to online training. As user actions come in (like giving a thumbs-up to a video or watching it till the end), they merge the data with the recommendation info that prompted it, creating a fresh training example. They batch up these new examples and run a training pass, then copy the updated sparse parameters from the training parameter servers to the production one. Moving just the modified parameters makes the updates way smaller than pushing whole model. Once a day, they sync the dense parameters.
This sounds straightforward, but it hides some significant engineering. Model publishing often involves an intricate process, such transforming a model for optimal serving to ensure it responds quickly, especially when low latency is a must (like ranking and recommending cases). Serving models is also a more distributed challenge: training usually crams as much data and computing power together as possible in one place, while inference (making predictions) is best done close to the user request for performance and latency reasons. To top it off, inference is often run on entirely different machines, or ones with varying characteristics, as compared to the training devices.
ByteDance showcases their ability to manage this process at a fine-grained level, delivering updated at a high pace (around 1 minute from action to model update), making their product ultra-responsive to users’ ever-changing tastes and reactions.
Other bits
The paper closes with a very practical tip: you can likely get by with snapshotting less often. During training, individual server failures are way less isolated than the ones in more traditional “big data” workloads, and can result in a loss of a significant amount of progress. To cut down on the impact, it’s common practice to save snapshots of the model during training, to use as a checkpoint if you have to restart the process. Snapshot more frequently, and you’ll waste less time on failures—but you’ll also spend more time on the snapshotting itself.
“We also did quite some exploration in this regard and to our surprise, model quality is more robust than expected. With a 0.01% failure rate of PS machine per day, we find a model from the previous day works embarrassingly well. This is explicable by the following calculation: Suppose a model’s parameters are sharded across 1000 PS, and they snapshot every day. Given 0.01% failure rate, one of them will go down every 10 days and we lose all updates on this PS for 1 day. Assuming a DAU of 15 Million and an even distribution of user IDs on each PS, we lose 1 day’s feedback from 15000 users every 10 days. “