Blog

  • Transformer model inference optimizations

    A very practical exploration of how to get a transformer model going faster for inference:

    We in general consider the following as goals for model inference optimization:

    • Reduce the memory footprint of the model by using fewer GPU devices and less GPU memory;
    • Reduce the desired computation complexity by lowering the number of FLOPs needed;
    • Reduce the inference latency and make things run faster.

    https://lilianweng.github.io/posts/2023-01-10-inference-optimization/

  • CUDA graphs in PyTorch

    The kernel dispatch time eats a lot of performance on GPU – CUDA graphs let you chain a bunch of kernels together, and they’re now more accessible from PyTorch:

    CUDA Graphs, which made its debut in CUDA 10, let a series of CUDA kernels to be defined and encapsulated as a single unit, i.e., a graph of operations, rather than a sequence of individually-launched operations. It provides a mechanism to launch multiple GPU operations through a single CPU operation, and hence reduces the launching overheads.

    https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/

  • PyTorch Internals

    A blog post from PyTorch veteran and core maintainer Ed Yang based on a talk where he breaks down the fundamentals of PyTorch. From 2019 but still very useful in explaining the tensor mechanics, particularly including striding, which is one of those simple but very applicable concepts!

    http://blog.ezyang.com/2019/05/pytorch-internals/

  • Ekko: Low Latency Model Updates at Tencent

    This paper from Tencent last year on the architecture of their recsys, and how it enables a high degree of freshness, via low-latency model updates to deliver fresh and relevant recommendations.

    Infrastructure and Architecture

    The infrastructure of Ekko is centered around a parameter server, which stores the model weights and embeddings created during training. These parameters are then copied to replica parameter servers in data centers around the world, for inference. With a massive model size of 40TB (!) and over 4500 servers for distribution, Ekko is built to handle a lot of users and a lot of content.

    Freshness and Privacy

    Freshness is crucial in Ekko’s design, as it aims to provide the best recommendations for new content and users. The paper also calls out the need to accommodate anonymous users given the expanse of privacy regulations; one of the few times I’ve seen this explicitly mentioned. In all cases, they’re focused on learning from behavior and getting updates out as quickly as possible: seconds, rather than minutes or hours.

    Model Sharding

    The system continuously trains on new data, and divides the model into shards. Model sharding takes advantage of the fact that updates are often clustered in space (within the model) and time (updates targeting a small number of hot items). In their production systems, Tencent observes that 3.08% of the parameters are updated per hour. Individual shards within the model are versioned separately within the parameter server.

    Peer-to-Peer Distribution

    The next challenge is getting the updated weights out to the distributed replicas. Ekko’s architecture is based on a peer-to-peer distribution of model weights, taking into account the topology of its hosting environment. Each data center hosts multiple replica parameter servers, which supply model weights to GPUs for inference. Leaders in a location are elected via Zookeeper and are responsible for communication between data centers, avoiding duplicate requests contending with each other over limited between-data-center pipes.

    The replicas in Ekko communicate directly with each other, sharing lists of the most recent updates for various shards of the model parameters and exchanging updated shards. This approach results in an eventually consistent system, which has not proven to be a problem for the production workloads at Tencent. The paper also details some logic to provide prioritization of updates so the most important models get updated first.

    Moving at speed means there isn’t a lot of time to run tests and validation against the model. To maintain stability, they run baseline versions of the model at a low percentage of traffic. If the new update starts underperforming the baseline, “witness” servers trigger a rollback.

  • Sparks of AGI

    Microsoft published a paper about GPT-4, with the (literal) headline claim that it showed sparks of artificial general intelligence. The paper includes an approach to evaluate the model’s abilities, and many examples.

    Jorge I Velez put together an excellent write up on LessWrong (or his substack).

    One topic discussed is inner dialog:

    “[S]ince the training data is essentially a linear exposition of the solution, a model trained on this data has no incentive to engage in an “inner dialogue” where it revisits and critically evaluates its own suggestions and calculations. Second, the limitation to try things and backtrack is inherent to the next-word-prediction paradigm that the model operates on. It only generates the next word, and it has no mechanism to revise or modify its previous output, which makes it produce arguments “linearly”.

    In contrast, when humans answer questions, they can have an internal reasoning process before speaking or taking action. We can think about the problem and then examine our own thoughts, while the model has no ability to go back and alter what it has already output. You can envision a two-step process where a language model generates an answer that is then added to the prompt before emitting the final response; effectively this is what is happening in various chatgpt prompt chains already.

    Extending this: some popular theories of mind posit our neural processes as many separate parts vying for attention, and, at a sufficiently high level, us making choices between them. Our inner dialogue makes us capable of making decisions and critically evaluating our own suggestions and calculations: a space for those subagents to operate.

    In a slide deck, Yann LeCunn sticks to his guns re “fundamental architecture change”, namely:

    • Abandon generative models
      • in favor joint-embedding architectures
    • Abandon probabilistic model
      • in favor of energy-based models
    • Abandon contrastive methods
      • in favor of regularized methods
    • Abandon Reinforcement Learning
      • In favor of model-predictive control

    Meanwhile Karpathy is more on the “string them together” side:

    “1 GPT call is a bit like 1 thought. Stringing them together in loops creates agents that can perceive, think, and act, their goals defined in English in prompts. For feedback / learning, one path is to have a “reflect” phase that evaluates outcomes, saves rollouts to memory, loads them to prompts to few-shot on them. That is the “meta-learning” few-shot path. You can “learn” on whatever you manage to cram into the context window. “

  • Monolith: Real Time Recommendation System

    A longer link-and-rec, but a fascinating paper from ByteDance on one of their recommendation systems.

    Brief Background on Recommendation Systems

    In a recommendation system, your goal is to present the most relevant content to users; this can be anything from short videos, to ads, to product listings. You have a vast pool of users and content to handle.

    You begin by representing users and content through their attributes (features). Attributes can be things like the user’s location, their watch history, or who they follow. These attributes can be dense (mostly populated with values) such as “age of user account”, or sparse (mostly empty), like “has liked a video about Vacuum Fluorescent Displays.”

    For each user or piece of content, you create a lengthy array of attributes and their values, with the majority being zeroes. These high-dimensional arrays are then transformed into lower-dimensional representations called embeddings. Embeddings are like a semantic hash that captures the essence of users or content, placing similar ones close to each other in the vector space.

    These embeddings are fed into a deep network, which maps the embeddings together and generates recommendations. Due to the enormous amount of data involved, there might be multiple layers or tiers within the system to handle the processing efficiently.

    The recommendation system then suggests content for users. After a certain period, you’ll evaluate the quality of those recommendations. The feedback loop could be short (whether the user watched the entire video) or long (if the user purchased a product from an ad). This information is used as training data for future iterations of the recommendation system, creating a continuous learning process.

    Challenges

    The paper describes two major problems recommendations systems have, when compared to other deep learning tasks.

    Sparse, Categorical & Dynamic Data: In the recommendation world, you’re dealing with a dizzying number of dimensions—billions of users and pieces of content, and a huge number of attributes to describe them. This creates a long tail of items that appear less often. One popular fix is hashing, where you map multiple elements (users, content, ads) into a single, embedding bucket. This trick also helps tackle the “cold start” problem, such as when you’re trying to recommend something to a user you know nothing about. These hashes can cause collisions between otherwise independent entries, which can impact how effective the recommendations are.

    Shifting Data Distribution: Unlike language models, where the tokens you predict based on a phrase stay pretty consistent over time, user interests in recommendation systems can be as fickle as fashion trends. Over the course of an hour a user might consume a whole bunch of woodworking videos on TikTok, but then want to switch to stand-up snippets. If a feed gets too optimized it can turn stale or, worse, bore users, leading to less engagement. So, for better recommendations, you’ve got to keep up with the user’s actions to capture their ever-changing tastes and preferences.

    “Intuitively, information from a more recent history can more effectively contribute to predicting the change in a user’s behavior. To mitigate the effect of Concept Drift, serving models need to be updated from new user feedback as close to real-time as possible to reflect the latest interest of a user.”

    Monolith

    The Monolith recommendation architecture addresses these two concerns through use of a collisionless hash algorithm and an online-training scheme.

    Cuckoo Hashing & Parameter Servers

    Recommendation models tend to be absolute behemoths, way too large for a single GPU (which, nowadays, max out at around 80GB). Each training example is some variant of “User X watched Video Y”, meaning most batches of examples only touch fractions of the users and videos in the mix. To manage this, they store the model’s parameters on a set of separate parameter servers and only load the relevant bits onto the training machines. These parameter servers rely on a Cuckoo hash-based hashtable, designed to avoid collisions:

    “[the system] maintains two tables 𝑇0,𝑇1 with different hash functions ℎ0(𝑥), ℎ1(𝑥), and an element would be stored in either one of them. When trying to insert an element 𝐴 into 𝑇0, it first attempts to place 𝐴 at ℎ0(𝐴); If ℎ0(𝐴) is occupied by another element 𝐵, it would evict 𝐵 from 𝑇0 and try inserting 𝐵 into 𝑇1 with the same logic. This process will be repeated until all elements stabilize, or rehash happens when insertion runs into a cycle.”

    This could end up using a lot of memory, so to save space (and improve performance) they prune old IDs (old, inactive users or old pieces of content), as well as ones that have very low usage.

    Online training

    The model is trained in two different ways. They start with traditional batch training, sifting through all their data just once. This generates the model, which is shifted over to live parameter servers and starts making recommendations.

    Next, they switch gears to online training. As user actions come in (like giving a thumbs-up to a video or watching it till the end), they merge the data with the recommendation info that prompted it, creating a fresh training example. They batch up these new examples and run a training pass, then copy the updated sparse parameters from the training parameter servers to the production one. Moving just the modified parameters makes the updates way smaller than pushing whole model. Once a day, they sync the dense parameters.

    This sounds straightforward, but it hides some significant engineering. Model publishing often involves an intricate process, such transforming a model for optimal serving to ensure it responds quickly, especially when low latency is a must (like ranking and recommending cases). Serving models is also a more distributed challenge: training usually crams as much data and computing power together as possible in one place, while inference (making predictions) is best done close to the user request for performance and latency reasons. To top it off, inference is often run on entirely different machines, or ones with varying characteristics, as compared to the training devices.

    ByteDance showcases their ability to manage this process at a fine-grained level, delivering updated at a high pace (around 1 minute from action to model update), making their product ultra-responsive to users’ ever-changing tastes and reactions.

    Other bits

    The paper closes with a very practical tip: you can likely get by with snapshotting less often. During training, individual server failures are way less isolated than the ones in more traditional “big data” workloads, and can result in a loss of a significant amount of progress. To cut down on the impact, it’s common practice to save snapshots of the model during training, to use as a checkpoint if you have to restart the process. Snapshot more frequently, and you’ll waste less time on failures—but you’ll also spend more time on the snapshotting itself.

    “We also did quite some exploration in this regard and to our surprise, model quality is more robust than expected. With a 0.01% failure rate of PS machine per day, we find a model from the previous day works embarrassingly well. This is explicable by the following calculation: Suppose a model’s parameters are sharded across 1000 PS, and they snapshot every day. Given 0.01% failure rate, one of them will go down every 10 days and we lose all updates on this PS for 1 day. Assuming a DAU of 15 Million and an even distribution of user IDs on each PS, we lose 1 day’s feedback from 15000 users every 10 days. “

  • On The Factory Floor

    March 20 2023

    “On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models” dives into practical lessons learned by Google from building large-scale models to predict click-through rates on ads. This incredibly dense paper is packed with practical knowledge gained through real-world experience, and I highly recommended it for anyone working on large-scale ranking and recommendation systems.

    One of the most interesting ideas in the paper is the suggestion to evaluate improvements based on gains in accuracy versus increased model complexity, and potential reductions in complexity while maintaining accuracy. Any improvement to the model accuracy could instead be reframed as a reduction in model size while holding the accuracy constant, helping keep training time from ballooning: looked at in this way some advances are bigger wins in efficiency than they are in accuracy.

    Bottlenecking is another concept discussed in the paper. Making layers in the deep network wider is a consistently effective way of improving model accuracy, but comes with a significant cost in training time. By adding narrow bottleneck layers, the costs associated with making layers wider can be mitigated while only slightly impacting accuracy. The paper also delves into automating the addition of these layers using reinforcement learning, which can lead to significant savings in training time without sacrificing accuracy.

    Data sampling is critical when dealing with massive amounts of data. The system trains over a time window, favoring more recent examples. Clicked examples are rarer, so they sample more of those and correct biases later. They also up-sample ads that are not often seen or clicked.

    Loss optimization is another practical aspect of the system. Multi-objective optimization is used to optimize ranking loss (how correctly it orders ads) as well as regular click-through rate loss. The paper also mentions using distillation with a loss from a teacher model. However, training with all losses at once can lead to instability, so they start with log loss and gradually introduce other techniques.

  • Rich Sutton’s Bitter Lesson

    Saw this referenced a few times, and hadn’t read it before. Short, and worthwhile.

    We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.