Ekko: Low Latency Model Updates at Tencent

This paper from Tencent last year on the architecture of their recsys, and how it enables a high degree of freshness, via low-latency model updates to deliver fresh and relevant recommendations.

Infrastructure and Architecture

The infrastructure of Ekko is centered around a parameter server, which stores the model weights and embeddings created during training. These parameters are then copied to replica parameter servers in data centers around the world, for inference. With a massive model size of 40TB (!) and over 4500 servers for distribution, Ekko is built to handle a lot of users and a lot of content.

Freshness and Privacy

Freshness is crucial in Ekko’s design, as it aims to provide the best recommendations for new content and users. The paper also calls out the need to accommodate anonymous users given the expanse of privacy regulations; one of the few times I’ve seen this explicitly mentioned. In all cases, they’re focused on learning from behavior and getting updates out as quickly as possible: seconds, rather than minutes or hours.

Model Sharding

The system continuously trains on new data, and divides the model into shards. Model sharding takes advantage of the fact that updates are often clustered in space (within the model) and time (updates targeting a small number of hot items). In their production systems, Tencent observes that 3.08% of the parameters are updated per hour. Individual shards within the model are versioned separately within the parameter server.

Peer-to-Peer Distribution

The next challenge is getting the updated weights out to the distributed replicas. Ekko’s architecture is based on a peer-to-peer distribution of model weights, taking into account the topology of its hosting environment. Each data center hosts multiple replica parameter servers, which supply model weights to GPUs for inference. Leaders in a location are elected via Zookeeper and are responsible for communication between data centers, avoiding duplicate requests contending with each other over limited between-data-center pipes.

The replicas in Ekko communicate directly with each other, sharing lists of the most recent updates for various shards of the model parameters and exchanging updated shards. This approach results in an eventually consistent system, which has not proven to be a problem for the production workloads at Tencent. The paper also details some logic to provide prioritization of updates so the most important models get updated first.

Moving at speed means there isn’t a lot of time to run tests and validation against the model. To maintain stability, they run baseline versions of the model at a low percentage of traffic. If the new update starts underperforming the baseline, “witness” servers trigger a rollback.

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading