Just heard this was accepted to a conference, but had completely missed the preprint being released. I briefly supported a some of the folks involved in this work at Meta, and am very glad to see it being shared publicly:
Moving Faster and Reducing Risk: Using LLMs in Release Deployment
The project predates Llama, and the paper goes through the history of using a couple of different BERT-style models and then (code)llama, modified for classification/scoring, for evaluating the riskiness of a change.
The goal of this was less to evaluate every change, but to allow low-risk changes to go through even during code freezes in business-critical periods.
Its also a good example of practical fine-tuning for a somewhat unusual use case!
During inference, we can query the model to generate the label token. To get the risk score rather than just the label, we extract the probability of the label tokens “0” and “1” from the next token distribution. If the model is aligned well, all other tokens in the model’s vocabulary should have close to zero probability of being generated.
Congratulations to Rui, Vijay, Nachi, Chandra and everyone on the team!