A story about validating language models
This happened some time ago, but it has some good lessons. How can you build confidence in the trustworthiness of a language model, that it will never output something inappropriate? Safety isn’t as clear-cut as other performance dimensions. Despite our best efforts to quantify distastefulness, in practice we test the model outputs by looking at them. This is true for your pet project, and it’s true for the world’s leading AI researchers.1
The background is that we were ready to use outputs from a state of the art (at the time) language model directly in our product. We used the language model to generate multi-word autocompletions that users could select to complete their sentences. This was a major step for us. Before, ML models were used for decisions or to recommend a subset of actions from a manually curated list. This would be the first time we would let a language model generate nearly arbitrary content and pass that on directly to our users.
My understanding is that the product team was eager to launch, but someone put the brakes on the project under concerns that the results weren’t sufficiently validated. Everyone had high confidence that the model was generating excellent autocompletions composed of valid English text, but people had worries that the autocompleter could be drawn into forming or completing racist, sexist, rude, or otherwise inappropriate content. Of course the system had a list of banned words, but inappropriate content can be designed out of individually innocuous words. The modelers had spent quite some time looking at model outputs during their development process, building confidence among themselves, but this was unconvincing to other stakeholders.
Our standard is higher than having a model that isn’t generally {racist, sexist, rude}. Instead, we need the model to never say anything that could even be interpreted as such. Furthermore, our standard is even higher than the model always outputting the results that the typers intend to type. We don’t want the model to even interact with distasteful content: we should notice inappropriate sentence fragments and decline to suggest autocompletions. As such, this wasn’t a matter of creating a better ML model on traditional performance metrics and a better autocompleter at returning suggestions that users want, because there are cases where users do want to write inappropriate sentences and we don’t want our autocompleter to be a part of that.
Colleagues regularly referred to what was historically called the newspaper test or the front page test, and which in tech circles is often referred to as the New York Times test: how would you feel if what you wrote showed up, out of context, on the front page of a leading newspaper? Could one person with enough time on their hands find a prompt that would trigger something inappropriate from the autocompleter, and then post a screenshot and trigger viral negative feedback? The model needs to be more than simply not racist, it has to be pristinely evasive of racism.
How can you be sure that a model will never create an inappropriate sentence?
A practical approach
One initial idea brought our way was that a group of our colleagues should spend a bit of time trying to fool the autocompleter, individually, through the product UI. While I didn’t consider this counterproductive, I did think it would be inefficient. I also thought it would have low coverage: I wasn’t expecting the people I work with to be particularly good at mimicking inappropriate behavior. Maybe they would all try to be racist in similar ways, earning redundant coverage on the most obvious bad words while missing out on other ways to be inappropriate.
I pushed for a better approach. I suggested we send corpuses of bad text into the autocompleter through its API, building a large sample of responses which we could then evaluate. My manager agreed, and he made another of the key contributions to the project: he found some datasets. Other teams had some small datasets of bad content from users over the years, content that had been marked as inappropriate by other users and verified with human review. There was one particular corpus we used most effectively. It had the worst racist, sexist, hateful content you could imagine. And then much more you couldn’t imagine, because you don’t have the vocabulary. None of us do. Those atypical slurs are some of the key ones to catch, because they wouldn’t otherwise be obvious and they might not be in filtering lists. Even though this list contained merely hundreds of samples, this was the varied corpus of awful behavior that we needed.
I built a pair of programs that together would read in a dataset, feed each sample to the autocompleter API, record the resulting suggestion, and ultimately format the results for human viewing. When performing an exercise like that you have to decide at what granularity to feed the model, whether after every character, every word, every sentence, or only after the full text sample. I built support for several of those options, and created multiple outputs accordingly. I had the program create CSV files that were organized for easy subsequent reading. Then we measured how many entries evaded the autocompleter’s filter and led to completions after (complete) inappropriate text, and we could also use human judgment to evaluate whether any of the autocompletions on text fragments led to bad content.
The programs were conceptually simple, although they were probably a bit larger than you would guess due to intricacies of the API and the nested structures that had to be packed and unpacked. One small obstacle I faced when trying to merge the code into part of the code path of the owning team was explaining the value of this tool at all. There was already a perfectly fine API where you could send requests and receive responses directly from a terminal, with inputs and results that could be stored more efficiently in a binary format. We could already store output structures in a lossless binary format. Why would anybody want to move data in and out of text formats? And was this code too simple to add to our repository?
I had to explain why it makes sense at all to optimize output formats for human reading. Yet that was a key part. The raw API output, even pretty-printed, typically covered most of an entire terminal window for each keystroke sent. This is because of the nested structures and the amount of tracking data and (helpful) model information returned. That’s great for completeness and for debugging any one API call, but it’s awful for trying to review a corpus of even moderate size.
I wanted data to be visually packed, to fit dozens of results on a monitor at a time. Not only did the raw API outputs require constant scrolling, but with each Page Down key press your eyes would have to scan to find the relevant fields because the outputs were of varying length and alignment. I wanted to extract just the most relevant fields (specifically but not exclusively: the suggested autocompletion) and put them in a tabular format. This would make reading easy. By doing so, we could split up the work and cover a vastly larger and more varied corpus of test cases in a short period of time. Legibility, readability, information density, minimizing eye movement, and maximizing read speed are first order concerns for human review.
Automating the testing and structuring the outputs for human review might have led to 100x the volume of reviewed queries. Simultaneously, this gave us much better coverage of inappropriate content by using datasets from real users rather than our own attempts at mimicking bad behavior.
What did we find?
You would think this story needs a happy ending, that the project saved us from the front page of the New York Times. Well, that’s possible, in the sense that I don’t think we ever ended up in the press or in a viral Twitter thread for rotten autosuggestions. But I can’t take much of the credit, because when we ran my program and stress tested the autocompleter on some nasty datasets, we only found a handful of gaps (and no, I won’t tell you what they were). We patched those, but they were rare enough that I don’t know if the autocompleter would have caused trouble even if we skipped the testing. We helped, but it turned out that the measures taken by the modelers and product developers already gave strong coverage.
But at a minimum, we strengthened the product, contributed some quantifiable measures of its robustness to bad prompts, and helped make everyone comfortable with a public release. We left behind some good tools and a playbook for next time.
Which I’m not one of, just to be clear.