The language of probabilities
Organized by New York Times crossword editor Will Shortz, the American Crossword Puzzle Tournament (ACPT) is the oldest and biggest tournament of its kind, consisting of seven qualifying puzzles and a final playoff puzzle. This year, the event moved to a virtual format, attracting more than 1,100 contestants vying for the $3,000 grand prize.
But for the first time in ACPT history, the top scorer was not a human contestant but an artificial intelligence (AI) system known as Dr.Fill. This unprecedented computer science victory was born of a last-minute collaboration between the original system’s inventor, Dr. Matthew Ginsberg, and a team of UC Berkeley engineers in the Berkeley Natural Language Processing Group (NLP).
Berkeley in the clues
Some examples of clever Berkeley-themed clues used in past crosswords.
- They’re given out at Berkeley’s Haas School – MBAS
- Winter hrs. in Berkeley – PST
- Domain ender that UC Berkeley was one of the first schools to adopt – EDU
- Angeleno at Berkeley, say – INSTATER
Dr.Fill bested humans throughout the competition, which ranks contestants based on accuracy and speed. In the qualifying phase, the highest human score was 12,810; Dr.Fill beat that with a score of 12,825. For the final playoff puzzle, multiple contestants completed it correctly, but Dr.Fill finished in just 49 seconds, which was two minutes and 11 seconds faster than Tyler Hinman, the grand prize winner.
Ginsberg — an academic and businessman in Eugene, Oregon — first developed the Dr.Fill program in 2012, when it finished 141st among 650 ACPT contestants. Using a combination of techniques from machine learning and classical AI, Dr.Fill gradually improved its tournament performance each year, with its best ranking at 11th place in 2017. This year, its core AI was augmented by a state-of-the-art neural question-answering technology — called the Berkeley Crossword Solver (BCS) — from the NLP, led by electrical engineering and computer sciences professor Dan Klein.
“We each brought something very different and valuable to the combined effort, and I think it’s fair to say that neither of us could have won the competition without the other,” said Ginsberg. “At some level, we were both using the same ‘how likely is this answer, given the clue?’ methodology. That allowed us to integrate the two systems in a very short period before the tournament began.”
A perfect match
The partnership began with computer science Ph.D. student Nicholas Tomlin, who had been re-implementing Dr.Fill as a fun side project since the summer of 2020. “I built some modules for it, but I never managed to replicate the accuracy of Matt’s original system,” he said. Although the project seemed ambitious, it modularized well and connected to the interests of other Berkeley NLP students, so Tomlin pitched it as a potential project in February, and the BCS was born. The group soon included Klein, three Ph.D. students (Nicholas Tomlin, Eric Wallace and Kevin Yang), and two undergrads (Albert Xu and Eshaan Pathak).
The team reached out to Ginsberg in March and sent him an initial application programming interface just two weeks before the April competition. “We had a state-of-the-art natural language understanding and question-answering component but a pretty basic crossword handler, while Matt had the best crossword system around and a bunch of domain expertise, so it was natural to join forces,” said Klein. “As we talked, we realized that our systems were designed in a way that made it very easy to interoperate because they both speak the language of probabilities.”
According to Klein, there are two parts to solving the puzzle: first to come up with answers to the clues and then, from all the answers that might work, to find which ones fit together. “The first part is really a game of language understanding. And the second is about search and reasoning.”
The BCS is a machine-learning based system that takes enormous amounts of data, both from past puzzles and more generally, to learn a neural network model. The team built a new question-answering system that learned how to combine general language understanding with the kinds of creative clues that show up in crosswords. Like a human, the system knows a good deal about language before it plays its first crossword, and then gets better as it trains on each puzzle. Compared to a human, the system has less world knowledge, but it’s been trained on 6.5 million past crossword clues.
The BCS’ hypotheses on each clue were then given to Dr.Fill, which had expertise on crossword structure, such as how to weigh alternatives within the grid and how clever themes modify the rules for any given puzzle. Because both systems compute with probabilities, they can be mixed and matched modularly.
Klein says it’s possible, even likely, that future tournament puzzles will become harder in order to stump the improved system. But those puzzles will be trickier for humans, too.
“Humans solve crosswords by first filling in the answers that they feel confident about,” Wallace said. “While our AI system tries to do the same, it’s sometimes highly confident even when it has no clue what it’s talking about. Given this, a future goal for our research group is to develop models that are not only accurate but also ‘know what they don’t know.’”