DeepMind Is Now the Undisputed Leader in Language AI with Gopher (280B)

DeepMind, one of the brightest stars in the AI firmament, has done it again. In 2010, CEO Demis Hassabis founded the startup. Which was later acquired by Google for $500 million with the mission. To solve “intelligence, advance science, and benefit humanity.” The company’s unique approach to AI combines machine learning. Neuroscience, among other disciplines, with the “long term aim” to develop. Artificial general intelligence (AGI).
DeepMind is behind some of the most impressive AI breakthroughs in the last decade.

Between 2015 and 2017, they built a series of models intended to surpass humans in two-player games of perfect information, like chess and Go. AlphaGo, the first of the family, won against some of the best Go players in the world — cementing AI’s dominance in a game that is considered notably harder for computers than chess.

AlphaGo’s successor DeepMind:

AlphaGo Zero mastered the game through self-play in less time and with less computing power. It destroyed AlphaGo 100–0, underlining the limitations of learning using human-based strategies. AlphaZero, the most popular of the three, learned to play Go, chess, and shogi using also a self-play approach. It achieved unprecedented results and became an international sensation — together with GPT-3 it’s arguably the most famous AI system born in the deep learning revolution.

In 2019, DeepMind introduced AlphaStar, an AI capable of playing toe-to-toe against the best players of StarCraft II — and becoming the first AI to reach world-class on an e-sport that requires “strategic capability in an imperfect information world.” The following year, AlphaFold helped solve a 50-year-old problem in biology — protein folding. Experts on the topic considered it a solution to the challenge. DeepMind made the system available for free globally in July 2021.

Given the unparalleled history of DeepMind:

AI developments, it was surprising they hadn’t made an appearance in the flourishing area of large language models (LLMs). This changed a few days ago when the company published a series of papers on the topic. They revealed some amazing news that will form the basis for the next steps in the field.

In this review article, I’ll highlight the key findings and results of Gopher a language model with 280 billion parameters that has greatly surpassed previous state-of-the-art (SOTA) models like GPT-3 and J1-Jumbo.
Gopher The new leader in language AI.

Gopher, like GPT-3, is an autoregressive transformer-based dense LLM— basically, it predicts the next word given a text history. With 280 billion parameters, it’s only rivaled in size by Nvidia’s MT-NLG (530B), developed in partnership with Microsoft.

The model was trained on MassiveText.

Which includes various sources like MassiveWeb (a compilation of web pages) C4 (Common Crawl text), Wikipedia, GitHub, books, and news articles. Together with Gopher, DeepMind built the Gopher family — a series of smaller models spanning from 44M to 7.1B params. All the models train on 300B tokens (12.8% of MassiveText) to isolate scale effects on their power.

Gopher’s performance compare that of SOTA models in 124 tasks across several disciplines. Math and logic, reasoning, knowledge, science, ethics, or reading comprehension. Gopher outperformed SOTA models — including GPT-3, J1 Jumbo, and MT-NLG — in 100 out of 124 tasks (81%)!

These results consolidate Gopher as the most powerful LLM to date and DeepMind as the number one contender to the throne in language AI — and as a certain bet on who will lead us towards the next AI breakthrough.
Aside from the overall incredible results, researchers found specific trends that repeat across tasks.

Gopher showed light improvements over SOTA:

Models in some benchmarks (and even did worse in a few cases) but in others, the gains were plain great. “Reasoning-heavy” tasks prove especially difficult (it’s been repeatedly suggest that LLMs have a hard time with common-sense and causal reasoning) but it shine in “knowledge-intensive” tasks.

Although generally better than previous LLMs, Gopher is still very far from human-level performance in most tasks (and also from supervised SOTA models trained specifically for the task at hand). Some examples are reading comprehension, common sense, logical reasoning, and language understanding. One anecdotal result worth mentioning is Gopher’s fact-checking skills. It did better than smaller models but not by having a deeper understanding of misinformation but by knowing more facts.

It’s blatantly clear that understanding and reasoning aren’t LLMs’ strengths:

“For a few categories of tasks (e.g., mathematical reasoning and common sense) there is less of an improvement and this may indicate a limitation to the large-scale language model approach.”
Experts have been saying this for years. LLMs are great for some tasks, but they aren’t well-suit to replicate human’s capacity to understand the world. We’ve evolved here and have developed an extensive toolkit to navigate reality. Reading a billion books is useful, but it can’t replace being alive.

Scale vs data DeepMind

A current trend in LLMs development is to design ever-larger models to reach new heights — Julien Simon from HuggingFace says “it’s starting to look like another Moore’s Law” — but no company has rigorously analyzed which variables affect more the power of these models. Is it the number of parameters? The terabytes of data? A mix of both?

DeepMind wanted to study scale (number of parameters) effects on model power while controlling for dataset size. They trained Gopher and the smaller models with the same amount of text from the same dataset. The main result is clear: Gopher was better in most tasks and in more than half (51.2%) the performance increase was larger than 25%. Undoubtedly, scale has a decisive influence on models’ performance.

Gopher showed the most improvement in tasks reliant on knowledge like science, technology, and humanities. This suggests “scale seems to ‘unlock’ the ability of a model to significantly improve performance on particular tasks.” On the other hand, smaller models “often perform better… than larger models” in reasoning and logical tasks.