04 - Intuition on LMs

Introduction

Title: Stanford CS25: V4 I Jason Wei & Hyung Won Chung of OpenAI
Overview:
This talk weaves together deep technical insights about large language models, scaling laws, and Transformer architectures with a forward‐looking discussion on AI research. Early in the presentation, the speakers introduce the non‐linear “emergent” behavior of models as they scale and explain how exponentially cheaper compute has become the driving force in pushing model performance. Later, they shift focus to dissecting the inner workings of Transformer architectures—comparing encoder–decoder and decoder–only models—and they conclude by questioning whether our current learning objectives (maximum likelihood) are the real bottleneck to further progress. The talk is both technical and visionary, connecting theory with practical implications for future research.

Chronological Analysis

Segment 1: Emergent Abilities and Scaling Behavior

Timestamp: 18:46

“define like an emergent ability basically as for small models, it the performance is zero so it’s not present in this model and then for large models you have much better than random performance"
"glib the small model can repeat it can fix the quote but it doesn’t follow the instruction so it decides to fix the quote”

Analysis:
In this segment, the speaker introduces the notion of emergent abilities—a phenomenon where smaller models show no sign of a particular capability, but once scaled up, the performance on that task jumps well above chance. The first quote succinctly defines emergent behavior in language models by contrasting the near-zero performance of small models with the robust results seen in larger ones. The second quote, while referring to a task (such as following a multi-step instruction), underscores that merely reproducing or “fixing” an output is not enough; only models with enough capacity can correctly integrate and act on complex instructions. This discussion is pivotal because it illustrates that scaling up isn’t just about reducing loss in a linear fashion; it may also unlock qualitatively new abilities. Such insights help explain why the field is excited about scaling laws and why real‐world applications (from accurate translation to reasoning) seem to suddenly “turn on” once models reach a critical size.

Segment 2: Dominant Driving Force in AI Research – Cheaper Compute

Timestamp: 39:05

“so what this means is you get 10x more compute every five years if you spend the same amount of dollar"
"the cost of compute is going down exponentially and this associated scaling is really dominating the AI research”

Analysis:
Here the focus shifts from model behavior to the economic and technological forces shaping AI research. The speaker points out that—as if by design—every five years researchers can access ten times more compute for the same cost. This exponential decrease in compute prices is not just a technical detail; it underpins the rapid scaling of models and the dramatic leaps in performance. By highlighting that cost reductions in compute drive the ability to train larger, more capable models, the segment contextualizes the emergent abilities discussed earlier. In the real world, this means that advances in hardware and efficiency aren’t secondary concerns but are central to the evolution of AI, influencing everything from research directions to practical deployments.

Segment 3: Transformer Architectures – Encoder–Decoder vs. Decoder–Only

Timestamp: 48:00

“starting with the encoder so here I’m going to show you an example of machine translation which used to be a very cool thing"
"that’s the output of this encoder”

Analysis:
This segment marks a technical deep dive into the inner workings of Transformer models. The speaker begins by illustrating the encoder part of the architecture using a machine translation example. The first quote introduces the process of taking an input sequence (e.g., an English sentence) and encoding it into dense vector representations. The second quote confirms the outcome of that process—the set of vectors that will later be used by the decoder to generate the translated output. By comparing the classic encoder–decoder setup (which relies on distinct parameters for encoding the input and then generating the output) with decoder–only approaches (which typically share parameters and use unidirectional attention), the talk sets up a discussion about the tradeoffs between added structure and the freedom of a model. This analysis not only deepens our understanding of how language models function at a granular level but also ties into later discussions about scalability and the impact of inductive biases.

Segment 4: Learning Objectives and the Bottleneck in Scaling

Timestamp: 73:07

“the architecture is not the bottleneck in further scaling and I think what’s the bottleneck now is the learning objective"
"and partly why I’m like interested in RHF as one instantiation of not using this maximum likelihood instead using a reward model as a learned objective function which is a lot less structured”

Analysis:
In this reflective segment, the speaker challenges a common assumption by arguing that while much attention has been paid to model architecture, the true bottleneck for future progress may lie with the learning objective itself. The first quote succinctly contrasts the prevailing focus on architectural tweaks with the need to reexamine how models are trained. The second quote introduces reinforcement learning from human feedback (RLHF) as an alternative approach—a method that replaces the traditional maximum likelihood objective with a reward model that can capture a richer, less rigid notion of correctness. This discussion is significant because it reorients the debate from “How should we build our models?” to “What should our models be learning?” The implications are far-reaching: if our training objectives are too constrained, even exponentially larger models might fail to capture the nuance needed for truly robust, general intelligence.

Conclusion

Throughout the talk, the speakers take us on a journey from understanding how scaling—both in model size and in the availability of compute—can lead to emergent abilities that were once thought impossible, to a technical dissection of Transformer architectures that have revolutionized language processing. They emphasize that exponentially cheaper compute has been the key enabler behind these leaps, while also questioning whether our long‐standing reliance on maximum likelihood training may soon limit further progress. By connecting the dots between hardware economics, model architecture, and learning objectives, the presentation not only maps the evolution of AI research but also lays out the practical and theoretical challenges that lie ahead. The overall learning outcome is a unified perspective on how advances in compute and careful rethinking of training methods will continue to shape the future of AI.

Erick Santana

Explorer