You should read this book if
I think the book is particularly useful for SWEs, management consultants, finance analysts, or business strategists who want to gather a technology focused perspective of what’s beyond the horizon of a fancier LLM or a new image generation model. If you are a robot learning research scientist and believe in The Bitter Lesson, these ideas probably sound familiar and I suggest reading Eric’s blog instead for more technical details.
Information about where to buy the book can be found at evjang.com/book/
Eric strongly believes in The Bitter Lesson — the idea that hand engineered, task specific systems will eventually be beaten out on that task by more general systems that leverage the ever-increasing amount of computation and data available. We’ve seen this play out before: on image classification, AlexNet’s deep learned features significantly outperformed 20 years of the human designed feature extractors on the same problem. On translation, deep learning translation systems significantly outperformed 40 years of the human designed parsers, syntax trees, and grammars.
But more data and more parameters is not the straight line path to turn an ImageNet classifier or a language model into a “generally intelligent” agent; we need new problem formulations. Indeed, if we do not give these agents sufficiently challenging problems to solve, they won’t learn general skills — they will learn problem specific hacks that are sufficient to get by on that problem. Eric points to biology, where only the minimal intelligence is developed to survive in the animal’s ecological niche. E. coli doesn’t need a brain in its ecological niche, and it’s the niche of humans (social apex predators) that drove us to develop complex communication (language), social awareness, and robust problem solving skills; if these skills didn’t enable better survival in this niche, we would not have developed them.
Based on this, Eric argues that, if we want our learning systems to develop these “higher level” skills, then we need to develop a problem so challenging that it requires them for success. He crystallizes this into an “artificial life” game he calls Jungle Basketball. This hypothetical game is zero sum between two (sets of) agents and designed to have a very high skill ceiling, forcing agents to develop and counter a large array of different skills and strategies — first by maintaining possession of the ball, then by scoring on the opponent’s hoop or by murdering their opponent. Much like evolution selects for more and more intelligent agents as they climb into higher and higher niches, this zero-sum competition should drive agents to develop more and more capabilities, eventually leading to highly competent agents with a wide variety of skills, from team coordination to long horizon goal selection and execution.
However, such a general game is far too expensive to simulate practically, and the process requires a search over an enormous search space of agent policies (and their morphologies). Instead, we should take a middle, feasible path between the “Strong” Bitter Lesson view and hand engineering everything, by instead using our knowledge about the world to guide this exploration. We can get humans to provide large scale expert demonstrations in a variety of video games, (for free, if humans like playing those games), and use that data to guide policy learning. This will allow an agent to learn to act without having to resort to random exploration, i.e. randomly wandering a world it does not understand in the hopes that it might accidentally do something reasonable and then learn from that experience.
Eric then proposes his most concrete, forward looking research agenda: what he calls “Just ask for Generalization”: if we train an agent on enough of these games, and provide it high quality language understanding (perhaps even with just the LLMs of today), it should be able to generalize to a new domain and new tasks just by explaining the new scenario in language. Language conditioned systems like DALLE exhibit the ability to make images it’s never seen before using novel language input (like this review’s header image), and we should be able to do things like this for robotics (e.g. Eric’s 2021 paper, BC-Z, which does language conditioned manipulation and exhibits some zero-shot generalization). He then extends the idea: we can just ask the robot to act as though it were “conscious”, and if it does a good enough job, it’s effectively conscious. Same applies to alignment: by talking to the robot, and eventually it talking back and engaging in a dialogue, it will come to understand what we intend without needing the current ham-fisted strategies like RLHF. These concepts are a technically watered down version of his blog post by the same name, where he suggest training an agent to imitate all kinds of quality actors then ask for the optimal actor, mixed with his follow up blog post where he suggests that the structure of language is the scaffolding to allow us to ask for any behavior.
The book then somewhat abruptly pivots to talking about the real world practicalities of the act of “doing AI for robots”: physical robots are hard, building a good team is important and requires good cohesion, and there’s a bunch of social implications at play. There’s a lot of (in my opinion) scattered points made, but to me the most interesting ones were Eric’s opinions informed by his prior experiences:
The book ends talking social implications, deep fakes, UBI vs costless utopia, and AI systems that understand beauty, but these topics felt too uninteresting to mention (maybe because I read about them every day on Twitter).
There’s a lot of ink spilled about AI. Pretty much all of it is garbage, ignorant of the past and untethered from present state of research. This book is an exception. I think this book does a great job of articulating the perspective of the portion of the robot learning community that believes in The Bitter Lesson (of which I consider myself a member1).
I am totally sold that we need our agents to solve sufficiently hard problems if we want them to learn higher level reasoning abilities2; I don’t believe that “intelligent design” of every low level component manually glued together is how we get to AGI (this is the “Boomer Robotics” way that Waymo, Cruise, Motional, Argo, etc follow, although they have safety regulatory concerns that home robots do not). But I also agree that full-sending training end-to-end optimization on Jungle Basketball (an example I love) is obviously intractable. We need a “Weak” Bitter Lesson.
Eric proposes “intelligent design” by expert demonstrations: scaling behavior cloning by getting expert demonstrations out of a video game humans play for fun3. I think this is a great idea; this data should have rich information about long horizon task execution, including free sparse rewards (game performance), and things like retries / self-resetting behaviors. However, I think that unless we develop far more precise and immersive VR experiences, these demonstrations will have to continue to rely on current generation VR style “grasping by stickiness” and fundamentally lack all of the rich contact information which is fundamentally needed for complex manipulation4, or they will have an enormous embodiment gap between the actions in-game and in the real world.
My research takes a complimentary approach: “intelligent design” by curriculum. I think the problem of high quality prediction of the world’s dynamics is a good intermediary stepping stone to get to robots that can act and respond with agility. I also believe scene flow / adjacent problems are (probably?) the right problem setting to do this in. I’ve started this thread in the AV domain because of the amount of publicly available datasets; I’ve recently built a LiDAR-based learned scene flow predictor that gets better with more raw data. It could be better (better Dynamic SLAM teacher, multi-modal, shelf-supervised semantics are all in-progress) but it lays the foundation for an entirely self-supervised representation learning problem that (hopefully) makes it easier to then learn to act in the real world across dynamic tasks in robotics at large. I don’t know which bet will be more fruitful5, and this is why it’s an open problem!
My understanding of the “Just Ask For Generalization” agenda was significantly sharpened by reading the two blog posts. It seems Eric wants to scale up Behavior Cloning with the special sauce of conditioning on policy “competence” to allow for the inclusion of non-expert data and language to provide task descriptions. This should result in a model that allows us to “just ask for generalization”, i.e. condition the model at test time on a novel language specified task with performance at an “expert” level. I think realistically, unless given access to an enormous amount of expert trajectories, this will result in a system only with the ability to execute low dexterity tasks. This is still a huge win — we don’t have such a system today and the value proposition of such a system is substantial6 — but I am skeptical we can get further than low dexterity without on-policy RL or a large amount of expert trajectories.
Eric’s first blog post sketches out why he thinks non-expert trajectories can allow BC to potentially reach expert performance: literature like D-REX, Hindsight Experience Replay, and Watch-Try-Learn all provide evidence that we can get improved performance over a dataset of non-expert trajectories. However, I’m not convinced this is a path to expert level preformance. My (weakly held) hunch is that the success of these approaches has more to do with addressing issues in the learning dynamics of the policy networks than it does in actually providing a true mechanism to reach expert generalization. Non-linear function approximators for value or reward modeling do not do optimal Bayesian updates — the Q value for states on even a simple grid world are a mess, even for optimal policies. I suspect that many of these methods are data-centric ways to push these function approximators to represent semi-reasonable structure in the surrounding state neighborhood; replacing a horrendous prior with a mediocre one should enable more policy robustness and better generalization, but a mediocre prior is not expert level generalization.
I also believe there’s fundamental limitations to language as a medium for dictating control. Rowing is not a complicated sport: put the oars in the water, tug on them, take them out of the water, reset, and do it again. But I’ve rowed for several years now and I’m at the point where it’s difficult for a coach to describe in natural language what minor adjustments need to be made to improve my form. These adjustments are on the level of tens of milliseconds and have to do with things like nuanced weight transfer to minimize deceleration on the stroke recovery. Recording a video and narrating over it helps, but translating a third person view of what I need to do differently into where exactly in the embodied trajectory I need to make the change, and then actually doing it, is hard. In practice, my coach will given me language instructions and then ride next to me in a launch and give an RL style reward of “yes” or “no” on every stroke to hone in the changes. I think if we want high dexterity and precision on tasks, this sort of on-policy training from reward or BC from other experts will be mandatory.
To me, the most interesting part of this section was the thinly veiled commentary of the form “at Google things were broken in XYZ way so we’re doing the opposite”. This was amusing, but I also found the insights valuable; as an example, my group has an “arm on car” style robot, and we’ve been so preoccupied with getting it to do anything useful that I’ve not really considered the space of tasks its embodiment is unable to perform. I also think the point about no bake-offs and team cohesion is well made: ultimately, personalities are going to dominate and conflict in the face of philosophical disagreements, so you can avoid them altogether by picking aligned individuals from the beginning.
Eric also talks about the importance of writing code for “Research Engineering” vs Production. Google famously over-engineers their stuff; during my SWE internships there as an undergrad I was impressed by their cathedrals of build and test infrastructure, “advanced” design patterns, and through code reviews surrounding their core products. But that is also bloat that prevents you from moving fast. My understanding is that, at Google Research (now Google Deepmind), no code from the outside world can be easily plugged into Google systems: everything is written in Google frameworks (TensorFlow, JAX) which are more efficient to train with than PyTorch but a nightmare to debug. There’s significantly less external ecosystem support as a result, requiring more to be written from scratch. Code bases or core services are written by software engineers using these “advanced” design patters which makes crawling through the codebase as a researcher to debug or hack in changes a nightmare.
I think Eric’s experienced the far end of over-engineering at Google, but I will say there’s a huge risk of under-engineering your codebase. My sophomore year of undergrad I was a talented but not particularly seasoned software engineer, yet I did most of the from-scratch development of the codebase for our RoboCup Small Size League team. This system was not well engineered and ultimately led to many issues down the line in terms of debug-ability and hackability because we had not designed it well to begin with. I’ve also worked with over-engineered codebases (many ML model zoos these days fit the bill) and I’ve tried to strike a balance with the ZeroFlow codebase that I wrote from scratch with the plan to extend it to be hackable scene flow model zoo.
I also agree that robotics system evaluation is hard, both for regression testing and performance improvements. We had this issue with our RoboCup team; to avoid regressions we had a bunch of unit tests for our CI system, but inevitably we’d run the stack once every few days and go “oh no, XYZ basic functionality doesn’t work anymore”. Currently, robotics lacks any sort of unified functional eval benchmark, and my hope is that over time we can develop the complex infrastructure to be able to evaluate head-to-head various methods on tasks we know we care about (e.g. manipulation). A pipe dream of mine is that the Biden Executive Order setting up an AI Safety organization at NIST can be somehow spun into providing evaluation infrastructure for an ongoing manipulation benchmark.
The book’s title is “AI is Good for You”, which implies it has some insights into how it will benefit you, the reader, and presumably society more broadly. To be honest, it doesn’t. Eric points out that automation should make goods and services cheaper, but he sort of ignores the counterpoint that the other shoe might finally drop and there will be large swaths of people that have no economic value beyond competing on price with commoditized automation to perform labor (a competition that, in the limit, ends very badly for the humans involved). He’s correct that the UBI math doesn’t work out, but I didn’t see any compelling description of what happens to all these people and I don’t believe we’re going to transition to some post-capital utopia (fully automated communism?) where you can just work whatever job you want.
To be fair, no one actually knows what’s going to happen; I certainly don’t. But in a book that has such a strong and compelling vision for where we are going technically and a title regarding the social implications, I was left disappointed. Charitably, it feels these ideas were included for “completeness”; uncharitably, it feels a bit like clickbait (hardly the worst use of clickbait).
The ideas in this book are gold. The writing is plain. The organization needs some work. The typesetting is awful. This reads like a good rough draft of a book: basically everything is there (I think the “Just Ask For Generalization” chapter was a bit unclear), but concepts were introduced out of order (e.g. “A-Life” was discussed several times before “Artificial Life” was mentioned) or not at all (e.g. many readers who would benefit from this book have little to no understanding of entropy from an information theory perspective, let alone a working understanding).
I think this is unfortunate because the kinds of people who would benefit from this semi-technical presentation of ideas (management consultants, finance analysts, business strategists) are the exact kind of people who are turned off by bad organization, typos, section headers at the end of pages, and tables that have one word randomly floating on the next page (this happened twice). Science communication is hard and unfortunately it selects more for the “communication” than it does the “science”. I think this book, if seriously cleaned up, has the opportunity to speak to an important intellectual class who drive business and policy decisions that impact our lives but are woefully under-informed by smooth talking technical hacks or ideologues who have an axe to grind.
Update Dec 4th 2023: I have been informed that these editing issues are going to be fixed! I think this is very exciting, as I think it will significantly grow the audience reach for the content, which is golden.
I tried to title my recent unsupervised scene flow learning pipeline paper “ZeroFlow: The Bitter Lesson meets Scene Flow” but I got shutdown.↩︎
June 14, 2022, I tweeted “For the record: attention, scale, and a sufficiently hard problem is all you need”.↩︎
I pitched to my labmates a version of Overcooked, played in first person VR, where you have to put the ingredients in the proper place by manipulating them with your tracked hands. We agreed this would be a fun game and a useful data collection rig, but none of us were enthusiastic about the idea of doing game development.↩︎
I think RT-X is -close to solving pick and place, and it’s time for the grasping community to work on problems with richer contact.↩︎
At present it feels like much of the robot learning community is betting on end-to-end learning from demonstrations driven by rewards. As a datapoint, the RSS 2023 Best Student Paper went to Teach a Robot to FISH: Versatile Imitation from One Minute of Demonstrations which uses no pretraining and only one minute of expert video demonstration to learn to do basic manipulation tasks. But to me, this feels like the opposite of The Bitter Lesson.↩︎
A robot that can load / unload the dishwasher or fetch raw ingredients at half the speed of a human but at 1/100th the cost is worth it for a lot of restaurants.↩︎