Next 5 to 10 years of AI

I’m currently doing a PhD in Machine Learning. It’s a broad and fast moving field – some of the technology has more or less converged on well understood approaches and is already heavily used in production (e.g. CNNs for traditional object detection), some of it is still very early days (e.g. most RL approaches), and some of it sits in the middle. I want to talk about this third category, and examples of where I think the technology is going to move from early days to production ready over the next 5 to 10 years.

Language Conditioned Creative Tools:

Large Language Models are shockingly impressive and have made me extremely bullish on the idea that in the limit scale and a sufficiently hard problem is all you need. It seems that language itself has some unique properties that caused LLMs to learn to generalize so well, and the resulting embedding spaces are extremely powerful tools, even for non-language or highly multi-modal tasks. On top of these we’ve seen the very recent development of multiple extremely impressive language conditioned image generation systems and multiple productized code generation tools built on LLMs. I think it’s clear that, over the next five to ten years, natural language is going to be the primary human interface for working with creative tools.

These domains have an important common property – the system need not make guarantees about its behavior because the human is still in the loop and working to exploit the strengths and avoid the weaknesses of the system. This means bleeding edge technology can be much more rapidly deployed, and thus we’re going to see hugely accelerated improvements to these tools as companies develop the art of deploying new systems in production. In the short term, we’re going to see development and ultimately convergence on standard UX patterns for working with these systems; an early example of this is prompt engineering.

In the art space, we’re going to see a massive shift in how artists do initial ideation and rapid development – these systems will allow for quick iteration on thousands of variants of design ideas. Currently with these tools, the later part of the design process is significantly harder to perform, requiring repeated applications of conditional image edits. Of a particular challenge is camera positioning – the camera angle can only be manipulated through high level text direction, and artists want to have very fine grained control of the full camera details. This motivates the integration of NeRFs as a scene representation output instead of raw pixels, thus allowing artists the potential for fine grained control of not just camera angle but details of things like lighting and depth of field. This also motivates the need for a new set of artist tools that allow for easy manipulation of objects in NeRF scenes.

In the video game space, I expect that we’re going to see the development of truly open-ended NPC dialogue and interaction using LLMs finetuned on gameworld lore plus neural text-to-speech engines. An early version of this is AI Dungeon, a text-only game that repeatedly has the user look around or take basic actions with GPT-3 acting as a generative dungeon master.

In the programming space, these neural code generation tools are already commercially available and will be ubiquitous within a few years. I expect for many feature-add style programming jobs (e.g. React component development), programmers will learn to prompt code generation tools to get large swaths of code written quickly, with their skillset transitioning towards being able to read the code generated to make sure it is semantically correct. For less feature-add heavy jobs, these tools will have less of an impact on code writing, instead providing new low-level editing features such as semantic find-and-replace/refactoring, as well as higher level functions such as automatically detecting divergence between natural language documentation and actual code behavior.

Robots That Can Do Open-Ended Tasks

Again, in the limit, I think scale and a sufficiently hard problem is all you need to build generally capable agents (the Bitter Lesson). I believe that large monolithic policies are going to lead to the best general performance, but that is far away from being computationally feasible – we still lack even generally available diverse datasets to do this due to their storage requirements, let alone compute systems that can handle training on that much data.

Instead, we’re going to build out a traditionally architected robot control stack, but with much of the classical cognitive components replaced with learned ones. This architecture avoids the compute issues mentioned above, but also allows operators to make business guarantees by more easily reasoning about the capabilities of a given system and more easily repair specific poor behaviors behaviors – it’s easier to isolate and iterate on a single set of components than it is to address systematic issues in large monolithic policies. In terms of what components will be replaced, it will be higher level reasoning systems first – while localization/mapping is not perfectly solved, well tuned modern SLAM stacks are quite robust even in difficult environments. Instead, these classical systems will be augmented by components such as learned traversability estimates and weak global signals like non-metric maps. However, the classical symbolic high level action planning systems are likely going to be entirely replaced by offboard queryable LLM planners that glue individual policies orchestrated together – we still don’t know how to ground classical symbols, and with the development of LLMs it appears that we won’t ever have to. Instead, we just need latent representations that a neural planner such as an LLM can understand and use to plan, and the future will be improving their planning quality, especially over long horizons, plus the expansion of the the capabilities and robustness of the underlying individual policies.

In terms of these underlying individual policies, I expect to see significant progress from very specific brittle capabilities to more robust, multi-objective capabilities. Today, we mostly build single task policies – historically these policies tend to be hand-crafted; more recently, these are typically learned. Historically, learned policies come from online RL approaches (sometimes bootstrapped in sim, sometimes randomly initialized, and then tuned in the real world), but more recently we’ve developed offline approaches that can use fixed datasets collected from human teleoperation to bootstrap a policy (e.g. behavior cloning or offline RL). Where we have decent simulators, the use of simulation will dramatically increase as we further close the sim-to-real gap. Already for low-contact, low-sensing domains the sim-to-real gap is almost closed (see Marco Hutter’s ICRA 2022 keynote where he live trained a walking gait from scratch in sim on his laptop and deployed it directly on a real robot), and it’s likely that advancements in simulation technology (e.g. differentiable simulation) will likely lead to significant improvements in higher contact, higher sensed domains. Where we don’t have decent simulators, I expect improvements in sample efficiency for behavior cloning and offline RL will lead to better quality pretrained policies that can be few shot finetuned on the target task.

However, by default these paradigms still require a human engineered reward signal – something external needs to tell the policy what to do, and how good of a job it’s doing in a task specific way. By comparison, the problem of simply reaching goal states (which can encode the configuration of the environment) is entirely-self supervised, and even in offline settings this can result in a better quality policy than was used to collect the data. This formulation is exciting because in principle it enables policies to train indefinitely on increasingly difficult tasks (reaching further and further away goals) and produces policies that are arbitrarily steerable (at deploy time, simply give it the state you want to achieve) and it can be run either in simulation or on large offline datasets. However, significantly more work is needed in making these robust over long horizons (50+ policy steps), in the large, natural state spaces of most tasks, and in ways to convert high level goal descriptions in natural language or image form into goal states.

I also expect to see significant improvements in vision embeddings for visual policies that will enable much better downstream planning improvements. A ResNet pre-trained on ImageNet works, but unsurprisingly more advanced embeddings from more in-distribution data with simple inductive biases leads to significant improvements in downstream policy performance – I expect that much of the vision literature advancements that have been left out due to compute constraints will start to contribute significant performance improvements in state-of-the-art visual RL approaches, and it may even be that we find a way to efficiently break out of the current paradigm of training a vision embedding, freezing it, and then training a policy.

As alluded to above, these policies need not be trained to perform only a single task – typically this is achieved through training with a multi-task reward or through general goal conditioning and results in better overall performance, but this is on a single policy composed of a single neural network. This raises difficulties if you start running on a fleet of robots, especially if you run on heterogeneous robots with entirely different form factors. You either have to train / adapt a new set of skills for each robot or you have to jointly train all your policies to be competent across all of your robots, requiring significant data to update all your skills with the introduction of each new robot (and possibly introducing a performance regression); neither of these are practically feasible for substantive fleets of robots. Functional composability fixes this by learning a latent task policy that is composed with a task-portable robot-specific neural module. This leads to significantly better sample efficiency and overall performance and likely will be necessary for supporting multiple versions of a fleet of robots.