A New Frontier for Generative Models: Learning from Our World
We’ve raised an $18M Series A led by EQT Ventures to train generative world-building models for film and gaming, fueled by a new source of training data.
Generative worlds for film and gaming
The best films or games transport you to epic worlds. This is why world-building—real or fictional—is at the heart of our work at Odyssey. We’re training a new generative model that enables you to generate cinematic worlds, with 3D control over scenery, characters, lighting, and motion. Once you’ve generated your world, you direct it, and then capture cinematic sequences within it. We believe an advanced generative world-building model will unlock a better way to create film, games, and more.
Fueling this generative model requires significant capital and, crucially, a novel source of training data. We're thrilled to announce we’ve closed an $18M Series A round led by EQT Ventures, with additional backing from repeat investors including GV, Air Street Capital, and others. Accelerated by this fundraise, we're equally excited to share how we’re training generative world-building models on a source of 3D data that’s all around us, but incredibly difficult to capture. Read on to learn more.
What self-driving cars taught us
Jeff and I founded Odyssey after spending a combined two decades building self-driving cars. In fact, more than 90% of our technical staff worked on advancing self-driving cars for large portions of their careers, at companies like Cruise, Wayve, Waymo, and Tesla. This experience provides us a unique perspective on the problem of world-building models.
Making a car drive at safety levels exceeding a human driver ultimately boils down to training models that are capable of safely navigating and reasoning through a dynamic 3D world. Training such models requires the large-scale collection of real-world driving interaction data captured in high-fidelity with multiple, complementary sensors. You can generate as much synthetic data as you’d like, but it doesn’t come close to the value encapsulated in the observations of millions of real-world humans interacting with one another in every scenario possible, across every city and weather condition. Models trained with these large-scale, multi-sensor, real-world datasets have now enabled fully driverless cars.
With this learning fresh in our minds when founding Odyssey, we set out to build generative models inspired by breakthrough self-driving technology, and to train them on similarly large-scale, multi-sensor, real-world datasets. However, instead of training models to navigate 3D worlds, we needed to train models to generate them from scratch.
We needed to answer a key question: how should we collect real-world data? We knew that a single self-driving car driving down a city street at rush hour can capture hundreds of thousands of 3D details that comprise our world—more than sufficient to power a generative model capable of building worlds. While collecting data with a car works perfectly for training a self-driving car, it is always going to be limited to wherever a car can travel. This isn't ideal for a generative model that aims to generate anything that you can imagine anywhere. It means missing out on the forests, caves, trails, beaches, glaciers, parks, and architectural masterpieces that really make our planet diverse, vivid, and alive.
To solve this, we’re embracing collecting real-world 3D data with a form-factor that’s explored our planet for many thousands of years: the human body.
We’re exploring and capturing the world
To that end, we’re deploying an advanced data capture system that can collect data just about anywhere that a human can reach. It’s a lightweight computer-in-a-backpack, attached to incredibly high-resolution and multimodal sensors. The device weighs 25 pounds with a long battery life, featuring 6 cameras, 2 lidars, and an IMU. Combined, these sensors capture our world in detailed, 13.5K resolution in 360 degrees, with physics-accurate depth information included alongside each panoramic capture. What’s more, since a human is in precise control of the sensors, they can ensure that every possible angle our generative models might find interesting is captured. Think Google Street View for everywhere that cars don't drive.
With this device—built by our partners at Mosaic, an optical imaging leader—and powerful post-processing algorithms, we’re now able to capture the fine details that make up our world. This rich 3D data, with examples in the video below, fuels our generative models to new heights.
Pioneering a new 3D representation
Equally important to the 3D data is how you teach a model to learn from it. Computer vision and graphics engineers have been building high-fidelity mesh representations via photogrammetry for years. Polygon meshes are an excellent representation for modelling hard 3D surfaces, and are by far the most efficient representation for accelerated computing in rendering. There is also an incredible ecosystem of renderers, editing tools, and more. Meshes are responsible for the astounding computer graphics we have come to demand in film and games. However, mesh representations do not model certain things well (e.g. hair, vegetation), and do not work well within the current model structures for generative machine learning.
Neural Radiance Fields (NeRFs) such as ZipNeRF have hugely improved the quality of photorealistic reconstruction for the problem of novel view synthesis. These methods build an implicit representation of the scene given multiple image captures. However, NeRFs do not enable editing, nor do they fit well with generative machine learning today.
Gaussian Splatting solves the same problem as NeRFs, but using gaussian primitives rather than an implicit representation. This increases the speed and quality of rendering. And, as splats are an explicit 3D representation, they are inherently editable (albeit with limitations). Splats excel at modelling things meshes struggle with, for example, a cat’s fur. Although splats can be integrated within a generative machine learning framework, they are not as effective at modelling various surfaces. For example, large flat surfaces for a table or a spaceship. There are also limitations in how splats model light transport.
Furthermore, neither NeRFs nor splats enable effective relighting. This limits their utility for direct editing: inserting objects or manipulating lighting conditions can’t be done without visual artifacts.
All of this tells us that something is missing: a 3D representation that is able to unify graphics, machine learning, learning from real-world data, and editability. At Odyssey, we’re focused on pioneering this unified way to learn 3D. We're combining the inherent editability and speed of meshes, the ability of splats to model complex photo-real appearance, and the real-world learning capabilities of NeRFs. We're developing a representation that integrates within existing 3D tooling for direct editing, fits natively into diffusion transformer pipelines, encodes both dynamic and static scenes, and scales computationally to achieve the highest visual fidelity for stunning visuals.
Our first stop: California
Today, it's impossible to teach a car to safely drive itself on the complex streets of San Francisco or London without large volumes of high-quality, diverse real-world 3D data. Similarly, we think it will be impossible for generative models to generate Hollywood-grade worlds that feel alive without training on a vast volume of rich, multimodal real-world 3D data. With fresh capital, we’re now excited to be scaling up our data collection operations in California, before expanding into other states and countries. We’re early on this journey, and are actively hiring researchers for this problem. If this problem and approach sounds interesting to you, and you’re based in either the Bay Area or London, please get in touch.