# Transcript: 4D Volumetric Video & Gaussian Splats for XR Creators

**Date:** March 13, 2026 · 10:30 PM  
**Session:** [4D Volumetric Video & Gaussian Splats for XR Creators](/sessions/2026-03-13/pp1162279-4d-volumetric-video-gaussian-splats-for-xr-creators)

## Summary

GeniusXR's founders demonstrate their journey from 360-degree video to advanced 4D volumetric capture using Gaussian splatting technology. The session showcases how they've reduced the barrier to entry from $8 million Microsoft systems with 300 cameras down to accessible 28-camera setups (and soon, smartphone capture), while dramatically improving quality and file sizes for training, entertainment, and virtual production applications.

## Topics

`volumetric capture` · `gaussian splatting` · `neural radiance fields` · `nerf` · `4d content` · `spatial computing` · `xr production` · `virtual production` · `6dof` · `digital twins`

## Key Takeaways

1. 4D content (animated 3D with time) is the next evolution after film, CG, and AI—enabling fully spatial, walkable experiences where viewers can move around captured performances from any angle.
2. Gaussian splatting with neural radiance fields (NeRF) has revolutionized volumetric capture: what once required 300 cameras and $8M can now be done with 28 cameras (soon just a smartphone), with better quality, smaller file sizes, and streamable web delivery.
3. The technology is already being deployed for practical applications: training simulations (600+ for Canadian government), medical education (surgery viewing from impossible angles), sports (UFC), and retail operations (Circle K store planning).
4. One capture enables multiple outputs: the same volumetric scan can be used for VR/AR experiences, web-based 3D viewers, traditional video renders, and pre-visualization for film/TV production—eliminating the need to choose formats upfront.
5. The future of content creation is converging: motion capture, 3D scanning, and volumetric filming are merging into single-camera systems that can create digital twins with full movement capability, democratizing access to spatial content creation.

## Full Transcript

We really wanted to figure out how to have real 3D in VR. This is me and Rich back in 2016 with our robot and the 360 camera from Samsung. We were on the engineering team, using our 360 camera and robot to create content at demos and events. However, that content was still three DOF—three degrees of freedom—meaning you could look side to side, up and down, and rotate, but you couldn't actually walk in the content. The goal was: can I walk in the content? How can I make content that's actually real, where I can put Rich in the scene and walk around him? That would be super cool. So we decided to get into it around 2017.

We started doing lots of R&D, which brought us all the way to today, where I'm going to show you some amazing content. Here's footage from three years ago where we captured with cameras arranged in a circle. You see the lights turning on and off—that's what we call cross polarization to get the right lighting. You see a frame freeze of the action, then him in motion, then him placed in a different virtual background. So we go from filming him in the studio with 80 cameras to placing him in any virtual environment. This is possible with 6DOF.

This is one of our employees who's into yoga. We captured her in the studio, removed the background, and have her in full 3D doing the motion. Then we can place her in any environment. Technically, this is virtual production from the future. This is how we believe content creation is going to work. In today's world, you need many cameras. Let me show you the studio itself. This is our studio in Fort Lauderdale, outside Nova Southeastern University. It's 80 cameras, many RXOs, and 75 lights. We're capable of filming someone or a few people with synchronized videos.

What we get out of it is 80 videos of that person doing whatever—whether it's a surgery for simulation and education, someone jumping for sports, or someone acting for a movie. This is our CTO, Richard Destiné. This is from five years ago. We're capable of placing anyone in any environment we want. We also did an AR version. I decided not to pay YouTube, so this is me with the Apple Vision Pro. We put a miniature version of me in the Vision Pro, and I walked into the studio. It's meta: we have me as a hologram in the Vision Pro as mixed reality inside the studio, and you can place me everywhere you want.

This works everywhere, any medium. It works in video, augmented reality, virtual reality—anywhere you want, any engine: Unreal Engine, Unity, Houdini, any 3D-capable engine can take this content. Here we place her in an AI-generated background. We have code that takes your text or image and creates a full 3D background where you can walk around and place anyone in there. If you use your imagination, this is the future of content.

Going from mocap to volumetric: motion capture is capturing the motion of a person. We bring someone into a studio, do an A-pose scan, put it in Unreal Engine or Unity, rig it with dots, then have an actor do motion capture. Currently we have the biggest volume in the world. Using our software to create a 3D model of our studio and the Insta360 camera from our partner in China, we're able to walk around our studio and create this 3D version. You could have an avatar walking—think how this could be a game or a virtual visit.

Our scripts and code make it so the background can be moving, just like the human moving. The subject now, if I wanted the yoga person in different movements, I would scan her, create a 3D version, then do motion capture. But the future is all of this merging. With one camera system—just a few cameras, currently we're at eight—very soon with just one camera, I'll be able to film you doing movements and have you as a digital twin where I could do anything I want. Which is kind of scary, really.

Let's go technical. We're here to talk about the rise of 4D. What is 4D? 4D is technically a hype buzzword for animated 3D. You might get why we call it 4D: usually a 3D scan, I tell the artist to do an A-pose and scan them, but I tell them not to move—there's no time. But in 4D we add time. Now I tell the subject, yeah, you can jump, you can talk, and I can capture your eyes, mouth, and teeth—because usually you can't do that.

Here's the history: film, CG (computer graphics—the rise of the first science fiction movies, Pixar showed us what you could do), then AI about 10 years ago. Gen AI came through—today you can create content with AI. Now the next iteration is spatial content. Spatial means you can actually walk into content, full immersion. Just like in the real world where I can walk around, spatial content means you can walk around the content and feel like you're inside it. That's what 4D is: it records everything, realism and time.

We have this technology called volumetric capture. When Ubisoft and EA Sports decided to scan players for NBA 2K, they'd tell them stand still and scan. A few years later, companies decided to put computer vision cameras on players to capture movement. Volumetric capture uses many synchronized cameras to reconstruct the performance. However, it was very expensive—Microsoft's first stage was about $8 million. The quality wasn't good, file sizes were huge, and it was extremely hard to work with.

Richard and I, back in 2017, saw this at Microsoft and said, this is not how it's going to be done. There must be another way. The breakthrough was in 2021 with the invention of NeRF—neural radiance fields. It's a different way of calculating 3D files. Now with real images and regular cameras, we can create real 3D files. The quality is insane, much better.

This is Vicky, our employee, doing yoga. The capture on the left is from 2020, processed in 2021. On the right is processed in 2023. As you can see, it's day and night. You can see her hair, her jewelry. Microsoft couldn't even do jewelry. This is extremely good quality with just 28 cameras. We went from 300 cameras to 200 to 180. You can actually see the cameras reflected in his eyes. The most important part is lighting. It's like cinema. We treat it like cinema. We don't want to reconstruct Richard—we want Richard as he is with his real movements, eyes, nose, and hair.

In 3D they redo all this stuff. They'll have your face and body shape, then spend millions doing the rest. We said no, we don't want that. Can I just film it? Why is it so complicated? You can do both lighting approaches. In this case, we did full blast lighting on Richard so with AI we could redo the lighting. But you can do practical lighting. Say you want the shadow from this side—the director was there, the actor was there, everybody liked the shot, so we're good to go.

Here I'll show you our software and how it works. This background was created with AI. We typed 'create a gym' and it created a background. Once you have this, you can import a person or subject you scanned or filmed. In this case, it's our boxer. We can position them, relight them, make them move, render videos, make a web version, export for VR. We like to say: one capture for them all. You capture either the subject or the world around you.

One of our devs is looking at the preview. Anyone familiar with previsualization? Previs is used a lot in moviemaking. This is the ultimate previs tool for today. With this camera or just my phone, I could take a video here right now, make a 3D version in 15-20 minutes, put it in my software, pre-visualize it, place my camera where I want, then show my producers and directors. This is what I want to do. Then you know what? I changed my mind. I want to put them on a mountain. Let's see what that looks like.

I imported the mountain I created with AI—took about three or four minutes. I placed my subject on the mountain. You can see the XYZ plane, so everything is in 3D, spatial. Then I can render out this video and show my team to get approval of that shot. Say we wanted to bring the boxer onto a mountain—wouldn't it be better to confirm the shot before we go to the mountain? That's what we do.

From 3GS to 3DGS—Gaussian splatting. Usually in real 3D you have polygons, triangles. In this case we have Gaussian splats. If you go back, they're these splats that represent the scene—a bunch of lines and splats that when you look from further away, looks like a real image. Here's a project with UFC. They took a video of the inside of the octagon. With our software we're able to do this. The 3D version also has colliders—I can't go through, it blocks me, because there's a mesh that says there's something there.

These are Gaussian splats. We started with 3D Gaussian splats—one frame, a still picture. Today we do 4D Gaussian splats. We added time. 4D is a buzzword, but we like to use buzzwords to raise funding and do all kinds of stuff. Why does it matter? It's the future of content. When 20 or 30 years ago they said 3D is going to replace a lot of stuff and people didn't believe—well, it happened. Today, this is going to replace all formats of content creation. And it's not going to happen with 80 cameras. It's going to happen with your phone that has three cameras.

There's three cameras because it takes depth. Two cameras are cinematic, and the difference between them gives us information to create 3D. The third camera is infrared—it gives us the distance. So the iPhone is already ready for that. Going back to what I talked about: Microsoft and Meta Stage doing it the hard way with 300 cameras creating a mesh. This 3D file was huge and you didn't know how to work with it.

In today's world, we skip the mesh part. We don't do a 3D file anymore. We do a Gaussian splat. We capture someone, do Gaussian reconstruction, then we have neural compression that makes this file very small. From there we can bring it into software like Unreal, Unity, or any other type. The file is small enough to stream over the internet. Let me pull up a streaming link. This is a sample we did for the Canadian government. We've done over 600 simulations for training. This is on the web, fully 3D and animating. He's moving and showing how to change a tire.

This is the future of content. I'm not saying everyone will always play with their content like a game, but certainly kids on Roblox and Fortnite won't be watching content like we watch it. There's some ADHD there, but it's true. If you look at a kid right now, they have the TV on, social on, a game, a how-to. As parents we have to be responsible and limit that. But for training purposes, this is excellent. Can anyone tell me this is not good for training, either on a tablet or in your headset?

If I put this in a headset and you wear it, you're completely undistracted from the world. I'm showing you what to do. There's audio with it, six degrees of freedom. I could walk around this gentleman, pause it, play it. At NSU in our studio in Florida, we do this for medtech. What we realize is that students most times can't see what's going on if they look from the other side. You guys have seen how they do surgery classes? You're sitting up here, the surgery is down there, you're only seeing one angle.

It would be great to replay it and say, let me look at the other angle, let me zoom in and see what's actually going on. Can I see an impossible shot where I usually don't get to see? UFC approached us for these same reasons. One capture for them all—it's your imagination. I can do multiple shots. Let me show an example. This is Richard's brother's gym. We said hey John, why don't we scan your gym? He's in the same building as us. This is a beautiful gym, you should show everyone. While I'm here, let me show you a big difficulty: we're able to scan mirrors and glass. This was not possible before.

In our software, we can decide the quality, how many pixels we want. We'll go to the max. We have the gym on the web. It's a web link you can share with anyone, and it works just like a game. I can go in and out. But what can I do with this for a moviemaker or someone into content? I would render out a video. We've rendered a video that goes from outside the gym to inside with titles and cool fast movements. The cool part is Richard's brother, who we invited to our studio. We told him, do some movements. Then we put him back in his gym.

This is John from the studio, and we put him back into his own gym. The background is real, he's real. All these camera movements were done after, not before. I could literally do anything I want with John in the gym. I could also place John on a mountain, like I did with the yoga person or the boxer. The inside of the gym was shot with this camera, but technically it could have been done with an iPhone. The difference is this is much quicker. With this we just walked around once. With an iPhone you'd have to do three or four levels, get everything.

In this case, because it has two LiDARs and four cameras, I can easily walk around. The mirrors—because we're not doing reconstruction as a mesh in real 3D reconstruction, it doesn't see reflection at all. Reflection kills it. With this new technology, neural radiance fields, we're able to have that mirror. We actually captured the reflection and created 3D with it. For measurements, you can imagine where this could be used: almost everywhere—digital twins, industries, retail. We just signed with Circle K where we're scanning interior stores to show employees where to put specials and posters. Simple, but it works.

I have this little tool for measuring. I can zoom into this section, click there, and measure...

---

*Source: stt · Language: en · Model: anthropic/claude-sonnet-4-5*

[← Back to session](/sessions/2026-03-13/pp1162279-4d-volumetric-video-gaussian-splats-for-xr-creators)