Converting videos to spatial videos
DATE
February 23, 2024

Spatial videos are videos that incorporate 3D content allowing users to interact with the environment and provide a much more immersive viewing experience. These videos look more real and make you feel like you were at the original scene when viewed on devices like the Apple Vision Pro. Earlier, we managed to convert 2d photos into spatial photos as shown in the previous blog post and we have now extended this to videos.

We found a few spatial converter implementations online but didn't initially find any free ones. We were also pretty curious about how challenging it would be to make one of our own. So...

How do you convert 2d videos to spatial videos?

You take a regular 2d video and transform it into a 3d video format that has been around for a while. More specifically, spatial videos conform to the video compression codec format MV-HEVC on the Apple Vision Pro. For us, that means you get the depth map of each frame in a video using an off the shelf solution for monocular depth estimation and generate two rotated images from that to represent the left and right eye. Then, you stack the photos on top of each other to create a new video that looks like a video with stacked frames. With the new video, we leverage Mark Swanson's Spatial converter to finish the conversion of the the stacked frames into the earlier format that the device understands. The device does the actual hard lifting of partitioning the frames and sending the right frames to each eye.

The Story

We thought that creating a video would be super difficult, especially since we were unfamiliar with the MV-HEVC video format and we didn't know which tools would be necessary to concatenate all the frames together. We also anticipated a lot of compute problems since it already takes a while to generate a single spatial photo and expected a video to take much longer. However, the open source community is strong and a lot of our work was offloaded to existing libraries.

This is what we started with, an iPhone video of a person skiing.

Then, we loop through each frame of the video to generate the stacked frames using a depth estimation model. We produce two images using the resulting depth map from the model and each frame becomes 2 photos stacked on top of each other like the following.

Single frame with left and right photo stacked on each other
Single frame with left and right photo stacked on each other

Given the list of all these frames, we use MoviePy to produce a video. Then, we use the converter that helps encode this into the spatial video!

Making it fast 🏎️

At first, we had a pretty naive implementation of the code. We looped through each frame of the video, then ran the depth estimation to generate each of the stacked frames. This was to see if it would work.

The initial runs took quite long for really short 10 second videos, so we immediately got annoyed by how slow it was and had to improve the speed to get something usable in a more reasonable amount of time. Since we have access to multiple cores on the computer, we sped the pipeline up by parallelizing work across all the processors in our laptops.

Our first attempt at speeding up this pipeline was through using Pool from the multiprocessing library. We took the number of available CPUs which allowed us to achieve a speed up equivalent to the number of cores. We did notice that our fans started blasting, but at least we got some good utilization out of our computers.

The processing still takes a while. For instance, a 12 second video with 691 frames takes roughly 5 minutes on a base M1 13" laptop...sad.

Future 🔮

There are improvements to be made to utilize GPUs. So, we're planning on using something like Modal to speed up the process even more. From the stats, it looks like we can drop this to <20ms using a V100 GPU. We'll play with this more and see where it goes!

Feel free to try this out at https://github.com/studiolanes/vision-utils if you want something off the shelf. Note that this is only currently compatible with Apple Silicon Mac computers.

To make it so that you don't have to clone our repo, install Poetry, and run the script, we'd love to build a way to easily upload a video and get back a spatial video. All you'll need to do is upload your asset and we'll return the transformed asset back to you. Shoot me an email me if you want to know when it's out at [email protected]!