Interview with Colin Urquhart – CEO DI4D

It has been two decades since the inception of DI4D and the facial animation industry has transformed dramatically. From early 3D scans to hyper-realistic graphics and AI, our CEO, Colin Urquhart shares his thoughts.

How has animation developed over the past 20 years?

DI4D was founded 20 years ago in 2003 and since then we’ve seen tremendous changes in the animation landscape. Many of these changes have been driven significantly by the exponential increase in graphical power, with the processing capabilities of video games consoles improving by over 1000 times. This has resulted in far greater graphical detail and realism. If you look at the graphical fidelity of early CGI movies, such as Final Fantasy: The Spirits Within (2001) and The Polar Express (2004), both were incredibly stunning for the time, but both were first surpassed by pre-rendered cinematics in video games and more recently, surpassed by real-time graphics in video game engines themselves.

Performance capture, which translates the face and body motion of an actor into animation, was pioneered in the early CGI movies but has now become mainstream in video game production. Similarly, DI4D’s advanced 4D facial capture solution, which was used initially for specialized VFX in movies such as Blade Runner 2049 (2017), is now being used frequently in video games such as Call of Duty: Modern Warfare II (2022) and Call of Duty: Modern Warfare III (2023). Conversely, video game engines are now becoming so powerful and capable that they are being used increasingly in movies, e.g. for virtual production.

What lies at the heart of DI4D’s ethos?

DI4D was founded on the concept of using photogrammetry to accurately capture the 3D shape and appearance of real-life people to create their realistic “virtual clones”, or digital doubles as they have now become better known. We have subsequently extended this technology to also capture accurately the facial performance of a moving actor and apply this faithfully to their digital double character. Our ethos is to capture the actor and their facial performance as faithfully as possible. By doing this, we can remove much of the subjectivity and time consuming “polish” that is associated with traditional manual methods of facial animation.

Do you have any predictions for the next 10-20 years?

Graphical processing power will continue to increase exponentially over the next 20 years. It is mind boggling to think that hardware for video games in 2043 might be 1,000 times more powerful than the current generation of video game consoles. The fidelity of graphics could be virtually indistinguishable from live action today. As a result, there will be even more convergence between what has traditionally been linear 2D entertainment, such as movies, and television, and interactive 3D entertainment, such as video games. It is possible that all forms of entertainment, 2D or 3D, will be rendered in real-time from 3D assets, opening the potential for many new and exciting types of production.

Increasing graphical fidelity will make it even more difficult to create novel content artistically with sufficient detail and realism – content creators may need to make even more use of 3D data captured or derived from real-life. For example, it is likely that this will put even more emphasis on being able to capture accurately and reproduce faithfully the tiniest nuances of an actor’s performance in order that animation derived from it does not fall into the uncanny valley.

How is AI affecting the animation industry more broadly?

Generative AI has certainly attracted a lot of hype recently for many industries including creative. There is no doubt that AI will continue to gain power and effectiveness for animation  applications and will eventually reach a “plateau of productivity”, likely mitigating some of the challenges associated with artistic creation of highly detailed assets. However, I believe its use will face significant IP and image rights concerns about the source data, particularly when it comes to deriving realistic 3D human appearance and performance. DI4D uses machine learning as part of its PURE4D solution in order to reduce manual clean-up and increase efficiency. However, this is only ever trained on data from the specific actor whose performance is being analysed, avoiding any potential IP issues that could arise from mixing data from multiple people or projects.

Delivering a novel, emotionally engaging acting performance requires a talented human actor, usually reacting to, and engaging with, other actors in a scene. The essential human element of acting will be very challenging to replicate convincingly with AI-generated content. Therefore, we believe there will continue to be a very important role for human actors, and solutions to accurately record and faithfully reproduce their performances.

How does DI4D’s technology and service work?

We believe that the most efficient way to create highly realistic facial animation is to accurately capture the performance of an actor and then apply it faithfully onto a lifelike digital double of them. This removes the subjective step of transferring, or re-targeting, the facial animation from an actor onto a different character, which is susceptible to becoming uncanny. Therefore, most of our projects begin by acquiring super high-resolution 3D scans of all the actors, usually with a LightStage or a multi-camera photogrammetry system. These 3D scans are then re-topologized to produce a highly lifelike, but inanimate, digital double of each actor.

The facial performance of the actors can then be captured either with our nine camera DI4D PRO system or a stereo camera head mounted camera (HMC) system. Our DI4D PRO system comprises nine synchronized 12-megapixel machine vision cameras and can capture the facial performance of a single seated actor with very high fidelity. We process synchronized multi-camera video data with our proprietary software tools: first using photogrammetry to generate a 3D scan per frame. Then we transfer the re-topologized scan mesh accurately to one frame of this 3D scan sequence and track it through the whole sequence using optical flow. The result is point cache animation, normally supplied in FBX format, that can be used to “direct drive” the re-topologized scan. No facial rig is required and therefore, the animation can faithfully re-create all of the nuanced shapes and transitions of an actor’s performance.

Our proprietary software can also be used to produce point cache animation from footage acquired with stereo camera HMCs. This offers the advantage of being able to capture the facial performances of multiple actors simultaneously with body mocap and voice recording while they move around freely. This process, known as full performance capture, not only facilitates good acting performance, but it also affords a high degree of realism because it ensures that face and body animation and voice performance are exactly synchronized.

The limited resolution and number of cameras (two) and the small fish eye lenses used on HMCs naturally results in lower fidelity facial animation than that derived from the higher resolution DI4D PRO system. Our new PURE4D solution addresses this by combining data acquired from both systems. First a small amount of high-fidelity expression and phonetic pangram data for each actor is acquired with the DI4D PRO system. The acting performances are then acquired on the mocap stage using stereo HMCs. The point cache animation derived from the HMC performance for each actor is augmented with the higher fidelity DI4D PRO data. The final result is highly-realistic and nuanced facial animation that remains faithful to the likeness and performance of the real actor