YITONG SUN

DeepMetricEye: Metric Depth Estimation in Periocular VR Imagery

Presented at ISMAR 2023

"DeepMetricEye reconstructs metric periocular depth from monocular VR eye-camera imagery, enabling clinically meaningful ocular measurement without adding dedicated depth hardware."

DeepMetricEye system overview for periocular VR depth estimation.

Virtual Reality (VR) technology has advanced rapidly, offering immersive experiences across applications including gaming, medical, education, and training simulations. The recent re-imagination of the 'Digital Universe' concept has illuminated a compelling vision of globally interconnected interactions. Despite the continual enhancement in content quality, the use of VR headsets persists in causing physiological discomfort to users, leading to substantially reduced usage duration. A significant proportion of unsatisfactory experiences by VR users can be attributed to digital eye strain (DES), dry eye, and visual impairment resulting from excessive artificial light stimulation from VR displays, as well as periocular swelling, increased intraocular pressure, and muscle displacement induced by the pressure exerted by the headset's face mask. However, these visual health issues have not yet received attention proportionate to the development of VR technology.

Recent VR headsets, increasingly equipped with eye-oriented monocular cameras, are designed to segment periocular feature maps, annotate the edge of the pupil, and detect gaze direction to enhance content interaction. While these methods offer a preliminary insight into eye activity during VR usage, they are insufficient for establishing connections with medical standards, for instance, the light stimulus calculation protocols and periocular condition medical guidelines, needed for meticulous visual health assessments and advanced user interaction studies. The fundamental issue lies in the inability of current methods to convert the segmented 2D relative feature annotations (such as pupil edge segmentation) into spatial metrics (pupil diameter), essential for strict standards. Proposed solutions, for instance, incorporating stereo cameras and depth cameras for metric size acquisition, present substantial challenges in terms of cost, computational power, battery life, and hardware design of VR headsets.

To convert 2D periocular feature annotations into 3D metric dimensions, we propose a framework that only utilises an eye-oriented monocular camera, present in various VR headsets, to estimate the measurable periocular depth map. This framework, built on a U-Net 3+ deep learning backbone, re-optimised by us, aims to accurately estimate depth maps while maintaining the lightweight processing demands suitable for VR deployment. To alleviate the difficulty in collecting facial data for training, we introduce a Dynamic Periocular Data Generation (DPDG) environment that leverages a small quantity of real facial scan data to generate thousands of synthetic periocular images and corresponding ground truth depth maps using Unreal Engine (UE) MetaHuman.

The main contributions of this study are as follows

  • We introduce a lightweight depth estimation framework for VR headsets to reconstruct periocular depth maps. The aim is to provide features' metric size for light stimulus standards calculation and periocular condition monitoring.
  • Addressing the challenge of facial data collection, our DPDG environment, based on UE MetaHuman, generates thousands of periocular training images and depth maps from limited facial scans.
  • We evaluate our method's accuracy and usability with two tasks: 1) evaluating global precision of periocular area, and 2) assessing pupil diameter.
  • We have open-sourced the DPDG environment, the code and dataset for the depth estimation model, and all metadata from the experiments.
Pipeline from real human scan to UE MetaHuman reconstruction and real-versus-synthetic periocular image calibration.
Real-to-synthetic calibration pipeline for DPDG. A real participant is reconstructed from 3D scanning and fitted to a UE5 MetaHuman; a real VR headset periocular capture is then compared with a synthetic UE5 render. The block-wise MAE map guides iterative tuning of the UE cine-camera noise and IR point-light intensity so the synthetic data better matches the headset's monocular eye camera.
DeepMetricEye system overview for periocular VR depth estimation.
Figure 2: Flowchart of proposed depth estimation framework. a: Initial phase involves detection of open-eye state and gaze direction using VR headset's API, from which a sequence of periocular images consistent with open-eye position and gaze direction is extracted from the video stream. b: The red channel of extracted images are iteratively input to the depth estimation model, an lightweighted and optimised U-Net 3+ variant with a 5-layer symmetrical encoder-decoder structure. The model omits shallow dense skip connections to diminish the negative impact of intricate details, such as pupils, eyelashes, and eyebrow regions, on the smooth transitions of the depth map, thereby prioritising deep semantics. The numbers indicate the depth dimensions of the tensors. c: The output depth maps undergo a two-standard-deviation outlier elimination and pixel averaging to produce d, the final periocular depth estimation.
Depth estimation comparison between DeepMetricEye and U-Net baseline models with ground truth, predicted depth maps, and pixel MAE surfaces.
Model comparison on a representative validation sample. DeepMetricEye is compared with four original U-Net and U-Net 3+ baselines using predicted depth maps and log-transformed pixel MAE surfaces. The paper reports the proposed model achieves the lowest average depth error among the tested models (AbsRel 0.038 and RMSE 0.017) while keeping a 28.8M-parameter model size, with visibly lower error in measurement-critical regions around the pupil and cheekbone.
Dataset generation demo video
1,150 avatars generated from 120 real-human scans.
Examples of synthetic periocular MetaHuman renders and paired depth ground truth across four fields of view.
Appendix B examples from the DPDG environment. Across four headset field-of-view settings, each synthetic periocular render is paired with its depth ground truth, illustrating how headset-specific virtual cameras can generate aligned image-depth training pairs from MetaHuman avatars.
Examples comparing MetaHuman renders, depth ground truth, estimated depth maps, 3D surfaces, and pixel MAE.
Appendix C examples of estimated depth maps compared with ground truth. Each row shows the MetaHuman render, depth ground truth, DeepMetricEye estimated depth, 3D ground-truth and estimated surfaces, and pixel MAE. The examples show that most residual error concentrates around high-frequency eyelash and eyebrow details, while the broader periocular surface remains metrically coherent for downstream measurements.