Evaluating the generalizability of video-based assessment of intraoperative surgical skill in capsulorhexis

Zhiwei Gong; Bohua Wan; Jay N Paranjape; Shameema Sikder; Vishal M Patel; S Swaroop Vedula

doi:10.1007/s11548-025-03406-0

Evaluating the generalizability of video-based assessment of intraoperative surgical skill in capsulorhexis

Int J Comput Assist Radiol Surg. 2025 May 22. doi: 10.1007/s11548-025-03406-0. Online ahead of print.

Authors

Zhiwei Gong^#¹, Bohua Wan^#², Jay N Paranjape³, Shameema Sikder^{1

4}, Vishal M Patel³, S Swaroop Vedula⁵

Affiliations

¹ Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, 21218, USA.
² Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
³ Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA.
⁴ Wilmer Eye Institution, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA.
⁵ Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, 21218, USA. swaroop@jhu.edu.

^# Contributed equally.

PMID: 40405033
DOI: 10.1007/s11548-025-03406-0

Abstract

Purpose: Assessment of intraoperative surgical skill is necessary to train surgeons and certify them for practice. The generalizability of deep learning models for video-based assessment (VBA) of surgical skill has not yet been evaluated. In this work, we evaluated one unsupervised domain adaptation (UDA) and three semi-supervised (SSDA) methods for generalizability of models for VBA of surgical skill in capsulorhexis by training on one dataset and testing on another.

Methods: We used two datasets, D99 and Cataract-101 (publicly available), and two state-of-the-art models for capsulorhexis. The models include a convolutional neural network (CNN) to extract features from video images, followed by a long short-term memory (LSTM) network or a transformer. We augmented the CNN and the LSTM with attention modules. We estimated accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC).

Results: Maximum mean discrepancy (MMD) did not improve generalizability of CNN-LSTM but slightly improved CNN transformer. Among the SSDA methods, Group Distributionally Robust Supervised Learning improved generalizability in most cases.

Conclusion: Model performance improved with the domain adaptation methods we evaluated, but it fell short of within-dataset performance. Our results provide benchmarks on a public dataset for others to compare their methods.

Keywords: Cataract surgery; Domain adaptation; Surgical skill assessment; Transformer.

Grants and funding

R01EY03306/EY/NEI NIH HHS/United States