Review of screening retinal images for diabetic retinopathy by an artificial intelligence (AI) system demonstrated potential for significantly reducing the need for manual review with no loss in accuracy, according to a large retrospective analysis.
A deep-learning autograder (DLAG) achieved significantly higher specificity (P<0.001) as compared with consensus grading and the iGradingM automated system, at the same levels of sensitivity. At the same specificities, DLAG also achieved higher sensitivity than either of the comparator methods. Overall, DLAG demonstrated 96.58% sensitivity for observable retinal disease and 98.48% sensitivity for disease requiring a referral.
The results suggested that DLAG could reduce the need for manual grading from the estimated 50% with the iGradingM system to 43.84%, reported Joseph Mellor, PhD, of the University of Edinburgh in Scotland, and coauthors in the .
"If this system were used in NHS Scotland Diabetic Eye Service [DES], it might create economic savings by final-grading more patient screening episodes than are final-graded by the current automated system," the authors wrote of their findings. "The system described in this study may be combined with automated assignment of screening intervals to further increase the efficiency of diabetic eye screening."
Scotland has used a machine-based grading system for several years, and while more efficient than manual grading, the system has not achieved a high degree of accuracy, said Sunir Garg, MD, of the Wills Eye Institute at Thomas Jefferson University in Philadelphia and a spokesperson for the American Academy of Ophthalmology.
"This current study examined images from a large number of patients, many of whom were screened multiple times," Garg told via email. "The ability of the current system to detect diabetic eye disease is better than human graders or the automated system currently used in Scotland."
"Automated grading potentially will increase the number of patients who have their diabetic eye disease identified, as well as risk stratified," Garg continued. "Depending on where the images are acquired, this can be easier for the patients. It can then identify those at higher need of seeing an ophthalmologist."
Machine-based systems do not replace a comprehensive eye exam, which can identify other common conditions such as glaucoma, cataracts, and macular degeneration that are often not identified by these models, Garg added.
"One drawback of even this newer generation screening software is there still are a number of patients that need to get sent for manual grading," he said. "The current iterations of AI can be helpful when the image quality is high, but in cases in which the pictures were ungradable, human screeners were still important."
Currently available machine grading systems for retinal images safely identify about half of retinal screenings without the need for manual grading, Mellor and coauthors noted in their introduction. Some studies have suggested that deep learning-based systems achieve higher sensitivity and specificity for referable diabetic retinopathy, but a of seven different systems showed wide variability in sensitivity and accuracy.
The iGradingM system has helped reduce the manual grading workload in Scotland's diabetes eye screening program, but AI systems offer the potential for further savings by achieving better specificity with similar sensitivity for disease detection, the authors continued.
Mellor and colleagues compared a deep learning-based algorithm for detecting any form of retinal disease or images that were ungradable. They compared the performance of the DLAG system with the DES final grade, manual grading, and the iGradingM system.
The study involved participants 12 years or older in Scotland's national diabetes eye screening program during 2006 to 2016. Screening involved a single, 45-degree macula-centered photograph of each eye, plus additional photographs as necessary. Data included retinal images, quality assurance data, and routine diabetic retinopathy grades obtained from 179,944 patients from various national datasets. Not all data were included in each analysis.
The DES final grade was used as the reference, including a sensitivity of 92.80% and specificity of 90.00%. DLAG had a 92.97% sensitivity at the same level of specificity and 90.00% specificity at the same level of sensitivity (P=1.000).
Individual grading resulted in a sensitivity of 95.83% and specificity of 75.28%. At the same level of specificity, DLAG had a sensitivity of 96.23% (P=0.021), and at the same level of sensitivity, DLAG had a specificity of 78.12% (P<0.001). The iGradingM system achieved sensitivity of 92.97% and specificity of 61.88%, as compared with 97.60% (P<0.001) and 89.38% (P<0.001), respectively, for DLAG at the same sensitivity and specificity.
The authors acknowledged several limitations of the study: limited consensus grading for the subset of 744 quality assurance images, inability to account for the potential impact of a 2021 switch to biennial screening for patients with no disease, lack of accounting for multiple screening episodes for each patient, no episode-level information about mydriasis, and use of the DES final grade as the reference as opposed to consensus grading.
Disclosures
The study was supported by the Juvenile Diabetes Research Foundation U.K.
Mellor reported no relevant relationships with industry.
Primary Source
British Journal of Ophthalmology
Fleming AD, et al "Deep learning detection of diabetic retinopathy in Scotland's diabetic eye screening programme" Br J Ophthalmol 2023; DOI: 10.1136/bjo-2023-323395.