Color blind?
Artificial intelligence could improve the treatment of breast cancer, but there are worries it might worsen disparities
By Casey Ross, STAT

The great hope of artificial intelligence in breast cancer is that it can distinguish harmless lesions from those likely to become malignant. By scanning millions of pixels, AI promises to help physicians find an answer for every patient far sooner, offering them freedom from anxiety or a better chance against a deadly disease.

But the Food and Drug Administration’s decision to grant clearances to these products without requiring them to publicly disclose how extensively their tools have been tested on people of color threatens to worsen already gaping disparities in outcomes within breast cancer, a disease that’s 46 percent more likely to be fatal for Black women.

Oncologists said testing algorithms on diverse populations is essential because of variations in the way cancers manifest themselves among different groups. Black women, for instance, are more likely to develop aggressive triple-negative tumors and are often diagnosed earlier in life at more advanced stages of disease.

“These companies should disclose the datasets they’re using and the demographics that they’re using. Because if they don’t, then we essentially have to take their word for it that these new technologies apply equally,’’ said Salewa Oseni, a surgical oncologist at Massachusetts General Hospital. “Unfortunately, we’ve found that’s not always the case.’’

A STAT investigation found that just one of 10 AI products cleared by the FDA for use in breast imaging breaks down the racial demographics of the data used to validate the algorithm’s performance in public summaries filed by manufacturers. The other nine explain only that they have been tested on various numbers of scans from proprietary datasets or mostly unnamed institutions.

The data used to support AI products — first to train the system to learn, and then to test and validate its performance — is a crucial marker of whether they are effective for a wide range of patients. But the companies often treat that information as proprietary, part of the secret recipe that sets their products apart from rivals.

The FDA is still trying to draw the line between the commercial interest in confidentiality and the public interest in disclosure, and whether it can, or should, force manufacturers to be fully transparent about their datasets to ensure AI algorithms are safe and effective.

“It’s an extremely, extremely important topic for us,’’ said Bakul Patel, director of the FDA’s Digital Health Center of Excellence. “We want to have that next level of conversation: What should that expectation be for people to bring trustworthiness in these products?’’

In addition to the agency’s January action plan, which calls for the development of standard processes to root out algorithmic bias, the FDA issued guidance in 2017 calling on all makers of medical devices — whether traditional tools or AI — to publicly report on the demographics of populations used to study their products. But to date, that level of detail is not being provided in public summaries of AI products posted to the agency’s website.

So far, just seven of 161 AI products cleared in recent years include any information about the racial composition of their datasets. Nonetheless, those devices were cleared to use AI to help detect or diagnose a wide array of serious conditions, including heart disease, strokes, and respiratory illnesses.

Judging by public FDA filings, AI products cleared for use in breast imaging are being tested on relatively small groups of patients. The datasets range from 111 patients to over 1,000, in some cases from a single US site and in other cases from multiple sites around the globe.

But executives at AI companies said these data represent only a fraction of the cases used to train and test their products over many years. They said studies done at the behest of the FDA simply reinforce what they already know: Their products will perform accurately in diverse populations.

“It’s the tip of the tip of the iceberg,’’ Matthieu Leclerc-Chalvet, chief executive of the French AI company Therapixel said of the 240-patient study requested by the FDA to validate its product, called MammoScreen, which received clearance in March 2020.

Prior to that study, he said, the company’s device — designed to help clinicians identify potentially cancerous lesions on mammography images — was trained on data from 14 health care sites that were selected to ensure representation from people of color. He said the tool was also tested on patients on the East and West coasts to measure its accuracy across different populations.

That study, published by the Radiological Society of North America, found that the diagnostic accuracy of radiologists improved when they were using MammoScreen. The study relied on 240 mammography images collected from a hospital in California from 2013 to 2016. The demographics of the dataset were broken down by age and level of breast density, but not by race.

A group of researchers from Massachusetts General Hospital and the Massachusetts Institute of Technology, for example, have decided to conduct studies to validate the performance of a breast cancer risk prediction algorithm in multiple centers in the United States and around the globe before they even consider commercializing the tool.

In a recent paper published in Science Translational Medicine, the researchers report the results of that testing on patients in Sweden and Taiwan, as well as its performance among Black patients in the United States. While the study found that the AI model, named Mirai, performed with similar accuracy across racial groups, the researchers are still doing more studies at other centers internationally.

Regina Barzilay, a researcher leading Mirai’s development at MIT, said the researchers determined that widespread validation is crucial to ensuring it could perform equally across populations. In examining their model, the researchers found the AI could predict the race of the patient just from analyzing a mammography image, indicating that it was picking up on nuances in the breast tissue that are relevant to assessing the risks facing different patients.

In breast cancer care, the failure to include diverse populations in research has repeatedly undermined how well products and algorithms work for people of color. Extensive research shows that such technologies can exacerbate disparities. Often, that research suggests, those inequities are because diverse groups of people were not included in data used to test the products prior to their release.

Meanwhile, multiple studies have shown another breast cancer tool, this one used to inform screening recommendations and clinical trial protocols, underestimated the risk of breast cancer in Black patients, and overestimated the risk in Asian patients.

The Gail model uses a range of demographic and clinical factors to assess a patient’s risk over five years. After studies pointed to inequities in performance, the model was adjusted to improve its generalizability in diverse populations.

Still another model, known as Tyrer-Cuzick, was recently found to overestimate risk in Hispanic patients. The paper by researchers from MIT and Massachusetts General Hospital also found that their algorithm, Mirai, significantly outperformed Tyrer-Cuzick in accurately assessing risks for Black patients.

Connie Lehman, chief of breast imaging at Mass. General and a coauthor of the study, said the field has suffered for decades from a failure to include diverse groups of patients in research.

“Whether it’s AI or traditional models, we have always been complacent in developing models in European Caucasian women with breast cancer,’’ she said.

Evidence of inequity did not lead regulators to pull those models from the market, though, because they are the only tools available. But Lehman said the problems with them should serve as a rallying cry to the developers of AI products who are now trying to use data to improve performance.

She sees huge potential in AI. But she also sees where it could go wrong, as was the case with an earlier generation of computer-aided detection (CAD) software for breast cancer. Despite FDA clearances and government reimbursements that allowed providers to collect hundreds of millions of dollars a year for using the products, they ultimately failed to improve care.

“Maybe we can say we learned lessons from CAD, we learned lessons from traditional risk models,’’ Lehman said. “Maybe we can say we’re not going to repeat those mistakes again. We are going to hold ourselves to a higher standard.’’

Commonwealth Fund.