
Martedì 27 maggio un nuovo incontro della serie dei Seminari Generali IAC 2025. Ospite Livia Lilli, Data Scientist presso la Direzione Tecnica, ICT e Innovazione Tecnologie Sanitarie, Università Cattolica del Sacro Cuore.
Titolo del seminario sarà Real-World Applications of Natural Language Processing in Healthcare – The MISTIC pipelines, che si svolgerà in modalità mista in presenza al CNR IAC e in streaming (vedi link in calce), trasmesso sul canale YT dell'istituto.
Di seguito l'abstract.
The use of Natural Language Processing (NLP) has expanded across many domains, including healthcare, where it plays a crucial role in extracting data from unstructured clinical reports to support Real-World Evidence (RWE) generation. This is especially important in oncology, where key information on disease progression such as metastasis is often found only in free-text Electronic Health Records (EHRs). However, processing this data remains challenging, particularly in minor languages like Italian, where domain-specific NLP tools are limited. Additionally, adaptation of large language models typically requires substantial computational resources and large labeled datasets, limiting their use in real-world clinical settings.
The MISTIC pipeline (Metastases Italian Sentence Transformers Inference Classification) is a novel, lightweight NLP solution developed by Fondazione Policlinico Universitario Agostino Gemelli IRCCS in collaboration with the Istituto per le Applicazioni del Calcolo “Mauro Picone” (CNR-IAC). Designed and tested in a real-world clinical setting, MISTIC aims to identify breast cancer metastases in Italian electronic health records (EHRs) using a few-shot learning approach that requires minimal annotated data and computational resources. The pipeline combines linguistic preprocessing techniques, such as sentence segmentation and topic filtering, with a transformer-based classifier fine-tuned on a small dataset of 550 texts.
When evaluated against alternative methods—including zero-shot BERT models, rule-based systems, and large generative language models—MISTIC demonstrates a compelling balance of accuracy, generalization, and efficiency. With an F1-score exceeding 91%, it outperforms competing approaches while maintaining full explainability and requiring no GPU infrastructure.
The project addresses a key gap in biomedical NLP by targeting Italian, an underrepresented language in clinical research. With its scalable and adaptable design, MISTIC can help hospitals streamline retrospective studies, build real-world evidence datasets, and extract meaningful insights from unstructured clinical text, demonstrating the impact of tailored NLP solutions on medical research and care.