In December 2023, LUCIA’s partners from SAS-FISEVI released a first version of a dataset incuding prescriptions, diagnoses and procedures of 3000 synthetic lung cancer patients to speed up the development of the lung cancer risk factor AI-based model.
In the frame of the task T2.1 “Data source identification, preparation & privacy enhancing ETL processes”, the team at the Innovation & Data Analysis Unit of Virgen Macarena University Hospital (SAS and FISEVI partners in LUCIA) has successfully released a first version of a synthetic dataset which has been generated based on real-world data of lung cancer patients registered in the hospital electronic health record (EHR). This dataset includes information on demographics, prescriptions, diagnoses and procedure codes from around 3000 synthetic lung cancer patients.OMOP CDM v5.4, which is currently considered the de facto standard in EU for its application in multi-center retrospective clinical studies. Future releases of the lung cancer synthetic dataset will be generated making use of advanced synthetic generation methods such as GAN networks (a type of Generative AI technique) in order to not only mimic the data model, but also to include clinically meaningful data values, so it can also be used to assess the quality and feasibility of different AI-based lung cancer risk prediction models in testing environments. This strategy has been considered as an effective way to push forward LUCIA developments while standing fully compliant with data privacy regulations in place, as no identifiable data has been included in the synthetic dataset. Furthermore, the methodology applied for the generation of this dataset has been cleared by the hospital research ethics committee. In a later stage, once the access to real-world data of lung cancer patients has been cleared by the research ethics committee, LUCIA partners will be able to validate their developments with real-world information in collaboration with our team. So far, the synthetic dataset has been approved only for internal use within the LUCIA consortium to support the achievement of the project objectives. Nonetheless, it is foreseen that this dataset will comply with the FAIR guidelines by the end of the project.
This dataset has been released aiming at enabling and accelerating the development and testing of the technological infrastructure that will later on support the training of the AI-based lung cancer risk prediction models. In this sense, instead of focusing on producing clinically meaningful synthetic records, in this version the SAS-FISEVI team has driven its efforts towards mimicking the data model of the real-world data. This data model follows the structure of the