Applying Natural Language Processing to real-world patient data

About the Project

Study Title: Applying Natural Language Processing to real-world patient data

Timeline of Study:

  • September 2024: Study begins with planning the ethics application and patient and public engagement events to guide the study set up.
  • November 2024: Workshop with cancer survivors previously treated with radiotherapy on the use of written medical notes in cancer research, and acceptable approaches to consent and patient communication.
  • February 2025: Workshop with a subset of participants from the November workshop to co-design patient information poster.
  • March 2025: Draft poster shared with workshop attendees and feedback received.
  • April 2025: Ethics application submitted to REC and CAG.
  • June 2025: CAG and REC review meetings
  • August 2027: Anticipated study end


Study funder: The project is being funded through a University of Manchester MRC Doctoral Training Partnership PhD studentship.

Study Lead: Dr Gareth Price

Study Contact:


Project Background


There are patient populations and changes in practice for which conventional clinical trial data does not exist. This means there are large sections of the population, particularly ethnic minority, deprived, and frail patients, where there is significant uncertainty over which treatment strategies are likely to lead to the best outcomes. Routinely collected (real-world) healthcare data, collected digitally in Electronic Healthcare Records (EHRs), offers the opportunity to provide evidence for personalised medicine for all patients, not just those typically recruited to clinical trials.

Whilst structured data such as age and tumour stage are easy for researchers to anonymise and analyse, the vast majority of information in patients’ EHRs is recorded as written medical notes. Being able to analyse this information is a critical challenge currently limiting cancer research. The Manchester Cancer Research Centre are collaborating with the School of Computer Sciences to develop and apply Natural Language Processing techniques, powered by artificial intelligence (AI), to analyse routinely collected written cancer data at The Christie NHS Foundation Trust.

The project initially focusses on lung cancer, the leading cause of cancer mortality in the UK, and head and neck cancers, which is a complex disease to diagnose and treat with survivors often suffering with poor quality of life.

What is natural language processing?
What is anonymised data?
What is real world data?

Patient and Public Involvement and Engagement


Written (unstructured) medical notes offer a rich source of information often excluded from real-world data analyses, but present unique challenges compared to structured data. Given the nature of written data, anonymization methods do not fully guarantee complete de-identification of personal identifiable information. This means that a critical component of undertaking research with these data is patient and public involvement in the design and implementation of the studies.

In November 2023 we hosted a 2-hours in-person focus group with 9 patient and public contributors to discuss:

  • What is important to patients and carers when working with potentially identifiable written medical note data?

What is important to consider with respect to this for this research project

  • Is the national data opt-out appropriate for this research project, or is a study specific consent process also needed?

Vote on the above question

  • What sort of information is important to communicate about this research project?

How should this information be shared?

In the discussion, contributors prioritized balancing information availability with data privacy, with many contributors indicating they were comfortable with less masking and anonymization of identifiable information if appropriately controlled and authorized and necessary for the research aims. The vote on the preferred consent approach favoured use of the national opt-out with two thirds (6/9) indicating this was an appropriate mechanism. The group also indicated a preference for simple physical/digital posters in outpatient departments at The Christie raising awareness of the project that link to more detailed web-based information.


Project outputs

“Patient and Public Involvement and Engagement on Natural Language Processing for Real World Cancer Medical Notes”, W Oyewusi, E Vasquez Osorio, G Nenadic, I MacGregor, and G Price.

Proceedings of the XXth International Conference on the use of Computers in Radiation Therapy, Lyon France (2024)

This is an opt-out study
This means if you were treated at the Christie for Head and Neck or Lung Cancer data relating to your cancer treatment will be included in the study unless you tell us not to or have used the National Data Opt-Out

What happens if my data is used?

  • If you are a current patient, nothing about your cancer treatment will change.
  • We will use only data from your medical records that relates to your cancer treatment.
  • All of your information will be stored safely, securely and confidentially on a computer system at The Christie NHS Foundation Trust, only accessible to specifically approved staff for the purposes of this project.
  • It will have your name and other private information held in a structured format removed (this is also called ‘anonymised’ data) before it is stored in the research database for analysis by the research team.
  • All written medical notes will be run through a masking software to remove identifiable information. There is a small chance some identifiable information may not be picked up by this process.
  • At the end of the study, we will securely maintain the database for a period of 5 years before it is securely destroyed.

How do I opt out?

  • If you do not want your information to be used in this study, you must tell us. This will not change your cancer treatment.
  • This is called opting out and you can do it at any time without giving a reason.
  • It will always be your choice to opt out.


If you decide you do not want your data to be used in the study, you can either use the national NHS data opt out or specifically opt out of this study by contacting the study team on:


The national data opt-out ( is a service that allows patients to opt out of their confidential patient information being used for research and planning.


You can view or change your national data opt-out choice at any time by:

  • using the online service at
  • by clicking on “Your Health” in the NHS App, and selecting “Choose if data from your health records is shared for research and planning”.


What benefits and risks are there to me of taking part in this study?

Nothing about your cancer treatment changes if you take part in this study.  Therefore, there are no risks or benefits to you personally from letting us include your data.  You would be helping to develop a method which may benefit patients with cancer in the future.


You can find out more about how we use your information in Manchester:


If you would like to receive a copy of the study outcomes when available, please email:

You will not get a copy of the study outcomes unless you email us. 


This project, and the information on this website and accompanying information poster were developed in collaboration with patient contributors and VOCAL.


Version 1.1 19/04/2024

IRAS ID: 328378

Study title : Applying Natural Language Processing to real-world patient data


The Christie NHS Foundation Trust

Navigate to The Christie website to discover more about cancer treatment

Clinical Informatics Hub

Navigate back to the Clinical Informatics Hub