6 min readJan 27, 2022

2021 marks the second year of The Innovation Sandbox and will see us embark on several new projects, as well as some existing projects, with new students, new clients and new partnerships.

The Innovation Sandbox is a collaborative initiative at the confluence of industry & academia, allowing both academic types of research on real–world datasets and the transfer of cutting-edge machine learning research into industry application

Project overview: what exactly are you researching? Why is it exciting?

After a fruitful collaboration with Person Centred Software (PCS) in creating a COVID-19 tracker for use in care homes, our attention was turned to other challenges within care home environments. Infections, particularly urinary tract infections (UTIs) were at the top of the list. UTIs are a class of conditions that affect the urinary tract, including the bladder (cystitis), urethra (urethritis) or kidneys (kidney infection) (NHS)

UTIs, which we found to occur in about 20% of care home residents are a particularly problematic form of infection for the elderly. Furthermore, out of many different infections that occur within the care homes (chest infections, skin infections, etc…), approximately a third have been UTIs over the past year.

Even though UTIs are extremely common among the elderly, it can be very difficult to detect one with an increase in age. This may be due to a supressed immune response which leads to observing no classical symptoms, or in most cases, other range of comorbidities can hinder a clear diagnosis. Thus, an early warning system to help care home workers identify and offer pre-emptive support in treating these infections would be invaluable.

Where past studies have focussed on purely medical or physiological data, using the Mobile Care Monitoring (MCM) system from PCS, we are able to access a rich history of behavioural data too — from volume of fluid intake to sleeping schedules. Leveraging this vast record of daily activities will allow for insights to be made that were not possible in more conventional studies.

Discussions with a medical expert we are collaborating with have also confirmed that UTIs often go undiagnosed (uncomplicated UTIs will resolve themselves) or have delayed diagnosis — with very unpleasant consequences. This is particularly true in care home environments, where with an elderly population the risk of infection is much greater and general frailty may prohibit effective communication of the symptoms that residents experience.

Having a healthcare professional on board will also allow us to receive consultation on the clinical side of UTIs because though we are approaching the project from a data science perspective, having a good understanding of the medical domain is crucial in avoiding certain pitfalls and helping us to make the best use of the data that we have available.

Ultimately, what we set up to achieve with this project is to be able to continuously test, predict and generate insights ahead of time on the risk of a resident having contracted a UTI. In care homes UTI risks increase with age and also it is very difficult for the elderly to notify their carer about their issues, or other symptoms may cause carers not realize issues with the patients. That is part of the reason why and the motivation on why we are working on this projwect and why we want to make it successful.

Project Challenges

Data Processing

Identifying the useful sets of records, extracting the data through API keys and processing these into a form which can be interpreted by machine learning algorithms will definitely be the most time-consuming challenge. We expect this data processing stage to represent the biggest portion of this project, as it will be difficult to manage and find meaning in the various types of data collected by the MCM application: discrete event data, including actions and behaviours, as well as continuous physiological data such as temperature and blood O2 level. We will need to process these in a form where the model can take as input properly.

To overcome this challenge gradually, we plan to “ease” into the actual modelling. First, we will develop a simple classifier model that will be able to predict whether a resident has a UTI at that given moment or not. This will be a form of a tree-based model, that will not account for temporal patterns and account only for continuous physiological data.

Then, after this has been established as a baseline model, we plan to use temporal abstraction methods to represent multivariate data into a symbolic time intervals representation that can be fed into a classifier. This classification method has shown to be effective for a prediction of falls in a similar domain, and hence we wish to utilise this, whilst also incorporating anomaly detection methods to further improve on the performance.

Data Collection and Reliability

The MCM application collects data from care homes across the UK, out of which at the moment, we have been given access to only 4 separate care home organisations out of a total of 30. — representing around 1800 residents. A challenge here, is represented by the fact that we will have to continuously keep into consideration how much data is available for what we want to achieve. Having a reliable amount of quality data will be fundamental to obtain meaningful results in the project.

Additionally, inputs into the MCM application will be done solely by carers and doctors, so the residents will not be logging their own data. The challenge arises in cases such as a carer not being present during a specific incident or an incomplete input of data in the app of a past event/action. Furthermore, an event may have been added in a paperwork but not logged within the application, which can result in many missing data points.

Hence, after observing how reliable the data is, we need to be able to produce a detailed analysis on it and resolve issues before proceeding onto the modelling stage. There will be requirements to account for missingness of data in the model and take appropriate measures. Now, after the initial baseline model is formed, the results may indicate a lack in performance. In this undesired scenario, we will aim to broaden our focus to general infections, rather than just UTIs, as the dataset availability will be expanded.

Project Ethics

As we will be dealing with sensitive and confidential information, provisions must be made to honour the data privacy agreements in place. We have access to information pertaining to resident names, locations, age, gender and medical history — these records must be treated such that individuals cannot be identifiable. At the moment, the only personal information that we extract from the APIs, are age and gender of the resident, as this will be an important factor in the modelling features.

Especially with age, however, outliers may result in being able to identify who this unique resident is. In order to avoid such behaviour, we plan to incorporate a form of differential privacy, where a small noise is added to the data to remove this possibility, whilst maintaining a similar distribution as the original data.

Potential Impact

A 2020 research from Springer, on predicting falls in the elderly served as a starting point for improving early detection of incidences in the care home domain, so when were given the possibility to follow up on that research, and develop a model to track changes in behaviour to predict UTIs, we were quite excited.

The volume of data that we will be able to work with is worth 2000 care homes within the UK. Hence, once we develop and apply the novel model, it can be deployed and used by approximately 40,000 care homes residents. We see this as a great impact in the care homes.

Ultimately, we want to be able to extend this model from just being able to detect anomalies in the behaviour of elders within care homes, but to also apply it to a general hospital environment, and the general population.

What to expect next

In late January, we completed all literature reviews and background research on the domain of UTIs, and various machine learning techniques which we could explore. Having done this, we are currently in a detailed exploratory data analysis phase that will immediately be followed by building the aforementioned baseline classifier. This will simply predict if someone does have a UTI or not, exploring a variety of binary and multi-label classifiers. Having completed this before Easter break, the gear will be shifted towards predicting continuously the risk of someone developing a UTI in the future time period. Calculating backwards from the deadline of this project, which is in mid–June, we wish to develop the novel model by the beginning of May. Any extensions with regards to further improving the model, or providing guidance to carers based on the output, will be worked on if time allows.