A leading hospital standards certification company gathers roughly 50 reports a day from its on-site inspectors. Each report has to be tagged to a specific certification-standard label that the hospital must meet. There are hundreds of these certification standard labels; on-site inspectors don’t always get them right about 30% of the time. On-site reports tagged to incorrect standard labels are misleading at best and useless at worst. They do not allow the certification company to produce an accurate evaluation of the hospital, putting both the hospital, the certification company, and patients at risk. Checking and correcting mis-labeled reports is manual, time consuming, and expensive.
A system that prompts the inspector in real time with the most likely set of standard labels will significantly reduce error, quality control oversight time, and expense.
Using about 100,000 rows of this historical data, we developed a machine learning model that learns the connection between the content of an inspection report and the standard label(s) that report is tied to. The machine learning model works in concert with Microsoft Azure Search to provide a flexible solution that can be extended beyond the proof of concept to a system that continually learns from the findings inspectors enter and the standard labels they choose to associate with their inspection reports.
We evaluated 3 different families of machine learning models in the natural language processing (NLP) domain. All the models convert text to a numerical representation.
- A term frequency – inverse document frequency (TF-IDF) model. Based on the frequency with which a word appears in a finding, discounted by the number of times that same word appears in all the findings across the entire training dataset.
- A latent semantic index (LSI) model. This model looks across all of the findings in the training dataset, identifying relationships between words and further refining these relationships into topics. A finding is then represented as a weighted sum of the topics.
- A latent Dirichlet allocation (LDA) model. A probabilistic extension of LSI. A more sophisticated model that is particularly good at finding semantically related content in a domain.
We created 2 metrics to measure the performance of the models and compare them. Based on our model evaluation metrics, we chose two models that performed significantly better than the others - a TF-IDF and an LDA model with vector size of 500 elements. We implemented the models in the Microsoft Azure Machine Learning (AzureML) platform.
The machine learning models perform 1,500 times better than randomly associating an inspection report with a certification-standard label. Synaptiq’s machine learning solution has reduced the time it takes to complete a certification by half and improved the customer experience by 30% (increase in net promoter score).