Analytics Theories for Medical Diagnosis

Khivsara (2018) presents a number of basic analytics theories. Of these, I believe four are most relevant for medical diagnosis: clustering, association rules, regression, and textual analysis.

No alt text provided for this image

Association rules are nothing more than finding casual structures and patterns between objects in order to establish some sort of logical relationship. It is a machine learning analog to what doctors do on a regular basis in making diagnoses. Picture an emergency room triage room, where patients are sorted and prioritized based on symptoms. In place of a nurse, perhaps on particularly busy nights, a self-service kiosk would allow patients to select all the symptoms they are exhibiting and these symptoms would generate potential diagnoses, the severity of which would determine priority in the night’s order.

Moving a step beyond simple associations, let us examine clustering. Assume two risk factors for chronic disease (e.g., unhealthy diet and tobacco use) were quantified for a population of patients and plotted on a two-axis graph. A simple review of the graph would show plots of individuals on the spectra of diet and tobacco use. Rather than being evenly dispersed across the graph, the data points would be arranged in two or more groupings depending upon the population. K-means clustering would classify those data points (the individuals) into different risk groups depending upon where they fell on the chart. K-means clustering is most useful in healthcare applications where similarities between patients must be quantified and cohorts established.

Going a step further and putting quantitative measures on the relationship between variables and predicted values, we have regression. Regression is all about quantifying the relationship between sets of variables and predicting values. In healthcare, the most common use of regression is related to healthcare costs. Insofar as making diagnoses, logistic regression in particular can be helpful with making diagnoses based on a number of known factors. Imagine a known regression equation for predicting diabetes risk based on multiple input variables.

Finally, let us examine textual analysis. The other three theories mentioned here rely on structured data. However, that structured data is only a fraction of the data collected when a patient sees a provider. The ability to utilize the unstructured data, rife with context and nuance, is perhaps the biggest untapped potential in healthcare analytics. The confluence of textual analysis and natural language processing (NLP) allow unstructured data from sources such as patient records and provider dictation to become part of the picture in predictive modeling and coexist with structured data.


EMC Services. (2018). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Retrieved from

Healthcare.AI. (2017). Step by step to K-Means clustering. Retrieved from

HealthCatalyst. (2019). How to use text analytics in healthcare to improve outcomes. Retrieved from

Kulkarni, A. R., & Mundhe, S. D. (2017). Data mining technique: An implementation of association rule mining in healthcare. International Advanced Research Journal in Science, Engineering and Technology, 4(7), 62-65.

World Health Organization. (2005). Chronic diseases and their common risk factors. Retrieved from