Analytics Theories for Medical Diagnosis

Khivsara (2018) presents a number of basic analytics theories. Of these, I believe four are most relevant for medical diagnosis: clustering, association rules, regression, and textual analysis.

No alt text provided for this image

Association rules are nothing more than finding casual structures and patterns between objects in order to establish some sort of logical relationship. It is a machine learning analog to what doctors do on a regular basis in making diagnoses. Picture an emergency room triage room, where patients are sorted and prioritized based on symptoms. In place of a nurse, perhaps on particularly busy nights, a self-service kiosk would allow patients to select all the symptoms they are exhibiting and these symptoms would generate potential diagnoses, the severity of which would determine priority in the night’s order.

Moving a step beyond simple associations, let us examine clustering. Assume two risk factors for chronic disease (e.g., unhealthy diet and tobacco use) were quantified for a population of patients and plotted on a two-axis graph. A simple review of the graph would show plots of individuals on the spectra of diet and tobacco use. Rather than being evenly dispersed across the graph, the data points would be arranged in two or more groupings depending upon the population. K-means clustering would classify those data points (the individuals) into different risk groups depending upon where they fell on the chart. K-means clustering is most useful in healthcare applications where similarities between patients must be quantified and cohorts established.

Going a step further and putting quantitative measures on the relationship between variables and predicted values, we have regression. Regression is all about quantifying the relationship between sets of variables and predicting values. In healthcare, the most common use of regression is related to healthcare costs. Insofar as making diagnoses, logistic regression in particular can be helpful with making diagnoses based on a number of known factors. Imagine a known regression equation for predicting diabetes risk based on multiple input variables.

Finally, let us examine textual analysis. The other three theories mentioned here rely on structured data. However, that structured data is only a fraction of the data collected when a patient sees a provider. The ability to utilize the unstructured data, rife with context and nuance, is perhaps the biggest untapped potential in healthcare analytics. The confluence of textual analysis and natural language processing (NLP) allow unstructured data from sources such as patient records and provider dictation to become part of the picture in predictive modeling and coexist with structured data.


EMC Services. (2018). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Retrieved from

Healthcare.AI. (2017). Step by step to K-Means clustering. Retrieved from

HealthCatalyst. (2019). How to use text analytics in healthcare to improve outcomes. Retrieved from

Kulkarni, A. R., & Mundhe, S. D. (2017). Data mining technique: An implementation of association rule mining in healthcare. International Advanced Research Journal in Science, Engineering and Technology, 4(7), 62-65.

World Health Organization. (2005). Chronic diseases and their common risk factors. Retrieved from

Machine Learning: Supervised and Unsupervised

Supervised typically takes the form of classification or regression. We know the input and output variables, and try to make sense of the relationships between the two. Tembhurkar, Tugnayat, & Nagdive (2014) refer to this as Descriptive mining. Common methods include decision tree, kNN algorithm, regression, and discriminant analysis. The methods are dependent upon the type of data input: continuous variables will use regression methods, while discrete variables will use classification methods.

For example, a human resources division in a large multinational company wants to determine what factors have contributed to employee attrition over the past two years. A decision tree methodology can produce a simple “if-then” map of what attributes combine and result in a separated employee. An example tree might point out that a male employee over the age of 45, working in Division X, who commutes more than 25 miles from home, has a manager 10 years or more his junior, and has been in the same unit for more than seven years is a prime candidate for attrition. Although many of the variables are continuous, a decision tree method makes the data manageable and actionable for human resources division use.

Unsupervised are usually clustering or association. The output variables are not known, and we are relying on the system to make sense of the data. No a priori knowledge. Temburkhar et al refers to this as Prescriptive mining. Common methods include neural networks, anomaly detection, k-means clustering, and principal components analysis. The methods are dependent upon the type of data input: continuous variables will use association methods, while discrete variables will use clustering methods.

For example, a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to novice and expert but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them.


Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).