MetroMaps and T-Cubes: Beyond Gantt Charts

Martínez, Dolado, & Presedo (2010) discuss two visual modeling tools for software development and planning, MetroMap and T-Cube. This discussion is in the context of greater attention being paid to the development process and metrics, not just the software engineering itself. A concession the authors make very early on is that Gantt charts are the prevalent method for project mapping in organizations, and that the research to date shows they are not effective for communicating, especially when different groups are involved. Enter the MetroMap, a way of visualizing abstract train-of-thought information that communicates both high-level and detailed information to viewers.

Image courtesy of Martínez, Dolado, & Presedo (2010)

T-Cube visualization is reminiscent of a Rubik’s Cube, utilizing the three-dimensional nature of a physical cube, the individual cubes making up the whole, and the facets (colors) on each individual cube. These correspond to tasks and attributes. The authors utilized a specific software set to illustrate these concepts, represented in the article. As the tasks and attributes are written independently, they can be represented by workgroup, type of task, module or time.

These two methods have their strengths and weaknesses, both individually and together. At first glance, it is obvious that the MetroMap can represent many indicators at once while the T-Cube can only show one at a time. MetroMap uses a variety of icons and styles to represent information while the T-Cube uses traditional treemaps. The authors size up the tools in a simple comparison table, noting that MetroMap generally has the edge on viewing a lot of information at once.

Features and benefits are great, but how does actual use differ? Is one easier than the other in practice? The authors examined a shortest-path route to accomplish the same task in both tools, and found that MetroMap was the most efficient in multiple scenarios. In all cases the actions were more basic and straightforward. Overall, either tool is more informative and effective than Gantt charts. Access to information and ability to understand it are paramount in any planning and development exercise. These are two tools that better enable that.

Reference

Martínez, A., Dolado, J., & Presedo, C. (2010). Software Project Visualization Using Task Oriented Metaphors. JSEA, 3, 1015-1026.

Delphi Methods and Ensemble Classifiers

Ensemble classifiers are a bit like Delphi methodology, in that they utilize multiple models (or experts) to arrive at a model that offers better predictive performance than would a single model (Dalkey & Helmer, 1963; Acharya, 2019). These are independent or parallel classifiers, implementing a majority vote amongst the classifiers like the Delphi method. A variety of individual classifiers can be used, including logistic regression, nearest neighbor methods, decision trees, Bayesian analysis, or discriminate analysis. According to Dietterich (2002), ensemble classification overcomes three major problems: Statistical, Computational, and Representational. The Statistical problem involves the hypothetical space being too large for the data itself, producing multiple accurate hypotheses yet only one being chose. The Computational problem involves the algorithm’s inability to guarantee the best hypothesis. The Representational problem involves the hypothetical space being devoid of any good approximation of the target.

Ensemble methods include bagging, boosting, and stacking. Bagging is considered a parallel or independent method; boosting and stacking are both sequential or dependent methods. Parallel methods are used when the independence between the base classifiers is advantageous, including error reduction; sequential methods are used when dependence between the classifiers is advantageous, such as correcting mislabeled examples or converting weak learners (Smolyakov, 2017).

Random forests are not exactly ensemble classifiers but do produce results from multiple decision trees and aggregate the results, like Bagging (Liberman, 2017). These train on different datasets and features, both randomly selected. Bias and variance errors are mitigated by way of low correlation between the models. Again, like ensemble classifiers and even Delphi method decision-making, learners operating as a committee should outperform any of the individual learners.

References

Acharya, Tarun (2019). Advanced ensemble classifiers. Retrieved from https://towardsdatascience.com/advanced-ensemble-classifiers-8d7372e74e40

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. 

Dalkey, N., & Helmer, O. (1963). An experimental application of the Delphi method to the use of experts. Management Science9(3), 458-467.

Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.

Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second Edition, (M.A. Arbib, Ed.), (pp. 405-408). Cambridge, MA: The MIT Press.

Liberman, N. (2017). Decision trees and random forests. Retrieved from https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991

Smolyakov, V. (2017). Ensemble learning to improve machine learning results. Retrieved from https://blog.statsbot.co/ensemble-learning-d1dcd548e936

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Thick Data and Big Data

In March 1968, Robert F. Kennedy said, of the Gross Domestic Product index: “It measures neither our wit nor our courage, neither our wisdom nor our learning, neither our compassion nor our devotion to our country, it measures everything in short, except that which makes life worthwhile.”

“What is measurable is not always what is valuable.” Wang (2016b) paraphrased Kennedy, originally referencing GDP and its inability to measure the qualitative human condition. With the exponential increase in attention to Big Data as of late, the focus on speed and scale have left out things that are “sticky” or “difficult to quantify” (Wang, 2016b). This disparity reflects the traditional gap between qualitative and quantitative research. In fact, Wang found referring to the qualitative efforts in traditional terms (e.g., ethnography) was met with enough skepticism and pushback that a new term friendly to data jargon had to emerge—and thus the term thick data was born.

https://miro.medium.com/max/1442/1*B4UOLidQEam25fJkNeZH8A.png
Courtesy Tricia Wang

At first glance, thick data is not attractive in the traditional sense of big data. It is inefficient, does not scale up, and is usually not reproducible. However, when combined with big data, it fills the gaps that the quantitative measures leave open. While big data can identify patterns, it cannot explain why those patterns exist. If big data can go broad, thick data can go deep. Thick data relies on human learning and complements the findings from machine learning that big data cannot provide adequate context for. It shows the social context of specified patterns and is able to handle irreproducible complexity. It is the qualitative complement to quantitative data, the color and nuance to a black-and-white picture.

Forces against the adoption of thick data typically stem from bias against qualitative data. Again, it is messy…inefficient, sticky, complicated, and nuanced. Most of the big data world values what can be quantified and the relationships that can be mapped. As (Wang, 2016a) notes, quantifying is addictive, and it can be easy to throw out data that doesn’t fit a numerical value. It isn’t a zero-sum game, however. Both big data and thick data complement each other. But “silo culture”—the same phenomenon that disrupts data integration and wreaks havoc across enterprise data environments—threatens the symbiosis between these two (Riskope, 2017). While thick data is not an innovation in the same sense of cutting-edge artificial intelligence or new developments in IoT technology, it is an innovation in how we think about the world around us and what is important when studying that world.

References

Riskope. (2017). Big data or thick data: Two faces of a coin.  Retrieved from https://www.riskope.com/2017/05/24/big-data-or-thick-data-two-faces-of-a-coin/

Wang, T. (2016a). The human insights missing from big data.  Retrieved from https://www.ted.com/talks/tricia_wang_the_human_insights_missing_from_big_data

Wang, T. (2016b). Why big data needs thick data.  Retrieved from https://medium.com/ethnography-matters/why-big-data-needs-thick-data-b4b3e75e3d7

XML and Standardization

XML is a true double-edged sword in the data analytics world, with both advantages and disadvantages not unlike relational databases or NoSQL. The global advantages and disadvantages inherent in XML are just as applicable in the healthcare field. For example, consider the flexibility of user-created tags on the fly—something that is both an advantage (for ease of use, compatibility, expandability, et cetera) and disadvantage (lack of standardization, potential incompatibility with user interfaces, et cetera) in the global sphere. These are equally applicable in healthcare settings. Considering an electronic health record (EHR), different providers and points of care may add to the EHR without having to conform to the standards of other providers; that is, data from a rheumatologist may be added to the patient record in with the same ease as a general practitioner or psychologist. The portability of the XML format means that the record can be exchanged amongst providers or networks as long as the recipient can read XML. However, this versatility comes at a price, as the lack of standardization means that all tags and fields in any given record must be known prior to query and can be quite a time-consuming process.

Considering an analogy to a different industry, think of a consumer packaged goods (CPG) manufacturer. The CPG has its own internal master data schemas in relational databases and reserves XML for its reseller data interface, so that the different wholesalers and retail network can share sales data back to the CPG in a common format. While all participants use a handful of core attributes (e.g., manufacturer SKU and long description), each wholesaler and retailer has its own set of attributes that are proprietary. XML allows the different participants to feed data back to the CPG without conforming to a schema imposed across the entire retail network and allows the CPG to glean the requisite data shared amongst all participants. However, the process requires setting up the known tags for each new participant so that the CPG knows ahead of time what specific tags are relevant to each participant.

References

Brewton, J., Yuan, X., & Akowuah, F. (2012). XML in health information systems. Paper presented at the World Congress in Computer Science, Computer Engineering, and Applied Computing, Las Vegas, NV.

Jumaa, H., Rubel, P., & Fayn, J. (2010, 1-3 July 2010). An XML-based framework for automating data exchange in healthcare. Paper presented at the The 12th IEEE International Conference on e-Health Networking, Applications and Services.

Stockemer, M. (2007). How Do HL7 and XML Co-Exist in Clinical Interfacing? Retrieved from https://healthstandards.com/blog/2007/08/10/how-do-hl7-and-xml-coexist-in-clinical-

The Role of Data Brokers in Healthcare

In courses I’ve led before, we looked at the disjointed data privacy regulations in the United States and current events in data privacy (e.g., Facebook, Cambridge Analytica, personal genomics testing, etc). The overall issue is repeatable in any setting: giving a single entity a large amount of data inevitably raises questions of ethics, privacy, security, and motivation.

Where healthcare data brokers are concerned, the stated goals differ by type of data. Where direct patient interaction with the data is concerned, the goal is to give patients “more control over the data” (Klugman, 2018) and perhaps bypass the clunky patient portals set up by providers. Of the data that is not personally identifiable, it can have much less altruistic goals, such as being a player in a multi-billion-dollar market (Patientory, 2018) or contributing to health insurance discrimination (Butler, 2018). I am not naïve enough to think that all exercises in healthcare should be altruistic, and the concept of insurance itself has a certain modicum of discrimination in its core; however, weaponizing the data to aid in unfair practices is beyond the pale here.

No alt text provided for this image

From a data engineering perspective, a broker in the truest sense of the word may act as a clearinghouse between providers with disparate systems, enabling the seamless transfer of patient data between those providers without putting the burden of ETL on either of them. Whereas XML formatting and other portability developments have allowed providers using different EHR systems to port patient data, a data brokerage would act as an independent party acting on the patient’s behalf and handling the technical details on integrating their data between all providers and interested parties. Beyond holding the data, the broker would be responsible for ensuring each provider and biller has access to the same single source of truth on that particular patient.

This would, of course, require a data warehouse of sorts for the single source to be held, and puts the questions of security, privacy, transparency, and ethics on the broker. The broker has to make money to survive and a business model must emerge, so it would not be immune to market forces. The aggregation of so much patient data in one place would be too great a temptation to let sit and not make money as de-identified commodities, so a secondary market would emerge and lead to the same issues cited above. Call me pessimistic, but the best predictor of future actions is past behavior, and thus far the companies holding massive amounts of data about our lives either can’t keep it secure from breaches or are perfectly happy selling it while turning a blind eye to what is done with it.

References

Butler, M. (2018). Data brokers and health insurer partnerships could result in insurance discrimination. Retrieved from https://journal.ahima.org/2018/07/24/data-brokers-and-health-insurer-partnerships-could-result-in-insurance-discrimination/

Klugman, C. (2018). Hospitals selling patient records to data brokers: A violation of patient trust and autonomy. Retrieved from http://www.bioethics.net/2018/12/hospitals-selling-patient-records-to-data-brokers-a-violation-of-patient-trust-and-autonomy/

Patientory. (2018). Data brokers have access to your information, do you? Retrieved from https://medium.com/@patientory/data-brokers-have-access-to-your-health-information-do-you-562b0584e17e

Data-in-Motion or Data-at-Rest?

Reading the available material on data-in-motion reminds me of when I first read about data lakes over data warehouses, or NoSQL over SQL: the urgency of the former, and outright danger of the latter, are both overblown. Put simply, data-in-motion provides real-time insights. Most of our analytics efforts across data science spheres apply to stored data, be it years, weeks, or hours old. Taking a look at data-in-motion means not storing it prior to analyzing it, and extracting insights as the data rolls in. This is one workstream of dual efforts that tell us the whole picture: historical data providing insight on what happened and potential to train for future detection, and real-time data to get what’s happening right now.

Churchward (2018) argues a fair point here: once data is stored, it isn’t real-time by definition. But taking that argument to a logical extreme by asking whether we would like to make decisions on data three months old is a stretch. While it is true that matters such as security and intrusion detection must have real-time detection, categorically dismissing data-at-rest analytics is reckless. It vilifies practices that are the foundation of any comprehensive analytics strategy. Both data-at-rest and data-in-motion are valuable drivers of any business intelligence effort that seeks to paint a total picture of any phenomena. 

There are, of course, less frantic cases to be made for data-in-motion. Patel (2018) illustrates a critical situation on an oil drilling rig, in which information even a few minutes old can be life-threatening. In this case, written for Spotfire X, there may be some confusion of monitoring versus analytics. The dashboard shown on the website and written scenario paint more a picture of monitoring and dashboarding than the sort of analytics we would consider deploying Spark or Kafka for. I don’t need a lot of processing power to tell me that a temperature sensor readings are increasing.

Performing real-time analytics on data-in-motion is an intensive task, requiring quite a bit of computing resources. Scalable solutions such as Spark or Kafka are available but may eventually hit a wall. Providers such as Logtrust (2017) differentiate themselves as a real-time analytics provider by pointing out the potential shortfalls of those solutions and offer a single platform for both data-in-motion and data-at-rest.

References

Churchward, G. (2018). Why “true” real-time data matters: The risks of processing data-at-rest rather than data-in-motion. Retrieved from https://insidebigdata.com/2018/03/22/true-real-time-data-matters-risks-processing-data-rest-rather-data-motion/

Logtrust. (2017). Real-time IoT big data-in-motion analytics.

Patel, M. (2018). A new era of analytics: Connect and visually analyze data in motion. Retrieved from https://www.tibco.com/blog/2018/12/17/a-new-era-of-analytics-connect-and-visually-analyze-data-in-motion/

Challenges of Health Informatics in the Cloud

Alghatani and Rezgui (2019) present a framework for remote patient monitoring via cloud architecture. The primary intention is to reduce disparate data sources and walls between various data siloes, increasing cost effectiveness, response time, and quality of care. The cloud architecture involves the database itself, user interface(s), and artificial intelligence. This cloud is used by four primary groups: patients, hospitals, insurance companies, and controllers (system stewards).

The authors outline a number of advantages here. Telemedicine can be a great thing but has a number of barriers to overcome, not the least of which are cost, culture, political environment, and infrastructure. The cloud architecture seeks to mitigate the cost and infrastructure issues. IT resources can be extended dynamically based on need and the decentralized nature of the system allows for better scalability, flexibility, and reliability.

There are a number of challenges to be considered. The authors highlight seven:

  1. Security
  2. Data management
  3. Governance
  4. Control
  5. Reliability
  6. Availability
  7. Business continuity

An extensive discussion on data collection challenges is presented, outlining a number of possible methods for collection and synchronization. There must be an assumption that no device on this architecture will maintain constant contact with the cloud, and consistency models must be taken into consideration. One option is for each device to maintain local storage and upload to the cloud once a stable connection is available. Another option is a whisper network of its own, much like the early Amazon Kindle devices. A third and final option—also the authors’ proposal—is the utilization of fog computing as a layer between these devices and the cloud.

Privacy is always an issue and cloud architecture muddies the waters a bit, as there is no on-premise server locked down that holds the personally identifiable information. Banks and hospitals have typically been the slowest to adopt cloud computing, in my experience. As Alghatani and Rezgui (2019) note, governance and control are concerns here. The Health Insurance Portability and Accountability Act (HIPAA) requires confidentiality in all individually-identifiable health information; in 2013, this law was extended to genetic information by way of the Genetic Information Nondiscrimination Act (GINA). While the rules prohibit use of genetic information for underwriting purposes, there is no restriction on the sharing or use of genetic information that has been de-identified (National Human Genome Research Institute, 2015). De-identification is not entirely foolproof. There are cases in which the data can be re-identified (Rosenbaum, 2018).

References

Alghatani, K., & Abdelmounaam, R. (2019). A cloud-based intelligent remote patient monitoring architecture. Paper presented at the International Conference on Health Informatics & Medical Systems, HIMS’19, Las Vegas, NV.

National Human Genome Research Institute. (2015). Privacy in genomics. Retrieved from https://www.genome.gov/about-genomics/policy-issues/Privacy

Rosenbaum, E. (2018). Five biggest risks of sharing your DNA with consumer genetic-testing companies. Retrieved from https://www.cnbc.com/2018/06/16/5-biggest-risks-of-sharing-dna-with-consumer-genetic-testing-companies.html

Single-Node Hadoop Installation on Ubuntu 16.04

When embarking on a new build for most anything, I tend to use online how-to guides published by other bloggers who have encountered specific issues. I recently completed a single-node Hadoop installation on a Linode Ubuntu box, after several unsuccessful attempts at a three-node setup, and am posting my steps here in case they will be of help to someone attempting to do what I did.

I relied heavily on Parth Goel’s work and this guide follows it nearly verbatim.

Part 1: Provision the server, harden, and install Java

In my case, I have a number of Linode boxes running already and added one more, running Ubuntu 16.04. I also followed the guide for securing a Linode server.

After provisioning and booting up, I completed the following steps:

Set hostname in /etc/hosts (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Set hostname in /etc/hostname (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Harden per Linode recommendations

Login to hadoop-master as root.

adduser hduser
adduser hduser sudo
sudo addgroup hadoop
sudo usermod -a -G hadoop hduser

Create ssh key pairs on local machine (OS X for me) and copy to Linode box. On OS, after creating the key pair, run the following command:

ssh-copy-id -i <key name> hduser@<server ip>

Next, disable root login, change SSH port, disable IPV6, and disable password login. You would be surprised at how many brute-force attacks a server is subjected to every minute. If you wind up locking yourself out, there is emergency LISH access.

sudo nano /etc/ssh/sshd_config

In the config file, you’ll change a few lines to look like this:

Port 2222
PermitRootLogin no
PasswordAuthentication no
UsePAM no

Save and exit text editor. One last command to disable IPV6, then restart SSHD:

echo 'AddressFamily inet' | sudo tee -a /etc/ssh/sshd_config
sudo systemctl restart sshd

Next, install UFW. In my case, I have a static IP I can connect from, so I whitelisted it. Nothing else can hit that server. I suppose I could have avoided all the hardening since I was going to only whitelist from one IP address, but better safe than sorry. The last thing I want is my little sandbox being used for some DoS bot attack.

sudo apt-get install ufw
sudo ufw allow from <vpn ip>
sudo ufw enable
Install Java and reboot

This was one of the biggest issues I ran into with various online how-to guides. It was nearly impossible to get the right combination of Hadoop, Ubuntu, and Java versions right, particularly when many guides went with using Oracle JDK. This step uses the default and doesn’t mess around with custom packages.

sudo apt-get update
sudo apt-get install default-jdk
sudo reboot

Part 2: Create localhost SSH access for Hadoop and install

Once this server boots up again, you will log in as hduser. The next step will create a key pair on hadoop-master. Leave the filename and prompts blank, otherwise you’ll have more work ahead of you. Note that we are adding the SSH port in our SSH command, as we changed it earlier, and will have to add this to the Hadoop environment variables.

ssh-keygen -t rsa 

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 

chmod 0600 ~/.ssh/authorized_keys 

ssh -p 2222 localhost

Once you’ve confirmed that works, exit the SSH session and get back to your hadoop-master hduser command line. Now it’s time to install Hadoop.

cd

wget http://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xvzf hadoop-2.7.3.tar.gz

sudo mkdir -p /usr/local/hadoop

cd hadoop-2.7.3/

sudo mv * /usr/local/hadoop

sudo chown -R hduser:hadoop /usr/local/hadoop

Part 3: Hadoop configuration

Variables configuration

Here, you have to do some checking to make sure your Java library is what is expected.

update-alternatives --config java

In this case we are looking for /lib/jvm/java-8-openjdk-amd64

Edit your bashrc file first.

sudo nano ~/.bashrc

Add the following:

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
#HADOOP VARIABLES END

Now edit your hadoop-env file.

sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Edit the following line to look like this:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Add this line:

export HADOOP_SSH_OPTS="-p 2222"
Hadoop XML configuration files
core-site.xml
sudo mkdir -p /app/hadoop/tmp

sudo chown hduser:hadoop /app/hadoop/tmp

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
mapred-site.xml
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

Add the following.

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:54311</value>
<description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode

sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode

sudo chown -R hduser:hadoop /usr/local/hadoop_store

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
yarn-site.xml
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add the following.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Part 4. Reboot and fire it up!

After reboot, format HDFS.

hdfs namenode -format

Start service.

start-dfs.sh

start-yarn.sh

Test with a simple job.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 2 5

You will want to visit http://<server-ip>:50070/ and http://<server-ip>:8088 to see the consoles.

MongoDB and CouchDB in Healthcare Applications

No alt text provided for this image

Both MongoDB and CouchDB are regarded in similar fashion—as they are document databases—and have been used widely in healthcare applications. The similarity to relational database systems usually allows for an easier learning curve and integration with in-place systems. They have been tested against XML and relational databases (e.g., Freire et al., 2016) and used in conjunction with them (e.g., Groce, 2015).

With respect to electronic health record (EHR) management, Freire et al. (2016) tested CouchDB performance with millions of EHRs including both administrative and epidemiological data points. It was noted that CouchBase is specifically designed for distributed computing and is a strength in this case. A number of datasets were set up for benchmarking and specific queries were written in each database language to answer health-specific questions. Response times varied widely, but the XML-based solutions consistently underperformed both MySQL and CouchBase. Against MySQL, CouchBase delivered faster response times. Despite space and indexing time requirements, CouchBase emerged as the top performer in the test.

MongoDB may be used to supplement and scale up SQL-based deployments, as outlined by Groce (2015). In this case, MongoDB was used to cut down on latency and performance overhead in Doctoralia, a company that connects patients with medical providers. Prior to the deployment, a single SQL server in one geographic location was utilized to handle all the load. As the organizational needs expanded to different countries and data volume increased, it became clear that a scaled approach was needed.

MongoDB allowed Doctoralia to deploy servers to each geographic location (reducing geographic latency) and frontload queries and aggregates to these servers (reducing processing latency). This precompute process also took much of the load off the central SQL server. The distributed framework allows Doctoralia to scale hardware needs up or down as demand requires, and replication allows for high availability with little to no downtime or lack of response seen by end users. Deploying a new server to handle new load is done in a matter of minutes. Doctoralia measures the MongoDB deployment in terms of speed and availability, and has considered it a great success.

References

CouchBase (2017). NoSQL for healthcare. Retrieved from https://www.couchbase.com/solutions/nosql-for-healthcare

Freire, S. M., Teodoro, D., Wei-Kleiner, F., Sundvall, E., Karlsson, D., & Lambrix, P. (2016). Comparing the performance of NoSQL approaches for managing archetype-basedelectronic health record data. PLoS ONE, 11(3).

Groce, D. (2015). How MongoDB helped a healthcare firm scale horizontally. Retrieved from https://dzone.com/articles/leaf-in-the-wild-doctoralia-scales-patient-service

MongoDB. (2019). Healthcare. Retrieved from https://www.mongodb.com/industries/healthcare

Analytics Theories for Medical Diagnosis

Khivsara (2018) presents a number of basic analytics theories. Of these, I believe four are most relevant for medical diagnosis: clustering, association rules, regression, and textual analysis.

No alt text provided for this image

Association rules are nothing more than finding casual structures and patterns between objects in order to establish some sort of logical relationship. It is a machine learning analog to what doctors do on a regular basis in making diagnoses. Picture an emergency room triage room, where patients are sorted and prioritized based on symptoms. In place of a nurse, perhaps on particularly busy nights, a self-service kiosk would allow patients to select all the symptoms they are exhibiting and these symptoms would generate potential diagnoses, the severity of which would determine priority in the night’s order.

Moving a step beyond simple associations, let us examine clustering. Assume two risk factors for chronic disease (e.g., unhealthy diet and tobacco use) were quantified for a population of patients and plotted on a two-axis graph. A simple review of the graph would show plots of individuals on the spectra of diet and tobacco use. Rather than being evenly dispersed across the graph, the data points would be arranged in two or more groupings depending upon the population. K-means clustering would classify those data points (the individuals) into different risk groups depending upon where they fell on the chart. K-means clustering is most useful in healthcare applications where similarities between patients must be quantified and cohorts established.

Going a step further and putting quantitative measures on the relationship between variables and predicted values, we have regression. Regression is all about quantifying the relationship between sets of variables and predicting values. In healthcare, the most common use of regression is related to healthcare costs. Insofar as making diagnoses, logistic regression in particular can be helpful with making diagnoses based on a number of known factors. Imagine a known regression equation for predicting diabetes risk based on multiple input variables.

Finally, let us examine textual analysis. The other three theories mentioned here rely on structured data. However, that structured data is only a fraction of the data collected when a patient sees a provider. The ability to utilize the unstructured data, rife with context and nuance, is perhaps the biggest untapped potential in healthcare analytics. The confluence of textual analysis and natural language processing (NLP) allow unstructured data from sources such as patient records and provider dictation to become part of the picture in predictive modeling and coexist with structured data.

References

EMC Services. (2018). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Retrieved from https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy-nieizv_book.pdf

Healthcare.AI. (2017). Step by step to K-Means clustering. Retrieved from https://healthcare.ai/step-step-k-means-clustering/

HealthCatalyst. (2019). How to use text analytics in healthcare to improve outcomes. Retrieved from https://www.healthcatalyst.com/how-to-use-text-analytics-in-healthcare-to-improve-outcomes

Kulkarni, A. R., & Mundhe, S. D. (2017). Data mining technique: An implementation of association rule mining in healthcare. International Advanced Research Journal in Science, Engineering and Technology, 4(7), 62-65.

World Health Organization. (2005). Chronic diseases and their common risk factors. Retrieved from https://www.who.int/chp/chronic_disease_report/media/Factsheet1.pdf