The Role of Data Brokers in Healthcare

In courses I’ve led before, we looked at the disjointed data privacy regulations in the United States and current events in data privacy (e.g., Facebook, Cambridge Analytica, personal genomics testing, etc). The overall issue is repeatable in any setting: giving a single entity a large amount of data inevitably raises questions of ethics, privacy, security, and motivation.

Where healthcare data brokers are concerned, the stated goals differ by type of data. Where direct patient interaction with the data is concerned, the goal is to give patients “more control over the data” (Klugman, 2018) and perhaps bypass the clunky patient portals set up by providers. Of the data that is not personally identifiable, it can have much less altruistic goals, such as being a player in a multi-billion-dollar market (Patientory, 2018) or contributing to health insurance discrimination (Butler, 2018). I am not naïve enough to think that all exercises in healthcare should be altruistic, and the concept of insurance itself has a certain modicum of discrimination in its core; however, weaponizing the data to aid in unfair practices is beyond the pale here.

No alt text provided for this image

From a data engineering perspective, a broker in the truest sense of the word may act as a clearinghouse between providers with disparate systems, enabling the seamless transfer of patient data between those providers without putting the burden of ETL on either of them. Whereas XML formatting and other portability developments have allowed providers using different EHR systems to port patient data, a data brokerage would act as an independent party acting on the patient’s behalf and handling the technical details on integrating their data between all providers and interested parties. Beyond holding the data, the broker would be responsible for ensuring each provider and biller has access to the same single source of truth on that particular patient.

This would, of course, require a data warehouse of sorts for the single source to be held, and puts the questions of security, privacy, transparency, and ethics on the broker. The broker has to make money to survive and a business model must emerge, so it would not be immune to market forces. The aggregation of so much patient data in one place would be too great a temptation to let sit and not make money as de-identified commodities, so a secondary market would emerge and lead to the same issues cited above. Call me pessimistic, but the best predictor of future actions is past behavior, and thus far the companies holding massive amounts of data about our lives either can’t keep it secure from breaches or are perfectly happy selling it while turning a blind eye to what is done with it.


Butler, M. (2018). Data brokers and health insurer partnerships could result in insurance discrimination. Retrieved from

Klugman, C. (2018). Hospitals selling patient records to data brokers: A violation of patient trust and autonomy. Retrieved from

Patientory. (2018). Data brokers have access to your information, do you? Retrieved from

MongoDB and CouchDB in Healthcare Applications

No alt text provided for this image

Both MongoDB and CouchDB are regarded in similar fashion—as they are document databases—and have been used widely in healthcare applications. The similarity to relational database systems usually allows for an easier learning curve and integration with in-place systems. They have been tested against XML and relational databases (e.g., Freire et al., 2016) and used in conjunction with them (e.g., Groce, 2015).

With respect to electronic health record (EHR) management, Freire et al. (2016) tested CouchDB performance with millions of EHRs including both administrative and epidemiological data points. It was noted that CouchBase is specifically designed for distributed computing and is a strength in this case. A number of datasets were set up for benchmarking and specific queries were written in each database language to answer health-specific questions. Response times varied widely, but the XML-based solutions consistently underperformed both MySQL and CouchBase. Against MySQL, CouchBase delivered faster response times. Despite space and indexing time requirements, CouchBase emerged as the top performer in the test.

MongoDB may be used to supplement and scale up SQL-based deployments, as outlined by Groce (2015). In this case, MongoDB was used to cut down on latency and performance overhead in Doctoralia, a company that connects patients with medical providers. Prior to the deployment, a single SQL server in one geographic location was utilized to handle all the load. As the organizational needs expanded to different countries and data volume increased, it became clear that a scaled approach was needed.

MongoDB allowed Doctoralia to deploy servers to each geographic location (reducing geographic latency) and frontload queries and aggregates to these servers (reducing processing latency). This precompute process also took much of the load off the central SQL server. The distributed framework allows Doctoralia to scale hardware needs up or down as demand requires, and replication allows for high availability with little to no downtime or lack of response seen by end users. Deploying a new server to handle new load is done in a matter of minutes. Doctoralia measures the MongoDB deployment in terms of speed and availability, and has considered it a great success.


CouchBase (2017). NoSQL for healthcare. Retrieved from

Freire, S. M., Teodoro, D., Wei-Kleiner, F., Sundvall, E., Karlsson, D., & Lambrix, P. (2016). Comparing the performance of NoSQL approaches for managing archetype-basedelectronic health record data. PLoS ONE, 11(3).

Groce, D. (2015). How MongoDB helped a healthcare firm scale horizontally. Retrieved from

MongoDB. (2019). Healthcare. Retrieved from