Latest Posts

Centralized and Distributed DBMS

Perhaps the best place to start with comparing centralized and distributed DBMS instances is the architecture itself. As the names suggest, it is mostly a matter of whether the data resides in one physical location—not necessarily logical, as multiple volumes within a single location do not qualify as a distributed DBMS—or multiple locations with an underlying controller to bring it all together. It might be compared to disk RAID options, in which data on a storage system is mirrored or striped across multiple physical drives.

No alt text provided for this image

We can continue the RAID analogy in discussing replication and partitioning. Much like Distributed DBMS architecture, RAID storage allows disks to be seamlessly duplicated for high fault tolerance or the data itself to be written across multiple disks to increase storage capacity and throughput. In RAID 0, data is striped across multiple disks; this is the equivalent of DDMBS partitioning. All the nodes in a DDMBS store different parts of the complete database. This may be accomplished by horizontal partitioning (in which all columns are stored, but different nodes have different subsets of records) or vertical partitioning (in which certain columns are stored in different nodes, of all records). Alternatively, in RAID 1, a disk is mirrored to another disk; this is the equivalent of replication.

A common misconception with DDBMS instances involves the CAP theorem. There is an assumption that while CDBMS instances enjoy Consistency, Availability, and Partition Tolerance all at the same time (the latter by virtue of being in a single location and it being a moot point), DDBMS administrators must choose either CP, CA, or AP. Rather, it is more accurate to say that a DDBMS administrator, in the event of a network partition, must choose between availability or consistency. The former may sacrifice consistency and the latter may sacrifice availability.

In terms of applications, a DDBMS is most appropriate for large volumes of data or for users spread across a large geographic area. A partitioned DDBMS architecture might be optimized to store specific columns on nodes local to user groups that use those columns more frequently than other user groups, even though they are not directly accessed. Geographic spread is a relevant use case due to the various network hops and latency differences that may exist between an otherwise central data center and users worldwide.

References

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. 

 Mehra, A. (2017). Understanding the CAP theorem. Retrieved from https://dzone.com/articles/understanding-the-cap-theorem

On DBaas migrations

An increasing number of enterprise systems are moving to as-a-service models, reducing a company’s overhead and turning traditional facets of information technology—those that have used up both real estate and capital expenditures—into outsourced subscriptions that are managed by outside companies. Infrastructure, Networking, and Reporting as a service are already popular. Moving the databases themselves off a company’s property and balance sheet into a cloud architecture entails what is known as Database-as-a-service (DBaaS) (Bonthu, Thammiraju, & Murthy, 2014). There are many factors involved in establishing the DBaaS environment and migrating the data from on-premise boxes to cloud.

There are typically eight steps involved in moving from on-premise to cloud databases:

  1. Define the scope of migration
  2. Ensure data security
  3. Select service provider
  4. Map the data
  5. Schedule the migration
  6. Select tools for migration or develop migration scripts
  7. Test before (and after) the migration
  8. Actual data migration

 The actual migration, insofar as relational databases are concerned, typically consists of three steps:

  1. Relational schema migration – it includes the migration of tables, indexes and views.
  2. Data migration done via tools or migration scripts. The time required for data migration depends on the size of the database.
  3. Database stored programs migration – the migration of stored procedures and triggers. (Vodomin & Andročec, 2015)

 The different types of cloud databases available, relational and non-relational, make for a variety of ways to migrate and a number of considerations for enterprise migration. Regardless, a one-time expenditure on migration can save countless dollars and hours of ballooning infrastructure and database sprawl. It is much easier to handle such sprawl by responding with both storage and virtual machine elasticity as opposed to investing more in onsite resources (Bonthu, Thammiraju, & Murthy, 2014). Further research in this space is warranted as the options for cloud architecture increase and companies have more options for service-based managed IT.

 References

 Bonthu, S., Thammiraju, S. D. M., & Murthy, Y. S. S. R. (2014). Study on database virtualization for database as a service (dbaas). International Journal of Advanced Research in Computer Science, 5(2), 31-34.

 Vodomin, G., & Andročec, D. (2015). Problems during database migration to the cloud. Paper presented at the Central European Conference on Information and Intelligence Systems, Varaždin, Croatia.

Integrating Agile and UX Methodology

Both Agile development and User Experience (UX) design methodologies have been popular in software design and project management spheres as of late, and despite their popularity, there has been little effort to integrate them. Each approaches development from different perspectives—Agile is more focused on coding and project management, while UX is concerned with the usability of the product and actual user interface. Of course, these are two sides of the same coin. A software solution is unusable without some form of user interface, and a user interface is but an empty shell without a quality software product behind it.  

Ferreira, Sharp, & Robinson (2011) suggest a framework for Agile and UX integration, founded on five principles:

  1. The user should be involved in the development process; 
  2. Designers and developers must be willing to communicate and work together extremely closely; 
  3. Designers must be willing to feed the developer with prototypes and user feedback; 
  4. UCD practitioners must be given ample time in order to discover the basic users’ needs before any code; and, 
  5. Agile/UCD integration must exist within a cohesive project management framework.

The framework itself has both the typical software developers and UX designers running in parallel iterations, giving and receiving feedback in each iteration, and the UX team working one Sprint ahead of the development team. Both start with a Sprint 0 to obtain context and task analysis for the project ahead. This ultimately generates User Stories, which are then distributed across the Sprints. These Stories first go through the UX team before being delivered to the development team, so that the UX designers can begin with the User Stories and intended outcomes to produce the interface. The authors observed both design and development teams in action, and noted many areas for integration and improvement. Most notably, the authors found that UX designers had no User Stories specific to them and found it difficult to design one sprint ahead. Usually, they were either on the same Sprint as the developers or one behind.

The authors present a solid argument for integrating Agile and UX methodologies, and since the article publication in 2011, the idea has caught on (e.g., Gothelf 2018). A variety of Agile- and UX- flavored methods are out there in the DevOps world at any moment and have dedicated followers and applications. 

References 

Ferreira, J., Sharp, H., & Robinson, H. (2011). User experience design and agile development: managing cooperation through articulation work. Software: Practice and Experience, 41(9), 963-974. doi:10.1002/spe.1012

Gothelf, J. (2018). Here is how UX design integrates with Agile and Scrum. Retrieved from https://medium.com/swlh/here-is-how-ux-design-integrates-with-agile-and-scrum-4f3cf8c10e24

MetroMaps and T-Cubes: Beyond Gantt Charts

Martínez, Dolado, & Presedo (2010) discuss two visual modeling tools for software development and planning, MetroMap and T-Cube. This discussion is in the context of greater attention being paid to the development process and metrics, not just the software engineering itself. A concession the authors make very early on is that Gantt charts are the prevalent method for project mapping in organizations, and that the research to date shows they are not effective for communicating, especially when different groups are involved. Enter the MetroMap, a way of visualizing abstract train-of-thought information that communicates both high-level and detailed information to viewers.

Image courtesy of Martínez, Dolado, & Presedo (2010)

T-Cube visualization is reminiscent of a Rubik’s Cube, utilizing the three-dimensional nature of a physical cube, the individual cubes making up the whole, and the facets (colors) on each individual cube. These correspond to tasks and attributes. The authors utilized a specific software set to illustrate these concepts, represented in the article. As the tasks and attributes are written independently, they can be represented by workgroup, type of task, module or time.

These two methods have their strengths and weaknesses, both individually and together. At first glance, it is obvious that the MetroMap can represent many indicators at once while the T-Cube can only show one at a time. MetroMap uses a variety of icons and styles to represent information while the T-Cube uses traditional treemaps. The authors size up the tools in a simple comparison table, noting that MetroMap generally has the edge on viewing a lot of information at once.

Features and benefits are great, but how does actual use differ? Is one easier than the other in practice? The authors examined a shortest-path route to accomplish the same task in both tools, and found that MetroMap was the most efficient in multiple scenarios. In all cases the actions were more basic and straightforward. Overall, either tool is more informative and effective than Gantt charts. Access to information and ability to understand it are paramount in any planning and development exercise. These are two tools that better enable that.

Reference

Martínez, A., Dolado, J., & Presedo, C. (2010). Software Project Visualization Using Task Oriented Metaphors. JSEA, 3, 1015-1026.

Delphi Methods and Ensemble Classifiers

Ensemble classifiers are a bit like Delphi methodology, in that they utilize multiple models (or experts) to arrive at a model that offers better predictive performance than would a single model (Dalkey & Helmer, 1963; Acharya, 2019). These are independent or parallel classifiers, implementing a majority vote amongst the classifiers like the Delphi method. A variety of individual classifiers can be used, including logistic regression, nearest neighbor methods, decision trees, Bayesian analysis, or discriminate analysis. According to Dietterich (2002), ensemble classification overcomes three major problems: Statistical, Computational, and Representational. The Statistical problem involves the hypothetical space being too large for the data itself, producing multiple accurate hypotheses yet only one being chose. The Computational problem involves the algorithm’s inability to guarantee the best hypothesis. The Representational problem involves the hypothetical space being devoid of any good approximation of the target.

Ensemble methods include bagging, boosting, and stacking. Bagging is considered a parallel or independent method; boosting and stacking are both sequential or dependent methods. Parallel methods are used when the independence between the base classifiers is advantageous, including error reduction; sequential methods are used when dependence between the classifiers is advantageous, such as correcting mislabeled examples or converting weak learners (Smolyakov, 2017).

Random forests are not exactly ensemble classifiers but do produce results from multiple decision trees and aggregate the results, like Bagging (Liberman, 2017). These train on different datasets and features, both randomly selected. Bias and variance errors are mitigated by way of low correlation between the models. Again, like ensemble classifiers and even Delphi method decision-making, learners operating as a committee should outperform any of the individual learners.

References

Acharya, Tarun (2019). Advanced ensemble classifiers. Retrieved from https://towardsdatascience.com/advanced-ensemble-classifiers-8d7372e74e40

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. 

Dalkey, N., & Helmer, O. (1963). An experimental application of the Delphi method to the use of experts. Management Science9(3), 458-467.

Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.

Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second Edition, (M.A. Arbib, Ed.), (pp. 405-408). Cambridge, MA: The MIT Press.

Liberman, N. (2017). Decision trees and random forests. Retrieved from https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991

Smolyakov, V. (2017). Ensemble learning to improve machine learning results. Retrieved from https://blog.statsbot.co/ensemble-learning-d1dcd548e936

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Decision making with Delphi

The Delphi method brings subject matter experts with a range of experiences together in multiple rounds of questioning to arrive at the strongest consensus possible on a topic or series of topics (Okoli & Pawlowski, 2004; Pulat, 2014). The first round is typically used to generate the ideas for subsequent rounds’ weighting and prioritizing, by way of a questionnaire. This first round is the most qualitative of the steps. Subsequent rounds are more quantitative. According to Pulat (2014), ideas are listed and prioritized by a weighted point system with no communication between the subject matter experts. This is meant to avoid confrontation (Dalkey and Helmer, 1963). Results and available data requested by one or more experts can be shown to all experts, or new information that is considered potentially relevant by an expert (Dalkey & Helmer, 1963; Pulat, 2014). 

While Delphi begins with and keeps a sense of qualitative research about it, traditional forecasting utilizes mostly quantitative methods, utilizing mathematical formulations and extrapolations as mechanical bases (Wade, 2012). Using past behavior as a predictor of future positioning, a most likely scenario is extrapolated (Wade, 2012; Wade, 2014). This scenario modeling confines planning to a formulaic process much like regression modeling. Both Delphi and traditional forecasting utilize quantitative methods, the difference being to what degree. A key question in deciding which method to use is what personalities are involved. Delphi methodology gives the most consideration to big personalities and potentially fragile egos, avoiding any direct confrontation or disagreements.

References

Dalkey, N., & Helmer, O. (1963). An experimental application of the Delphi method to the use of experts. Management Science9(3), 458-467.

Okoli, C., & Pawlowski, S. D. (2004). The Delphi method as a research tool: an example, design considerations and applications. Information & Management42(1), 15-29.

Pulat, B. (2014) Lean/six sigma black belt certification workshop: Body of knowledge. Creative Insights, LLC.

Wade, W. (2012) Scenario Planning: A Field Guide to the Future. John Wiley & Sons P&T. VitalSource Bookshelf Online.

Thick Data and Big Data

In March 1968, Robert F. Kennedy said, of the Gross Domestic Product index: “It measures neither our wit nor our courage, neither our wisdom nor our learning, neither our compassion nor our devotion to our country, it measures everything in short, except that which makes life worthwhile.”

“What is measurable is not always what is valuable.” Wang (2016b) paraphrased Kennedy, originally referencing GDP and its inability to measure the qualitative human condition. With the exponential increase in attention to Big Data as of late, the focus on speed and scale have left out things that are “sticky” or “difficult to quantify” (Wang, 2016b). This disparity reflects the traditional gap between qualitative and quantitative research. In fact, Wang found referring to the qualitative efforts in traditional terms (e.g., ethnography) was met with enough skepticism and pushback that a new term friendly to data jargon had to emerge—and thus the term thick data was born.

https://miro.medium.com/max/1442/1*B4UOLidQEam25fJkNeZH8A.png
Courtesy Tricia Wang

At first glance, thick data is not attractive in the traditional sense of big data. It is inefficient, does not scale up, and is usually not reproducible. However, when combined with big data, it fills the gaps that the quantitative measures leave open. While big data can identify patterns, it cannot explain why those patterns exist. If big data can go broad, thick data can go deep. Thick data relies on human learning and complements the findings from machine learning that big data cannot provide adequate context for. It shows the social context of specified patterns and is able to handle irreproducible complexity. It is the qualitative complement to quantitative data, the color and nuance to a black-and-white picture.

Forces against the adoption of thick data typically stem from bias against qualitative data. Again, it is messy…inefficient, sticky, complicated, and nuanced. Most of the big data world values what can be quantified and the relationships that can be mapped. As (Wang, 2016a) notes, quantifying is addictive, and it can be easy to throw out data that doesn’t fit a numerical value. It isn’t a zero-sum game, however. Both big data and thick data complement each other. But “silo culture”—the same phenomenon that disrupts data integration and wreaks havoc across enterprise data environments—threatens the symbiosis between these two (Riskope, 2017). While thick data is not an innovation in the same sense of cutting-edge artificial intelligence or new developments in IoT technology, it is an innovation in how we think about the world around us and what is important when studying that world.

References

Riskope. (2017). Big data or thick data: Two faces of a coin.  Retrieved from https://www.riskope.com/2017/05/24/big-data-or-thick-data-two-faces-of-a-coin/

Wang, T. (2016a). The human insights missing from big data.  Retrieved from https://www.ted.com/talks/tricia_wang_the_human_insights_missing_from_big_data

Wang, T. (2016b). Why big data needs thick data.  Retrieved from https://medium.com/ethnography-matters/why-big-data-needs-thick-data-b4b3e75e3d7

Five thousand days of the World Wide Web

In 2007, Kevin Kelly looked back on the last 5,000 days of the World Wide Web and asked: what’s to come? Now, with years of hindsight since that talk, we ask: what next?

One thing I have to call attention to here is the latter part of the talk, in which Kelly discusses codependency and the exchange of privacy for convenience. Total personalization equals total transparency. From a development and data perspective, nothing is outlandish about that statement. But as we have seen in the social fabric over the last few years, not everyone understands or agrees with that logic. There is a demand for personalization without the transparency. I believe the watershed moment in that space will be a split between those who eschew all personalization in order to maintain privacy, and those who are determined to innovate a way around having personalization and privacy to the degree that we expect now.

That is not my prediction for an innovation in the next two decades. For that, think back to 2012, when Google Glass was first introduced to the public. It was a product ahead of its time and failed to gain traction. Less than ten years later, Google is refining the product for a more sophisticated release and targeted audiences are paying attention. Looking ahead to 2030 and beyond, augmented reality products will be as commonplace as the personal vital signs wearable (Apple Watch) or natural language processor in the living room (Amazon Alexa). Forces working in their favor are both tangible and intangible. Augmented reality is already here, most notably in current iPhone models. This has introduced the concept in an incremental and friendly way in an existing device as opposed to a bombshell new product class. Consumers are able to experience the tangible technology on devices they are already familiar with, gain confidence, and accept the new products that push the envelope. These are a mix of technological, cultural, and social forces.

These same forces can work against adoption. The development of augmented reality now centers around headsets and devices with cameras, but what of the technologies that can project fully-functional desktops and workstations into the ephemera to be touched and manipulated as though they were physically there? The interface running Tony Stark’s lab in Iron Man is not run through Google Glass but is just simply there. Assuming these can be done, take my earlier point about transparency and privacy, and apply it to these technologies that, by definition, augment the very reality we function in. If people are uncomfortable now with the personalization/transparency tradeoff, a new device that alters how they see and interact with the world might simply be a bridge too far.

References

Dimandis, P. H. (2019). Augmented 2030: the apps, headsets, and lenses getting us there. Retrieved from https://singularityhub.com/2019/09/13/augmented-2030-the-apps-headsets-and-lenses-getting-us-there/

Kelly, K. (2007). The next 5,000 days of the web. Retrieved from https://www.ted.com/talks/kevin_kelly_the_next_5_000_days_of_the_web

When Collaborative Learning isn’t an Open-Office Plan for Kids

According to Adams Becker et al. (2017, p. 20), “the advent of educational technology is spurring more collaborative learning opportunities,” driving innovation in a symbiotic relationship that pushes development in both areas as a product of the other. Collaborative learning and the technological developments that help drive it are trends but not fads.

The confluence of collaborative learning and educational technology.

It may be easy to draw parallels between collaborative learning and open-plan offices. This corporate architecture fad of recent years does appear to be in the same spirit as collaborative learning, and the technological tools that are used in conjunction with open-plan offices—or in spite of them—do have a supporting relationship. But the similarities stop there. Open-plan offices had the best intentions of creating more face-to-face interaction with colleagues but has been proven to reduce such interaction by a drastic margin, pushing employees to use alternative text-based methods of communication in light of social pressure to “look busy” amongst coworkers (James, 2019).

There is one carry-over from corporate collaboration that is fruitful in collaborative learning spaces: synchronous communication via messaging apps such as Slack. These interactions are purposeful and augment the authentic active learning students engage in with collaborative learning. Just as workers do not operate in silos within an organization, students are encouraged to engage with others in various collaborative methods. Educational research and practice reinforce these lessons learned from the corporate world, and are helpful forces driving innovation and advancement in collaborative learning practice and technology.

References

Adams Becker, S., Cummins, M., Davis, A., Freeman, A., Hall Giesinger, C., & Ananthanarayanan, V. (2017). NMC Horizon report: 2017 higher education edition. Austin, TX: T. N. M. Consortium.

James, G. (2019). Open-plan offices literally make you stupid, according to Harvard.  Retrieved from https://www.inc.com/geoffrey-james/open-plan-offices-literally-make-you-stupid-according-to-harvard.html

XML and Standardization

XML is a true double-edged sword in the data analytics world, with both advantages and disadvantages not unlike relational databases or NoSQL. The global advantages and disadvantages inherent in XML are just as applicable in the healthcare field. For example, consider the flexibility of user-created tags on the fly—something that is both an advantage (for ease of use, compatibility, expandability, et cetera) and disadvantage (lack of standardization, potential incompatibility with user interfaces, et cetera) in the global sphere. These are equally applicable in healthcare settings. Considering an electronic health record (EHR), different providers and points of care may add to the EHR without having to conform to the standards of other providers; that is, data from a rheumatologist may be added to the patient record in with the same ease as a general practitioner or psychologist. The portability of the XML format means that the record can be exchanged amongst providers or networks as long as the recipient can read XML. However, this versatility comes at a price, as the lack of standardization means that all tags and fields in any given record must be known prior to query and can be quite a time-consuming process.

Considering an analogy to a different industry, think of a consumer packaged goods (CPG) manufacturer. The CPG has its own internal master data schemas in relational databases and reserves XML for its reseller data interface, so that the different wholesalers and retail network can share sales data back to the CPG in a common format. While all participants use a handful of core attributes (e.g., manufacturer SKU and long description), each wholesaler and retailer has its own set of attributes that are proprietary. XML allows the different participants to feed data back to the CPG without conforming to a schema imposed across the entire retail network and allows the CPG to glean the requisite data shared amongst all participants. However, the process requires setting up the known tags for each new participant so that the CPG knows ahead of time what specific tags are relevant to each participant.

References

Brewton, J., Yuan, X., & Akowuah, F. (2012). XML in health information systems. Paper presented at the World Congress in Computer Science, Computer Engineering, and Applied Computing, Las Vegas, NV.

Jumaa, H., Rubel, P., & Fayn, J. (2010, 1-3 July 2010). An XML-based framework for automating data exchange in healthcare. Paper presented at the The 12th IEEE International Conference on e-Health Networking, Applications and Services.

Stockemer, M. (2007). How Do HL7 and XML Co-Exist in Clinical Interfacing? Retrieved from https://healthstandards.com/blog/2007/08/10/how-do-hl7-and-xml-coexist-in-clinical-