I’ve never been a fan of buzzwords. The latests source of my discomfort is the term thought leader, which is one of those ubiquitous but necessary phrases in almost every professional space. That hasn’t kept me from poking fun at it, though, as I believe we should be able to laugh at ourselves and not take things too seriously.
Big Data is a buzzword. But it’s also my career.
What is the difference between regular, conventional, garden-variety data and Big Data? There’s a lot we could say here, but they key differences that come to mind for me are use, size, scope, and storage. I immediately think of two specific datasets I’ve used for teaching purposes: LendingClub and Stattleship.
LendingClub posts their loan history (anonymized, of course) for public consumption so that any audience may feed it into an engine or tool of their choice for analysis. I’ve used this dataset before to demonstrate predictive modeling and how financial institutions use it to aid decision-making in loan approvals. Stattleship is a sports data service with an API that allows access to a myriad of major league sports data. They also provide a custom wrapper to be used in R, and I’ve used these tools to teach R.
One of the primary differences between big data and conventional data is use case. Take these two datasets, for example. The architects of these sets understand that a variety of users will be downloading the data for various reasons, and there is no specific use case intended for either set. The possibilities are endless. With smaller troves of data, we typically have an intended use attached, and the data is specific to that use. Not so with big data.
These datasets illustrate two other key factors in big data: size and scope. Again, the datasets are not at all meant to answer one specific question or have a narrow focus. Sizing is often at least in gigabytes or terabytes—and in many cases tipping over into petabytes. The freedom to explore multiple lines of inquiry is inherent in big data sets without any sort of restriction on scope.
Finally, the storage and maintenance of big data is another key difference that sets it apart from conventional datasets. The trend of moving database operations offsite and using Database-as-a-Service models have enabled the growth of big data, as has the development of distributed computing and storage. Smaller conventional datasets do not require such an infrastructure and are not quite as impactful on a company’s bottom line.