As organizations grapple with how to effectively manage ever more voluminous and varied reservoirs of big data, data lakes are increasingly viewed as a smart approach. However, while the model can deliver the flexibility and scalability lacking in traditional enterprise data management architectures, data lakes also introduce a fresh set of integration and governance challenges that can impede success.
Born from the rise of the cloud and big data technologies like Hadoop, data lakes provide a way for organizations to cost-effectively store nearly limitless amounts of structured and unstructured data from myriad sources without regard to how that data might be leveraged in the future. By its very nature and through self-service business intelligence capabilities, a data lake also encourages experimentation and data exploration by a broader set of non-business analyst users. According to a survey conducted by TDWI Research, 85 percent of respondents considered the data lake an opportunity to address the challenges they face trying to manage the data deluge with traditional relational databases. Moreover, the TDWI survey found the data lake being pursued for a variety of benefits and use cases, the most prominent being advanced analytics (49 percent) and data discovery (49 percent).
I was lamenting to my friend and fellow MVP Shamir Charania (blog|Twitter) that I didn’t have a topic for this week’s blog post, so he and his colleague suggested I write about data lakes, and specifically the Azure Data Lake.
This is what Wikipedia says:
A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data swamp is a deteriorated data lake either inaccessible to its intended users or providing little value.
In my opinion, the Wikipedia definition has too many words, so let’s rewrite it:
A data lake is a repository of enterprise data stored in its original format. This may take the form of one or more of the following:
(I thought the term “data swamp” was a joke, but it’s 2018 and nothing shocks me anymore.)
If that definition of a data lake sounds like a file system, I’d agree. If it sounds like SharePoint, I’m not going to argue either.
However, the main premise of a data lake is a single point of access for all of an organization’s data, which can be effectively managed and maintained. To differentiate “data lake” from “file system,” then, we need to talk about scale. Data lakes are measured in petabytes of data.
For dinosaurs like me who still think in binary, a petabyte (referred to by some as a pebibyte) is 1,024 terabytes (tebibytes), or 1,125,899,906,842,624 bytes (yes, that’s 16 digits).
In the metric system, a petabyte is 1,000 terabytes, or 1,000,000,000,000,000 bytes.
No matter which counting system we use, a petabyte is one million billion bytes. That’s a lot of data.
Internet companies including search engines (Google, Bing), social media companies (Facebook, Twitter), and email providers (Yahoo!, Outlook.com) are managing data stores measured in petabytes. On a daily basis, these organizations handle all sorts of structured and unstructured data.
Assuming they put all their data in one repository, that could technically be thought of as a data lake. These organizations have adapted existing tools and even created new technologies to manage data of this magnitude in a field called big data.
The short version: big data is not a 100 GB SQL Server database or data warehouse. Big data is a relatively new field that came about because traditional data management tools are simply unable to deal with such large volumes of data. Even so, a single SQL Server database can allegedly be more than 500 petabytes in size, but Michael J. Swart warns us: if you’re using over 10% of what SQL Server restricts you to, you’re doing it wrong.
Big data is where we hear about processes like Google’s MapReduce. The Apache Foundation created their own open-source implementation of MapReduce called Hadoop. Later, Apache Spark was developed to solve some of the limitations inherent in the MapReduce cluster computing paradigm.
From a high level of abstraction, we can think of the Azure Data Lake as an infinitely large hard drive. It leverages the resilience, reliability, and security of Azure Storage you already know and love. Then, using Hadoop and other toolsets in the Azure environment, data can be queried, manipulated and analyzed in the same way we might do it on-premises, but leveraging the massive parallel processing of cloud computing combined with virtually limitless storage.
Note: Microsoft is not the only player in this space. Other cloud vendors like Google Compute (GC) and Amazon Web Services (AWS) offer roughly equivalent services for roughly equivalent prices.
With all of that taken into consideration, here is my new definition for “data lake”:
A data lake is a single repository for all enterprise data, in its natural format, which can be effectively managed and maintained using a number of big data technologies.