There is one thing we can say for certain about the IT and computer industry – we sure love creating new terms and categories… even when, it can be argued, that we don’t need them. And even when new terms might be useful, it sure can be difficult to come to grips with them. Especially when every vendor out there tries to grab onto the new term to prove that their technology is relevant, or hot.
Data lake is one of these newer terms that has been created for a new category of data store for us to create and manage. But I don’t think that the term is well understood by many. So just what is a data lake?
Well, the “what is” definition at TechTarget says that “a data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.” This is a reasonable definition, at a high level. The key portion here is that the data is stored in its “native format.” The data is not manipulated or transformed in any meaningful way; it is simply stored and cataloged for future use.
Why is this an interesting or useful thing to do? Any type of data can be stored in a data lake: structured, semi-structured and unstructured. For example, organizations can use a data lake for customer information captured from multiple sources for future analysis and aggregation. This can consist of typical number, characters, dates and times, as well as complex documents, text, multimedia and more. In general, the data is ingested without transformation and data scientists can run analytical models against the data, business analysts can augment business intelligence activities with the data, and it can even be used as a long-term data archive.
Organizations are under intense pressure these days to capture any data that could be relevant to their business. the number of sources and amount of data continues to skyrocket. So the desire to grab the data when it is available is high, but the time to organize and understand that data fully at the time of capture is not usually available.
Therein, however, lies one potential problem for the long-term viability of a data lake. Unless the data in the data lake is properly defined and documented then it can be quite difficult to extract business value. Perhaps there remains value for data scientists in a data lake because they possess the skills to wrangle and manipulate data. But to expect that business people will be able to use data that is just dumped into a data lake with little or no details about it is unrealistic.
You should not just treat the data lake as a dumping ground for data. It is important to have a means of understanding and managing the data that is stored in the data lake. Without a mechanism for defining, populating, accessing and managing the data in your data lakes, you will find them to be less than useful.
Populating a data lake requires knowledge of and proper tools for data integration. Because the data lake contains multiple types of data from multiple sources, it must include support for a wide array of different platforms, data types and structures, interfaces and processing capabilities.
You will also need some form of metadata management for a data lake environment to remain useful and healthy. Minimally, a data lake requires information about each type of data stored there, but also some guidance on where the data originated (that is, its provenance), the data elements it contains, the meaning of each, and how to read them. Of course, the metadata can be minimal to begin with and then fleshed out as your data scientists and analytics teams explore the data.
The words of Sean Martin, CTO of Cambridge Semantics, ring true here. He said “We see customers creating big data graveyards, dumping everything into HDFS (Hadoop Distributed File System) and hoping to do something with it down the road. But then they just lose track of what’s there.”
Planning the on-going viability of the data lake is important and that means active management to ensure accessibility and understanding of what is available.
Wither the Data Warehouse?
Some people worry that data lakes will summon the demise of data marts and data warehouses. But this is absurd. A data warehouse, as defined by Bill Inmon, is “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.”
In contrast with a data lake, where data is captured and stored with no transformation or aggregation, a data warehouse contains data transformed from multiple sources and is designed for business users. A data lake cannot serve the same purpose unless the data is modified from its “native format”… and then it stops being a data lake by definition.
To contrast these two (data lake and data warehouse) let’s take a look at how James Dixon, CTO of Pentaho and the man who coined the term data lake, describes them. He likens a data mart or warehouse to a bottle of water… it has been “cleansed, packaged and structured for easy consumption.” On the other hand, a data lake is more like a body of water in its natural state, a lake or a stream.
There are, of course, many other differences. A data warehouse contains structured data whereas a data lake can contain structured, unstructured, and semi-structured data. Data in the data lake comes from multiple sources and will have varying schemata. As such, the data lake requires schema-on-read capability – and a platform, like Hadoop, that supports such a requirement. With data from multiple, disparate sources all being stored in its native format, data lakes cannot support schema-on-write like data warehouses do.
Of course, Hadoop is not the only technology that can be used for data lakes. For example, some organizations with a more cloud-focused mentality are using solutions from cloud providers like Amazon Web Services, IBM BigInsights on Cloud or Microsoft Azure to implement their data lake.
The type of storage that can be used also separates data warehouses from data lakes. With a data warehouse performance is important and you do not want to store data that will be queried by business professionals on slower, less-costly storage devices. On the other hand, storing a data lake on such devices makes a lot of sense!
So understand the difference… and do not confuse the two.
A Place for Everything and Everything in its Place
The bottom line here is that you really do need to understand terms and their associated technologies. Even if, at first glance, similarities abound, take the time to investigate and learn before attempting to adopt any new technology.
A data lake is not a data warehouse. They are designed for different purposes, and with different goals in mind. Properly implemented, a data lake can feed your data marts and enterprise data warehouse… and it can support the analytical queries and models of your data scientists.
While proceeding with knowledge will not assure success, proceeding without knowledge surely will doom your projects to failure.
 “The enterprise data lake: Better integration and deeper analytics” by Brian Stein and Alan Morrison, http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/assets/pdf/pwc-technology-forecast-data-lakes.pdf
 “Pentaho, Hadoop, and Data Lakes” by James Dixon. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/