5 Things to Know about Avoiding a Data Swamp with a Data Lake

By David Stevens posted Apr 17, 2017 08:06 AM


It is estimated that a staggering 70% of the time spent on analytic projects is concerned with identifying, cleansing, and integrating data. To rectify this situation many organizations are considering a data lake solution. A data lake contains data from various sources. However, without proper management and governance a data lake can quickly become a data swamp. A data swamp is unsafe to use because no one is sure where data came from, how reliable it is, or how it should be protected. IBM proposes an enhanced data lake solution that is built with management, affordability, and governance at its core. This solution is known as a Data Lake.

Data without governance is a liability


1. What is a Data Lake?

A Data Lake provides credible information to subject matter experts (such as data to analysts, data scientists, and business teams) so they can perform analysis activities such as, investigating and understanding a particular situation, event, or activity. A Data Lake has capabilities that ensure the data is properly cataloged and protected so subject matter experts can confidently access the data they need for their work and analysis.


2. What makes up a Data Lake?

The Data Lake is composed of three main components:


  • Data Lake Services

    These services can locate, access, prepare, transform, process, and move data in and out of the Data Lake repositories.


  • Data Lake Repositories

    The repositories provide platforms both for storing data and running analytics as close to the data as possible.


  • Information Management and Governance Fabric

    The fabric provides the engines and libraries to govern and manage the data in the Data Lake.


3. Where does the data come from that feeds the Data Lake?

Much of the data in the Data Lake comes from the enterprise IT systems such as, business systems and business applications. Solutions that monitor activities might also be a source for data. For example, a source could be the log data on usage of the enterprise's web site.


4. How do you roll out a Data Lake?

A Data Lake is a dynamic, agile environment for business teams to control and use in an interactive, self-service manner. There are at least two initial activities necessary to establish the governance and management framework essential to a Data Lake. One activity is to install the information integration and governance platform with at least one data repository. Another activity is the definition of the governance policies and related implementations for managing data for each subject area stored in the Data Lake.


5. What are some of the key roles in the team?

Various roles are important to defining and enhancing the Data Lake. For example, the governance team enables the Data Lake to accept data on new subject areas by defining the governance policies and related data definitions. The IT team enhances the Data Lake by adding new types of repositories, new data refineries, and feeds from non-traditional sources of information. By the way, a data refinery provides the ability to move and transform data in, out, and between the Data Lake repositories. The data refineries use the governance polices to efficiently process the data and ensure the governance policies are enforced. Another role is that of the information curators who define new sources of information that extend the ability to create insights with the Data Lake. Business teams are critical because they add their knowledge and departmental data into the reservoir bringing additional perspective on the operational systems data.