What is a data lake? Definition, benefits, architecture and best practices

Did you miss a session at the Data Summit? Watch On-Demand Here.

Table of contents

What is a data lake?

A data lake is defined as a centralized and scalable storage repository that holds large volumes of raw big data from multiple sources and systems in its native format. 

To understand what a data lake is, consider a data lake as an actual lake, where the water is raw data that flows in from multiple sources of data capture and can then flow out to be used for a range of internal and customer-facing purposes. This is much broader than a data warehouse, which would be more like a household tank, one that stores cleaned water (structured data) but just for use of one particular house and not anything else.

Data lakes can be executed using in-house built tools or third-party vendor software and services. According to Markets and Markets, the global data lake software and services market is expected to grow from $7.9 billion in 2019 to $20.1 billion in 2024. A number of vendors are expected to drive this growth, including Databricks, AWS, Dremio, Qubole and MongoDB. Many organizations have even started providing the so-called lakehouse offering, combining the benefits of both data lakes and warehouses through a single product. 

Data lakes work on the concept of load first and use later, which means the data stored in the repository doesn’t necessarily have to be used immediately for a specific purpose. It can be dumped as-is and used all together (or in parts) at a later stage as business needs arise. This flexibility, combined with the vast variety and amount of data stored, makes data lakes ideal for data experimentation as well as machine learning and advanced analytics applications within an enterprise.

Data lake vs. data warehouse

Unlike data warehouses, which only store processed structured data (organized in rows and columns) for some predefined business intelligence/reporting applications, data lakes bring the potential to store everything with no limits. This could be structured data, semi-structured data, or even unstructured data such as images (.jpg) and videos (.mp4).

Key benefits and challenges

Benefits of data lake for enterprises

  • Expanded data-types for storage: As data lakes bring the capability to store all data types, including those critical to perform advanced forms of analytics, organizations can leverage them to identify opportunities and actionable insights that could help with improving operational efficiency, increasing revenue, saving money, and reducing risk.
  • Revenue growth from expanded data analytics: According to an Aberdeen survey, organizations that implemented a data lake outperformed similar companies by 9% in terms of organic revenue growth. These companies were able to perform new types of analytics on previously unused data – log files, data from click-streams, social media, and internet-connected devices – stored in the data lake.
  • Unified data from silos: Data lakes can also centralize information from disparate departmental silos, mainframes, and legacy systems, thereby offloading their individual capacity, preventing issues such as data duplication, and giving a 360-degree view to the users. Simultaneously, they keep the cost of storing data for future use on the lower side.
  • Enhanced data capture, including IoT: An organization can implement a data lake to ingest data from across multiple sources including IoT equipment sensors in factories and warehouses. These sources can be internal and/or customer-facing for a data lake of unified data. Customer facing data helps marketing, sales and account management teams to orchestrate omni-channel campaigns using the most updated and unified information available for each customer, whereas internal data is used for holistic employee and finance management strategies. 

Challenges of a data lake 

Over the years, cloud data lake and warehousing architectures have helped enterprises scale their data management efforts while lowering costs. However, the current set-up has some challenges, such as:

  • Lack of consistency with warehouses: Companies may often find it difficult to keep their data lake and data warehouse architecture consistent. It is not just a costly affair, but teams also need to employ continuous data engineering tactics to ETL/ELT data between the two systems. Each step can introduce failures and unwanted bugs affecting the overall data quality.
  • Vendor lock-in: Shifting large volumes of data into a centralized EDW becomes quite challenging for companies not only because of the time and resource required to execute such a task but also because this architecture creates a closed-loop causing vendor lock-in.
  • Data governance: While the data in the data lake tend to be mostly in different file-based formats, a data warehouse is mostly in database format, and it adds to the complexity in terms of data governance and lineage management between the two storage types.
  • Data copies and associated costs: Data available in data lakes and data warehouses leads to an extent of data copies and has associated costs. Moreover, commercial warehouse data in proprietary formats increases the cost of migrating data. A data lake house addresses these typical limitations of a data lake, as well as data warehouse architecture, by combining the best elements of both data warehouses and data lakes to deliver significant value for organizations.0

Architecture of a data lake: 5 key components

Data lakes use a flat architecture and can have many layers depending on technical and business requirements. No two data lakes are built exactly alike. However, there are some key zones through which the general data flows – Ingestion zone, landing zone, processing zone, refined data zone, and consumption zone.

1. Data ingestion

This component, as the name suggests, connects a data lake to external relational and nonrelational sources – such as social media platforms and wearable devices – and loads raw structured, semi-structured, and unstructured data into the platform. Ingestion is performed in batches or in real-time, but it must be noted that a user may need different technologies to ingest different types of data.

Currently, all major cloud storage providers offer solutions for low-latency data ingestion. This includes Amazon S3, Amazon Glue, Amazon Kinesis, Amazon Athena, Google Dataflow, Google BigQuery, Azure Data Factory, Azure Databricks, and Azure Functions.

2. Data landing 

Once the ingestion completes, all the data is stored as-is with metadata tags and unique identifiers in the landing zone. As per Gartner, this is usually the largest zone in a data lake today (in terms of volume) and serves as an always-available repository of detailed source data, which can be used/reused for analytic and operational use-cases as and when the need arises. The presence of raw source data also makes this zone an initial playground for data scientists and analysts, who experiment to define the purpose of the data.

3. Data processing

When the purpose(s) of the data is known, its copies move from landing to the processing stage, where the refinement, optimization, aggregation, and quality standardization takes place by imposing some schemas. This zone makes the data analysis-worthy for various business use cases and reporting needs. 

Notably, data copies are moved into this stage to ensure that the original arrival state of the data is preserved in the landing zone for future use. For instance, if new business questions or use cases arise, the source data could be explored and repurposed in different ways, without the bias of previous optimizations.

4. Refined data zone

When the data is processed, it moves into the refined data zone, where data scientists and analysts set up their own data science and staging zones to serve as sandboxes for specific analytic projects. Here, they control the processing of the data to repurpose raw data into structures and quality states that could enable analysis or feature engineering.

5. Consumption zone

The consumption zone is the last stage of general data flow within a data lake architecture. In this layer, the results and business insights from analytic projects are made available to the targeted users, be it a technical decision-maker or a business analyst, through the analytic consumption tools and SQL and non-SQL query capabilities.

Top 6 best practices for an effective and secured data lake in 2022

1. Identify data goals

In order to prevent your data lake from becoming a data swamp, it is recommended to identify your organization’s data goals – the business outcomes – and appoint an internal or external data curator who could assess new sources/datasets and govern what goes into the data lake based on that goal. Clarity on what type of data has to be collected can help an organization dodge the problem of data redundancy, which often skews analytics.

2. Document incoming data

All incoming data should be documented as it is ingested into the lake. The documentation usually takes the forms of technical metadata and business metadata, although new forms of documentation are also emerging. Without proper documentation, a data lake deteriorates into a data swamp that is difficult to use, govern, optimize and trust. Users fail to discover the required data.

3. Maintain quick ingestion time

The ingestion process should run as quickly as possible. Eliminating prior data improvements and transformations increase ingestion speed as does adopting new data integration methods for pipelining and orchestration. This would help make the data available as soon as possible after data is created or updated so that some forms of reporting and analytics can operate on it.

4. Process data in moderation

The main goal of a data lake is to provide detailed source data for data exploration, discovery, and analytics. If an enterprise processes the ingested data with heavy aggregation, standardization, and transformation, then many of the details captured with the original data will get lost, defeating the whole purpose of the data lake. So, an enterprise should make sure to apply data quality remediations in moderation while processing. 

5. Focus on subzones

Individual data zones in the lake can be organized by creating internal subzones. For instance, a landing zone can have two or more subzones depending on the data source (batch/streaming). Similarly, the data science zone under refined datasets layer can include subzones for analytics sandboxes, data laboratories, test datasets, learning data and training, while the staging zone for data warehousing may have subzones that map to data structures or subject areas in the target data warehouse (e.g., dimensions, metrics and rows for reporting tables and so on).

6. Prioritize data security

Security has to be maintained across all zones of the data lake, starting from landing to consumption. To ensure this, connect with your vendors and see what they are doing in these four areas — user authentication, user authorization, data-in-motion encryption, and data-at-rest encryption. With these elements, an enterprise can keep its data lake actively and securely managed, without the risk of external or internal leaks (due to misconfigured permissions and other factors).

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More

Read More

Author: admin