data lakes

Tips for Enhancing Your Data Lake Strategy

As organizations grapple with how to effectively manage ever more voluminous and varied reservoirs of big data, data lakes are increasingly viewed as a smart approach. However, while the model can deliver the flexibility and scalability lacking in traditional enterprise data management architectures, data lakes also introduce a fresh set of integration and governance challenges that can impede success.

The Power and Potential of Data Lakes

Born from the rise of the cloud and big data technologies like Hadoop, data lakes provide a way for organizations to cost-effectively store nearly limitless amounts of structured and unstructured data from myriad sources without regard to how that data might be leveraged in the future. By its very nature and through self-service business intelligence capabilities, a data lake also encourages experimentation and data exploration by a broader set of non-business analyst users. According to a survey conducted by TDWI Research, 85 percent of respondents considered the data lake an opportunity to address the challenges they face trying to manage the data deluge with traditional relational databases. Moreover, the TDWI survey found the data lake being pursued for a variety of benefits and use cases, the most prominent being advanced analytics (49 percent) and data discovery (49 percent).

Original Link

Moving Big Data to the Cloud: A Big Problem?

Digital transformation is overhauling the IT approach of many organizations and data is at the center of it all. As a result, organizations are going through a significant shift in where and how they manage, store, and process this data.

To manage big data in the not so distant past, enterprises processed large volumes of data by building a Hadoop cluster on-premises using a commercial distribution such as Cloudera, Hortonworks, or MapR.

Original Link

The Benefits of Building a Modern Data Architecture for Big Data Analytics

Modern data-driven companies are the best at leveraging data to anticipate customer needs, changes in the market, and proactively make more intelligent business decisions. According to the Gartner 2018 CEO and Senior Business Executive Survey, 81 percent of CEOs have prioritized technology initiatives that enable them to acquire advanced analytics. While many companies tapping into advanced analytics are now rethinking their data architecture and beginning data lake projects, 60 percent of these projects fail to go beyond piloting and experimentation, according to Gartner. In fact, that same Gartner survey reports that only 17 percent of Hadoop deployments were in production in 2017. If companies don’t successfully modernize their data architecture now, they will end up losing customers, market share, and profits.

What Drives the Shift to a Modern Enterprise Data Architecture?

The architectures that have dominated enterprise IT in the past can no longer handle the workloads needed to move the business forward. This shift towards a modern data architecture is driven by a set of key business drivers. There are seven key business drivers for building a modern enterprise data architecture (MEDA):

Original Link

Add Schema as Needed, Not in Advance

The first three steps of building a traditional data warehouse are to 1) gather reporting requirements, 2) identify source data, and 3) design a data model, also known as a schema, to hold the data in a predictable structure for analysis.

The big data and data lake revolutions have radically changed that approach. Now people are gathering data first and using a “come as you are” approach for the data model. Basically sourcing and dumping potentially interesting data, as-is, into a big data repository or cloud file store. The analytical and reporting requirements then generally come next, as people (or machines) try to find something useful to do with the data that they have assembled, or try to use it to answer actual business questions. In this new world, the “data modeling” step is largely ignored, or deferred until later, using a “schema on read” approach.

Original Link

Maintaining a Data Warehouse

In more traditional IT projects, when a successful system is tested, deployed and in daily operation, its developers can usually sit back and take a well-deserved rest as users come on-board, and leave ongoing maintenance to a small team of bug-fixers and providers of minor enhancements. At least until the start of the next major release cycle. Developers of today’s data warehouses have no such luxury.

The measure of success of a data warehouse is only partly defined by the number and satisfaction level of active users. The nature of creative decision-making support is that users are continuously discovering new business requirements, changing their mind about what data they need, and thus demanding new data elements and structures on a weekly or monthly basis. Indeed, in some cases, the demands may arrive daily!

This need for agility in regularly delivering new and updated data to the business through the data warehouse has long been recognized by vendors and practitioners in the space. Unfortunately, such agility has proven difficult to achieve in the past. Now, ongoing digitalization of business is driving ever higher demands for new and fresh data. Current—and, in my view, short-sighted—market thinking is that a data lake filled with every conceivable sort of raw, loosely managed data will address these needs. That approach may work for non-critical, externally sourced social media and Internet of Things data. However, it really doesn’t help with the legally-binding, historical, and (increasingly) real-time internally and externally sourced data currently delivered via the data warehouse.

Fortunately, the agile and automated characteristics of the Data Vault/data warehouse automation (DWA) approach described in the design, build, and operate phases discussed in earlier posts apply also to the maintenance phase. In fact, it may be argued that these characteristics are even more important in the maintenance phase than in the earlier ones of data warehouse development.

One explicit design point of the Data Vault data model is agility. A key differentiator between Hub, Link, and Satellite tables is that they have very different usage types and temporal characteristics. Such separation of concerns allows changes in both data requirements (frequent and driven by business needs) and data sources (less frequent, but often requiring deep “data archeology”) to be handled separately and more easily than in traditional designs. In effect, the data warehouse is structured according to good engineering principles, while the data marts flow with user needs. This structuring enables continuous iteration of agile updates to the warehouse, continuing through to the marts, by reducing or eliminating rework of existing tables when addressing new needs. For a high-level explanation of how this works, see Sanjay Pande’s excellent “Agile Data Warehousing Using the Data Vault Architecture”article.

The engineered components and methodology of the Data Vault approach are particularly well-suited to the application of DWA tools, as we saw in the design and build phases. However, it is in the maintain phase that the advantages of DWA become even more apparent. Widespread automation is essential for agility in the maintenance phase, because it increases developer productivity, reduces cycle times, and eliminates many types of coding errors. WhereScape Data Vault Express incorporates key elements of the Data Vault approach within the structures, templates, and methodology it provides to improve a team’s capabilities to make the most of potential automation gains.

Furthermore, WhereScape’s metadata-driven approach means that all the design and development work done in preceding iterations of data warehouse/mart development is always immediately available to the developers of a subsequent iteration. This is provided through the extensive metadata that WhereScape stores in the relational database repository and makes available directly to developers of new tables and/or population procedures. This metadata plays an active role in the development and runtime processes of the data warehouse (and marts) and is thus guaranteed to be far more consistent and up-to-date than typical separate and manually maintained metadata stores such as spreadsheets or text documents.

In addition, WhereScape automatically generates documentation, which is automatically maintained, and related diagrams, including impact analysis, track back/forward, and so on. These artifacts aid in understanding and reducing the risk of future changes to the warehouse, by allowing developers to discover and avoid possible downstream impacts of any changes being considered.

Another key factor in ensuring agility and success in the maintenance phase is the ongoing and committed involvement of business people. WhereScape’s automated, templated approach to the entire design, build, and deployment process allows business users to be involved continuously and intimately during every stage of development and maintenance of the warehouse and marts.

With maintenance, we come to the end of our journey through the land of automating warehouses, marts, lakes, and vaults of data. At each step of the way, combining the use of the Data Vault approach with data warehouse automation tools simplifies technical procedures and eases the business path to data-driven decision making. WhereScape Data Vault Express represents a further major stride toward the goal of fully agile data delivery and use throughout the business. 

Original Link

Data Lakes and Swamps, Oh My

I was lamenting to my friend and fellow MVP Shamir Charania (blog|Twitter) that I didn’t have a topic for this week’s blog post, so he and his colleague suggested I write about data lakes, and specifically the Azure Data Lake.

What Is a Data Lake?

This is what Wikipedia says:

A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data swamp is a deteriorated data lake either inaccessible to its intended users or providing little value.

In my opinion, the Wikipedia definition has too many words, so let’s rewrite it:

A data lake is a repository of enterprise data stored in its original format. This may take the form of one or more of the following:

  • structured data from relational databases (rows and columns).
  • semi-structured data (CSV, log files, XML, JSON).
  • unstructured data (emails, documents, PDFs).
  • binary data (images, audio, video).

(I thought the term “data swamp” was a joke, but it’s 2018 and nothing shocks me anymore.)

If that definition of a data lake sounds like a file system, I’d agree. If it sounds like SharePoint, I’m not going to argue either.

However, the main premise of a data lake is a single point of access for all of an organization’s data, which can be effectively managed and maintained. To differentiate “data lake” from “file system,” then, we need to talk about scale. Data lakes are measured in petabytes of data.

Whoa, What’s a Petabyte?

For dinosaurs like me who still think in binary, a petabyte (referred to by some as a pebibyte) is 1,024 terabytes (tebibytes), or 1,125,899,906,842,624 bytes (yes, that’s 16 digits).

In the metric system, a petabyte is 1,000 terabytes, or 1,000,000,000,000,000 bytes.

No matter which counting system we use, a petabyte is one million billion bytes. That’s a lot of data.

Who, What, How?

Internet companies including search engines (Google, Bing), social media companies (Facebook, Twitter), and email providers (Yahoo!, are managing data stores measured in petabytes. On a daily basis, these organizations handle all sorts of structured and unstructured data.

Assuming they put all their data in one repository, that could technically be thought of as a data lake. These organizations have adapted existing tools and even created new technologies to manage data of this magnitude in a field called big data.

The short version: big data is not a 100 GB SQL Server database or data warehouse. Big data is a relatively new field that came about because traditional data management tools are simply unable to deal with such large volumes of data. Even so, a single SQL Server database can allegedly be more than 500 petabytes in size, but Michael J. Swart warns us: if you’re using over 10% of what SQL Server restricts you to, you’re doing it wrong.

Big data is where we hear about processes like Google’s MapReduce. The Apache Foundation created their own open-source implementation of MapReduce called Hadoop. Later, Apache Spark was developed to solve some of the limitations inherent in the MapReduce cluster computing paradigm.

Hadoop and other big data technologies can be thought of as a collection of tools and languages that make analysis and processing of these data lakes more manageable. Some of these tools you’ve already heard of, like JavaScript, Python, R, .NET, and Java. Others (like U-SQL) are specific to big data.

What Is the Azure Data Lake?

From a high level of abstraction, we can think of the Azure Data Lake as an infinitely large hard drive. It leverages the resilience, reliability, and security of Azure Storage you already know and love. Then, using Hadoop and other toolsets in the Azure environment, data can be queried, manipulated and analyzed in the same way we might do it on-premises, but leveraging the massive parallel processing of cloud computing combined with virtually limitless storage.

Note: Microsoft is not the only player in this space. Other cloud vendors like Google Compute (GC) and Amazon Web Services (AWS) offer roughly equivalent services for roughly equivalent prices.

Our New Definition

With all of that taken into consideration, here is my new definition for “data lake”:

A data lake is a single repository for all enterprise data, in its natural format, which can be effectively managed and maintained using a number of big data technologies.

Original Link

Unleash Data-Driven Decision-Making Through Agile Analytics

“It is a capital mistake to theorize before one has data.” – Arthur Conan Doyle, author of Sherlock Holmes

Despite the advice of Arthur Conan Doyle, theorizing to a greater or lesser extent is how the majority of business has been conducted until the digital age. Whether you call it gut instinct or business smarts, the ability to spot trends and anticipate demand gives companies the edge over the competition. Now the digital age is taking the guesswork out of the process. Data is redefining decision-making every front – from operation and engineering activities to research and engagement strategies.

In fact, the data economy is already a multi-billion-dollar industry, generating employment for millions, and yet we’re only just beginning to tap its potential. It’s no accident that digital transformation is on every boardroom agenda. The secret to unlocking future prosperity in almost any business, whether established or a digital native, lies with the data.

Big Data Is Big Business

Today, the key to successful business decision-making is data engineering.

2.5 quintillion bytes of data are generated every day on the internet!

And that figure is growing. So is the desire to put it to good business use. Utilizing vast repositories for storing data, otherwise known as data lakes, is now commonplace. These differ from traditional warehousing solutions in that they aim to present the data in as “flat” a structure as possible, rather than in files and sub-folders, and in their native format as well. In other words, data lakes are primed for analytics.

Drowning in Data

Data lakes have given rise to the concept of the “enterprise data bazaar,” a useful term coined by 451 Research. In the enterprise data bazaar, or marketplace, self-service access to data combines with data governance to produce a powerful platform that enterprises can use to steer the future direction of the business. You can read more in the 451 Research report, Getting Value from the Data Lake.

Data lakes are not without their challenges. Gartner predicts 80 percent are currently inefficient due to metadata management capabilities that are ineffective.

Data Engineering Puts Disparate Data to Work With Agile Analytics

IDC’s Ritu Jyoti spells it out for enterprises, noting, “Data lakes are proving to be a highly useful data management architecture for deriving value in the DX era when deployed appropriately. However, most of the data lake deployments are failing, and organizations need to prioritize the business use case focus along with end-to-end data lake management to realize its full potential.”

When we talk to customers, the business drivers for data engineering are clear. Businesses are crying out for quick access to the right data. They need relevant reports, delivered fast. They want to be able to analyze and predict business behaviors, and then take action in an agile fashion. Data growth shows no signs of slowing, and the business insights enterprises will gain are only as good as the data they put in. As data sets grow, enterprises need to be able to quickly and easily add new sources. Finally, efficiency is a consideration since the cost of data systems, as a percentage of IT spend, continues to grow.

Extracting business value from these vast data volumes requires a rock-solid business strategy, a tried-and-tested approach, and deep technical and sector expertise. We have broken this down into four key phases for big data deployments:

  1. Assess and Qualify: First, the focus is on understanding the nature of the organization’s data, formulating its big data strategies, and building the business case.

  2. Design: Next, big data workloads and solution architecture need to be assessed and defined according to the individual needs of the organization.

  3. Develop and Operationalize: Work to develop the technical approach for deploying and managing big data on-premise or in the cloud. The approach should take into account governance, security, privacy, risk, and accountability requirements.

  4. Maintain and Support: Big data deployments are like well-oiled engines, and they need to be maintained, integrated, and operationalized with additional data, infrastructure and the latest techniques from the fields of analytics, AI, and ML.

Original Link

Transforming ETL for Data Driven Age

Are ETL Tools Still Relevant?

This question is facing user-centric organizations and even ETL vendors themselves. Will they be able to survive the ever-changing data landscape? Let’s first understand the genesis of ETL, which originated in the data warehousing world. It had a high learning curve for developers but it provided many benefits like distributed processing, maintainability, being somewhat UI-based instead of scripting, etc.

The changing data transformation process and terminology for the data-driven age can be summed up in the below table:


Reporting focused 


Analytics focused

ETL (Extract, Transform, and Load)

The flow was tightly coupled to how data was handled.

Real-time data was not considered.

Data Pipeline

It is loosely coupled in terms of how data is handled

It includes real-time data. It can be ELT or ETL

Extract and Load

Selecting data from a particular source, and loading into a different environment like RDBMS.


No selection of data but the full dataset is dumped into the data lake.


Transforming the data using ETL tools.

Standardized Transformation

Transforming the data using any big data tools/technologies.

Ad-Hoc Transformation        

Self-service data prep tools are used for ad-hoc transformation.     

Standard ETL Processes

Data Quality, Security, Metadata management, Governance, etc.

Standard Data Processes

Data Quality, Security, Metadata management, Governance, etc. (still relevant).

Coupling may be an old concept in programming but is still a relatively new one when it comes to how data is handled – as mentioned above, ETL flows are tightly coupled but, now, data pipelines are loosely coupled. This approach also had drawbacks, like the creation of data swamps with dark data.

Standardized transformations are still relevant, for which ETL processes can still be followed. But for totally new concepts like data self-service, old processes and practices cannot be used. Standard ETL processes like data quality, security, metadata management, and governance also remain relevant for data-driven organizations.

Data Lake Impact

Big data shook ETL, as it impacted its core value proposition. ETL should start supporting big data eco-system technologies while reinventing itself.

Below are certain ways in which ETL was impacted by big data:

  1. ETL is still relevant in the environment which is using DW – currently, both DW and data lakes are complimenting each other by extending and improving architecture. This may not be impossible in the future, as all new use cases are built using data lakes.
  2. Standard transformations were implemented using ETL tools/engine for processing and RDBMS as storage. But data lakes are used for both processing and storage, hence, in comparison, they provided a single platform, for ease of use and was cheaper to use.
  3. Data lakes extend analyses from just standardized ETL, as data lakes enable first ingestion and then data prep which is oriented towards self-service and ad-hoc, which is not possible in ETL.
  4. Data lakes were used as data landing/staging/archives, which even RDBMS was also not able to handle as a storage solution. Thus, a rethink of how ETL tools were implemented was required.
  5. ETL was not meant to be used in an unstructured environment, but big data processes enable the storage of semi-structured and unstructured data, which makes ETL irrelevant to such types of data. ELT is the way forward for such data.

Legacy ETL approaches have started to lose relevance in the new data-driven world. As new architectures and technologies emerge due to big data, new approaches need to be supported by ETL tools to be relevant. The shift towards Hadoop and other open architectures meant that legacy ETL vendors were on losing ground. 

Reinventing ETL – Options

What are the options for vendors to stay relevant by reinventing themselves, let’s check below:

1. Open Source-Based Execution

Proprietary technologies for data processing and storage are losing relevance. ETL vendors should be able to support all the open source executions – Spark, MR, etc,., and Hadoop storage.

2. Cloud-Centric

Cloud capable is not good, ETL tools should support cloud-native architectures with on-premise versions. There are new cloud-native ETL tools like Snaplogic, Informatica Cloud, and Talend Integration Cloud which provide an integration Platform-as-a-Service (iPaaS) that resolve lots of challenges in terms of infrastructure, though are still some ETL limitations which are not that self-service enabled as compared to emerging tools. Hence, more focus on self-service and ML can allow these tools to enable ad-hoc and self-learning, thus being more relevant in the new age.

3. Data Prep in the Mix

ETL is a developer-focused data transformation tool, while data prep is self-service-focused data transformation tool. As we move towards greater use of data lakes for analytics, both for ad-hoc and standard processes, ETL will start to become irrelevant as self-service will become more pervasive. Both should merge towards creating a single data transformation category of tools which can work on standard and ad-hoc transformations. 

4. AI/ML Focused

AI/ML is an enabler – it enhances data engineers’/developers’ ability to complete their jobs easily and quickly by automating many processes. This may include automatic suggestions for datasets, their transforms, and rules which were not previously possible. AI creates a collaboration between AI algorithms and data workers. AI learns once a suggestion is accepted and tunes the classification and transformation according to the suggestions accepted.

Thus, AI will keep on impacting many parts of the data architecture including self-learning algorithms in data classification, data modeling, data storage, etc. ETL tools need to support AI solutions – some vendors have started to provide some AI functionality but still far away from being used as the standard solution.

5. Self-Service Design Capability

ETL tools should start supporting the creation of new self-service-based design/flows by enhancing existing tools and providing new tools for such designs. This will help in creating new self-service-based use cases for organizations.

6. Real-Time Support

Real-time support should be provided via open source technologies and there should be appropriate changes to the architecture of existing tool or new tools created for this purpose. Real-time will enable the tool to provide support for all the use cases of big data.

7. Big Data Quality 

There are still no ETL tools which can enhance the quality of large amounts of data. There are few which can profile big data processes, but there is no rule-based engine to support such execution. ETL vendors should focus on this critical area to be able to compete with new platform-based tools on Hadoop. Data prep can provide support to some degree, but it cannot be industrialized for executing such use cases. 

8. Matching and Merging Support on Big Data

Somewhere in the grey area of MDM and ETL – matching and merging support for the ingested data in data lakes needs to be provided. This is, again, a critical area and, by using ML technologies, this can be easily provisioned by the vendors.

9. Unified Metadata Catalog Support

A data-driven world will require organizations to have access to the catalog of all of their data. As ETL tools are already a repository of metadata, they should be able to support such a requirement which should require a catalog to be automatically populated, its data automatically categorized/tagged, and have search capabilities and Crowd/Expert ratings enabled.

10. Reusability-Centric Data Lake Design 

ETL tools should, by design, provide support to reusable components so that few jobs should be able to support such design. This has been in work for a long time but more emphasis should now support data lake technologies.


As this data-driven age requires relentless support for more data to provide better insights with lower costs, ETL tools need to reinvent themselves and native technologies will form the future for such tools. ETL may be fading out in use by vendors, but the knowledge that created ETL as a category in data management still provides a base for any such data transformation activity. ETL vendors like Talend, Informatica, etc. have recognized these challenges and created new products and enhanced products some specifically for big data and cloud.

Original Link

Azure Data Lake With U-SQL: Using C# Code Behind


In this article, we are trying to discuss using C# code behind a U-SQL script, as I’ve noticed that a lot of U-SQL scripts have C# code on the backend.

Now the question is why we are going to use this C# code, as we can create functions, stored procedures, etc., successfully in U-SQL. The answer is quite simple. We want to use the power of C# and the libraries related to it.

For an example, we need to create a complex scalar value function and, using C#, it is quite easy to do using the built-in math library functionality. 

Case Study

To understand the C# code behind our data lake, we are not looking at any complex examples. Here, we have a CSV file, that has data for: “StudentID”; “StidentName”; “Marks1”; “Marks2”; “Marks3.”

We are going to retrieve information from CSV file and try to put the information into another output CSV file.

We are doing little transformation work by adding “Marks1,” “Marks2,” and “Marks3” and giving a set of “Total Marks” data. 

We are going to use C# code to create a function named GetTotalMarks. It takes three input marks and returns the total of three input marks.

C# Code Behind

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text; namespace TestApplication
{ public static class StudentRecor { public static Double GetTotalMarks(int marks_1, int marks_2, int marks_3) { return marks_1 + marks_2 + marks_3; } }

U-SQL Script

 EXTRACT StudentID int, StudentName string, Marks1 int, Marks2 int, Marks3 int FROM "C:/Users/Joydeep/AppData/Local/USQLDataRoot/Input-1/StudentRecords.csv" USING Extractors.Csv(); @filtering = SELECT StudentID, StudentName, Marks1, Marks2, Marks3, TestApplication.StudentRecor.GetTotalMarks(Marks1, Marks2, Marks3) AS TotalMarks FROM @searchlog; OUTPUT @filtering TO "C:/Users/Joydeep/AppData/Local/USQLDataRoot/output/Output-1/StudentResult.csv" USING Outputters.Csv();

Please look at the calling of the function in the U-SQL code.

It is :

<Name Space Name> . <Class Name>.<Function Name>

Job Graph

Image title

 Output File

 Image title

Hope this helps!

Original Link

HDFS Concurrent Access

Last year, I implemented a data lake. As is standard, we had to ingest data into the data lake, followed by basic processing and advanced processing.

We were using bash scripts for some portions of the data processing pipeline, where we had to copy data from the Linux folders, into HDFS, followed by a few transformations in Hive.

To reduce the time taken for data load, we planned to execute these two operations in parallel – that of copying files into HDFS and that of performing the Hive transformations – ensuring that the two operations operated on separate data sets, identified by a unique key.

But, it was not to be. Though both operations executed without errors, Hive threw up errors once we started querying the transformed data.

Upon investigation, we found out that the errors were due to parallel execution. When data is being copied into HDFS (from the Linux folders), Hadoop uses temporary file names until the copy operation is complete. After the copy operation is complete, the temporary file names are removed and the actual file is available in Hadoop.

When Hive is executed in parallel (while the copy operation is in progress), Hive refers to these temporary files. Even though the temporary file names are removed from Hadoop, Hive continues to have a reference to them, causing the above-mentioned error.

Once we ensured that the HDFS copy operation and Hive transformation were not performed in parallel, our problem was solved.

Original Link

Solving Architectural Dilemmas to Create Actionable Insights

The organizational trend of evolution rather than revolution is currently at its peak regarding digital transformation projects. Enterprises have adopted a convergence path called DevOps. Instead of using the classic tiered structure that groups teams according to discipline, DevOps teams integrate workers from different departments in order to enhance communication and collaboration. This paradigm promotes a faster project development lifecycle because it eliminates the interdependencies that exist when functions such as software development and IT operations are completely separate functions. 

While this closed-loop paradigm is being implemented at the organizational level, it is still lagging in the supporting infrastructure. Complex integration workflows and slow data pipelines are still the leading architecture choices. This is due primarily to the power of inertia, which preserves old SOA concepts and slows the adoption of technologies that simplify integration and provide much faster data pipelines.

A unified data and analytics platform is needed to provide extreme data processing, fast data ingestion, and advanced analytics. InsightEdge is a converged platform that unifies microservices, analytics, and processing in an event-driven architecture that reduces TCO while increasing performance.

The Four A’s of Big Data 

As Ventana Research Analyst David Menninger suggests, organizations must shift and evolve their focus from the three V’s of big data- Volume, Variety, and Velocity – to the four A’s: Analytics, Awareness, Anticipation, and Action.

Analytics is the ability to derive value from the billions of rows of data, which opens the door to the other three A’s. Awareness allows the organization to have situational or contextual awareness of the current event stream (such as with NLP). Anticipation is the ability to predict, foresee and prevent unwanted scenarios.

The final A is Action, or Actionable Insights, which means leveraging the first three A’s in real time to take preemptive measures that positively impact the business. This new paradigm and methodology – using actionable insights to assess operational behavior and opportunities – should be the clear choice for decision makers when they assess where to invest their corporate budget. The four A’s can reduce customer churn, optimize network routing, increase profits by facilitating adaptive pricing and demand forecasting, and even save lives by using predictive analytics to avoid calamities like critical equipment failures.

When it comes to building the architecture for your business intelligence application, the four A’s can be challenging to achieve. Over-provisioned and complex architectures that are based on the traditional organizational business intelligence ETL flow result in slow workloads and costly infrastructure.

Slow by Design, not Computation

Traditional enterprise architectures, including the big data Lambda/Kappa architecture, are based on comfortable, familiar Service Oriented Architecture (SOA) concepts that utilize a stack of products, where each component has a specific usage or specialty.

This lego block concept, where each component is agile and can potentially be replaced or upgraded, sounds great in theory. But deploying, managing, and monitoring these components in an architecture while expecting high performance through the workflow may not be realistic. The result can actually be the opposite due to the number of moving parts, bottlenecks, and other single points of failure.

Classic pipeline examples are a combination of message brokers, compute layers, storage layers, and analytical processing layers that exist within yet additional sets of tools, which must ultimately work together in a complex ecosystem. This agile SOA-based architecture is a double-edged sword, which can help on some levels but also be a serious hindrance.

Aside from the complexity factor, this methodology has additional downsides:

  1. High availability must be maintained within every component and throughout the entire workflow for enterprise production environments.
  2. There is a constant need for advanced knowledge about a varied, ever-growing stack of products.
  3. The learning curve for infrastructure management and monitoring is steep.
  4. Complex integration leads to costly over-provisioning.
  5. Increasing the number of components in the architecture can lead to reduced performance when executing the flow or getting insights in real time.

Using multiple solutions with polyglot persistence requires performing Extract-Transform-Load (ETL) operations using external, non-distributed classic ETL tools that are nearly obsolete. Aside from paying for extra software licenses, and possibly consultants with specialized ETL knowledge (ETL/Data Architects and Implementers), you also inject ETL complexities, bugs, limitations and faults into your system. Using this methodology can ultimately hurt performance due to network latency, serialization, synchronous dependencies and other context switches.

How can you overcome these challenges and start focusing on the four A’s instead of the V’s? Begin by reducing the number of moving parts and simplifying the workflow process.

The Benefits of a Unified Insights Platform

At GigaSpaces, we examined what makes the data pipeline so complex, which parts can be simplified, and how to make everything faster – much faster.

On its own, Apache Spark is limited to loading data from the data store, performing transformations, and persisting the transformed data back to the data store. When embedded in the InsightEdge Platform, Apache Spark can make changes to the data directly on the data grid, which reduces the need for multiple data transformations and prevents excessive data shuffling.

The only external tool needed is Kafka (or another message broker) to create a unified pipeline. This one additional component, together with the InsightEdge Platform, provides all the parts you need to derive actionable insights from your data in a single software platform.

Most of the traditional polyglot persistence methodologies, which include key/value stores, document databases, and even in-memory databases, support building a custom workflow using a variety of data store structures. However, classic multi-tier architectures require a storage layer that has multiple “tables” (main store and delta stores) along with, multiple layers for hot, warm, cold, and archive/history data. These tiers are built on top of yet more tiers for durable storage and sit underneath an additional management and query tier. 

The In-Memory Data Grid tier of the InsightEdge Platform can ingest and store multi-tiered datasets, which eliminates most of the layers required in traditional table-based database paradigm. The data grid can also utilize advanced off-heap persistence that leverages its unique MemoryXtend feature, which enables scaling out into the multi-tiered range at relatively low cost.

In addition to reducing the number of tiers in both the general architecture and within the In-Memory Data Grid, GigaSpaces has successfully eliminated delta tables because the grid is the live operational store. Unlike traditional databases, the grid can handle massive workloads and processing tasks, ultimately pushing big-data store (e.g. Hadoop) asynchronous replication to the background to put the multi-petabytes in cold storage. With SQL-99 support and the full Spark API, InsightEdge’s management and analytical query tier is essentially a part of the grid, leveraging shared RDDs, DataFrames, and DataSets on the live transactional data.

Simplifying the Workflow: Out-of-the-Box High Availability

Fast data, as seen on the big data curve, is the ability to handle velocity-bound streaming from an eclectic collection of data sources. The three top requirements for these hybrid transactional/analytical processing (HTAP) intensive scenarios are:

  1. A closed-loop analytics pipeline that includes data ingestion, insights, and recommended actions at sub-second latency.
  2. The convergence of myriad data types, especially in IoT and Omni-Channel environments.
  3. Ability to handle and correlate between real-time and historical data in the same pipeline.

A Closed-Loop Analytics Pipeline

The last mile in every big-data project is the ability to improve business based on analytical results – the coveted actionable insights. Most big-data projects are so focused on building and managing the data lake that they don’t get to this final, most important step.

The first challenge in creating a unified platform is addressing the issue of bi-directional integration between the transactional and analytical data stores. Integrating these two tiers usually means having two different storage layers, unified by a query layer/engine that can pull data from both, and then run the blend and batch on a third layer.

Unifying all these layers results in a single strong, consistent transactional layer that acts as the source for all processing, querying, and storage of analytical models.

Convergence of Multiple Data Types

Another challenge is the ability to merge different data types, such as POJO, PONO, document (JSON, XML) as well as other structured data types, with semi and unstructured data types.

In addition to being able to store, process, and query these vastly different data types, the platform must be able to run analytical computations on the source using Spark APIs (RDDs, DataFrames, DataSets, and SparkSQL) as the same data is being processed in real time.

In standard big data implementations, there is a separation between the transactional processing performed by the applications and the analytical storage and stack of products. This segregation of duties – or misalignment between the two stacks – creates the classic BI (Business Intelligence) “Rear-View Mirror Architecture” paradox. In this paradox, the analytics infrastructure focuses on data accumulation and retrospective analysis rather than on actionable insights. The result is a collection of analytical insights that are of limited or even minimal value by the time the analysis is complete.

To overcome this problem, the InsightEdge Platform contains out-of-the-box mapping between the Spark API and all the different data types that can be ingested. With the whole spectrum of the Spark API available immediately, you can easily and quickly create fast data analytics on top of every data type.


To load data from the data grid, use SparkContext.gridRdd[R]. The type parameter R is a Data Grid model class. For example:

val products = sc.gridRdd[Product]()

After the Spark RDD is created, you can perform any generic Spark actions or transformations on it, such as:

val products = sc.gridRdd[Product]()

To save a Spark RDD to the data grid, use the RDD.saveToGridmethod. It assumes that type parameter T of your RDD[T] is a Space class, otherwise, an exception is thrown at runtime. For example:

val products = sc.gridRdd[Product]()

After the Spark RDD is created, you can perform any generic Spark actions or transformations on it, such as:

val products = sc.gridRdd[Product]()

To save a Spark RDD to the data grid, use the RDD.saveToGrid method. It assumes that type parameter T of your RDD[T] is a Space class, otherwise, an exception is thrown at runtime. For example:

val rdd = sc.parallelize(1 to 1000).map(i => Product(i, "Description of product " + i, Random.nextInt(10), Random.nextBoolean()))

To query a subset of data from the data grid, use the SparkContext.gridSql[R](sqlQuery, args) method. The type parameter R is a Data Grid model class, the sqlQuery parameter is a native Datagrid SQL query, and args are arguments for the SQL query. For example, to load only products with a quantity of more than 10:

val products = sc.gridSql[Product]("quantity > 10")
// We use parameters to ease development and maintenance
val products = sc.gridSql[Product]("quantity > ? and featuredProduct = ?", Seq(10, true))

Data Frames API

Spark RDDs are stored in the data grid as collections of objects of a certain type. You can create a Data Frame for the required type using the following syntax:

val spark: SparkSession // An existing SparkSession.
import org.insightedge.spark.implicits.all._
val df =[Person] Displays the content of the DataFrame to stdout

Dataset API

val spark: SparkSession // An existing SparkSession.
val ds: Dataset[Person] =[Person].as[Person] Displays the content of the Dataset to stdout // We use the dot notation to access individual fields
// and count how many people are below 60 years old
val below60 = ds.filter( p => p.age < 60).count()

Predicate Pushdown

One reason why the InsightEdge Platform is so powerful is its ability to use the Spark/Grid API to “push down” predicates to the data grid, leveraging the grid’s indexes and aggregation power transparently to the user. The workload is delegated behind the scenes between the data grid and Spark – tasks are sent to the grid to take advantage of its indexes, filtering, and superior aggregation capabilities.

Blending Real-Time and Historical Data in the Same Pipeline

The final challenge is the need to unify and scale both real-time and historical data. Existing pipelines are built using the polyglot persistence model, which is fine in theory. However, this model has serious limitations regarding performance and management. The InsightEdge answer to this challenge is performing lazy loading and customized prefetching of all the hot and warm data to the platform so that the data is immediately available for batch queries and correlation with the incoming stream. Correlation matching and complex queries can be run very easily because there is no data shuffling back and forth between multiple storage layers and the analytical tier. The result of this approach is a live, operational data lake that doesn’t risk turning into a data swamp.

Real-Time Data Lake and Machine Learning

Operationalizing a data lake is a fundamental methodology shift that is needed to prevent your data from becoming stale, obsolete, and, finally, an enterprise data swamp. Simply building a storage layer without utilizing it correctly for both real-time and batch processing is the reason why, according to Forbes, 76% of data-driven enterprises are stuck somewhere between data accumulation and being solely reactive to the collected data.

When examining the analytics or machine learning value chain, the business impact increases exponentially with the ability to extrapolate meaningful data and finally create actionable insights.

This significant change in business analytics can be seen when the infrastructure is mature enough to allow an organization to move from (step 1) being reactive to (step 2) being predictive, and ultimately (step 3) making proactive decisions. To better understand the value of this paradigm shift, let’s examine a simple use case; a machine such as a train, which has brakes that will wear out over time. The goal is to proactively keep the train running at an optimal level, without allowing the brakes to fail in real-time.

The first step in achieving the above goal is to run analytics in real-time in order to detect anomalies or abnormal behavior as the train is moving. When applying this idea to a business organization, metrics can be collected and analyzed in real-time (also implementing classic algorithms and machine learning).

The second step is a little more complicated. The goal is not only to react when an anomaly is discovered but also to predict potential equipment or machine failure ahead of time. Being able to predict maintenance costs and plan accordingly lowers the total cost of ownership (TCO) by preventing a chain of failures that can be costly to repair. While the above example addresses only TCO, this capability can have life-saving applications for organizations in safety scenarios, such as predictive maintenance on train brakes, elevator systems, and aviation safety.

The third step is basically the “final frontier” in business analytics. We leverage the results of the previous two steps (descriptive and predictive) and apply mathematical and computational sciences to create a statistical decision tree. Action can then be taken based on this decision tree to achieve optimal results. To keep your train running at peak safety, this decision may involve purchasing a specific type of brake pad that best suits the use pattern that has been identified.

Looking at a business organization, this type of data analysis can guide leaders in making much more effective decisions, which are based on strong, current statistical forecasting instead of outdated historical analysis.

Visualizing the Data

Data scientists and project managers often require quick insight into data to understand the business impact and don’t want to waste valuable time consulting their corporate IT team. InsightEdge provides multiple ways of visualizing the data stored in the XAP in-memory data grid. 

In addition to the built-in Apache Zeppelin notebook, InsightEdge contains the powerful In-Grid SQL Query feature, which includes a SQL-99 compatible JDBC driver.  This provides a means of integrating external visualization tools, such as Tableau. These tools can be connected to the grid via a third-party ODBC-JDBC gateway, in order to access the data stored in the XAP in-memory data grid and present it in a visual format.


The ability to process and analyze data in a simple, fast and transactional platform is no longer optional; it is necessary in order to handle the ever-growing data workloads, leverage new deep learning frameworks like Intel’s BigDL, and lower your organization’s TCO. While technologies for managing big data are constantly and rapidly evolving, the organizational methodologies for processing data are lagging behind.

GigaSpaces’ InsightEdge Platform breaks away from the traditional tiered approach and provides a simpler, faster workflow by consolidating the In-Memory ingestion and processing tier together with the analytical tier in a tightly coupled microservices architecture. 

References and Further Reading

Original Link

Building Data Lakes for GDPR Compliance

If there’s one key phenomenon that business leaders across all industries have latched onto in recent years, it’s the value of data. The business intelligence and analytics market continue to grow at a massive rate (Gartner forecasts the market will reach $18.3 billion in 2017) as organizations invest in the solutions that they hope will enable them to harvest the potential of that data and disrupt their industries.

But while companies continue to hoard data and invest in analytics tools that they hope will help them determine and drive additional value, the General Data Protection Regulation (GDPR) is forcing best practices in the capture, management and use of personal data.

The European Union’s GDPR stipulates stringent rules around how data must be handled. Impacting the entire data lifecycle, organizations must have an end-to-end understanding of its personal data, right through from its collection and processing, to storage and — finally — its destruction.

As companies scramble to make the May 25 deadline, data governance is a key focus. But organizations cannot just think of the new regulations as a box to check. Continuous compliance is required and most organizations are having to create new policies that will help them achieve a privacy by design mode.

Diverse Data Assets

One of the great challenges posed in securely managing data is the rapid adoption of data analytics across businesses as it moves from an IT office function to become a core asset for business units. As a result, data often flows in many directions across the business, so it becomes difficult to understand the data about the data — such as lineage of data (where it was created and how it got there).

Organizations may have personal data in many different formats and types (both structured and unstructured) across many different locations. Under the GDPR, it will be crucial to know and manage where personal data is across their business. While no one is certain in exactly what form GDPR will be enforced, organizations will need to be able to demonstrate that their data management processes are continually in compliance with the GDPR at a moment’s notice.

With the diverse sources and banks of data that many organizations have, consolidating this data will be key to effectively managing their compliance with the GDPR. With the numerous different types of data that must be held across an organization, data lakes are a clear solution to the challenge of storing and managing disparate data.

Pool Your Data

A data lake is a storage method that holds raw data, including structured, semi-structured, and unstructured data. The structure and requirements of the data are only defined once the data is needed. Increasingly, we’re seeing data lakes used to centralize enterprise information, including personal data that originates from a variety of sources, such as sales, CX, social media, digital systems, and more.

Data lakes, which use tools like Hadoop to track data within the environment, help organizations bring all the data together into a data lake where it can all be maintained and governed collectively. The ability to store structured, semi-structured, and unstructured data is crucial to the value of this approach for consolidating data assets, compared to data warehouses, which primarily maintain structured, processed data. Enabling organizations to discover, integrate, cleanse, and protect data that can then be shared safely is essential for effective data governance.

Further to the view across the full expanse of the data lake, organizations can look upstream to identify the sources of data from before they flowed into the lake. That way, organizations can track specific data back to their source — like the CX or marketing applications — providing end-to-end visibility across their entire data supply chain so that it can be scrutinized and identified as necessary.

This end-to-end view of personal data is crucial under the GDPR, enabling businesses to identify the quality and point of origin for all their information. Further to enabling organizations to store, manage, and identify the source of all their data, data lakes provide a cost-effective means for organizations to store all their data in one place. On the other hand, managing this large volume of data in a data warehouse has a far higher TCO.

Setting the Foundations

While data lakes currently present the best approach for data management and governance for GDPR compliance, this will not be the last stop in organizations’ journey towards innovative, efficient and complaint data management. The data storage approaches of the future will be built with consideration for the new regulatory climate and will be created to serve and adhere to the challenges they present.

However, with the demand on organizations to create data policies and practices that will support the compliance of their future data storage and analytics endeavors, it is clear that businesses need to start refining processes and policies that will lay the foundations for compliant data innovation in the future. Being able to quickly and easily identify and access all data, with a clear understanding of its source and stewardship, is now the minimum standard for the management of personal data.

The Clock Is Ticking

Time is running out for many organizations on achieving GDPR compliance, with just weeks until its enforcement. However, companies must take a long-term view and build a data storage model that will enable them to consolidate, harmonize and identify the source of their data in compliance with the GDPR.

GDPR is bringing new dimensions with respect to customers demand: now, they value trust and transparency and will vote with their feet. They will follow companies that will be able to deliver personalized interactions while letting their customers taking full control over their personal data. Ultimately, companies that establish a system of trust at the core of their customer and/or employee relationship will win in the digital economies.

Original Link

A Tale of Two Trade Shows (Gartner vs Strata )

It was an interesting week during the week of March 5th. Gartner and Strata had their big smackdown with competing tradeshows during the same week, and, based on our attendance at both events, you could not have had two more different perspectives on the industry. Gartner held their Data & Analytics Summit in Grapevine, Texas while O’Reilly (with Cloudera) held Strata in San Jose, CA.

To give you a sense of the events if you are not familiar, Gartner’s event is mostly Gartner analysts speaking about their perspective on the market with a few customer case studies and vendor pitches mixed in. It is usually a perspective that is targeting the mainstream of the market and not looking to project too much into the future of technology. It is targeting an executive audience and is, therefore, more about the business impact of technology

Strata, on the other hand, is much more focused on what is coming. The audience is much more technology-based, with a lot of actual implementers in the audience. Many of the presentations get into the technical detail of how someone actually achieved some specific result for a specific project. In contrast to Gartner, this event is much more about what is coming and where the future might be heading.

On one hand, at Strata, you would think that everyone already had a fully deployed Hadoop cluster and had it in production given the intense focus on machine learning and AI. Given our experience as a tech-company helping mainstream companies automate their big data deployments, this is a bit of an exaggeration. Clearly, the mainstream market needs help to simplify the complexity of big data. However, that perspective was still in stark contrast with Gartner where one analyst put forward the idea that data lakes could/should be built on Teradata. I find this point a little hard to accept given Teradata’s shrinking revenue and the continually shrinking market share of all of the traditional DW vendors. Just like a tiger can’t trade its strips for spots, traditional DWs can’t magically become highly flexible and agile platforms that support all new kinds of semi-structured and unstructured data types.

We also noticed some progression in thinking as well this year.  In past years a lot of the Strata attendees shared a perspective on vendors that they didn’t want or need software tools or platforms to simplify away the complexity of Hadoop because they could just as easily code things by hand.  I am happy to say that attitude changed quite a big this year with a large number of discussions about how organizations were looking for automation to reduce the complexity of developing and deploying on Hadoop. 

Actually, we also noticed a lot of interest in discussing automation at Gartner as well. In fact, the concept of automation was one area of great consistency across both events. In general, however, our perspective at Infoworks was that both events were operating a bit at the extreme. Strata was looking too far in the future, while Gartner was holding on too much to the past. 

The reality is somewhere in between. Big data technologies like Hadoop are clearly the forward-looking platforms that will enable machine learning (ML) and AI as part of an analytics technology stack. Legacy data warehouses will not somehow evolve to be a different beast that can handle the new kinds of data nor will they cost-effectively deal with ML and AI. At the same time, big data needs to mature to a point where it doesn’t take an army of experts to deploy it. Focusing on ML and AI when you can’t even implement big data into a repeatable production environment for even the most basic use cases is also unrealistic. For most organizations, it is getting out over the tips of their skis.

The reality sits somewhere in between. There is simply too much venture capital investment in the big data space to believe that it will just disappear.  All of that VC money will ultimately close the complexity and maturity gap and new big data technologies will augment the existing DW systems and may even replace many of them with a more cost-effective approach. But all of this will take time. 

In the meantime, if you get to attend both Strata and the Gartner conferences next year when they are not during the same week, you will get two very different, yet interesting perspectives that are at least both worth considering. 

Original Link

8 Key Takeaways From the MDM and Data Governance Summit

A few weeks ago, I had the great opportunity to attend the MDM and Data Governance Summit held in NYC. The summit was packed with information, trends, best practices, and research from the MDM Gartner Institute. Much of the information is something you don’t find in webinars or white papers on the internet. Speakers were from many different industries who brought different perspectives to the data management concepts.

I also got to meet IT business executives who are all along different parts of their MDM and data governance journey. As I reflect back on the conference, I wanted to share some of my key highlights and takeaways from the event as we all start to prepare for our IT and data strategies in 2018.

1. MDM Main Drivers

The top five main drivers of MDM are:

  1. Achieving synergies for cross-selling

  2. Compliance

  3. Customer satisfaction

  4. System integration

  5. Economies of scale for M&A

A new driver also has been considered recently — digital transformation — and MDM is at the core of digital transformation.

2. IT and Business Partnerships are More Important Than Ever

If there was one thing that everyone at the summit agreed upon, it was that the partnership of the business and IT directly impacts the success of the MDM and data governance programs more than any other factor. For one banking company, this happened naturally, as the business also understood the power of data and the issues related to it. But largely, the experience of the people involved in this partnership is that it’s an uphill battle to get the buy-in from the business. Tying the project to a specific business objective is critical in these scenarios. The bottom line is that a solid partnership between business and IT will provide the right foundation for an MDM program.

3. Managing Change Is Critical

It’s widely accepted that any MDM journey is long, and takes energy and perseverance. One company mitigated this process by starting with the domain that they thought would have the most impact.

4. Data Governance Council

Only about 20% of the audience had some form of data governance council but all the case studies presented had a data governance council in place. The council was made up of both business teams and IT teams. There is no real pattern from the organizational structure perspective. An insurance company that has a hugely successful MDM implementation has the Enterprise Information Management Team as part of the Compliance Team. Another financial company had the team reporting to the COO. It all depends on how your company is organized and does business.


This topic was everywhere. 50% of the audience said they were impacted by this regulation. But looks like a lot of companies still lag way behind in terms of preparing their enterprise data for compliance. This is a serious issue, as there are less than 150 days to get ready. One of the speakers said that MDM is the heart of GDPR implementation.

6. Next-Generation MDM

Data-as-a-Service is something every company should aim for in the next two to three years. Also, bringing in social media and unstructured data will be key to gain actionable insights from MDM initiatives. Large enterprises have moved beyond CDI and PIM to focus on relationships and hierarchies. Cloud MDM will be in demand, but there is potential for creating more data silos as integration becomes a challenge.

7. Big Data Lakes

There are just too many technologies in the Big Data space. Therefore solution architecture becomes key when building a Data Lake. A common implementation was to load the data from legacy systems into Hadoop without any transformation. But without metadata, the lake quickly becomes a swamp. So, to get true value from Big Data Analytics, MDM and Data Governance have to be effective and sustainable. Also from a technology perspective, there needs to be sound integration with big data systems. My company Talend has been at the forefront of Big Data Integration providing a unified platform for MDM, DQ, ESB and Data Integration.


Finally, I want to end this blog with some great quotes from the speakers:

  • “Digital transformation requires information excellence.”

  • “If you don’t know where you are, a map won’t help.”

  • “Big data + data governance = big opportunity.”

  • “Data is a precious thing and will last longer than the systems themselves.”

  • “There is no operational excellence without data excellence.”

  • “A shared solution is the best solution.”

  • “People and processes are more critical than technology.”

  • “Rules before tools.”

  • “Master data is the heart of applications and architecture.”

  • “There is no AI without IA (Information Agenda).”

As you prepare for MDM and data governance initiatives in 2018, I hope some of my takeaways will spark new ideas for you on how to have a successful journey to MDM.

Original Link

How to Streamline Query Times to Handle Billions of Records

Here at Sisense, we love a challenge. So when a client comes to us and tells us they need to find a way to run queries on billions of records without this slowing them down, our ears perk up and we leap at the chance to find a solution.

In fact, that’s how we recently found ourselves testing a billion transactional records and three million dimensional records — totaling a whopping 500GB of data — with 100 concurrent users and up to 38 concurrent queries, with a total setup time of just two hours… and an average query time of 0.1 seconds!

But wait! I’m getting ahead of myself. Let’s start by talking through some of the issues that affect how fast you can query data.

How Are You Storing Your Data?

Let’s start with the obvious: data warehousing.

Typically, working with masses of data means you also need extensive data warehousing in place to handle it alongside extract-transform-load tools that uploads data from the original source on a regular basis (extract), adjusts formats and resolve conflicts to make the datasets compatible (transform), and then delivers all of this data into the analytical repository where it’s ready for you to run queries, calculations, and trend analysis (load).

This creates a single version of the truth — a source of data that brings together all your disparate pieces into one place. While this is great, there are also some drawbacks to data warehousing.

First of all, data warehouses are highly structured, and the row-and-column schema can be overly restrictive for some forms of data. Also, the sheer volume of data quickly overloads most systems, grinding to a halt if you run queries that attempt to tap into the entire data pool.

Then, there are data marts.

To help tackle the issues that come with working with huge data sets, many IT teams deploy data marts alongside their databases. These essentially siphon off access to a smaller chunk of the data, and then you select which data marts each department or user has access to. The outcome of this is that you put less pressure on your hardware, as your computer is tapping into smaller pools of data; but the flipside is that you have vastly reduced access to the organization’s total data assets in the first place.

At the other end of the scale, you have data lakes.

These are used to store massive amounts of unstructured data, helping to bypass some of the issues that come with using conventional data warehouses. They also make sandboxing easier, allowing you to try out different data models and transformations before you settle on a final schema for your data warehouse to avoid getting trapped into something that doesn’t work for you.

The trouble with data lakes is that, while the offer formidable capacity for storing data, you do need to have all kinds of tools in place to interface between the data lake and your data warehouse, or with your end data analytics tool if you want to skip the need for warehousing on top. Systems like this that use data lakes aren’t exactly agile, so your IT team will need to be pretty heavily involved in order to extract the insights you want.

Alternatively, you might deal with unstructured data using an unconventional data storage option.

For example, you might use a NoSQL database like MongoDB. This gives you tons of freedom in terms of the kind of data you add and store, and the way that you choose to store it. MongoDB also makes use of sharding techniques to avoid piling the pressure on your IT infrastructure, allowing for (pretty much) infinite scaling.

The downside, of course, is that the thing that makes this so great — the unstructured, NoSQL architecture — also makes it tricky to feed this data straight into a reporting tool or analytics platform. You need a way to clean up and reconcile the data first.

What About Tools Used for Analysis?

Dynamic DBMS tools like PostgreSQL can open doors.

PostgreSQL is an analytics and reporting tool that allows you to work with an enormous variety of data types — including native data types that give you much more freedom as you come to build and manipulate a BI solution, and “array” types, which help you to aggregate query results rapidly on an ad hoc basis.

Introducing PostgreSQL into the mix can be massively helpful in bringing together your disparate strands — but again, it can’t do everything. It can’t help much with qualitative data, and as a non-relational database (which wasn’t built to handle Big Data) it will buckle under huge volumes of information.

You can also use R for high-end predictive analytics. Once you have a solid BI system in place, you can add another layer of awesomeness by using R to build working models for statistical analysis quickly and easily. R is incredibly versatile and allows you to move away from static reporting by programming a system for analysis that you can adapt and improve as you go.

The thing is, though, this is an add-on; it doesn’t replace your current BI or data analytics system. R is an excellent programming language that can help you generate predictive analytics fast, but you need to have a rock-solid system in place for handling and preparing data in the first place.

How to Streamline Everything

I know what you’re thinking: I said I was going to explain how to streamline your data queries to help you generate results faster, but so far, all I’ve done is dangle some potential solutions and then show you how they fall short!

That’s because I haven’t revealed the secret sauce that binds all these pieces together in perfect harmony.

As you can see, each of the tools we’ve discussed is used to fix one problem in the storage, flow, and use of data within your organization, but they don’t help with the big picture. That’s where Sisense’s Elasticube comes in. The Elasticube allows you to store data or drag it in directly from your existing stores at lightning speed, giving users unfettered access to their entire pool of data, whatever format it’s kept in (unless you choose to stagger permissions). Thanks to clever use of In-Chip Processing and a Columnar Database structure, you tap into only the data you need for the query, without restricting yourself permanently, as you would with a data mart.

You can then reconcile and harmonize this data with minimal hassle to treat all these strands as a single data source for the purpose of analysis and reporting.

Still within the Elasticube, you can map and manipulate these data sources to build your own dashboards and run your own queries at incredible speed.

Plus, using our range of custom-built connectors, you can link your Sisense Elasticube directly to MongoDB, PostgreSQL, and other DMBS tools, and you can integrate Sisense with R for even more in-depth predictive analytics.

So that’s the big secret. Using the Sisense Elasticube, I was able to set up a system in 120 minutes that could run concurrent queries on data representing one billion online purchases, from three million origins/destinations, with an average query time of 0.1 seconds and a maximum query time of just 3 seconds.

Pretty impressive, huh? Here’s what it looked like:


And here’s an example dashboard that we used to display the results in real time:


How’s that for streamlined? 

Original Link