data governance

Tips for Enhancing Your Data Lake Strategy

As organizations grapple with how to effectively manage ever more voluminous and varied reservoirs of big data, data lakes are increasingly viewed as a smart approach. However, while the model can deliver the flexibility and scalability lacking in traditional enterprise data management architectures, data lakes also introduce a fresh set of integration and governance challenges that can impede success.

The Power and Potential of Data Lakes

Born from the rise of the cloud and big data technologies like Hadoop, data lakes provide a way for organizations to cost-effectively store nearly limitless amounts of structured and unstructured data from myriad sources without regard to how that data might be leveraged in the future. By its very nature and through self-service business intelligence capabilities, a data lake also encourages experimentation and data exploration by a broader set of non-business analyst users. According to a survey conducted by TDWI Research, 85 percent of respondents considered the data lake an opportunity to address the challenges they face trying to manage the data deluge with traditional relational databases. Moreover, the TDWI survey found the data lake being pursued for a variety of benefits and use cases, the most prominent being advanced analytics (49 percent) and data discovery (49 percent).

Original Link

Building Data Lakes for GDPR Compliance

If there’s one key phenomenon that business leaders across all industries have latched onto in recent years, it’s the value of data. The business intelligence and analytics market continue to grow at a massive rate (Gartner forecasts the market will reach $18.3 billion in 2017) as organizations invest in the solutions that they hope will enable them to harvest the potential of that data and disrupt their industries.

But while companies continue to hoard data and invest in analytics tools that they hope will help them determine and drive additional value, the General Data Protection Regulation (GDPR) is forcing best practices in the capture, management and use of personal data.

The European Union’s GDPR stipulates stringent rules around how data must be handled. Impacting the entire data lifecycle, organizations must have an end-to-end understanding of its personal data, right through from its collection and processing, to storage and — finally — its destruction.

As companies scramble to make the May 25 deadline, data governance is a key focus. But organizations cannot just think of the new regulations as a box to check. Continuous compliance is required and most organizations are having to create new policies that will help them achieve a privacy by design mode.

Diverse Data Assets

One of the great challenges posed in securely managing data is the rapid adoption of data analytics across businesses as it moves from an IT office function to become a core asset for business units. As a result, data often flows in many directions across the business, so it becomes difficult to understand the data about the data — such as lineage of data (where it was created and how it got there).

Organizations may have personal data in many different formats and types (both structured and unstructured) across many different locations. Under the GDPR, it will be crucial to know and manage where personal data is across their business. While no one is certain in exactly what form GDPR will be enforced, organizations will need to be able to demonstrate that their data management processes are continually in compliance with the GDPR at a moment’s notice.

With the diverse sources and banks of data that many organizations have, consolidating this data will be key to effectively managing their compliance with the GDPR. With the numerous different types of data that must be held across an organization, data lakes are a clear solution to the challenge of storing and managing disparate data.

Pool Your Data

A data lake is a storage method that holds raw data, including structured, semi-structured, and unstructured data. The structure and requirements of the data are only defined once the data is needed. Increasingly, we’re seeing data lakes used to centralize enterprise information, including personal data that originates from a variety of sources, such as sales, CX, social media, digital systems, and more.

Data lakes, which use tools like Hadoop to track data within the environment, help organizations bring all the data together into a data lake where it can all be maintained and governed collectively. The ability to store structured, semi-structured, and unstructured data is crucial to the value of this approach for consolidating data assets, compared to data warehouses, which primarily maintain structured, processed data. Enabling organizations to discover, integrate, cleanse, and protect data that can then be shared safely is essential for effective data governance.

Further to the view across the full expanse of the data lake, organizations can look upstream to identify the sources of data from before they flowed into the lake. That way, organizations can track specific data back to their source — like the CX or marketing applications — providing end-to-end visibility across their entire data supply chain so that it can be scrutinized and identified as necessary.

This end-to-end view of personal data is crucial under the GDPR, enabling businesses to identify the quality and point of origin for all their information. Further to enabling organizations to store, manage, and identify the source of all their data, data lakes provide a cost-effective means for organizations to store all their data in one place. On the other hand, managing this large volume of data in a data warehouse has a far higher TCO.

Setting the Foundations

While data lakes currently present the best approach for data management and governance for GDPR compliance, this will not be the last stop in organizations’ journey towards innovative, efficient and complaint data management. The data storage approaches of the future will be built with consideration for the new regulatory climate and will be created to serve and adhere to the challenges they present.

However, with the demand on organizations to create data policies and practices that will support the compliance of their future data storage and analytics endeavors, it is clear that businesses need to start refining processes and policies that will lay the foundations for compliant data innovation in the future. Being able to quickly and easily identify and access all data, with a clear understanding of its source and stewardship, is now the minimum standard for the management of personal data.

The Clock Is Ticking

Time is running out for many organizations on achieving GDPR compliance, with just weeks until its enforcement. However, companies must take a long-term view and build a data storage model that will enable them to consolidate, harmonize and identify the source of their data in compliance with the GDPR.

GDPR is bringing new dimensions with respect to customers demand: now, they value trust and transparency and will vote with their feet. They will follow companies that will be able to deliver personalized interactions while letting their customers taking full control over their personal data. Ultimately, companies that establish a system of trust at the core of their customer and/or employee relationship will win in the digital economies.

Original Link

Data Lake: The Central Data Store

We live in the age of data, and as per Gartner, the volume of worldwide information is growing at a minimum rate of 59% annually. Volume alone is a significant challenge to manage, and variety and velocity make it even more difficult. It is also very evident that generation of larger and larger volumes of data will continue, especially if we consider the exponential growth of the number of handheld devices and Internet-connected devices.

For organizations with systems of engagement, this is true — but for others, data volume growth is not that high. Data volume is different for different organizations. In spite of this difference, one common factor across all of them is the importance of meaningful and useful analytics for different stakeholders. With the increased use of tools across organizations for different functionalities, the task of generating meaningful and useful reports for different stakeholders is becoming more and more challenging. 

What Is the Data Lake?

Nick Heudecker, research director at Gartner, has explained the data lake:

“In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format. The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”

Thus, a data lake helps organizations gain insight into their data by breaking the data silos. The term “data lake” was first used in 2010, and its definition/characteristics are still evolving. In general, “data lake” refers to a central repository capable of storing zettabytes of data drawn from various internal and external sources in a format as close as to the raw data.

Data Lake Challenges

A data lake is usually thought of as the collection and collation of all enterprise data from legacy systems and sources, data warehouses and analytics systems, third-party data, social media data, clickstream data, and anything else that might be considered useful information for the enterprise. Although the definition is interesting, is it actually possible or required for every organization?

Different organizations have different challenges and patterns of distributed data, and with diversified scenarios, every organization has their own need for the data lake. Though the needs, pattern, source, and architecture of the data are different, the challenges are the same with respect to building a central storage or lake of data:

  • Bringing data from different sources to a common, central pool.
  • Handling low volume but highly diversified data.
  • Storing the data in a low-cost infrastructure compared to a data warehouse or big data.
  • Near real-time synchronization of data with the central data store.
  • Traceability and governance of the central data.

Data Lake Implementation Considerations

In most cases, data lakes are deployed with the essence of a data-as-a-service model where it is considered as a centralized system-of-record, serving other systems at enterprise scale. A localized data lake not only expands to support multiple teams but also spawns multiple data lake instances to support larger needs. This centralized data then can be used by all different teams for their analytical needs.

With all these understandings, it’s time to discuss the various needs of data lakes in terms of integration and governance.

Integration Challenges

In order to deploy a data lake at the enterprise level, it needs to have certain capabilities that will allow it to be integrated within the overall data management strategy, IT applications, and data flow landscape of the organization.

  • In order to make the data of a data lake useful at a later point in time, it is very important to make sure that the lake is getting the right data at the right time. For example, a data lake may ingest monthly sales data from enterprise financial software. If the data lake takes in that data too early, it may get only a partial dataset or no data at all. This could result in inaccurate reporting down the line, leading the company in the wrong direction. Thus, the integration platform operating in the background for the population of data into the data lake should be capable of pushing data from various tools both in real-time and on-demand based on the business case.
  • Though the main purpose of the data lake is to store data, at times (based on different business cases and in order to facilitate other departments for using the data in future), some data needs to be distilled or processed before getting inserted into the data lake. Thus, the integration platform should not only have support for this but also ensure that the data processing is happening accurately and in the correct order.
  • Centralized data storage is useful only when the stored data can be extracted by all different departments for their own use. There should be a capability to integrate the data lake with other applications or downstream reporting/analytic systems. The data lake should also have support for REST APIs, which different applications can interact with to get or push their own piece of data.

Data Lake Governance Challenges

The data lake is not only about storing data centrally and furnishing it to different departments whenever required. With more and more users beginning to use data lakes directly or through downstream applications or analytical tools, the importance of governance for data lakes increases. Data lakes create a new level of challenges and opportunities by bringing in diversified datasets from various repositories to one single repository.

The major challenge is to ensure that data governance policies and procedures exist and are enforced in the data lake. There should be a clear definition of the owner for each dataset as and when they enter the lake. There should be a very well-documented policy or guideline regarding the required accessibility, completeness, consistency, and updating of each data.

For solving the above problem, there should be built-in mechanisms in the data lake to track and record any manipulation of data assets present in the data lake.

Is the Data Lake Same for Everyone?

The implementation of a data lake is not the same for all organizations, as data volume and requirements of data collection vary from organization to organization. In general, a data lake comes with the perception that the data volume should be at a level of petabytes or zettabytes or even more, and needs to be implemented using a NoSQL database. In reality, this amount of data volume together with the implementation of a NoSQL DB may not be needed or may not be possible for all organizations. The end goal of having a central data store catering to all analytical needs of the organization can be started with a SQL DB and with a considerable data volume.

Original Link

Why the Impact of Big Data Extends Further Than You Think

C-suite executives: What is the impact of big data in your organization? If you suggest I speak to the CIO, you might be surprised when you take a closer look.

Big data is at a tipping point across many corporate disciplines, from marketing to customer support. Corporate leaders across every part of their businesses are using it to innovate and gain new insights into their existing business.

The impact of big data is spreading rapidly throughout many companies and becoming a part of their DNA. The danger, however, is that some senior managers may be unaware of its effect.

Siloed Data Can Cause Problems

Big data deployments are commonly localized. When organizations first begin experimenting with big data, they often do so with small proof-of-concept projects. Sometimes the senior management may be aware of these, but in many cases, they may be run as “skunk works” projects conducted autonomously to test a team’s assumptions.

This creates several problems for senior executives. The first is that they don’t have a unified view of the organization’s big data initiatives, which tends to keep big data from becoming a strategic focus. The biggest value of data will come when connecting the dots across an entire organization. For instance, a single view of a customer does not live in marketing or in sales—it spans the entire organization. It’s hard to encourage joined-up thinking and pursue a bigger goal when different people in the organization are fostering a piecemeal approach to big data projects.

Another problem is a lack of governance. Data can be a powerful tool, but it can also be destructive to your business if misused. If ad hoc projects are started without adequate oversight, they can lead to the inappropriate use of personal information.

Without a strategic approach to big data, executives risk developing pockets of noncompliance as well-intentioned middle managers experiment with sensitive data. All it takes is one slip, and it could leave chief executives trying to explain themselves while the company makes negative headlines.

Time to Combine Big Data Projects

By pulling these projects together into a holistic strategy, business leaders can take control and help write the company’s big data narrative. This offers several advantages. First, it gives business leaders more control over data privacy and security. By imposing best practices in an enterprise-wide policy, executives reduce the risk of rogue managers creating security gaps in the corporate infrastructure. In a world where data provenance and protection is more important than ever, this point cannot be overstated.

Gaining control over data governance will make the compliance department happy. Instead of having to find and audit a universe of undocumented big data projects, the compliance team will already be aware of them, and will be assured that the project managers drew from a data governance playbook when creating them. This should make auditing far easier and reduce the risk of noncompliance.

The other benefit of creating a governing strategy for big data projects is that the whole can become more than the sum of its individual parts. While few companies expect to pile all their information into a single, huge data ocean, an end-to-end view of these projects can nevertheless introduce opportunities for efficiency and data aggregation. Multiple projects may share information in a single data lake.

Managing data in this way can increase the variety of information available to a single big data project. For example, an ad hoc project might only draw on the sales database, limiting its results. By making unstructured data from other sources available to that project, a company could generate more accurate, insightful results.

These partnerships require a strategic vision—which, in turn, increases the impact of big data at the corporate level. A comprehensive, top-down big data strategy creates space for the executive team to forge a vision of what it wants big data to achieve.

Making Big Data a Program, Not a Project

How can corporate leaders turn things around and get to this point? The first step lies in developing a mature big data strategy. Business leaders must treat big data as a program, not a project, with the highest level of executive buy-in.

Then you should give the program the attention it deserves by turning to experts to design big data architectures around open architectures: It’s unlikely that your big data program will support just one technology or put data in only one location.

Once you have a program with well-defined compliance practices and technical specifications, your organization can document its existing siloed projects, assessing each one to see if it can be incorporated into the new framework.

Even projects that cannot be integrated will still be valuable. Architects can learn from most deployments, understanding what worked and what didn’t. They can then factor those insights into a redesign, as they build the project from scratch with new, enterprise-approved parameters.

The time to do this is now. Data-driven companies such as Uber, Facebook, and Amazon are already disrupting markets by the dozen. They are ahead of the wave. By developing a strategy that embraces the impact of big data across your organization and beyond, you can surf this wave too.

Original Link

Why the Promise of Medical Data Remains Unfulfilled

I’ve written numerous times over the past few years about the power of medical data and the numerous issues surrounding it, from the new ways of collecting and analyzing it to the importance of strong data governance.

Despite the tremendous potential for delivering better care, and also better medical research, the joined-up use of medical data remains largely overlooked. A recent paper examines some of the reasons why that is.

The study found considerable variance in the IT systems across the NHS, with a continued reliance upon paper records and limited data sharing between departments. With patient medical records remaining the primary source of data around the patient, this undermines efforts to use such records both for better care and better research.

The poor use of data wasn’t confined to the NHS, however, with the study also finding that both the pharma industry and universities were not using data to its full potential. For instance, the pharma industry has a long and murky history over the selective publishing of data around trials, and even academia has been accused of similar practices in recent years.

Various well-publicized cases have led to steps to improve matters. For instance, projects such as the AllTrials campaign strive to promote the proper reporting of clinical trials. Even then, however, it’s estimated that fewer than 50% of trials are reported within two years of completion.

Better Governance

The analysis also revealed there were significant problems with the way data is regulated and governed. Existing regulations exist to ensure patients, the public, and medical professionals are safeguarded, but this can often result in excessive caution being used around patient data. Consent procedures can also limit the impact of studies, especially in niche fields.

“Sometimes, relying on the need for individual consent can limit studies about groups that are difficult to reach, as well as problems such as substance misuse, and any issues seen as sensitive,” the authors say.

…all of which adds up to a significant problem. The authors say that the misuse (or non-use) of data is costing lives. Equally, none of the problems outlined above stand in isolation but rather as part of a wider picture of data governance in healthcare. While there are clearly good reasons why data governance is crucial, there are also many areas that could be improved upon.

“It can be argued that data non-use is a greater risk to well-being than data misuse. The non-use of data is a global problem and one that can be difficult to quantify. As individuals, we have a role to play in supporting the safe use of data and taking part where we are able,” the authors conclude.

Original Link

What Are the Keys to Big Data?

To gather insights on the state of big data in 2018, we talked to 22 executives from 21 companies who are helping clients manage and optimize their data to drive business value. We asked them, “What are the keys to a successful big data strategy?” Here’s what they told us:

ID the Business Problem

  • Understand the problem you’re trying to solve. Frequently, that’s the complexity and scale of the datasets. We use OLAP to garner insights. Rapid aggregation of large datasets or streams of data. Help take large streams of data and ingest so you can slice and dice to gather insights.
  • Blurring of products and toolsets around databases and big data. Take advantage of innovation in the stack. Be case-driven and focus on outcomes and results. Help customers make practical decisions.
  • Look at the requirements and the problems you are looking to solve — performance or data access so you can make the right choice of technology. 
  • A big data strategy with a sole purpose of exploring possibilities is likely to end up in misunderstanding. An efficient strategy must be driven by a pragmatic approach, first by identifying business problems to solve and validating assumptions through experiments involving internal users and customers. Optimizing a legacy data warehouse by moving data to modern infrastructure and technologies more suitable for large data volume handling, processing, and facilitating the run of complex algorithms and analysis. A cloud architecture can often be considered at this stage. Build common practices and architecture framework around the concept of the data lake to quickly unleash the value of data. By doing so, companies can benefit from a robust and scalable architecture built on multiple layers to properly collect and ingest data at different paces, store large amount of raw data in any formats, protect sensitive information, manage data quality at expected level, refine data if necessary and derive data to quickly allow analysis and accessibility.
    • Lead change for information being the company fuel by applying the right user experience (UX) on how data are delivered to employees or the company’s digital ecosystem. New customer-oriented marketing based on big data should be used to improve and even alter the current marketing practices. Care should be taken not to fall into excesses statistics and/or spurious correlation that could be beyond needed actionable steps for customers. The data, analytics, and insights generated by the analysts must be communicated precisely and openly to internal users. The final information should be represented in a way that its value is connected to actions by the implementation team.
  • Having a clear and worthy end goal; practical implementation strategies that limit the risk in these likely huge undertakings; and choosing the right technologies and partners for the right job.
  • It’s not actually about the big data itself. Firstly, it’s about identifying the current processes or actions that are present in your business (or project). By identifying, I explicitly mean document — using BPMN (business process model and notation), UML, word processor, whiteboards, chalkboards, cameras on phones, napkins, etc. This crucial step is required to enable, if not force, all technology decisions (selection and implementation) to benefit and be driven by the needs of the business. When approached in this manner several benefits emerge that not only help identify where to employ a big data analytic approach but how to measure improvements:
    1. The individual steps needed to achieve the business goals via one or more processes are revealed to all stakeholders.
    2. The constraints (inputs, outputs, allotted time) required when executing each step are also identified. For instance, Step 1 may need to complete in 3 seconds.
    3. The data required for each step, process, and business goals are identified. Importantly, the data is identified within the scope of the business needs. (We want to avoid selecting technology [acquiring or building] for any other reason. The battlefield of business history is littered with efforts that selected technology because it was popular, cool, familiar, low or no cost or a resume builder.)

Governance and Operations

  • Information governance and management. Customers don’t know where all of their data is. We identify all of the data and where it resides. We take aged data and determine how recently it has been accessed. While data is growing rapidly, there’s always a lot more data that’s 30+ days old and active data less than 30 days old.
  • Organization is one of the largest keys to a successful big data strategy, whether that’s internal organization with clear definitions of roles for collecting, analyzing, and acting upon this data or utilizing the right platforms and tools for strategic goals. Making sure there are processes that organize data streams into usable information, and then translating that into understandable action points are challenges a lot of enterprises face.
  • There are a number of key components for a successful big data strategy, but possibly the most practical advice is to build an initial data lake (or “dataland,” as we like to call it) for data scientists to use as a playground to ideate new applications for that data. This can turn into a number of quick wins even before a more disciplined approach to big data is realized. However, at some point, data governance becomes extremely important, since, most likely, some of the data assets that will end up in the data lake will have sensitive information that must be guarded appropriately
    • There is also the talent factor, which must be considered initially in order to make any initiative successful. Unfortunately, you will be needing to identify in the organization (or hire, otherwise) individuals who could be considered data scientists, a combination of a statistical/math buff, with a soft spot for data, who can do a reasonably good work at programming, for which candidates are not exactly growing on trees. When given the option, it’s probably more effective to grow them from existing people with domain expertise in your organization and existing data skill, building on top of them the statistical acumen and possibly relying on existing toolsets (both open-source and commercially available) to supplement their limited programming ability. 

Strategy and Structure

  • We expected specialized technologies, new opportunities to do distributed computing, now we’re seeing organizations with flat IT budgets looking to take costs out of IT so they can focus on innovation. Data volumes are growing. Need to focus on the process versus the data. Understand that data is the enabling layer but also the obstacle to pursuing microservices, AI, and IoT. What is the data fabric strategy you are going to use to pursue new technologies — multi-cloud, hybrid processes, and microservices.
  • From my experience, it is imperative to develop a strategy incrementally, starting with key use cases that can benefit from big data technology before moving to broad, organization-wide projects. Beyond that, a successful strategy involves getting the right people in place that understand both the technology spectrum and the business goals and picking the right architecture. Data streaming technology can be a huge help in creating an architecture that connects different data silos, data lakes, etc. Also important is putting the right governance policies in place for making data accessible across an organization.
  • Pay attention to the structure and the quality of the data as you are moving it or ingesting. Have a process for maintaining the structure and the quality.
  • One of the keys to a successful big data strategy is the ability to use necessary data efficiently in the enterprise for analytics/machine learning. This requires real-time and easy access to lineage preserved to the data wherever it may be stored without the traditional ETL process that results in multiple copies and delays in accessing the data.
  • The most successful organizations who have adopted a big data strategy reported these observations: The use of a unified, comprehensive and flexible data management platform enables speed, reusability, and trust with far less manpower than traditionally manual and complex approaches. The use of a unified, comprehensive and flexible data management platform also enables organizations to focus manpower on business logic and business context, and not be delayed by the complexities of an ever-changing infrastructure ecosystem. The ability to leverage AI-driven technology to guide user behavior and automate processes.

Speed of Delivery

  • Real-time analytics and fast data processing. Don’t fall into cluster sprawl. Complexity can be high. Leverage Spark to innovate on big data in a more concise way. Simplify architecture and performance. Focus on high performance so you can get answers quickly.
  • A work in progress that’s not easily solved. Connect quickly with a self-service model to empower business analysts and data scientists to be self-sufficient and independent.
  • Pull together data sources to answer questions. Don’t get stuck using a single system. Leave the data where it lives naturally and do the analytics there. If you bring the data to a data warehouse, it’s out of date. Answer questions quickly. Run different “what if” scenarios and respond to questions in real0time.


  • Back-up, recovery, and protection especially with the growth of ransomware. Data is business critical. Treat it as such.
  • The more metadata you have the more it can work for you.
  • Data operations ensuring data is moving across the enterprise while you are able to keep your finger on where it is and what it’s being used for. Operational oversight, quality, and SLAs. Big data can be difficult for companies to use. Kafka is complex and can be difficult to get started. We provide a UI that removes the requirement for upfront programming.
  • Municipalities have been collecting data for a long time. That data has been collecting dust. We provide the resources for non-technical people to clean and work with the data.

Here’s who we spoke to:

  • Emma McGrattan, S.V.P. of Engineering, Actian
  • Neena Pemmaraju, VP, Products, Alluxio, Inc.
  • Tibi Popp, Co-founder and CTO, Archive360
  • Laura Pressman, Marketing Manager, Automated Insights
  • Sébastien Vugier, SVP, Ecosystem Engagement and Vertical Solutions, Axway
  • Kostas Tzoumas, Co-founder and CEO, data Artisans
  • Shehan Akmeemana, CTO, Data Dynamics
  • Peter Smails, V.P. of Marketing and Business Development, Datos IO
  • Tomer Shiran, Founder and CEO and Kelly Stirman, CMO, Dremio
  • Ali Hodroj, Vice President Products and Strategy, GigaSpace
  • Flavio Villanustre, CISO and V.P. of Technology, HPCC Systems
  • Fangjin Yang, Co-founder and CEO,
  • Murthy Mathiprakasam, Director of Product Marketing, Informatica
  • Iran Hutchinson, Product Manager and Big Data Analytics Software/Systems Architect, InterSystems
  • Dipti Borkar, V.P. of Products, Kinetica
  • Adnan Mahmud, Founder and CEO, LiveStories
  • Jack Norris, S.V.P. Data and Applications, MapR
  • Derek Smith, Co-founder and CEO, Naveego
  • Ken Tsai, Global V.P., Head of Cloud Platform and Data Management, SAP
  • Clarke Patterson, Head of Product Marketing, StreamSets
  • Seeta Somagani, Solutions Architect, VoltDB

Original Link

8 Key Takeaways From the MDM and Data Governance Summit

A few weeks ago, I had the great opportunity to attend the MDM and Data Governance Summit held in NYC. The summit was packed with information, trends, best practices, and research from the MDM Gartner Institute. Much of the information is something you don’t find in webinars or white papers on the internet. Speakers were from many different industries who brought different perspectives to the data management concepts.

I also got to meet IT business executives who are all along different parts of their MDM and data governance journey. As I reflect back on the conference, I wanted to share some of my key highlights and takeaways from the event as we all start to prepare for our IT and data strategies in 2018.

1. MDM Main Drivers

The top five main drivers of MDM are:

  1. Achieving synergies for cross-selling

  2. Compliance

  3. Customer satisfaction

  4. System integration

  5. Economies of scale for M&A

A new driver also has been considered recently — digital transformation — and MDM is at the core of digital transformation.

2. IT and Business Partnerships are More Important Than Ever

If there was one thing that everyone at the summit agreed upon, it was that the partnership of the business and IT directly impacts the success of the MDM and data governance programs more than any other factor. For one banking company, this happened naturally, as the business also understood the power of data and the issues related to it. But largely, the experience of the people involved in this partnership is that it’s an uphill battle to get the buy-in from the business. Tying the project to a specific business objective is critical in these scenarios. The bottom line is that a solid partnership between business and IT will provide the right foundation for an MDM program.

3. Managing Change Is Critical

It’s widely accepted that any MDM journey is long, and takes energy and perseverance. One company mitigated this process by starting with the domain that they thought would have the most impact.

4. Data Governance Council

Only about 20% of the audience had some form of data governance council but all the case studies presented had a data governance council in place. The council was made up of both business teams and IT teams. There is no real pattern from the organizational structure perspective. An insurance company that has a hugely successful MDM implementation has the Enterprise Information Management Team as part of the Compliance Team. Another financial company had the team reporting to the COO. It all depends on how your company is organized and does business.


This topic was everywhere. 50% of the audience said they were impacted by this regulation. But looks like a lot of companies still lag way behind in terms of preparing their enterprise data for compliance. This is a serious issue, as there are less than 150 days to get ready. One of the speakers said that MDM is the heart of GDPR implementation.

6. Next-Generation MDM

Data-as-a-Service is something every company should aim for in the next two to three years. Also, bringing in social media and unstructured data will be key to gain actionable insights from MDM initiatives. Large enterprises have moved beyond CDI and PIM to focus on relationships and hierarchies. Cloud MDM will be in demand, but there is potential for creating more data silos as integration becomes a challenge.

7. Big Data Lakes

There are just too many technologies in the Big Data space. Therefore solution architecture becomes key when building a Data Lake. A common implementation was to load the data from legacy systems into Hadoop without any transformation. But without metadata, the lake quickly becomes a swamp. So, to get true value from Big Data Analytics, MDM and Data Governance have to be effective and sustainable. Also from a technology perspective, there needs to be sound integration with big data systems. My company Talend has been at the forefront of Big Data Integration providing a unified platform for MDM, DQ, ESB and Data Integration.


Finally, I want to end this blog with some great quotes from the speakers:

  • “Digital transformation requires information excellence.”

  • “If you don’t know where you are, a map won’t help.”

  • “Big data + data governance = big opportunity.”

  • “Data is a precious thing and will last longer than the systems themselves.”

  • “There is no operational excellence without data excellence.”

  • “A shared solution is the best solution.”

  • “People and processes are more critical than technology.”

  • “Rules before tools.”

  • “Master data is the heart of applications and architecture.”

  • “There is no AI without IA (Information Agenda).”

As you prepare for MDM and data governance initiatives in 2018, I hope some of my takeaways will spark new ideas for you on how to have a successful journey to MDM.

Original Link

Traditional vs. Self-Service BI: What’s the Difference?

Slow. Inflexible. Time-consuming. Does that sound like any sensible way to get users the business insights they need to do their jobs? A few years ago, this might have been the only option for business intelligence, but now there’s a fork in the road. Users can go one way with centralized BI run by their IT department, or they can strike out with modern BI solutions they can use by themselves. As you might imagine, in the traditional vs. self-service BI debate, there are pros and cons to be considered before making a choice.

Is Traditional BI the Answer?

Let’s take the traditional approach first. There are reasons for this controlled BI environment to exist. When you control the data and the BI application, you have a chance of controlling the quality of the results. IT departments concerned about quality (meaning every conscientious IT department) can make sure that data is properly prepared, stored, and secured. They can build systems that offer standardized, scalable reporting, and they can give users answers to their business questions. Of course, those answers may take a little time to materialize, especially if the IT department is busy with other projects. Or if the rate of growth of data (and big data) starts to outstrip the IT department’s resources for handling it.

The Self-Service BI Alternative

So, how about the self-service approach? There are now numerous business intelligence tools available that appeal to users through their simplicity and affordability, and we’re not just talking about spreadsheets. Self-service BI has made giant strides to get to a point where users can access data from different sources, get insights from all the sources altogether, and make faster business decisions. The tools are typically intuitive and interactive (those that aren’t tend to disappear from circulation), and let users explore data beyond what the IT department has curated.

Learn how to calculate the cost of business intelligence with our free step-by-step guide.

How About Both?

But perhaps representing traditional vs. self-service BI as a fork in the road is unrealistic. An organization may need both types of business intelligence. Functional reporting on daily business operations is still a common requirement, even if it is now a smaller part of the overall BI picture. Compliance reporting and dashboarding, for example, are still needed. Moreover, once they are set up, these functions often run happily with little or no intervention. Traditional BI still has a role to play in answering questions about what happened in the past, or about what is happening about operations now. By comparison, for questions about the future, especially spur of the moment “what if” style questions, users want more individual power and faster reaction times than traditional BI is designed to give them. In this case, self-service can be preferable.

Figuring Out That Fork in the Road

This calls to mind the advice of one expert, who said, “When you come to the fork in the road, take it!” This apparently non-sensical statement now starts to make sense. How you “take the fork in the road” and navigate between traditional vs. self-service BI will depend on several factors. Identification of suitable self-service BI use cases is one factor. Business user levels of BI understanding is another. So are data governance and the commonality (or not) of BI systems. We look at each in turn, below.

Which Self-Service BI Use Cases Make Sense?

Much of the demand for self-service BI is driven by the general use case of needing answers in a hurry. Specific examples might be:

  • A retail company wants same day answers about which products to put on special offer or how to adjust its daily online advertising.
  • A pizza chain searching for a new flavor to captivate companies is looking for “fail-fast analytics” to quickly explore test market reactions and eliminate pizza experiments that don’t find favor.
  • A construction company needs to consolidate spreadsheet data sent in by subcontractors to immediately spot any signs of rising costs or delays that could jeopardize building schedules.

How Much Understanding of Business Intelligence is Needed?

If a deep understanding of BI or data science is needed for a BI application, then that application is unlikely to be self-service. On the other hand, when a self-service tool is sufficiently intuitive and allows users to focus on business results instead of underlying technology, then end-users can work independently of developers, data specialists, and the IT department. One example is smart data visualization allowing ad hoc questions to be easily asked about any part of the data and to any depth.

How Can Self-Service BI and Data Governance Issues be Resolved?

Self-service BI cannot be at the expense of clarity or confidence in the results. Self-service tools that can use data sources directly can avoid such problems. The data sources do not change, and different users can apply the same tools to check that they get the same results. There may still be a discussion about the way the results are to be interpreted, but there should never be disagreement about the sources or the consistency of the data used to get the results.

Will You Need Totally Separate BI Hardware and Software?

Cost is a factor for most enterprises and organizations. Whereas in the past you may have had no option other than to pay separately for hardware and software, now there’s a better way. Instead of the older assembly line or legacy system approaches, enterprises can now move to a single-stack BI approach that provides BI to satisfy advanced users in the IT department, as well as non-technical business users. Once again, “When you come to the fork in the road, take it!”

The Future of Business Intelligence Is Self-Service

Even if both traditional and self-service BI will continue to coexist, constantly rising data volumes and accelerating business needs will mean that end-users will do an increasingly large part of BI for themselves. Good self-service BI applications will let users focus on their business questions without needing to build elaborate solutions, and get the insights and answers they need immediately, without having to wait for specialist IT staff to help them out.

Original Link

All You Need to Know About Data Management and Integration (DMI)

Chocolate soufflés are just like data-driven insights — they are rich, transformative, and really hard to create.

Anyone who has attempted to make a soufflé knows that in order to make it right, you must follow the recipe exactly, prepare the ingredients properly, and be methodical with every step of the process. When prepared correctly, a soufflé is magical, but if there is one small error, the whole thing ends in disaster.

Generating insights from your data works the same way. If you don’t have a standard process in place to ensure your data is accurate, usable and secure, your insights will be worthless and even misleading. But unlike the ingredients of a soufflé, managing data is extremely complex and the likelihood of making an error is high.

Leveraging the full value of your data and avoiding costly mistakes starts with a data management and integration (DMI) strategy. Simply put, DMI is a set of policies and procedures meant to provide the right people with timely access to accurate data.

With different types of data now coming from hundreds of internal and external sources, the potential for game-changing insights within your company is only matched by the nerve-racking potential for data chaos.

DMI is more important than ever, so we’ve broken down the four pillars of data management and integration to get you started.

Data Governance Is the Recipe for DMI

You can’t make a great soufflé without a clear and detailed recipe that describes exactly how to prepare the ingredients, put them together and bake the final product.

Successfully managing your data starts with a “recipe” called data governance. The role of data governance is to establish standard policies and procedures across the organization that set the parameters for managing all of your data.

A comprehensive data governance plan defines user rights, defines security policies, and monitors the technologies used to implement the various data procedures.

Data governance is the blueprint on how your company manages its data, focusing on three areas: people, policy, and technology.

Data analysts currently waste most of their time, up to 80%, identifying and processing data to use before doing any analysis. Good data governance should alleviate this by providing the plan and structure for data to be secure, easily found, and shared among people with the appropriate permissions.

Quality Ingredients Always Make the End Product Better

A soufflé is actually not that hard to make — if you know how to prepare the right ingredients. For example, the most important ingredient is egg whites, but if the egg whites have any traces of egg yolk, your soufflé will fail.

Most people don’t think about the quality of their data, but, like the tainted egg whites, if your data is dirty, your insights will not be trustworthy.


Dirty data, such as data with spelling errors or data placed in the wrong field, is a big problem that often goes unnoticed. The power of your analysis is only as good as the quality of your data. If your data is 99% accurate, your insight might be dead wrong.

Without data management and integration policies in place to ensure data quality, you will almost certainly have costly data inconsistencies. Let’s say your customer service team receives a complaint from D. Smith, while your marketing team is sending a promotion to a David Smith at the same time. You’ll look incompetent sending conflicting messages to the same person.

Beyond setting policies, a DMI program should include a data cleansing process to remedy some of these issues. Data cleansing tools, such as OpenRefine, reconcile and remove duplicate or incomplete data points.

Your data is generated across dozens of touchpoints, which increases this risk of duplication and redundancy. With a DMI approach that ensures your data is clean and ready for use, you will have the confidence to make decisions based on your analyses.

Integration Tools Prepare the Data for Analysis

When you are following a recipe, it is sometimes tempting to skip the instructions and mix all the ingredients together at the same time. If you’ve done this then you know it never ends well.

Integrating your data is exactly the same. Your data that is coming in across all your applications and legacy systems are not compatible. While you need to combine some of it to run your analysis, you can’t just throw it all together willy-nilly and expect accurate results.

Data integration is a complicated endeavor when you consider all the structured and unstructured data available to analyze. There are no one-size-fits-all approaches to data integration, but there are several tools to help automate the integration process of matching, cleaning, and preparing data for analysis.

Data integrations are complex and technical, but at the core are three steps are known as ETL: extraction, transformation, and loading.

  • Data extraction is when data is collected from all of the various data sources.
  • Data transformation is the process of transforming all of your data into a compatible format, often with the help of a universal data model that defines the datasets with common properties.
  • Data loading is when the right data is loaded into the appropriate database, ready for analysis.

You have so many sources of valuable data, but you can’t just throw them in an analytics tool and hope to make sense of it all. Your data management and integration strategy should identify the approach and tools necessary to implement a successful integration.

When the ingredients are prepared and combined properly, your chances of accurate insights and a successful soufflé go from good to great.

Protect Your Soufflé at All Costs

After painstakingly preparing your soufflé, you don’t want any unauthorized people in the kitchen to mess with it. If you turn your back for one second and someone opens the oven door, both your soufflé and your mood will be deflated.

A core pillar of DMI is data security. Not a week goes by without news of a massive data breach that compromises the personal information of millions of people. Traditionally, companies kept their data safe by securing their perimeter, but data now comes from the outside and employees access sensitive data with devices beyond the firewall.

One of the biggest data risks within your company has to do with access management. A recent survey found that 47% of companies had at least 1,000 sensitive files open to every employee. Unintentionally giving too much access to employees is a very common problem with dire repercussions.


U.S. retailer, Target, had a data breach where over 40 million credit cards including pin numbers were stolen. While there were many reasons for the breach, a identified that too many people had access to sensitive data, including remote contractors.

Your DMI policy should determine user access across the enterprise, but when you have hundreds or even thousands of potential users, how can you keep track of everyone?

Identity and access management (IAM) tools are designed to help you manage access to your data. You can authenticate, authorize, and track users that interact with your data at all times and set rules to comply with your data management policies.

For example, your IAM tool enables your team and contractors to access data remotely on any device by adding extra security measures such as automated logouts and multifactor authentification. Even if a device is lost or stolen, the data will be secure.

To Generate Data-Driven Insights, Follow the Recipe

In the same way that a perfectly prepared soufflé can transform a dinner, data-driven insights can give your business a competitive advantage that takes your business to a new level.

When you have DMI policies and procedures in place to make sure your data is accurate, accessible, and secure, you will realize the power of your data and make decisions with confidence.

Original Link

What Is GDPR and Why Should Database Administrators Care?

You’ve no doubt heard at least something about the GDPR, the EU’s new privacy and data management law, with its greatly increased maximum fines for noncompliance and tighter definitions for acceptable uses of personal information.

If you’ve continued reading past paragraph one of any of the many articles, you’ll be aware that the law applies globally to all organizations holding EU citizens’ data — it’s not bounded by geography or jurisdiction.

Many of the articles have emphasized the increased penalties, much to the chagrin of the UK’s Information Commissioner, Elizabeth Denham, who is charged with enforcing the law in the UK (and yes, it will apply after Brexit, assuming Brexit happens).

“Heavy fines for serious breaches reflect just how important
personal data is in a 21st century world. But we intend to use
those powers proportionately and judiciously.” (Source)

But this is all C-level chatter, surely?

Yes, the boss needs to know about increased risks from incorrect data handling, but what should a DBA do — other than wait for new edicts from on high, especially when the data held by the IT department is actually a business asset and data governance is often the responsibility of multiple people or even departments?

If, like many DBAs, you’d rather spend your time applying your skills to optimizing your systems, tuning for performance and bang-per-buck (and minimizing the firefighting in between), why not just wait and see?

Well, there are a few reasons the smarter DBA should be preparing now and getting his or her estate in order ahead of the May 2018 enforcement date.

Probably the biggest is that the new law requires “privacy by design and default” throughout data handling.

Like most regulations, it’s open to interpretation, but regulators have been clear that data protection safeguards should be integrated into products and services from the earliest stage of development, with privacy always the default option.

But privacy of what, exactly? Not all data is private data.

You’ll need to do some thinking up front about this because before GDPR even becomes law, you’ll be expected to have audited your data and identified the two categories of personal data that require special handling.

Two categories? Yes. Standard personal data includes details like names, addresses, cookie strings, and web surfing data. “Special” personal data relates to data that is, literally, more personal like racial or ethnic origin as well as biometric and genetic data.

That said, now’s the time to ask three key questions.

  1. Where is your data? You’ll probably have more than one database, and you also need to remember legacy systems that are still in use, backups, and copies that might be used in development and testing.
  2. What exactly is your data that might be affected by the new law? You might want to take a broad brush approach and categorize your CRM database as private, but how about web orders, purchase histories, audit logs…? You and the business owners of the data (these might need to be assigned, as well) are going to have to talk and agree how to classify all of the data you keep.
  3. Who is accessing your data? This is probably the most crucial part of the whole exercise. If you do make copies of databases for use in development and testing, for example, you’ll need to consider masking personal data. You’ll also need to think about data access requirements so that people only have permission to view, modify, or delete personal data that is relevant to their job role, and for which appropriate consent has been obtained.

Now, picture being unprepared when answering those questions for your organization. Many companies have problems of data sprawl and you won’t be alone in not being sure where all the data is, and who really needs access to it.

Is that the answer your CEO will expect? Or the new Data Protection Officer? (Oh, yes, you might be getting a new colleague with statutory responsibilities to enforce GDPR compliance.)

This isn’t all you need to prepare for, but it’s an important first step in ensuring you’re compliant with GDPR. So start the conversation, take the lead if you can, and you’ll protect yourself from impossible deadlines coming out of a “panic compliance” project.

The future you will thank you for starting your GDPR journey now.

Original Link

Containers for Enhanced Data Governance and Regulatory Compliance

How can auditors assess the use of enterprise data given today’s fragmented storage infrastructure? In short, with great difficulty!

Earlier this year, Windocks became the first container engine to incorporate database cloning. The combination of SQL Server containers with database cloning has been immediately popular for support of Dev/Test and reporting needs. A complex Terabyte class database can be delivered in seconds, and only requires an incremental 40 MB of storage. 

The combination of SQL Server containers and database clones is great for Dev/Test and reporting, but is also proving to be a huge step forward for Data Governance and Regulatory Compliance. In this article, we’ll explore how this design delivers a versioned, auditable repository of enterprise data for Audit and Compliance purposes.

Data Imaging for Enterprise Data Environments

Data images are built using Full or Differential SQL Server backups, snapshots, and SQL Server incremental log shipping, and are combined with SQL Server scripts to implement data masking during the image build. The resulting image is a full byte copy of the data in the form of Virtual Disks that can span multiple physical (or virtual) disks and large data sets.

The Virtual Disk, in turn, supports the creation of Windows “differencing disks” which are writable clones. Clones are delivered in seconds, and only require 40 MB or less of storage.

Images are built with a Dockerfile that specifies the location of backups, snapshots, or log shipping updates, and SQL Server scripts. Windocks images support multiple databases, with source files located on the Windocks host or a network attached file. In the example below the Dockerfile specifies two databases, located on network attached file shares, and SQL Server scripts for data masking.

The resulting image is versioned and auditable and supports delivery of multi-terabyte environments in seconds for Development and Test, and for reporting and BI. These data environments can now be delivered automatically or provisioned by users, for use with any SQL Server container (both Windocks and Microsoft’s), as well as with conventional SQL Server instances.

Data Imaging Enhances Data Governance and Compliance

The design as described was implemented to address the needs for delivery of data environments for Dev/Test and reporting needs, but the new Data Image repository is ideally suited for expanding data governance and regulatory compliance needs.

Privacy/Security: Security is improved with data delivered through a structured container process. Ad hoc access to enterprise data can be curtailed and approved and auditable images used to support dev and test, as well as reporting and BI needs. Privacy is enhanced as data masking that is implemented during the image build. National boundaries are respected with image registries hosted in the appropriate country, as the Windocks solution runs wherever Windows servers are supported (on-premise, private, or public cloud).

Quality: The container workflow enhances data quality and consistency by supporting the use of production databases as the authoritative source of data. The workflow outlined above will soon be enhanced with native Jenkins or Team City server support for Continuous Integration, making this approach unique as the first full-stack Jenkins Continuous Integration solution that incorporates production database support.

Access and Use: Docker containers are emerging as the defacto standard for software development and test. Containers play a prominent role in Microsoft’s strategies for Windows Server 2016 and SQL Server 2017. The approach outlined here provides organizations with an on-ramp to Docker-based workflows on Windows Server 2012 and Server 2016, with support of all editions of SQL Server 2008 onward. Not only does this workflow improve access to data (on-demand and in seconds), with the latest preferred dev and test tooling, but it also is uniquely useful for SQL Server reporting and BI purposes. This workflow also integrates with existing backup and DR system infrastructure, making it uniquely easy to add to existing systems and infrastructure.

Open: As a result of customer feedback Windocks is also expanding support for delivery of data environments from any Storage Area Network (SAN), from NetApp, EqualLogic, and others. Copy Data Management systems will also be supported, such as from Cohesity and Rubrik. Finally, support for MySQL, DB2, and other environments will also be added as requested by customers.


Data governance should not be an afterthought for modern software development and delivery strategies. Windocks’ combination of SQL Server containers with database cloning delivers benefits for development and test, reporting and BI, and enhances data governance and policy compliance. The solution installs with existing systems and delivers Terabyte-class data environments in seconds while creating immutable, versioned and auditable images that address many data governance needs.

Explore how Windocks can enhance your data governance and delivery with a free Windocks Community Edition. Download your free evaluation of Windocks here.

Original Link