As organizations grapple with how to effectively manage ever more voluminous and varied reservoirs of big data, data lakes are increasingly viewed as a smart approach. However, while the model can deliver the flexibility and scalability lacking in traditional enterprise data management architectures, data lakes also introduce a fresh set of integration and governance challenges that can impede success.
Born from the rise of the cloud and big data technologies like Hadoop, data lakes provide a way for organizations to cost-effectively store nearly limitless amounts of structured and unstructured data from myriad sources without regard to how that data might be leveraged in the future. By its very nature and through self-service business intelligence capabilities, a data lake also encourages experimentation and data exploration by a broader set of non-business analyst users. According to a survey conducted by TDWI Research, 85 percent of respondents considered the data lake an opportunity to address the challenges they face trying to manage the data deluge with traditional relational databases. Moreover, the TDWI survey found the data lake being pursued for a variety of benefits and use cases, the most prominent being advanced analytics (49 percent) and data discovery (49 percent).
If there’s one key phenomenon that business leaders across all industries have latched onto in recent years, it’s the value of data. The business intelligence and analytics market continue to grow at a massive rate (Gartner forecasts the market will reach $18.3 billion in 2017) as organizations invest in the solutions that they hope will enable them to harvest the potential of that data and disrupt their industries.
But while companies continue to hoard data and invest in analytics tools that they hope will help them determine and drive additional value, the General Data Protection Regulation (GDPR) is forcing best practices in the capture, management and use of personal data.
The European Union’s GDPR stipulates stringent rules around how data must be handled. Impacting the entire data lifecycle, organizations must have an end-to-end understanding of its personal data, right through from its collection and processing, to storage and — finally — its destruction.
As companies scramble to make the May 25 deadline, data governance is a key focus. But organizations cannot just think of the new regulations as a box to check. Continuous compliance is required and most organizations are having to create new policies that will help them achieve a privacy by design mode.
One of the great challenges posed in securely managing data is the rapid adoption of data analytics across businesses as it moves from an IT office function to become a core asset for business units. As a result, data often flows in many directions across the business, so it becomes difficult to understand the data about the data — such as lineage of data (where it was created and how it got there).
Organizations may have personal data in many different formats and types (both structured and unstructured) across many different locations. Under the GDPR, it will be crucial to know and manage where personal data is across their business. While no one is certain in exactly what form GDPR will be enforced, organizations will need to be able to demonstrate that their data management processes are continually in compliance with the GDPR at a moment’s notice.
With the diverse sources and banks of data that many organizations have, consolidating this data will be key to effectively managing their compliance with the GDPR. With the numerous different types of data that must be held across an organization, data lakes are a clear solution to the challenge of storing and managing disparate data.
A data lake is a storage method that holds raw data, including structured, semi-structured, and unstructured data. The structure and requirements of the data are only defined once the data is needed. Increasingly, we’re seeing data lakes used to centralize enterprise information, including personal data that originates from a variety of sources, such as sales, CX, social media, digital systems, and more.
Data lakes, which use tools like Hadoop to track data within the environment, help organizations bring all the data together into a data lake where it can all be maintained and governed collectively. The ability to store structured, semi-structured, and unstructured data is crucial to the value of this approach for consolidating data assets, compared to data warehouses, which primarily maintain structured, processed data. Enabling organizations to discover, integrate, cleanse, and protect data that can then be shared safely is essential for effective data governance.
Further to the view across the full expanse of the data lake, organizations can look upstream to identify the sources of data from before they flowed into the lake. That way, organizations can track specific data back to their source — like the CX or marketing applications — providing end-to-end visibility across their entire data supply chain so that it can be scrutinized and identified as necessary.
This end-to-end view of personal data is crucial under the GDPR, enabling businesses to identify the quality and point of origin for all their information. Further to enabling organizations to store, manage, and identify the source of all their data, data lakes provide a cost-effective means for organizations to store all their data in one place. On the other hand, managing this large volume of data in a data warehouse has a far higher TCO.
While data lakes currently present the best approach for data management and governance for GDPR compliance, this will not be the last stop in organizations’ journey towards innovative, efficient and complaint data management. The data storage approaches of the future will be built with consideration for the new regulatory climate and will be created to serve and adhere to the challenges they present.
However, with the demand on organizations to create data policies and practices that will support the compliance of their future data storage and analytics endeavors, it is clear that businesses need to start refining processes and policies that will lay the foundations for compliant data innovation in the future. Being able to quickly and easily identify and access all data, with a clear understanding of its source and stewardship, is now the minimum standard for the management of personal data.
Time is running out for many organizations on achieving GDPR compliance, with just weeks until its enforcement. However, companies must take a long-term view and build a data storage model that will enable them to consolidate, harmonize and identify the source of their data in compliance with the GDPR.
GDPR is bringing new dimensions with respect to customers demand: now, they value trust and transparency and will vote with their feet. They will follow companies that will be able to deliver personalized interactions while letting their customers taking full control over their personal data. Ultimately, companies that establish a system of trust at the core of their customer and/or employee relationship will win in the digital economies.
We live in the age of data, and as per Gartner, the volume of worldwide information is growing at a minimum rate of 59% annually. Volume alone is a significant challenge to manage, and variety and velocity make it even more difficult. It is also very evident that generation of larger and larger volumes of data will continue, especially if we consider the exponential growth of the number of handheld devices and Internet-connected devices.
For organizations with systems of engagement, this is true — but for others, data volume growth is not that high. Data volume is different for different organizations. In spite of this difference, one common factor across all of them is the importance of meaningful and useful analytics for different stakeholders. With the increased use of tools across organizations for different functionalities, the task of generating meaningful and useful reports for different stakeholders is becoming more and more challenging.
Nick Heudecker, research director at Gartner, has explained the data lake:
“In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format. The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”
Thus, a data lake helps organizations gain insight into their data by breaking the data silos. The term “data lake” was first used in 2010, and its definition/characteristics are still evolving. In general, “data lake” refers to a central repository capable of storing zettabytes of data drawn from various internal and external sources in a format as close as to the raw data.
A data lake is usually thought of as the collection and collation of all enterprise data from legacy systems and sources, data warehouses and analytics systems, third-party data, social media data, clickstream data, and anything else that might be considered useful information for the enterprise. Although the definition is interesting, is it actually possible or required for every organization?
Different organizations have different challenges and patterns of distributed data, and with diversified scenarios, every organization has their own need for the data lake. Though the needs, pattern, source, and architecture of the data are different, the challenges are the same with respect to building a central storage or lake of data:
In most cases, data lakes are deployed with the essence of a data-as-a-service model where it is considered as a centralized system-of-record, serving other systems at enterprise scale. A localized data lake not only expands to support multiple teams but also spawns multiple data lake instances to support larger needs. This centralized data then can be used by all different teams for their analytical needs.
With all these understandings, it’s time to discuss the various needs of data lakes in terms of integration and governance.
In order to deploy a data lake at the enterprise level, it needs to have certain capabilities that will allow it to be integrated within the overall data management strategy, IT applications, and data flow landscape of the organization.
The data lake is not only about storing data centrally and furnishing it to different departments whenever required. With more and more users beginning to use data lakes directly or through downstream applications or analytical tools, the importance of governance for data lakes increases. Data lakes create a new level of challenges and opportunities by bringing in diversified datasets from various repositories to one single repository.
The major challenge is to ensure that data governance policies and procedures exist and are enforced in the data lake. There should be a clear definition of the owner for each dataset as and when they enter the lake. There should be a very well-documented policy or guideline regarding the required accessibility, completeness, consistency, and updating of each data.
For solving the above problem, there should be built-in mechanisms in the data lake to track and record any manipulation of data assets present in the data lake.
The implementation of a data lake is not the same for all organizations, as data volume and requirements of data collection vary from organization to organization. In general, a data lake comes with the perception that the data volume should be at a level of petabytes or zettabytes or even more, and needs to be implemented using a NoSQL database. In reality, this amount of data volume together with the implementation of a NoSQL DB may not be needed or may not be possible for all organizations. The end goal of having a central data store catering to all analytical needs of the organization can be started with a SQL DB and with a considerable data volume.
C-suite executives: What is the impact of big data in your organization? If you suggest I speak to the CIO, you might be surprised when you take a closer look.
Big data is at a tipping point across many corporate disciplines, from marketing to customer support. Corporate leaders across every part of their businesses are using it to innovate and gain new insights into their existing business.
The impact of big data is spreading rapidly throughout many companies and becoming a part of their DNA. The danger, however, is that some senior managers may be unaware of its effect.
Big data deployments are commonly localized. When organizations first begin experimenting with big data, they often do so with small proof-of-concept projects. Sometimes the senior management may be aware of these, but in many cases, they may be run as “skunk works” projects conducted autonomously to test a team’s assumptions.
This creates several problems for senior executives. The first is that they don’t have a unified view of the organization’s big data initiatives, which tends to keep big data from becoming a strategic focus. The biggest value of data will come when connecting the dots across an entire organization. For instance, a single view of a customer does not live in marketing or in sales—it spans the entire organization. It’s hard to encourage joined-up thinking and pursue a bigger goal when different people in the organization are fostering a piecemeal approach to big data projects.
Another problem is a lack of governance. Data can be a powerful tool, but it can also be destructive to your business if misused. If ad hoc projects are started without adequate oversight, they can lead to the inappropriate use of personal information.
Without a strategic approach to big data, executives risk developing pockets of noncompliance as well-intentioned middle managers experiment with sensitive data. All it takes is one slip, and it could leave chief executives trying to explain themselves while the company makes negative headlines.
By pulling these projects together into a holistic strategy, business leaders can take control and help write the company’s big data narrative. This offers several advantages. First, it gives business leaders more control over data privacy and security. By imposing best practices in an enterprise-wide policy, executives reduce the risk of rogue managers creating security gaps in the corporate infrastructure. In a world where data provenance and protection is more important than ever, this point cannot be overstated.
Gaining control over data governance will make the compliance department happy. Instead of having to find and audit a universe of undocumented big data projects, the compliance team will already be aware of them, and will be assured that the project managers drew from a data governance playbook when creating them. This should make auditing far easier and reduce the risk of noncompliance.
The other benefit of creating a governing strategy for big data projects is that the whole can become more than the sum of its individual parts. While few companies expect to pile all their information into a single, huge data ocean, an end-to-end view of these projects can nevertheless introduce opportunities for efficiency and data aggregation. Multiple projects may share information in a single data lake.
Managing data in this way can increase the variety of information available to a single big data project. For example, an ad hoc project might only draw on the sales database, limiting its results. By making unstructured data from other sources available to that project, a company could generate more accurate, insightful results.
These partnerships require a strategic vision—which, in turn, increases the impact of big data at the corporate level. A comprehensive, top-down big data strategy creates space for the executive team to forge a vision of what it wants big data to achieve.
How can corporate leaders turn things around and get to this point? The first step lies in developing a mature big data strategy. Business leaders must treat big data as a program, not a project, with the highest level of executive buy-in.
Then you should give the program the attention it deserves by turning to experts to design big data architectures around open architectures: It’s unlikely that your big data program will support just one technology or put data in only one location.
Once you have a program with well-defined compliance practices and technical specifications, your organization can document its existing siloed projects, assessing each one to see if it can be incorporated into the new framework.
Even projects that cannot be integrated will still be valuable. Architects can learn from most deployments, understanding what worked and what didn’t. They can then factor those insights into a redesign, as they build the project from scratch with new, enterprise-approved parameters.
The time to do this is now. Data-driven companies such as Uber, Facebook, and Amazon are already disrupting markets by the dozen. They are ahead of the wave. By developing a strategy that embraces the impact of big data across your organization and beyond, you can surf this wave too.
I’ve written numerous times over the past few years about the power of medical data and the numerous issues surrounding it, from the new ways of collecting and analyzing it to the importance of strong data governance.
Despite the tremendous potential for delivering better care, and also better medical research, the joined-up use of medical data remains largely overlooked. A recent paper examines some of the reasons why that is.
The study found considerable variance in the IT systems across the NHS, with a continued reliance upon paper records and limited data sharing between departments. With patient medical records remaining the primary source of data around the patient, this undermines efforts to use such records both for better care and better research.
The poor use of data wasn’t confined to the NHS, however, with the study also finding that both the pharma industry and universities were not using data to its full potential. For instance, the pharma industry has a long and murky history over the selective publishing of data around trials, and even academia has been accused of similar practices in recent years.
Various well-publicized cases have led to steps to improve matters. For instance, projects such as the AllTrials campaign strive to promote the proper reporting of clinical trials. Even then, however, it’s estimated that fewer than 50% of trials are reported within two years of completion.
The analysis also revealed there were significant problems with the way data is regulated and governed. Existing regulations exist to ensure patients, the public, and medical professionals are safeguarded, but this can often result in excessive caution being used around patient data. Consent procedures can also limit the impact of studies, especially in niche fields.
“Sometimes, relying on the need for individual consent can limit studies about groups that are difficult to reach, as well as problems such as substance misuse, and any issues seen as sensitive,” the authors say.
…all of which adds up to a significant problem. The authors say that the misuse (or non-use) of data is costing lives. Equally, none of the problems outlined above stand in isolation but rather as part of a wider picture of data governance in healthcare. While there are clearly good reasons why data governance is crucial, there are also many areas that could be improved upon.
“It can be argued that data non-use is a greater risk to well-being than data misuse. The non-use of data is a global problem and one that can be difficult to quantify. As individuals, we have a role to play in supporting the safe use of data and taking part where we are able,” the authors conclude.
To gather insights on the state of big data in 2018, we talked to 22 executives from 21 companies who are helping clients manage and optimize their data to drive business value. We asked them, “What are the keys to a successful big data strategy?” Here’s what they told us:
Here’s who we spoke to:
A few weeks ago, I had the great opportunity to attend the MDM and Data Governance Summit held in NYC. The summit was packed with information, trends, best practices, and research from the MDM Gartner Institute. Much of the information is something you don’t find in webinars or white papers on the internet. Speakers were from many different industries who brought different perspectives to the data management concepts.
I also got to meet IT business executives who are all along different parts of their MDM and data governance journey. As I reflect back on the conference, I wanted to share some of my key highlights and takeaways from the event as we all start to prepare for our IT and data strategies in 2018.
The top five main drivers of MDM are:
Achieving synergies for cross-selling
Economies of scale for M&A
A new driver also has been considered recently — digital transformation — and MDM is at the core of digital transformation.
If there was one thing that everyone at the summit agreed upon, it was that the partnership of the business and IT directly impacts the success of the MDM and data governance programs more than any other factor. For one banking company, this happened naturally, as the business also understood the power of data and the issues related to it. But largely, the experience of the people involved in this partnership is that it’s an uphill battle to get the buy-in from the business. Tying the project to a specific business objective is critical in these scenarios. The bottom line is that a solid partnership between business and IT will provide the right foundation for an MDM program.
It’s widely accepted that any MDM journey is long, and takes energy and perseverance. One company mitigated this process by starting with the domain that they thought would have the most impact.
Only about 20% of the audience had some form of data governance council but all the case studies presented had a data governance council in place. The council was made up of both business teams and IT teams. There is no real pattern from the organizational structure perspective. An insurance company that has a hugely successful MDM implementation has the Enterprise Information Management Team as part of the Compliance Team. Another financial company had the team reporting to the COO. It all depends on how your company is organized and does business.
This topic was everywhere. 50% of the audience said they were impacted by this regulation. But looks like a lot of companies still lag way behind in terms of preparing their enterprise data for compliance. This is a serious issue, as there are less than 150 days to get ready. One of the speakers said that MDM is the heart of GDPR implementation.
Data-as-a-Service is something every company should aim for in the next two to three years. Also, bringing in social media and unstructured data will be key to gain actionable insights from MDM initiatives. Large enterprises have moved beyond CDI and PIM to focus on relationships and hierarchies. Cloud MDM will be in demand, but there is potential for creating more data silos as integration becomes a challenge.
There are just too many technologies in the Big Data space. Therefore solution architecture becomes key when building a Data Lake. A common implementation was to load the data from legacy systems into Hadoop without any transformation. But without metadata, the lake quickly becomes a swamp. So, to get true value from Big Data Analytics, MDM and Data Governance have to be effective and sustainable. Also from a technology perspective, there needs to be sound integration with big data systems. My company Talend has been at the forefront of Big Data Integration providing a unified platform for MDM, DQ, ESB and Data Integration.
Finally, I want to end this blog with some great quotes from the speakers:
“Digital transformation requires information excellence.”
“If you don’t know where you are, a map won’t help.”
“Big data + data governance = big opportunity.”
“Data is a precious thing and will last longer than the systems themselves.”
“There is no operational excellence without data excellence.”
“A shared solution is the best solution.”
“People and processes are more critical than technology.”
“Rules before tools.”
“Master data is the heart of applications and architecture.”
“There is no AI without IA (Information Agenda).”
As you prepare for MDM and data governance initiatives in 2018, I hope some of my takeaways will spark new ideas for you on how to have a successful journey to MDM.
Slow. Inflexible. Time-consuming. Does that sound like any sensible way to get users the business insights they need to do their jobs? A few years ago, this might have been the only option for business intelligence, but now there’s a fork in the road. Users can go one way with centralized BI run by their IT department, or they can strike out with modern BI solutions they can use by themselves. As you might imagine, in the traditional vs. self-service BI debate, there are pros and cons to be considered before making a choice.
Let’s take the traditional approach first. There are reasons for this controlled BI environment to exist. When you control the data and the BI application, you have a chance of controlling the quality of the results. IT departments concerned about quality (meaning every conscientious IT department) can make sure that data is properly prepared, stored, and secured. They can build systems that offer standardized, scalable reporting, and they can give users answers to their business questions. Of course, those answers may take a little time to materialize, especially if the IT department is busy with other projects. Or if the rate of growth of data (and big data) starts to outstrip the IT department’s resources for handling it.
So, how about the self-service approach? There are now numerous business intelligence tools available that appeal to users through their simplicity and affordability, and we’re not just talking about spreadsheets. Self-service BI has made giant strides to get to a point where users can access data from different sources, get insights from all the sources altogether, and make faster business decisions. The tools are typically intuitive and interactive (those that aren’t tend to disappear from circulation), and let users explore data beyond what the IT department has curated.
Learn how to calculate the cost of business intelligence with our free step-by-step guide.
But perhaps representing traditional vs. self-service BI as a fork in the road is unrealistic. An organization may need both types of business intelligence. Functional reporting on daily business operations is still a common requirement, even if it is now a smaller part of the overall BI picture. Compliance reporting and dashboarding, for example, are still needed. Moreover, once they are set up, these functions often run happily with little or no intervention. Traditional BI still has a role to play in answering questions about what happened in the past, or about what is happening about operations now. By comparison, for questions about the future, especially spur of the moment “what if” style questions, users want more individual power and faster reaction times than traditional BI is designed to give them. In this case, self-service can be preferable.
This calls to mind the advice of one expert, who said, “When you come to the fork in the road, take it!” This apparently non-sensical statement now starts to make sense. How you “take the fork in the road” and navigate between traditional vs. self-service BI will depend on several factors. Identification of suitable self-service BI use cases is one factor. Business user levels of BI understanding is another. So are data governance and the commonality (or not) of BI systems. We look at each in turn, below.
Much of the demand for self-service BI is driven by the general use case of needing answers in a hurry. Specific examples might be:
If a deep understanding of BI or data science is needed for a BI application, then that application is unlikely to be self-service. On the other hand, when a self-service tool is sufficiently intuitive and allows users to focus on business results instead of underlying technology, then end-users can work independently of developers, data specialists, and the IT department. One example is smart data visualization allowing ad hoc questions to be easily asked about any part of the data and to any depth.
Self-service BI cannot be at the expense of clarity or confidence in the results. Self-service tools that can use data sources directly can avoid such problems. The data sources do not change, and different users can apply the same tools to check that they get the same results. There may still be a discussion about the way the results are to be interpreted, but there should never be disagreement about the sources or the consistency of the data used to get the results.
Cost is a factor for most enterprises and organizations. Whereas in the past you may have had no option other than to pay separately for hardware and software, now there’s a better way. Instead of the older assembly line or legacy system approaches, enterprises can now move to a single-stack BI approach that provides BI to satisfy advanced users in the IT department, as well as non-technical business users. Once again, “When you come to the fork in the road, take it!”
Even if both traditional and self-service BI will continue to coexist, constantly rising data volumes and accelerating business needs will mean that end-users will do an increasingly large part of BI for themselves. Good self-service BI applications will let users focus on their business questions without needing to build elaborate solutions, and get the insights and answers they need immediately, without having to wait for specialist IT staff to help them out.
Chocolate soufflés are just like data-driven insights — they are rich, transformative, and really hard to create.
Anyone who has attempted to make a soufflé knows that in order to make it right, you must follow the recipe exactly, prepare the ingredients properly, and be methodical with every step of the process. When prepared correctly, a soufflé is magical, but if there is one small error, the whole thing ends in disaster.
Generating insights from your data works the same way. If you don’t have a standard process in place to ensure your data is accurate, usable and secure, your insights will be worthless and even misleading. But unlike the ingredients of a soufflé, managing data is extremely complex and the likelihood of making an error is high.
Leveraging the full value of your data and avoiding costly mistakes starts with a data management and integration (DMI) strategy. Simply put, DMI is a set of policies and procedures meant to provide the right people with timely access to accurate data.
With different types of data now coming from hundreds of internal and external sources, the potential for game-changing insights within your company is only matched by the nerve-racking potential for data chaos.
DMI is more important than ever, so we’ve broken down the four pillars of data management and integration to get you started.
You can’t make a great soufflé without a clear and detailed recipe that describes exactly how to prepare the ingredients, put them together and bake the final product.
Successfully managing your data starts with a “recipe” called data governance. The role of data governance is to establish standard policies and procedures across the organization that set the parameters for managing all of your data.
A comprehensive data governance plan defines user rights, defines security policies, and monitors the technologies used to implement the various data procedures.
Data governance is the blueprint on how your company manages its data, focusing on three areas: people, policy, and technology.
Data analysts currently waste most of their time, up to 80%, identifying and processing data to use before doing any analysis. Good data governance should alleviate this by providing the plan and structure for data to be secure, easily found, and shared among people with the appropriate permissions.
A soufflé is actually not that hard to make — if you know how to prepare the right ingredients. For example, the most important ingredient is egg whites, but if the egg whites have any traces of egg yolk, your soufflé will fail.
Most people don’t think about the quality of their data, but, like the tainted egg whites, if your data is dirty, your insights will not be trustworthy.
Dirty data, such as data with spelling errors or data placed in the wrong field, is a big problem that often goes unnoticed. The power of your analysis is only as good as the quality of your data. If your data is 99% accurate, your insight might be dead wrong.
Without data management and integration policies in place to ensure data quality, you will almost certainly have costly data inconsistencies. Let’s say your customer service team receives a complaint from D. Smith, while your marketing team is sending a promotion to a David Smith at the same time. You’ll look incompetent sending conflicting messages to the same person.
Beyond setting policies, a DMI program should include a data cleansing process to remedy some of these issues. Data cleansing tools, such as OpenRefine, reconcile and remove duplicate or incomplete data points.
Your data is generated across dozens of touchpoints, which increases this risk of duplication and redundancy. With a DMI approach that ensures your data is clean and ready for use, you will have the confidence to make decisions based on your analyses.
When you are following a recipe, it is sometimes tempting to skip the instructions and mix all the ingredients together at the same time. If you’ve done this then you know it never ends well.
Integrating your data is exactly the same. Your data that is coming in across all your applications and legacy systems are not compatible. While you need to combine some of it to run your analysis, you can’t just throw it all together willy-nilly and expect accurate results.
Data integration is a complicated endeavor when you consider all the structured and unstructured data available to analyze. There are no one-size-fits-all approaches to data integration, but there are several tools to help automate the integration process of matching, cleaning, and preparing data for analysis.
Data integrations are complex and technical, but at the core are three steps are known as ETL: extraction, transformation, and loading.
You have so many sources of valuable data, but you can’t just throw them in an analytics tool and hope to make sense of it all. Your data management and integration strategy should identify the approach and tools necessary to implement a successful integration.
When the ingredients are prepared and combined properly, your chances of accurate insights and a successful soufflé go from good to great.
After painstakingly preparing your soufflé, you don’t want any unauthorized people in the kitchen to mess with it. If you turn your back for one second and someone opens the oven door, both your soufflé and your mood will be deflated.
A core pillar of DMI is data security. Not a week goes by without news of a massive data breach that compromises the personal information of millions of people. Traditionally, companies kept their data safe by securing their perimeter, but data now comes from the outside and employees access sensitive data with devices beyond the firewall.
One of the biggest data risks within your company has to do with access management. A recent survey found that 47% of companies had at least 1,000 sensitive files open to every employee. Unintentionally giving too much access to employees is a very common problem with dire repercussions.
U.S. retailer, Target, had a data breach where over 40 million credit cards including pin numbers were stolen. While there were many reasons for the breach, a identified that too many people had access to sensitive data, including remote contractors.
Your DMI policy should determine user access across the enterprise, but when you have hundreds or even thousands of potential users, how can you keep track of everyone?
Identity and access management (IAM) tools are designed to help you manage access to your data. You can authenticate, authorize, and track users that interact with your data at all times and set rules to comply with your data management policies.
For example, your IAM tool enables your team and contractors to access data remotely on any device by adding extra security measures such as automated logouts and multifactor authentification. Even if a device is lost or stolen, the data will be secure.
In the same way that a perfectly prepared soufflé can transform a dinner, data-driven insights can give your business a competitive advantage that takes your business to a new level.
When you have DMI policies and procedures in place to make sure your data is accurate, accessible, and secure, you will realize the power of your data and make decisions with confidence.
You’ve no doubt heard at least something about the GDPR, the EU’s new privacy and data management law, with its greatly increased maximum fines for noncompliance and tighter definitions for acceptable uses of personal information.
If you’ve continued reading past paragraph one of any of the many articles, you’ll be aware that the law applies globally to all organizations holding EU citizens’ data — it’s not bounded by geography or jurisdiction.
Many of the articles have emphasized the increased penalties, much to the chagrin of the UK’s Information Commissioner, Elizabeth Denham, who is charged with enforcing the law in the UK (and yes, it will apply after Brexit, assuming Brexit happens).
“Heavy fines for serious breaches reflect just how important
personal data is in a 21st century world. But we intend to use
those powers proportionately and judiciously.” (Source)
But this is all C-level chatter, surely?
Yes, the boss needs to know about increased risks from incorrect data handling, but what should a DBA do — other than wait for new edicts from on high, especially when the data held by the IT department is actually a business asset and data governance is often the responsibility of multiple people or even departments?
If, like many DBAs, you’d rather spend your time applying your skills to optimizing your systems, tuning for performance and bang-per-buck (and minimizing the firefighting in between), why not just wait and see?
Well, there are a few reasons the smarter DBA should be preparing now and getting his or her estate in order ahead of the May 2018 enforcement date.
Probably the biggest is that the new law requires “privacy by design and default” throughout data handling.
Like most regulations, it’s open to interpretation, but regulators have been clear that data protection safeguards should be integrated into products and services from the earliest stage of development, with privacy always the default option.
But privacy of what, exactly? Not all data is private data.
You’ll need to do some thinking up front about this because before GDPR even becomes law, you’ll be expected to have audited your data and identified the two categories of personal data that require special handling.
Two categories? Yes. Standard personal data includes details like names, addresses, cookie strings, and web surfing data. “Special” personal data relates to data that is, literally, more personal like racial or ethnic origin as well as biometric and genetic data.
That said, now’s the time to ask three key questions.
Now, picture being unprepared when answering those questions for your organization. Many companies have problems of data sprawl and you won’t be alone in not being sure where all the data is, and who really needs access to it.
Is that the answer your CEO will expect? Or the new Data Protection Officer? (Oh, yes, you might be getting a new colleague with statutory responsibilities to enforce GDPR compliance.)
This isn’t all you need to prepare for, but it’s an important first step in ensuring you’re compliant with GDPR. So start the conversation, take the lead if you can, and you’ll protect yourself from impossible deadlines coming out of a “panic compliance” project.
The future you will thank you for starting your GDPR journey now.
How can auditors assess the use of enterprise data given today’s fragmented storage infrastructure? In short, with great difficulty!
Earlier this year, Windocks became the first container engine to incorporate database cloning. The combination of SQL Server containers with database cloning has been immediately popular for support of Dev/Test and reporting needs. A complex Terabyte class database can be delivered in seconds, and only requires an incremental 40 MB of storage.
The combination of SQL Server containers and database clones is great for Dev/Test and reporting, but is also proving to be a huge step forward for Data Governance and Regulatory Compliance. In this article, we’ll explore how this design delivers a versioned, auditable repository of enterprise data for Audit and Compliance purposes.
Data images are built using Full or Differential SQL Server backups, snapshots, and SQL Server incremental log shipping, and are combined with SQL Server scripts to implement data masking during the image build. The resulting image is a full byte copy of the data in the form of Virtual Disks that can span multiple physical (or virtual) disks and large data sets.
The Virtual Disk, in turn, supports the creation of Windows “differencing disks” which are writable clones. Clones are delivered in seconds, and only require 40 MB or less of storage.
Images are built with a Dockerfile that specifies the location of backups, snapshots, or log shipping updates, and SQL Server scripts. Windocks images support multiple databases, with source files located on the Windocks host or a network attached file. In the example below the Dockerfile specifies two databases, located on network attached file shares, and SQL Server scripts for data masking.
The resulting image is versioned and auditable and supports delivery of multi-terabyte environments in seconds for Development and Test, and for reporting and BI. These data environments can now be delivered automatically or provisioned by users, for use with any SQL Server container (both Windocks and Microsoft’s), as well as with conventional SQL Server instances.
The design as described was implemented to address the needs for delivery of data environments for Dev/Test and reporting needs, but the new Data Image repository is ideally suited for expanding data governance and regulatory compliance needs.
Privacy/Security: Security is improved with data delivered through a structured container process. Ad hoc access to enterprise data can be curtailed and approved and auditable images used to support dev and test, as well as reporting and BI needs. Privacy is enhanced as data masking that is implemented during the image build. National boundaries are respected with image registries hosted in the appropriate country, as the Windocks solution runs wherever Windows servers are supported (on-premise, private, or public cloud).
Quality: The container workflow enhances data quality and consistency by supporting the use of production databases as the authoritative source of data. The workflow outlined above will soon be enhanced with native Jenkins or Team City server support for Continuous Integration, making this approach unique as the first full-stack Jenkins Continuous Integration solution that incorporates production database support.
Access and Use: Docker containers are emerging as the defacto standard for software development and test. Containers play a prominent role in Microsoft’s strategies for Windows Server 2016 and SQL Server 2017. The approach outlined here provides organizations with an on-ramp to Docker-based workflows on Windows Server 2012 and Server 2016, with support of all editions of SQL Server 2008 onward. Not only does this workflow improve access to data (on-demand and in seconds), with the latest preferred dev and test tooling, but it also is uniquely useful for SQL Server reporting and BI purposes. This workflow also integrates with existing backup and DR system infrastructure, making it uniquely easy to add to existing systems and infrastructure.
Open: As a result of customer feedback Windocks is also expanding support for delivery of data environments from any Storage Area Network (SAN), from NetApp, EqualLogic, and others. Copy Data Management systems will also be supported, such as from Cohesity and Rubrik. Finally, support for MySQL, DB2, and other environments will also be added as requested by customers.
Data governance should not be an afterthought for modern software development and delivery strategies. Windocks’ combination of SQL Server containers with database cloning delivers benefits for development and test, reporting and BI, and enhances data governance and policy compliance. The solution installs with existing systems and delivers Terabyte-class data environments in seconds while creating immutable, versioned and auditable images that address many data governance needs.
Explore how Windocks can enhance your data governance and delivery with a free Windocks Community Edition. Download your free evaluation of Windocks here.