It was great speaking to Oksana Sokolovsky, CEO of Io-Tahoe, about the availability of a new smart data discovery solution. The new solution includes the addition of Data Catalog, which allows data owners and data stewards to use a machine learning-based smart catalog to create, maintain, and search business rules, define policies, and provide governance workflow functionality. Io-Tahoe’s data discovery capability provides complete business rule management and enrichment. It enables a business user to govern the rules and define policies for critical data elements. It allows data-driven enterprises to enhance information about data automatically, regardless of the underlying technology and build a data catalog.
“Today’s digital business is driving new requirements for data discovery,” says Stewart Bond, Director Data Integration and Integrity Software Research, IDC. “Now more than ever, enterprises are demanding effective and comprehensive access to their data — regardless of where it is retained — with a clear view into more than its metadata but its contents, as well. Io-Tahoe is delivering a solution for data discovery to empower governance and compliance with a deeper view and understanding of data and its relationships.”
“Io-Tahoe allows the organization to conduct data discovery across enterprise landscapes, ranging from databases, data warehouses, and data lakes, bringing disparate data worlds together into a common view which will lead to a universal metadata store. This enables organizations to have full insight into their data to better achieve their business goals, drive data analytics, enhance data governance, and meet regulatory demands required in advance of regulations such as GDPR.”
Increasing governance and compliance demands have created an opportunity for data discovery. According to MarketsandMarkets, the data discovery market is estimated to grow from $4.33 billion USD in 2016 to $10.66 billion USD in 2021. This is driven by the increasing importance of data-driven decision-making and self-service business intelligence (BI) tools. However, the challenge of integrating the growing number of disparate platforms, databases, data lakes, and other silos of data has prevented the comprehensive governance, and use, of enterprise data.
Io-Tahoe’s smart data discovery solution features an algorithmic approach to auto-discover rich information about data and data relationships. Its machine learning technology looks beyond metadata and at the data itself for greater insight and visibility into complex datasets across the enterprise. Built to scale, Io-Tahoe makes data available to everyone in the organization, untangling the complex maze of data relationships and enabling applications such as data science, data analytics, data governance, and data management.
The technology-agnostic solution spans silos of data and creates a centralized repository of discovered data that users can search and govern. Through self-service features, users can increase team engagement through simplified and accurate sharing of data knowledge, business rules, and reports. Users have a greater ability to analyze, visualize, and leverage business intelligence and other tools, all of which have become the foundation to power data processes.
The new version of SQL Server Management Studio (v17.5) brings with it a new feature: SQL Data Discovery and Classification. You might recall that in SSMS 17.4, the Vulnerability Assessment feature was added. So, that’s two new features in the last two releases. This is the beauty of de-coupling SSMS from the SQL Server install media. We get more features, faster. But I digress.
The SQL Data Discovery and Classification feature will seem familiar to anyone working with Dynamic Data Masking in Azure. Both features use T-SQL to parse the names of columns to identify and classify the data. (This feature is also available in the Data Migration Assistant, where you can get a list of columns that would benefit from either Dynamic Data Masking or Always Encrypted.)
The SQL Data Discovery and Classification feature will help users discover, classify, and label columns that contain sensitive data. The feature also allows for the generation of reports for auditing purposes. With GDPR less than three months away, this could be the one feature that helps your company remain compliant.
Using the Data Discovery and Classification tool is easy. Just select a database and right-click. Go to Tasks > Classify Data…. Here is an example using a test GalacticWorks database:
My GalacticWorksTest database has only one table, a copy of the AdventureWorks2012.Sales.CreditCard table. You can see the results of the scan here:
I’ve highlighted the Information Type and Sensitivity Label column headers. I want to make certain you understand these columns represent dropdown windows, allowing you to alter both as needed.
The options for Information Type are as follows: Banking, Contact Info, Credentials, Credit Card, Date Of Birth, Financial, Health, Name, National ID, Networking, SSN, Other, and [n/a]. Here’s what the drop-down looks like:
The options for Sensitivity Label are as follows: Public, General, Confidential, Confidential – GDPR, Highly Confidential, Highly Confidential – GDPR, and [n/a]. Here’s what the drop-down looks like:
Since the feature is parsing column names, we will create a new table and use non-English names. We will also use abbreviations for column names for those of you
old experienced enough to remember when abbreviations were in vogue.
OK, let’s create a new table:
CREATE TABLE [Sales].[Tarjeta]( [TarjetaCreditoID] [int] IDENTITY(1,1) NOT NULL, [TarjetaTipo] [nvarchar](50) NOT NULL, [TarjetaNumero] [nvarchar](25) NOT NULL, [TARNUM] [nvarchar](25) NOT NULL, [ExpMonth] [tinyint] NOT NULL, [ExpYear] [smallint] NOT NULL, [ModifiedDate] [datetime] NOT NULL, CONSTRAINT [PK_Tarjeta_TarjetaCreditoID] PRIMARY KEY CLUSTERED ([TarjetaCreditoID] ASC )ON [PRIMARY]) ON [PRIMARY] GO
I’ve created a table almost identical to the Sales.CreditCard table, except that I am using Spanish names for credit (credito), card tarjeta), number (numero), and type (tipo). I’ve also added a column TARNUM, an abbreviation for the TarjetaNumero column. That’s the column that would have actual credit card numbers.
We will re-run the classification again (make sure you close the first results; otherwise, you won’t get a refresh with the new table included). Also note that I don’t need data in the table for this feature to evaluate the columns. I haven’t loaded any rows into Sales.Tarjeta, and here’s the result (I’ve scrolled down to show the three new rows):
The Data Discovery and Classification tool identified three columns: ExpMonth, ExpYear, and CreditoID. However, it missed TarjetaNumero and TARNUM, which would have the actual credit card numbers. The TarjetaCreditID column has no card number, just an IDENTITY(1,1) value used for a primary key.
One last item of interest. When a column is classified, the details are stored as extended properties. Here’s an example:
You can see that the Data Discovery and Classification feature does not flag the columns I created for this test. Because the feature focuses on keywords, it’s expected behavior that columns will be skipped. There are two reasons why. The first is the fact that the use of keywords has some cultural bias. For example, SSN is flagged as a keyword for the American Social Security Number. But in the Netherlands, it’s possible to have SOFINR as a column name abbreviation for Social Fiscal Number, and SOFINR is currently not flagged.
The second reason is that the feature only supports English, and offers partial support for a handful of non-English languages (Spanish, Portuguese, French, German, and Italian). (As I’m writing this at the SQL Konferenz in Germany, I found that Personalausweis, the name for the German Identification card, is flagged.) However, the MSDN I provided earlier makes no mention of supported languages or collations. I’m hoping that the MSDN pages get updates to reflect the languages and collations that are supported, to avoid any confusion for users.
Once you have reviewed and classified your data, you will want to run a report. Using AdventureWorks2008 as an example, I will accept all 39 recommendations and click Save > View Report. Here’s the result:
The report shows that the AdventureWorks2008 database has 39 distinct columns in 19 distinct tables that have been classified. This is the information you can now hand over to your audit team.
This report is at the database level. That means you will need to roll your own solution to get the details from many databases at the same time. It should be possible to use some Powershell voodoo to extract the data. Or, better yet, fire up PowerBI and use that to build your own dashboard.
The SQL Data Discovery and Classification feature is a great first step by Microsoft to help users understand where sensitive data may exist in their enterprise. This is also a good time to remind you why having a data dictionary is important. Even with all the right tools in place, all the right people, using all the right knowledge, you will still miss a column of sensitive data at times.
That’s because life is dirty, and so’s your data. Identifying and classifying data is not an easy task. You won’t get a perfect result with a simple right-click of a mouse. It takes diligence on the part of the data professional to curate the necessary metadata for data classification. Not every data professional has the time or patience for such efforts.
But the Data Discovery and Classification tool is a great first step forward. I can’t wait to watch this feature as it grows.
Data profiling is the process of assessing data values and deriving statistics or business information about the data. It allows data scientists to validate data quality and business analysts to determine the usage of the existing data for different purposes. Kylo automatically generates profile statistics such as minimum, maximum, mean, standard deviation, variance, aggregates (count and sum), the occurrence of null values, the occurrence of uniqueness, the occurrence of missing values, the occurrence of duplicates, the occurrence of top values, and occurrence of valid and invalid values.
Once the data has been ingested, cleansed, and persisted in data lake, the business analyst searches and finds out if the data can deliver business impact. Kylo allows users to build queries to access the data so as to build data products supporting analysis and make data discovery simple. In this blog, let’s discuss automatic data profiling and search-based data discovery in Kylo.
To learn about Kylo deployment requiring knowledge on different components/technologies, refer to our previous article on Kylo setup for data lake management.
To learn more about Kylo self-service data ingest, refer to this previous article.
Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. Kylo’s data profiling routine generates statistics for each field in an incoming dataset. Profiling is used to validate data quality. The profiling statistics can be found on the Feed Details page.
The feed ingestion using Kafka is shown in the below diagram:
Informative summaries about each field from the ingested data can be viewed under the View option on the Profile page. String (User field in the sample dataset) and numeric data type (Amount field in the sample dataset) profiling details are shown in the below diagrams:
Kylo profiling jobs automatically calculate the basic numeric field statistics such as minimum, maximum, mean, standard deviation, variance, and sum. Kylo provides basic statistics for the String field. The numeric field statistics for the Amount field are shown in the below diagram:
The basic statistics for the String field (User field) are shown in the below diagram:
Predefined standardization rules are used to manipulate data into conventional or canonical formats (i.e. dates, stripping special characters) or data protection (i.e. masking credit cards, PII, etc.). A few standardization rules applied on the ingested data are as follows:
Kylo provides an extensible Java API to develop custom validation, custom cleansing, and standardization routines as per business needs. The standardization rules applied to the User, Business, and Address fields as per the configuration is shown in the below diagram:
Kylo’s profiling window provides additional tabs such as Valid and Invalid to view both valid and invalid data after data ingestion. If validation rules fail, the data will be marked as invalid and will be shown under the Invalid tab with the reason for failure such as range validator rule violation, not considered as timestamp, and so on.
The data is ingested from Kafka. During feed creation, Kafka batch size is set as “10000” (which is the number of messages that the Kafka producer will attempt to batch before sending it to the consumer). To know more on batch size, refer to our previous article. Profiling applied on each batch data and informative summary is available on the Profile page. 68K records of data consumed from Kafka are shown in the below diagram:
Kylo uses Elasticsearch to provide the index for search features such as free-form data and metadata. It allows business analysts to decide on the required fields to be searchable and to enable index options for those fields while creating a feed. The indexed User and Business fields searchable from Global Search are shown in the below diagram:
The predefined Index Feed queries the index-enabled field data from the persisted Hive table and indexes the feed data into Elasticsearch. The Index Feed is automatically triggered as a part of the Data Ingest template. The index feed job status is highlighted in the below diagram:
If the index feed fails, a search cannot be performed on the ingested data. As “user” is a reserved word in Hive, the search functionality for User and Business fields failed due to the field name “user,” as shown in the below diagram:
To resolve this, the “user” field name is modified as “customer_name” during feed creation.
The search query to return the matched documents from the Elasticsearch is:
customer_name: “Bradley Martinez”:
The search query (Lucence search query) to search data and metadata is:
business: “JP Morgan Chase & Co”:
In this article, we discussed automatic data profiling and search-based data discovery in Kylo. We discussed a few issues faced with Index Feed and their solutions, too. Kylo uses Apache Spark for data profiling, data validation, data cleansing, data wrangling, and schema detection. It provides extensible API capabilities to build custom validator and standardizer. Kylo automatically performs data profiling and discovery in the background on performing proper setup with different technologies.