Social media provides a window into our minds in all manner of ways, with researchers using what we share online to understand a wide range of phenomenon. A recent study from the First Moscow Medical University suggests that we can do a similar thing and gain insights into the side effects of taking medicine.
The researchers propose a system whereby complaints such as feeling a little giddy that are posted on social media can now be translated into medical terms such as vertigo.
In this article, I will do regression analysis with Oracle data mining.
Data science and machine learning are very popular today. But these subjects require extensive knowledge and application expertise. We can solve these problems with various products and software that have been developed by various companies. In Oracle, methods and algorithms for solving these problems are presented to users with the
The most useful analytics come from data that is stored properly, categorized correctly, and mined thoroughly. To effectively store and use the data your business collects, you must first incorporate the following aspects.
Over the last half-century, data management has significantly changed the way in which data is organized to be processed by a computer. Today, data can be stored non-sequentially and still used effectively. The usefulness of proper data management has not been lost, as its principles extend far beyond how data is stored.
Data mining is the process of discovering hidden, valuable knowledge by analyzing a large amount of data. Also, we have to store that data in different databases.
As data mining is a very important process, it is advantageous for various industries, such as manufacturing, marketing, etc. Therefore, there’s a need for a standard data mining process. This data mining process must be reliable. Also, this process should be repeatable by business people with little to no knowledge of data science.
Data mining has been a major topic of discussion within the technology community for nearly 30 years, when Gregory Shapiro first coined the term. Data mining has become more prevalent in recent years, as organizations store much larger datasets and use Hadoop based-tools to extract and sort it more easily. Data mining processes have evolved to the point that they can also extract previously undetectable patterns from their datasets.
This is a concept known as sequential pattern mining. Data scientists are not only able to identify previously unrecognizable correlations between variables but can also look at chronological sequencing to infer causal relationships between different events that would otherwise be overlooked.
This is an exciting new field of focus within the big data community. However, it has introduced a new set of challenges that data scientists must overcome.
Data scientists that are trying to conduct sequential pattern mining often encounter the same heuristics that analysts running other regression analyses face. The biggest mistake is confusing correlation with causation.
Data scientists must be careful not to draw inaccurate conclusions from the data they evaluate. Here are some of the biggest issues they will face.
Some patterns tell a very important story that decisionmakers will need to rely on. Others have little significance to the problems they are trying to represent. Decisionmakers that draw the wrong conclusions about them can make costly errors.
It is often impossible to separate the two by evaluating the pattern in isolation of other system constraints. The pattern must be looked at in context and the right nuances need to be built into the model.
More importantly, the patterns must be pertinent to the system itself. Carl H. Mooney of Flinders University illustrates this in a series of tests he conducted for his whitepaper: Sequential Pattern Mining: Approaches and Algorithms.
“This seminal work, however, has some limitations: Given that the output was the maximal frequent sequences, some of the inferences (rules) that could be made could be construed as being of no real value. For example, a retail store would probably not be interested in knowing that a customer purchased product ‘A’ and some considerable time later purchased product ‘B.’”
Evaluating dozens of terabytes of data requires a tremendous amount of computing resources and time. It can take days, weeks or even months to properly mine patterns without the right controls in place. By the time the patterns have been extracted, it may be too late to use them.
Every system has its own limitations in place. While pattern mining, analysts must make sure the observations they make don’t lead to conclusions that will convince decisionmakers to exceed them.
There are a few steps that you can take to address the concerns that arise with sequential pattern mining. Here are some of the most important controls to build into the framework.
Data analysts cannot make causal inferences from data without reliable time stamps. The exact time of events needs to be accurately stored to make it easily accessible in the data sets used for sequential pattern mining.
A number of pitfalls can occur if the server storing your datasets is taken from the network. The biggest issue is obviously that new data will be added at a later date with the wrong timestamps, which can cause massive problems with your sequential regression analysis. You need to check provider uptime history to avoid this issue.
Some sequential pattern mining methods have more of a proven track record than others. You should consider using a priori-based and Pattern-Growth-based approaches, as there is considerable evidence supporting their effectiveness.
Horizontal formatting applies to the structuring of the original data. You need to list the dependent and independent variables that will be used for your regression analysis. In many applications, this would be a customer ID, which is transformed by transactional time and other variables in a customer sequence database.
The exact structuring of your horizontal formatting varies by application. The important thing is to identify the dependent variables that you will be studying and making sure they are properly sorted in your data set, so they can be easily accessed for future reference.
Sequential pattern mining is a very complex field that has introduced a number of challenges for data analysts. The good news is that there are a number of methodologies that can help minimize these challenges.
Ideation, in terms of community, can be a useful tool on several fronts. It can motivate employees helping them feel appreciated when others vote for their ideas. It can also be used to increase engagement, ensuring your developers stay in your community and continue to grow. Most importantly, it can be the driver of your product development as your community suggests new uses and features they would like to see.
Using ideation to motivate employees is something of a no-brainer. People are happy when they are heard and appreciated, leading them to work more efficiently and get projects done quickly. Developers aren’t necessarily looking for their boss’s approval, though; it’s more about their peers. A developer community is the perfect place for this because it’s the organic place for people to share their ideas and knowledge. And who knows, maybe some of these great ideas are about more than just the code they’re working on.
You may be familiar with the concept of data mining — examining large sets of data to generate new information. Ideation through community is doing this by gathering data on what users want, which means that it’s then up to you to process that data and gain information. When looking at ideation as the driver of your product development, this means that you’re letting your users not only pick the path the product goes on to grow, you’re actually letting them build the road with you. You’re going to save countless hours and dollars by using ideation rather than traditional methods of focus groups, testing markets for viability, or releasing a beta version of your product with features that you “hope” your audience wants to have.
Instead, you give your product and marketing teams a head start on that whole process by allowing them to examine what the community wants and then create value statements around those things. You can have high-level conversations about your next major release rather than having to dive into small updates and hope they fit the market.
Let your developer community help build the product they want to see next and you’ll see an increase in engagement leading to better products that people are passionate about being released fast.
Statisticians say the weirdest things. This is how one can feel if they are not well-versed with statistical terminology. However, statistics also has an unhelpful tendency to use words that change their meanings based on the context that they are used in. This, most of the time, causes a lot of heartburn and confusion for not-so-technically-sound analysts.
Hence, I decided to write an article around some interesting statistical terms that I have heard and used in the past two decades and have also explained to various audiences ranging from college graduates at various national and international universities to corporate head honchos.
Heterogeneity: In statistics, this means that your population samples have widely varying results.
Heteroscedasticity: This refers to the circumstance in which the variability of a variable is unequal across the range of value of a second variable that predicts it. This is one of the most common and frequently used terms of assumption of parametric analysis (i.e. linear regression)
Homogeneity: This term is opposite of heterogeneity; it means that your populations and samples have similar traits. Homogeneous samples are usually small and are made of similar cases.
Microdata: Individual response data obtained from surveys and censuses. These are data points directly observed or collected from a specific unit of observation. Also known as raw data. ICPSR is an excellent resource for obtaining microdata files.
Data point or datum: Singular of data. Refers to a single point of data. Example: The amount of aviation gasoline consumed by the transportation sector in the U.S. in 2012.
Quantitative data/variables: Information that can be handled numerically. Example: Spending by US consumers on personal care products and services.
Qualitative data/variables: Information that refers to the quality of something. Ethnographic research, participant observation, open-ended interviews, etc., may collect qualitative data. However, often, there is some element of the results obtained via qualitative research that can be handled numerically, i.e. how many observations, the number of interviews conducted, etc. Example: Periods when the US was in vs. was not in a recession. The quality of being in a recession is assigned a value of .01 and not in a recession .0, which makes it possible to display as a chart.
Indicator: Typically used as a synonym of statistics that describes variables that describe something about the socioeconomic environment of a society, i.e. per capita income, unemployment rate, or median years of education.
Statistic: A number that describes some characteristic or status of a variable, i.e. a count or a percentage. Example: Total non-farm job starts in August 2014.
Statistics: Numerical summaries of data that has been analyzed in some way. Example: Ranking of airlines by percentage of flights arriving on time at Huntsville International Airport in Alabama in 2013.
Time series data: Any data arranged in chronological order. Example: Gross Domestic Product of Greece, 2000-2013.
Variable: Any finding that can change or vary. Examples include anything that can be measured, such as the number of logging operations in Alabama.
Numerical variable: Usually refers to a variable whose possible values are numbers. Example: Bank prime loan rate.
Categorical variable: A variable that distinguishes among subjects by putting them in categories (i.e. gender). Also called discrete or nominal variables. Example: Female infant mortality rate of Belarus (the mortality rate is numerical; the age/gender characteristic is categorical).
Time series: A set of measures of a single variable recorded over a period of time. Example: Hourly mean earnings of civilian workers — mining management, professional, and related workers.
Alpha-beta conundrum: There are so many meanings for these two statistical terms that one can get confused in no time. So, let’s understand the meaning of these two statistics in various contexts.
Alpha error: It is the probability of a Type I error in any hypothesis test — incorrectly claiming statistical significance.
Beta error: It is the probability of a Type II error in any hypothesis test — incorrectly concluding no statistical significance. (1 — Beta is power.)
In almost all textbooks and software packages, the population regression coefficients are denoted by beta. Like all population parameters, they are theoretical — we don’t know what they are. The regression coefficients we estimate from our sample are statistical estimates of those parameter values. Most parameters are denoted with Greek letters and statistics with the corresponding Latin letters.
This is another totally different use of alpha, AKA the coefficient alpha, which measures the reliability and correctness of a scale.
I hope that you all have enjoyed reading this article and would like to share some interesting terminologies and statistical terms with me, as well!
If you work with big data or artificial intelligence at all, then you’re dealing with data science. Data science is all about extracting meaning from data — which is the whole point of having data in the first place, right?
In this post, we’ll take a look at data science from a variety of angles so that you can understand it whether you’re a data science aficionado or are putting your data science lab coat on for the first time. We’ll look at some of the most popular and helpful data science articles on DZone, some outside resources on the topic, and some DZone publications that can help you learn more.
Check out some of the top data science articles on DZone to learn about data science from a developer’s perspective, how to become a data scientist, and what data science looks like for the modern data architecture. These articles are best read in order!
An Introduction to Data Science by Shiv Shet. Let’s start with an introduction to data science. Learn from Shiv Shet how to truly leverage data science for your data analytics needs.
The Developer’s Guide to Data Science by Sander Mak. Learn why enterprise developers need to familiarize themselves with data science and see how to go from developer to data scientist.
10 Steps to Become a Data Scientist in 2018 by Vijay Laxmi. We’re still kicking off 2018, so it’s not too late to benefit from this article about how to become a data scientist this year!
Data Science for the Modern Data Architecture by Vinay Shukla. Time to get a little more advanced and consider data science through an architectural lens.
Data Science: Viable Career? (2017 and Beyond) by John Sonmez. Lastly, is data science a viable career opportunity? Check out one perspective in this video and transcript.
PS: Are you interested in contributing to DZone? Check out our brand new Bounty Board, where you can apply for specific writing prompts and win prizes!
Let’s journey outside of DZone and check out some recent news, conferences, and more that should be of interest to data science newbies and experts alike.
Data Science Tutorial for Beginners: What Is Data Science? In the first video of this series, get an in-depth introduction to data science and learn about unstructured and structured data, jobs in the data science world, and more.
The Data Science Conference. Taking place in Chicago, IL, US from March 3-4, 2018, The Data Science Conference brings together anyone who’s interested in data science, big data, data mining, machine learning, artificial intelligence, and/or predictive modeling for a few days of data science fun.
The Data Skeptic podcast. This weekly podcast offers a good introduction to those interest in data science and machine learning. Interesting topics including predicting Alzheimer’s, neuroimaging, and exponential time algorithms.
The DZone Guide to Big Data: Data Science and Advanced Analytics. Explore the critical capabilities in next-generation self-service data preparation tools and dive deep into applications and languages affiliated with big data.
Machine Learning: Patterns for Predictive Analytics. This DZone Refcard covers machine learning for predictive analytics, explains setting up testing and training data, and offers machine learning model snippets.
In this article, I will show you how to profile data with Oracle Data Mining.
Data profiling is a very important step in making sense of a dataset. Before we start a job, we need to know what kind of columns are in the data, what range these columns contain, what range of min/max points are involved, how many distinct values are involved, and how many null records we have before using the data that will directly influence how we analyze and use the information. For this reason, data profiling is a very important step to take before processing the data.
There are multiple methods for data profiling. As you can write your own scripts to accomplish this, it is possible to use it in built-in libraries. I will do this with Oracle Data Mining, one of the methods that I can easily implement from within Oracle SQL Developer. First, let’s open the Oracle Data Mining window on Oracle SQL Developer.
We are now creating a new project and workflow through our existing ODM link.
We have created our new workflow. Now, drag and drop the components we need to profile the data from the Toolbox to the workflow.
First, let’s put the Data Source component in the workflow and its settings. (I will enter the datasource where the datasource component is to be profiled)
I gave the dataset to the datasource component (
HR_DATA). After selecting this dataset, column and data information about this set are listed in the grid at the bottom of the window.
Now we will drag and drop the Explore Data component onto the workflow and connect to the Data Source component.
Now that we have our connection, let’s run the workflow.
Running the workflow may take some time, depending on the size of the data we are examining. We can follow the process through the screen. Once this is done, right-click on the Explore Data component to observe the results and call View Data.
The result screen contains a separate line record for each column in each dataset. We can observe that some statistics and calculations are made for each column in our dataset. We can reach the following results on this screen by column.
In addition, data populations on the columns can be visually displayed via histograms by opening the pop-up window to the Statistics tab.
Through these histograms, we can easily observe what values are changed and how they are distributed.
As we can see from the sample we have done, we can get useful values as a result of data profiling. I will analyze these values to make my information even more beneficial.
The columnar storage technique has proven effective in handling many computing scenarios and is commonly adopted by data warehousing products. The technique is usually a synonym of high-performance in the industry.
Yet the strategy has its own weaknesses. A Google search shows that criticisms surrounding it are mainly about data modification. There are few discussions of its application to read-only data analysis and computing, which will be taken care of in the following.
The idea behind columnar storage is simple. To retrieve data stored row-wise on disk, all columns will be scanned. As a result, a query involving only a few columns will retrieve a lot of irrelative data. Plus, disk response time suffers as it jumps between tracks picking up data. The column storage strategy only enables the retrieval of useful columns, significantly reducing the amount of to-be-accessed data on most occasions. For big data computing, in particular, disk scanning is time-consuming, and a decrease in the amount of data to be accessed will greatly speed up the overall computing process. Moreover, chances are that there are a lot of duplicate values in one column. It’s easier to compress data into a smaller size under the columnar storage to enhance performance.
According to its working principle, columnar storage increases performance by reducing disk access. But it can’t reduce the number of computations. If data is already in memory, the technique is unnecessary. Structured data processing is row-based, and storing in-memory data column-wise complicates the construction of records and hinders performance. Except for professional vector computing (which is column-based and commonly used in data mining), columnar storage isn’t the right choice for handling relational-style in-memory computing (including in-memory databases).
Columnar storage stores data column by column, making simultaneous access to multiple columns random and discontinuous. The more columns accessed, the more random the accesses. Random access to the HDD will seriously affect performance due to time-consuming head jumps, even worse than the performance with row-based storage, which enables continuous column access when a lot of columns are being accessed or the total number of columns is small. Concurrent tasks (and parallel processing) will further exacerbate the random access problem. Though disk retrieval suffers from head jumps with concurrency computing in row-based storage, the degree is much smaller. With columnar storage, one concurrent task will generate multiple concurrent access requests (the number of involved columns). With row-based storage, one concurrent task only generates one concurrent access request.
One solution is to increase the buffer area for storing the retrieved data to reduce the proportion of the seek time. But setting the buffer area for each column consumes a large amount of memory space if there are a lot of involved columns. The other solution is to add more hard disks and store columns on different hard disks. As columnar storage is generally applied to scenarios involving a large number of columns, the number of columns is usually far more than that of hard disks that can be installed on a machine and disk access conflict often occurs.
Yet columnar storage is more friendly with an SSD that hasn’t the seek time problem.
The commonly used indexing technique is intended to locate desired records from a large dataset according to key values. The nature of indexing is sorting. An index record addresses of both ordered key values and their corresponding records in the original data table. With row-based storage, the position of a record can be represented by one number. But with columnar storage, each column of a record has its own position and, in principle, should be recorded, making an index table that is almost as large as the original table. Accesses become more complicated and space consumption is large. It’s no better than the method of copying the original table and then sorting it.
You might say that we can store only the address of one column for a record and then calculate the addresses of the rest of the columns. Sizes of field values of certain data types, like strings, are unfixed, and the values of other data types, like integer and date, which generally have fixed sizes, may become unfixed thanks to compression techniques often used with columnar storage. If all field values are stored in fixed sizes, the index becomes simple and access speeds up, but the data amount increases, leading to a less cost-effective traversal, which is the operation for which the columnar storage strategy is mainly used in the first place.
A commonly used real-life method is dividing the original table into a number of data blocks. For each block, data is stored in a column-based format. The index is created on blocks to quickly locate the block where the data sits. Then, only in-block data scanning is needed.
This block-based index has lower performance than the row-based index because of the extra in-block scanning. If the original data table is ordered by index key values (usually, the index key is the original table’s primary key), it’s easy to locate the blocks (typically, only one block) holding the target data with bearable performance loss. This index type applies to scenarios where records are located according to unique key values. If the original data table is unordered by the index key, the block-based index is useless. It’s possible that the target data falls in nearly all blocks, causing a similar performance to the full table scanning.
Multithreaded parallel processing is a must to make full use of the capabilities of multi-core CPU. Having data segmented is necessary for performing parallel processing.
There are two basic requirements for data segmentation:
Almost equal data amount in each segment (making balanced task allocation among threads).
Dynamic segmentation (because the number of threads can’t be predicted).
It’s easier to segment data with row-based storage. We can make a mark at the end of each record (or of every N records) to evenly divide data into K segments according to the number of bytes. The ending mark in each segment is the starting point of the next segment. This is unfeasible with columnar storage because the sizes of field values may be unfixed. Segmenting points within columns may not fall in the same record, resulting in mismatched data.
Block-based segmentation strategy is a common solution to the columnar storage segmentation problem. The unit of segmentation is a block where data won’t be further divided to be processed in parallel. Here is a dilemma. On one hand, the number of blocks should be large enough to enable a dynamic segmentation (because you can’t get ten segments from only five blocks). As a contemporary computer generally has many CPU cores, nearly 100 blocks are needed to achieve a flexible, balanced segmentation. On the other hand, too many blocks means that the columnar data is physically divided into many discontinuous blocks, making the traversal code very complicated and causing the retrieval of useless data between two blocks.
For an HDD, there’s also the seek time problem. The more the blocks data is divided into, the more serious this problem becomes. Only when the space in a the columnar data in a block used is much larger than the buffer area for storing retrieved data, the proportion of the time spent in retrieving useless data and the seek time will become relatively small. This requires there to be a sufficiently large number of records in each block. In other words, the data amount is crucial for processing data in columnar storage format in parallel. With an HDD (including a disk array that contains a large group of HDDs), normally, there should be at least one billion records in one table on a single computer, with the data amount reaching above 100G. Parallel processing won’t cause a noticeable increase in performance if the data amount isn’t large enough. This is that problem that multidimensional analysis, for which the columnar storage strategy is particularly suitable, faces. Besides, the size of each block is pre-determined, but neighboring blocks can’t be physically combined while data is continuously added. Consequently, the number of blocks is ever- increasing, bringing management challenges that demand a special scalable space to store the index of the blocks.
These problems, however, don’t deny the great advantages of columnar storage in handling read-only computing scenarios. But any impetuous use of the technique needs to be avoided. For data warehousing products, it’s appropriate to allow the system administer or user to decide whether or not columnar storage should be used, how to divide columnar data into blocks, which part of the data should be stored in columnar format, and which part of data needs to be handled with both row-based storage and columnar storage to achieve higher performance through data redundancy.
A graph database is a data management system software. The building blocks are vertices and edges. To put it in a more familiar context, a relational database is also a data management software in which the building blocks are tables. Both require loading data into the software and using a query language or APIs to access the data.
Relational databases boomed in the 1980s. Many commercial companies (i.e. Oracle, Ingres, IBM) backed the relational model (tabular organization) of data management. In that era, the main data management need was to generate reports.
Graph databases didn’t see a greater advantage over relational databases until recent years, when frequent schema changes, managing explosives volume of data, and real-time query response time requirements make people realize the advantages of the graph model.
There are commercial software companies backing this model for many years, including TigerGraph (formerly named GraphSQL), Neo4j, and DataStax. The technology is disrupting many areas, such as supply chain management, e-commerce recommendations, security, fraud detection, and many other areas in advanced data analytics.
Here, we discuss the major advantages of using graph databases from a data management point of view.
This means very clear, explicit semantics for each query you write. There are no hidden assumptions, such as relational SQL where you have to know how the tables in the
FROM clause will implicitly form cartesian products.
They have superior performance for querying related data, big or small. A graph is essentially an index data structure. It never needs to load or touch unrelated data for a given query. They’re an excellent solution for real-time big data analytical queries.
Graph databases solve problems that are both impractical and practical for relational queries. Examples include iterative algorithms such as PageRank, gradient descent, and other data mining and machine learning algorithms. Research has proved that some graph query languages are Turing complete, meaning that you can write any algorithm on them. There are many query languages in the market that have limited expressive power, though. Make sure you ask many hypothetical questions to see if it can answer them before you lock in.
Graph databases can perform real-time updates on big data while supporting queries at the same time. This is a major drawback of existing big data management systems such as Hadoop HDFS since it was designed for data lakes, sequential scans, and appending new data (no random seek), and it is an architecture design choice to ensure fast scan I/O of an entire file. The assumption here is that any query will touch the majority of a file, while graph databases only touch relevant data, so a sequential scan is not an optimization assumption.
Graph databases offer a flexible online schema evolvement while serving your query. You can constantly add and drop new vertex or edge types or their attributes to extend or shrink your data model. It’s so convenient to manage explosive and constantly changing object types. The relational database just cannot easily adapt to this requirement, which is commonplace in the modern data management era.
Graph databases can group by aggregate queries, which is unimaginable in relational databases. Due to the tabular model restriction, aggregate queries on a relational database are greatly constrained by how data is grouped together. In contrast, graph models are more flexible for grouping and aggregating relevant data. See this article on the latest expressive power of aggregation for graph traversal. I don’t think relational databases can do this kind of flexible aggregation on selective data points. (Disclaimer: I have worked on commercial relational database kernels for a decade; Oracle, MS SQL Server, Apache popular open-source platforms, etc.)
Graph databases can combine multiple dimensions to manage big data, including time series, demographic, geo-dimensions, etc. with a hierarchy of granularity on different dimensions. Think about an application in which we want to segment a group of a population based on both time and geo dimensions. With a carefully designed graph schema, data scientists and business analysts can conduct virtually any analytical query on a graph database. This capability traditionally is only accessible to low-level programming languages such as C++ and Java.
Graph databases have great AI infrastructure due to well-structured relational information between entities, which allows one to further infer indirect facts and knowledge. Machine learning experts love them. They provide rich information and convenient data accessibility that other data models can hardly satisfy. For example, the Google Expander team has used it for smart messaging technology. The knowledge graph was created by Google to understand humans better, and many more advances are being made on knowledge inference. The keys of a successful graph database to serve as a real-time AI data infrastructure are:
Support for real-time updates as fresh data streams in
A highly expressive and user-friendly declarative query language to give full control to data scientists
Support for deep-link traversal (>3 hops) in real-time (sub-second), just like human neurons sending information over a neural network; deep and efficient
Scale out and scale up to manage big graphs
In conclusion, we see many advantages of native graph databases that cannot be worked around by traditional relational databases. However, as any new technology replacing old technology, there are still obstacles in adopting graph databases. One is that there are fewer qualified developers in the job market than the SQL developers. Another is the non-standardization of the graph database query language. There’s been a lot of marketing hype and incomplete offerings that have led to subpar performance and subpar usability, which slows down graph model adoption in enterprises.
In this article, I will do market basket analysis with Oracle data mining.
Data science and machine learning are very popular today. But these subjects require extensive knowledge and application expertise. We can solve these problems with various products and software that have been developed by various companies. In Oracle, methods and algorithms for solving these problems are presented to users with the
DBMS_DATA_MINING package, we can create models such as clustering, classification, regression, anomaly detection, feature extraction, and association. We can interpret efficiency with the models we create. The results we obtain from these models can be put into our business scenario.
DBMS_DATA_MINING package does not come up by default on the Oracle database. For this reason, it’s necessary to install this package first. You can set up your database with Oracle data mining by following this link.
With the installation of the Oracle data mining package, three new dictionary tables are created:
SELECT * FROM ALL_MINING_MODELS; SELECT * FROM ALL_MINING_MODEL_SETTINGS; SELECT * FROM ALL_MINING_MODEL_ATTRIBUTES;
ALL_MINING_MODELS table contains information about all the models.
ALL_MINING_MODELS_ATTRIBUTES tables contain parameters and specific details about these models.
Now, let’s prepare an easily understood data set to do market basket analysis.
Let’s first examine the ONLINE_RETAIL dataset.
|Column Name||Description||Data type|
|InvoiceNo||Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.||String|
|StockCode||Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.||String|
|Description||Product (item) name. Nominal.||String|
|Quantity||The quantities of each product (item) per transaction. Numeric.||Numeric|
|InvoiceDate||Invice Date and time. Numeric, the day and time when each transaction was generated.||Date|
|UnitPrice||Unit price. Numeric, Product price per unit in sterling.||Numeric|
|CustomerID||Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.||Numeric|
|Country||Country name. Nominal, the name of the country where each customer resides.||
Now that we have reviewed the details with our dataset, let’s load the ONLINE_RETAIL that we downloaded to our Oracle database.
First, create the Oracle table in which we will load the data set (XLS) that we downloaded.
CREATE TABLE ONLINE_RETAIL ( INVOICENO VARCHAR2(100 BYTE), STOCKCODE VARCHAR2(100 BYTE), DESCRIPTION VARCHAR2(200 BYTE), QUANTITY NUMBER, INVOICEDATE DATE, UNITPRICE NUMBER, CUSTOMERID NUMBER, COUNTRY VARCHAR2(100 BYTE) );
Now that we have created our table, we will load the dataset we downloaded as CSV into the table; we have multiple methods to do this:
Using Oracle External Table
Using Oracle SQL Loader
Using SQL-PL/SQL editors (Oracle SQL Developer, Toad, PL/SQL Developer…)
I will load the dataset with the editor that I am using. I use Toad as an editor. With Toad, you can perform the data loading process by following the steps below.
Database > Import > Import Table Data
You may not be using this editor because Toad is paid. This feature is available to other editors, so you can easily do this with other editors. For example, with Oracle SQL Developer for free, you can load data as follows using Oracle SQL Developer.
SELECT * FROM ONLINE_RETAIL;
We have completed the dataset installation process.
When we observe the data, we see the details of the purchases made by the customers. Each row contains information on which products are taken, how much they are received, who bought this product, what date it bought, how much it costs, and from which country. When we examine the data in detail, we also observe that there are one or more records belonging to the same InvoiceNo. So we can think of InvoiceNo as a basket.
Now let’s observe a sample basket.
SELECT * FROM ONLINE_RETAIL WHERE INVOICENO = '536368';
Yes, we saw an example basket.
Now, let’s analyze these together to determine the best-selling products together. I will give you some information about the algorithm used for better understanding of the subject before we start the association analysis.
DBMS_DATA_MINING package performs association analysis with APRIORI algorithm. To use this algorithm, we need to define some parameters. The default values for these parameters and parameters are as follows:
The article on the link can be reviewed to better understand the algorithm parameters and what they mean. Now, we will read the model settings, create a table, and insert the algorithm parameters into it.
CREATE TABLE SETTINGS_ASSOCIATION_RULES AS SELECT * FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS) WHERE SETTING_NAME LIKE 'ASSO_%'; BEGIN UPDATE SETTINGS_ASSOCIATION_RULES SET SETTING_VALUE = 3 WHERE SETTING_NAME = DBMS_DATA_MINING.ASSO_MAX_RULE_LENGTH; UPDATE SETTINGS_ASSOCIATION_RULES SET SETTING_VALUE = 0.03 WHERE SETTING_NAME = DBMS_DATA_MINING.ASSO_MIN_SUPPORT; UPDATE SETTINGS_ASSOCIATION_RULES SET SETTING_VALUE = 0.03 WHERE SETTING_NAME = dbms_data_mining.asso_min_confidence; INSERT INTO SETTINGS_ASSOCIATION_RULES VALUES (DBMS_DATA_MINING.ODMS_ITEM_ID_COLUMN_NAME, 'STOCKCODE'); COMMIT; END;
Yes, we have created the algorithm parameter table. Now we can move on to the step of creating our model.
CREATE VIEW VW_ONLINE_RETAIL AS SELECT INVOICENO,STOCKCODE FROM ONLINE_RETAIL; BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name => 'MD_ASSOC_ANLYSIS', mining_function => DBMS_DATA_MINING.ASSOCIATION, data_table_name => 'VW_ONLINE_RETAIL', case_id_column_name => 'INVOICENO', target_column_name => NULL, settings_table_name => 'SETTINGS_ASSOCIATION_RULES'); END;
Now that we have our model, let’s look at the details about the model now from the dictionary.
SELECT MODEL_NAME, ALGORITHM, COMMENTS, CREATION_DATE, MINING_FUNCTION, MODEL_SIZE FROM ALL_MINING_MODELS WHERE MODEL_NAME = 'MD_ASSOC_ANLYSIS';
SELECT SETTING_NAME, SETTING_VALUE FROM ALL_MINING_MODEL_SETTINGS WHERE MODEL_NAME = 'MD_ASSOC_ANLYSIS';
Let’s look at the output of the analysis.
SELECT RULE_ID, B.ATTRIBUTE_SUBNAME ANTECEDENT_STOCKCODE, C.ATTRIBUTE_SUBNAME CONSEQUENT_STOCKCODE, RULE_SUPPORT, RULE_CONFIDENCE FROM TABLE (DBMS_DATA_MINING.GET_ASSOCIATION_RULES ('MD_ASSOC_ANLYSIS')) A, TABLE (A.ANTECEDENT) B, TABLE (A.CONSEQUENT) C;
As a result of the parameters we have given to the algorithm, we have reached and displayed products with high sales rate together. It is possible to make different analyzations by changing the parameters of the algorithm.