ALU

data analysis

Using Big Data to Spot Extremists Online Before They Post Anything Dangerous

There is a growing demand among politicians and the wider public for social networks to identify and remove hate speech from their websites. Doing so is often easier said than done however, but a new study from the Massachusetts Institute of Technology highlights how extremists can potentially be identified, even before they post any threatening content.

The scale of the problem was highlighted in 2016 when Twitter revealed that it shut down around 360,000 ISIS related accounts. Identifying most such accounts depends on users reporting them, and there is little that can be done to prevent a user from simply creating another account.

Original Link

Reporting and Analysis With Elasticsearch

Since the popularity of NoSQL and Big Data exploded in recent years, keeping up with the latest trends in databases, search engines, and business analytics is vital for developers.

And it’s hard not to be overwhelmed by the number of solutions available on the market: Amazon CloudSearch, Elasticsearch, Swiftype, Algolia, Searchify, Solr, and others.

Original Link

Easing the Adoption of A Customer-Centric Product Development Process with Data

In search of a sort of customer-centric product development Nirvana (and the organizational tenants that allow it to flourish) known as high-tech anthropology, executives are willing to pay upwards of $20,000 to spend time with the founders of Menlo Innovations, according to an article in Forbes. The Michigan-based software design consultancy has achieved Apple-like mystique with its unique philosophy that guides both how it works and the work it completes for its clients.

In fact, according to the Forbes coverage, a full 10 percent of Menlo Innovations’ $5 to $6 million in anticipated revenue for 2018 will come from the fees it charges for tours and consulting.

Original Link

The Skills That Data Analysts Need to Master

The Basics

1. The first is Excel. This seems very simple, but, in fact, it’s not. Excel can not only do simple two-dimensional tables, complex nested tables, but also create line charts, column charts, bar charts, area charts, pie charts, radar charts, combo charts, and scatter charts. 

2. Master SQL statements on SQL Server or Oracle. Although you are a business analyst, if you can rely on IT and IT tools (such as a multi-dimensional BI analysis model) sometimes you can’t get the data you want. Learning to write nested SQL statements, including join, group by, order by, distinct, sum, count, average, and various statistical functions, can be very helpful3. Master visualization tools, such as BI,such as Cognos ,Tableau and FineBI,etc.,specifically look at what tools the enterprise uses,like I used to use FineBI.Visualization with these tools is very convenient, especially if the analysis report can contain these images. These skills will definitely attract the attention of senior leaders, as it allows them to understand at a glance, and gain insight into, the essence of the business. In addition, as a professional analyst, using the multi-dimensional analysis model, Cube, you can easily and efficiently customize reports.

Original Link

Analyze Elasticsearch Data in R

You can access Elasticsearch data with pure R script and standard SQL on any machine where R and Java can be installed. You can use the CData JDBC Driver for Elasticsearch and the RJDBC package to work with remote Elasticsearch data in R. By using the CData Driver, you are leveraging a driver written for industry-proven standards to access your data in the popular, open-source R language. This article shows how to use the driver to execute SQL queries to Elasticsearch and visualize Elasticsearch data by calling standard R functions.

Install R

You can match the driver’s performance gains from multithreading and managed code by running the multithreaded Microsoft R Open or by running open R linked with the BLAS/LAPACK libraries. This article uses Microsoft R Open 3.2.3, which is preconfigured to install packages from the Jan. 1, 2016 snapshot of the CRAN repository. This snapshot ensures reproducibility.

Original Link

Forget About the 10x Developer, Focus on 3x Instead

I am going to open up a controversial subject, the 10x developer myth.

This subject has been debated by the industry for decades, so why bring it up again?

Original Link

BI Strategy Beyond Excel

In a data-driven world, it’s odd that gut instinct still plays such a big part in business decisions. But what isn’t surprising is that, not only do businesses have more data than they can handle, but they’re also using a tool never designed to gather BI (Business Intelligence) in the first place.

That tool is Excel; approximately “1 in 5 businesses are using spreadsheets as the main tool to communicate data internally,” according to Bernard Mar (Forbes). Excel is a great tool for certain business requirements, but gathering and organizing corporate data isn’t among them.

Original Link

Visualizing the State of the Web

Recently, the folks at dev.to conducted its State of the Web Survey, which was completed by 1899 respondents. The results were shared online and a call to arms was put forth: State Of The Web Data – Call For Analysis! I took the challenge to visualize this data using Kendo UI: results are here.

In this blog post, I’ll show you how I built this page. In particular, I’ll focus on two sections of the survey that I found to be the most interesting; a set of opinion statements and a series of yes/no questions. Both of these were built using Kendo UI, which provides the building blocks for creating rich data visualizations. At the end, I think you’ll agree that it’s fun to paint outside the lines.

Original Link

Merging Django ORM With SQLAlchemy for Easier Data Analysis

Development of products with Django framework is usually easy and straightforward; great documentation, many tools out of the box, plenty of open-source libraries, and big community. Django ORM takes full control about SQL layer protecting you from mistakes and underlying details of queries so you can spend more time on designing and building your application structure in Python code. However, sometimes such behavior may hurt — for example, when you’re building a project related to data analysis. Building advanced queries with Django is not very easy; it’s hard to read (in Python) and hard to understand what’s going on in SQL-level without logging or printing generated SQL queries somewhere. Moreover, such queries could not be efficient enough, so this will hit you back when you load more data into DB to play with. In one moment, you can find yourself doing too much raw SQL through Django cursor, and this is the moment when you should do a break and take a look on another interesting tool, which is placed right between ORM layer and the layer of raw SQL queries.

As you can see from the title of the article, we successfully mixed Django ORM and SQLAlchemy Core together, and we’re very satisfied with results. We built an application which helps to analyze data produced by EMR systems by aggregating data into charts and tables, scoring by throughput/efficiency/staff cost, and highlighting outliers which allows to optimize business processes for clinics and save money.

Original Link

Customize Sorting Order With RuleBasedCollation in Solr

Short Titles of Old English Texts are made up of-of abbreviations, numbers, roman numerals, and non-English letters. The built-in sort function for programming languages does not produce the desired result. Therefore, we have an in-house developed Perl script, oest_sort, to sort the short titles. When indexing the Old English texts in Apache Solr, in order to sort the results in the desired order, I added a seq field which is assigned a number to each short title according to the oest_sort result. However, this solution isn’t ideal. Recently, I settled on the RuleBasedCollation for Sorting Text with Custom Rules.

Short Title Sorting Rules

Original Link

Executive Briefing: What Is Fast Data and Why Is it Important? [Video]

from Lightbend

Streaming data system’s so-called Fast Data promises accelerated access to information, leading to new innovations and competitive advantages. These systems, however, aren’t just faster versions of Big Data; they force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices.

This means new challenges for your organization. Whereas a batch job might run for hours, a stream processing application might run for weeks or months. This raises the bar for making these systems resilient against traffic spikes, hardware and network failures, and so forth. The good news is that there is a strong history of facing these demands in the world of microservices.

Original Link

Five Reasons Why You Are Not Data-Driven

This week, I spent a lot of time with companies talking about data and using data for a variety of purposes, ranging from improved decision making to machine learning and deep learning systems. All companies I talk to have tons of data in their archives and often generate a lot of data in real-time or through batched uploads

However, although all companies claim to be data-driven in their decision processes and ways of working, practice often shows a different reality. When reflecting on my experiences with a variety of companies, I realized that there are at least five reasons why companies are not as data-driven as they think.

Original Link

Designing Better Dashboards for Your Data

The way you structure a dashboard makes a critical difference in how effective it will be. Having the right data is important, but, to the viewer, it is all about the organization and presentation of the data. In this article, we’ll take a look at a few different ways to organize a dashboard to meet the user’s requirements.

Dashboard

Original Link

Is Self-Service Big Data Possible?

By now, we all know about and are experiencing the rise in the volume of data generated and available to an organization and the issues it can cause. One can see that there is little end in sight to the data tsunami which is largely in part due to the increased variety of data from mobile, social media, and IoT sources.

So, it’s no surprise that organizations find themselves drowning in data. In a recent survey from independent market research firm Vanson Bourne, they discovered that up to 80 percent of respondents believe legacy technology is holding their organization back from taking advantage of data-driven opportunities. The same survey also stated that only 50 percent of the data collected is analyzed for business insight. Couple all this with organizations needing insights from the data at a faster and faster rate is a recipe for disaster or, at best, represents potentially lost revenue.

Original Link

Applying Bayes Theorem to a Big Data World

Bayes Theorem

One of the challenges in analyzing Big Data is of course its volume – there is just so much of it. Then mix in high velocity, or Fast Data, and standard analytical methodologies to make sense of it break down, and become cumbersome and ineffective.

Machine learning techniques that self-adjust and improve over time are a cost effective approach. Bayes, a machine learning methodology, is an effective tool for classifying or categorizing data as it streams in. It is not dependent on modeling or on managing complex rule sets.

Original Link

Business Intelligence Maturity Model

Scaling Up, Growth, and Digital Transformation Guy

In January of 2016, I worked on an article called, Innovation Maturity Model. The intention was to provide a framework or a roadmap for executives in order to foster innovation among their employees. The intention of writing this article is also the same as innovation but focused on Business Intelligence. Besides, I am keen to highlight the difference of being data-driven and report-driven; meaning, some companies get a report and make some decisions, there is no experimentation, pivot and preservation, hence, no data driven decisions. Gartner has a great data analytics maturity model that includes "business outcomes, people, skills, processes, data, and technologies." It is based on 5 Levels:

  1. Unaware
  2. Opportunistic
  3. Standards
  4. Enterprise
  5. Transformative

Image title

Original Link

Evolving From Descriptive to Prescriptive Analytics, Part 5: The First Project

In our four previous posts (Part 1, Part 2, Part 3, Part 4), we talk about building a team with the proper skills and tools to succeed at data science. Now, it’s time to take on our first project. The goal for this project is to obtain real-world experience and to deliver tangible value to the business. Our strategy is to get started with a project even as we’re building skills and tool infrastructure, so we’ll be more prepared to take on a highly visible project.

We started with three project selection criteria:

Original Link

Evolving From Descriptive to Prescriptive Analytics: Part 4, Eating the Ugly Frogs

In our previous blog posts, we discussed gaining leadership support, acquiring data science skills, and having the tools to manage your data. With this post, we’ll discuss ensuring your data scientists are productive and happy.

What activities keep data scientists happy? A recent CrowdFlower report about data scientists says that mining data for patterns, building models for data, and refining algorithms are the three favorite tasks among data scientists. Most other tasks are not nearly as interesting. We call these other tasks the ugly frogs of data science. Only 19% of data scientists can spend most of their time doing their favorite tasks, while the others spend most of their time on tasks they loathe. Wouldn’t it be nice if the data scientists could be liberated from doing what they like least and allowed to do what they enjoy most?

Original Link

Incorporating ETL Tools Into Your Data Warehousing Strategy

Managing a data warehouse isn’t just about managing a data warehouse, if we may sound so trite. There’s actually a lot more to consider. For example, how data gets into your data warehouse is a whole process unto itself — specifically, what happens to your data while it’s in motion and the forms it must take to become usable.

And that’s where ETL tools come in.

Original Link

The Idea, Part 1: SQL Queries in Pandas Scripting

It is sometimes hard to leave behind the habits that you are used to. As a SQL addicted data analyst, your mind may oppose your new challenging learning path for using new data analysis environments and languages like Python. I have taken many online data science and Python courses. However, I still need to match the query in SQL with Python scripts in my mind. It is like learning a new language over your mother tongue. Consequently, I have decided to prepare a SQL guide for myself and others who are in the same boat as me.

Environment Preparation

In this work, I used Oracle HR example schema data. You can find information about this example data in the Oracle docs.

Original Link

Difference Between Data Science, Data Analytics, and Machine Learning

We all know that Machine learning, Data Science, and Data analytics is the future. There are companies who not only help businesses predict future growth and generate revenue but also find applications for data in other fields like surveys, product launches, elections, and more. Stores like Target and Amazon constantly keep track of user data in forms of their transactions, which, in turn, helps them to improve their user experience and deploy custom recommendations for you on your login page.

Well, we have discussed the trend, so let’s dig a little deeper and explore their differences. Machine Learning, Data Science, and Data analytics can’t be completely separated, as they are have origins in the same concepts but have just been applied differently. They all go hand-in-hand with each other, and you’ll easily find an overlap between them too.

Original Link

Data Silos Are the Greatest Stumbling Block to an Effective Use of Firms’ Data

Greater access to data has given business leaders real, valuable insights into the inner workings of their organizations. Those who have been ahead of the curve in utilizing the right kinds of data for the right purposes have reaped the rewards of better customer engagement, improved decision-making, and a more productive business, whilst those who have lagged behind have found themselves faced with an uphill struggle to compete.

This, however, has only been the first part of the data story. As businesses have begun to recognize the positive impact data could have on how they run their business, they’ve taken a predictable next step: they’re collecting more of it. And lots more.

Original Link

Java Says, Your Data’s Not That Big

Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was "easy," which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.

One definition of "big data" is "Data that is too big to fit on one machine." By that definition what is "big data" for one language is plain-old "data" for another. Java, with it’s efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It’s a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.

Original Link

Using Big Data to Increase Site Visitors

A modern business environment heavily relies on the use of technology. Technology has made big data possible and companies that leverage this information are able to drive more strategic decisions, as well as drive their business goals in a more reliable and efficient way. Big data is a massive amount of information that flows around the internet. These vast quantities of data hold valuable information about the market, customers, and everything else that inhabits the online world.

In fact, it’s estimated the big data will reach 40 zettabytes by 2020. But why do companies analyze this data in the first place? The main reason is that they extract valuable information from this data, in order to better understand the market they operate in, as well as the customers they’re trying to address, so that companies can deliver more relevant and personalized experiences to their audience. That being said, here are a few ways to use big data to increase site visitors.

Original Link

Ultimate List Of Big Data Examples in Real Life

Big Data is everywhere these days. In this article, I will give you some awesome real-life big data examples to demonstrate the utility of big data.

Let me start this post off with what big data is…

Original Link

Breaking Into a Data Vault

The term Data Vault evokes an image of a safe and secure place to store your most important, core data assets. A lot of engineering goes into its design and delivery to ensure it does that job. However, there also is another image—a large, steel door—that comes to mind. Vaults are designed to keep people out. Even the owners of the goods stored there must leap through some hoops to get in. In the case of the Data Vault the intention for access is the exact opposite: business people must certainly be able to get easily to the core data of the enterprise. Breaking into a Data Vault should therefore be made extremely easy.

If that odd-sounding statement makes you pause for thought, it’s meant to. (Yes, it’s you I’m talking to: the IT person responsible for the Data Vault project!) Please carefully consider how to make business people feel welcome in the Data Vault—in fact, convince them they own it—before you begin designing, building, operating, and maintaining your Data Vault. The level of comfort of business people with the Data Vault will directly determine your success in this undertaking.

Original Link

Power Is Nothing Without Control

"You can’t control (leave alone improve) what you can’t see." – Me

"Power is nothing without control." – Pirelli

Original Link

The Cold Start Problem

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We used Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

Original Link

Scrum Retrospective 3: Generate Insights

This is the third post of my blog post series about the five phases of a Scrum Retrospective. In this post I cover Phase 3, Generate Insights.

If you haven’t read the previous posts in this series you can start with Phase 1, Setting the stage.

Original Link

Create Beautiful Java Visualizations With Tablesaw’s Plot.ly Wrapper

Creating visualizations is an essential part of data analysis — whether you’re just "looking around" a new dataset or verifying the results of your machine learning algorithms. Unfortunately, Java has fallen behind what the best visualization tools offer. Tablesaw’s new plotting framework provides a platform for creating visualizations in Java for the entire analysis process, from the earliest explorations to the final presentation.

The framework provides a Java wrapper around the Plot.ly open source JavaScript visualization library. Plot.ly is based on the extraordinary D3 (Data-Driven Documents) framework and is certainly among the best open-source visualization packages available in any language. Plot.ly is so good, in fact, it is widely used in languages other than JavaScript, such as Python and R, which already had solid options for visualization.

Original Link

Scrum Retrospective 2 — Gather Data

This is the second post of my blog post series about the five phases of a Scrum Retrospective. In this post, I cover the most crucial ideas for Stage 2 — Gather Data.

If you haven’t read the previous post in this series you can find it here: Stage 1 — Setting the stage.

Original Link

Java Data Analysis: Using Cross Tabs

One of the most common data analysis operations is looking at how frequently certain kinds of observations appear in a data set.  You can use cross-tabulations, also known as contingency tables, to perform this task.

Here, we show you how to compute cross-tabs easily in Java using the Tablesaw open source (Apache 2) library. Tablesaw is a data-frame and visualization library that supports one and two dimensional cross-tabs. For this, the Table class contains the methods you need.

Example

In the example below, we show the observation counts for each combination. The dataset is using polling results from President George W. Bush, before and after 9/11/2001.  

// Preparation: load the data
Table table = Table.read().csv("../data/bush.csv");
// Add a column for the months from the date col so we can count by month
StringColumn month = table.dateColumn("date").month();
month.setName("month");
table.addColumns(month); // perform the crossTab operation
Table counts = table.xTabCounts("month", "who"); // formatting // make table print as integers with no decimals instead of the raw doubles it holds
counts.columnsOfType(ColumnType.NUMBER) .forEach(x -> ((NumberColumn)x).setPrintFormatter(NumberColumnFormatter.ints()));
// print
System.out.println(counts);

The formatted output is shown below.

 Crosstab Counts: month x who [labels] | fox | gallup | newsweek | time.cnn | upenn | zogby | total |
---------------------------------------------------------------------------------------- APRIL | 6 | 10 | 3 | 1 | 0 | 3 | 23 | AUGUST | 3 | 8 | 2 | 1 | 0 | 2 | 16 | DECEMBER | 4 | 9 | 4 | 3 | 2 | 5 | 27 | FEBRUARY | 7 | 9 | 4 | 4 | 1 | 4 | 29 | JANUARY | 7 | 13 | 6 | 3 | 5 | 8 | 42 | JULY | 6 | 9 | 4 | 3 | 0 | 4 | 26 | JUNE | 6 | 11 | 1 | 1 | 0 | 4 | 23 | MARCH | 5 | 12 | 4 | 3 | 0 | 6 | 30 | MAY | 4 | 9 | 5 | 3 | 0 | 1 | 22 | NOVEMBER | 4 | 9 | 6 | 3 | 1 | 1 | 24 | OCTOBER | 7 | 10 | 8 | 2 | 1 | 3 | 31 | SEPTEMBER | 5 | 10 | 8 | 3 | 0 | 4 | 30 | Total | 64 | 119 | 55 | 30 | 10 | 45 | 323 |

Of course, you can dispense with the formatting if you like. It is important to note the total column on the right, which shows that 23 polls were conducted in April, 16 polls in August, etc, across all pollsters. Similarly, the column totals at the bottom shows that Fox conducted 64 polls, Gallup 119, etc.

Single Variable Totals

You can get single variable counts using the  xTabCounts() method that takes only one column name argument .

Table whoCounts = table.xTabCounts("who");
// formatting and printing as above

This produces:

 Column: who Category | Count |
---------------------- gallup | 119 | zogby | 45 | time.cnn | 30 | fox | 64 | newsweek | 55 | upenn | 10 |

Calculating Percents

You may want to see the percent of polls conducted by each pollster, rather than raw counts. The xTabPercents() method can be used for that.

Table whoPercents = table.xTabPercents("who");

Actually, percents are a misnomer. The results produced are the proportions in decimal format. To get percent-formatted output, we use a different  NumericColumnFormatter.

When you have two variables, you can display the percent that falls into each combination as shown below.

Table tablePercents = table.xTabTablePercents("month", "who");
tablePercents.columnsOfType(ColumnType.NUMBER) .forEach(x -> ((NumberColumn)x).setPrintFormatter(NumberColumnFormatter.percent(1)));

Because the percents are small, we updated the formatter to show a single fractional digit after the decimal point. The output can best be understood by looking at an example. Of all the polls in the dataset, 1.9 percent were conducted by Fox in April, 3.1 percent by Gallup in April, and 0.9 percent by Fox in August, as well as the others shown in the table below.

 Crosstab Table Proportions: [labels] | fox | gallup | newsweek | time.cnn | upenn | zogby | total |
------------------------------------------------------------------------------------------ APRIL | 1.9% | 3.1% | 0.9% | 0.3% | 0.0% | 0.9% | 7.1% | AUGUST | 0.9% | 2.5% | 0.6% | 0.3% | 0.0% | 0.6% | 5.0% | DECEMBER | 1.2% | 2.8% | 1.2% | 0.9% | 0.6% | 1.5% | 8.4% | FEBRUARY | 2.2% | 2.8% | 1.2% | 1.2% | 0.3% | 1.2% | 9.0% | JANUARY | 2.2% | 4.0% | 1.9% | 0.9% | 1.5% | 2.5% | 13.0% | JULY | 1.9% | 2.8% | 1.2% | 0.9% | 0.0% | 1.2% | 8.0% | JUNE | 1.9% | 3.4% | 0.3% | 0.3% | 0.0% | 1.2% | 7.1% | MARCH | 1.5% | 3.7% | 1.2% | 0.9% | 0.0% | 1.9% | 9.3% | MAY | 1.2% | 2.8% | 1.5% | 0.9% | 0.0% | 0.3% | 6.8% | NOVEMBER | 1.2% | 2.8% | 1.9% | 0.9% | 0.3% | 0.3% | 7.4% | OCTOBER | 2.2% | 3.1% | 2.5% | 0.6% | 0.3% | 0.9% | 9.6% | SEPTEMBER | 1.5% | 3.1% | 2.5% | 0.9% | 0.0% | 1.2% | 9.3% | Total | 19.8% | 36.8% | 17.0% | 9.3% | 3.1% | 13.9% | 100.0% |

As you can see, this also gives you the total percents by month and pollster.

Column Percents and Row Percents

The final option is to show column percents or row percents. We’ll start with column percents. You calculate them as shown below.

Table columnPercents = table.xTabColumnPercents("month", "who");

This produces the following table. If you look at the column for “fox,” the values you see are the percentages for Fox alone. Nine percent of Fox’s polls were conducted in April, five percent in August, etc. Looking across the columns, on the other hand, are not very intuitive (or useful, probably) until you get to the total, which shows the average across all pollsters by month.

 Crosstab Column Proportions: [labels] | fox | gallup | newsweek | time.cnn | upenn | zogby | total |
----------------------------------------------------------------------------------------- APRIL | 9% | 8% | 5% | 3% | 0% | 7% | 7% | AUGUST | 5% | 7% | 4% | 3% | 0% | 4% | 5% | DECEMBER | 6% | 8% | 7% | 10% | 20% | 11% | 8% | FEBRUARY | 11% | 8% | 7% | 13% | 10% | 9% | 9% | JANUARY | 11% | 11% | 11% | 10% | 50% | 18% | 13% | JULY | 9% | 8% | 7% | 10% | 0% | 9% | 8% | JUNE | 9% | 9% | 2% | 3% | 0% | 9% | 7% | MARCH | 8% | 10% | 7% | 10% | 0% | 13% | 9% | MAY | 6% | 8% | 9% | 10% | 0% | 2% | 7% | NOVEMBER | 6% | 8% | 11% | 10% | 10% | 2% | 7% | OCTOBER | 11% | 8% | 15% | 7% | 10% | 7% | 10% | SEPTEMBER | 8% | 8% | 15% | 10% | 0% | 9% | 9% | Total | 100% | 100% | 100% | 100% | 100% | 100% | 100% |

Here, row percents show the opposite viewpoint.

Table rowPercents = table.xTabRowPercents("month", "who");

Here, we see that, of all the polls conducted in April, Fox conducted 26 percent, Gallup conducted 43 percent, and The University of Pennsylvania conducted zero percent when rounded.

 Crosstab Row Proportions: [labels] | fox | gallup | newsweek | time.cnn | upenn | zogby | total |
---------------------------------------------------------------------------------------- APRIL | 26% | 43% | 13% | 4% | 0% | 13% | 100% | AUGUST | 19% | 50% | 12% | 6% | 0% | 12% | 100% | DECEMBER | 15% | 33% | 15% | 11% | 7% | 19% | 100% | FEBRUARY | 24% | 31% | 14% | 14% | 3% | 14% | 100% | JANUARY | 17% | 31% | 14% | 7% | 12% | 19% | 100% | JULY | 23% | 35% | 15% | 12% | 0% | 15% | 100% | JUNE | 26% | 48% | 4% | 4% | 0% | 17% | 100% | MARCH | 17% | 40% | 13% | 10% | 0% | 20% | 100% | MAY | 18% | 41% | 23% | 14% | 0% | 5% | 100% | NOVEMBER | 17% | 38% | 25% | 12% | 4% | 4% | 100% | OCTOBER | 23% | 32% | 26% | 6% | 3% | 10% | 100% | SEPTEMBER | 17% | 33% | 27% | 10% | 0% | 13% | 100% | Total | 20% | 37% | 17% | 9% | 3% | 14% | 100% |

And, that’s all there is to cross-tabs.

Tablesaw supports many other functions that make data analysis easy, including loading data, filtering, mapping, summarizing, cleaning, saving, and so on. Tablesaw’s visualization support is based on the plot.ly library. 

Original Link

Is it Time to Break Up With Your Spreadsheets?

There are various online web applications and offline software that help you to get your job done. A task management tool, a collaboration tool, a CRM system, your calendar or just a piece of pen and paper, you name it.

But let’s face it: You probably have a basic Excel sheet on your desktop which you use to enter data, don’t you? A quick entry for your sales lead, a quick calculation of a quote or a mini project management to keep your task status up-to-date?

Why Do You Have This Excel Sheet?

Looking back on my career path, I have seen two major reasons why people replace their software with Excel or extend the software with an Excel file: either the software is too complicated to use, or the software does not offer a simple feature you need. You get frustrated and get back to the magic of Microsoft: the Excel file. Add some columns, rows, and functions and you have it in a few minutes. Copy and paste data into the sheet and bam, there you go! Attach it to your email and share it with your customer or colleagues. Quick and straightforward. Agree?

For a quick shot, the Excel file is the best way to get things done. However, consider the downsides. In my experience, you will inevitably run into the following hangups:

Playing Excel Ping Pong

Picture it: A colleague or customer starts to add their data or comments to the sheet. They send it back to you via email, override your current version on the network drive or add an extension like “_v2” to the filename.

Imagine the procedure goes back and forth several times and you end up with “_v3, _v4, _v5.” You keep all the versions on your local drive, your customers and colleagues do the same for sure. You don’t want to delete them as you might want to get back to an older version.

This is when the Excel file ping pong will cause errors. A wrong product ID, a wrong price calculation, the customer has a wrong version and and and…

You might now protest, “But hey, there is this change tracking feature!” Well, to be honest, who is using it? I haven’t met one person in the last 15 years who uses the change tracking feature for an Excel file. You might be exceptional.

Who Didn’t Close the Excel File?

Let’s look at yet another disadvantage of Excel. As soon the Excel file becomes a “tool” for your department or even the entire company, the file has to be accessible to multiple persons at the same time. And then the phone rings and your colleague asks you to close the file as he urgently needs to look up a price. This happens several times a day.

Annoying? YES, totally!

The file crashes, someone accidentally moves it to another folder, data is missing, I can list a bunch of failures caused by the collaborative use of an Excel file. You ask why? At one company I even wrote a macro (yes I did!!!) which automatically saves and closes the Excel sheet after a pre-defined time. Excel is not a database. It is not an application where several people are able to work on it at the same time.

The Evolution of Excel

The time has come. The evil Excel file has to die. But what should we do? I know an insurer for cars who had been working with Excel sheets in order to capture insurance cases. For years! At the end of the day, they had to develop a software. And you can imagine the costs of such a software project. I assume that 95% of companies run their core business in Excel. And all of them live with the above troubles: locked files, corrupt files, wrong data, and so on. What’s the solution?

Turn Excel Files to Web Apps

Turn the Excel files into web apps! All of them! With no-code platforms, you can define your data structure in models, set your relationships between your models, and finally import your Excel file and match the columns with the models and fields to turn your Excel file into a working web application. And, it all takes a fraction of the time.

As a web application, your evolved Excel sheet now has all the advantages of a web-based software. Accessible via the Internet in your browser, you can share it and multiple persons can concurrently work on it. No versioning of files, no corrupted data and you can extend your web app overtime with new calculations, fields, and so on. And it’s shareable with a link without being attached to an email.

Original Link

Understanding Bistro Streams: Counting Clicks by Region

Introduction

Bistro Streams [1] is a light-weight stream and batch processing library which radically changes the way stream data is processed by relying on a new column-oriented approach to data processing. Although its features are focused on stream analytics with applications in IoT and edge computing, it can also be applied to batch processing, including such tasks as data integration, data migration, extract-transform-load (ETL) or big data processing. Bistro Streams is based on novel principles including conceptual design and physical architecture by relying on Bistro Engine library [2] and this technology can be shortly described as follows:

Bistro Streams does for stream analytics what column stores did for databases.

More specifically, Bistro Streams has the following major distinguishing features:

  • Bistro Streams defines its data processing logic using column operations as opposed to using only set operations in traditional systems including Kafka Streams and Flink. Bistro makes columns and column operations first-class elements of data modeling and data processing. In particular, Bistro Streams does not use such difficult to comprehend and execute operations like join, group-by or reduce.

  • Bistro Streams also has a column-oriented physical representation of data (in-memory). The idea is apparently not new and widely used in column stores but it is new for stream processing. In the case of long histories (which is needed for complex analysis) and complex analytic workflows, it can provide higher performance. It is also important for running on edge devices with limited resources.

  • The third feature of Bistro Streams is how it organizes data processing in time. Bistro Streams separates the logic of (1) triggering the processes for appending, evaluating, and deleting data from the logic of (2) what to do during evaluation, that is, data processing itself. In particular, the frequency and conditions for starting evaluations are specified using a separate API. The same for retention policies where deletion time is determined independently of the data processing operations applied to data.

  • Bistro Streams also does not distinguish between batch and stream processing. In particular, the data is internally represented in the same way for both workloads and the difference is only in how frequently the state is updated. Therefore, it can be applied to a wider range of problems and use cases including batch processing, data ingestion, ETL or real-time stream processing. The data processing logic is always the same.

In this article, we provide an introduction to Bistro Streams by demonstrating its basic features using an example [3] where we need to count the number of clicks made by users from different regions.

Problem 

Let us assume we have an input stream sending asynchronous events. Each event contains a username and number of clicks made by the user. For example, a sequence of messages could be as follows: {"User": "Alice", "Count": 3},   {"User": "Bob", "Count": 5}{"User": "Max", "Count": 2}and so on. 

Each user originates from some region which can be retrieved from the table of users. This means that we can convert a sequence of usernames into a sequence of their regions like "Americas",  "Americas""Europas" for the above messages.

Our goal is to continuously monitor the number of clicks originating from each region for the latest period of time, for example, for the last 10 seconds. We want to regularly update this information, for example, every 2 seconds.

If we represent all regions as a table then the first column would store region names and the second column would store the number of clicks received from it for the last 10 seconds. For example, we could see that there were 111 clicks from Americas and 222 clicks from Europas for the last 10 seconds. However, in 2 seconds, these numbers will probably change. Of course, the period for moving aggregation (10 seconds) and the frequency of updating the result (2 seconds) can be adjusted if necessary. 

In Bistro Streams, the data processing functions are distributed between the following layers: 

  • Data state is essentially a database, that is, a number of tables and columns.

  • Data processing logic is encapsulated into table and column definitions which involve user-defined functions written in Java. Defining a table means that it will be populated from other tables and defining a column means that its values will be evaluated from other columns.

  • Actions which are submitted to the server and describe what needs to be done, for example, add new records or provide an output or evaluate the expressions.

In the following sections, we will describe how these tasks are implemented.

Solution

Define Data Schema 

Before we can process data we need to define where it will be stored and, for that purpose, we define a number of tables in the same way as we do it for databases. In our case, we need three tables: 

  • CLICKS: This table will store source events and hence it should have a column with the username and a column for the number of clicks. In addition, it will need a (derived) column which references this user record in the USERS table. 

  • USERS: This table stores all users and their properties. In particular, it has columns for user name and user region. We will find the region for each user from this table. It needs also a (derived) column which directly references records from the REGIONS tables. 

  • REGIONS: This table is a list of regions and it also has a column which will store the results of analysis, that is, the total number of clicks for some period of time. Therefore, it has two columns: region name and the total number of clicks. 

Here is the code for defining these schema elements in Bistro: 

Schema schema = new Schema("Example 4"); Table regions = schema.createTable("REGIONS"); Column regionName = schema.createColumn("Name", regions);
Column regionClicks = schema.createColumn("Clicks", regions); Table users = schema.createTable("USERS");
Column userName = schema.createColumn("Name", users);
Column userRegion = schema.createColumn("Region", users); Table clicks = schema.createTable("CLICKS");
Column clickTime = schema.createColumn("Time", clicks);
Column clickUser = schema.createColumn("User", clicks);
Column clickCount = schema.createColumn("Count", clicks);

Define Data Processing Logic

Data processing in Bistro Streams is performed in so-called derived columns and derived tables. These are normal columns and tables but their data is computed from the data in other columns and tables which in turn can be derived elements. 

The idea of data processing is that for each new record in the CLICKS table, we need to find the user record in the USERS table and then for this user find its region in the REGIONS table. And then we can update the number of clicks for this region by increasing its current value in the Clicks columns. For example, if we get a new event {"User": "Alice", "Count": 3} then we find the record for the username "Alice" and the record for the region "Americas". Finally, we add 3 to the column Count of this region record, so if it had 10 clicks then it will have 13 clicks.

First, we need to define two so-called link columns for the tables CLICKS and USERS which will reference the USERS and REGIONS tables, respectively. This can be done as follows:

// Link column: CLICKS -> USERS
Column clickUserLink = schema.createColumn("User Link", clicks, users);
clickUserLink.link( new Column[] { clickUser }, userName
); // Link column: USERS -> REGIONS
Column userRegionLink = schema.createColumn("Region Link", users, regions);
userRegionLink.project( new Column[] { userRegion }, regionName
);

Note that the second column is defined using the project operation. It is equivalent to the link column except that the target record will be added if not found. We decided to use a project column (and not link column) because it will automatically populate the REGIONS table from the data in the USERS table. Alternatively, we could populate the regions by loading this data from an external data source and then link to it from the table of users.

Once the tables have been connected, we can define the aggregation logic. It is done by defining a so-called accumulate column in the REGIONS tables.

regionClicks.setDefaultValue(0.0);
regionClicks.accumulate( new ColumnPath(clickUserLink, userRegionLink), (a,p) -> (double)a + (double)p[0], // Add the clicks when an event is received (a,p) -> (double)a - (double)p[0], // Subtract clicks when the event is deleted new ColumnPath(clickCount) // Measure to be aggregated
);

This definition means that we group all records from the CLICKS table depending on the region they belong to. And then we sum up the values in the Clicks column of the original events for one group.

Importantly, it is done differently than normal when using group-by or reduce. Here we define two functions: one will process a fact when it is added to the source table (adder), and the other will process a fact when it is removed from the source table (remover). The intention is to keep only events during some time (10 seconds in our example) and then delete them. The system will not compute the aggregate for the whole window each time we want to get the result. Instead, the result will be continuously updated as new events are appended and when old events are deleted. It is important in the case of long windows, for example, if we wanted to monitor the number of clicks received for 1 day.

Actions and Data Evaluation Logic 

Once the logic of data processing has been defined, we need to determine the sequence of operations with the data. In particular, it is necessary to define how data is fed into the system, when it is processed, when data is removed from the system, and how we view the result. 

In our example, we use a simulator to produce input events. It is implemented by the ClickSimulator class:

ClickSimulator simulator = new ClickSimulator(server, clicks, 1000);

Once this connector instance is started, it will generate events as described at the beginning with some random click counts and with random time delays. For each new event, it will create an ActionAdd and submit it to the server: 

Action action = new ActionAdd(this.table, record);
Task task = new Task(action, null);
this.server.submit(task);

The result of executing this action by the server is a new record in the CLICKS table. In real applications, we could subscribe to a topic in some message bus or we could receive events via an HTTP listener. If we now start our server then the simulator will generate new events and they will be stored in this table which will continuously grow and no processing will be done.

In order to trigger data processing, we define a timer. It is actually also a connector but it wakes up at regular time intervals in order to trigger some processing or perform any other actions. We want to do data processing every 2 seconds and, therefore, we configure the timer accordingly by passing this frequency as a constructor parameter:

ConnectorTimer timer = new ConnectorTimer(server,2000);

What do we want to do every 2 seconds? Our goal is to aggregate the click counts for the last 10 seconds and hence we need to ensure that we have only data for this period and not more. This can be done by submitting a standard ActionRemove and specifying that it has to remove all records older than 10 seconds:

timer.addAction( new ActionRemove(clicks, clickTime, Duration.ofSeconds(Example4.windowLengthSeconds))
);

If we start the server now then the input CLICKS table will not continuously grow anymore and it will contain only records no older than 10 seconds. Yet, no data processing will be performed. 

Now we can really process data by evaluating its derived columns which is done by submitting a standard ActionEval:

timer.addAction( new ActionEval(schema)
);

Note that the system will process only new and deleted records by updating the totals.

Now all the totals are up-to-date and we can choose how to output this information. In our example, we simply print the numbers in the console for each region:

timer.addAction( x -> { System.out.print("=== Totals for the last " + Example4.windowLengthSeconds + " seconds: "); Range range = regions.getIdRange(); for(long i=range.start; i<range.end; i++) { String name = (String) regionName.getValue(i); Double count = (Double) regionClicks.getValue(i); System.out.print(name + " - " + count + " clicks; "); } System.out.print("\n"); }
);

We can now start the server and observe the running click counts for incoming events:

=== Totals for the last 10 seconds: Americas - 50.0 clicks; Europas - 33.0 clicks; === Totals for the last 10 seconds: Americas - 42.0 clicks; Europas - 42.0 clicks; === Totals for the last 10 seconds: Americas - 58.0 clicks; Europas - 28.0 clicks; === Totals for the last 10 seconds: Americas - 70.0 clicks; Europas - 29.0 clicks; 

Normally, however, the results will be written to an output connector which can store them in a database or send to a message bus.

Conclusion 

Bistro Streams uses a new data processing paradigm by relying on a column-oriented data model and in-memory storage model. This approach is especially beneficial for complex analytics and stream processing because column operations are frequently much simpler at design time and more efficient at runtime. What is important is that Bistro Streams can be used for both batch and stream analytics because the way data is processed and how frequently it is processed is under full control of the developer. In particular, there is no problem in doing stream and batch analytics simultaneously by processing fast data from an input stream and loading large quantities of data from a persistent data store. Finally, Bistro Streams is implemented as a light-weight software library which can be embedded into devices or run at the edge without the need to provision complex infrastructure [4]. This makes it a good candidate for application in IoT and other areas where near real-time response is required.

The source code of this example can be checked out from [3] and more information can be found in [1] and [2]. 

Links 

[1] Bistro Streams: https://github.com/asavinov/bistro/tree/master/server 

[2] Bistro Engine: https://github.com/asavinov/bistro/tree/master/core 

[3] Source code of the example described in the article: https://github.com/asavinov/bistro/blob/master/examples/src/main/java/org/conceptoriented/bistro/examples/server/Example4.java 

[4] Is Your Stream Processor Obese? https://dzone.com/articles/is-your-stream-processor-obese

Original Link

Bad Data: The Virus Lurking in Your Business

There’s a virus in your business. It’s not the kind you would normally associate with suspect code from the internet that inadvertently crept through your firewalls. No. It’s a lot simpler than that but potentially far more devastating. It’s poor quality data.

You don’t know it’s bad quality until you find it’s inaccurate – maybe a customer tells you it’s wrong or false negatives and false positives lead you to that conclusion. But by that time, it’s most likely too late. It’s been replicated, shared across multiple systems. Other data has been extrapolated from it. You may have shared it with other departments, even third-parties such as business/trading partners or regulatory organizations. Hundreds, thousands, possibly millions of transactions later you discover this. Decisions and investments have been made based on that data. Worst of all – there is no way back from undoing everything that’s been done.

Prevention is better than the cure – but just like humans trying to stay as healthy as possible, organizations can only do their best to minimize the risks of bad data entering or spreading like a contagious disease across their organization and the systems that consume it.

Technology alone cannot prevent bad data but it can augment users’ processes of collecting and managing that data over its lifecycle. Sometimes organizations may set targets on staff for the quantity of data being processed instead of the quality of that data.

If these factors don’t convince you to focus on data quality then the General Data Protection Regulation (GDPR) might. Data subjects have the right to know what data is stored about them and an organization needs to satisfy that request within a set period of time. It’s more than just an embarrassment if the data an organization holds is incorrect. Organizations may have to explain how that data came into their possession and if inaccurate why – particularly when dealing with personally identifiable information (PII).

Why Data Governance Is Crucial for Data Quality

Organizations need to know that their data is correct and available to users that have a right to view and process it. Data is used to help deliver business efficiency and to drive business transformation and innovation. Organizations have a corporate responsibility to manage and protect that data to meet corporate, industry, and governmental regulations. Data Quality is a key element of Data Governance. There is a clear need to make good quality, well understood, governed data available to authorized users. In short, there are 3 key considerations that organizations need to consider:

  1. Know your data. This could mean building a 360-degree view of a particular focus area – for example, a 360-degree view of the customer. Organizations may need to gather internal data – and external data from social media, click stream, census, or other relevant sources. Data must also be accessible by all users and/or applications that need it – on-premise or across a hybrid cloud. This could mean making data globally accessible to many applications regardless of the computing platform. A common access layer, ontologies, and business glossary to help understand data elements are all key elements of what an information and governance catalog should provide.
  2. Trust your data. Well-governed data provides confidence in not just the data itself, but in the outcomes from analytics, reports and other tasks based on that data. There are two key points to data governance: First, organizations must have the ability to ensure the data is secure and adheres to compliance regulations. And second, they must have the ability to govern data so users can find and access the information themselves, at the exact time they need it.
  3. Data as a source for insight and intelligence. This means having the right skills and tools in place to surface insights, as well as the right technology to learn from the data and improve accuracy each time that data is analyzed.

More Than Just a Data Quality Platform

IBM InfoSphere Information Server (IIS) can help organizations integrate and transform data and content to deliver accurate, consistent, timely, and complete information on a single platform unified by a common metadata layer. It provides common connectivity, shared metadata, a common execution engine to help facilitate flexible deployment across on-premise, grid, cluster, cloud or natively on Hadoop environments. It can help accelerate and automate data quality and governance initiatives by:

  • Automatically discovering data and data sources.
  • Automate rules triggering custom DQ actions based on business events.
  • Utilizing Machine Learning for an accelerated Metadata Classification Process/Auto Tagging (discussed in more detailed later).
  • Automatically classifying data – including understanding PII risk.

The Benefits of Industry Models

  • 200+ out-of-the-box Data Classes (clients can expand).
  • 200+ out-of-the-box Data Rule Definitions (clients can expand).
  • QualityStage Address Verification Interface (AVI) with 248+ country coverage.
  • Stewardship Center and Business Process Management help enable customized data quality exception records to be routed for notification and/or remediation.

Getting started with data quality can be a daunting task. To help ease the initial burden and arrive at a standardized data model, industry models are available that provide pre-built content.

Data Profiling and Quality – Core Capabilities

A key capability of data quality is deep data profiling and analysis to understand the content, quality, and structure of tables and files. This includes:

  • Column Analysis — min/max, frequency distributions, formats, data types, data classes.
  • Data Classification — measuring all columns against a set of pre-defined (200+) data classes and clients can expand these.
  • Data Quality Scores — all data values in all columns are measured against 10 data quality dimensions (configurable and expandable).
  • Primary Key/Multi-Column Primary Key analysis.
  • Relationship Analysis — discover and validate PK->FK relationships.
  • Overlap Analysis — measure the percent of duplicate/same values across columns, across 1 or more data sets.

This is like pulling ‘Double Duty’ because it uses the built-in Data Rules, combined with the organization’s business logic to identify exception records for statistics and/or remediation.

In addition, users can specify consistent and re-usable data rules, driven by the business. For example, rules can be written in a language that is less technical than SQL and can be written once and applied in multiple places. For example, all SSN number columns should comply to the same set of rules. Customers are able to run hundreds or thousands of data rules on a daily, weekly, or monthly basis.

The Role of Machine Learning and A.I. on Data Quality

One of the tasks a data scientist regularly faces is ensuring their test data is of suitable quality. I don’t need to explain the impact of developing a model that was trained on poor quality data. Machine learning can help find similarities in data across different silos and unify a variety of data. When the model is deployed, algorithms can be used to learn about the data and help improve the quality of new data as models encounter it.

Machine learning and neural networks are used in the IBM Unified Governance and Integration platform to identify probabilistic matches of multiple data records that are likely to be the same entity, even if they look different. This makes analyzing master data for quality as well as business term relationships possible and faster, which has been a major pain point for many clients.

Feedback learning is applied so if the confidence score of a match is below a certain threshold level, the system can refer the candidate data records to a human expert using the workflow. It is far more productive for those experts to deal with a small subset of weak matches than an entire dataset.

Consider a new data scientist being given a task to develop a machine learning model to detect customer churn for a specific product or service. While they have an idea of what needs to be accomplished, they have no idea what data sets should be used to start with. Within IBM data governance technologies, machine learning can help the data scientist search for “Customer retention” and get a graph view of all connected entities including associated privacy information, where drill-down is available and find more information about quality and authenticity of data.

A classification or taxonomy is a way of understanding the world by grouping and categorizing. Many organizations use the social security number (SSN) to track a customer across various investment products, for example, but they may appear in various forms such as Tax Identification Number or Employee Identification Number. Using traditional rule-based engines, it’s difficult to figure out that these three terminologies essentially refer to the same entity. In contrast, one term may also have different meanings in the same organization. Machine learning models offer a new way to train the system to describe “domains” from the data that helps find these relationships.

Traditional techniques of metadata matching and assignments are rule-based. It is important to understand that machine learning models while being able to do a better job with ambiguous data sets, are not a replacement for all the existing application rules. ML does not try to replace the existing application rules and regular expressions where they are proven to work well – rather it just augments them. This combined approach empowers users by automatically assigning terms with higher confidence yet not ignoring domain expertise when it is required.

If you take GDPR (Global Data Protection Regulation) as an example, there are 4 steps to process before GDPR terminologies can relate to your business terminologies and be leveraged for privacy regulations:

  • Supportive content terms must be manually extracted from GDPR documents (from various articles and sections).
  • Hierarchies must be created for the key categories.
  • Supportive content terms must be matched manually with the business terms by domain experts.
  • And, finally, these supportive content terms must be mapped to the business data model.

Machine learning is used to create a neural network model that interprets certain regulations based on other similar regulations. This not only extracts the supportive content terms from a raw document but creates a well-formed taxonomy that can be more easily ingested into the governance catalog.

Looking at the Bigger Picture

IIS, while providing very detailed analysis and control of many aspects of the data quality and the broader unified governance and integration platform, also provides end-to-end views of relationships, dependencies, and lineages across multiple sources of data. Users can see high-level policies down to the granular policies, the relationship to individual process rules, as well as rules that apply to data and how and where they are applied to data and metadata. Put another way, users can see which rules and how those rules operate on the data values contained in a given column, the rules governed by (described by) a governance rule and how governance rules are driven by governance policies and sub-policies. By applying machine learning capabilities across data quality processes and the bigger information governance solutions as outlined above organizations can help significantly enhance their data quality initiatives by having systems that learn and become progressively smarter about data quality across the enterprise. The take away is that it is better to try to detect and prevent poor quality data from ever being used by your applications than trying to clear it up later.

For more information on data quality visit ibm.com/analytics/data-quality.

Original Link

Why You Should Use FlashText Instead of RegEx for Data Analysis

If you have done any text/data analysis, you might already be familiar with Regular Expressions (RegEx). RegEx evolved as a necessary tool for text editing. If you are still using RegEx to deal with text processing, then you may have some problems to deal with. Why? When it comes to large-sized texts, the low efficiency of RegEx can make data analysis unacceptably slow.

In this article, we will discuss how you can use FlashText, a Python library that is 100 times faster than RegEx to perform data analysis.

RegEx vs. FlashText

Before you proceed with your analysis, you need to clean your source data, even for the simplest text. This often includes searching and replacing keywords. For example, search the corpus for the keyword “Python,” or replace all instances of “python” with “Python.”

RegEx is an ideal tool if you have to search and replace a few hundred keywords. However, many of these tasks involve Natural Language Processing (NLP). You might come across tens of thousands of such operations. RegEx would require several days to complete such a requirement.

Of course, you may think that you can solve the problem by parallelizing your processes; however, in practice, this solution does not make too much of a difference.

Is there any other way to attack this problem?

The creator of FlashText also faced the same problem back then. After some research that did not return any results, he decided to write a new algorithm.

Before understanding the algorithm behind it, let us take a look at the comparison chart showing the speed of FlashText in a search vs. Regular Expressions in a search.

1

You can observe from the above chart that RegEx’s processing time increases on a nearly linear basis as the number of keywords increases. However, on the other hand, the increase in keywords does not affect FlashText.

Moving on, let us look at another chart on keyword replacement.

2

Likewise, when the number of keywords increases, the processing time of FlashText almost remains the same. So, what is FlashText? I have explained it in the following section.

The Smarter and Faster Way of Data Cleansing – FlashText

As the name suggests, FlashText is one of the fastest ways to execute search and replace keywords. It is an open source Python library on GitHub.

When using FlashText, begin by providing a list of keywords. FlashText uses this list to build an internal Trie dictionary. You then send it a string of text depending on whether you want to search or replace.

For performing replacements, it will create a new string with the replacement keyword. To carry out a search, it will return a list of keywords in the string. These tasks will only iterate through your string once.

Why Is FlashText So Fast?

To truly understand the reason behind FlashText’s speed, let us consider an example. Take a sentence that is comprised of three words: “I like Python.” Assume that you have a corpus of four words {Python, Java, J2EE, and Ruby}.

If for every word in the corpus, you select it out and see if it appears in the sentence, you need to iterate the string four times.

3

For n words in the corpus, we need n iterations. And each step (is it in the sentence?) will take its own time. This is the logic behind RegEx matching.

There is also an alternative method that contradicts the first method. That is, for each word in the sentence, we can see if it exists in the corpus.

4

For m words in the sentence, you have m cycles. In the situation, the time spent only depends on the number of words in the sentence. You can quickly perform this step (is it in the corpus?) using a dictionary.

The FlashText algorithm uses the second method. Moreover, the Aho-Corasick algorithm and the Trie data structure inspired this algorithm.

How Does FlashText Work?

First, it creates a Trie data structure from the corpus. Like the following graph:

5

Start and EOT (End of Term) indicate the word boundaries. It can be space, comma, or line return. It can match keywords when it has boundaries on both sides. This way it can prevent cases such as ‘apple’ matching for ‘pineapple.’

Use the string “I like Python,” and search for it character by character.

6

7

As this algorithm searches character by character, when searching for “I”, you can easily jump over some values because “l” (as in like) is not immediately after it. This mechanism can let you skip all the non-existent words.

The FlashText algorithm only inspects every character in the string “I like Python.” Even if your dictionary contains one million keywords, it does not really affect its operation.

When Do You Need to Use FlashText?

I would suggest you use FlashText whenever the number of keywords is greater than 500.

8

In terms of searching, if the number of keywords is greater than 500, FlashText will perform better than RegEx.

Additionally, RegEx can search special characters such as “^, $, *, d” but FlashText does not support them.

You cannot match partial words (for example “worddvec”), but it can match full words too (“word2vec”).

Take a look at the basic usage of FlashText. Give it a try. You will observe that it is much faster than RegEx.

Below is some Python code which will help you use FlashText.

Code: Search for keywords using FlashText.

9

Code: Search for keywords using FlashText.

10

Original Link

Introduction to Spark With Python: PySpark for Beginners

Apache Spark is one the most widely used frameworks when it comes to handling and working with Big Data and Python is one of the most widely used programming languages for Data Analysis, Machine Learning, and much more. So, why not use them together? This is where Spark with Python also known as PySpark comes into the picture.

With an average salary of $110,000 per annum for an Apache Spark Developer, there’s no doubt that Spark is used in the industry a lot. Because of its rich library set, Python is used by the majority of Data Scientists and Analytics experts today. Integrating Python with Spark was a major gift to the community. Spark was developed in the Scala language, which is very much similar to Java. It compiles the program code into bytecode for the JVM for Spark big data processing. To support Spark with Python, the Apache Spark community released PySpark. In this Spark with Python blog, I’ll discuss the following topics.

  • Introduction to Apache Spark and its features
  • Why go for Python?
  • Setting up Spark with Python (PySpark)
  • Spark in Industry
  • PySpark SparkContext and Data Flow
  • PySpark KDD Use Case

Apache Spark is an open-source cluster-computing framework for real-time processing developed by the Apache Software Foundation. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Below are some of the features of Apache Spark which gives it an edge over other frameworks:

  • Speed: It is 100x faster than traditional large-scale data processing frameworks.
  • Powerful Caching: Simple programming layer provides powerful caching and disk persistence capabilities.
  • Deployment: Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manager.
  • Real Time: Real-time computation and low latency because of in-memory computation.
  • Polyglot: It is one of the most important features of this framework as it can be programmed in Scala, Java, Python, and R.

Although Spark was designed in Scala, which makes it almost 10 times faster than Python, Scala is faster only when the number of cores being used is less. As most of the analyses and processes nowadays require a large number of cores, the performance advantage of Scala is not that much.

For programmers, Python is comparatively easier to learn because of its syntax and standard libraries. Moreover, it’s a dynamically typed language, which means RDDs can hold objects of multiple types.

Although Scala has SparkMLlib it doesn’t have enough libraries and tools for Machine Learning and NLP purposes. Moreover, Scala lacks Data Visualization.

Setting Up Spark With Python (PySpark)

I hope you guys know how to download Spark and install it. So, once you’ve unzipped the spark file, installed it and added it’s path to the .bashrc file, you need to type in source .bashrc

export SPARK_HOME = /usr/lib/hadoop/spark-2.1.0-bin-hadoop2.7
export PATH = $PATH:/usr/lib/hadoop/spark-2.1.0-bin-hadoop2.7/bin

To open PySpark shell, you need to type in the command ./bin/pyspark

Apache Spark, because of it’s amazing features like in-memory processing, polyglot, and fast processing is being used by many companies all around the globe for various purposes in various industries:

Yahoo! uses Apache Spark for its Machine Learning capabilities to personalize its news and web pages and also for target advertising. They use Spark with Python to find out what kind of news users are interested in reading and categorizing the news stories to find out what kind of users would be interested in reading each category of news.

TripAdvisor uses Apache Spark to provide advice to millions of travelers by comparing hundreds of websites to find the best hotel prices for its customers. The time taken to read and process the reviews of the hotels in a readable format is done with the help of Apache Spark.

One of the world’s largest e-commerce platforms, Alibaba, runs some of the largest Apache Spark jobs in the world in order to analyze hundreds of petabytes of data on its e-commerce platform.

PySpark SparkContext and Data Flow

Talking about Spark with Python, working with RDDs is made possible by the library Py4j. PySpark Shell links the Python API to Spark Core and initializes the Spark Context. Spark Context is at the heart of any Spark application.

  1. Spark Context sets up internal services and establishes a connection to a Spark execution environment.
  2. The Spark Context object in driver program coordinates all the distributed processes and allows for resource allocation.
  3. Cluster Managers provide Executors, which are JVM processes with logic.
  4. Spark Context objects send the application to executors.
  5. Spark Context executes tasks in each executor.

Image title

PySpark KDD Use Case

Now let’s have a look at a use case: KDD’99 Cup (International Knowledge Discovery and Data Mining Tools Competition). Here we will take a fraction of the dataset because the original dataset is too big.

import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")

Creating RDD:

Now we can use this file to create our RDD.

data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)

Filtering

Suppose we want to count how many normal interactions we have in our dataset. We can filter our raw_data RDD as follows.

from time import time
t0 = time()
normal_count = normal_raw_data.count()
tt = time() - t0
print "There are {} 'normal' interactions".format(normal_count)
print "Count completed in {} seconds".format(round(tt,3))

Count:

Now we can count how many elements we have in the new RDD.

from time import time
t0 = time()
normal_count = normal_raw_data.count()
tt = time() - t0
print "There are {} 'normal' interactions".format(normal_count)
print "Count completed in {} seconds".format(round(tt,3))

Output:

There are 97278 'normal' interactions
Count completed in 5.951 seconds

Mapping:

In this case, we want to read our data file as a CSV formatted one. We can do this by applying a lambda function to each element in the RDD as follows. Here we will use the map() and take() transformation.

from pprint import pprint
csv_data = raw_data.map(lambda x: x.split(","))
t0 = time()
head_rows = csv_data.take(5)
tt = time() - t0
print "Parse completed in {} seconds".format(round(tt,3))
pprint(head_rows[0])

Output:

Parse completed in 1.715 seconds
[u'0', u'tcp', u'http', u'SF', u'181', u'5450', u'0', u'0',
.
. u'normal.']

Splitting:

Now we want to have each element in the RDD as a key-value pair where the key is the tag (e.g. normal) and the value is the whole list of elements that represents the row in the CSV formatted file. We could proceed as follows. Here we use line.split()and map().

def parse_interaction(line):
elems = line.split(",")
tag = elems[41]
return (tag, elems) key_csv_data = raw_data.map(parse_interaction)
head_rows = key_csv_data.take(5)
pprint(head_rows[0])

Output:

(u'normal.', [u'0', u'tcp', u'http', u'SF', u'181', u'5450', u'0', u'0', u'0.00', u'1.00',
.
.
.
. u'normal.'])

The Collect Action:

Here we are going to use the collect() action. It will get all the elements of RDD into memory. For this reason, it has to be used with care when working with large RDDs.

t0 = time()
all_raw_data = raw_data.collect()
tt = time() - t0
print "Data collected in {} seconds".format(round(tt,3))

Output:

Data collected in 17.927 seconds

That took longer than any other action we used before, of course. Every Spark worker node that has a fragment of the RDD has to be coordinated in order to retrieve its part and then reduce everything together.

As a final example that will combine all the previous ones, we want to collect all the normal interactions as key-value pairs.

# get data from file
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file) # parse into key-value pairs
key_csv_data = raw_data.map(parse_interaction) # filter normal key interactions
normal_key_interactions = key_csv_data.filter(lambda x: x[0] == "normal.") # collect all
t0 = time()
all_normal = normal_key_interactions.collect()
tt = time() - t0
normal_count = len(all_normal)
print "Data collected in {} seconds".format(round(tt,3))
print "There are {} 'normal' interactions".format(normal_count)

Output:

Data collected in 12.485 seconds
There are 97278 normal interactions

So this is it, guys!

I hope you enjoyed this Spark with Python article. If you are reading this, congratulations! You are no longer a newbie to PySpark. Try out this simple example on your systems now.

Original Link

How to Build a Data-Driven Culture

The importance of data has been around for centuries, dating back to the times of astronomy and scientific research. Early on, we saw famous scientists and astronomers like Charles DarwinGalileo GalileiMarie Curie, and Nikola Tesla gather data to prove hypotheses and evaluate discoveries.

Fast forward to today and the value of data continues to increase with it being the single most important factor in unfurling the truth and discovering new frontiers. In a recent EY report, 81 percent of surveyed organizations say that data should be used to analyze each business decision. By using data in their analyses, organizations are more prepared to uncover opportunities or threats and increase business performance and efficiency.

In fact, many corporations have embedded the importance of data into their core values. Amazon, for example, includes data as the backbone of many of its leadership principles. To achieve success, employees at Amazon are constantly measured against those principles and are pushed to “dive deep” into the data.

To build a data-driven culture, you must shift how your organization uses data to help meet the company’s business goals. Below are three things you should consider:

  1. Define company metrics – most successful organizations set business goals and define metrics to measure their progress and success. If a company’s goal is to improve sales revenues by 20 percent, then teams across the business should tap into its data and analyze how their department can contribute to the company’s revenue improvement. In addition to making teams accountable, these metrics also allow teams to objectively analyze how viable their business strategies are and to eliminate ambiguity.
  2. Democratize data – data has historically been in the hands of a few. If data is available beyond key stakeholders, employees are mobilized and empowered to conduct their own analysis by slicing and dicing the data. They can then contribute their findings to strengthen or invalidate their business strategies or ideas.
  3. Deliver consistent data – as organizations become larger, business processes become more complex. And analyzing the right, relevant data becomes more complicated when working with duplicative data from siloed sources and departments. By making data consistent across the organization, employees and departments have a common understanding of the data used in analyses, which leads to better cross-team collaboration

Building a data-driven culture is not easy, but with the right organization and technology in place, along with common goals across teams, you can provide the foundation needed to get you there.

Original Link

What Is Data Warehousing?

Today’s enterprise relies on the effective collection, storage, integration, and analysis of data. These activities have moved to the heart of revenue generation, cost containment, and profit optimization. As such, it’s no surprise that the amounts of data generated – as well as the number and types of data sources – have exploded.

Data-driven companies require heavy-duty solutions for managing and analyzing large quantities of data across their organizations. These systems must be scalable, reliable, and secure enough for regulated industries, as well as flexible enough to support a wide variety of data types. These requirements go way beyond the capabilities of any traditional database. That’s where the data warehouse comes in.

Breaking it Down: What Is a Data Warehouse, Anyway?

A data warehouse is a large-capacity repository that sits on top of multiple databases. Whereas the conventional database is optimized for a single data source, such as payroll information, the data warehouse is designed to handle a variety of data sources, such as sales data, data from marketing automation, real-time transactions, SaaS applications, SDKs, APIs, and more.

There are other differences, as well. For example, single-source databases are built for speed, employing online transactional processing (OLTP) to insert and edit small transactions. However, due to their structure, they do not lend themselves to advanced analytics. In contrast, a data warehouse uses online analytical processing (OLAP), which is designed for fast, sophisticated analysis.

Databases and data warehouses do have some similarities, however. Besides the fact that they are both repositories for large amounts of data, both can be queried. And they both have the ability to store data in tables (although databases only store data in two-dimensional tables; data warehouses contain multidimensional tables with layers of columns and rows).

On-Premise and in the Cloud

Companies are increasingly moving away from on-premise data warehouses to the cloud, leveraging the cost savings and scalability managed services can provide. The architecture of these cloud-enabled data warehouses differs from that of their traditional, on-premise counterparts.

Traditional data-warehouse architecture is separated into three tiers: one for the database server that extracts data from multiple data sources, one for the OLAP server (which transforms the data), and one for the client level.

Cloud-based data warehouses are an entirely different animal. Their architecture varies tremendously among vendors. For example, Amazon’s Redshift is essentially a cloud-based representation of on-premise data warehouses. BigQuery is serverless so it manages computing resources dynamically and hides resource-management decisions from the user.

The cloud offers some distinct advantages:

  • It’s managed. Instead of hiring your own data-warehousing team, a cloud data warehouse lets you outsource the management hassle to professionals who must meet service level agreements (SLAs).
  • It outperforms on-premise data warehouses. Cloud-based solutions offer superior reliability and speed. They are generally more secure than on-premise data warehouses, making them a good choice for the enterprise.
  • It’s built for scale. Cloud-based data warehouses are elastic, so you can instantly add capacity.
  • It’s more cost-effective. With cloud, you pay for what you use. Some providers charge by throughput. Others charge per hour per node. In every case, you avoid the mammoth costs incurred by an on-premise data warehouse that runs 24 hours a day, seven days a week.

Check out this guide to selecting the right cloud-based data warehouse for your environment.

Do You Need a Data Warehouse?

Some businesses and industries require more data analysis than others. For example, Amazon uses real-time data to adjust prices three or four times a day. Insurance companies track policies, sales, claims, payroll, and more. They also use machine learning to predict fraud. Gaming companies must track and react to user behavior in real-time to enhance the player’s experience. Data warehouses make all of these activities possible.

If your organization has or does any of the following, you’re probably a good candidate for a data warehouse:

  • Multiple sources of disparate data.
  • Big-data analysis and visualization – both asynchronously and in real-time.
  • Custom report generation/ad-hoc analysis.
  • Data mining.
  • Machine learning/AI.
  • Data science.

These activities and assets require more than the traditional single-source database can provide. They require an “industrial-strength” data warehouse.

Original Link

Data Analyst vs. Data Scientist vs. Big Data Expert

With the advent of data science and big data as a mainstream career option, there has been a lot of confusion about the different options out there. Several claims suggest data analysts will be obsolete after big data, while some claim that big data and data science are the same, or one is a subset of another. When it comes to one profile eliminating the other, only time can tell. As for the differences, a simple factual study of either can reveal the truth about them.

Data science has been here for a long time, while big data, on the other hand, is fairly new, originating from the former — with significant changes. Data analysis leverages the techniques and software systems used in either (and vice versa with respect to techniques) but is a whole different story.

So, here is a little comparison between data analysts, data scientists, and big data experts!

Data Analysts

Data Scientists

Big Data Experts

Definition

Using automated tools, they fetch segregated data and insights. They define datasets and do an extensive demographical analysis to determine business and product related strategies.

As evident from ‘scientist,’ they fetch data, construct and maintain databases, clean and segregate data for various needs and also work on data visualization and analysis.

They deal with a continuous and large amount of data, define parameters and datasets for analysis, and program analytical systems to provide strategic insights for businesses.

Skills Required

Programming, Statistics and Mathematics, Machine Learning, Data Visualization and Communication, Data Wrangling and Dataset Definition

SAS/R/similar tools, Python, Hadoop, SQL, Restructuring Data, Database Construction and Management

Mathematics and Statistics, Programming and Computer Science, Analytical Skills, Business Strategy

Application

Healthcare, Insurance, Travel, Administration, Gaming, Distribution Systems

Search Engines, Advertisements, Adaptive Algorithms, AI Systems

Retail, E-commerce, Financial Services, Communication

This makes it pretty clear as to the roles of any of these three profiles. The most important difference is the application, which further shortlists the industries that hire them. And the difference is substantial and diverse. There certainly is an overlap in the skills and requirements and also, the work. The overlap, however, is because of the common foundation all these profiles stand upon. And because of that, the profiles have a hierarchical progression among them.

  • As you can see, data analytics is the most basic among the options. The job of a data analyst has wider application and is thus the more diverse adoption in different industries. Even the educational and academic requirements are lower for a data analyst.
  • Next in line are big data jobs, which are fairly complicated and require advanced skills. Sometimes, a big data certification is a mandatory requirement to get a big data analyst job. The scope for big data jobs is increasing by the day due to the penetration of digital technologies in various industries.
  • At the top lies data science jobs. These have a diverse set of profiles under them and require certain expertise. Data science certification is mandatory to get a job. The scope for data scientists is lower compared to big data due to the different (individual) profiles that lie under the umbrella of data science.

As of the threat to data analysts from big data experts, it is important to understand while big data analysts can take over data analysis with some modifications to their profiles, it is very unlikely. The reason is that the type of data and its processing needs (and end goals) are very different for either of the profiles — that’s also why they cater to different industries.

With the above information, the differences among data analysts, data scientists, and big data experts should be clear. This info can be used to make better career plans in the field of data analytics and business strategy. Informed decisions are always the best decisions one can make. Research well and analyze and success will be a few steps ahead!

Original Link

Classification From Scratch, Part 4 of 8: Penalized Ridge Logistic

This is the fourth post of our series on classification from scratch, following the previous post which was some sort of detour on kernels. But today, we’ll get back on the logistic model.

Formal Approach to the Problem

We’ve seen before that the classical estimation technique used to estimate the parameters of a parametric model was to use the maximum likelihood approach. More specifically:

Image title

The objective function here focuses (only) on the goodness of fit. But, usually, in econometrics, we believe something like non sunt multiplicanda entia sine necessitate (“entities are not to be multiplied without necessity”), the parsimony principle, simpler theories are preferable to more complex ones. So we want to penalize for models that are too complex.

This is not a bad idea. It is mentioned here and there in econometrics textbooks, but usually for model choice, not inference. Usually, we estimate parameters using maximum likelihood techniques, and them we use AIC or BIC to compare two models. Recall that Akaike (AIC) criteria are based on

Image title

We have on the left a measure for the goodness of fit, and on the right, a penalty increasing with the “complexity” of the model.

Very quickly, here, the complexity is the number of variates used. I will not enter into details about the concept of sparsity (and the true dimension of the problem), I will recommend reading the book by Martin Wainwright, Robert Tibshirani, and Trevor Hastie on that issue. But assume that we do not make a variable selection, we consider the regression on all covariates. Define

Image title

for any 

Image title

One might say that the AIC could be written

Image title

And actually, this will be our objective function. More specifically, we will consider for some norm || . ||. I will not get back here on the motivation and the (theoretical) properties of those estimates (that will actually be discussed in the Summer School in Barcelona, in July), but, in this post, I want to discuss the numerical algorithm to solve such optimization problems, for|| . ||l1  (the Ridge regression) and for|| . ||l1  (the LASSO regression).

Normalization of the Covariates

The problem || ß ||of is that the norm should make sense, somehow. A small ß is with respect to the “dimension” of Xj’s. So, the first step will be to consider linear transformations of all covariates Xj to get centered and scaled variables (with unit variance):

y = myocarde$PRONO
X = myocarde[,1:7]
for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
X = as.matrix(X)

Ridge Regression (From Scratch)

Before running some code, recall that we want to solve something like:

Image title

In the case where we consider the log-likelihood of some Gaussian variable, we get the sum of the square of the residuals, and we can obtain an explicit solution. But not in the context of a logistic regression.

The heuristics about Ridge regression is the following graph. In the background, we can visualize the (two-dimensional) log-likelihood of the logistic regression, and the blue circle is the constraint we have if we rewrite the optimization problem as a constrained optimization problem :

Image title

can be written equivalently (it is a strictly convex problem)

Image title

Thus, the constrained maximum should lie in the blue disk

LogLik = function(bbeta){ b0=bbeta[1] beta=bbeta[-1] sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)*log(1 + exp(b0+X%*%beta)))}
u = seq(-4,4,length=251)
v = outer(u,u,function(x,y) LogLik(c(1,x,y)))
image(u,u,v,col=rev(heat.colors(25)))
contour(u,u,v,add=TRUE)
u = seq(-1,1,length=251)
lines(u,sqrt(1-u^2),type="l",lwd=2,col="blue")
lines(u,-sqrt(1-u^2),type="l",lwd=2,col="blue")

Let us consider the objective function, with the following code:

PennegLogLik = function(bbeta,lambda=0){ b0 = bbeta[1] beta = bbeta[-1] -sum(-y*log(1 + exp(-(b0+X%*%beta))) - (1-y)* log(1 + exp(b0+X%*%beta)))+lambda*sum(beta^2)
}

Why not try a standard optimization routine? In the very first post on that series, we did mention that using optimization routines was not clever since they were strongly relying on the starting point. But here, that is not the case.

lambda = 1
beta_init = lm(PRONO~.,data=myocarde)$coefficients
vpar = matrix(NA,1000,8)
for(i in 1:1000){
vpar[i,] = optim(par = beta_init*rnorm(8,1,2), function(x) PennegLogLik(x,lambda), method = "BFGS", control = list(abstol=1e-9))$par}
par(mfrow=c(1,2))
plot(density(vpar[,2]),ylab="",xlab=names(myocarde)[1])
plot(density(vpar[,3]),ylab="",xlab=names(myocarde)[2])


Clearly, even if we change the starting point, it looks like we converge towards the same value. That could be considered as the optimum.

The code to compute Image title would then be:

opt_ridge = function(lambda){
beta_init = lm(PRONO~.,data=myocarde)$coefficients
logistic_opt = optim(par = beta_init*0, function(x) PennegLogLik(x,lambda), method = "BFGS", control=list(abstol=1e-9))
logistic_opt$par[-1]}

and we can visualize the evolution of 

Image title

as a function of 

Image title

v_lambda = c(exp(seq(-2,5,length=61)))
est_ridge = Vectorize(opt_ridge)(v_lambda)
library("RColorBrewer")
colrs = brewer.pal(7,"Set1")
plot(v_lambda,est_ridge[1,],col=colrs[1])
for(i in 2:7) lines(v_lambda,est_ridge[i,],col=colrs[i])

At least it seems to make sense: we can observe the shrinkage as

Image title  increases (we’ll get back to that later on).

Ridge, Using Netwon Raphson Algorithm

We’ve seen that we can also use Newton Raphson to solve this problem. Without the penalty term, the algorithm was

Image title

where

Image title

and

Image title

where ∆old is the diagonal matrix with terms Pold(1-Pold) on the diagonal.

Thus

Image title

that we can also write

Image title

where z=Xßold+(X^ToldX)^-1X^T[y-pold]. Here, on the penalized problem, we can easily prove that

Image title

while

Image title

Hence

Image title

The code is then:

Y = myocarde$PRONO
X = myocarde[,1:7]
for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
X = as.matrix(X)
X = cbind(1,X)
colnames(X) = c("Inter",names(myocarde[,1:7])) beta = as.matrix(lm(Y~0+X)$coefficients,ncol=1) for(s in 1:9){ pi = exp(X%*%beta[,s])/(1+exp(X%*%beta[,s])) Delta = matrix(0,nrow(X),nrow(X));diag(Delta)=(pi*(1-pi)) z = X%*%beta[,s] + solve(Delta)%*%(Y-pi) B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%Delta%*%z) beta = cbind(beta,B)}
beta[,8:10] [,1] [,2] [,3]
XInter 0.59619654 0.59619654 0.59619654
XFRCAR 0.09217848 0.09217848 0.09217848
XINCAR 0.77165707 0.77165707 0.77165707
XINSYS 0.69678521 0.69678521 0.69678521
XPRDIA -0.29575642 -0.29575642 -0.29575642
XPAPUL -0.23921101 -0.23921101 -0.23921101
XPVENT -0.33120792 -0.33120792 -0.33120792
XREPUL -0.84308972 -0.84308972 -0.84308972

Again, it seems that convergence is very fast.

And interestingly, with that algorithm, we can also derive the variance of the estimator

Image title

where

Image title

The code to compute Image title as a function of Image titleis then:

newton_ridge = function(lambda=1){ beta = as.matrix(lm(Y~0+X)$coefficients,ncol=1)*runif(8) for(s in 1:20){ pi = exp(X%*%beta[,s])/(1+exp(X%*%beta[,s])) Delta = matrix(0,nrow(X),nrow(X));diag(Delta)=(pi*(1-pi)) z = X%*%beta[,s] + solve(Delta)%*%(Y-pi) B = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% (t(X)%*%Delta%*%z) beta = cbind(beta,B)}
Varz = solve(Delta)
Varb = solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X))) %*% t(X)%*% Delta %*% Varz %*% Delta %*% X %*% solve(t(X)%*%Delta%*%X+2*lambda*diag(ncol(X)))
return(list(beta=beta[,ncol(beta)],sd=sqrt(diag(Varb))))}

We can visualize the evolution ofImage title  (as a function of Lambda)

v_lambda=c(exp(seq(-2,5,length=61)))
est_ridge=Vectorize(function(x) newton_ridge(x)$beta)(v_lambda)
library("RColorBrewer")
colrs=brewer.pal(7,"Set1")
plot(v_lambda,est_ridge[1,],col=colrs[1],type="l")
for(i in 2:7) lines(v_lambda,est_ridge[i,],col=colrs[i])


and to get the evolution of the variance

v_lambda=c(exp(seq(-2,5,length=61)))
est_ridge=Vectorize(function(x) newton_ridge(x)$sd)(v_lambda)
library("RColorBrewer")
colrs=brewer.pal(7,"Set1")
plot(v_lambda,est_ridge[1,],col=colrs[1],type="l")
for(i in 2:7) lines(v_lambda,est_ridge[i,],col=colrs[i],lwd=2)


Recall that when Lambda=0 (on the left of the graphs), 

Image title

(no penalty). Thus as Lambda increases (i) the bias increase (estimates tend to 0) (ii) the variances decrease.

Ridge, Using glmnet

As always, there are R functions available to run a ridge regression. Let’s use the glmnet function, with Alpha=0

y = myocarde$PRONO
X = myocarde[,1:7]
for(j in 1:7) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
X = as.matrix(X)
library(glmnet)
glm_ridge = glmnet(X, y, alpha=0)
plot(glm_ridge,xvar="lambda",col=colrs,lwd=2)

as a function of the norm

The l1 norm here, I don’t know why. I don’t know either why all graphs obtained with different optimization routines are so different… Maybe that will be for another post…

Ridge With Orthogonal Covariates

An interesting case is obtained when covariates are orthogonal. This can be obtained using a PCA of the covariates.

library(factoextra)
pca = princomp(X)
pca_X = get_pca_ind(pca)$coord

Let’s run a ridge regression on those (orthogonal) covariates

library(glmnet)
glm_ridge = glmnet(pca_X, y, alpha=0)
plot(glm_ridge,xvar="lambda",col=colrs,lwd=2)

plot(glm_ridge,col=colrs,lwd=2)

We clearly observe the shrinkage of the parameters, in the sense that:

Image title

Application

Let us try with our second set of data

df0 = df
df0$y=as.numeric(df$y)-1
plot_lambda = function(lambda){
m = apply(df0,2,mean)
s = apply(df0,2,sd)
for(j in 1:2) df0[,j] = (df0[,j]-m[j])/s[j]
reg = glmnet(cbind(df0$x1,df0$x2), df0$y==1, alpha=0,lambda=lambda)
u = seq(0,1,length=101)
p = function(x,y){ xt = (x-m[1])/s[1] yt = (y-m[2])/s[2] predict(reg,newx=cbind(x1=xt,x2=yt),type='response')}
v = outer(u,u,p)
image(u,u,v,col=clr10,breaks=(0:10)/10)
points(df$x1,df$x2,pch=c(1,19)[1+z],cex=1.5)
contour(u,u,v,levels = .5,add=TRUE)
}

We can try various values of Lambda:

reg = glmnet(cbind(df0$x1,df0$x2), df0$y==1, alpha=0)
par(mfrow=c(1,2))
plot(reg,xvar="lambda",col=c("blue","red"),lwd=2)
abline(v=log(.2))
plot_lambda(.2)

or

reg = glmnet(cbind(df0$x1,df0$x2), df0$y==1, alpha=0)
par(mfrow=c(1,2))
plot(reg,xvar="lambda",col=c("blue","red"),lwd=2)
abline(v=log(1.2))
plot_lambda(1.2)


The next step is to change the norm of the penalty, with the l1 norm (to be continued…).

Original Link

Understanding WSO2 Stream Processor, Part 1

Streaming analytics has been one of the trending topics in the software industry for some time. With the production of billions of events through various sources, analyzing these events provides a competitive advantage for any business. The process of streaming analytics can be divided into 3 main sections.

  1. Collect — Collecting events from various sources.
  2. Analyze — Analyzing the events and deriving meaningful insights.
  3. Act — Take action on the results.

WSO2 Stream Processor (WSO2 SP) is an intuitive approach to stream processing. It provides the necessary capabilities to process events and derive meaningful insights with its state of the art “Siddhi” stream processing runtime. The below figure showcases how WSO2 SP acts as a stream processing engine for various events.

Source: https://docs.wso2.com/display/SP410

With the WSO2 SP, events generated from various sources like devices, sensors, applications, and services can be received. The received events are processed in real time using the streaming SQL language “Siddhi.” Once the results are derived, those results can be published through APIs, alerts, or visualizations so that business users can act on them accordingly.

Users of WSO2 SP need to understand a set of basic concepts around the product. Let’s identify the main components which a user needs to interact with.

WSO2 Stream processor comes with built-in components to configure, run, and monitor the product. Here are the main components.

  • WSO2 SP runtime (worker) — Executes the real-time processing logic which is implemented using Siddhi streaming SQL.
  • Editor — Allows users (developers) to implement their logic using Siddhi streaming SQL and debug, deploy, and run their implementations similar to an IDE.
  • Business Rules — Allows business users to change the processing logic by simply modifying a few values stored in a simple form.
  • Job Manager — Allows the deployment and management of siddhi applications across multiple worker nodes.
  • Portal — Provides ability to visualize the results generated from processing logic which was implemented.
  • Status Dashboard — Monitor multiple worker nodes in a cluster and showcases the information about those nodes and the siddhi applications which are deployed.

In addition to the above components, the diagram includes:

  • Source — Devices, Apps, and Services which generates events.
  • Sink — Results of the processing logic are passed into various sinks like APIs, dashboards, and notifications.

With these components, users can implement a plethora of use cases around streaming analytics and/or stream processing. The next thing you need to understand about WSO2 SP is the “Siddhi” streaming SQL language and its high-level concepts. Let’s take a look at those concepts as well.

Figure: Siddhi high-level concepts in a nutshell

The above figure depicts the concepts which need to be understood by WSO2 SP users. Except for the source and sink which we have looked through in the previous section, all the other concepts are new. Let’s have a look at these concepts one by one.

  • Event — Actual data coming from sources which are formatted according to the schema.
  • Schema — Define the format of the data which is coming with events.
  • Stream — A running (continuous) set of incoming events are considered as a stream.
  • Window — Is a set of events which are selected based on the number of events (length) or a time period (duration).
  • Partition — Is a set of events which are selected based on a specific condition of data (e.g. events with the same “name” field).
  • Table — Is a static set of events which are selected based on a defined schema and can be stored in a data store.
  • Query — Is the processing logic which uses streams, tables, windows, partitions to derive meaningful data out of the incoming data events.
  • Store — Is a table stored in a persistent database for later consumption through queries for further processing or to take actions (visualizations).
  • Aggregation — Is a function (pre-defined) applied to events and produces outputs for further processing or as final results.
  • Triggers — Are used to inject events according to a given schema so that processing logic executes periodically through these events.

Now that we have a basic understanding of WSO2 SP and its main concepts, let’s try to do a real streaming analysis using the product. Before doing that, we need to understand the main building block of WSO2 SP runtime which is a “Siddhi Application.” It is the place where users configure WSO2 SP runtime to make it happen.

Figure: Siddhi application components

Within a Siddhi application, we have three main sections.

  • Source definition — This is the place to define incoming event sources and their schemas. Users can configure different transport protocols, messaging formats, etc.
  • Sink definition — This section defines the place to emit the results of the processing. Users can choose to store the events in tables, output to log files, etc.
  • Processing Logic — This section implements the actual business logic for data processing using the Siddhi streaming SQL language.

Now that you have a basic understanding of WSO2 SP and it’s main concepts, the next thing you can do is to get your hands dirty by trying out a few examples. The tutorials section of the documentation is a good point to start things off.

Original Link

The Future Isn’t in Databases, but in the Data

In the past year, you may have heard me mention my certificates from the Microsoft Professional Program. One certificate was in Data Science, the other in Big Data. I’m currently working on a third certificate, this one in Artificial Intelligence.

You might be wondering why a database guy would be spending so much time on data science, analytics, and AI. Well, I’ll tell you.

The future isn’t in databases, but in the data.

Let me explain why.

Databases Are Cheap and Plentiful

Take a look at the latest DB-Engines rankings. You will find there are 343 distinct database systems listed, 138 of those are relational databases, and I’m not sure it is a complete list, either, but it should help make my point: you have no idea which one of 343 database systems is the right one. It could be none of them. It could be all of them.

Sure, you can narrow the list of options by looking at categories. You may know you want a relational, a key-value pair, or even a graph database. Each category will have multiple options, and it will be up to you to decide which one is the right one.

Decisions are made to go with whatever is easiest. And “easiest” doesn’t always mean “best.” It just means you’ve made a decision allowing the project to move forward.

Here’s the fact I want you to understand: Data doesn’t care where or how it is stored. Neither do the people curating the data. Nobody ever stops and says, “Wait, I can’t use that, it’s stored in JSON.” If they want (or need) the data, they will take it, no matter what format it is stored in to start.

And the people curating the data don’t care about endless debates on MAXDOP and NUMA and page splits. They just want their processing to work.

And then there is this #hardtruth — It’s often easier to throw hardware at a problem than to talk to the DBA.

Technology Trends Over the Past Ten Years

Here’s a handful of technology trends over the past ten years. These trends are the main technology drivers for the rise of data analytics during this timeframe.

Business Intelligence Software

The ability to analyze and report on data has become easier with each passing year. The Undisputed King of all business analytics, Excel, is still going strong. Tableau shows no signs of slowing down. PowerBI has burst onto the scene in just the past few years. Data analytics is embedded into just about everything. You can even run R and Python through SQL Server.

Real-Time Analytics

Software such as Hadoop, Spark, and Kafka allow for real-time analytic processing. This has allowed companies to gather quality insights into data at a faster rate than ever before. What used to take weeks or months can now be done in minutes.

Data-Driven Decisions

Companies can use real-time analytics and enhanced BI reporting to build a data-driven culture. We can move away from, “Hey, I think I’m right, and I found data to prove me right” to a world of, “Hey, the data says we should make a change, so let’s make the change and not worry about who was right or wrong.” In other words, we can remove the human factor from decision making, and let the data help guide our decisions instead.

Cloud Computing

It’s easy to leverage cloud providers such as Microsoft Azure and Amazon Web Services to allocate hardware resources for our data analytic needs. Data warehousing can be achieved on a global scale with low latency and massive computing power. What once cost millions of dollars to implement can be done for a few hundred dollars and some PowerShell scripts.

Technology Trends Over the Next Ten Years

Now, let’s look at a handful of current trends. These trends will affect the data industry for the next ten years.

Predictive Analytics 

Artificial intelligence (AI), machine learning (ML), and deep learning (DL) are just starting to become mainstream. AWS is releasing DeepLens this year. Azure Machine Learning makes it easy to deploy predictive web services. Azure Machine Learning Workbench lets you build your own facial recognition program in just a few clicks. It’s never been easier to develop and deploy predictive analytic solutions.

DBA as a Service 

Every company making database software (Microsoft, AWS, Google, Oracle, etc.) is actively building automation for common DBA tasks. Performance tuning and monitoring, disaster recovery, high availability, low latency, auto-scaling based upon historical workloads, the lists go on. The current DBA role, where lonely people work in a basement rebuilding indexes, is ending one page at a time.

Serverless Functions 

Serverless functions are also hip these days. Services such as IFTTT make it easy for a user to configure an automated response to whatever trigger they define. Azure Functions and AWS Lambda are where the hipster programmers hang out, building automated processes to help administrators do more with less.

More Chatbots 

We are starting to see a rise in the number of chatbots available. It won’t be long before you are having a conversation with a chatbot playing the role of a DBA. The only way you’ll know it is a chatbot and not a DBA is because it will be a pleasant conversation for a change. Chatbots are going to put a conversation on top of the automation of the systems underneath. As new people enter the workforce, interaction with chatbots will be seen as the norm.

Summary

There is a dearth of people able to analyze data today.

Data analytics is the biggest growth opportunity I see for the next ten years. The industry needs people to help collect, curate, and analyze data.

We also need people to build data visualizations. Something more than an unreadable pie chart, but I will save that rant for a different post.

We are always going to need an administrator to help keep the lights on, but as time goes on, we will need fewer administrators. This is why I’m advocating a shift for data professionals to start learning more about data analytics.

Well, I’m not just advocating it, I’m doing it.

Original Link

Building Low-Overhead Metrics Collection for High-Performance Systems

Metrics play an integral part in providing confidence in a high-performance software system. Whether you’re dealing with a data processing framework or a web server, metrics provide insight into whether your system is performing as expected. A direct impact of capturing metrics is the performance cost to the system being measured.

When we first started development on Wallaroo, our high-throughput, low-latency, and elastic data processing framework, we were aiming to meet growing demand in the data processing landscape. Companies are relying on more and more data all while expecting to process that data as quickly as possible for, among other things, faster, better decision making, and cost reduction. Because of this, we set out to develop Wallaroo with three core principles in mind:

  1. High-Throughput.
  2. Low-Latency.
  3. Lower the infrastructure costs/needs compared to existing solutions.

Proving to our users that we were delivering on these principles was an essential part of the early development cycle of Wallaroo. How could we prove we were delivering on these principles? We needed to capture metrics. Metrics are an excellent way to spot bottlenecks in any system. Providing meaningful metrics to our users would mean they could make quicker iterations in their development cycle by being able to spot specific bottlenecks in various locations of our system. An added benefit that came from capturing these metrics was that it also sped up the Wallaroo development cycle as we were able to use them as guidelines to measure the “cost” of every feature.

In the end, what came of our metrics capturing process became a web UI that provides Wallaroo users introspection for various parts of the applications they develop. Here’s what the result looks like:

In this post, I’ll cover some of the design decisions we made regarding our metrics capturing system to maintain the high performance our system was promising.

Metrics Aren’t Free

Adding introspection via metrics to your system inherently adds an overhead. However, the impact of that overhead is determined by how you capture metrics and what information you choose to capture. Not only is there a ton of information to capture, but there are also many different ways to capture it. The biggest problem you face while determining how you capture metrics is maintaining a low overhead. A computationally heavy metrics capturing system could mean lower throughput, higher latencies, and costlier infrastructure, all things you want to avoid. When approaching this type of problem, you need to determine how much information you need to capture to provide meaningful details via your metrics while minimizing the performance impact on your system.

Information We Want to Convey With Our Metrics

A Wallaroo application broken down to its simplest form is composed of the following components:

  • Computations: Code that transforms an input of some type to an output of some type. In Wallaroo, there are stateless computations and what we call “state computations” that operate on an in-memory state.
  • Pipelines: A sequence of computations and/or state computations originating from a source and optionally terminating in a sink.

These components can be on one or more Wallaroo workers (process). By capturing metrics for these components, we give Wallaroo users a granular look into various parts of the applications they develop.

The information we want to convey to our users with the metrics we capture is the following:

  • The throughput of a specific component in Wallaroo for a given period.
  • The percentile of latencies that fell under a specific time in Wallaroo for a given period.

Our metrics capturing needs to be flexible enough to give us accurate statistics over different periods of time. Knowing what information the metrics of your system needs to convey will end up playing a major factor in how you determine to capture metrics.

Using Histograms for Our Metrics Capturing

There are many ways to capture metrics, each with its own positives and negatives. Out of the options we looked at, histograms were the most appealing because they could provide the statistics we wanted to convey while also maintaining a low overhead. For a low-level look into this decision, Nisan wrote a dedicated blog post, “Latency Histograms and Percentile Distributions In Wallaroo Performance Metrics.” I’ll cover this design choice on a higher level.

To best describe how we decided to use histograms as the metrics capturing data structure in Wallaroo, I will focus on one of Wallaroo’s components: state computations.

These were the metric statistics we wanted to provide:

As mentioned above, Wallaroo allows for both stateless computations and computations that operate on state. To maintain high-throughput and low-latency, we leverage some of the design principles behind Pony, an object-oriented, actor-model, capabilities-secure, high-performance programming language. To maintain high performance at scale, concurrency and parallelism are needed for our state computations to avoid acting as a bottleneck in our pipelines. Sean talked a bit about this design principle in detail in the Avoid Coordination section of our “What’s the Secret Sauce?” blog post.

Here’s an example of state partitioning in Wallaroo:

For a Word Count application, we partition our word count state into distinct state entities based on the letters of the alphabet. We ultimately set up 27 state entities in total, one for each letter plus one called “!” which will handle any “word” that doesn’t start with a letter. By partitioning, we remove the potential bottleneck of a single data structure to maintain the state of the count of all of our words.

Each state entity is managed by a Step, an actor which is responsible for running the state computation and routing output messages downstream, amongst other things. Using Steps allows us to avoid coordination in updating state, but we also need to avoid coordination in our metrics capturing system. To get a full picture of how a state computation is performing, the metrics provided by each Step needs to be able to be stitched together. Here’s a diagram to best illustrate the type of aggregation we require across Steps:

A Sorted Values List would give us the granularity we need to do the aggregations required but would have a much higher performance impact than we want. Another option would be to store specific statistics (mean/median/average) per Step, but the need for aggregation would render these statistics entirely useless if we want an accurate depiction of the computation multiple Steps represent. Theo Schlossnagle wrote a great blog post explaining why bad math, like an average of averages, provides useless metrics: “The Problem with Math: Why Your Monitoring Solution is Wrong.” Existing industry research shows that histograms are excellent solutions for gathering meaningful metrics: “How NOT to Measure Latency” by Gil Tene and “The Uphill Battle for Visibility” by Theo Schlossnagle are two great examples. We recognized that histograms would work well for us due to the following:

  • Low cost of both time and space compared to other metric capturing techniques.
  • Histograms can be easily aggregated, necessary when not all metrics are stored in a single data structure.
  • Allows us to answer: is the 99th percentile latency below this time value?

Minimizing the Metrics Capturing Overhead

We know what metrics information we need to capture, and we know how we want to capture it. The next step is to determine how much of this workload is required to be handled by Wallaroo. We need to aggregate histograms to get a complete picture of components but is it worth the resource cost to process this online in Wallaroo or any other high-performance system? The answer is generally no. An external system could take these histograms and perform the aggregations needed without impinging on the resources required by Wallaroo. This means freeing up CPU and memory resources that we’d be using if Wallaroo were also responsible for the aggregation of its metrics. Since metrics would have to be offloaded from Wallaroo in some fashion eventually, we decided to do this as early as possible.

Push vs. Poll

When deciding how to offload the metrics information from Wallaroo to an external system, we have two options: push or poll. We ended up making the metrics system push-based for several reasons. Using Steps as an example here, if we were polling each for its metrics, we’d be wasting CPU resources if a Step had not completed any work. If we have 300 Steps and only 50 are handling the workload, polling and sending empty stats for the remaining 250 could turn our outgoing connection to the external metrics processing system into a bottleneck. By pushing only from Steps which are completing work, we save on CPU resources and minimize the potential of the metrics receiver bottlenecking from an overload of incoming data.

Wallaroo’s Metrics in Action

In the end, it was a combination of several design choices that allowed us to capture the metrics we wanted without greatly impacting the performance of Wallaroo. Maintaining our high-throughput and low-latency principles while capturing metrics was not the most straightforward task, but we’re happy with what we’ve been able to achieve so far.

As we saw above, we ultimately ended up developing a Metrics UI to assist developers using Wallaroo. Feel free to spin up a Wallaroo application and our Metrics UI to get a feel for how we handle metrics for all of the components that compose a Wallaroo application.

In a future post, I’ll dive into how we use the metrics we receive from Wallaroo to come up with the information displayed above.

Give Wallaroo a Try

We hope that this post has piqued your interest in Wallaroo!

If you are just getting started, we recommend you try our Docker image, which allows you to get Wallaroo up and running in only a few minutes.

Thank you! We always appreciate your candid feedback (and a GitHub star)!

Original Link

Visualizations on Apache Kafka Made Easy with KSQL

Shant Hovsepian is the CTO and co-founder of Arcadia Data and is going to tell us about how to get started with processing streaming data with Confluent KSQL and visualizing it using the Arcadia Data platform.

The first Kafka Summit in London was just last month, and a popular topic at the show was KSQL. Released for production use with Confluent Platform 4.1, KSQL gives Kafka users a streaming SQL engine so they can use a SQL-like language to process and query data in Kafka. It’s been getting a lot of traction since, and thanks to a lot of hard work and dedication, it was just released in production-ready form as part of Confluent Platform 4.1. This is a huge step in simplifying many types of stream processing that can be run on Kafka. With the wealth of existing expertise in SQL, developers can now use those skills with KSQL to more quickly build applications for data filtering, transformation, enrichment, manipulation, and analysis of Kafka data.

KSQL is a game-changer not only for application developers but also for non-technical business users. How? The SQL interface opens up access to Kafka data to analytics platforms based on SQL. Business analysts who are accustomed to non-coding, drag-and-drop interfaces can now apply their analytical skills to Kafka. So instead of continually building new analytics outputs due to evolving business requirements, IT teams can hand a comprehensive analytics interface directly to the business analysts. Analysts get a self-service environment where they can independently build dashboards and applications.

Arcadia Data is a Confluent partner that is leading the charge for integrating visual analytics and BI technology directly with KSQL. We’ve been working to combine our existing analytics stack with KSQL to provide a platform that requires no complicated new skills for your analysts to visualize streaming data. Just as they will create semantic layers, build dashboards, and deploy analytical applications on batch data, they can now do the same on streaming data. Real-time analytics and visualizations for business users have largely been a misnomer until now. For example, some architectures enabled visualizations for end users by staging Kafka data into a separate data store, which added latency. KSQL removes that latency to let business users see the most recent data directly in Kafka and react immediately.

We’re hearing a lot of excitement about important business uses. For example, customers are looking to optimize operations by immediately analyzing the effects of their latest applications. With the feedback loop enabled by KSQL, they are able to adjust and optimize more quickly. Others are looking to detect operations errors such as discrepancies in transaction information in real time versus end-of-day problem resolution. In some cases, the goal isn’t necessarily real-time access, but simply reducing time-to-analysis from hours to minutes.

If you want to experience this for yourself, getting started is easy. You can download binaries that get you exploring visualizations on Kafka very quickly. The binaries consist of the free desktop analytics tool, Arcadia Instant, plus Docker images that have containerized versions of Kafka and KSQL. This means you can set up a test environment in minutes that runs entirely on your desktop. You don’t have to worry about setting up a cluster of nodes or about setting up cloud instances – you can just start experimenting. As long as you have about 8 GB RAM on your system, then you have plenty of power to run this setup. There are also a few walkthroughs that will guide you through the process of setting up a visualization environment.

Getting Started With Arcadia Instant in Combination With the Confluent Platform

To get started, please follow our Get Running with KSQL guide. It will tell you how to create a working setup of Arcadia Instant and the Confluent Platform components KSQL and Kafka (plus sample data).

From there, build an example dashboard by following our guide, A Day in the Life of a Business Analyst Using Streaming Analytics. You can also watch the video version of the guide. Even if you have no prior experience with Arcadia Data, you will find the guide to help you quickly get up to speed on building a real dashboard. More information is available on our streaming visualizations web page.

In summary, if you’re an application developer who works with Kafka, or hopes to work with it soon, KSQL is the right technology choice for many types of processing you’ll need to do. And if you want to expose your analytics capabilities to a non-technical audience, you no longer have to build your own visualization outputs in something like Excel or spend a lot of time bolting on a tool designed for other environments. The best choice is to go with a proven technology like Arcadia Data running with KSQL to provide that “last mile” of streaming data capabilities to your end users. So take a look at what Confluent and Arcadia Data have done, and stay tuned for more innovations in the coming months!

If you’re interested in what KSQL can do, check out:

Original Link

Apache Storm vs WSO2 Stream Processor, Part 2

Welcome back! If you missed Part 1, you can check it out here.

5. When to Use What From a Use Case Perspective

Let’s look 13 streaming analytics patterns, one by one, and evaluate to what extent they are applicable for Apache Storm and WSO2 Stream Processor.

Pattern 1: Preprocessing

Preprocessing is often done as a projection from one data stream to the other or through filtering. Some of the potential operations include:

  • Filtering and removing some events.

  • Reshaping a stream by removing, renaming, or adding new attributes to a stream.

  • Splitting and combining attributes in a stream.

  • Transforming attributes.

When implementing this type of use case, the most important feature we should consider is the programming model. Here, Storm has a competitive advantage over WSO2 SP because of Storm’s ability to use a lower level language to specify the filtering task and the ease of specifying distributed processing.

Pattern 2: Alerts and Thresholds

This pattern detects a condition and generates alerts based on a condition (e.g., alarm on high temperature). These alerts can be based on a simple value or more complex conditions such as rate of increase, etc. To implement such alert generation scenarios, the types of operators in the stream processor query language plays a major role since we should be able to specify complex alert conditions with the query language. Here, WSO2 SP has a competitive advantage over Storm because of WSO2 SP’s use of the Siddhi complex event processing library which can specify complex event pattern matching queries using its Streaming SQL capabilities. We call a language that enables users to write SQL-like queries to query streaming data a “Streaming SQL” language. 

Comparision: The powerful Siddhi query language enables users to specify complex stream processing queries quite easily with Streaming SQL. But if someone wants to write code for their logic (such as simple filtering types of applications) the recommended choice would be to use Storm. Writing Java code which does time windows and temporal event sequence patterns is quite complicated [17]. Hence, in such scenarios, WSO2 Stream Processor would be a good choice compared to Storm. Furthermore, WSO2 SP’s built-in dashboard capabilities enable the creation of dashboards with alert messages.

Pattern 3: Simple Counting and Counting With Windows

This pattern includes aggregate functions such as Min, Max, Percentiles, etc. They can be counted without storing any data (e.g., counting the number of failed transactions). Counts are often used with a time window attached to it (e.g., failure count last hour). Distributed processing capabilities are essential for certain types of stream processing applications. When the volume of the data to handle becomes gigabytes per second, the stream processing system needs to scale into multiple nodes. Travel time prediction of each individual vehicle in a very large populated city [29], processing of streaming image data received by large-field radio telescopes [30], etc., are some examples for applications of this category. This kind of application may not include complicated event processing logic, rather, it has to deal with sheer volumes of data gathered from a large number of sensors.

Comparision: Apache Storm is considerably strong in implementing such use cases since it has the capability of operating as a distributed stream processor. Although both Storm and WSO2 SP has windowing capabilities, windows in Storm has to be implemented with first principles whereas in WSO2 SP windows are available with its Streaming SQL. Use of Streaming SQL provides the advantages of usability and portability as discussed in this talk. Hence the better option would be to use WSO2 SP.

Pattern 4: Joining Event Streams

Combining data from two sensors, and detecting the proximity of two vehicles, combining data from a football player’s foot sensors and the football’s sensors to track the movement of the football among the players of a match are some examples for such use cases. Both Storm and WSO2 SP have equal capabilities of joining streams.

Pattern 5: Data Correlation, Missing Events, and Erroneous Data

In this use case, other than joining, we also need to correlate the data within the same stream. This is because different data sensors can send events at different rates, and many use cases require this fundamental operator. Some of the possible sub use cases can be listed as follows:

  • Matching up two data streams that send events at different speeds.

  • Detecting a missing event in a data stream (e.g. detect a customer request that has not been responded to within one hour of its reception).

  • Detecting erroneous data.

Comparision: WSO2 SP has a competitive advantage over Storm for this use case because the Siddhi query language has out-of-the-box support for implementing all the above use cases whereas with Storm custom code has to be written.

Pattern 6: Interacting With Databases

The need for interacting with databases arises when we need to combine the real-time data against the historical data stored in a disk. Some of the examples include:

  • When a transaction happened, look up the age using the customer ID from customer database to be used for fraud detection (enrichment).

  • Checking a transaction against blacklists and whitelists in the database.

  • Receive input from the user.

Comparision: WSO2 SP has built-in extensions and a fine-tuned event persistence layer for interacting with databases. For example, the RDBMS event table extension enables an RDBMS such as MySQL, H2, or Oracle to be paired with the stream processing application. Similar extensions exist for accessing NoSQL databases such as HBASECassandraMongoDB, etc. This enables out-of-the-box event stream persistence with WSO2 Stream Processor. However, with Storm developer, one has to create custom code to interact with databases [28].

Pattern 7: Detecting Temporal Event Sequence Patterns

It is a quite common use case of streaming analytics to detect a sequence of events arranged in time. A simple example of such a use case is the following credit card fraud detection scenario. A thief, having stolen a credit card, would try a smaller transaction to make sure it works before he does a large transaction. Here, the small transaction followed by a large transaction is a temporal sequence of events arranged in time and can be detected using a regular expression written on top of an event sequence.

Comparision: WSO2 SP’s Siddhi query language has out-of-the-box support for detecting such scenarios through its temporal event pattern detection capabilities. But with Storm, the user has to custom implement this feature.

Pattern 8: Tracking

Tracking corresponds to following something over space and time and detecting given conditions. For example, tracking wildlife, making sure they are alive and making sure they have not been sent to the wrong destinations. Both WSO2 SP and Storm can be equally applied to this use case.

Pattern 9: Trend Detection

Detecting patterns from time series data and bringing them into operator attention are common use cases. Trends include rise, fall, turn, outliers, complex trends like triple bottom, etc. WSO2 SP’s built-in event sequence pattern detection capabilities enable developers to create event trend detection and easily implement it with WSO2 SP as compared to Storm.

Pattern 10: Running the Same Query in Batch and Real-Time Pipelines

In this scenario, we run the same query in both real-time and batch pipelines. This is also known as Lambda Architecture. Nathan Marz who created Apache Storm also came up with the concept of Lambda Architecture. The Lambda Architecture is an approach to building stream processing applications on top of MapReduce and Storm or similar systems [32][33]. Here an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. The same query is implemented twice, once in a batch system and once in a stream processing system. The results from both systems are used to produce a complete answer.

Both Storm and WSO2 SP have the capability to implement Lambda architecture.

Pattern 11: Detecting and Switching to Detailed Analysis

This scenario detects a condition which suggests some anomaly and further analyzes it using historical data. Use basic rules to detect fraud scenarios and then pull out all transactions done against that credit card for a larger time period from a batch pipeline and run a detailed analysis.

Comparision: WSO2 SP has the capability of specifying the complex event pattern matching queries using its Siddhi query language. Furthermore, WSO2 SP’s built-in capabilities to deal with event stores allows it to query-specific information. With Storm, these features have to be custom coded.

Pattern 12: Using an ML Model

Often we face the scenario of training a model (often a Machine Learning model) and then using it with the real-time pipeline to make decisions. For example, you can build a model using R and export it as PMML. Some examples are fraud detection, segmentation, predicting the next value, predict churn, etc.

Comparision: In order to implement such functionality, WSO2 SP’s streaming machine learning extension can be easily utilized to load a model in PMML format and conduct predictions. However, with Storm, the user will have to custom implement the complete functionality from scratch.

Pattern 13: Online Control

Use cases such as autopilot, self-driving, and robotics, etc., are examples of situations where we need to control things online. These may involve problems like current situation awareness, predicting next value(s), and deciding on corrective actions, etc.

Comparision: Similar to Pattern 12, implementing such online control typically requires the use of machine learning techniques. With its streaming machine learning extension, WSO2 SP can easily enable implementing online control use cases compared to Storm.

6. Conclusion

This two-part article conducted a side-by-side comparison between the features of Apache Storm and WSO2 Stream Processor. Next, it discussed how these features get applied in 13 streaming analytics patterns. The results of this study indicate that WSO2 Stream Processor and Apache Storm have their own pros and cons. Table II summarizes their applicability from a use case point of view.

Table II: When to Use What?

Use Case Apache Storm WSO2 Stream Processor

Preprocessing

Alerts and Thresholds

Simple Counting and Counting with Windows

Joining Event Streams

Data Correlation, Missing Events, and Erroneous Data

Interacting with Databases

Detecting Temporal Event Sequence Patterns

Tracking

Trend detection

Running the same Query in Batch and Real-time Pipelines

Detecting and switching to Detailed Analysis

Using an ML Model

Online Control

References

[1]    Apache Software Foundation (2015), Apache Stormhttp://storm.apache.org/

[2]   Forrester (2014), The Forrester Wave™: Big Data Streaming Analytics Platforms, Q3 
        2014
https://www.forrester.com/report
        /The+Forrester+Wave+Big+Data+Streaming+Analytics+Platforms+Q3+2014/-/E-
        RES113442

[3]   WSO2 (2018), Analytics Solutionshttps://wso2.com/analytics/solutions/

[4]   De Silva, R. and Dayarathna, M.(2017), Processing Streaming Human Trajectories with 
        WSO2 CEP
https://www.infoq.com/articles/smoothing-human-trajectory-streams

[5]   WSO2 (2017), Video Analytics: Technologies and Use Cases
        http://wso2.com/whitepapers/innovating-with-video-analytics-technologies-and-use-cases

[6]   WSO2 (2015), Fraud Detection and Prevention: A Data Analytics 
        Approach
http://wso2.com/whitepapers/fraud-detection-and-prevention-a-data-analytics-
        approach

[7]   WSO2 (2018), WSO2 Helps Safeguard Stock Exchange via Real-Time Data Analysis and         Fraud Detectionhttp://wso2.com/casestudies/wso2-helps-safeguard-stock-exchange-via-
        real-time-data-analysis-and-fraud-detection

[8]   Apache Software Foundation (2015), Trident Tutorialhttp://storm.apache.org/releases
        /1.1.2/Trident-tutorial.html

[9]   Luckham, D.(2016), Proliferation of Open Source Technology for Event Processing
        http://www.complexevents.com/2016/06/15/proliferation-of-open-source-technology-for-
        event-processing/

[10]  Zapletal, P.(2016), Comparison of Apache Stream Processing Frameworks: Part 1,  http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1

[11]   WSO2 (2017), [WSO2Con USA 2017] Scalable Real-time Complex Event Processing at Uberhttp://wso2.com/library/conference/2017/2/wso2con-usa-2017-scalable-real-time-complex-event-processing-at-uber/

[12]   Apache Software Foundation (2015), Storm SQL integrationhttp://storm.apache.org/releases/2.0.0-SNAPSHOT/storm-sql.html

[13]   GitHub (2018), siddhi-execution-reorderhttps://github.com/wso2-extensions/siddhi-execution-reorder

[14]   Microsoft (2018), Stream Analytics Documentationhttps://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-comparison-storm

[15]    Apache Software Foundation (2015), Apache Stormhttp://storm.apache.org/about/multi-language.html

[16]   Tsai, B. (2014), Fault Tolerant Message Processing in Stormhttps://bryantsai.com/fault-tolerant-message-processing-in-storm-6b57fd303512

[17]   Blogger (2015), Why We need SQL like Query Language for Realtime Streaming Analytics?http://srinathsview.blogspot.com/2015/02/why-we-need-sql-like-query-language-for.html 

[18]   GitHub (2018), WSO2 Siddhihttps://github.com/wso2/siddhi

[19]   Apache Software Foundation (2015), Apache Stormhttp://storm.apache.org/releases/current/Fault-tolerance.html

[20]   WSO2 (2018), Introduction – Stream Processor 4.0.0https://docs.wso2.com/display/SP400/Introduction

[21]    Andrade, H.C.M. and Gedik, B. and Turaga, D.S. (2014), Fundamentals of Stream Processing: Application Design, Systems, and Analytics, 9781107434004, Cambridge University Press

[22]   GitHub (2018), Siddhi Query Guide – Partitionhttps://wso2.github.io/siddhi/documentation/siddhi-4.0/#partition

[23]   Apache Software Foundation (2015), Resource Aware Schedulerhttp://storm.apache.org/releases/1.1.2/Resource_Aware_Scheduler_overview.html

[24]    Apache Software Foundation (2015), Storm State Managementhttp://storm.apache.org/releases/1.1.2/State-checkpointing.html

[25]   Apache Software Foundation (2018), Interface IRichSpouthttp://storm.apache.org/releases/1.1.2/javadocs/org/apache/storm/topology/IRichSpout.html

[26]   Bigml (2013), Machine Learning From Streaming Data: Two Problems, Two Solutions, Two Concerns, and Two Lessonshttps://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/

[27]   SAP (2017), Forrester Research Names SAP in Leaders Category for Streaming Analyticshttps://reprints.forrester.com/#/assets/2/308/%27RES136545%27/reports

[28]   Apache Software Foundation (2015), Storm JDBC Integrationhttp://storm.apache.org/releases/1.1.2/storm-jdbc.html

[29]    T. Hunter, T. Das, M. Zaharia, P. Abbeel and A. M. Bayen, Large-Scale Estimation in Cyberphysical Systems Using Streaming Data: A Case Study With Arterial Traffic Estimation, in IEEE Transactions on Automation Science and Engineering, vol. 10, no. 4, pp. 884-898, Oct. 2013. doi: 10.1109/TASE.2013.2274523

[30]    A. Biem, B. Elmegreen, O. Verscheure, D. Turaga, H. Andrade and T. Cornwell, A streaming approach to radio astronomy imaging, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, 2010, pp. 1654-1657.

[31]    Pathirage, M.(2018), Kappa Architecturehttp://milinda.pathirage.org/kappa-architecture.com/
 
[32]    MapR Technologies (2018), Architecturehttps://mapr.com/developercentral/lambda-architecture/

[33]    Hausenblas, M., Bijnens, N. (2017), Lambda Architecturehttp://lambda-architecture.net/

[34]    Hortonworks (2018), Apache Storm Component Guidehttps://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_storm-component-guide/content/storm-trident-intro.html

[35]    Weinberger, Y. (2015), Exactly-Once Processing with Trident – The Fake Truthhttps://www.alooma.com/blog/trident-exactly-once

Original Link

Transportation’s Digital Transformation

Transportation operators have taken a huge leap in the last decade since data connectivity has become available everywhere. Their digital transformation is awe-inspiring.

A fascinating episode of NPR’s Planet Money podcast described UPS’s transformation from a package delivery company to a technology company. The signature brown trucks have become rolling computers full of sensors. Every step a driver takes, every mile she drives, are tracked and analyzed by the company to increase efficiency. UPS is using every speck of information to strategize about the drivers’ tiniest decisions in an effort to optimize further, down to which pocket they use to hold their pen. One minute saved per driver per day saves UPS $14.5 million over a course of a year.

A driver’s handheld computer is equipped with GPS, tracking the driver’s every step. The truck is wired with hundreds of sensors, sending millions of data points to a central data warehouse, where they are crunched and processed. A team of data analysts then combs through the data to discover new ways to shave seconds off deliveries to increase productivity. Today, drivers make 130 deliveries per day, compared with around 90 before this digital transformation.

Kudos to UPS for recognizing that extreme optimization should not come at the expense of employee satisfaction. Since introducing this digital transformation, UPS has doubled driver wages and compensation.

Waze, the community-driven navigation app, is another prime example of how transportation has been transformed by millions of real-time data points from thousands and thousands of drivers.

Image title

Based on each vehicle’s location and speed, which are collected passively, as well as incidents actively reported by drivers (road closures, accidents, traffic, police), Waze can find the fastest route to every destination and provide an accurate ETA. The best part? Waze will reroute you if it finds a better route as traffic conditions change.

Isn’t it surprising that the transportation industry is using real-time data to optimize how they deliver value while most software delivery organizations don’t?

The Real-Time Software Delivery Map

Just like transportation operators, software delivery organization strive to deliver customer value faster and better — better than they’ve done it in the past, and better than any current or future competition.

But so often neither the business nor IT knows exactly what’s happening in real-time and when the work will be done. When will that feature or product be running in production? When will that problem be fixed? When is that new modernized system going to replace the legacy stuff causing so many problems?

Despite best intentions and a ton of hard work, for many decision makers as well as contributors, the answer is “I don’t know.” I don’t know when it can fit in the queue, I don’t know how long it’s going to wait at the work centers downstream from me, I don’t know how long it will be until it makes its way to the top of the priority list.

Why can’t we have Waze for software delivery? If huge traffic networks can create real-time optimized routes and accurately predict ETAs, why can’t we do the same for pure software delivery?

Image title

The answer is we can, it is completely within every organization’s reach. But it requires three things:

  • You need to lay down an infrastructure of connected roads and overpasses for the work to seamlessly travel without getting stuck.
  • Each work item will need to report its location and status in real time.
  • You’ll need a system capable of collecting and compiling all those data points and drawing a map of the how value flows through your teams and finding the fastest routes to production.

This is what is being referred to as Value Stream Management. We talked about it on a recent webinar.

As we all know, products don’t get created out of thin air. Each specialist contributes their part by working in a specialized tool. Portfolio managers work in PPM to create plans and assign budgets, product owners and business analysts work in requirements management to define features, developers work in Agile and issue tracking to create the design and code, test engineers work in test management to design and run tests, and so on.

The value lives in the small units of work, or “work items,” housed in those tools — your Jira, Jenkins, ALM, ServiceNow, and many, many more. Those features, stories, builds, test cases, defects, and tickets are capable of telling us their status in real-time, but often have no central “Waze” to report to. Plus, they cannot flow uninterrupted along the value creation route because roads have not been built between the tools. The work items exist in islands with no bridges.

Value Stream Management solutions, like Tasktop Integration Hub, are essentially Waze for software. They connect the tools to let the value flow seamlessly from tool to tool, team to team, specialist to specialist. They gather the individual work item statuses in real-time and reframe them in business terms — so people can see how customer value is flowing. And they communicate the overall picture of how value flows from inception till its final destination, helping organizations make adjustments to get there fastest by eliminating roadblocks, circumventing bottlenecks and rerouting items.

The Road Not Taken

IT faces no shortage of potential ways to spend its budget, what with Agile and DevOps, modernization initiatives, new technologies like AI and machine learning. What justification is there for putting Value Stream Management at the top of that list?

Perhaps the New York City subway system can provide some answers. According to the New York Times, in New York City trains are terminally late, obstructed daily by a cascade of system failures. During the first three months of 2017, three-quarters of the subway’s lines were chronically behind schedule.

The M.T.A. doesn’t have the capability to gather real-time data from its trains. Every incident requires a lengthy post-mortem where inspectors must try to gather minute by minute accounts of what happened, from a signal system dating back to the 1920s and 30’s.

Just to be on the safe side, the M.T.A. has been actively slowing down its trains since the 1990s following some fatal accidents. As a result, they’ve reduced throughput, which has increased crowding, which slows trains down even more. End-to-end running time during peak hours has increased by more than six minutes in the four years between 2012 and 2016.

The New York City subway needs to deliver more people than ever before — close to 6 million passengers a day. Yet it has infrastructure completely inadequate for the task, and as a result it’s falling behind in every conceivable metric. Now, something in the range of $110 billion is required to overhaul the subway system and it will take many, many years to do so. This may even cost Governor Cuomo his office.

The subway is so critical to New York’s economy we believe it will get bailed out and fixed. It’s lucky in that way.

However, a software organization that delivers fewer features while creating worse customer experiences? It won’t be so lucky.

Original Link

5 Steps to Regression Analysis [Infographic]

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Topics:

big data ,regression analysis ,data analysis

Original Link

The Virtual Assistant is the Future of Business

What makes a good user experience (UX)? Many factors come into play, including the user’s work habits, environment, and goals. An exceptional UX must meet the customer’s precise needs, applying elegance and simplicity to deliver seamless interaction. As noted by Nielsen Norman Group, a leading UX consultancy, one first must draw a distinction between UX and the user interface (UI). Using the example of movie review website, firm co-founders Don Norman and Jacob Nielsen write: “Even if the UI for finding a film is perfect, the UX will be poor for a user who wants information about a small independent release if the underlying database only contains movies from the major studios.”

This analogy applies as well to the world of application performance management (APM) software, where the conventional UI—the aforementioned dashboard—is optimized to meet the needs of the IT professional, who can drill down to explore technical issues in greater depth. Often, however, this presentation is visual overkill for the mobile-toting business user seeking a far narrower range of insights.

For business customers, simpler is almost always better, and that’s why conversational UX is the superior choice for extracting real-time business insights from vast and varied sources of data. According to a recent study by CapTech, an IT management consulting firm, businesses are rapidly adopting conversational UX-based virtual assistants, such as chatbots, Amazon Alexa, Apple Siri and Google Assistant. While conversational UX is beginning to appear in some APM solutions, too, it has targeted the IT user and not the business. At AppDynamics we are beginning to experiment with this focus on the business user, beginning with some prototypes of an “intelligent bot,” a smart virtual assistant that employs proactive reporting, on-demand interaction, and automated task execution—all combined with machine learning—to extend the capabilities of APM far beyond its traditional IT roots.

Needed: Simpler Data Analysis

Business users today favor popular, mobile-based social platforms—think Slack, Skype, Facebook Messenger and Cisco WebEx—for workplace interaction. At the same time, self-service analytics is a growing trend in the enterprise, one requiring intuitive tools that enable the user to glean insights from company data without assistance from IT or a data scientist. This self-serve approach can bring analytics “to the masses” via a familiar, intuitive UI, often on a mobile device.

The dashboard isn’t the answer here, particularly when business users want to be alerted only when key metrics are impacted. A conversational UX, voice or text, can revolutionize business by providing intelligent automation that enables users to see in real time how technology is impacting their operations and customers. By empowering companies to measure and baseline everything, conversational UX becomes a central nervous system for the entire organization. The key benefit here is proactive visibility, such as real-time alerts on sudden changes in company sales, revenue or customer churn.

A Smarter Virtual Assistant

The conversational assistant can deliver value to a much broader user base, not just IT, by providing deeper, more automated and proactive business monitoring. Think of it as a concierge service for monitoring and optimization. By sending alerts for key metrics, the virtual assistant frees users from having to log into a dashboard to view their KPIs, since their assistant does it for them.

What kinds of business services could this assistant deliver? Some examples:

  • Real-time proactive reporting: users are alerted to KPIs they care about, such as a sudden drop-off in sales or revenue.
  • On-demand interaction: The ability to make natural language queries such as “What were total sales in the past hour?” Or, “How did this customer segment perform over the past two hours?”
  • Context-driven recommendation and task execution: The assistant might ask the user, “Total sales have declined 25% in the past hour; what would you like to do?” Suggested actions could include “Provide more information,” “Notify IT,” “Snooze alert for 30 minutes,” or other options.
  • Customer segment monitoring: When a user requests the customer journey for a particular service or application, the assistant could provide a graphical representation revealing bottlenecks or critical metrics (such as a drop in the number of conversions) that must be addressed immediately.

The true benefit of the business-focused virtual assistant is its ability to provide proactive reporting, on-demand interaction and automated task execution—all without the need for pesky logins or dashboards. It could also deliver key metrics, such as high-level KPIs, via an easy-to-use UI available on a variety of platforms, including mobile.

By incorporating machine learning capabilities, the virtual assistant could also deliver prescriptive insights that help businesses detect and prevent fraud in real-time, improve retail forecasts, create more accurate pricing models, and much more. More than simply a conversational UX, this cognitive helper would become a powerful recommendation engine enabling a natural conversation channel for business users to engage with their APM software.

The intelligent virtual assistant is smart automation that will change the way business operates. Rather than analyzing massive data sets long after they’ve gone cold, business users will soon be able to access real-time information-uncovering insights that enable them to see trends and changes as they happen. Here at AppDynamics, we are seeing promising results with virtual assistant prototypes, and we’re starting to work with customers on early use cases.

Original Link

Transforming Enterprise Decision-Making With Big Data Analytics

A survey conducted by NVP revealed that the increased usage of big data analytics to make decisions that are more informed has proved to be noticeably successful. More than 80% executives confirmed the big data investments to be profitable and almost half said that their organization could measure the benefits from their projects.

When it is difficult to find such extraordinary result and optimism in all business investments, big data analytics has established how doing it in the right manner can being a glowing result for businesses. This post will enlighten you on how big data analytics is changing the way businesses make informed decisions. In addition, you’ll understand why companies are using big data and elaborated processes to empower you to take more accurate and informed decisions for your business.

Why Are Organizations Harnessing the Power of Big Data to Achieve Their Goals?

There was a time when crucial business decisions were made solely based on experience and intuition. However, in the technological era, the focus shifted to data, analytics, and logistics. Today, while designing marketing strategies that engage customers and increase conversion, decision-makers observe, analyze, and conduct in-depth research on customer behavior to get to the roots instead of following conventional methods wherein they highly depend on customer response.

Five exabytes of information were created between the dawn of civilization through 2003, which has tremendously increased to the generation of 2.5 quintillion bytes data every day. That is a huge amount of data at disposal for CIOs and CMOs. They can utilize the data to gather, learn, and understand customer behavior, along with many other factors before making important decisions. Data analytics surely leads to making the most accurate decisions and getting highly predictable results. According to Forbes, 53% of companies are using data analytics today, up from 17% in 2015. It ensures the prediction of future trends, the success of the marketing strategies, positive customer response, and an increase in conversion and much more.

Various Stages of Big Data Analytics

Being a disruptive technology big data analytics has inspired and directed many enterprises to not only make informed decisions but also has helped them with decoding, identifying, and understanding information, patterns, analytics, calculations, statistics, and logistics. Utilizing it to your advantage is as much art as it is science. Let’s break down the complicated process into different stages for better understanding on data analytics.

Identify Objectives

Before stepping into data analytics, the very first step all businesses must take is to identify objectives. Once the goal is clear, it is easier to plan, especially for the data science teams. Initiating from the data gathering stage, the whole process requires performance indicators or performance evaluation metrics that could measure the steps time to time that will stop the issue at an early stage. This will not only ensure clarity in the remaining process but also increase the chances of success.

Data Gathering

Data gathering, being one of the important steps, requires full clarity on the objective and relevance of data with respect to the objectives. In order to make more informed decisions, it is necessary that the gathered data is right and relevant. Bad data can take you downhill and with no relevant report.

Understanding the Importance of the 3 Vs

The 3 Vs define the properties of big data. Volume indicates the amount of data gathered, variety means various types of data, and velocity is the speed the data processes.

  • Define how much data is required to be measured.
  • Identify relevant data (for example, when you are designing a gaming app, you will have to categorize according to age, type of the game, and medium).
  • Look at the data from the customer perspective. That will help you with details such as how much time to take and how to respond within your customer’s expected response times.
  • You must identify data accuracy. Capturing valuable data is important. Make sure that you are creating more value for your customer.

Data Preparation

Data preparation, also called data cleaning, is the process in which you give a shape to your data by cleaning, separating it into right categories, and selecting it. The goal to turn vision into reality is dependent on how well you have prepared your data. Ill-prepared data will not only take you nowhere but also no value will be derived from it.

Two key focus areas are what kind of insights are required and how will you use the data. In order to streamline the data analytics process and ensure you derive value from the result, it is essential that you align data preparation with your business strategy. According to Bain report, “23% of companies surveyed have clear strategies for using analytics effectively.” Therefore, it is necessary that you have successfully identified the data and insights are significant for your business.

Implementing Tools and Models

After completing the lengthy collection, cleaning, and preparation the data, statistical and analytical methods are applied here to get the best insights. Out of many tools, data scientists require using the most relevant statistical and algorithm deployment tools to their objectives. It is a thoughtful process to choose the right model since the model plays the key role in bringing valuable insights. It depends on your vision and the plan you have to execute by using the insights.

Turn Information Into Insights

The goal is to turn data into information, and information into insight.” — Carly Fiorina

Being the heart of the data analytics process, at this stage, all the information turns into insights that could be implemented in respective plans. Insight simply means the decoded information, understandable relation derived from the big data analytics. Calculated and thoughtful execution give you measurable and actionable insights that will bring great success to your business. By implementing algorithms and reasoning on the data derived from the modeling and tools, you can receive valuable insights. Insight generation is highly based on organizing and curating data. The more accurate your insights are, the easier it will be for you to identify and predict the results, as well as future challenges, and deal with them efficiently.

Insights Execution

The last and most important stage is executing the derived insights into your business strategies to get the best out of your data analytics. Accurate insights implemented at the right time in the right model of strategy is important at which many organization fail.

Challenges Organizations Tend to Face Frequently

Despite being a technological invention, big data analytics is an art that, handled correctly, can drive your business to success. Although it could be the most preferable and reliable way of making important decisions, there are challenges (such as cultural barriers). When major strategical business decisions are taken on their understanding of the businesses and experience, it is difficult to convince them to depend on data analytics, which is objective, and data-driven process where one embraces the power of data and technology. Yet, aligning big data with traditional decision-making processes to create an ecosystem will allow you to create accurate insights and execute efficiently in your current business model.

Original Link

  • 1
  • 2