big data

Create Data Visualizations in Cognos BI With Microsoft Project Data

Access Microsoft Project data as an ODBC data source in Cognos Business Intelligence and create data visualizations in Cognos Report Studio.

You can use the CData ODBC driver for Microsoft Project to integrate Microsoft Project data with the drag-and-drop style of Cognos Report Studio. This article describes both a graphical approach to creating data visualizations, with no SQL required, as well as how to execute any SQL query supported by Microsoft Project.

Original Link

Clarifying Ways of Defining Jacobi Elliptic Functions Using Mathematica and SciPy

The Jacobi elliptic functions sn and cn are analogous to the trigonometric functions sine and cosine. They come up in applications such as nonlinear oscillations and conformal mapping. Unfortunately, there are multiple conventions for defining these functions. The purpose of this post is to clear up the confusion around these different conventions.

The image above is a plot of the function sn [1].

Original Link

Running Your First Python Script

Since you are here, I’ll assume that you have a working Python setup along with a working Python interpreter, and are ready to run your first Python script. If not, check out my previous articles, where I walk you through each step to set up your Python environment. Here are the links to those articles:

In this article, we will be talking about:

Original Link

Data Science Project Folder Structure

Have you been looking out for project folder structure or template for storing artifacts of your data science or machine learning project? Once there are teams working on a particular data science project, there arises a need for governance and automation of different aspects of the project using a build automation tool such as Jenkins. Thus, you need to store the artifacts in well-structured project folders. In this post, you will learn about the folder structure of data science project with which you can store the files/artifacts of your data science projects.

Folder Structure of Data Science Project

The following represents the folder structure for your data science project.

Original Link

Exploring College Major and Income: A Live Data Analysis in R [Video]

I recently came up with the idea for a series of screencasts:

I’ve thought about recording a screencast of an example data analysis in #rstats. I’d do it on a dataset I’m unfamiliar with so that I can show and narrate my live thought process.

Any suggestions for interesting datasets to use?

Original Link

How We Use Kafka

Humio is a log analytics system built to run both on-prem and as a hosted offering. It is designed for "on-prem first" because, in many logging use cases, you need the privacy and security of managing your own logging solution. And because volume limitations can often be a problem in Hosted scenarios.

From a software provider’s point of view, fixing issues in an on-prem solution is inherently problematic, and so we have strived to make the solution simple. To realize this goal, a Humio installation consists only of a single process per node running Humio itself, being dependent on Kafka running nearby (we recommend deploying one Humio node per physical CPU so a dual-socket machine typically runs two Humio nodes).

Original Link

Streams and Temp File Cleanup: Fixing a Real Production Issue

Mismanaging resources is one of the easiest ways to bring down a production system. In a development environment, it’s easy to ignore open streams and temporary files, but at production scale, they can have a hugely negative impact. Unfortunately, it’s far too easy to overlook these issues while developing, even when you’re looking out for them.

Dealing With Multiple Streams

A few months ago, I dealt with a bug in one of our production services that centered around Scala code that looked something like this:

Original Link

Spinning Up a Wallaroo Cluster Is Easy

Oh No, More Data!

Last month, we took a long-running pandas classifier and made it run faster by leveraging Wallaroo’s parallelization capabilities. This time around, we’d like to kick it up a notch and see if we can keep scaling out to meet higher demand. We’d also like to be as economical as possible: provision infrastructure as needed and de-provision it when we’re done processing.

If you don’t feel like reading the post linked above, here’s a short summary of the situation: there’s a batch job that you’re running every hour, on the hour. This job receives a CSV file and classifies each row of the file, using a Pandas-based algorithm. The run-time of the job is starting to near the one-hour mark, and there’s concern that the pipeline will break down once the input data grows past a particular point.

Original Link

Real-Time Analytics on MongoDB Data in Power BI

Power BI is expanding self-service data prep to help business analysts extract insights from big data and introducing enterprise BI platform capabilities. With recent updates, Power BI has enabled connectivity to more data sources than ever before. That said, no product is able to do everything, which is where the CData Power BI Connectors come in.

With CData, you get live connectivity to data in Power BI (meaning DirectQuery) from any of the 120+ supported sources, ranging from CRM and marketing automation to big data and NoSQL. With the CData connectors, you can access MongoDB data quickly (faster than you can with any other connector) and easily, leveraging the built-in modeling and data flattening features of the connector to create a table-like model of your schema-less data, ready to be viewed, reported and analyzed from Power BI — no code or data curation required.

Original Link

Clean Data: A Prerequisite for Business Success

A lot of businesses these days are trying to incorporate data-driven decision making into their processes, and for good reason. Making better decisions depends on having all the relevant information, and putting it to good use. But in the rush to put data to use, an important concern often falls by the wayside: ensuring that the data is high quality.

Today, we’ll take a look at a critical concern when incorporating more data into your decision making workflow, namely how to ensure that your data is high quality enough to point you in the right direction. In most cases, a few simple tweaks or best practices are enough to take your inputs and transform them into a cleaned-up data set that you can rely on.

Original Link

Exactly-Once Semantics With Apache Kafka

Kafka’s exactly once semantics was recently introduced with the version which enabled the message being delivered exactly once to the end consumer even if the producer retries to send the messages.

This major release raised many eyebrows in the community as people believed that this was not mathematically possible in distributed systems. Jay Kreps, co-founder of Confluent and co-creator of Apache Kafka, explained its possibility and how is it achieved in Kafka in this post.

Original Link

Upload Files With Python

Python is eating the world! You will find many passionate Python programmers and just as many critics, but there’s no denying that Python is a powerful, relevant, and constantly growing force in software development today.

Python is just a language, though, and languages can’t solve business problems like workflow, architecture, and logistics; these things are up to you, the developer! The packages you choose, the architecture you implement, and strategy you follow will all impact the success of your Python project. Let’s take a look at the logistics behind uploading a file to the cloud using Python. I’ll discuss some of the considerations every team faces when implementing a file upload and management solution and then end with a concise recipe for you to upload files with Python using Filestack’s Python SDK.

Original Link

Posting Images With Apache NiFi 1.7 and a Custom Processor

Posting Images With Apache NiFi 1.7 and a Custom Processor

I have been using a shell script for this since Apache NiFi did not have a good way to natively post an image to HTTP servers, such as the model server for Apache MXNet.

So I wrote a quick and dirty processor that posts an image there and gathers the headers, result body, status text, and status code and returns them to you as attributes.

Original Link

Connect to Cloudant Data in AWS Glue Jobs Using JDBC

AWS Glue is an ETL service from Amazon that allows you to easily prepare and load your data for storage and analytics. Using the PySpark module along with AWS Glue, you can create jobs that work with data over JDBC connectivity, loading the data directly into AWS data stores. In this article, we walk through uploading the CData JDBC Driver for Cloudant into an Amazon S3 bucket and creating and running an AWS Glue job to extract Cloudant data and store it in S3 as a CSV file.

Upload the CData JDBC Driver for Cloudant to an Amazon S3 Bucket

In order to work with the CData JDBC Driver for Cloudant in AWS Glue, you will need to store it (and any relevant license files) in a bucket in Amazon S3.

Original Link

Analyze Elasticsearch Data in R

You can access Elasticsearch data with pure R script and standard SQL on any machine where R and Java can be installed. You can use the CData JDBC Driver for Elasticsearch and the RJDBC package to work with remote Elasticsearch data in R. By using the CData Driver, you are leveraging a driver written for industry-proven standards to access your data in the popular, open-source R language. This article shows how to use the driver to execute SQL queries to Elasticsearch and visualize Elasticsearch data by calling standard R functions.

Install R

You can match the driver’s performance gains from multithreading and managed code by running the multithreaded Microsoft R Open or by running open R linked with the BLAS/LAPACK libraries. This article uses Microsoft R Open 3.2.3, which is preconfigured to install packages from the Jan. 1, 2016 snapshot of the CRAN repository. This snapshot ensures reproducibility.

Original Link

How to Use NumPy to Hadamard Product

The first time you see matrices, if someone asked you how you multiply two matrices together, your first idea might be to multiply every element of the first matrix by the element in the same position of the corresponding matrix, analogous to the way you add matrices.

But that’s not usually how we multiply matrices. That notion of multiplication hardly involves the matrix structure; it treats the matrix as an ordered container of numbers, but not as a way of representing a linear transformation. Once you have a little experience with linear algebra, the customary way of multiplying matrices seems natural, and the way that may have seemed natural at first glance seems kinda strange.

Original Link

Life Is Dirty. So Is Your Data. Get Used to It.

The internet provides everyone the ability to access data at any time, for any need. Unfortunately, it does not help guarantee that the data is valid, or clean.

In the past year, I have earned certifications in three areas: Data Science, Big Data, and Artificial Intelligence. Those studies have provided me with the opportunity to explore the world of data that exists. For example, Kaggle is a great source of data. They also offer competitions, if you are the type of person that enjoys money.

Original Link

Automatically Combining Factor Levels in R

Each time we face real applications in an applied econometrics course, we have to deal with categorical variables. And the same question arises, from students: how can we automatically combine factor levels? Is there a simple R function?

I did upload a few blog posts, over the past few years. But so far, nothing satisfying. Let me write down a few lines about what could be done. And if someone wants to write a nice R function, that would be awesome. To illustrate the idea, consider the following (simulated dataset):

Original Link

Tips for Enhancing Your Data Lake Strategy

As organizations grapple with how to effectively manage ever more voluminous and varied reservoirs of big data, data lakes are increasingly viewed as a smart approach. However, while the model can deliver the flexibility and scalability lacking in traditional enterprise data management architectures, data lakes also introduce a fresh set of integration and governance challenges that can impede success.

The Power and Potential of Data Lakes

Born from the rise of the cloud and big data technologies like Hadoop, data lakes provide a way for organizations to cost-effectively store nearly limitless amounts of structured and unstructured data from myriad sources without regard to how that data might be leveraged in the future. By its very nature and through self-service business intelligence capabilities, a data lake also encourages experimentation and data exploration by a broader set of non-business analyst users. According to a survey conducted by TDWI Research, 85 percent of respondents considered the data lake an opportunity to address the challenges they face trying to manage the data deluge with traditional relational databases. Moreover, the TDWI survey found the data lake being pursued for a variety of benefits and use cases, the most prominent being advanced analytics (49 percent) and data discovery (49 percent).

Original Link

Watching/Alerting on Real-Time Data in Elasticsearch Using Kibana and SentiNL

In the previous post, we set up an ELK stack and ran data analytics on application events and logs. In this post, we will discuss how you can watch real-time application events that are being persisted in the Elasticsearch index and raise alerts if the condition for the watcher is breached using SentiNL (a Kibana plugin).

A few examples of alerting for application events (see previous posts) are:

Original Link

How to Work With Avro Files

This short article describes how to transfer data from Oracle database to S3 using Apache Sqoop utility. The data will be stored in Avro data format.

The data transfer was done using the following technologies:

Original Link

Empowering Enterprises With Data Discovery, Orchestration, and Delivery

Great speaking with Nic Smith, Global Vice President of Product Marketing for Cloud Analytics at SAP SE  about their release of the SAP Data Hub, which helps build agile, data-driven pipeline applications that tap a single, logical data set representing an entire enterprise.

The data orchestration solution distills business value from all data for operational excellence and digital expansion. The data hub allows customers to build, execute, orchestrate, and govern flow-based pipelines, providing the maximum reuse of existing data developments while encompassing digital innovations. Using an innovative metadata catalog and policy management, the solution provides a trusted metadata discovery, refinement, and publishing environment. Users can easily extract more information and intelligence from highly distributed, diverse hybrid and multi-cloud environments to make real-time decisions and take immediate action without unnecessary data movement or consolidation.

Original Link

Locking in a Data Vault

So, I’m playing a little with words here. I’m certainly not advocating locking anybody or anything in a Data Vault. I want to share how you can lock in success as you design and deliver your new Data Vault. This post is aimed to specifically assist your development team.

Most of us are challenged by change. And developers are little different. They are typically very comfortable with a set of design approaches and tools learned in the past and it routinely frames their perspective on how to tackle the future. Combining the comfort of old ways with the tight timeframes and pressures of today’s business requests seldom leads to taking time to explore new options. As a result, it is easy for teams to be weighed down by outdated, limiting approaches to data infrastructure.

Original Link

Why Use K-Means for Time Series Data? (Part Two)

In "Why Use K-Means for Time Series Data? (Part One)," I give an overview of how to use different statistical functions and K-Means Clustering for anomaly detection for time series data. I recommend checking that out if you’re unfamiliar with either. In this post I will share:

  1. Some code showing how K-Means is used.
  2. Why you shouldn’t use K-Means for contextual time series anomaly detection.

Some Code Showing How It’s Used

I am borrowing the code and dataset for this portion from Amid Fish’s tutorial. Please take a look at it, it’s pretty awesome. In this example, I will show you how you can detect anomalies in EKG data via contextual anomaly detection with K-Means Clustering. A break in rhythmic EKG data is a type of collective anomaly but with it we will analyze the anomaly with respect to the shape (or context) of the data.

Original Link

Real-Time Data Replication Between Ignite Clusters Through Kafka

Apache Ignite, from version 1.6 provides a new way to do data processing based on Kafka Connect. Kafka Connect, a new feature introduced in Apache Kafka 0.9, enables scalable and reliable streaming data between Apache Kafka and other data systems. It made it easy to add new systems to your scalable and secure stream data pipelines in-memory. In this article, we are going to look at how to set up and configure the Ignite Source connector to perform data replication between Ignite clusters.

Apache Ignite, out-of-the-box, provides the Ignite-Kafka module with three different solutions (API) to achieve a robust data processing pipeline for streaming data from/to Kafka topics into Apache Ignite.

Original Link

Properties File Lookup Augmentation of Data Flow in Apache NiFi 1.7.x

Properties File Lookup Augmentation of Data Flow in Apache NiFi 1.7.x

A really cool technologist contacted me on LinkedIn and asked an interesting question:


Original Link

Why Use K-Means for Time Series Data? (Part One)

As an only child, I spent a lot of time by myself. Oftentimes my only respite from the extreme boredom of being by myself was daydreaming. I would meditate on objects in my environment and rotate them around in my head. I now attribute my love of jigsaw puzzles, math, and art to all the time I dedicated to visualization practice. My love for those things inspired me to try and understand more about how statistical functions and K-Means Clustering are used in anomaly detection for time series data.

In this first post, I provide high-level answers for the following questions:

Original Link

Four Free Data Analysis and Visualization Libraries for Your Project

The human brain works in such a way that visual information is better recognized and perceived than textual information. That’s why all marketers and analysts use different data visualization techniques and tools to make boring tabular data more vibrant. Their goal is to convert the raw, unstructured data to structured data and to convey its meaning to those people who are involved in the process of decision-making.

The following approach is the most common:

Original Link

[DZone Research] Python and R in Big Data and Data Science

This article is part of the Key Research Findings from the 2018 DZone Guide to Big Data: Stream Processing, Statistics, and Scalability.


For the 2018 DZone Guide to Big Data, we surveyed 540 software and data professionals to get their thoughts on various topics surrounding the field of big data and the practice of data science. In this post, we focus on the extreme popularity of the Python language in the field. 

Original Link

[DZone Research] The Three Vs of Big Data

This article is part of the Key Research Findings from the 2018 DZone Guide to Big Data: Stream Processing, Statistics, and Scalability.


For the 2018 DZone Guide to Big Data, we surveyed 540 software and data professionals to get their thoughts on various topics surrounding the field of big data and the practice of data science. In this article, we focus on how respondents told us their work is affected by the velocity, volume, and variety of data. 

Original Link

[DZone Research] Data Management and the Cloud

This article is part of the Key Research Findings from the 2018 DZone Guide to Big Data: Stream Processing, Statistics, and Scalability.


For the 2018 DZone Guide to Big Data, we surveyed 540 software and data professionals to get their thoughts on various topics surrounding the field of big data and the practice of data science. In this article, we focus in on what respondents told us about database management systems (DBMS) and using the cloud to house and analyze data sets.

Original Link

Physical Constants in Python

You can find a large collection of physical constants in scipy.constants. The most frequently used constants are available directly, and hundreds more are in a dictionary physical_constants.

The fine structure constant α is defined as a function of other physical constants:

Original Link

Visualizing Time Series Data With Dygraphs


This post will walk through how to visualize dynamically updating time series data that is stored in InfluxDB (a time series database), using the JavaScript graphing library: Dygraphs. If you have a preference for a specific visualization library, check out these other graphical integration posts using various libraries- plotly.js, Rickshaw, Highcharts, or you can always build out a dashboard in our very own Chronograf, which is designed exclusively for InfluxDB.

Prep and Setup

To begin with, we’ll need some sample data to display on screen. For this example, I’ll be using the data generated from a separate tutorial written by DevRel Anais Dotis-Georgiou on using the Telegraf exec or tail plugins to collect Bitcoin price and volume data and see it trend over time. I’ll then query for the data in InfluxDB periodically using the HTTP API on the front-end. Let’s get started!

Original Link

Moving Big Data to the Cloud: A Big Problem?

Digital transformation is overhauling the IT approach of many organizations and data is at the center of it all. As a result, organizations are going through a significant shift in where and how they manage, store, and process this data.

To manage big data in the not so distant past, enterprises processed large volumes of data by building a Hadoop cluster on-premises using a commercial distribution such as Cloudera, Hortonworks, or MapR.

Original Link

Reactive Summit: Fast Data ”Delivers Tangible Benefits That Help Real People”

A Senior Product Manager in the Enterprise Database Market at IBM, Anson Kokkat is joining Reactive Summit with a talk called “Successfully Design, Build, and Run Fast Data Applications.” He will demonstrate how one can act on the massive data from IoT and online apps with data science, machine learning, and open source tools in an integrated platform using IBM Db2 Event Store.

In advance of his talk, we spoke to Anson about real-time analytics, fast data, the main challenges companies face when deploying Reactive and best ways to address these challenges.

Original Link

Part 2: SQL Queries in Pandas Scripting (Filtering and Joining Data)

In Part 1, I went over information on the preparation of a data environment, which is a sample of HR data, and then did some simple query examples over the data by comparing the Pandas library and SQL. For the examples in this article, we first meed to setup the environment as described in Part 1.

Here, I will continue giving more complex SQL queries by rewriting the queries with Pandas. The SQL queries have 6 parts, which are;

Original Link

Building a Regression Model Using Oracle Data Mining

In this article, I will do regression analysis with Oracle data mining. 

Data science and machine learning are very popular today. But these subjects require extensive knowledge and application expertise. We can solve these problems with various products and software that have been developed by various companies. In Oracle, methods and algorithms for solving these problems are presented to users with the DBMS_DATA_MINING package.

Original Link

Interlaced Roots: Strum’s Separation Theorem

Strum’s separation theorem says that the zeros of independent solutions to an equation of the form

alternate. That is, between any two consecutive zeros of one solution, there is exactly one zero of the other solution. This is an important theorem because a lot of differential equations of this form come up in applications.

Original Link

Install Python 3.7.0 on Ubuntu 18.04/Debian 9.5

In this post, we will install Python 3.7.0 on Ubuntu 18.04 / Debian 9.5. This is the latest version of the Python Programming Language.

What’s New?

Python 3.7 gives us some new features, including:

Original Link

The Benefits of Building a Modern Data Architecture for Big Data Analytics

Modern data-driven companies are the best at leveraging data to anticipate customer needs, changes in the market, and proactively make more intelligent business decisions. According to the Gartner 2018 CEO and Senior Business Executive Survey, 81 percent of CEOs have prioritized technology initiatives that enable them to acquire advanced analytics. While many companies tapping into advanced analytics are now rethinking their data architecture and beginning data lake projects, 60 percent of these projects fail to go beyond piloting and experimentation, according to Gartner. In fact, that same Gartner survey reports that only 17 percent of Hadoop deployments were in production in 2017. If companies don’t successfully modernize their data architecture now, they will end up losing customers, market share, and profits.

What Drives the Shift to a Modern Enterprise Data Architecture?

The architectures that have dominated enterprise IT in the past can no longer handle the workloads needed to move the business forward. This shift towards a modern data architecture is driven by a set of key business drivers. There are seven key business drivers for building a modern enterprise data architecture (MEDA):

Original Link

The Levenshtein Algorithm

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who discovered this equation in 1965.

Levenshtein distance may also be referred to as edit distance, although it may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.

Original Link

Surveyors Formula for Area of a Polygon

If you know the vertices of a polygon, how do you compute its area? This seems like this could be complicated, with special cases for whether the polygon is convex or maybe other considerations. But as long as the polygon is "simple," i.e. the sides meet at vertices but otherwise do not intersect each other, then there is a general formula for the area.

The Formula

Original Link

The Next Step in Stream Processing: ACID Transactions on Data Streams

The stream processing industry has been growing rapidly over the years and it is expected to reach $50 billion in revenue by the end of 2025(1). Since the very early days of stream processing with Apache Flink, we have always held the strong belief that stream processing is a technology that will be the new paradigm in data processing, something that we see coming to fruition as more and more modern enterprises become event-driven, real-time, and software-operated. As a result, stream processing and streaming frameworks have evolved over the years to become more robust and offer increasingly better guarantees for data and computation correctness.

Looking at the history of stream processing we can see three distinguishing steps:

Original Link

An Empirical Look at the Goldbach Conjecture

The Goldbach conjecture says that every even number bigger than 2 is the sum of two primes. I imagine he tried out his idea on numbers up to a certain point and guessed that he could keep going. He lived in the 18th century, so he would have done all his calculation by hand. What might he have done if he could have written a Python program?

Let’s start with a list of primes, say the first 100 primes. The 100th prime is p = 541. If an even number less than p is the sum of two primes, it’s the sum of two primes less than p. So by looking at the sums of pairs of primes less than p, we’ll know whether the Goldbach conjecture is true for numbers less than p. And while we’re at it, we could keep track not just of whether a number is the sum of two primes, but also how many ways it is a sum of two primes.

Original Link

Half-Terabyte Benchmark Neo4j vs. TigerGraph

Graph database having been becoming more and more popular and are getting lots of attention.

In order to know how graph databases perform, I researched the state-of-the-art benchmarks and found that loading speed, loaded data storage, query performance, and scalability are the common benchmark features. However, those benchmarks’ testing datasets are too small, ranging from 4MB to 30 GB. So, I decided to do my own benchmark. Let’s play with a huge dataset: half-terabytes.

Original Link

WebAssembly Cephes: Mathematical Special Functions in JavaScript


A lot of the mathematical implementations we do at NearForm depends on so-called mathematical special functions. There are no clearly defined criteria for what a special function is, but suffice to say they are functions that often turn up in mathematics and most of them are notoriously hard to compute. Often they are defined as the solution to a well-posed problem. However, the solution itself is not easy to compute.

JavaScript is severely lacking in implementations of these special functions, especially trusted implementations. If you are lucky, you can find a basic implementation in npm. However, the implementations often have numerous undocumented issues and only work somewhat well for a small and undocumented input range.

Original Link

How to Implement a Kafka Producer

This article deals with the ways to implement a Kafka producer.

A Kafka producer is an application that can act as a source of data in a Kafka cluster. A producer can publish messages to one or more Kafka topics.

Original Link

So Much Data, So Many Formats: A Conversion Service, Part 3

Welcome back! If you missed Part 2, check it out here.

Converting to CSV

Now that the general structure is clear, we can see where the magic happens: how we convert to and from a specific format to our intermediate representation. We are going to start with CSV.

Original Link