Hive Metastore Configuration After Fresh Installation

For the beginners playing around in Hive, a stoppage arises with the proper configuration. After placing Hive libraries in designated folders and updating necessary environment variables, many times the first eager execution of hive fails with the exception “HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient”. That’s when Hive Metastore needs to be configured, which is pretty simple and straightforward.

There are two ways to configure Hive Metastore. We can use ‘schematool’ or directly source the hive-schema-3.1.0.mysql.sql script provided by Hive into the Metastore database.

Original Link

Use Case: How to Implement Hive Hooks to Optimize a Data Lake

Data About Data

The important difference between data lakes and data swamps is prudently organized data leads to an efficient lake while a swamp is just data that is either over-replicated or siloed by its users.  Getting the information on how the production data is being used across organization can not only be beneficial in building a well-organized data lake but it will also help data engineers to fine-tune the data pipelines or data itself.

To understand how data is consumed, we need to figure out answers to some basic questions like:

Original Link

The Application of AWS to Big Data

  • An AWS-fully integrated cloud computing network can help you build and secure big data applications. There is no hardware to procure and no need to maintain an infrastructure, so it is easy to focus on your resources to uncover new insights. With new features and capabilities introduced regularly, it provides a path to leverage the trending technologies without making long-term commitments.

  • You can build an entire data analytics application with AWS to boost your business. Scale a Hadoop cluster from zero to thousands of servers within just a few minutes, and then turn it off again when the job is completed. It includes the ability to process workloads effectively at lower cost in far less time than other tools.

  • With Amazon Web Services, you can get your big data infrastructure up and running quickly. In addition to big data web services, AWS conjointly furnishes you with an extensive plan of innovation and counseling options through our AWS Partner Network and AWS Marketplace.

  • AWS partners are providing creative and innovative data analytics solutions for other customers in the AWS Cloud. AWS gives you a way to get your big data infrastructure up and running quickly. 

  • The Data Pipeline aids in copying, enriching, transforming, and moving data as it is a data orchestration product which helps in processing substantial data workloads. It manages orchestration, scheduling, and monitoring the activities of the pipe along with logical equations required to handle failure scenarios.

  • With the information pipeline, you can read and compose information from AWS stockpiling administrations as well as your on-premise stockpiling frameworks. It bolsters a scope of information handling administrations, for example, Spark, EMR, Hive, and Pig, and can execute Linux/Unix shell commands.

  • The real-time streaming data collection for high frequency and processing AWS provides managed Kinesis services. It can be used for data streaming analysis, and for large-scale data ingestion. Amazon’s RedShift tool is designed to work with data sizes up to dozens of petabytes.

  • Machine Learning is the best choice for predictive analysis. AWS provides a way to build creative predictive models. Users are guided through the data selection process, training models, etc. by a simple wizard-based UI.

  • The service Lambda can run application code on top of the Amazon cloud infrastructure, providing developers with a great infrastructure management system. It also includes administrative and operational tasks, including scaling and resource provisioning, monitoring system health, code deployment, and application of security patches to the underlying resources.

  • AWS Elasticsearch is a primary distributed search server offering powerful search functionality over schema-free documents in real-time. Due to its nature, it is an ideal choice for performing complex queries over a massive dataset, and EC2 provides a capable platform to scale as needed. It exploits EC2’s on-request machine learning, empowering the expansion and expulsion of EC2 occasions and comparing hubs as limit and execution prerequisites change.

  • Educational institutions and other partners are using AWS and provide enterprise data warehouses and data lakes to enable self-service analytics. By incorporating data through distinct systems such as Admissions, Alumni, SIS, and Higher Educational institutions, one can provide insights that are unique and near real-time.

With AWS, we’ve got an entire end-to-end suite for large knowledge services that meet current demands, on the cloud, and with scale. Massive knowledge with AWS provides excellent solutions for each stage of the extensive knowledge lifecycle: 

The Apache Hadoop framework is available with EMR (Elastic Map Reduce) as an auto-scaling and managed service. It allows you to run Big Data workloads in the cloud with ease.

This suite of services in Mobile AWS-Big Data Analytics provides a path to find and measure the use of applications and export that data to another facility for further data analysis. Also, the AWS platform makes it a potential fit to solve these big data problems and to implement proven big data analytics on Amazon Web Services.

AWS can transform an organization’s data into valuable information with big data solutions from Amazon. It helps to convert current, archived or future application data into an asset to help your business grow. The big data tools from AWS let your teams become more productive, more comfortable with experimentation, and roll out projects sooner.

Original Link

How to Update Hive Tables the Easy Way

Historically, keeping data up-to-date in Apache Hive required custom application development that is complex, non-performant, and difficult to maintain. HDP 2.6 radically simplifies data maintenance with the introduction of SQL MERGE in Hive, complementing existing INSERT, UPDATE, and DELETE capabilities.

This article shows how to solve common data management problems, including:

  • Hive upserts, to synchronize Hive data with a source RDBMS.
  • Update the partition where data lives in Hive.
  • Selectively mask or purge data in Hive.

In a later post, we’ll show how to manage slowly-changing dimensions (SCDs) with Hive.


These SQL features are the foundation for keeping data up-to-date in Hadoop, so let’s take a quick look at them.

MERGE was standardized in SQL 2008 and is a powerful SQL statement that allows inserting, updating, and deleting data in a single statement. MERGE makes it easy to keep two systems consistent. Let’s look at the SQL specification for MERGE (slightly simplified):

MERGE INTO <target table> USING <table reference>
ON <search condition> <merge when clause>...
WHEN MATCHED [ AND <search condition> ]
THEN <merge update or delete specification>
WHEN NOT MATCHED [ AND <search condition> ]
THEN <merge insert specification>

MERGE is so powerful because you can specify as many WHEN MATCHED/WHEN NOT MATCHED clauses as you want.

In this post, we’ll also use the more familiar UPDATE statement, which looks like this:

UPDATE <target table>
SET <set clause list>
[ WHERE <search condition> ]

UPDATEcomes into play when you don’t need to mix inserts and updates together in the same statement.

Get Ready to Keep Data Fresh

With HDP 2.6, there are two things you need to do to allow your tables to be updated.

First: you need to configure your system to allow for Hive transactions. In Ambari, this just means toggling the ACID Transactions setting on.

Second: Your table must be a transactional table. That means the table must be clustered, stored as ORCFile data, and have a table property that says transactional = true. Here’s an example:

create table customer_partitioned (id int, name string, email string, state string) partitioned by (signup date) clustered by (id) into 2 buckets stored as orc tblproperties("transactional"="true");

Use Case 1: Hive Upserts

Suppose you have a source database you want to load into Hadoop to run large-scale analytics. Records in the source RDBMS are constantly being added and modified, and there’s no log to help you understand which records have changed. Instead, to keep things simple, you just do a full dump every 24 hours and update the Hadoop side to make it a mirror image of the source side.

Let’s create our managed table as follows:

create table customer_partitioned (id int, name string, email string, state string) partitioned by (signup date) clustered by (id) into 2 buckets stored as orc tblproperties("transactional"="true");

Suppose our source data at Time = 1 looks like this:

And the refreshed load at Time = 2 looks like this:

Upsert combines updates and inserts into one operation, so you don’t need to worry about whether records existing in the target table or not. MERGE is designed with this use case in mind and the basic usage is extremely simple:

merge into customer_partitioned using all_updates on = when matched then update set, state=all_updates.state when not matched then insert values(,,, all_updates.state, all_updates.signup);

Notice we use both “when matched” and “when not matched” conditions to manage updates and inserts, respectively. After the merge process, the managed table is identical to the staged table at T = 2, and all records are in their respective partitions.

Use Case 2: Update Hive Partitions

A common strategy in Hive is to partition data by date. This simplifies data loads and improves performance. Regardless of your partitioning strategy, you will occasionally have data in the wrong partition. For example, suppose customer data is supplied by a 3rd-party and includes a customer signup date. If the provider had a software bug and needed to change customer signup dates, suddenly records are in the wrong partition and need to be cleaned up.

Suppose our initial data looks like this:

And our second load looks like this:

Notice that ID 2 has the wrong Signup date at T = 1, so is in the wrong partition in the Hive table. This needs to be updated somehow so that ID 2 is removed from partition 2017-01-08 and added to 2017-01-10.

Before MERGE it was nearly impossible to manage these partition key changes. Hive’s MERGE statement doesn’t natively support updating the partition key, but here’s a trick that makes it easy anyway. We introduce a delete marker which we set any time the partition keys and UNION this with a second query that produces an extra row on-the-fly for each of these non-matching records. The code makes the process more obvious:

merge into customer_partitioned using (
-- Updates with matching partitions or net new records. select case when all_updates.signup <> customer_partitioned.signup then 1 else 0 end as delete_flag, as match_key, all_updates.* from all_updates left join customer_partitioned on = union all -- Produce new records when partitions don't match. select 0, null, all_updates.* from all_updates, customer_partitioned where = and all_updates.signup <> customer_partitioned.signup ) sub
on = sub.match_key when matched and delete_flag=1 then delete when matched and delete_flag=0 then update set, state=sub.state when not matched then insert values(,,, sub.state, sub.signup);

After the MERGE process the managed table and source table are synchronized, even though records needed to change partitions. Best of all this was done in a single operation with full atomicity and isolation.

Use Case 3: Mask or Purge Hive Data

Let’s say your security office comes to you one day and says that all data from a particular customer needs to be masked or purged. In the past you could have spent hours or days re-writing affected partitions.

Suppose our contacts table looks like this:

And we created our table as follows:

create table contacts (id int, name string, customer string, phone string) clustered by (id) into 2 buckets stored as orc tblproperties("transactional"="true");

The security office approaches us, asking for various actions:

Example: Mask all contact phone numbers from customer MaxLeads

Here we use Hive’s built-in masking capabilities (as of HDP 2.5 or Apache Hive 2.1)

update contacts set phone = mask(phone) where customer = 'MaxLeads';

Example: Purge all records from customer LeadMax

delete from contacts where customer = 'LeadMax';

Example: Purge records matching a given list of keys.

Suppose the security office gave us a CSV with certain keys and asked us to delete records matching those keys. We load the security office’s CSV into a table and get the list of keys using a subquery.

delete from contacts where id in ( select id from purge_list );


Hive’s MERGE and ACID transactions make data management in Hive simple, powerful, and compatible with existing EDW platforms that have been in use for many years. Stay tuned for the next post in this series where we show how to manage Slowly-Changing Dimensions in Hive.

Read the next blog in this series: Update Hive Tables the Easy Way Part 2

Original Link

HDFS Concurrent Access

Last year, I implemented a data lake. As is standard, we had to ingest data into the data lake, followed by basic processing and advanced processing.

We were using bash scripts for some portions of the data processing pipeline, where we had to copy data from the Linux folders, into HDFS, followed by a few transformations in Hive.

To reduce the time taken for data load, we planned to execute these two operations in parallel – that of copying files into HDFS and that of performing the Hive transformations – ensuring that the two operations operated on separate data sets, identified by a unique key.

But, it was not to be. Though both operations executed without errors, Hive threw up errors once we started querying the transformed data.

Upon investigation, we found out that the errors were due to parallel execution. When data is being copied into HDFS (from the Linux folders), Hadoop uses temporary file names until the copy operation is complete. After the copy operation is complete, the temporary file names are removed and the actual file is available in Hadoop.

When Hive is executed in parallel (while the copy operation is in progress), Hive refers to these temporary files. Even though the temporary file names are removed from Hadoop, Hive continues to have a reference to them, causing the above-mentioned error.

Once we ensured that the HDFS copy operation and Hive transformation were not performed in parallel, our problem was solved.

Original Link

Episode 162: Smart walls and dumb homes

This week Kevin and I discuss Amazon’s big security install reveal and how it made us feel. Plus, a smart home executive leaves Amazon and Facebook’s rumored smart speaker makes another appearance. China is taking surveillance even further and Kevin and I share our thoughts on the state of the smart home, and failed projects. In our news tidbits we cover a possible new SmartThings hub, a boost for ZigBee in the UK, the sale of Withings/Nokia Health, the death of a smart luggage company, and reviews for Google Assistant apps. We also answer a reader question about a connected door lock camera.

The Smart Wall research was conducted at Disney Research. The first step is building a grid of conductive materials. Later, researchers painted over it.

This week’s guest Chris Harrison, an assistant professor at Carnegie Mellon University, share his creation of a smarter wall, one that responds to touch and also recognizes electronic activity in the room. We discuss the smart wall, digital paper, how to bring context to the connected home or office, and why you may want to give up on privacy. It’s a fun episode.

Hosts: Stacey Higginbotham and Kevin Tofel
Guest: Chris Harrison, an assistant professor at Carnegie Mellon University
Sponsors: MachineQ and Twilio

  • A surprise appearance from the Wink hub
  • What happens when IoT can read your thoughts?
  • Kevin swapped hubs and is pretty unhappy about it
  • A cheap way to make connected paper
  • Go ahead, rethink you walls

Original Link

Top 5 Hadoop Courses to Learn Online

If you are learning big data, want to explore Hadoop framework, and are looking for some awesome courses, then you have come to the right place! In this article, I am going to share some of the best Hadoop courses to learn Hadoop in depth. In the last couple of articles, I have shared some big data and Apache Spark resources that have been well-received by my readers. After that, a couple of my readers emailed me and asked about some Hadoop resources, e.g. books, tutorials, and courses, that they can use to learn Hadoop better. This is the first article in a series on Hadoop. I am going to share a lot more about Hadoop and some excellent resources in coming the month, e.g. books and tutorials. BTW, If you don’t know, Hadoop is an open-source distributed computing framework for analyzing big data, and it’s been around for some time.

The classic MapReduce pattern that many companies use to process and analyze big data also runs on the Hadoop cluster. The idea of Hadoop is simple: to leverage a network of computers to process a huge amount of data by distributing them to each node and later combining individual outputs to produce the result.

Though MapReduce is one of the most popular Hadoop features, the Hadoop ecosystem is much more than that. You have HDFS, Yarn, Pig, Hive, Kafka, HBase, Spark, Knox, Ranger, Ambari, ZooKeeper, and many other big data technologies.

BTW, why Hadoop? Why you should learn Hadoop? Well, it is one of the most popular skills in the IT industry today. The average salary for a big data developer in the US is around $112,000 and goes up to an average of $160,000 in San Fransisco, as per Indeed.

There are also a lot of exciting and rewarding opportunities in the big data world and these courses will help you understand those technologies and improve your understanding of the overall Hadoop ecosystem.

Without further ado, here is my list of some of the best Hadoop courses you can take online to learn and master Hadoop.

1. The Ultimate Hands-On Hadoop Course: Tame Your Big Data!

This is the ultimate course on learning Hadoop and other big data technologies, as it covers Hadoop, MapReduce, HDFS, Spark, Hive, Pig, HBase, MongoDB, Cassandra, Flume, etc.

In this course, you will learn to design distributed systems that manage a huge amount of data using Hadoop and related technologies.

You will not only learn how to use Pig and Spark to create scripts to process data on Hadoop clusters but also how to analyze non-relational data using HBase, Cassandra, and MongoDB.

It will also teach you how to choose an appropriate data storage technology for your application and how to publish data to your Hadoop cluster using high-speed messaging solutions like Apache Kafka, Sqoop, and Flume.

You will also learn about analyzing relational data using Hive and MySQL and query data interactively using Drill, Phoenix, and Presto.

In total, it covers over 25 technologies to provide you a complete knowledge of the big data space.

2. The Building Blocks of Hadoop Course: HDFS, MapReduce, and YARN

Processing billions of records is not easy — you need to have a deep understanding of distributed computing and underlying architecture to keep things under control. And if you are using Hadoop to do that job, then this course will teach you all the things you need to know.

As the name suggests, this course focuses on the building blocks of the Hadoop framework, i.e. HDFS for storage, MapReduce for processing, and YARN for cluster management.

In this course, you will learn about Hadoop architecture and then do some hands-on work by setting up a pseudo-distributed Hadoop environment.

You will submit and monitor tasks in that environment and slowly learn how to make configuration choices for the stability, optimization, and scheduling of your distributed system.

At the end of this course, you should have a complete knowledge of how Hadoop works and its individual building blocks, i.e. HDFS, MapReduce, and YARN.

3. SQL on Hadoop: Analyzing Big Data With Hive

If you don’t what Hive is, let me give you a brief overview. Apache Hive is a data warehouse project build on top of Apache Hadoop for providing data summaries, queries, and analyses. It provides a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop and NoSQL databases like HBase and Cassandra.

This course starts with explaining key Apache Hadoop concepts like distributed computing and MapReduce and then goes into great detail into Apache Hive.

The course presents some real-world challenges to demonstrate how Hive makes tasks easier to accomplish. In short, this is a good course to learn how to use the Hive query language to find solutions to common big data problems.

4. Big Data and Hadoop for Beginners (Hands-On!)

If you are a beginner who wants to learn everything about Hadoop and related technologies, then this is the perfect course for you.

In this course, instructor Andalib Ansari will teach you the complex architecture of Hadoop and its various components like MapReduce, YARN, Hive, and Pig for analyzing big datasets.

You will not only understand what the purpose of Hadoop is and how it works but also how to install Hadoop on your machine and learn to write your own code in Hive and Pig to process a huge amount of data.

Apart from basic stuff, you will also learn advanced concepts like designing your own data pipeline using Pig and Hive.

The course also gives you an opportunity to practice with big datasets. It is also one of the most popular Hadoop course son Udemy with over 24,805 students already enrolled and over 1,000 ratings at an average of 4.2.

5. Learn Big Data: The Hadoop Ecosystem Masterclass

This is another great course to learn big data from Udemy. In this course, instructor Edward Viaene will teach you how to process big data using batch processing.

The course is very hands-on but comes with a good amount of theory. It contains more than six hours of lectures to teach you everything you need to know about Hadoop.

You will also learn how to install and configure Hortonworks Data Platform (HDP). It provides demons that you can try out on your machine by setting up a Hadoop cluster on the virtual machine (though you need 8GB+ RAM for that).

Overall, this is a good course for anyone who is interested in how big data works and what technologies are involved, with some hands-on experience.


That’s all for some of the best courses to learn Hadoop and related technologies like Hive, HDFS, MapReduce, YARN, Pig, etc. Hadoop is one of the most popular frameworks in the big data space and a good knowledge of Hadoop will go a long way in boosting your career prospects, especially if you are interested in big data.

Thanks for reading this article. If you like these Hadoop courses, then please share with your friends and colleagues. If you have any questions or feedback, then please drop a note.

Other Programming Resources You May Like

Original Link

Aerospace Corp.’s iLab encourages out-of-the-box thinking without leaving home

Aerospace Corp. engineers used iLab time to refine a concept for self-assembling Hive satellites. (Credit: Aerospace Corp.)

This article originally appeared in the Feb. 12, 2018 issue of SpaceNews magazine.

Instead of designing satellites years before they launch to perform specific tasks, what if small multipurpose satellites were designed for a variety of jobs? And what if those satellites could be launched separately before linking in orbit to perform one mission, then reconfigured to tackle a different job?

That’s the concept behind the Aerospace Corporation’s adaptable multipurpose satellite concept, called Hive. “The great thing is you don’t need to know what it’s going to be before you launch,” said Randy Villahermosa, executive director of Aerospace’s Innovation Laboratory, or iLab.

Hive is one of more than 100 iLab projects approved for further investigation or prototyping since March 2017 when Aerospace, which operates a federally funded research and development center in El Segundo, California, for the U.S. Air Force’s Space and Missile Systems Center and the National Reconnaissance Office, converted its library into iLab to encourage what Villahermosa calls “the sort of random collisions that lead to great ideas.”

Aerospace’s iLab bears a striking resemblance to an Ikea store although the adjacent areas are furnished to encourage collaboration not to look like kitchens, offices or bedrooms. There’s a 1950s-style diner, a living room with comfortable couches, a small theater and a conference room enclosed in a glass cube.

“We create different spaces because everyone has a different comfort level for different types of environments,” Villahermosa said.

Aerospace employees, Air Force officers and visiting contractors can explore ideas in iLab. Aerospace offers its employees week-long sabbaticals where instead of performing their regular jobs, they flesh out ideas in iLab. After a week, they pitch their idea to Aerospace executives who decide whether to fund it.

“More than half the time, we’ll fund them to build a prototype,” Villahermosa said.

Aerospace oversees the subsequent technology development work but licenses promising products to commercial partners. As an example, Villahermosa points to the Micro Dosimeter that Aerospace developed and licensed to Teledyne Advanced Electronic Solutions of Lewisburg, Tennessee.

MarsDrop is one of the ideas conceived in iLab. Aerospace is working with NASA’s Jet Propulsion Laboratory in nearby Pasadena, California, to design small, flexible pods that could ride along with large Mars missions. The pods would fall through the sky, land on the red planet, deploy petals to open and perform experiments.

“You would be able to launch several on a mission to Mars,” Villahermosa said.

U.S. Air Force Secretary Heather Wilson (right) and John Stopher (second from right), director of the Principal DoD Space Advisor staff, visit iLab in December. (Credit: Aerospace Corp.)U.S. Air Force Secretary Heather Wilson (right) and John Stopher (second from right), director of the Principal DoD Space Advisor staff, visit iLab in December. (Credit: Aerospace Corp.)

Another iLab project, Launch U, began with an intriguing question. If a startup brings a cubesat or small satellite to a launch provider today, what prevents it from flying tomorrow?

An iLab team discussed the question during what it calls a “hack session” with Virgin Orbit in 2017. The group concluded cubesats were not the ideal form factor for launch vehicles. Instead, small satellites could be packed into standard Launch Units the way package-delivery giant UPS fills cargo containers before loading them on airplanes.

“With the Launch U concept, we are working on a standard that would allow us do to that,” Villahermosa said.

Carrie O’Quinn, Aerospace senior project engineer, has been working with teams in the small satellite community to develop new standards, which the team hopes to reveal in August at the Small Satellite Conference in Logan, Utah.

Aerospace’s iLab is “a very collaborative environment where I can share a challenge that I’m faced with and we can all brainstorm a solution quickly,” O’Quinn said by email. “The ability to get things done quickly and with a lot of autonomy is very empowering.”

If Launch U helps satellites reach orbit more quickly, it could pave the way for testing of Hive multipurpose satellites. Unlike cubesats, which usually function independently or as part of a constellation, Hive satellites are interlocking, reconfigurable satellites designed to join, semi-detach and climb over one another to rearrange themselves.

“The big problem with cubesats is a physics problem,” Villahermosa said. “You can only have so much aperture and power in these small form factors. But if these satellites link up and act cooperatively, we can start to imagine getting the same capability as a big satellite.”

Hive satellites could, for example, create a large optical telescope. They could even move around to change the shape of the telescope mirror.

Working in “iLab gave me the opportunity to assemble a team of subject matter experts to evaluate and ensure feasibility of the Hive concept,” said Henry Helvajian, Aerospace’s Hive project team leader. “It was an honor to be trusted to develop and deliver such a novel concept.”

Aerospace engineers are investigating the best method for linking Hive satellites, which could be equipped with synthetic aperture radars, visible and short-wave infrared cameras. They originally envisioned using electromagnets to join Hive satellites in orbit. A recent visit to JPL convinced Villahermosa the team should explore Gecko Gripper, an adhesive device whose sticking power can be turned on and off.

Whatever linking mechanism is selected, Hive satellites will be able to detach satellites that fail. “Now we have satellites that can reconfigure on orbit,” Villahermosa said. “It’s a potential solution for resiliency.”

Original Link

Monitoring Energy Usage With NiFi, Python, Hive, and a SmartPlug Device

During the holidays, it’s nice to know how much energy you are using. One small step I took was buying a low-end inexpensive TPLink Energy Monitoring plug for one device. I have been monitoring my phone charging and my Apple monitor.

Let’s read the data and do some queries in Apache Hive and Apache Spark 2 SQL.

Processing live energy feeds in the cloud:

Monitor energy from a local OSX:

If your local instance does not have access to Apache Hive, you will need to send the data via site-to-site to a remote Apache NiFi/HDF server/cluster that can.

For Apache Hive usage, please convert to Apache ORC files:

To create your new table, grab the hive.ddl:

Inside of Apache Zeppelin, we can create our table based on the above DDL. We could have also let Apache NiFi create the table for us. I like to keep my DDL with my notebook — just a personal choice.

We can then query our table in Apache Zeppelin using Apache Spark 2 SQL and Apache Hive QL.

Here are the steps you should take:

  1. Purchase an inexpensive energy monitoring plug.

  2. Connect it to a phone app via WiFi.

  3. Once configured, you can access it via Python.

  4. Install the HS100 Python library.

Shell script (


Python code (

from pyHS100 import SmartPlug, SmartBulb
#from pprint import pformat as pf
import json
import datetime
plug = SmartPlug("")
row = { }
emeterdaily = plug.get_emeter_daily(year=2017, month=12)
for k, v in emeterdaily.items(): row["hour%s" % k] = v
hwinfo = plug.hw_info
for k, v in hwinfo.items(): row["%s" % k] = v
sysinfo = plug.get_sysinfo()
for k, v in sysinfo.items(): row["%s" % k] = v
timezone = plug.timezone
for k, v in timezone.items(): row["%s" % k] = v
emetermonthly = plug.get_emeter_monthly(year=2017)
for k, v in emetermonthly.items(): row["day%s" % k] = v
realtime = plug.get_emeter_realtime()
for k, v in realtime.items(): row["%s" % k] = v
row['alias'] = plug.alias
row['time'] = plug.time.strftime('%m/%d/%Y %H:%M:%S')
row['ledon'] = plug.led
row['systemtime'] ='%m/%d/%Y %H:%M:%S')
json_string = json.dumps(row)

The code is basically a small tweak of the example code provided with the pyHS100 code. This code allows you to access the HS110 that I have. My PC and my smart meter are on the same WiFi, which can’t be 5G.

Example data:

{"hour19": 0.036, "hour20": 0.021, "hour21": 0.017, "sw_ver": "1.1.1 Build 160725 Rel.164033", "hw_ver": "1.0", "mac": "50:C7:BF:B1:95:D5", "type": "IOT.SMARTPLUGSWITCH", "hwId": "60FF6B258734EA6880E186F8C96DDC61", "fwId": "060BFEA28A8CD1E67146EB5B2B599CC8", "oemId": "FFF22CFF774A0B89F7624BFC6F50D5DE", "dev_name": "Wi-Fi Smart Plug With Energy Monitoring", "model": "HS110(US)", "deviceId": "8006ECB1D454C4428953CB2B34D9292D18A6DB0E", "alias": "Tim Spann's MiniFi Controller SmartPlug - Desk1", "icon_hash": "", "relay_state": 1, "on_time": 161599, "active_mode": "schedule", "feature": "TIM:ENE", "updating": 0, "rssi": -32, "led_off": 0, "latitude": 40.268216, "longitude": -74.529088, "index": 18, "zone_str": "(UTC-05:00) Eastern Daylight Time (US & Canada)", "tz_str": "EST5EDT,M3.2.0,M11.1.0", "dst_offset": 60, "day12": 0.074, "current": 0.04011, "voltage": 122.460974, "power": 1.8772, "total": 0.074, "time": "12/21/2017 13:21:52", "ledon": true, "systemtime": "12/21/2017 13:21:53"}

As you can see we, only get the hours and days where we had usage.  Since this is new, I don’t have them all.

I created my schema to handle all the days of a month and all the hours of a day.

We are going to have a sparse table. If I was monitoring millions of devices, I would put this in Apache HBase. I may do that later.

Let’s create an HDFS directory for loading Apache ORC files…

hdfs dfs -mkdir -p /smartPlug hdfs dfs -chmod -R 777 /smartPlug 

Table DDL:

CREATE EXTERNAL TABLE IF NOT EXISTS smartPlug (hour19 DOUBLE, hour20 DOUBLE, hour21 DOUBLE, hour22 DOUBLE, hour23 DOUBLE, hour18 DOUBLE, hour17 DOUBLE, hour16 DOUBLE, hour15 DOUBLE, hour14 DOUBLE, hour13 DOUBLE, hour12 DOUBLE, hour11 DOUBLE, hour10 DOUBLE, hour9 DOUBLE, hour8 DOUBLE, hour7 DOUBLE, hour6 DOUBLE, hour5 DOUBLE, hour4 DOUBLE, hour3 DOUBLE, hour2 DOUBLE, hour1 DOUBLE, hour0 DOUBLE, sw_ver STRING, hw_ver STRING, mac STRING, type STRING, hwId STRING, fwId STRING, oemId STRING, dev_name STRING, model STRING, deviceId STRING, alias STRING, icon_hash STRING, relay_state INT, on_time INT, feature STRING, updating INT, rssi INT, led_off INT, latitude DOUBLE, longitude DOUBLE, index INT, zone_str STRING, tz_str STRING, dst_offset INT, day31 DOUBLE, day30 DOUBLE, day29 DOUBLE, day28 DOUBLE, day27 DOUBLE, day26 DOUBLE, day25 DOUBLE, day24 DOUBLE, day23 DOUBLE, day22 DOUBLE, day21 DOUBLE, day20 DOUBLE, day19 DOUBLE, day18 DOUBLE, day17 DOUBLE, day16 DOUBLE, day15 DOUBLE, day14 DOUBLE, day13 DOUBLE, day12 DOUBLE, day11 DOUBLE, day10 DOUBLE, day9 DOUBLE, day8 DOUBLE, day7 DOUBLE, day6 DOUBLE, day5 DOUBLE, day4 DOUBLE, day3 DOUBLE, day2 DOUBLE, day1 DOUBLE, current DOUBLE, voltage DOUBLE, power DOUBLE, total DOUBLE, time STRING, ledon BOOLEAN, systemtime STRING) STORED AS ORC
LOCATION '/smartPlug' 

A simple query on some of the variables:

select `current`,voltage, power,total,time,systemtime, on_time, rssi, latitude, longitude from smartPlug 

Note that current is a special word in SQL, so we tick it.

An Apache Calcite query inside Apache NiFi:


With the Python API, I can turn it off, so I don’t monitor then. In an updated article, I will add a few smart plugs and turn them on and off based on things occurring — perhaps it’ll turn off a light when no motion is detected. We can do anything with Apache NiFi, Apache MiniFi, and Python. The API also allows for turning the green LED light on the plug on and off.  

Check it out:

The screenshots above are from the IoS version of the TPLink KASA app, which lets you configure and monitor your plug. For many people that’s good enough, but not for me.

Original Link

Ingesting RDBMS Data as New Tables Arrive in Hive

Let’s say that a company wants to know when new tables are added to a JDBC source (say, an RDBMS). Using the ListDatabaseTables processor, we can get a list of TABLE s, and also views, system tables, and other database objects, but for our purposes, we want tables with data. I have used the ngdbc.jar from SAP HANA to connect and query tables with ease.

For today’s example, I am connecting to MySQL, as I have a MySQL database available for use and modification.


mysql -u root -p test < person.sql
CREATE USER 'nifi'@'%' IDENTIFIED BY 'reallylongDifficultPassDF&^D&F^Dwird';
mysql> show tables;
| Tables_in_test |
| mock2 |
| personReg |
| visitor |
4 rows in set (0.00 sec)

I created a user to use for my JDBC Connection Pool in NiFi to read the metadata and data.

These table names will show up in NiFi in the attribute.

Step 1

ListDatabaseTables: Let’s get a list of all the tables in MySQL for the database we have chosen.

After it starts running, you can check its state, see what tables were ingested, and see the most recent timestamp (Value).

We will get back what catalog we read from, how many tables there are, and each table name and it’s full name.

HDF NiFi supports generic JDBC drivers and specific coding for Oracle, MS SQL Server 2008, and MS SQL Server 2012+.

Step 2

GenerateTableFetch using the table name returned from the list returned by the database control.

Step 3

We use extract text to get the SQL statement created by generate table fetch.

We add a new attribute, sql.

Step 4

Execute SQL with that $sql attribute.

Step 5

Convert AVRO files produced by ExecuteSQL into performant Apache ORC files:

Step 6

PutHDFS to store these ORC files in Hadoop.

I added the table name as part of the directory structure, so a new directory is created for each transferred table. Now, we have dynamic HDFS directory structure creation.

Step 7

Replace the text to build a SQL statement that will generate an external Hive table on our new ORC directory.

Step 8

PutHiveQL to execute the statement that was just dynamically created for our new table.

We now have instantly queryable Hadoop data available to Hive, SparkSQL, Zeppelin, ODBC, JDBC, and a ton of BI tools and interfaces.

Step 9

Finally, we can look at the data that we have ingested from MySQL into new Hive tables.

That was easy! The best part is that as new tables are added to MySQL, they will be auto-ingested into HDFS and Hive tables.

Future Updates

Use Hive merge capabilities to update changed data. We can also ingest to Phoenix/HBase and use the upsert DML.

Test with other databases. Tested with MySQL.

Quick tip (HANA): In NiFi, refer to tables with their full name in quotes: "SAP_"."ASDSAD.SDASD.ASDAD".

Original Link

Big Data Analytics: Interactive Queries Using Presto on Microsoft Azure Data Lake Store and Qubole

In addition to the recent announcement of the integration of Microsoft’s Azure Data Lake Store (ADLS) with Qubole Data Service (QDS), Qubole now offers interactive querying capability on ADLS with Presto. This is a major benefit for businesses that want to do interactive queries against large datasets using the same Hive metastore leveraged by ETL process on Hive and data science use cases on Spark.

ADLS is an enterprise-grade hyper-scale repository for big data workloads. It enables you to capture and process data of any size, type, and ingestion speed in one single place. ADLS implements the open-source Apache® Hadoop Distributed File System (HDFS) compatible interface. With its HDFS support, you can easily migrate your existing Hadoop and Spark datasets to the cloud without recreating your HDFS directory structure.

Presto is a distributed SQL engine designed for running interactive analytics queries. Using a memory-oriented architecture in place of HDFS storage, Presto is able to rival the performance of traditional data warehouses but at a fraction of the cost, by relying on a horizontally scalable layer of memory optimized compute nodes backed on to affordable and secure ADLS storage for your data lake. Presto can be used in place of other well known interactive open-source query engine such as Impala, Hive or traditional SQL data warehouses.

Analysts with existing skills can connect their preferred BI applications (i.e. Power BI, Tableau, Looker, Qlik, etc.) to the Qubole-managed Presto cluster via a common ODBC/JDBC driver. This makes adoption easy and maintains productivity.

ADLS vs. Azure Blob Store

ADLS is storage optimized for big data workloads of all kinds — batch, interactive, and streaming and all types, both structured and unstructured. On the other hand, Azure Blob Store is a general-purpose object store that works well for a variety of use cases and is not specially tuned for read/write accesses of big data workloads. With ADLS, there are no limits on the amount of data you can store and it is optimized for high-throughput and input/output operations per second (IOPS). ADLS also enforces HTTPS protocol for data transfer to and from the store, thereby enforcing better security.

For more details, visit here.

Highlights of Presto, ADLS, and QDS Integration

  • Use Qubole to efficiently manage your Presto clusters by automatically taking care of provisioning and de-provisioning, along with dynamic auto-scaling the size of the cluster to handle the workload at any point in time.
  • Connect your favorite BI application to a Qubole-managed Presto cluster.
  • Configure QDS accounts with ADLS credentials for seamless and transparent access to ADLS on all (Hadoop, Spark, etc.) clusters in your account.
  • Run Apache Hive, Hadoop, Spark, and Presto queries through QDS platform, which is now capable of accessing data in your ADLS.
  • Migrate data from on-premise storage to ADLS using built-in native tools (in QDS) from a diverse set of storage solutions such as Azure SQL Service, Azure SQL Data Warehouse, Microsoft SQL Server, MySQL, and more.
  • Migrate data from cloud object stores using distributed Hadoop (MapReduce) job from Azure Blob Storage to ADLS.

Getting Started

Let’s see how to get started with ADLS, a free QDS Business Edition on Azure, and using Presto and ADLS with QDS.


Sign up for Azure portal and create an ADLS account. For detailed steps, visit here.

Free QDS Business Edition on Azure

Sign up for free QDS Business Edition on Azure by visiting here. For detailed steps, visit here.

Using Presto and ADLS With QDS

For detailed steps, visit here.

Original Link

Hive and Presto Clusters With Jupyter on AWS, Azure, and Oracle

Jupyter™ Notebooks are one of the most popular IDEs of choice among Python users. Traditionally, Jupyter users work with small or sampled datasets that do not require distributed computing. However, as data volumes grow and enterprises move toward a unified data lake, powering business analytics through parallel computing frameworks such as Spark, Hive, and Presto becomes essential.

We covered connecting Jupyter with Qubole Spark cluster in the previous article. In this post, we will show how Jupyter users can leverage PyHive to run queries from Jupyter Notebooks against Qubole Hive and Presto clusters in a secure way.

The following diagram depicts a high-level architectural view of the solution.

A Jupyter Notebook that is running on your local computer will utilize the Qubole API to connect to a Qubole Spark Cluster. This will allow the notebook to execute SQL code on Presto or Hive Cluster using PyHive. Please follow the step-by-step guide below to enable this solution.

Step-by-Step Guide

  1. Follow steps in this article to connect Jupyter with Qubole Spark Cluster.

  2. Navigate to the Clusters page on Qubole and click the ellipsis on the same Spark cluster you used in the previous step. Click Edit Node Bootstrap.

  3. Add the following command to the Node bootstrap outside of any conditional code to make sure that this command runs for both, master and slave nodes.

    pip install pyhive
  4. Start or restart the Spark cluster to activate PyHive.

  5. Set Elastic IP for Master Node in the cluster configuration for both Hive and Presto clusters. This step is optional. However, it will help reconnecting to Hive and Presto clusters after their restart.

  6. On Hive cluster, enable Hive Server 2.

  7. Make sure that port 10003 on the master node of Hive cluster and port 8081 on the Presto cluster are open for access from Spark cluster. You may need to create security groups and apply them as Persistent Security Groups on the cluster configuration.

  8. Start or restart Hive and Presto clusters and take a note of the Master DNS on the Clusters page. If you configured Elastic IPs on Step 5 use them instead. Here is an example of how the Master DNS may look.

  9. Start Jupyter Notebook and open an existing or create a new PySpark notebook. Please refer to this article on details of starting Jupyter Notebook.

  10. To connect to Hive, use this sample code below. Replace <Master-Node-DNS> with the values from Step 8.

    from pyhive import hive
    hive_conn = hive.Connection(host="<Master-Node-DNS>", port=10003)
    hive_cursor.execute('SELECT * FROM your_table LIMIT 10')
    print hive_cursor.fetchone()
  11. To connect to Presto, use this sample code below. Replace <Master-Node-DNS> with the values from Step 8.

    from pyhive import presto
    presto_conn = presto.Connection(host="<Matser-Node-DNS>", port=8081)
    presto_cursor.execute('SELECT * FROM your_table LIMIT 10')
    print presto_cursor.fetchone()


Original Link

Flatten Complex Nested Parquet Files on Hadoop With Herringbone

Herringbone is a suite of tools for working with Parquet files on HDFS, and with Impala and Hive. Please visit my GitHub and this documentation for more details.


Note: You must be using a Hadoop machine; Herringbone needs a Hadoop environment.

Prerequisite: Thrift

  • Thrift 0.9.1 (must have 0.9.1, as 0.9.3 and 0.10.0 will give an error while packaging)
    • Download: $ wget
    • Unzip: $ tar -xvf thrift-0.9.1.tar.gz  
    • Configure: $ cd thrift-0.9.1.tar.gz; $ ./configure  
    • Make: $ sudo make  
    • Make install: $ sudo make install  
    • Check version: $ thrift –version  

Prerequisite: Impala

  • First, set up a Cloudera repo on your machine:
    • Go to specific folder: $ cd /etc/apt/sources.list.d/  
    • Get the Cloudera repo: $ wget 
    • Update your Ubuntu repo: $ sudo apt-get update  
  • Install Impala
    • Install Impala: $ sudo apt-get install impala  
    • Install Impala Server: $ sudo apt-get install impala-server 
    • Install Impala stat-store: $ sudo apt-get install impala-state-store 
    • Install Impala shell: $ sudo apt-get install impala-shell 
    • Verify Impala: $ impala-shell  
Starting Impala Shell without Kerberos authentication
Connected to mr-0xd7-precise1.0xdata.loc:21000
Server version: impalad version 2.6.0-cdh5.8.4 RELEASE (build 207450616f75adbe082a4c2e1145a2384da83fa6)
Welcome to the Impala shell. Press TAB twice to see a list of available commands. Copyright (c) 2012 Cloudera, Inc. All rights reserved. (Shell build version: Impala Shell v1.4.0-cdh4-INTERNAL (08fa346) built on Mon Jul 14 15:52:52 PDT 2014)

Building: Herringbone Source

  • Clone git repo: $ git clone
  • Go to folder: $ cd herringbone  
  • Using Maven to build: $ mvn package  

Here is the successful Herringbone mvn package command log for your review:

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO] Herringbone Impala
[INFO] Herringbone Main
[INFO] Herringbone
[INFO] ------------------------------------------------------------------------
[INFO] Building Herringbone Impala 0.0.2
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Building Herringbone 0.0.1
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] Herringbone Impala ................................. SUCCESS [ 2.930 s]
[INFO] Herringbone Main ................................... SUCCESS [ 13.012 s]
[INFO] Herringbone ........................................ SUCCESS [ 0.000 s]
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16.079 s
[INFO] Finished at: 2017-10-06T11:27:20-07:00
[INFO] Final Memory: 90M/1963M
[INFO] ------------------------------------------------------------------------

Using Herringbone

Note: You must have fields on Hadoop, not on the local file system.

Verify the file on Hadoop:

  • ~/herringbone$ hadoop fs -ls /user/avkash/file-test1.parquet 
  • -rw-r–r– 3 avkash avkash 1463376 2017-09-13 16:56 /user/avkash/file-test1.parquet
  • ~/herringbone$ bin/herringbone flatten -i /user/avkash/file-test1.parquet 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/avkash/herringbone/herringbone-main/target/herringbone-0.0.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See for further details.
17/10/06 12:06:44 INFO client.RMProxy: Connecting to ResourceManager at mr-0xd1-precise1.0xdata.loc/
17/10/06 12:06:45 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
17/10/06 12:06:45 INFO input.FileInputFormat: Total input paths to process : 1
17/10/06 12:06:45 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
1 initial splits were generated. Max: 1.34M Min: 1.34M Avg: 1.34M
1 merged splits were generated. Max: 1.34M Min: 1.34M Avg: 1.34M
17/10/06 12:06:45 INFO mapreduce.JobSubmitter: number of splits:1
17/10/06 12:06:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1499294366934_0707
17/10/06 12:06:45 INFO impl.YarnClientImpl: Submitted application application_1499294366934_0707
17/10/06 12:06:46 INFO mapreduce.Job: The url to track the job: http://mr-0xd1-precise1.0xdata.loc:8088/proxy/application_1499294366934_0707/
17/10/06 12:06:46 INFO mapreduce.Job: Running job: job_1499294366934_0707
17/10/06 12:06:52 INFO mapreduce.Job: Job job_1499294366934_0707 running in uber mode : false
17/10/06 12:06:52 INFO mapreduce.Job: map 0% reduce 0%
17/10/06 12:07:22 INFO mapreduce.Job: map 100% reduce 0%

Now verify the file ~/herringbone$ hadoop fs -ls /user/avkash/file-test1.parquet-flat:

Found 2 items
-rw-r--r-- 3 avkash avkash 0 2017-10-06 12:07 /user/avkash/file-test1.parquet-flat/_SUCCESS
-rw-r--r-- 3 avkash avkash 2901311 2017-10-06 12:07 /user/avkash/file-test1.parquet-flat/part-m-00000.parquet

That’s it; enjoy!

Original Link

Create Your Own Metastore Event Listeners in Hive With Scala

Hive metastore event listeners are used to detect every single event that takes place whenever an event is executed in Hive. If you want a certain action to take place for an event, you can override MetaStorePreEventListener and provide your own implementation.

In this article, we will learn how to create our own metastore event listeners in Hive using Scala and SBT.

So let’s get started!

First, add the following dependencies in your build.sbt file:

libraryDependencies += "org.apache.hive" % "hive-exec" % "1.2.1" excludeAll ExclusionRule(organization = "org.pentaho") libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3" libraryDependencies += "org.apache.httpcomponents" % "httpclient" % "4.3.4" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0" libraryDependencies += "org.apache.hive" % "hive-service" % "1.2.1" unmanagedJars in Compile += file("/usr/lib/hive/lib/hive-exec-1.2.1.jar") assemblyMergeStrategy in assembly := { case PathList("META-INF", xs @ _*) => MergeStrategy.discard case x => MergeStrategy.first

Now, create your first class. You can name it anything, but I named it OrcMetastoreListener. This class must extend the MetaStorePreEventListener class of Hive and take the Hadoop conf as the constructor argument:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hive.metastore.MetaStorePreEventListener
import class OrcMetastoreListener(conf: Configuration) extends MetaStorePreEventListener(conf) { override def onEvent(preEventContext: PreEventContext): Unit = { preEventContext.getEventType match { case CREATE_TABLE => val tableName = preEventContext.asInstanceOf[PreCreateTableEvent].getTable tableName.getSd.setInputFormat("") tableName.getSd.setOutputFormat("") case ALTER_TABLE => val newTableName = preEventContext.asInstanceOf[PreAlterTableEvent].getNewTable newTableName.getSd.setInputFormat("") newTableName.getSd.setOutputFormat("") case _ => //do nothing } }

The pre-event context contains all the Hive metastore events. In my case, I want all tables generated in Hive to use the Hive input format and output format — and the same thing for the altering command.

The best use case for this listener is when somebody wants to query a data source such as spark or any other data source using its own custom input format and even don’t want to alter the schema of hive table to use his custom input format

Now, let’s build a JAR from the core and use it in Hive.

First, add an SBT assembly plugin in your plugins.sbt file:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")

Now, go to your root project and build the JAR with command SBT assembly. It will build your JAR, collect your JAR, and put it in your $HIVE_HOME/lib path. Inside the $HIVE_HOME/conf  folder, add the following contents in hive-site.xml:

<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:metastore_db;create=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>hive.metastore.schema.verification</name> <value>false></value> </property> <property> <name>hive.metastore.pre.event.listeners</name> <value>metastorelisteners.OrcMetastoreListener</value> </property>

Now, create a table in Hive and describe it:

Time taken: 2.742 seconds
id int Detailed Table Information Table(tableName:hivetable, dbName:default, owner:hduser,e,,, compressed:false, Time taken: 0.611 seconds, Fetched: 3 row(s)

And that’s it!

Original Link

AIR: Data Intelligence in Qubole

The proliferation of the cloud for Big Data workloads has provided many benefits and has also opened up new areas of optimization. But data teams often do not have the right information and as a result, there are several missed opportunities to:

  • Improve productivity

  • Improve performance

  • Reduce resources consumed

  • Reduce resource cost

To solve this problem, we built an intelligence service that collects data across the stack, analyzes and provides insights and recommendations to all types of personas: data consumers (analysts, data scientists, data engineers), and data service providers (admins) in an organization.

At the Data Platforms 2017 Conference, we announced the early access of data discovery and data model intelligence. The same service — AIR (Alerts, Insights, Recommendations) — is now generally available.

Data Discovery for Data Consumers

Analysts spend most of their time finding the right data for the business problem they are trying to solve. The key to increasing their productivity lies in increasing the agility and democratizing analytics by making it truly self-service. While there are many ways to provide this, data discovery helps analysts get the information about their data model intelligently and intuitively through:

  • Autosuggestion and completion:
    • Usage-based ranking: Collects usage metrics from all users and sorts suggestions intelligently based on top usage.
    • Context-aware suggestions: Knows when to suggest which object (tables, columns, joins, and filters).
  • Search and discovery:
    • Search for all available column names (filter all tables which contain a column) or table names (filter all tables which has the search word and browse through the available columns).
    • Top-down analysis provides insights into how other users query a particular table making it easy to use tables that are unfamiliar.

  • Usage insights provide insights into the most used columns, tables, joins, filters, and the most frequent users enabling analysts to be more context-aware to learn from other expert users and to ramp up on new data sets more efficiently.

  • Statistics provide a high-level summary of a particular table or column.

  • Data preview provides sample data for a particular table or column giving the analyst an overview of the data that they are going to query.

  • Data profile provides a summary of the data itself, such as for column cardinality, number of rows with unique values for a particular column, and number of nulls or zeros for a particular column.

Data Model Intelligence for Data Service Providers

Data admins, on the other hand, face a different set of problems. In addition to productivity, cost and performance of the infrastructure play a big role. One important factor is data model design. Relevant questions on data model design are:

  • Do I have the right data to make intelligent decisions?

  • Am I optimizing the data model efficiently? What is the right strategy given my specific setup?

In order to solve this problem, we built data model usage intelligence, which provides insights (for “All Tables” as well as “Hot Tables”) and recommendations to help the data services team identify opportunities for optimizing the data model (partition, sorting, changing the data format, etc.).

AIR infrastructure extracts insights and recommendations based on usage metrics such as the following.

Data Model Insights

  • Most frequently used tables, columns, and partitions.

  • Top users of specific tables, columns, and partitions. These users may be the subject matter experts on the data in these tables.

  • Distribution of I/O or compute resource consumed across users, tables, and commands.

Qubole Data Service (QDS) can report on all tables, or hot tables specifically.

All Tables

This view:

  • Summarizes total commands and average execution time.

  • Shows a graph of the volume of commands per day.

  • Shows a graph of total execution time in seconds per day.

  • Shows a histogram of the number of tables that were joined together.

  • Shows a leaderboard of the top users in terms of query count, with other data such as total commands submitted by each user, total errors, and total execution time (for all the queries submitted by this user).

Hot Tables

This view provides:

  • A pie chart showing the percentage of commands using hot tables that:
    • Use as a filter the column on which the table is partitioned
    • Use as a filter a column on which a table is not partitioned
    • Do not use any filter
  • A high percentage in the first category indicates that the table is correctly partitioned; a high percentage in the second category may mean that the table needs to be re-partitioned.
  • A pie chart showing the percentage of commands using hot tables that are in:
    • Columnar format (ORC, Avro, Parquet)
    • Non-columnar format
  • This highlights the overall efficiency of the data models and helps to determine if the data models have to be converted into a columnar format to improve query performance and reduce the overall data scans.

  • A graph showing the above analysis broken out by individual hot tables so that admins can decide which table to prioritize for repartitioning.

  • A table showing the most used hot tables and their relevant data formats.

  • A table showing the most-used columns for each hot table in ascending order.

  • A table showing the top used Join groups and the count of how many times have they been used in queries.

These insights drive recommendations to improve usability, performance, and cost, such as the following.

Data Model Recommendations

  • Identify the right partition column (based on top used predicate) for a specific table.

  • Identify tables that are not in columnar format and recommend the right data format (ORC, Parquet).

  • Recommend the right list of column(based on top used predicate) based on which a particular table can be sorted.

But this is barely the tip of the iceberg. What else is possible with this intelligence infrastructure? Can we expand this other to other areas of the QDS technology stack? Can this help analyze the workloads and answer the below questions?

  • Can the user reduce the cost for the same performance? Cost can be saved by using smaller machines or smaller clusters.

  • At what cost can a workload run faster?

  • Is my AWS Spot Node policy effective?

  • Will an heterogenous cluster help me?

In Summary

We’re committed to building data-driven products that help organizations increase the productivity of data team members and performance while reducing TCO of big data initiatives. In this blog post, we’ve introduced two such new products that are built on an infrastructure that collects data from Hive, Spark, Presto, and Hadoop clusters to provide Alerts, Recommendations, and Insights (AIR).

Original Link

Hive, Presto, and Spark SQL Engine Configuration

Execution engines like M/R, Tez, Presto, and Spark provide a set of knobs or configuration parameters that control the behavior of the execution engine. In this article, we will describe an approach to determine a good set of parameters for SQL workloads and some surprising insights that we gained in the process.

It is tricky to find a good set of parameters for a specific workload. The list of parameters is long and many of the parameters are correlated in unexpected ways. For example, in M/R, mapper memory and input split size are correlated since a good value for the memory parameter depends on the split size.

Typically, the ETL engineer will determine a set of parameters after analyzing a few important workloads. These parameters may not be optimal for all workloads. Moreover, as the queries and data change, the parameters may not be optimal over time. An automated approach that can recommend an optimal set of configuration values for each workload is the only scalable option.

The summary of our research is:

  • The optimization function should be a function of dollar cost and throughput. Admins typically focus on one of them, leading to sub-optimal configurations.
  • In general, execution engine configuration is sub-optimal in the field. In every experiment, a large fraction of the queries could be optimized by >60%. This points to the fact that manual efforts to choose optimal configuration falls short in most cases.
  • Optimal configuration can be determined by running a workload iteratively with different values but the methodology is too expensive and impractical.
  • A simple model of the execution engine provides very good recommendations for SQL workloads. The model eliminates the need for actual execution.
  • The model is generic and is applicable to all M/R, Tez, Presto, and Spark engines.
  • The model can be used to automate and provide insights and recommendations for engine configuration.

Related Work

Existing approaches to search for optimal configuration can be broadly classified into two types:

  1. Iterative execution: In this approach, jobs are executed multiple times with different configuration parameters to arrive at the optimal configuration. As the parameter space is huge, these approaches focus on techniques to converge towards a good configuration using lesser number of steps. For example, Sandeep et al. use a gradient named noisy gradient to converge to a solution and applies stochastic optimization to parameter tuning.
  2. Mathematical model: In this approach, a mathematical model is used to predict runtime/cost of jobs for a particular configuration. The search over the parameter space to find the optimal configuration can then be performed using the model, without having to actually run the job. Examples of this approach are Starfish and BigExplorer.

Qubole Study for SQL Workloads

The above methods optimize configuration from the perspective of an engine. The methods do not consider the type of workload — SQL or programmatic (M/R or Scala code). The major advantage is that the methods are generally applicable. The major disadvantage is that the number of parameters is huge. The page lists more than 100 parameters. The list makes searching the parameter space or building a model hard. SQL workloads are easier to model because there is a finite set of operators and a small set of parameters are important, as described in the section on model-based execution below.

Since a large fraction of customer workloads at Qubole are SQL queries run via Hive, Spark, and Presto, we focused on SQL Workloads.

Optimization Methodology

We explored two options to search the space of configuration values: iterative execution and model-based execution. The optimization function for both methodologies is:

Where is the number of containers launched, the following represent container memory and execution time for ith container launched for job respectively:

Image title


Image title

The product is a proxy for the cost of running a container. The sum is a proxy for the cost of running a query.

We chose this metric as it represents the memory and CPU resources consumed and correlates to both the dollar cost and throughput of the big data stack.

We focused on parameters that control parallelism and memory per task. These classes of parameters have the biggest impact on SQL workloads. The specific parameters for M/R and Spark engines is given in the table below:

Image title

Iterative Execution

In this method, we ran Hive queries with various configuration parameters and chose the best among them. We employed the following strategies to reduce the parameter space:

  1. Parameter reduction: As described above, we focused on a small set of configuration parameters.
  2. Discretization: We further discretized each parameter so that we try few values rather than all possible values for each parameter.
  3. Reduce search range: For each parameter, there could be a large range of values that are significantly bad. We limited the search to within a good range for each dimension using heuristics. We identified a range by talking to experts.
  4. Assume dimension independence: To prevent parameter space explosion due to correlation, we ignored their dependence on each other.

We implemented an iterative algorithm that searched the space of all configuration values based on these constraints. The figure above shows the steps to search optimal values for two parameters.

  1. The search space for two parameters.
  2. Discrete values are explored in both axes.
  3. Search space is restricted based on domain knowledge.
  4. Algorithm iterates through each parameter, chooses the optimal point, and then moves to the next parameter.

Experimental Results

We used the algorithm to optimize three customer Hive queries. We observed following percentage reduction over settings chosen by the DBAs:

We saw very good improvement in our cost metric. However, this method has two major disadvantages:

  1. Cost: The experiment cost $5,000. The customer had 1,000 more queries. It is possible to make the search more efficient and reduce the number of iterations. Since customers have hundreds or thousands of queries, even 10- or 50-fold reduction is not sufficient to make the approach economical.
  2. Shadow clusters and tables: For ETL queries, the approach requires shadow clusters and queries. The queries had to be reviewed multiple times to make sure production clusters and tables were not affected. The cost in terms of man-hours was also exorbitant.

Model-Based Execution

Since iterative execution is impractical at scale, we considered a model-based approach to eliminate execution of queries. We created an execution model that replicated an execution engine. The model is based on the reduced set of parameters only and is therefore relatively simpler to other approaches.

The cost model also takes statistics about data sizes and selectivities of various operators as input. There are two ways to get these statistics:

  1. Collect metrics from a previous run. This approach is suitable for ETL or reporting queries. In QDS, these metrics are available in the Qubole Data Warehouse.
  2. Statistics from the database catalog. This approach is suitable for ad hoc queries. In QDS, customers can collect these statistics by turning on Automatic Statistics Collection.

The model outputs the result of the optimization function described above.

Experimental Results

To quantify the prediction error by the model, we ran an experiment on four queries of a customer. The graph below shows the benefit predicted by our model and the actual observed benefit for these queries. The actual savings closely match the predicted savings indicating that the model is sufficiently accurate.

Key Insights to Optimize Workloads

We gained a few key insights to optimize SQL workloads through multiple experiments and trials on customer queries. These are in order of priority:

1. Container Shape Should Match the Instance Family Shape

Yarn allocates containers on two dimensions: memory and vCPU. Each container is given one vCPU and some memory. The memory/vCPU of the containers should match the memory/vCPU ratio of the machine type. Otherwise, resources are wasted!

2. Avoid Spills in Tasks

Spills are expensive because each spill leads to an extra write and read of all the data. Spills should be avoided at all costs. Spills can be avoided by providing adequate memory to each task.

3. Decrease Memory per Task to Choose a Cheaper Instance Type

On cloud platforms, machines with higher memory/CPU are more expensive for the same CPU type. Decrease the memory per task and consequently increase parallelism to choose a cheaper instance type. As long as tasks do not spill, the total work done in terms of IO, CPU, and network traffic is independent of the parallelism factor of the tasks. For example, the total data read and processed will be the same if the number of mappers is 100 or 1,000.

If a job can be tuned to avoid spills on a cheaper instance with the same compute but less memory than the original instance, then it is generally a good idea to move to a cheaper instance for saving cost without any performance degradation.

4. Beware of Secondary Effects of High Parallelism

On the other hand, parallelism cannot be increased indefinitely. There are secondary effects of increasing the number of tasks. For example, every task has to pay the cost of JVM start if applicable. Also, there is an increase in the number of communication channels. Thus, parallelism should be not be set so high that secondary effects drown the increase in performance. This limit is specific to a workload or query and cluster configuration and can be determined algorithmically.

5. For Spark, Prefer Fat Executors

This insight is specific to Spark, where there is an additional parameter of cores per executor. Given a certain number of cores per machine, we have a choice of either running many executors with fewer cores per executor (thin executors) or fewer executors with more cores per executor (fat executors). We have observed that for Spark, fat executors generally provide better performance. This is because of several reasons such as better memory utilization across cores in an executor, reduced number of replicas of broadcast tables, and less overhead due to more tasks running in the same executor process.

Stay Tuned

This automated discovery of insights uses the simple cost model for SQL workloads. The data collected through an automatic statistics collector will also be implemented for non-SQL workloads such as data science and machine learning.

If you’re interested in QDS, sign up today for a risk-free trial.


  1. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B. and Babu, S., 2011, January. Starfish: A Self-Tuning System for Big Data Analytics. In Cidr (Vol. 11, No. 2011, pp. 261-272).
  2. Chao-Chun Yeh, Jiazheng Zhou, Sheng-An Chang, Xuan-Yi Lin, Yichiao Sun, Shih-Kun Huang, “BigExplorer: A configuration recommendation system for big data platform,” Technologies and Applications of Artificial Intelligence (TAAI) 2016 Conference on, pp. 228-234, 2016, ISSN 2376-6824.
  3. Sandeep Kumar, Sindhu Padakandla, Chandrashekar L, Priyank Parihar, K Gopinath, Shalabh Bhatnagar: Performance Tuning of Hadoop MapReduce: A Noisy Gradient Approach (link)

Original Link