NoSQL vs. SQL: Differences Explained

A common thread in technology — or for that matter, just about anything humans attempt over time — is evolution. It’s not unusual, then, to see things grow and adapt as the world around them changes. That very adaptation means that sometimes older variations of our favorite tools remain, while newer variations appear alongside them to better serve certain, specific purposes.

The database world is no exception. Since the days of E.F. Codd and his introduction of the relational data model, we’ve seen data stores and their requirements grow from purely transactional models doing ACID (Atomic, Consistent, Isolated, Durable), to requiring more flexibility, ability to scale (via distributed deployment), and speed (in terms of latency and throughput). These improvements may, in many cases, be at the very expense of what made RDBMS solutions preferable in the first place. But how to know?

Original Link

Multiple Databases With Shared Entity Classes in Spring Boot and Java

Hello, everyone, It has been a few months since my last post. I have been busy traveling and relocating. In this post, I want to illustrate how a Spring Boot application can have multiple data sources with shared entity classes. The need for this arose in my current project where an in-memory database was needed for the high performance and a persistent database for storage.

In this blog post, I will use H2 for the in-memory database and Postgres for the persistent storage. I will setup up the application and show how the entities can be passed from one data source to another.

Original Link

Fix SQL Server With One Click

Tempting headline, isn’t it? It might even seem like clickbait, but that’s not the intention.

The SQL Server default configuration is not recommended for production environments, and yet I have worked on many production environments that have been set up by people who don’t know that the default configurations are not recommended. These same people shouldn’t have to care either, because their job is to install SQL Server so that the vendor product they’ve purchased can work as required.

Original Link

Consistent Hashing

I wrote about consistent hashing in 2013 when I worked at Basho, when I had started a series called “Learning About Distributed Databases” and today, I’m kicking that back off after a few years (ok, after 5 or so years!) with this post on consistent hashing.

As with Riak, which I wrote about in 2013, Cassandra remains one of the core active distributed database projects alive today that provides an effective and reliable consistent hash ring for the clustered distributed database system. This hash function, is an algorithm that maps data to variable length to data that’s fixed. This consistent hash is a kind of hashing that provides this pattern for mapping keys to particular nodes around the ring in Cassandra. One can think of this as a kind of Dewey Decimal Classification system where the cluster nodes are the various bookshelves in the library.

Original Link

DynamoDB PrimaryKey, HashKey, SortKey (RangeKey)

Last week, I came across DynamoDB. Over the past few years, I have been fascinated by how the industry went from relational to NoSQL to NewSQL and then spread to all directions, collapsing into MySQL/Postgres, etc. The whole thing is both funny and fascinating.

From my past experience, whenever we used a KV store, we paid big time. Scalability comes with a check, you pay for losing features, and you gain by having high performance, so you have to tradeoff — Economy 101.

Original Link

An Introduction to DBMS Types

This article will be of interest to those learning about databases who don’t have much prior knowledge of the different types or of the terminology involved. It provides a brief review and links to further information, which I hope is useful to anyone starting to work with databases. If you’re already a wizard with SQL or Neo4j, you won’t need to read any further!.

If you’re still with us, let’s first introduce some terminology. A database collects and organizes data, while access to that data is typically via a “database management system” (DBMS), which manages how the data is organized within the database. This article discusses some of the ways a DBMS may organize data. It reviews the difference between relational database management systems (RDBMS) and NoSQL.

Original Link

Solving Invisible Scaling Issues with Serverless and MongoDB

Don’t follow blindly, weigh your actions carefully.

Ever since software engineering became a profession, we have been trying to serve users all around the globe. With this comes the issue of scaling and how to solve it. Many times these thoughts of scaling up our software to unimaginable extents are premature and unnecessary.

This has turned into something else altogether with the rise of serverless architectures and back-end-as-a-service providers. Now we’re not facing issues of how to scale up and out, but rather how to scale our database connections without creating heavy loads.

With the reduced insight we have about the underlying infrastructure, there’s not much we can do except for writing sturdy, efficient code and use appropriate tools to mitigate this issue.

Or is it? 

How Do Databases Work with Serverless?

With a traditional server, your app will connect to the database on startup. Quite logical, right? The first thing it does is hook up to the database via a connection string and only when that’s done will the rest of the app will initialize.

Serverless handles this a bit differently. The code will actually run for the first time only once you trigger a function. Meaning you have to both initialize the database connection and interact with the database during the same function call.

Going through this process every time a function runs would be incredibly inefficient and time-consuming. This is why serverless developers utilize a technique called connection pooling to only create the database connection on the first function call and re-use it for every consecutive call. Now you’re wondering, how this is even possible?

The short answer is that a lambda function is, in all essence, a tiny container. It’s created and kept warm for an extended period of time, even though it is not running all the time. Only after it has been inactive for over 15 minutes will it be terminated.

This gives us a time frame of 15 to 20 minutes where our database connection is active and ready to be used without suffering any performance loss.

Using Lambda with MongoDB Atlas

Here’s a simple code snippet for you to check out.

// db.js
const mongoose = require('mongoose')
const connection = {} module.exports = async () => { if (connection.isConnected) { console.log('=> using existing database connection') return } console.log('=> using new database connection') const db = await mongoose.connect(process.env.DB) connection.isConnected = db.connections[0].readyState

Once you take a better look at the code above you can see it makes sense. At the top, we’re requiring mongoose and initializing an object called connection. There’s nothing more to it. We’ll use the connection object as a cache to store whether the database connection exists or not.

The first time the db.js file is required and invoked it will connect mongoose to the database connection string. Every consecutive call will re-use the existing connection.

Here’s what it looks like in the handler which represents our lambda function.

const connectToDatabase = require('./db')
const Model = require('./model') module.exports.create = async (event) => { try { const db = await connectToDatabase() const object = Model.create(JSON.parse(event.body)) return { statusCode: 200, body: JSON.stringify(object) } } catch (err) { return { statusCode: err.statusCode || 500, headers: { 'Content-Type': 'text/plain' }, body: 'Could not create the object.' } }

This simple pattern will make your lambda functions cache the database connection and speed them up significantly. Pretty cool, huh? 

All of this is amazing, but what if we hit the cap of connections our database can handle? Well, great question! Here’s a viable answer.

What about Connection Limits?

If capping your connection limit has you worried, then you might think about using a back-end-as-a-service to solve this issue. It would ideally create a pool of connections your functions would use without having to worry about hitting the ceiling. Implementing this would mean the provider will give you a REST API which handles the actual database interaction while you only use the APIs.

You hardcore readers will think about creating an API themseles to house the connection pool or use something like GraphQL. Both of those solutions are great for whichever use case fits you best. But, I’ll focus on using off-the-shelf tools for getting up and running rather quickly.

Using Lambda with MongoDB Stitch

If you’re a sucker for MongoDB, like I am, you may want to check out their back-end-as-a-service solution called Stitch. It gives you a simple API to interact with the MongoDB driver. You just need to create a Stitch app, connect it to your already running Atlas cluster and your set. In the Stitch app, you make sure to enable anonymous login and create your database name and collection.

Install the stitch npm module and reference your Stitch app id in your code then start hitting the APIs.

const { StitchClientFactory, BSON } = require('mongodb-stitch')
const { ObjectId } = BSON
const appId = 'notes-stitch-xwvtw'
const database = 'stitch-db'
const connection = {} module.exports = async () => { if (connection.isConnected) { console.log('[MongoDB Stitch] Using existing connection to Stitch') return connection } try { const client = await StitchClientFactory.create(appId) const db = client.service('mongodb', 'mongodb-atlas').db(database) await client.login() const ownerId = client.authedId() console.log('[MongoDB Stitch] Created connection to Stitch') connection.isConnected = true connection.db = db connection.ownerId = ownerId connection.ObjectId = ObjectId return connection } catch (err) { console.error(err) }

As you can see the pattern is very similar. We create a Stitch client connection and just re-use it for every consequent request.

The lambda function itself looks almost the same as the example above.

const connectToDatabase = require('./db') module.exports.create = async (event) => { try { const { db } = await connectToDatabase() const { insertedId } = await db.collection('notes') .insertOne(JSON.parse(event.body)) const addedObject = await db.collection('notes') .findOne({ _id: insertedId }) return { statusCode: 200, body: JSON.stringify(addedObject) } } catch (err) { return { statusCode: err.statusCode || 500, headers: { 'Content-Type': 'text/plain' }, body: 'Could not create the object.' } }

Seems rather similar. I could get used to it. However, Stitch has some cool features out of the box like authentication and authorization for your client connections. This makes it really easy to secure your routes.

How to Know If It Works?

To make sure I know which connection is being used at every given time, I use Dashbird’s invocation view to check my Lambda logs.

Lambda invocation log creating new

Here you can see it’s creating a new connection on the first invocation while re-using it on consecutive calls.

lambda invocation log using existing

The service is free for 14 days, so you can check it out if you want.  if you want an extended trial or just join my newsletter

Dashbird gif

Wrapping up

In an ideal serverless world, we don’t need to worry about capping our database connection limit. However, the amount of users required to hit your APIs to reach this scaling issue is huge. This example above shows how you can mitigate the issue by using back-end-as-a-service providers. Even though Stitch is not yet mature, it is made by MongoDB, which is an amazing database. And using it with AWS Lambda is just astonishingly quick.

To check out a few projects which use both of these connection patterns shown above jump over here:

If you want to read some of my previous serverless musings head over to my profile or join my newsletter!

Or, take a look at a few of my other articles regarding serverless:

Hope you guys and girls enjoyed reading this as much as I enjoyed writing it. Until next time, be curious and have fun.

Original Link

Short Walks — Setting up a Foreign Key Relationship in Entity Framework [Snippet]

Having had to search for this for the fiftieth time, I thought I’d document it here, so I knew where to look!

To set-up a foreign key relationship in EF, the first step is to define your classes; for example:

In this case, each Resource has a ResourceType in a simple one-to-many relationship. In the lookup table, in this case: ResourceType, define the key:

public class ResourceType
{ [Key] public int Id { get; set; } public string Name { get; set; } }

(You’ll need to reference: System.ComponentModel.DataAnnotations)
Then, in the main table, in this case Resource, map a field to the Lookup, and then tell it how to store that in the DB:

public class Resource
{ public int Id { get; set; } public int ResourceTypeId { get; set; } [ForeignKey("ResourceTypeId")] public ResourceType ResourceType { get; set; } public string Name { get; set; }

(You’ll need to reference: System.ComponentModel.DataAnnotations.Schema)

That’s it. Once you run Add-Migration, you should have a foreign key relationship set-up.

Original Link

Databases in Containers

This post was originally published here.

Database containerization has emerged with various critiques here and there. Data insecurity, specific resource requirements, network problems are often quoted as the significant drawbacks of the practice. Nevertheless, container usage has been on the increase, and so, too, has the method of containerizing databases.

Container usage is now being applied by organizations of all sizes from small startups to huge established microservices platforms. Even prominent database players like Google, Amazon, Oracle, and Microsoft have adopted containerization. This article aims to help beginners navigate the minefield of database containerization and avoid some of the major pitfalls that can occur. Note, we are not recommending its usage, but if you feel the need then hopefully this will help.

But what is database containerization?

Understanding Database Containerization

Database containerization encases databases within a container alongside its operating environment to enable data loading onto a virtual machine and run it independently.

Here are four factors that support the use of the database in containers.

1. Usage of the same configuration or ports for all containers

This setup eliminates some of the overheads that come with a distributed system which supports different nodes types. This distributed system brings about the need for the maintenance of separate containers which also requires multiple configurations. Database containerization supports one kind of configuration.

2. Resilience, resources, and storage

Containers aren’t meant to persist with data inside them. In traditional database scenarios, there is often the need for data replication or for data to be exported from a central storage system. Which makes this process expensive and also significantly slows performance.

Databases act like any other server-side app except they are typically more CPU- and memory-intensive, are highly stateful, and they utilize storage. All concepts that work the same in containers. On top of that, it’s possible to manage states, limit resources, and restrict network access.

3. Cluster upscale or downscale

The practice addresses the uncertainties of how successful an application will be and the volume it will require by improving the elasticity of its infrastructure. Database containerization accommodates application elasticity; growing when needed and also shrinking to useful infrastructure support. Adding more nodes to clusters can help rebalance data in the background.

4. Data locality and networking

Network scaling has been a significant challenge in modern virtualized data centers. Usually, load balancers take all traffic in the first run and then distributes to the application containers. The application containers then have to communicate to the databases thereby creating more traffic. Containerization brings the database and the application a little closer together alleviating some of the networking issues.

Efficiently Deploy Databases in Containers

Putting databases in containers comes with inherent obstacles to overcome. Databases manifest some fundamental properties that make it hard for them to be containerized effectively. These include their ability to handle persistent storage of data which is critical. The need for disk space to store large amounts of data, and the complex configuration layers required which create a limitation for database containerization. The practice also suffers from the need for high throughput and low latency networking capabilities.

If you are going to put your database in your container then it’s advisable to use the container orchestration platform Kubernetes. The StatefulSets feature of K8s was designed to overcome the very problems that occur when attempting to build and run database clusters inside containers.

If you really and truly have to go down this path, where possible try to use stable Helm Charts to help you get there. Please note though that this doesn’t mean the deployment will still be as stable as through a managed service, but a lot of the heavy lifting will be done for you. K8s will build, deploy, and label your containers concurrently and the self-healing element maintains your cluster health. Ensure the Chart you choose implements the database with StatefulSets and persistent volume claims (to store the actual data in the event of a failure).

Why StatefulSets?

  • StatefulSets runs your containers and orchestrates everything together while making pods more suited to stateful applications.
  • Storage is stable and persistent.
  • StatefulSets pods each have a few unique attributes;
    • They’re all labeled with an ordinal name which allows for stable identification.
    • Pods are built one at a time instead of in one go, which is helpful when bootstrapping a stateful system.
    • Pod rescheduling is stable and persistent.
    • Pods can be shut down gracefully when you scale down the amount of replicas needed, very useful for databases!
  • Mounts persistent storage volume to where your database saves its data.
  • With StatefulSets, you can use a ‘’sidecar’’ container to help your main container do the necessary work.
  • Just like with Kubernetes ReplicaSets, with StatefulSets you can scale nodes easily with kubectl scale.
  • Rolling updates are ordered and graceful.

There you have it. If you really must put your database in a container, then use Kubernetes StatefulSets to help you get there. Most importantly though, ask yourself why you’re doing it in the first place and if you really need to…?

Caylent offers DevOps-as-a-Service to high growth companies looking for help with microservices, containers, cloud infrastructure, and CI/CD deployments. Our managed and consulting services are a more cost-effective option than hiring in-house and we scale as your team and company grow. Check out some of the use cases and learn how we work with clients by visiting our DevOps-as-a-Service offering.

Original Link

It’s Time for a Single Property Graph Query Language [Vote Now]

The time has come to create a single, unified property graph query language.

Different languages for different products help no one. We’ve heard from the graph community that a common query language would be powerful: more developers with transferable expertise, portable queries, solutions that leverage multiple graph options, and less vendor lock-in.

One language, one skill set.

The Property Graph Space Has Grown…A Lot

Property graph technology has a big presence from Neo4j and SAP HANA to Oracle PGX and Amazon Neptune. An international standard would accelerate the entire graph solution market, to the mutual benefit of all vendors and — more importantly — to all users.

That’s why we are proposing a unified graph query language, GQL (Graph Query Language), that fuses the best of three property graph languages.

Relational Data Has SQL, and Property Graphs Need GQL

Although SQL has been fundamental for relational data, we need a declarative query language for the powerful — and distinct — property graph data model to play a similar role.

Like SQL, the new GQL needs to be an industry standard. It should work with SQL but not be confined by SQL. The result would be better choices for developers, data engineers, data scientists, CIOs, and CDOs alike.

Right now, there are three property graph query languages that are closely related. We have Cypher (from Neo4j and the openCypher community), we have PGQL (from Oracle), and we have G-CORE, a research language proposal from the Linked Data Benchmark Council [LDBC] (co-authored by world-class researchers from the Netherlands, Germany, Chile, the U.S, and technical staff from SAP, Oracle, Capsenta, and Neo4j).

The proposed GQL (Graph Query Language) would combine the strengths of Cypher, PGQL, and G-CORE into one vendor-neutral and standardized query language for graph solutions, much like SQL is for RDBMS.

Each of these three query languages has similar data models, syntax, and semantics. Each has its merits and gaps, yet their authors share many ambitions for the next generation of graph queryings, such as a composable graph query language with graph construction, views, and named graphs; and a pattern-matching facility that extends to regular path queries.

Let Your Voice Be Heard on GQL

The Neo4j team is advocating that the database industry and our users collaborate to define and standardize one language.

Bringing PGQL, G-CORE, and Cypher together, we have a running start. Two of them are industrial languages with thousands of users, and combined with the enhancements of a research language, they all share a common heritage of ASCII art patterns to match, merge, and create graph models.

What matters most right now is a technically strong standard with strong backing among vendors and users. So we’re appealing for your vocal support.

Please vote now on whether we should unite to create a standard Graph Query Language (GQL), in the same manner as SQL.

Should the property graph community unite to create a standard Graph Query Language, GQL, alongside SQL?

For more information, you can read the GQL manifesto here and watch for ongoing updates.

Emil Eifrem, CEO;
Philip Rathle, VP of Products;
Alastair Green, Lead, Query Languages Standards & Research;
for the entire Neo4j team

Original Link

Data Fabric for Kubernetes Extended

Thanks to Jack Norris, S.V.P. Data and Applications at MapR Technologies, Inc., for introducing me to the MapR Data Fabric for Kubernetes, which addresses the limitations of container use by providing easy and full data access from within and across clouds and on-premise deployments. The data fabric enables stateful applications to be deployed in containers for production use cases, machine learning pipelines, and multi-tenant use cases.

A typical issue with containers is that most organizations just look at containing lightweight ephemeral apps. They haven’t been able to do this with stateful apps since it is more complex to control data access and to determine how to maintain access once the data has been moved. As data volume scaled and databases were added from disparate locations, solutions would break down. Security of the data is also a concern because organizations are not able to replicate authorization and access.

With a natively integrated Kubernetes volume driver, MapR provides persistent storage volumes for access to any data ­—­ like from databases, files, and streaming — located on-premises, across clouds, and to the edge. The data fabric’s extension to Kubernetes also provides scheduled automation for multi-tenant, containerized, and non-containerized applications located inside and outside of a MapR cluster.

Image title

“Stateful and data-driven applications can’t elegantly live in the cloud without an elegant means for persisting state and making it available, securely and robustly, to containerized microservices,” says James Kobielus, lead analyst at SiliconANGLE Wikibon. “Container technology has traditionally failed to address the data portability challenge. Ideally, developers should be able to build containerized applications that can directly access persisted data volumes of any scale. Likewise, data architectures and operations personnel should be able to ensure this data remains available to containerized apps regardless of the platforms to which those containers have been moved.”

Using the MapR Data Fabric for Kubernetes, organizations can enable a global, flexible data fabric that provides high-performance access to data as if it were local and can benefit from enterprise security protection, container high availability, snapshots, mirroring, and disaster recovery.  

“MapR provides the flexibility, elasticity, and simplicity for next-gen application deployment, eliminating concerns about how, where, and if the underlying platform can grow with your data and business needs,” says Anil Gadre, Chief Product Officer at MapR Technologies. “We provide a unique advantage for our customers by enabling them to build a data fabric that extends to disparate environments, where they can capture, store, process and analyze any type of data,” continues Anil. “Extending the data fabric to Kubernetes is a needed advancement to accelerate the deployment of Containerized applications in Enterprises while allowing them to harness value from their data.”

The new container extension also brings comprehensive cloud capabilities to the Data Fabric, including:

  • Differentiated data services within a cloud through data synchronization and integrity across availability zones to meet high availability requirements.

  • Cross-cloud data bursting to support cloud neutral deployments with the ability to optimize application processing for cost, performance, and compliance. Cross-data access support includes NFS, S3, HDFS, and ODBC. 

  • Supports easy on-ramp from on-premises and private cloud deployments to public cloud.

Original Link

PostgreSQL 10: a Great New Version for a Great Database

Reuven reviews the latest and most interesting features in PostgreSQL 10.

PostgreSQL has long claimed to be the most advanced open-source relational database. For those of us who have been using it for a significant amount of time, there’s no doubt that this is true; PostgreSQL has consistently demonstrated its ability to handle high loads and complex queries while providing a rich set of features and rock-solid stability.

But for all of the amazing functionality that PostgreSQL offers, there have long been gaps and holes. I’ve been in meetings with consulting clients who currently use Oracle or Microsoft SQL Server and are thinking about using PostgreSQL, who ask me about topics like partitioning or query parallelization. And for years, I’ve been forced to say to them, “Um, that’s true. PostgreSQL’s functionality in that area is still fairly weak.”

So I was quite excited when PostgreSQL 10.0 was released in October 2017, bringing with it a slew of new features and enhancements. True, some of those features still aren’t as complex or sophisticated as you might find in commercial databases. But they do demonstrate that over time, PostgreSQL is offering an amazing amount of functionality for any database, let alone an open-source project. And in almost every case, the current functionality is just the first part of a long-term roadmap that the developers will continue to follow.

In this article, I review some of the newest and most interesting features in PostgreSQL 10—not only what they can do for you now, but what you can expect to see from them in the future as well. If you haven’t yet worked with PostgreSQL, I’m guessing you’ll be impressed and amazed by what the latest version can do. Remember, all of this comes in an open-source package that is incredibly solid, often requires little or no administration, and which continues to exemplify not only high software quality, but also a high-quality open-source project and community.

PostgreSQL Basics

If you’re new to PostgreSQL, here’s a quick rundown: PostgreSQL is a client-server relational database with a large number of data types, a strong system for handling transactions, and functions covering a wide variety of tasks (from regular expressions to date calculations to string manipulation to bitwise arithmetic). You can write new functions using a number of plugin languages, most commonly PL/PgSQL, modeled loosely on Oracle’s PL/SQL, but you also can use languages like Python, JavaScript, Tcl, Ruby and R. Writing functions in one of these extension languages provides you not only with the plugin language’s syntax, but also its libraries, which means that if you use R, for example, you can run statistical analyses inside your database.

PostgreSQL’s transactions are handled using a system known as MultiVersion Concurrency Control (MVCC), which reduces the number of times the database must lock a row. This doesn’t mean that deadlocks never happen, but they tend to be rare and are relatively easy to avoid. The key thing to understand in PostgreSQL’s MVCC is that deleting a row doesn’t actually delete it, but merely marks it as deleted by indicating that it should no longer be visible after a particular transaction. When all of the transaction IDs are greater than that number, the row’s space can be reclaimed and/or reused, a process known as “vacuuming”. This system also means that different transactions can see different versions of the same row at the same time, which reduces locks. MVCC can be a bit hard to understand, but it is part of PostgreSQL’s success, allowing you to run many transactions in parallel without worrying about who is reading from or writing to what row.

The PostgreSQL project started more than 20 years ago, thanks to a merger between the “Postgres” database (created by Michael Stonebreaker, then a professor at Berkeley, and an expert and pioneer in the field of databases) and the SQL query language. The database tries to follow the SQL standard to a very large degree, and the documentation indicates where commands, functions and data types don’t follow that standard.

For two decades, the PostgreSQL “global development group” has released a new version of the database roughly every year. The development process, as you would expect from an established open-source project, is both transparent and open to new contributors. That said, a database is a very complex piece of software, and one that cannot corrupt data or go down if it’s going to continue to have users, so development tends to be evolutionary, rather than revolutionary. The developers do have a long-term roadmap, and they’ll often roll out features incrementally across versions until they’re complete. Beyond the core developers, PostgreSQL has a large and active community, and most of that community’s communication takes place on email lists.

PostgreSQL 10

Open-source projects often avoid making a big deal out of a software release. After all, just about every release of every program fixes bugs, improves performance and adds features. What does it matter if it’s called 3.5 or 2.8 or 10.0?

That said, the number of huge features in this version of PostgreSQL made it almost inevitable that it was going to be called 10.0, rather than 9.7 (following the previous version, 9.6). What is so deserving of this big, round number?

Two big and important features were the main reasons: logical replication and better table partitions. There were many other improvements, of course, but in this article, I focus on these big changes.

Before continuing, I should note that installing PostgreSQL 10 is quite easy, with ports for many operating systems—including various Linux distributions—readily available. Go to the main PostgreSQL site, and click on the link for “download”. That will provide the instructions you need to add the PostgreSQL distribution to the appropriate package repository, from which you can then download and install it. If you’re upgrading from a previous version, of course, you should be a bit more conservative, double-checking to make sure the data has been upgraded correctly.

I also should note that in the case of Ubuntu, which I’m running on my server, the number of packages available for PostgreSQL 10 is massive. It’s normal to install only the base server and client packages, but there are additional ones for some esoteric data types, foreign data wrappers, testing your queries and even such things as an internal cron system, a query preprocessor and a number of replication options. You don’t have to install all of them, and you probably won’t want to do so, but the sheer number of packages demonstrates how complex and large PostgreSQL has become through the years, and also how much it does.

Logical Replication

For years, PostgreSQL lacked a reasonable option for replication. The best you could do was take the “write-ahead logs”, binary files that described transactions and provided part of PostgreSQL’s legendary stability, and copy them to another server. Over time, this became a standard way to have a slave server, until several years ago when you could stream these write-ahead log (WAL) files to another server. Master-slave replication thus became a standard PostgreSQL feature, one used by many organizations around the world—both to distribute the load across multiple servers and to provide for a backup in the case of server failure. One machine (the master) would handle both read and write queries, while one or more other (slave) machines would handle read-only queries.

Although streaming WALs certainly worked, it was limited in a number of ways. It required that both master and slave use the same version of PostgreSQL, and that the entire server’s contents be replicated on the slave. For reasons of performance, privacy, security and maintenance, those things deterred many places from using PostgreSQL’s master-slave streaming.

So it was with great fanfare that “logical replication” was included in PostgreSQL 10. The idea behind logical replication is that a server can broadcast (“publish”) the changes that are made not using binary files, but rather a protocol that describes changes in the publishing database. Moreover, details can be published about a subset of the database; it’s not necessary to send absolutely everything from the master to every single slave.

In order to get this to work, the publishing server must create a “publication”. This describes what will be sent to subscribing servers. You can use the new CREATE PUBLICATION command to do this.

As I wrote above, replication of the WAL files meant that the entire database server (or “cluster”, in PostgreSQL terminology) needed to be replicated. In the case of logical replication, the replication is done on a per-database basis. You then can decide to create a publication that serves all tables:


Note that when you say FOR ALL TABLES, you’re indicating that you want to publish not only all of the tables that currently exist in this database, but also tables that you will create in the future. PostgreSQL is smart enough to add tables to the publication when they are created. However, the subscriber won’t know about them automatically (more on that to come).

If you want to restrict things, so that only a specific table is replicated, you can do so with this:


You also can replicate more than one table:


If you are publishing one or more specific tables, the tables must already exist at the time you create the publication.

The default is to publish all actions that take place on the published tables. However, a publication can specify that it’s going to publish only inserts, updates and/or deletes. All of this is configurable when the publication is created, and can be updated with the ALTER PUBLICATION command later.

If you’re using the interactive “psql” shell, you can take a look at current publications with \dRp, which is short for “describe replication publications”. It’s not the easiest command to remember, but they long ago ran out of logical candidates for single-letter commands. This command will show you which publications have been defined and also what permissions they have (more on that in a moment). If you want to know which tables are included in a publication, you can use \dRp+.

Once you’ve set up the publication, you can set up a subscription with (not surprisingly) the CREATE SUBSCRIPTION command. Here, things are a bit trickier, because the data is actually arriving into the subscriber’s database, which means there might be conflicts or issues.

First and foremost, creating a subscription requires that you have a valid login (user name and password) on the publisher’s system. With that in hand, you can say:

CREATE SUBSCRIPTION mysub CONNECTION 'host=mydb user=myuser' ↪PUBLICATION MyPeoplePub;

Notice that you use a standard PostgreSQL “connecting string” to connect to the server. You can use additional options if you want, including setting the port number and the connection timeout. Because a database might have multiple publications, you have to indicate the publication name to which you want to subscribe, as indicated here. Also note that the user indicated in this connection string must have “replication” privileges in the database.

Once the subscription has been created, the data will be replicated from its current state on the publisher.

I’ve already mentioned that using the FOR ALL TABLES option with CREATE PUBLISHER means that even if and when new tables are added, they will be included as well. However, that’s not quite true for the subscriber. On the subscriber’s side, you need to indicate that there have been changes in the publisher and that you want to refresh your subscription:


If you’ve done any binary replication in previous PostgreSQL versions, you already can see what an improvement this is. You don’t have to worry about WALS, or about them being erased, or about getting the subscribing server up to speed and so forth.

Now, it’s all well and good to talk about replication, but there’s always the possibility that problems will arise. For example, what happens if the incoming data violates one or more constraints? Under such circumstances, the replication will stop.

There are also a number of caveats regarding what objects are actually replicated—for example, only tables are replicated, such objects as views and sequences are not.

Table Partitioning

Let’s say you’re using PostgreSQL to keep track of invoices. You might want to have an “invoices” table, which you can query by customer ID, date, price or other factors. That’s fine, but what happens if your business becomes extremely popular, and you’re suddenly handling not dozens of customers a month, but thousands or even millions? Keeping all of that invoicing data in a single database table is going to cause problems. Not only are many of the older invoices taking up space on your primary filesystem, but your queries against the table are going to take longer than necessary, because these older rows are being scanned.

A standard solution to this problem in the database world is partitioning. You divide the table into one or more sub-tables, known as “partitions”. Each partition can exist on a different filesystem. You get the benefits of having a single table on a single database, but you also enjoy the benefits of working with smaller tables.

Unfortunately, such partitioning was available in previous versions of PostgreSQL—and although it worked, it was difficult to install, configure and maintain. PostgreSQL 10 added “declarative partitioning”, allowing you to indicate that a table should be broken into separate partitions—meaning that when you insert data into a partitioned table, PostgreSQL looks for the appropriate partition and inserts it there.

PostgreSQL supports two types of partitioning schemes. In both cases, you have to indicate one or more columns on which the partitioning will be done. You can partition according to “range”, in which case each partition will contain data from a range of values. A typical use case for this kind of partition would be dates, such as the invoices example above.

But, you also can partition over a “list” value, which means that you divide things according to values. For example, you might want to have a separate partition for each state in the US or perhaps just for different regions. Either way, the list will determine which partition receives the data.

For example, you can implement the date invoice example from above as follows. First, create an Invoices table:

postgres=# CREATE TABLE Invoices ( id SERIAL, issued_at TIMESTAMP NOT NULL, customer_name TEXT NOT NULL, amount INTEGER NOT NULL, product_bought TEXT NOT NULL
) partition by range (issued_at);

(And yes, in an actual invoice system, you would be using foreign keys to keep track of customers and products.)

Notice that at the conclusion of the CREATE TABLE command, I’ve added a “partition by range” statement, which indicates that partitions of this table will work according to ranges on issued_at, a timestamp.

But perhaps even more interesting is the fact that id, the SERIAL (that is, sequence) value, is not defined as a primary key. That’s because you cannot have a primary key on a partitioned table; that would require checking a constraint across the various partitions, which PostgreSQL cannot guarantee.

With the partitioned table in place, you now can create the individual partitions:

postgres=# CREATE TABLE issued_at_y2018m01 PARTITION OF Invoices FOR VALUES FROM ('2018-jan-01') to ('2018-jan-31');
CREATE TABLE postgres=# CREATE TABLE issued_at_y2018m02 PARTITION OF Invoices
postgres-# FOR VALUES FROM ('2018-feb-01') to ('2018-feb-28');

Notice that these partitions don’t have any column definition. That’s because the columns are dictated by the partitioned table. In psql, I can ask for a description of the first partition. See Table 1 for an example of what this would look like.

Table 1. public.issued_at_y2018m01

Column Type Collation Nullable Default
id integer not null nextval('invoices_id_seq'::regclass)
issued_at timestamp without time zone not null
customer_name text not null
amount integer not null
product_bought text not null

Partition of: invoices FOR VALUES FROM ('2018-01-01 00:00:00') ↪TO ('2018-01-31 00:00:00')

You can see from the example shown in Table 1 not only that the partition acts like a regular table, but also that it knows very well what its range of values is. See what happens if I now insert rows into the parent “invoices” table:

postgres=# insert into invoices (issued_at , customer_name, ↪amount, product_bought)
postgres-# values ('2018-jan-15', 'Jane January', 100, 'Book');
postgres=# insert into invoices (issued_at , customer_name, ↪amount, product_bought)
values ('2018-jan-20', 'Jane January', 200, 'Another book');
postgres=# insert into invoices (issued_at , customer_name, ↪amount, product_bought)
values ('2018-feb-3', 'Fred February', 70, 'Fancy pen');
postgres=# insert into invoices (issued_at , customer_name, ↪amount, product_bought)
values ('2018-feb-15', 'Fred February', 60, 'Book');

So far, so good. But, now how about a query on “invoices”:

postgres=# select * from invoices; id | issued_at | customer_name | amount | product_bought
----+---------------------+---------------+--------+---------------- 3 | 2018-02-03 00:00:00 | Fred February | 70 | Fancy pen 4 | 2018-02-15 00:00:00 | Fred February | 60 | Book 1 | 2018-01-15 00:00:00 | Jane January | 100 | Book 2 | 2018-01-20 00:00:00 | Jane January | 200 | Another book
(4 rows)

I also can , if I want, query one of the partitions directly:

postgres=# select * from issued_at_y2018m01 ; id | issued_at | customer_name | amount | product_bought
----+---------------------+---------------+--------+---------------- 1 | 2018-01-15 00:00:00 | Jane January | 100 | Book 2 | 2018-01-20 00:00:00 | Jane January | 200 | Another book
(2 rows)

Although you don’t have to do so, it’s probably a good idea to set an index on the partition key on each of the individual partitions:

postgres=# create index on issued_at_y2018m01(issued_at);
postgres=# create index on issued_at_y2018m02(issued_at);

That will help PostgreSQL find and update the appropriate partition.

Not everything is automatic or magical here; you’ll have to add partitions, and you even can remove them when they’re no longer needed. But this is so much easier than used to be the case, and it offers more flexibility as well. It’s no surprise that this is one of the features most touted in PostgreSQL 10.


I’ve personally been using PostgreSQL for about 20 years—and for so many years people said, “Really? That’s your preferred open-source database?” But, now a large and growing number of people are adopting and using PostgreSQL. It already was full of great features, but there’s always room to improve—and with PostgreSQL 10, there are even more reasons to prefer it over the alternatives.


To learn more about PostgreSQL, download the code, read the documentation and sign up for the community e-mail lists, go to

About the Author

Reuven Lerner teaches Python, data science and Git to companies around the world. His free, weekly “better developers” email list reaches thousands of developers each week; subscribe here. Reuven lives with his wife and children in Modi’in, Israel.

Original Link

Build or Buy? — The Eternal IT Question

I have been thinking a lot about the idea of “Build or Buy” in regards to systems for IT solutions. What do I mean by “Build or Buy” exactly?


Build an IT solution based on your team putting together the parts required. Common examples are building a custom database or storage systems for your application. These solutions are typically within the (public or private) cloud or can be hosted on-premise.


Use a “as-a-Service” solution from a public cloud provider that abstracts the need for management of your IT infrastructure. You can allow the vendor to ensure the uptime and security while you focus on application development. Common examples are using services like MongoDB’s Atlas or the AWS S3 service. These are easy to begin working with because they require no capital expense and are typically ready to use in minutes. I posed the question to someone who’s talked about this subject a lot lately, Kelsey Hightower, of Google.

At what point do you determine you’ve met the limits of what a platform has available in regards to scale and resources? Additionally, when do you determine that a self-hosted solution is no longer as valuable as a Platform-as-a-Service?

Why Build?

Building a solution for something such as data storage tends to be a common task for many teams in enterprise solutions. There are a number of concerns that major organizations tend to consider when deploying large storage arrays as well that can put pressure on a team:

  • How will we back this up?
  • Who will provide long-term maintenance?
  • Will costs remain reasonable?
  • Do we have any specific business or regulatory rules we need to be included in how we store data?

Answering these means planning out a long-term solution that includes application lifecycle, specifically, how long will you require the app this data uses to remain available?

Why Buy?

Service-based hosting of applications has become the choice for many businesses who want to reduce their total footprint in their IT architecture. Gone are the days to requisition systems from a vendor, negotiations and lead time required for delivery. Any business can easily use a scalable solution like MongoDB’s Atlas or AWS with just a credit card. The ability to buy has reduced the time it takes to deliver applications. This change to “buy” has also put many other options in the hands of developers:

  • Self-service via APIs or GUI based interfaces.
  • Self-remedying failure response
  • Alerting.
  • Automated processes to handle common administration tasks.
  • Using a scalable solution (both scaling up and down).

What’s Going To Work For Me?

These are just a few reasons why businesses and developers turn to services to host their applications. Your use case will ultimately require you to put into a plan all of the potential risks using either solution will present. There is a valid point to state that you really should buy until it becomes evident it’s time to build. This is one of the reasons to avoid service vendor lock-in when selecting technologies to leverage when building apps.


  • Consider open formats like JSON to store your data that translate to many different languages.
  • If selecting a service, ensure that the vendor will permit your data to others if your situation changes. (costs, competition, credits)
  • Make checklists and document the architecture of your systems regardless of their hosting for future growth and scale.
  • Only use what your team can support.

That last tip is a critical one; what do I mean? Well, don’t go for an on-premise solution if you do not have the staffing or the funds to handle hands-on support. Don’t use a cloud solution if you haven’t validated that the data you have is permitted to be within this environment.

I hope I gave you some thoughts on whether buying or building your next big IT solution. Feel free to contact me in the comments with any questions or comments.

Original Link

DevOps on Graphs: The 5-Minute Interview With Ashley Sun, Software Engineer at LendingClub [Video]

“Basically, anything you can think of in your infrastructure, whether it’s GitHub, Jenkins, AWS, load balancers, Cisco UCS, vCenter – it’s all in our graph database,” said  Ashley Sun, Software Engineer at  LendingClub.

DevOps at LendingClub is no easy feat: Due to the complexities and dependencies of their internal technology infrastructure – including a host of microservices and other applications – it would be easy for everything to spiral out of control. However, graph technology helps them manage and automate every connection and dependency from top to bottom. 

In this week’s five-minute interview (conducted at GraphConnect New York), Ashley Sun discusses how the team at LendingClub uses Neo4j to gain complete visibility into its infrastructure for deployment and release automation and cloud orchestration. The flexibility of the schema makes it easy for LendingClub to add and modify its view so that their graph database is the single up-to-date source for all queries about its release infrastructure.

Talk to us about how you use Neo4j at LendingClub.

Ashley Sun: We are using Neo4j for everything related to managing the complexities of our infrastructure. We are basically scanning all of our infrastructure and loading it all into Neo4j. We’ve written a lot of deployment and release automation, cloud orchestration, and it’s all built around Neo4j. Basically, anything you can think of in your infrastructure, whether it’s GitHub, Jenkins, Amazon Web Services (AWS), load balancers, Cisco Unified Computing System (UCS), vCenter – it’s all in our graph database

We’re constantly scanning and refreshing this information so that at any given time, we can query our graph database and receive real-time, current information on the state of our infrastructure.

What made you choose Neo4j?

Sun: At the time, my manager was looking for a database that we could run ad-hoc queries against, something that was flexible and scalable. He actually looked at a few different graph databases and decided Neo4j was the best. 

Catch this week’s 5-Minute Interview with Ashley Sun, Software Engineer at LendingClub

What have been some of your most interesting or surprising results you’d seen while using Neo4j?

Sun: The coolest thing about Neo4j, for us, has been how flexible and easily scalable it is. If you’ve come from a background of working with the traditional SQL database where schemas have to be predefined — with Neo4j, it’s really easy to build on top of already existing nodes, already existing relationships and already existing properties. It’s really easy to modify things. Also, it’s really, really easy to query at any time using ad-hoc queries. 

We’ve been working with Neo4j for three years, and as our infrastructure has grown and as we’ve added new tools, our graph database has scaled and grown with us and just evolved with us really easily. 

Anything else you’d like to add or say?

Sun: It would be exciting for more tech companies to start using Neo4j to map out their infrastructure and maybe automate deployments and their cloud orchestration using Neo4j. I’d love to about how other tech companies are using Neo4j.

Original Link

10 Steps to Become a Data Scientist in 2018

The newfound love for data science in today’s computing world isn’t unjustified. Ranked as the hottest job on offer in the coming years by Harvard Business Review and coupled with sweet paychecks, the lacunae in the existing skills of professionals compared to the industry standard skillset required for the position of a data scientist means there is a lot already that comes with learning data science.

In such a scenario, what gives you a competitive edge? Here are ten steps to follow on your path to becoming a data scientist!

1. Develop Skills in Algebra, Statistics, and ML

A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician. The idea is to have the just the right balance, avoiding too much or not enough of an emphasis on either of the two.

2. Learn to Love (Big) Data

Data scientists handle a humungous volume of segregated and non-segregated data on which computations often cannot be performed using a single machine. Most of them use big data software like Hadoop, MapReduce, or Spark to achieve distributed processing. There are many online courses that can really help you to learn big data at your pace; check out the video below!

3. Gain a Thorough Knowledge of Databases

Given the huge amount of data generated virtually every minute, most industries employ database management software such MySQL or Cassandra to store and analyze data. Good insight of the workings of the DBMS will surely go a long way in securing your dream job as a data scientist.

4. Learn to Code

You cannot be a good data scientist until you learn the language in which data communicate. A well-categorized chunk of data might be screaming out its analysis; the writing may be on the wall but you can only comprehend it if you know the script. A good coder might not be a great data scientist, but a great data scientist is surely a good coder. 

5. Master Data Munging, Visualization, and Reporting

Data munging is the process of converting the raw form of data into a form that is easy to study, analyze, and visualize. The visualization of data and its presentation are an equally important set of skills on which a data scientist relies heavy when facilitating managerial and administrative decisions using data analysis.

6. Work on Real Projects

Once you have become a good data scientist, in theory, it is all about practice. Search the internet for data science projects (Google quandl) and invest your time building your own forte, along with zeroing in on the areas that still require brushing up.

7. Look for Knowledge Everywhere

A data scientist is a team player, and when you are working together with a group of like-minded people, being a keen observer always helps. Learn to develop the intuition required for analyzing data and making decisions by closely following the working habits of your peers and decide what best suits you.

8. Communication Skills

Communication skills differentiate a great data scientist from a good data scientist. More often than not, you find yourself behind closed doors explaining the findings of your data analysis to people who matter, and the ability to have your way with words will always come in handy when tackling unforeseen situations.

9. Compete

Websites such as Kaggle are a great training ground for budding data scientists as they try to find teammates and compete against one another to showcase their intuitive approaches and hone their skills. With the rising credibility of the certifications provided by such sites in the industry, these competitions are fast becoming a stage to show to companies how innovatively your mind works.

10. Stay Up-to-Date With the Data Scientist Community

Follow websites such as KDNuggets, Data Science 101, and DataTau to remain in sync with the happenings of the world of data science and gain insight regarding the types of job openings currently being offered in the field.

We hope the above list helps you take off on your data scientist ambitions and acts as a faithful companion as you steer your way ahead of everyone towards excellence.

Original Link

2018 Big Data Predictions (Part 1)

Given how fast technology is changing, we thought it would be interesting to ask IT executives to share their thoughts on the biggest surprises in 2017 and their predictions for 2018.

Here’s article one of two of what they told us about their predictions for big data and analytics in 2018. We’ll cover additional predictions for 2018 in a subsequent article.

Lucas Vogel, Founder, Endpoint Systems

  • We’re going to see a lot more “componentization” of big data components and platforms going forward. Google has executed on this brilliantly, where a significant portion of their general cloud offerings such as BigTable and Dataflow are in fact built on top of Hadoop and other big data technologies — both open-source (HBase, Beam) and proprietary (BigQuery). I think we’ll see more products pushed out this way, providing less confusion to practitioners looking to benefit from the Hadoop ecosphere of big data.

Will Hayes, CEO, Lucidworks

  • Data is being created and collected at massive scale across all industries. This past year, with the support of machine learning integration, new analytics tools are being developed to help companies glean as much valuable insight from that data as possible. I think search capabilities will continue to become an incredibly important component of analyzing this content and furthering the understanding of the user experience. In 2018, robust search tools will become the standard for companies that create and access large reserves of data and content.

Eliot Horowitz, CTO and Co-Founder, MongoDB

  • Organizations with a global reach or those in regulated fields have to keep data in certain places for legal reasons, making it a challenge to have a single logical view of this data. In many cases, the same technologies that make it easier to develop applications, like cloud infrastructure, services, and serverless backends make it challenging to even enforce data access protections.

Patrick McFadin, Vice President of Developer Relations, DataStax

  • Stream processing data will become further integrated into standard backend databases.

    More companies will embrace multi-cloud as competition heats up between cloud vendors and fear of lock-in becomes more prevalent.

    Graph database use cases will become less art and a lot more science as the technology matures.

    Data autonomy is the fear that the big cloud players will become the main driver for large digital transformation projects. More and more brands will want data autonomy in a multi-cloud world in order to compete and stay ahead. The need and urgency to meet the big cloud players head-on with data-driven applications will intensify.

    Real-time analysis of operational data will be a qualifying feature for most infrastructures so that their applications can explore emerging trends, provide timely alerts to operators and end users, and reduce the latency between the appearance of a condition and when it is visible to business owners on their dashboards. The traditional model of online data dumping nightly into an analytics data warehouse will not suffice. This will be made all the more challenging by the following.

    The continued explosion of data. The philosophy of “store everything” will lead to even faster data growth, forcing organizations to choose between 1) sacrificing real-time access, 2) throwing away data, or 3) inventing new solutions at tremendous cost. However, the requirement to compete will make options 1 and 2 tantamount to ceding ground to competitors. This is compounded by:

    A broadening of the definition of IoT. We are moving into a world where everything is generating data, and everything is consuming data. This year, more devices will become data-enabled, and first-gen IoT devices will be replaced by newer models with 5x the number of sensors. Devices that used to primarily report their state and allow a few commands will instead participate in an elaborate network of mutually coordinated behavior, all relying on a constant stream of data.

    Different sources and schemas. Ingesting all of the data and making it actionable is complicated by its different sources and schemas. Organizations seldom have the capability to dictate the shape of data from all the sources they need to harness, and kicking the can down the road by dumping it into a shapeless data lake doesn’t help. They will need to find a way to make it all queryable.

    Letting developers work more closely and naturally with data.Many technologies that facilitate managing and deriving insight from data at scale introduce impedance to the development process in the form of special-purpose languages or multiple new architectural layers.

    Cross-cloud. Public cloud infrastructure and services are a boon to organizations, but they are also a source of lock-in. Organizations will face pressure to mitigate this lock-in, so they can take advantage of regions across cloud providers or mix services offered by separate cloud providers. Perhaps data storage is most cost-effective in one, while another offers the best price/performance ratio on GPU resources. Or they might want to migrate from one provider to another.

Adnan Mahmud, CEO and Founder, LiveStories

  • Keeping the election theme, everyone will be watching if the polls get it right for the mid-term elections. We will continue to see the adoption of Smart City technologies. More sensors will get deployed and new applications will be created to take advantage of these data streams.

Mike Kail, CTO, CYBRIC

  • Big Data will balloon into “overweight data” due to technology such as IoT, autonomous vehicles, and the 4th Industrial Revolution. This will drive new startups to address the needs to rapidly process and act upon this data.

Nima Negahban, CTO and Cofounder, Kinetica

  • Beginning of the end of the traditional data warehouse. As the volume, velocity, and variety of data being generated continue to grow, and the requirements to manage and analyze this data continue to grow at a furious pace, as well, and the traditional data warehouse is increasingly struggling with managing this data and analysis. While in-memory databases have helped alleviate the problem to some extent by providing better performance, data analytics workloads continue to be more and more compute-bound.

    These workloads can be up to 100x faster leveraging the latest advanced processors like GPUs, however, this means a nearly complete re-write of the traditional data warehouse. In 2018, enterprises will start to seriously re-think their traditional data warehousing approach and look at moving to next-generation databases either leveraging memory or advanced processors architectures (GPU, SIMD) or both.

Dale Kim, Senior Director, Products and Solutions, Arcadia Data

  • Artificial intelligence (AI) deserves the same treatment Hadoop and other big data technologies have received lately. If the industry is trying to balance the hype around big data-oriented products, it has to make sure not to overhype the arrival of AI. This is not to suggest that AI has no place in current and future-looking big data projects, just that we are not at a point in time yet where we can reliably turn business decision-making processes over entirely to machines. Instead, in 2018 the industry will begin to modernize BI with machine assistance rather than AI-driven tasks. Think of it as power steering versus self-driving cars. Business users will get more direction on how to gain better insights faster, as they don’t need to be told what the right insights are. We’re so enamored by the idea of AI, but the reality is it’s not ready to act on its own in the context of analyzing data for business users.

    In modernizing BI, we’ll also start to see a shift in which organizations will bring BI to the data. BI and big data have hit a bit of a brick wall. Companies have spent a lot of money on their data infrastructures, but many are left wondering why they have to wait so long for their reports. Part of the problem is that companies are capturing their data in a data lake built on a technology like Hadoop, but they not taking full advantage of the power of the data lake. Rather than ideally moving operations to the data, businesses move data from the lake to external BI-specific environments. This process of “moving data to the compute” adds significant overhead to the analytics lifecycle and introduces trade-offs around agility, scale, and data granularity. Next year and moving forward, we’ll start to see more companies bringing the processing to the data, a core tenet of Hadoop and data lakes, with respect to their BI workloads. This will speed the time to insight and improve the ROI companies see on their big data infrastructure investments.

Don Boxley, Co-Founder and CEO, DH2i

  • In 2018, organizations will turn to Best Execution Venue (BEV) technologies to enable and speed digital transformation, laying last year’s fears to rest. Organizations will reap immediate business and technological benefit, as well dramatic reductions in associated costs, by employing technologies that dynamically decide and then move workloads/data to the location and conditions where it can function at peak performance and efficiency, to achieve the desired outcome.

Lee Atchison, Senior Director Strategic Architecture, New Relic

  • Datasets are getting larger and more comprehensive. Accessible storage is getting larger and less expensive, and compute resources are getting bigger and less expensive. This opens up a natural opportunity for advanced artificial intelligence to be used to process and analyze that data. Finding useful trends, patterns, and detect anomalies are natural use cases for AI on these large datasets.

Original Link

Who Cares About NoSQL?

With all the hullabaloo with NoSQL companies going public and then losing a third of their valuation, being bought out, and some even going into administration and then being resurrected by customers — you could be forgiven for thinking the NoSQL bubble has spectacularly burst.

It’s crunch time for NoSQL companies, that’s for sure. They need to grow up. Rapidly. They need to become easier to deploy, use, and manage. But they’re working on that.

The effort to deploy NoSQL databases, though, is worth it for so many customers. I wanted to bang the drum a little for NoSQL in these troubled times, so here’s my list of how you can really benefit from using them.

Oh, and it’s a stream of thought kinda thing — these are in no particular order of preference.

1. You Could Save Millions on Oracle Coherence or Software AG Terracotta Licenses

These middleware layers are used to cache often-used data in the Java tier. They are extensively used in financial services to reduce the load on underlying databases or act as in-process caches and distributed memory shared across many machines.

They’re bloody expensive, though! Like… eye-wateringly expensive. Even for very rich banking types.

You can use a NoSQL key-value store to do a similar job — and for much less cash. An in-memory key-value store for transient data can easily be powered by the extremely lightweight Redis NoSQL database.

If you’re in the cloud, take a look at AWS DynamoDB.

These also have useful functionality for complex and custom data types so they may even be easier to code against for some use cases.

2. Save Millions on Not Coding Around Relational Database Structural Issues

Relational databases are great… for relational data.

For structures that change rapidly or that are deeply nested, they introduce a bunch of overhead.

Imagine an XML or JSON document structure — say, a complex trade FPML document. Let’s say you want to update it whole but also want to be able to retrieve it by key fields on an ad hoc basis.

To do this in relational databases, you have to either code a special function to handle introspecting the XML data — not the fastest — or duplicate some of the fields as relational columns — not easy on storage space.

And then you’ve got to code around the issue and spend a lot of time thinking of the best way to do things. It’s just not fun or productive.

Using a document NoSQL database which can natively handle either XML (MarkLogic) or JSON (pretty much all of them — MongoDB, ArangoDB, CosmosDB, MarkLogic, etc.) data will greatly simplify storage and query of complex document structures.

And they’ll save you a boatload of development and testing time, too.

Oh, and their licenses are cheaper, and they run on commodity hardware. That’s easy math.

3. Save Pulling Out Your Hair Doing Complex Queries Over Thousands of Pieces of Related Data

If you’ve got ridiculously complex relationships (in data… not in your personal life…) between entities in a complex graph of information and need to traverse those relationships, you need an effective way to index that data in order to make those queries fly.

This is where SPOGI-style indexes come in for graph NoSQL databases! Good choices here are AllegroGraph (very standards-compliant), GraphDB, and Neo4j (not W3C standards compliant, but has a very nice query language of its own).

Shortest path queries are really, really computationally complex. Especially if they’re calculating the cost of the paths as they traverse them from data in the graph.

If you’re doing a lot of these queries (i.e. sat nav-style application), you need a dedicated data store.

Only graph NoSQL databases provide that.

4. If You Have a Whole Bucket Load of Data About Each Record… Sometimes

Sometimes, it’s possible for a single record to have only a few properties out of thousands of possible ones.

You may not want the overhead of defining them all up front — or you may simply not know them all up front!

You may also only want to pull some groups of properties back in one go (i.e. a summary, or one aspect of the entity), and want it to be fast.

The way that wide column stores (AKA columnar NoSQL databases, AKA BigTable clones) work makes this very efficient.

Be it Hypertable (a good commercial offering), Cassandra (AKA DataStax Enterprise), or Accumulo (good for securing individual data fields), these NoSQL databases can simplify your application and drive more performance.

They’re also easier to understand if your mind is totally stuck in the world of tables and columns!

5. You May Be Indecisive or Have Every Type of Data Imaginable but Don’t Want 10 Database Products

In this instance, a hybrid NoSQL database may be for you. Ones that can handle a variety of query types — be they simple key-value/name-fetch applications, document structures, and graph queries — can all be handled by a single, true-hybrid, database using one API.

These databases include MarkLogic Server or ArangoDB. Definitely try those out.

In Summary

There are a few difficulties in using NoSQL databases — but the benefits far, far outweigh them.  The great thing is that you can download the above databases and try them out in minutes.

No great time overhead, and not sales droids to talk to until you think you may find them valuable. Just have a go today.

I cannot recommend enough that you open your mind and try them out. The possibilities are truly endless — not to mention, profitable for you and your employer!

Original Link

An Introduction to SQL on Hadoop and SQL off Hadoop

Initially, Apache Hadoop was seen as a platform for batch processing unstructured data. Inherently, Hadoop was a cheap way to reliably store and process lots of data, so more use cases were attracted to it.

Over time, the inexorable effects of data gravity increased the need for SQL on Hadoop, as SQL is the language of data. Initially, having any way to use SQL against the Hadoop data was the goal, but now there is an increasing requirement to connect business users with tools like Tableau to that data, and give them the performance they expect with high levels of concurrency. 

Note that to meet this requirement, it is likely that users will need to have structured data stored in Hadoop (along with the original unstructured data), as good performance is more likely if a transformation is done once rather than per-query as noted in this DZone article.

Open-Source Solutions

There are a number of open-source solutions for SQL on Hadoop, including Hive (LLAP), Impala, SparkSQL, and Presto.

As most of these products are relatively young, there are still significant improvements being made, as covered in the later section on benchmarking. So if you can live with their functionality and performance today, you can expect things to improve over the next few years.

The greatest strength of these solutions is they were written from scratch for analyzing data in Hadoop. They were intended to run on Hadoop clusters from day one, and interoperate with the growing number of data formats in that ecosystem.

The greatest weakness of these solutions is they were written from scratch for analysing data in Hadoop. People often say good software takes 10 years, and that certainly applies to SQL products, particularly in the area of query optimisation — Hadoop itself is 10 years old, but most of the SQL on Hadoop products are much younger than that; any focus on real-time, high concurrency SQL on Hadoop is younger still.

That is why a lot of proprietary database products are built on the shoulders of giants. For example, here is a list of products that derive from PostgreSQL, including Greenplum, Netezza, ParAccel, Redshift, and Vertica. The resulting products have a great start in avoiding a lot of mistakes made in the past, particularly in areas such as query optimisation.

By contrast, those developing open-source SQL on Hadoop products from scratch have to learn and solve problems that were long-since addressed in other database products.

That is why promising projects like Presto are only starting to add a cost-based optimizer in 2017, and Impala cannot handle a significant number of TPC-DS queries (which is why Impala TPC-DS benchmarks tend to show less than 80 queries, rather than the full 99 from the query set – more on this in the benchmarking section later).

In addition, some open-source solutions are adopted by specific Hadoop distributions. So Hortonworks uses Hive LLAP, whilst Cloudera prefers Impala. Although the projects are open-source, if you are trying to e.g. get Kudu working on Hortonworks as part of Impala deployment, you may struggle as seen in this Hortonworks community topic.

Proprietary Solutions

There are proprietary alternatives for using SQL to query data in Hadoop.

Many of these allow you to run what TDWI call “SQL off Hadoop,” requiring a separate platform for the SQL engine in addition to your Hadoop cluster. This is unattractive to many companies, as you have the cost of an additional platform, and the auditing concerns when moving data between different platforms. On the other hand, some might perceive benefits in isolating the SQL workload from their Hadoop cluster, and think the operational complexity and extra platform costs are a price worth paying.

Other products, such as Vertica, have an on-Hadoop offering that does not match their off Hadoop product. In Vertica’s case, they advise using the Vertica Enterprise Edition when you need “to boost performance” by enabling optimizations such as projections which aren’t available in their on-Hadoop offering.

Finally, some products such as Kognitio have been migrated from an off Hadoop product to run on Hadoop with no missing functionality or performance features.

One major impact of open source for SQL on Hadoop is that having a free-to-use product is now a basic requirement. Users expect to evaluate a fully functional version of the product with no limits on the scale or duration of that evaluation, at no cost. They accept that they will pay for support/consultancy if they decide to move into production with the product at a later date.


Everyone knows that vendors are good at constructing benchmarks which suit their own needs!

However, benchmarks are available as a starting point to judge the alternative SQL on Hadoop options for functionality and performance.

These include:

  • AtScale: This was the second time AtScale benchmarked SQL on Hadoop, and one clear message was the massive improvements made by the open-source offerings they tested. This is what one would expect with relatively new products, but was still a good sign. The other finding I noted was that products had strengths and weaknesses for different queries, suggesting this is not a One Size Fits All market. It should also be noted that the benchmark used the relatively small TPC-H query set, whereas the other benchmarks listed here use the newer and more comprehensive TPC-DS query set.

  • Comcast ran a benchmarking exercise with TPC-DS queries, comparing different SQL on Hadoop products. They are also the only benchmark to compare the performance of different file formats, so worth reading on that basis alone. It’s also worth reading the detail on which of the TPC-DS queries were included (66 of the 99 TPC-DS queries), and the scoring mechanism for long-running/failing queries. There was also no concurrency testing (due to lack of time in their tests, given the number of combinations of product and file format that they considered) which would be essential for almost all real-world use cases.

  • Kognitio TPC-DS query set a benchmark, including all TPC-DS queries, and concurrency testing. There is more detail on how the benchmark was run, and the per-query results here. Although the benchmark linked here does not include Hive LLAP and Presto, there will be further Kognitio blog posts on these products — without giving everything away, Presto seems comparable to SparkSQL in terms of performance and ease-of-use, whilst LLAP is performant, but not stable under concurrent load. 

A number of common themes emerge from the benchmarking process:

  • Open-source products are improving significantly in terms of functionality and performance — both AtScale and Kognitio see these results (for example, Kognitio report the functionality improvements between Spark 1.6 and 2.0, and IBM fellow Berni Schiefer observed similar improvements starting with Spark 1.5).

  • Product immaturity for a number of the open-source products means they cannot run all the TPC-DS query set, either because they don’t support the required syntax or they generate runtime errors.

  • Some products (particularly SparkSQL and Presto) need significant tuning for concurrent performance. This was observed by Kognitio and in the Berni Schiefer article mentioned above.

  • Hive by itself is very slow (highlighted in particular by Comcast). Hive LLAP is a significant improvement, although AtScale still ranked it behind Presto and Impala.

Possible reasons for current open source SQL on Hadoop products not being as performant as some proprietary offerings include:

  • Overhead of starting and stopping processes for interactive workloads. To run relatively simple queries quickly, you need to reduce latency. If you have a lot of overhead for starting and stopping containers to run tasks, that is a big impediment to interactive usage, even if the actual processing is very efficient

  • Product immaturity. See the earlier commentary on building from scratch, rather than leveraging years of experience built into existing products.

  • Evolution from batch processing. If a product like Hive starts off based on Map-Reduce, its developers won’t start working on incremental improvements to latency, as they won’t have any effect. Similarly, if Hive is then adopted for a lot of batch processing, there is less incentive to work on reducing latency. Hive 2 with LLAP project aims to improve matters in this area, but in benchmarks such as this AtScale referenced earlier, it still lags behind Impala and SparkSQL.

Of course, the best benchmark is one that represents your intended workload, bearing in mind future as well as initial requirements, so you should always conduct your own functional and performance testing rather than relying on benchmarks from vendors or other third parties.

No More “One Size Fits All”

As long as you use SQL on Hadoop solutions which run on your Hadoop cluster, you can use the right tool for the job. If you had already have Hive for ELT and other batch processing, but it can’t meet your needs for connecting a community of business users to data in Hadoop, you can utilize a different SQL on Hadoop solution for that task, and keep your current Hive workload in place.

You no longer have to choose between betting the farm on one solution for all SQL access or have multiple hardware platforms for different use cases, with the performance and audit trail pains of having to move data between those platforms.

As I work for a company producing an SQL on Hadoop product, I recuse myself from recommending one product. I’d expect most users to try a product from their Hadoop distribution, then look at free-to-use alternatives as and when they find use cases that product cannot handle.

Original Link

What Are the Hurdles Companies Face With Databases Today?

To gather insights on the state of databases today and their future, we spoke to 27 executives at 23 companies who are involved in the creation and maintenance of databases.

We asked these executives, “What are the most common issues you see companies having with databases?” Here’s what they told us:

Lack of Knowledge

  • The lack of education about and understanding of GPUs in the datacenter. Adoption of the cloud. Teaching enterprises about the difference in hardware technology. There are 50 to 100 different types of databases and the customer needs to understand the benefits of each type. We provide the ability to integrate TensorFlow, Café, Torch. No limits on the types of feeds, languages, and interfaces you can use. There’s a lot of Oracle and SAP legacy databases with years of stored procedures that need to be unwrapped to see changes for the better with regards to visibility and greater actionability on the data. Fusing consumer IoT with industrial IoT.
  • Once we take care of the operational challenges, what’s left is understanding the data models and the data access. This requires understanding customers, how they interact with the app, and how the app interacts with the database.
  • Compromise critical business performance with speed, scale, accuracy, and cost. Go relational and you can do as much as you want. How much you’re able to explore is based on how much you’re willing to pay. NoSQL world each niche database. Consistency, availability, partitioning triangle, manual sharding, and scaling are the challenges people are dealing with. Customers with scars are knowledgeable. Customers who have hired a new team know the new techniques but have a hard time — they get to a prototype quickly but don’t see the roadblocks with scalability, consistency, and availability until they’re further down the road and it takes longer and is more expensive to address.
  • Legacy is a big one as the system kind of works with some pain. How bad is the pain relative to the amount of work required to alleviate it? How do I estimate the cost, time, effort, and payoff? The cost of switching technology is often high. Evaluating fit can be challenging. Stakeholders in the company can raise political situations. Work together with IT and app development to determine the proper product fit and address any political issues.
  • When developers work with databases, they can quickly find themselves in over their heads if they try to address database issues. It’s only natural for a developer to turn back to their code as the most familiar path to resolve an issue. In most cases, the database engine will do a better job at finding the most efficient way of completing a task than you could in code — especially when it comes to things like making the results conditional on operations performed on the data.
  • The more data that is stored is inversely proportional to a company’s ability to analyze the data. Companies have no clue how many copies of data they have. No idea of data lineage. Data virtualization helps codify the dependencies of data storage saving significant storage dollars. In-memory technology requires writing to different APIs. We’re trying to introduce a combination of data virtualization and data grids to provide a consolidated view of all of the data.


  • We help customers find things more easily. Ingesting into a SQL server schema had to change for every customer. Cloud and database technology can simplify these efforts.
  • Not understanding the kind of workload appropriate for the database. People will start with the relational database they already have but don’t have a checkpoint to determine the need for a graph or key value database. Performance struggles when using technology that’s not the best fit. It works fine on a small scale, but you realize the limitations as you grow.
  • There will always be performance, scale, resiliency, and security challenges. But increasingly, and at a tactical level, we see the shift to containers as a common issue. Moving from bare metal to containers is not your typical “lift and shift.” It’s a true modernization of the architecture. Understanding that data persistence, portability, and performance in a containerized environment are truly different. Building the right microservices architecture and a scalable, distributed server, storage, and network fabric are key to run stateful, database applications in containers.
  • Key functions are performance, scalability, availability, and security. You must consider all of these. Too often, a client will focus on short-term user problems and then have to perform major reworking.
  • Prior to using our technology, all of our customers shared they had query performance issues on joins over hundreds of millions of entities/relationships.
  • Affordability, resiliency, and inflexibility are fairly common concerns in the database market. Concerns like these continue to drive us to produce technologies and offerings that level those speed bumps and get people involved, up and running faster.


  • Unanticipated growth. Different teams using different products. Start experimenting with the functionality of the product and end up with thousands of workflows. What happens if you end up with a billion rows in this table?
  • Exploding data, the proliferation of solutions, outsourcing quality or testing but must remain compliant.
  • Invested in a point solution that they’ve outgrown as their needs have changed. A consolidated database eases the patch management process. All PaaS reduce dependencies on versions but you still have 17 different libraries. Systems are more complex. Move from a database to a data platform so you can do more than just store data.
  • Many companies fail to take a critical look at the initial choice of a database solution. A database gets chosen because it is safe (what the developers know), cool (what the developers saw on Hacker News), or paid for (what the company has already licensed). Always put requirements first, and do your best to anticipate realistic scale needs.
  • A source of pain is that the production database is too large to give access to developers. You can’t test and end up with blind spots because production data is different than test data. Making SQL server able to run on Linux. A local copy of SQL server to test against is huge.
  • When developers build apps and test but not at scale — 500 GB versus hundreds of terabytes.


  • As customers deploy to next generation databases, they need automatic backup and recovery. Test and development environments need to meet the two-week DevOps release cycle. Automatically refresh data nightly to provide test and development with the data they needed.

  • The need to align database changes with application changes. Stems from Conway’s Law – the company that designs the process will design the process so it follows the line of communication in the company. That mentality does not work anymore. Move to the cloud with a DevOps methodology. The database model hasn’t changed since 1979.
  • Modernizing the application stack of existing applications moving from Oracle and SQL server to Cloud. How to manage the data tier. Need to modernize data tier at the same time. Architect and infrastructure change quickly how to manage databases to let me change over time.
  • Depends on if you are working with a third-party app versus build your own. Build your own need to be optimized. We help the third party with the infrastructure to help with performance and disaster recovery.
  • Larger disparate teams needing to integrate databases into DevOps. Enable to speak the same language as the application development team. Provide different tooling so they are able to plug the databases into the processes that exist using the same technology. Shift the database integration process left.
  • Adapt to changing infrastructure – cloud and containers. Different use cases serving different requirements. Intelligent payment processing 24/7/365. Different requirements for each use case. Understand how to make the database meet the requirements. Consistency, persistency, partition tolerance. What’s the best way to make the database meet the requirements?

What are some hurdles you see companies facing with databases today?

Here’s who we talked to:

  • Emma McGrattan, S.V.P. of Engineering, Actian
  • Zack Kendra, Principal Software Engineer, Blue Medora
  • Subra Ramesh, VP of Products and Engineering, Dataguise
  • Robert Reeves, Co-founder and CTO and Ben Gellar, VP of Marketing, Datical
  • Peter Smails, VP of Marketing and Business Development and Shalabh Goyal, Director of Product, Datos IO
  • Anders Wallgren, CTO and Avantika Mathur, Project Manager, Electric Cloud
  • Lucas Vogel, Founder, Endpoint Systems
  • Yu Xu, CEO, TigerGraph
  • Avinash Lakshman, CEO, Hedvig
  • Matthias Funke, Director, Offering Manager, Hybrid Data Management, IBM
  • Vicky Harp, Senior Product Manager, IDERA
  • Ben Bromhead, CTO, Instaclustr
  • Julie Lockner, Global Product Marketing, Data Platforms, InterSystems
  • Amit Vij, CEO and Co-founder, Kinetica
  • Anoop Dawar, V.P. Product Marketing and Management, MapR
  • Shane Johnson, Senior Director of Product Marketing, MariaDB
  • Derek Smith, CEO and Sean Cavanaugh, Director of Sales, Naveego
  • Philip Rathle, V.P. Products, Neo4j
  • Ariff Kassam, V.P. Products, NuoDB
  • William Hardie, V.P. Oracle Database Product Management, Oracle
  • Kate Duggan, Marketing Manager, Redgate Software Ltd.
  • Syed Rasheed, Director Solutions Marketing Middleware Technologies, Red Hat
  • John Hugg, Founding Engineer, VoltDB
  • Milt Reder, V.P. of Engineering, Yet Analytics

Original Link