ALU

relational database

How to Effectively Work With a Relational Database Using Java JDBC

If you don’t want to use any of the ORM frameworks to implement database queries and feel like even the JdbcTemplate Spring Tool isn’t right for you, try the JdbcBuilder class from the UjoTools project.

Anyone who’s ever programmed SQL queries through the JDBC library has to admit the interface isn’t very user-friendly. Maybe that’s why a whole array of libraries has emerged, varying in both list of services provided and degree of complexity. In this article, I’d like to show you a convenient class from the Java UjoTools library called JdbcBuilder. Its purpose is to help with assembling and executing SQL statements — nothing more, nothing less. JdbcBuilder class doesn’t address mapping of results to JavaBeans, doesn’t address optimization, nor does it provide a database connection.

Original Link

How Database Indexes Really Work

I have a growing love of databases that leads me to ask a lot of questions about how they work, and my recent obsession is database indexes.

Previously, I knew that if I wanted a particular column or field to be faster than the rest of them, I indexed it. That was as much as my brain could handle when I was first learning to code, but growing as a developer means expanding my knowledge of the fundamentals, like what exactly a database index is and why it exists.

Original Link

NoSQL vs Relational Databases (When to Use What?)

Relational databases have existed for more than 40 years now, and they work well. There are specific use cases, however, where a software professional might use a NoSQL database over a relational one. Some of those reasons are:

  • Relational databases are highly available and highly consistent. Thereby running atomic operations on them is a piece of cake and they run very well (i.e. database transactions run very well on relational databases). Hence, if you are looking for normal CRUD operation type websites, a relational database is still a solid option to choose.
  • CAP theorem stands for C – Consistency, A — Availability, P — Partitioning (or Scalability) and states that having all three properties at the same time is not possible, so at the best, you can get two properties from these and not the third one. Relational databases provide consistency and availability but lack solid partitioning functionality, even though relational databases support partitioning, but due to the core concept of ‘Joins’ and other things like shared indexes, scaling them using partitions is very difficult and not optimal.
  • The point mentioned above is the key reason for the existence of NoSQL databases like MongoDB or Cassandra. These databases provide excellent support for horizontal scalability. They do lack a bit on the principle of consistency as most of them don’t support distributed transactions (or as perfectly as a relational database does). They also don’t have joins exactly like relational databases, and these are the key reasons why they scale so well horizontally (i.e. by adding more machines horizontally).
  • Another reason for the usage of NoSQL databases is their developer friendliness. DB’s like MongoDB are document databases where the data is stored as a JSON, which is highly compatible with most web user interfaces (read ‘Single page Java Apps’) and has excellent tooling support.
  • Relational databases have a strict schema for the data storage. Through the use of ‘alter’ statements, the schema can be changed, but it has its impact on the existing code (‘application specific code’), which has to be changed in accordance with the changes made to the schema. The NoSQL databases, on the other hand, support easy schema changes on the fly without affecting any existing code.

How Do NoSQL Databases Scale Horizontally?

There are 2 key approaches:

  1. Auto-Sharding: This is what is followed by Google’s Bigtable. It basically assigns a range of values to a partition (this is one approach for partitioning). So when a certain value falls within a certain range, the database knows which partition to look in. It is somewhat similar to a hashtable bucketing concept.
  2. Consistent Hashing: The other approach is consistent hashing, which is followed by DynamoDB in Amazon. It prepares a hash of machines in a circular fashion, and if a certain machine fails in the circle, the database knows where to look for the next record in this circle of machines.

Original Link

Relational Data Model: Back to the Roots

Back to the Roots

Recently, I realized that the associative/semiotic/hypergraph (R3DM/S3DM) technology framework we propose to be adopted in database management systems can be considered in many ways an extension to Codd’s relational model. I am aware that this is a big claim and certainly this is not the place or the time to lay down my arguments, but this is how it occurred to me.

I have partially implemented TRIADB technology twice on top of two different data stores and I noticed that the ADD and GET operations we defined were closely related to Codd’s relational algebra operations, while datasets, i.e. domains, and a user-defined type system match the sets defined in mathematics and relational theory. Coincidentally, Codd’s relational logic goes back to Aristotle and the cornerstone of our technology, which is the computational semiotic triangle. I will briefly mention that one basic difference is that both the heading set and the body tuples of the relation — in fact, everything — are transformed and uniformly represented with numerical key references. Therefore, it can also be called a reference database management system (RDBMS). All these are simply good indications. I believe we are on the right track.

The truth is, and I will quote Chris Date here, that:

If you are proposing to replace technology A with technology B, you must understand technology A and there must be some problem that technology A does not solve and that technology B does solve.

And the best person I have found to teach me an in-depth understanding of relational database technology is Chris Date himself. The following video is a clip from an excellent illuminating workshop that explains Codd’s Relational Theory for computer professionals, but most importantly, he shows what a real relational product would be like and how and why it would be so much better than what’s currently available.

Relational Model vs. Other Data Models

That said, allow me to have my doubts about whether many of the proponents of other database technologies, including those in SQL databases and those in NoSQL databases, have understood what the differences really are with respect to relational models and at what abstraction level they occur. Again, this is not the place or time to elaborate on this. Instead, I am inviting you to ponder the architectural design of modern database management systems.

You see, in practice, it is too difficult to make a very clean separation between the physical, logical, and conceptual levels of information. From an engineer’s point-of-view, it is hard to separate theoretical from practical purposes. Moreover, many of these NoSQL DBMSs that are in fashion are suited to solve a particular type of problem. This is why you often hear that big corporations and large companies have many different kinds of DBMSs at the back-end — not to mention that there is a trend to market many DBMSs as multi-model database systems. And that made me also realize that there has to be a distinction between those problems a problem that solves at the physical level, i.e. like partitioning and availability, and at the logical-conceptual level, i.e. integrity and data modeling. Therefore, I foresee that in the future, systems will have to use a combination of these two levels that somehow will have to be tuned and made to work harmonically independent of each other.

Our Perspective

This is our perspective toward the architectural design of modern database management systems that fully justifies our choice of marketing TRIADB as a middleware. We are focusing on providing an efficient and effective solution at the logical and conceptual level using an existing implementation of the database physical layer. Relational modeling theory applies here, too. From what I understand, it was the implementation details at the physical level and perhaps other naive simplifications that made many depart from the original relational model. So, it’s time to go back to the roots and make some real progress.

In case you, as a reader, have the same feelings and see some truth in my writing, I would be more than happy to discuss with you the progress we are making with TRIADB and associative, semiotic, hypergraph technology and definitely exchange ideas and share some common thoughts on these database topics. Stay tuned.

Original Link

Data Profiling: A Holistic View of Data Using Neo4j

Data profiling is a widely used methodology in the relational database world to analyze the structure, contents, and metadata of a data source. Generally, data profiling consists a series of jobs executed upon the data source to collect statistics and produce informative summaries of the underlying data.

As a general rule, data evolves with time. After some years, the actual data stored and used in a database may vary significantly from what people think it is or what the database was designed for at the beginning. Data profiling helps not only to understand anomalies and assess data quality but also to discover, register, and assess enterprise metadata.

The Neo4j graph database is the best for analyzing connected, high-volume, and variably structured data assets, which makes data profiling more critical, as it would help us obtain a better understanding of the data, identify hidden patterns more easily, and potentially improve query performance.

This article will share practical data profiling techniques using the Cypher graph query language.

The following are system requirements:

  • Neo4j Graph Database version 3.2.x, either Community or Enterprise Edition and either Linux or Windows (I use Windows 10).
  • Internet browser to access the Neo4j Browser (I use Chrome).
  • A graph database. Here, I imported data from the Stack Overflow Questions dataset, which contains more than 31 million nodes, 77 million relationships, and 260 million properties. The total database size on Windows 10 is about 20 GB.

All of the Cypher scripts and outcomes — mostly screenshots — are executed inside the Neo4j Browser, unless specified otherwise.

Database Schema Analysis

Database schema analysis is usually the first step of data profiling. The simple purpose of it is to know what the data model looks like and what objects are available.

Note: Most of the scripts used in this section can be found in Neo4j Browser under Favorites > Data Profiling.

Show the Graph Data Model (metamodel)

Cypher script:

// Show what is related, and how (the meta graph model)
CALL db.schema()

The Stack Overflow database has three nodes:

  1. User

  2. Post

  3. Tag

And it has four relationships:

  1. User POSTED Post

  2. Post HAS_TAG Tag

  3. Post is PARENT_OF Post

  4. Post ANSWER another Post

Show Existing Constraints and Indexes

Cypher script:

// Display constraints and indexes
:schema

Indexes:

 ON :Post(answers) ONLINE ON :Post(createdAt) ONLINE ON :Post(favorites) ONLINE ON :Post(score) ONLINE
... ...

Constraints:

 ON ( post:Post ) ASSERT post.postId IS UNIQUE ON ( tag:Tag ) ASSERT tag.tagId IS UNIQUE ON ( user:User ) ASSERT user.userId IS UNIQUE

Indexes tell us the properties that will have the best query performance when used for matching.

Constraints tell us the properties that are unique and can be used to identify a node or relationship.

Show All Relationship Types

Cypher script:

// List relationship types
CALL db.relationshipTypes()

A list of available relationship types.

Show All Node Labels/Types

Cypher script:

// List node labels
CALL db.labels()

A list of available node labels/types.

Count All Nodes

Cypher script:

// Count all nodes
MATCH (n) RETURN count(n)

It only takes 1ms for Neo4j to count 31 million plus nodes.

Count All Relationships

Cypher script:

// Count all relationships
MATCH ()-[r]->() RETURN count(*)

Again, it only takes 1 ms for Neo4j to return the total number of relationships.

Show Data Storage Sizes

Cypher script:

// Data storage sizes
:sysinfo

Sample Data

Cypher script:

// What kind of nodes exist
// Sample some nodes, reporting on property and relationship counts per node.
MATCH (n) WHERE rand() <= 0.1
RETURN
DISTINCT labels(n),
count(*) AS SampleSize,
avg(size(keys(n))) as Avg_PropertyCount,
min(size(keys(n))) as Min_PropertyCount,
max(size(keys(n))) as Max_PropertyCount,
avg(size( (n)-[]-() ) ) as Avg_RelationshipCount,
min(size( (n)-[]-() ) ) as Min_RelationshipCount,
max(size( (n)-[]-() ) ) as Max_RelationshipCount

You may have noticed the first line of the script, MATCH (n) WHERE rand() <= 0.1, effectively chooses 10% (0.1) of the total nodes for sampling. Changing this value would change the sample size (i.e. using 0.01 uses 1%).

Node Analysis

Node analysis is more or less similar to table and column analysis for the profiling of a relational database (RDBMS). The purpose of node analysis is to reveal facts about nodes, as well as properties of nodes.

Count Nodes by Their Labels/Types

Cypher script:

// List all node types and counts
MATCH (n) RETURN labels(n) AS NodeType, count(n) AS NumberOfNodes;

Node counting gives a clearer idea of the volume of each type of the node in a database.

Property Analysis

List All Properties of a Node

Cypher script:

// List all properties of a node
MATCH (u:User) RETURN keys(u) LIMIT 1

List All Properties of a Relationship

Cypher script:

// List all properties of a relationship
MATCH ()-[t:POSTED]-() RETURN keys(t) LIMIT 1

There is no property for the relationship.

Uniqueness of the Property

Cypher script:

// Calculate uniqueness of a property
MATCH (u:User) RETURN count(DISTINCT u.name) AS DistinctName, count(u.name) AS TotalUser, 100*count(DISTINCT u.name)/count(u.name) AS Uniqueness;

It seems 78% of the usernames are unique. A property having unique values can be a good candidate for the ID.

Nullability of the Property

Cypher script:

// Calculate nullability of a property
MATCH (u:User) WHERE u.name IS null RETURN count(u);

There is no empty value for property Name of node User.

Min, Max, Average, and Standard Deviation of the Values of a Property

Cypher script:

// Calculate min, max, average and standard deviation of the values of a property
MATCH (p:Post) RETURN min(p.favorites) AS MinFavorites, max(p.favorites) AS MaxFavorites, avg(p.favorites) AS AvgFavorites, stDev(p.favorites) AS StdFavorites;

Occurrences of Values of a Property

Cypher script:

// Find out most often used values for a property
MATCH (p:Post)
RETURN p.answers AS Answers, count(p.answers) AS CountOfAnswers
ORDER BY Answers ASC;

From the results, there are 1.17 million posts that have zero answers, 4.66 million have one answer, and so on.

Node Rank (Centrality)

Importance of a User

Cypher script:

// Calculate node rank / Centrality of a node
// i.e., the relevance of a node by counting the edges from other nodes:
// in-degree, out-degree and total degree. MATCH (u:User)
WITH u,size( (u)-[:POSTED]->()) AS OutDepth, size( (u)<-[:POSTED]-()) AS InDepth
ORDER BY OutDepth, InDepth
WHERE u.name STARTS WITH 'T'
RETURN u.name, min(OutDepth), max(OutDepth), min(InDepth), max(InDepth)

Calculate node rank / centrality of a node in Neo4j

For a user, the max(OutDepth) represents the max number of posts they have submit. When max(InDepth) is 0, it means there is no relationship ending at the User node.

The user who has the most OutDepth per post can be considered to be more important within the community.

Note: As this is a heavy query, make sure there is enough heap size (specified by dbms.memory.heap.max_size in the neo4j.conf file). Alternatively, use a filter to limit the scope of the query as shown in the sample, which only looks for users whose name starts with T.

Importance of a Post

By looking at which post has the most number of answers, we can tell the importance and/or received attention of the post.

Orphan Nodes

Cypher script:

// orphans: node has no relationship
match (u:User)
with u,size( (u)-[:POSTED]->()) as posts where posts = 0
return u.name, posts;

Find orphan nodes in Neo4j

These are users who have never submitted any post or answer.

Relationship Analysis

Relationship analysis focuses on relationships in a graph database. It can help us understand the completeness, integrity, and density of certain relationships between nodes.

What is unique to graph databases, compared to normal RDBMSs, is the powerful analyses available to reveal the hidden knowledge of connected data. One example is to find out the shortest path between two nodes. Another one is to identify relationship triangles.

Statistics on Relationships

Cypher script:

// Count relationships by type
match (u)-[p]-() with type(p) as RelationshipName, count(p) as RelationshipNumber return RelationshipName, RelationshipNumber;

Calculate data relationship statistics in Neo4j using Cypher

Display the total number of each relationship type in the database.

Another query to get similar results is given below; however, it takes much more time to complete:

MATCH ()-[r]->() RETURN type(r), count(*)

Find the Shortest Path Between Two Nodes

Cypher script:

// Find all shortest path between 2 nodes
MATCH path = allShortestPaths((u:User {name:"Darin Dimitrov"})-[*]-(me:User {name:"Michael Hunger"}))
RETURN path;

Learn more about data profiling using the Neo4j graph database and the APOC library

The shortest path between the two chosen users is 6.

The shortest path between two users — highlighted by red arrows in the diagram above — tells us the two users are connected by posts having the same tags (red nodes). These are not necessarily the only paths, as users may post to answer each other’s questions or posts, but in this case, connections through the same tag — i.e. a common area of interest — are the fastest way to connect the two users.

In a large graph like this one, it may not be viable to calculate the shortest path between any two users. However, it may be valuable to check the connectivity among the most important people or among posts having the most interest.

Triangle Detection

Cypher script:

// Triangle detection:
match (u:User)-[p1:POSTED]-(x1),(u)-[p2:POSTED]-(x2),(x1)-[r1]-(x2) where x1 <> x2 return u,p1,p2,x1,x2 limit 10; // Count all triangles in the graph
match (u:User)-[p1:POSTED]-(x1),(u)-[p2:POSTED]-(x2),(x1)-[r1]-(x2) where x1 <> x2 return count(p1);

How to detect triadic closures (triangles) in the Neo4j graph database using the Cypher query language

Triangles are another key concept in graph theory. Triangles are represented by three connected nodes, directional or uni-directional. Identifying triangles — or a lack of triangles — provides interesting insights into the underlying data asset.

Triangles are also referred as Triadic Closures, as per Graph Databases, 2nd Edition (O’Reilly Media):

A triadic closure is a common property of social graphs, where we observe that if two nodes are connected via a path involving a third node, there is an increased likelihood that the two nodes will become directly connected at some point in the future.

Putting this concept into our daily life, it’s a familiar social occurrence. If we happen to be friends with two people who don’t know one another, there’s an increased chance that those two people will become direct friends at some point in the future.

By discovering the existence of triangles in a graph database, we can create more efficient queries to avoid circular traversal.

Using the APOC Library

Since Neo4j 3.0, users can implement customized functionality using Java to extend Cypher for highly complex graph algorithms. This is the so-called concept of user-defined procedures.

The APOC library is one of the most powerful and popular Neo4j libraries. It consists of many procedures (about 300 at the time of writing) to help with many different tasks in areas like data integration, graph algorithms, or data conversion— and no surprise, it also has several functions for analyzing the metadata of the graph database.

To enable APOC in Neo4j 3.x, there are a few simple steps:

  1. Stop Neo4j service.
  2. Download and copy the most recent version of the APOC JAR file to the plugins folder under the database, i.e. graph.db\plugins
  3. Add the following line to the neo4j.conf file: dbms.security.procedures.unrestricted=apoc.*
  4. Start Neo4j service again.

The out-of-box functions for profiling are all under apoc.meta.*.

Below are some samples:

• CALL apoc.meta.data()

This will list all nodes and relationships as well as properties of each.

List all nodes, relationships and properties in Neo4j using the APOC library

• CALL apoc.meta.graph()

This is equivalent to CALL db.schema() (refer to section 1.1 above).

• CALL apoc.meta.stats()

This will list statistics of nodes and relationships. It also shows the cardinality of each relationship by node types. For example, the following stats communicate the fact that the INTERACTS relationship is between the nodes of label INTERACTS

{ "labelCount": 1, "relTypeCount": 1, "propertyKeyCount": 3, "nodeCount": 107, "relCount": 352, "labels": { "Character": 107 }, "relTypes": { "(:Character)-[:INTERACTS]->()": 352, "()-[:INTERACTS]->(:Character)": 352, "()-[:INTERACTS]->()": 352 }
}
• CALL apoc.meta.schema()

This will return metadata of all node labels, relationship types, and properties.

• CALL apoc.meta.subGraph({labels:['Character'],rels:['INTERECTS']})

This is a very useful function especially for a very large graph, as it allows you to analyze a subset of nodes and relationships (subgraphs). The complete specification looks like this:

CALL apoc.meta.subGraph({labels:[labels],rels:[rel-types],excludes:[label,rel-type,…]})

Further Discussion

There is a huge advantage to storing data as a graph. The graph data model enables us to do much more powerful analysis on relationships over large amounts of data and unearths buried connections among the vast amount of individual data elements (nodes).

Data profiling on a graph database like Neo4j gives us more insightful understandings of the actual data we are working on. The results obtained can then be used for further detailed analysis, performance tuning, database schema optimization, and data migration.

As a native graph database, Neo4j provides native graph data storage and native query processing through the Cypher query language. Some of the most useful data profiling tasks can be easily done using Cypher, as shown in this article.

There are also extensions that support more complex and advanced graph analysis; for example, Betweenness Centrality and Connected Components. Neo4j graph algorithms is the one I’ve been using to perform more complex data profiling. Once installed, the functions can be called directly as part of the Cypher query and results visualized inside the Neo4j Browser. I plan to cover more on this in coming articles.

Original Link

What Are the Major Advantages of Using a Graph Database?

A graph database is a data management system software. The building blocks are vertices and edges. To put it in a more familiar context, a relational database is also a data management software in which the building blocks are tables. Both require loading data into the software and using a query language or APIs to access the data.

Relational databases boomed in the 1980s. Many commercial companies (i.e. Oracle, Ingres, IBM) backed the relational model (tabular organization) of data management. In that era, the main data management need was to generate reports.

Graph databases didn’t see a greater advantage over relational databases until recent years, when frequent schema changes, managing explosives volume of data, and real-time query response time requirements make people realize the advantages of the graph model.

There are commercial software companies backing this model for many years, including TigerGraph (formerly named GraphSQL), Neo4j, and DataStax. The technology is disrupting many areas, such as supply chain management, e-commerce recommendations, security, fraud detection, and many other areas in advanced data analytics. 

Here, we discuss the major advantages of using graph databases from a data management point of view.

Object-Oriented Thinking

This means very clear, explicit semantics for each query you write. There are no hidden assumptions, such as relational SQL where you have to know how the tables in the FROM clause will implicitly form cartesian products.

Performance

They have superior performance for querying related data, big or small. A graph is essentially an index data structure. It never needs to load or touch unrelated data for a given query. They’re an excellent solution for real-time big data analytical queries.

Better Problem-Solving

Graph databases solve problems that are both impractical and practical for relational queries. Examples include iterative algorithms such as PageRank, gradient descent, and other data mining and machine learning algorithms. Research has proved that some graph query languages are Turing complete, meaning that you can write any algorithm on them. There are many query languages in the market that have limited expressive power, though. Make sure you ask many hypothetical questions to see if it can answer them before you lock in.

Update Real-Time Data and Support Queries Simultaneously

Graph databases can perform real-time updates on big data while supporting queries at the same time. This is a major drawback of existing big data management systems such as Hadoop HDFS since it was designed for data lakes, sequential scans, and appending new data (no random seek), and it is an architecture design choice to ensure fast scan I/O of an entire file. The assumption here is that any query will touch the majority of a file, while graph databases only touch relevant data, so a sequential scan is not an optimization assumption.

Flexible Online Schema Environment

Graph databases offer a flexible online schema evolvement while serving your query. You can constantly add and drop new vertex or edge types or their attributes to extend or shrink your data model. It’s so convenient to manage explosive and constantly changing object types. The relational database just cannot easily adapt to this requirement, which is commonplace in the modern data management era.

Group by Aggregate Queries

Graph databases can group by aggregate queries, which is unimaginable in relational databases. Due to the tabular model restriction, aggregate queries on a relational database are greatly constrained by how data is grouped together. In contrast, graph models are more flexible for grouping and aggregating relevant data. See this article on the latest expressive power of aggregation for graph traversal. I don’t think relational databases can do this kind of flexible aggregation on selective data points. (Disclaimer: I have worked on commercial relational database kernels for a decade; Oracle, MS SQL Server, Apache popular open-source platforms, etc.) 

Combine and Hierarchize Multiple Dimensions

Graph databases can combine multiple dimensions to manage big data, including time series, demographic, geo-dimensions, etc. with a hierarchy of granularity on different dimensions. Think about an application in which we want to segment a group of a population based on both time and geo dimensions. With a carefully designed graph schema, data scientists and business analysts can conduct virtually any analytical query on a graph database. This capability traditionally is only accessible to low-level programming languages such as C++ and Java.

AI Infrastructure

Graph databases have great AI infrastructure due to well-structured relational information between entities, which allows one to further infer indirect facts and knowledge. Machine learning experts love them. They provide rich information and convenient data accessibility that other data models can hardly satisfy. For example, the Google Expander team has used it for smart messaging technology. The knowledge graph was created by Google to understand humans better, and many more advances are being made on knowledge inference. The keys of a successful graph database to serve as a real-time AI data infrastructure are:

  • Support for real-time updates as fresh data streams in

  • A highly expressive and user-friendly declarative query language to give full control to data scientists

  • Support for deep-link traversal (>3 hops) in real-time (sub-second), just like human neurons sending information over a neural network; deep and efficient

  • Scale out and scale up to manage big graphs

In conclusion, we see many advantages of native graph databases that cannot be worked around by traditional relational databases. However, as any new technology replacing old technology, there are still obstacles in adopting graph databases. One is that there are fewer qualified developers in the job market than the SQL developers. Another is the non-standardization of the graph database query language. There’s been a lot of marketing hype and incomplete offerings that have led to subpar performance and subpar usability, which slows down graph model adoption in enterprises.

Original Link

What Goes Around Comes Around: A Brief History of Databases

What Goes Around Comes Around is a fascinating paper about the (cyclical) history of data modeling. It was written by two database experts: Joseph Hellerstein, a computer science professor at UC Berkeley, and Michael Stonebraker, founder of Ingres and Postgres and winner of the 2014 Turing award.

This article came to my attention as the first paper discussed in Readings in Database Systems (or the “Red Book,” also in part by Michael Stonebraker), linked to me by a co-worker at Grakn. The book has proven to be a very good reference point for tried-and-tested techniques and architectures in database management systems.

Written in 2005, it presents a surprisingly relevant view on the current state of databases and data models. The trends it outlines are still holding true, with models such as JSON and property graphs sitting firmly in the “semi-structured” era it describes.

The paper takes us through nine eras of data models:

  • Hierarchical (IMS): Late 1960s and 1970s
  • Network (CODASYL): 1970s
  • Relational: 1970s and early 1980s
  • Entity-Relationship: 1970s
  • Extended relational: 1980s
  • Semantic: Late 1970s and 1980s
  • Object-oriented: Late 1980s and early 1990s
  • Object-relational: Late 1980s and early 1990s
  • Semi-structured (XML): Late 1990s to the present

In reading the paper, it’s clear that there are certain lessons to be learned from each era. Certain things clearly worked and other things clearly did not, and it shows that innovations don’t always get adopted in a straightforward way. Moreover, it’s also evident that the same sorts of issues appear and re-appear and the same lessons keep being relearned.

In order to illustrate these historical patterns in database theory and innovation, the authors use the classic example of suppliers and parts from Codd:

Supplier (sno, sname, scity, sstate)
Part (pno, pname, psize, pcolor)
Supply (sno, pno , qty, price )

This is presented as a relational schema. Each line is a relation (or table), with the attributes in brackets. So, in this case, we have a set of suppliers and a set of parts. There is also a “supply” relation, indicating that a particular part is a supplied by a particular supplier, with the given quantity and price.

IMS Era

IMS (Information Management System) was released in 1968. It structured data into record types with fields, similar to the relational table above. Each record type (except the root) has one parent record type. Similarly, every instance of a record type needs a parent that is an instance of the parent record type. This is called a hierarchical model because it forms a hierarchy of record types and instances.

This model is so limited that we can’t even represent our example properly!

Our options are either to make Part a child of Supplier (left), or Supplier a child of Part (right). Notice that in the former case, we end up duplicating part information if the same part is sold by different suppliers. In the latter case, we will duplicate supplier information if the same supplier sells different parts.

This redundancy is an issue in terms of storage efficiency as well as consistency. If the same data is stored in two places, special care has to be made to make sure that both pieces of data do not go out of sync.

The former case also cannot model a part that is not sold by any supplier. Similarly, the latter case cannot model a supplier who sells no parts.

IMS was queried record-at-a-time by navigating the hierarchy. For example, to find all the red parts supplied by Supplier 16:

get unique Supplier (sno = 16)
until failure: get next within parent (pcolor = red)

This reads a lot like an imperative programming language; we have step-by-step instructions, state, and control flow. We essentially have to describe explicitly the algorithm to execute in order to complete a query. Working out how to run a query quickly can be a challenge. Even something this simple may have a faster execution method in certain circumstances (for example, if we have an index to look up red parts, but not one for supplier numbers).

Additionally, IMS supported several ways to store records based on the key(unique, ordered identifier for a record): you could store them sequentially, or indexed with a B-tree, or hashed. This meant you could choose whatever would provide the best performance for your use case.

However, the storage method chosen would actually disable certain operations. For example, you could not insert a single record when they were stored sequentially. If using the hash method, then you could not use “get next,” such as in the query above. This demonstrates a lack of physical data independence — we cannot change the physical data level without also changing the logical data level. This becomes very important with databases that are maintained for a long time: business requirements change, the data stored changes and increases. At some point, a different physical data representation could be necessary to improve performance. This was a motivator behind the relational model, which we’ll discuss later.

IMS did offer a little support for logical data independence — meaning we can extend the logical data level without impacting applications built on it. It allowed exposing only a subset of record types (essentially a sub-tree of the hierarchy). This meant new record types could be added without disrupting the view of the data.

IMS has seen several extensions to its model such that it can (sort of) handle more complicated models such as our example. It is actually still in use and maintained today (the latest release was in 2015). In general, users appear to be legacy systems where the costs outweigh the benefits of switching to something more modern.

So these are the lessons from IMS:

  • Physical and logical data independence are highly desirable.
  • Tree-structured data models are very restrictive.
  • It is a challenge to provide sophisticated logical reorganizations of tree-structured data.
  • A record-at-a-time user interface forces the programmer to do manual query optimisation, and this is often hard.

CODASYL Era

Independently of IMS, CODASYL (Committee on Data Systems Languages) created several reports between 1969 and 1973 describing the network model, a more flexible model that allows a record type to have multiple parents.

In our example, we introduce two arrows: Supplies and Supplied_by. These are called set types. The arrow goes from the owner record type to the child record type. An instance of the owner type has a set of instances of the child type (this is a little distinct from a mathematical set because the set must have an owner). So in our example above, a Supplier may have any number of Supply children via the set Supplies.

This extra flexibility has allowed us to eliminate all redundancy. Additionally, it’s perfectly possible to have a Supplier without any Parts and vice-versa – they would just have no Supply children.

Like IMS, CODASYL uses a record-at-a-time query language:

find Supplier (SNO = 16)
until no-more: find next Supply record in Supplies find owner Part record in Supplied_by get current record check current record (color = red)

In this case, we have several “cursors” pointing to records — a cursor for the Supplier, a cursor for the Supply, a cursor for the Part and a cursor for the current record. When writing a CODASYL query, the programmer has to consider the location of all these different cursors in their head. The paper explains it best: “One way to think of a CODASYL programmer is that he should program looking at a wall map of the CODASYL network that is decorated with various colored pins.” As you can imagine, this isn’t terribly easy or intuitive!

An additional issue with the network model is loading the data. Building the sets of records can be time-consuming and often the order things are loaded can make a big difference in terms of performance. Working out the order things should be loaded is non-trivial.

There are still some databases floating around based on CODASYL, such as IDMS; however, they don’t seem to be widespread.

The lessons from CODASYL are:

  • Networks are more flexible than hierarchies but more complex.
  • Loading and recovering networks is more complex than hierarchies.

Relational Era

The relational model was proposed by Ted Codd in 1970, motivated in part by producing a better model than IMS for data independence. Data is structured as relations — sets of tuples. The query language is set-at-a-time, performed on entire relations using operators such as select σ and join ⋈ :

σ[sno=16 ∧ pcolor=red](Supply ⋈ Supplier ⋈ Part)

This combination of a simple model and a high-level language allows for much better logical and physical data independence: using the relational operators, we can describe logical relations in terms of physical ones (the ones that are actually stored to disc). Applications do not need to know the difference.

In contrast to the previous models discussed, the relational model took a more theoretical approach: it is (deliberately) less focused on implementation and therefore does not have a physical storage proposal. You can use whatever storage may be appropriate (whether that involves storing them unordered, indexed with a B-tree, or both).

Not only that, but this higher-level algebra allows a system to use rules of replacement to optimise the query. For example, using rules like the commutativity of ⋈ we can rewrite the query as:

σ[sno=16](Supplier) ⋈ Supply ⋈ σ[pcolor=red](Part)

The database system can determine whether this is faster using knowledge about the storage, such as any indices available and the size of the relations in question.

This new model kicked off the “great debate” between Ted Codd (supporting the relational model and mostly backed by academics) and Charlie Bachman (supporting CODASYL and mostly backed by DBMS users).

CODASYL was criticised for being simultaneously too complex and too inflexible to represent common situations. Record-at-a-time programming was also considered too hard to optimise.

CODASYL advocates responded to this criticism by introducing their own set-at-a-time query languages that would also allow better physical and logical data independence. There were also efforts to clean the model to make it easier to understand.

The relational model was criticised for being too academic: relational algebra queries were not easy to read or understand. Additionally, the relational model did not have a physical storage proposal so it was not initially clear that an efficient implementation could even be built.

Relational advocates responded to this by creating “natural” query languages based on the relational algebra such as SQL. Implementations such as System R and INGRES eventually proved that the model was viable and that automatic query optimisation could be competitive even with very skilled programmers.

So the two camps responded to the criticisms of the other and attempted to arrive at solutions. In practice, these differences didn’t matter as much as you might expect. Having previously exclusively backed IMS, IBM changed its tune in 1984 and announced dual support for IMS and DB/2 (an early relational database). DB/2 was the newer and easier-to-use technology, so naturally was selected over IMS. This advantage was enough to settle the issue.

Tangentially, this also established SQL as the relational query language. Nowadays, “relational” and “SQL” are often treated as synonymous, but at the time SQL had competition (such as QUEL). There are people of the opinion that SQL is not a good language for the relational model, such as Hugh Darwen and C. J. Date, writers of The Third Manifesto. One of their criticisms of SQL is that it is not strictly the same as the relational model. For example, SQL allows an additional NULL value for absent or unknown information. SQL relations also allow duplicate rows, meaning they are not sets.

So, our lessons:

  • Set-at-a-time languages are good, regardless of the data model, since they offer much improved physical data independence.
  • Logical data independence is easier with a simple data model than with a complex one.
  • Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology.
  • Query optimizers can beat all but the best record-at-a-time DBMS application programmers.

Next Time…

We’ve looked at three eras of data models: IMS, CODASYL, and relational. We’ve seen the importance of physical and logical data independence, the limitations of hierarchical models, and the advantages of set-at-a-time high-level query languages.

At Grakn, we’re building a knowledge base and attempting to follow these lessons. We have a high-level set-at-a-time query language, with built-in reasoning allowing high logical data independence. Additionally, we use a flexible entity-relationship model (which you will see in the next post) to structure our data.

Relational databases continue to dominate and remain a hugely popular data model even thirty years later. Nonetheless, more recent competition from other models such as graph databases and NoSQL has made the contemporary database landscape much more diverse and has provided a new set of tools to developers. In the next blog post, we’ll look at some of the key “post-relational” models:

  • Entity-Relationship
  • Extended relational
  • Semantic
  • Object-oriented
  • Object-relational
  • Semi-structured (XML)

Original Link

Data Modeling Guidelines for NoSQL JSON Document Databases

In this blog post, I’ll discuss how NoSQL data modeling is different from traditional relational schema data modeling, and I’ll also provide you with some guidelines for document database data modeling.

Document databases, such as MapR-DB, are sometimes called “schema-less” — but this is a misnomer. Document databases don’t require the same predefined structure as a relational database, but you do have to define the facets of how you plan to organize your data. Typically, with a NoSQL data store, you want to aggregate your data so that the data can quickly be read together, instead of using joins. A properly designed data model can make all the difference in how your application performs. We have an anecdote at MapR where one of our solution architects worked with a customer, and in a one-hour conversation about schema design, was able to improve access performance by a factor of 1,000x. These concepts matter.

Why NoSQL?

Simply put, the motivation behind NoSQL is data volume, velocity, and/or variety. MapR-DB provides for data variety with two different data models:

  1. MapR-DB as a wide column database with an Apache HBase API.
  2. MapR-DB as a document database with an Open JSON API.

MapR-DB JSON is different than other document data stores in that the row key design is the same for both models, and both can store data (columns or documents) with different access patterns in a different column family with the same row key.

Relational vs. NoSQL Data Modeling

In relational design, the focus and effort are around describing the entity and its relation to other entities — the queries and indexes are designed later. With a relational database, you normalize your schema, which eliminates redundant data and makes storage efficient. Then, queries with joins bring the data back together again. However, joins cause bottlenecks on read, with data distributed across a cluster, and this model does not scale horizontally. With MapR-DB, a table is automatically partitioned across a cluster by key range, and each server is the source for a subset of a table (called a tablet). MapR-DB has a query-first schema design in which queries should be identified first, then the row key should be designed to distribute the data evenly and also to give a meaningful primary index to query by. The row document (JSON) or columns (HBase) should be designed to group data together that will be read together. With MapR-DB, you de-normalize your schema to store in one row or document what would be multiple tables with indexes in a relational world. Grouping the data by key range provides for fast reads and writes by row key.

NoSQL Data Modeling Process

It is useful to start off with Entity Relationship modeling in order to define the entities, relationships, and attributes in your application:

  • Entities: Main objects in your application
  • Attributes: Properties of the objects in your application
  • Relationships: Connections between entities, i.e. 1-1, 1-many, many-many

The E-R model can be used with your query and data access patterns to define the physical model so that the data that is read together are stored together.

As a modeling example, we will use a social application similar to Reddit (Note: I do not know how Reddit is really implemented). Here are the use cases:

  • Users can post URLs to articles by category (like news, sports…):
  • Users can then make comments on posts:

Some of the query requirements are:

  • Display the posts by category and date (most recent first)
  • Display the comments by post
  • Display the posts by user ID

Logical Model Example

This is an E-R diagram for our example social application:

The Entities are:

  • User, Post, Comment, Category

The relations are:

  • A User makes a post
  • A Post has comments
  • A Post belongs to a category

Relational Model Example

This is the relational model for the example social application:

  • Users are stored in the user table.
  • The posted URL is stored in the Post table with a foreign key to the user that posted it, and a foreign key to the category for the post.
  • Comments about a post are stored in the comments table with a foreign key to the post and a foreign key to the user that commented.

Normalization

In a relational database, you normalize the schema to eliminate redundancy by putting repeating information into a table of its own. In this example below, we have an order table, which has a one-to-many relationship with an order items table. The order items table has a foreign key with the ID of the corresponding order.

Denormalization

In a denormalized datastore, you store in one table what would be multiple indexes in a relational world. Denormalization can be thought of as a replacement for joins. Often, with NoSQL, you de-normalize or duplicate data so that data is accessed and stored together.

Parent-Child Relationship-Embedded Entity

Here is an example of denormalization of the SALES_ITEM schema in a Document database:

{ "_id": "123", "date": "10/10/2017", “ship_status”:”backordered” "orderitems": [ { "itemid": "4348", "price": 10.00 }, { "itemid": "5648", "price": 15.00 }]
}

If your tables exist in a one-to-many relationship, it’s possible to model it as a single document. In this example, the order and related line items are stored together and can be read together with a find on the row key (_id). This makes the reads a lot faster than joining tables together.

Note: The maximum default row size is 32MB and optimal size is between 50-100KB. If the embedded entities are really long, then they could be bucketed by row key, or you could just store the ID to the embedded entity table (which would require your application to query that table also).

Document Model Example

This is the document model for the example social application:

There are two tables in the document model compared to four in the relational:

  • User details are stored in the User table.
  • Posted URLs are stored in the Post table:
    • The row key is composed of the category and a reverse timestamp so that posts will be grouped by category with the most recent first.
    • There is a secondary index on the posted by attribute, to query by who submitted the URL.
    • Comments are embedded in the post table.

Composite Row Key Design

Row keys are the primary index for MapR-DB (MapR-DB JSON 6.0 also has secondary indexes). Data is automatically distributed, as it is written by sorted row key range. You can include multiple data elements in a “composite” row key, which can be useful for grouping rows together for finding by key range. For example, if you wanted to group posts by category and date, you could use a row key like "SPORTS_20131012" (if you want the most recent first, use a reverse timestamp). If you wanted to group restaurants by location, you could use a row key like "TN_NASHVL_PANCAKEPANTRY".

Another option is to add a hash prefix to the row key in order to get good distribution and still have a secondary grouping.

Generic Data, Event Data, and Entity-Attribute-Value

Generic data is often expressed as name value or entity attribute value. In a relational database, this is complicated to represent because every row represents an instance of a similar object. JSON allows easy variation across records. Here is an example of clinical patient event data:

patientid-timestamp, Temperature , "102"
patientid-timestamp, Coughing, "True"
patientid-timestamp, Heart Rate, "98"

This is the document model for the clinical patient event data:

The row key is the patient ID plus a time stamp. The variable event type and measurement are put into name-value pairs.

Tree, Adjacency List, Graph Data

Here is an example of a tree, or adjacency list:

Here is a document model for the tree shown above (there are multiple ways to represent trees):

{ "_id": "USA", “type”:”state”, "children": ["TN",”FL] "parent": null
}
{ "_id": "TN", “type”:”state”, "children": ["Nashville”,”Memphis”] "parent": "USA”
}
{ "_id": "FL", “type”:”state”, "children": ["Miami”,”Jacksonville”] "parent": "USA”
}
{ "_id": "Nashville", “type”:”city”, "children": [] "parent": "TN”
}

Each document is a tree node, with the row key equal to the node ID. The parent field stores the parent node ID. The children field stores an array of children node IDs. A secondary index on the parent and children fields allows to quickly find the parent or children nodes.

Inheritance Mapping

In modern object-oriented programming models, different object types can be related, for instance, by extending the same base type. In object-oriented design, these objects are considered instances of the same base type, as well as instances of their respective subtypes. It is useful to store objects in a single database table to simplify comparisons and calculations over multiple objects. But we also need to allow objects of each subtype to store their respective attributes, which may not apply to the base type or to other subtypes. This does not match a relational model but is very easy to do with a document model. Here is an example of object inheritance for store products (bike, pedal, and jersey are all types of store products):

In this online store example, the type of product is a prefix in the row key. Some of the name-value pairs are different and may be missing depending on the type of product. This allows to model different product types in the same table and to find a group of products easily by product type.

In this blog post, you learned how document database data modeling is different from traditional relational schema modeling, and you also got some guidelines for document database data modeling.

Original Link

What Developers Need to Know About Databases

To gather insights on the state of databases today and their future, we spoke to 27 executives at 23 companies who are involved in the creation and maintenance of databases.

We asked these executives, “What skills do developers need to be proficient with databases?” Here’s what they told us:

SQL

  • Strong SQL knowledge. Understand the difference between joins — outer, inner, left right. Know which type of database to use for which use case. Once you learn a programming language, other languages become the same. Know ML and DL libraries as well as Python. Know the different kinds of design patterns.
  • SQL is never going away. Mongo, H-base Hadoop, and Cassandra all have SQL interfaces. SQL is still the most well-known language. Understand how SQL impacts the performance of the data store. Developers must understand the platform and the performance of the database. Understand the tools needed to query performance.
  • SQL is still the language of choice for databases. Know how to translate objects in the application and how to store. Be able and willing to move to DevOps. Let go and be open to new frameworks. Deploy through cloud containers. Think microservices, not monolithic applications.
  • SQL relations understand declarative languages, intersections, and joins. Get rid of if/then logic and understand the language. The database system is designed for getting data in and out well. Don’t fall back on the chosen application language. Take advantage of what the database has to offer to accelerate performance. 
  • Important to understand SQL even though the frameworks automate. Understand how to model data appropriately to handle scale, agility, and changes. Better modeled is better prepared for the future. Understand characteristics of databases for different services. Know what the right solution is for the problem you are trying to solve. 
  • SQL is an obvious skill for developers to learn. Every Oracle database includes application express and SQL. We support all developer frameworks so there is no relearning of tools or languages required.
  • The short answer is SQL. Also, it’s increasingly important to be able to map the real world into a schema that describes the relationship between objects/entities.

Databases

  • Understand the data model and the end goal of the application so that you can choose the right database. How will other processes and tooling work around the database to shorten the development cycle and enable the faster development of features?
  • Make sure you use the right tool for the job. Make sure the solution you are building has copious test data so you can test every scenario. You can’t have too much good test data. Keep it simple; don’t go out of scope. 
  • Have the right tool for the right job. Know multi-storage. While you may be inclined to become an expert in a particular system and data stores, make sure you understand distributed databases and fundamental system levels, problem-solving, and systems thinking in a distributed manner. 
  • A lot of information is available. Understand the tradeoffs with solutions. What tradeoff comes from transactions, durability, and consistency? In programming, anything is possible but there are tradeoffs. As apps move faster be able to changes quickly. How far to make changes for how long and what is the maintenance lifecycle. Work with applications architects to understand requirements. 
  • Look at the platform and tools and understand when to choose a relational (transactions) versus a NoSQL (reports and analytics). Relational requires more overhead in code with an object relational mapper. No object mapper is needed with NoSQL. How badly do you need transactions? 
  • For me, it’s skill sets. Databases are not just shrink-wrapped software that’s deployed and then looked after by a DBA. Databases are now part of the DevOps fabric. It takes a different skill set to get business value from them these days. You have to be part programmer, part automator, part infrastructure admin, and part database admin. That’s a unicorn these days.

Other

  • Performance. What are you doing when the table gets large? Can you write a better query to be more selective? Schema structure, foreign keys which introduce coupling. How to write microservices for key, coupling, and structure? Persist object graphs. Store inside RDBMS. Foreign key references from one object tree to another can cause coupling problems.

  • Basic awareness of app areas and a high level of security requirements. Knowing the regulations in any given vertical. Encryption standards. Masking standards.
  • A willingness to learn new technologies! Even if you never leave Postgres for instance, there’s a ton of stuff in there that falls outside the classical SQL model, and learning how to use it will improve your craft. But don’t be afraid to leave your comfort zone! If you see a new database that looks crazy and a distribution is available, try it out! Even if you don’t find it useful, you’ll be better for the experience. A good handle on distributed systems is going to become more of a requirement going forward. Beyond that, a critical eye and an understanding of the tradeoffs you make with every decision. Work with product management to understand the business need.
  • Don’t focus too much on the technology. Focus on the use case and the long term for apps today and the future. Understand and define the business case.
  • Find a problem and figure out how to solve it. Learn by making mistakes. I received a dump of Apache logs and was asked what we could learn from them. I did a terrible job but learned a lot. Find a problem to solve and don’t be afraid to fail. Understand the basic concepts of distributed systems – what’s possible and what isn’t. What’s easy and what isn’t.
  • Architect skills. Developers must be architects these days. Understand the big picture of mapping. Reinvent the art of re-evaluating technology. Everything changes so fast the fundamentals you learned five or 10 years ago may no longer be applicable. Something counter to your intuition could work because the underlying facts have changed. Modern networks have reduced latency from tens or hundreds of milliseconds to less than one millisecond. Bandwidth is 100-times greater. Try to find the next set of standards. There are many ways of doing things.
  • Memory is the new disc. Think memory first. Have a multi-cloud strategy. Your data grid will act as a data broker unifying data from multiple infrastructures, multiple sources, and making them highly portable.
  • Fundamental scalable apps with concurrency using a Scala, Spark infrastructure.
  • There are several skills crucial to database management, the least of them being coding knowledge — despite developers’ first instincts. The first is to be organized in your tracking. It’s essential to get a good baseline for the unique KPIs for your specific DBMS. Beyond the basics of querying execution time, each platform often has its own metrics you’ll want to keep an eye on for performance tuning. The second is networking. As documentation can be limited, developers often need to reach out to those colleagues with long histories of working with the database platform. They’ll know the ins and outs of the multiple generations and version updates to best remedy the issue at hand.
  • I believe for developers to be proficient with databases they need to have the capability to create efficient data/graph models and queries.

What do you think developers need to know to be proficient working with databases?

Here’s who we talked to:

  • Emma McGrattan, S.V.P. of Engineering, Actian
  • Zack Kendra, Principal Software Engineer, Blue Medora
  • Subra Ramesh, VP of Products and Engineering, Dataguise
  • Robert Reeves, Co-founder and CTO and Ben Gellar, VP of Marketing, Datical
  • Peter Smails, VP of Marketing and Business Development and Shalabh Goyal, Director of Product, Datos IO
  • Anders Wallgren, CTO and Avantika Mathur, Project Manager, Electric Cloud
  • Lucas Vogel, Founder, Endpoint Systems
  • Yu Xu, CEO, TigerGraph
  • Avinash Lakshman, CEO, Hedvig
  • Matthias Funke, Director, Offering Manager, Hybrid Data Management, IBM
  • Vicky Harp, Senior Product Manager, IDERA
  • Ben Bromhead, CTO, Instaclustr
  • Julie Lockner, Global Product Marketing, Data Platforms, InterSystems
  • Amit Vij, CEO and Co-founder, Kinetica
  • Anoop Dawar, V.P. Product Marketing and Management, MapR
  • Shane Johnson, Senior Director of Product Marketing, MariaDB
  • Derek Smith, CEO and Sean Cavanaugh, Director of Sales, Naveego
  • Philip Rathle, V.P. Products, Neo4j
  • Ariff Kassam, V.P. Products, NuoDB
  • William Hardie, V.P. Oracle Database Product Management, Oracle
  • Kate Duggan, Marketing Manager, Redgate Software Ltd.
  • Syed Rasheed, Director Solutions Marketing Middleware Technologies, Red Hat
  • John Hugg, Founding Engineer, VoltDB
  • Milt Reder, V.P. of Engineering, Yet Analytics

Original Link

Relational to JSON With APEX_JSON

APEX_JSON is a PL/SQL API included with Oracle Application Express (APEX) 5.0 that provides utilities for parsing and generating JSON. While APEX_JSON was primarily intended to be used by APEX developers, there are some hooks that can allow it to be used as a standalone PL/SQL package.

Solution

The following solution uses APEX_JSON to create the JSON that represents a department in the HR schema. APEX_JSON basically writes JSON content to a buffer. By default, the buffer used is the HTP buffer in the database, as that’s what APEX reads. But as you can see with line 27, it’s possible to redirect the output to a CLOB buffer instead. Once we’ve redirected the output, we can make API calls to open/close objects and arrays and write values to them. When we’re done writing out the JSON we can make a call to get_clob_output to get the JSON contents. I’ve highlighted some of the relevant lines…

create or replace function get_dept_apex_json( p_dept_id in departments.department_id%type
) return clob is cursor manager_cur ( p_manager_id in employees.employee_id%type ) is select * from employees where employee_id = manager_cur.p_manager_id; l_date_format constant varchar2(20) := 'DD-MON-YYYY'; l_dept_rec departments%rowtype; l_dept_json_clob clob; l_loc_rec locations%rowtype; l_country_rec countries%rowtype; l_manager_rec manager_cur%rowtype; l_job_rec jobs%rowtype; begin apex_json.initialize_clob_output; select * into l_dept_rec from departments where department_id = get_dept_apex_json.p_dept_id; apex_json.open_object(); --department apex_json.write('id', l_dept_rec.department_id); apex_json.write('name', l_dept_rec.department_name); select * into l_loc_rec from locations where location_id = l_dept_rec.location_id; apex_json.open_object('location'); apex_json.write('id', l_loc_rec.location_id); apex_json.write('streetAddress', l_loc_rec.street_address); apex_json.write('postalCode', l_loc_rec.postal_code); select * into l_country_rec from countries cou where cou.country_id = l_loc_rec.country_id; apex_json.open_object('country'); apex_json.write('id', l_country_rec.country_id); apex_json.write('name', l_country_rec.country_name); apex_json.write('regionId', l_country_rec.region_id); apex_json.close_object(); --country apex_json.close_object(); --location open manager_cur(l_dept_rec.manager_id); fetch manager_cur into l_manager_rec; if manager_cur%found then apex_json.open_object('manager'); apex_json.write('id', l_manager_rec.employee_id); apex_json.write('name', l_manager_rec.first_name || ' ' || l_manager_rec.last_name); apex_json.write('salary', l_manager_rec.salary); select * into l_job_rec from jobs job where job.job_id = l_manager_rec.job_id; apex_json.open_object('job'); apex_json.write('id', l_job_rec.job_id); apex_json.write('title', l_job_rec.job_title); apex_json.write('minSalary', l_job_rec.min_salary); apex_json.write('maxSalary', l_job_rec.max_salary); apex_json.close_object(); --job apex_json.close_object(); --manager else apex_json.write('manager', '', p_write_null => true); end if; close manager_cur; apex_json.open_array('employees'); for emp_rec in ( select * from employees where department_id = l_dept_rec.department_id ) loop apex_json.open_object(); --employee apex_json.write('id', emp_rec.employee_id); apex_json.write('name', emp_rec.first_name || ' ' || emp_rec.last_name); apex_json.write('isSenior', emp_rec.hire_date < to_date('01-jan-2005', 'dd-mon-yyyy')); apex_json.write('commissionPct', emp_rec.commission_pct, p_write_null => true); apex_json.open_array('jobHistory'); for jh_rec in ( select job_id, department_id, start_date, end_date from job_history where employee_id = emp_rec.employee_id ) loop apex_json.open_object(); --job apex_json.write('id', jh_rec.job_id); apex_json.write('departmentId', jh_rec.department_id); apex_json.write('startDate', to_char(jh_rec.start_date, l_date_format)); apex_json.write('endDate', to_char(jh_rec.end_date, l_date_format)); apex_json.close_object(); --job end loop; apex_json.close_array(); --jobHistory apex_json.close_object(); --employee end loop; apex_json.close_array(); --employees apex_json.close_object(); --department l_dept_json_clob := apex_json.get_clob_output; apex_json.free_output; return l_dept_json_clob; exception when others then if manager_cur%isopen then close manager_cur; end if; raise; end get_dept_apex_json;

Output

When passed a departmentId of 10 , the function returns a CLOB populated with JSON that matches the goal 100%.

{ "id": 10, "name": "Administration", "location": { "id": 1700, "streetAddress": "2004 Charade Rd", "postalCode": "98199", "country": { "id": "US", "name": "United States of America", "regionId": 2 } }, "manager": { "id": 200, "name": "Jennifer Whalen", "salary": 4400, "job": { "id": "AD_ASST", "title": "Administration Assistant", "minSalary": 3000, "maxSalary": 6000 } }, "employees": [ { "id": 200, "name": "Jennifer Whalen", "isSenior": true, "commissionPct": null, "jobHistory": [ { "id": "AD_ASST", "departmentId": 90, "startDate": "17-SEP-1995", "endDate": "17-JUN-2001" }, { "id": "AC_ACCOUNT", "departmentId": 90, "startDate": "01-JUL-2002", "endDate": "31-DEC-2006" } ] } ]
}

Summary

I really enjoyed working with APEX_JSON — it’s my new “go-to” for PL/SQL based JSON generation. APEX_JSON has a very light footprint (it’s just a single package) and it takes a minimalistic approach. Rather than compose objects as one would do with PL/JSON, you simply use the package to write JSON to a buffer.

This approach yields some performance benefits, as well. In a basic test where I generated the JSON for every department in the HR schema 100 times in a loop, the APEX_JSON-based solution finished at around 3.5 seconds whereas the PL/JSON based solution took around 17 seconds. That means APEX_JSON about 3.8 times faster than PL/JSON when it comes generating JSON and converting it to a CLOB.

Unfortunately, APEX_JSON is only included with APEX 5.0+. Upgrading your database’s APEX instance seems a little extreme if all you want to do is work with JSON (though it is free and doesn’t take too long), but if you already have APEX 5.0, then it’s a very nice tool to be able to leverage.

Original Link

What Is Elasticsearch and How Can It Be Useful?

Products that involve e-commerce and search engines with huge databases are facing issues such as product information retrieval taking too long. This leads to poor user experience and in turn turns off potential customers.

Lag in search is attributed to the relational database used for the product design, where the data is scattered among multiple tables — and the successful retrieval of meaningful user information requires fetching the data from these tables. The relational database works comparatively slow when it comes to huge data and fetching search results through database queries. Businesses nowadays are looking for alternatives where the data stored to promote quick retrieval. This can be achieved by adopting NoSQL rather than RDBMS for storing data. Elasticsearch (ES) is one such NoSQL distributed database. Elasticsearch relies on flexible data models to build and update visitor profiles to meet the demanding workloads and low latency required for real-time engagement.

Let’s understand what is so significant about Elasticsearch. ES is a document-oriented database designed to store, retrieve, and manage document-oriented or semi-structured data. When you use Elasticsearch, you store data in JSON document form. Then, you query them for retrieval. It is schema-less, using some defaults to index the data unless you provide mapping as per your needs. Elasticsearch uses Lucene StandardAnalyzer for indexing for automatic type guessing and for high precision.

Every feature of Elasticsearch is exposed as a REST API:

  1. Index API: Used to document the index.

  2. Get API: Used to retrieve the document.

  3. Search API: Used to submit your query and get a result.

  4. Put Mapping API: Used to override default choices and define the mapping.

Elasticsearch has its own query domain-specific language in which you specify the query in JSON format. You can also nest other queries based on your needs. Real-world projects require search on different fields by applying some conditions, different weights, recent documents, values of some predefined fields, and so on. All such complexity can be expressed through a single query. The query DSL is powerful and is designed to handle real-world query complexity through a single query. Elasticsearch APIs are directly related to Lucene and use the same name as Lucene operations. Query DSL also uses the Lucene TermQuery to execute it.

This figure shows how the Elasticsearch query works:Indexing and Searching in Elasticsearch

The Basic Concepts of Elasticsearch

Let’s take a look at the basic concepts of Elasticsearch: clusters, near real-time search, indexes, nodes, shards, mapping types, and more.

Cluster

A cluster is a collection of one or more servers that together hold entire data and give federated indexing and search capabilities across all servers. For relational databases, the node is DB Instance. There can be N nodes with the same cluster name.

Near-Real-Time (NRT)

Elasticsearch is a near-real-time search platform. There is a slight from the time you index a document until the time it becomes searchable.

Index

The index is a collection of documents that have similar characteristics. For example, we can have an index for customer data and another one for a product information. An index is identified by a unique name that refers to the index when performing indexing search, update, and delete operations. In a single cluster, we can define as many indexes as we want. Index = database schema in an RDBMS (relational database management system) — similar to a database or a schema. Consider it a set of tables with some logical grouping. In Elasticsearch terms: index = database; type = table; document = row.

Node

A node is a single server that holds some data and participates on the cluster’s indexing and querying. A node can be configured to join a specific cluster by the particular cluster name. A single cluster can have as many nodes as we want. A node is simply one Elasticsearch instance. Consider this a running instance of MySQL. There is one MySQL instance running per machine on different a port, while in Elasticsearch, generally, one Elasticsearch instance runs per machine. Elasticsearch uses distributed computing, so having separate machines would help, as there would be more hardware resources.

Shards

A shard is a subset of documents of an index. An index can be divided into many shards.

Mapping Type

Mapping type = database table in an RDBMS.

Elasticsearch uses document definitions that act as tables. If you PUT (“index”) a document in Elasticsearch, you will notice that it automatically tries to determine the property types. This is like inserting a JSON blob in MySQL, and then MySQL determining the number of columns and column types as it creates the database table.

Do you want to know more about what Elasticsearch is and when to use it? Some of the use cases of Elasticsearch can be found here. Elasticsearch users have delightfully diverse use cases, ranging from appending tiny log-line documents to indexing web-scale collections of large documents and maximizing indexing throughput.

Sometimes, we have more than one way to index or query documents. And with the help of Elasticsearch, we can do it better. Elasticsearch is not new, though it is evolving rapidly. Still, the core product is consistent and can help achieve faster performance with search results for your search engine.

Original Link

Relational to JSON in Oracle Database

More and more often these days, front-end developers want their data in JSON format. And why not? JSON is a simple data-interchange format that’s lightweight and easy to use. Plus, many languages now provide a means of parsing and converting JSON data into native object types. However, not all data is best persisted in JSON format. For many applications, the relational model will be the best way to store data. But can’t we have the best of both worlds? Of course!

There’s no shortage of options when it comes generating JSON from relational data in Oracle Database. In this series, I’ll introduce you to several options, from lower-level PL/SQL-based solutions to higher-level solutions that do more than just generate and parse JSON. Here’s a list of the options I’ll be covering:

  • Lower-level:
  • Higher-level:

Which solution is right for you? That depends on your specific use case. Are you simply converting data and using it to populate another table? If so, one of the PL/SQL-based options would probably be your best choice. Are you creating an API for others to consume? If so, you’ll probably want to utilize one of the higher level options and only use the PL/SQL options when needed. Are you trying to stream JSON via AJAX in APEX? Then the APEX_JSON package is probably the way to go.

You probably see where I’m going with this. It’s not a one-size-fits-all kind of thing. I’ll provide an overview of how these options can be used to accomplish a goal (defined next). Hopefully, that will help you decide which option would work best for you for a given project.

The Goal

You’re probably familiar with, or at least aware of, the HR schema in Oracle Database: one of several sample schemas that are included for learning purposes. Well, we’re going to convert it to JSON! This is what the HR schema looks like:

You could convert the HR schema into JSON several different ways — it just depends on where you start. For example, if we started with the Employees table, we could traverse down like this:

On the other hand, we could start with the Departments table and traverse down like this:

That’s the path we’ll be traversing, from the Departments table on down. Given a department id, our goal will be to create the JSON representation of that department. We’ll follow some basic rules to keep the JSON from growing unruly for this demo:

  • No more than four or five attributes per object. This one is pretty self-explanatory. We don’t need to include every column from every table to get the point.
  • No more than three levels deep. Following one of the attribute chains in the image above, you’ll see “departments > locations > countries > regions”, which is four levels deep. Again, we don’t need to traverse down as far as we can to show how traversal works; three levels should do nicely.

In the end, we’ll be creating JSON that looks like this:

{ "id": 10, "name": "Administration", "location": { "id": 1700, "streetAddress": "2004 Charade Rd", "postalCode": "98199", "country": { "id": "US", "name": "United States of America", "regionId": 2 } }, "manager": { "id": 200, "name": "Jennifer Whalen", "salary": 4400, "job": { "id": "AD_ASST", "title": "Administration Assistant", "minSalary": 3000, "maxSalary": 6000 } }, "employees": [ { "id": 200, "name": "Jennifer Whalen", "isSenior": true, "commissionPct": null, "jobHistory": [ { "id": "AD_ASST", "departmentId": 90, "startDate": "17-SEP-1995", "endDate": "17-JUN-2001" }, { "id": "AC_ACCOUNT", "departmentId": 90, "startDate": "01-JUL-2002", "endDate": "31-DEC-2006" } ] } ]
}

There are a couple of things I want to note about this JSON. First of all, it includes all of the possible types of values you’ll find in JSON: objects, arrays, strings, numbers, booleans, and null. To get this output, I had to do a little finagling. For null values, I included the commission_pct column of the Employees table, as not all employees get a commission. Additionally, I made it a rule that each solution should be written in such a way that if a department doesn’t have a manager or employees, then those properties should be displayed with a null value.

Boolean (true/false) values are not valid data types in Oracle’s SQL engine but I wanted to include them as they are common in JSON. Oracle developers typically use another datatype to represent Boolean values, i.e. 0/1, ‘T’/’F’, or ‘Y’/’N’. But I couldn’t find any of these types of flags in the HR schema so I decided to use another business rule instead: If an employee was hired before January 1, 2005, then they should have an isSenior attribute set to true; otherwise, it should be false.

I’ll do a separate blog post on dates in JSON, as they can be tricky. The reason is that dates are not valid data types in JSON. So as with Booleans in Oracle’s SQL engine, developers must make use of other data types (number and string) to represent dates. Issues arise around selecting a date format and handling timezone conversions. To keep things simple in this series, I’ll use a non-standard, string-based date format (DD-MON-YYYY) that is meant for clients to display as a string. Later, when I do the post on dates I’ll include a link here.

Okay, let’s generate some JSON! Here are the links from above for convenience:

Original Link

Object-Relational Mapping Pitfalls

Object-relational mapping (ORM) has become an indispensable tool to work with relational databases in Java applications. This topic, along with JPA and Hibernate, has been the subject of innumerable articles. Therefore, the present post will not enumerate again the pros and cons of using these tools but describe a real situation involving the misuse of ORM.

How Good of an Abstraction Is ORM?

To some extent, ORM has proved to be a successful tool to hide the complexities of relational databases. Yet an abstraction is as good as its design to abstract implementation details away. When these details leak through the abstraction, there is likely to be problems (see the law of leaky abstractions for more details).

For instance, everybody familiar with JPA (Java Persistence API) knows about all those annotations scattered across the classes to indicate how the objects are to be represented in the underlying database, for example, @ManyToOne@OneToMany, @Id, and @JoinColumn.

Through those annotations, database concepts like relationships between entities, primary keys, foreign keys, join queries, etc. leak into the realm of object-oriented programming. Even worse, using an ORM efficiently requires some knowledge about how it works. For instance, when running a query you do not want to load in memory the content of the entire database, right? Yet that is what may happen if you do not configure your ORM properly to enable lazy loading (see this article for further explanation: JPA Lazy Loading).

To compound the problem, let us introduce a new actor: Querydsl. As the name implies, Querydsl provides a domain-specific language to create queries based on Java objects. It can make you forget about the existence of the database, especially when compared to old plain JDBC.

List<Person> persons = queryFactory.selectFrom(person) .where( person.firstName.eq("John"), person.lastName.eq("Doe")) .fetch();
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM Person WHERE firstName = 'John' AND lastName = 'Doe'");

How to Blow Up a Database With ORM

Now that all the actors are properly introduced, I can proceed to describe the issue I had to deal with and that motivated this post.

A MySQL database on production was getting overloaded every day at peak time and eventually would crash. There was some speculation about DoS attacks and/or the existence of a connection leak.

Anyway, I was asked to fix the problem. After examining the most common queries run on Production, I found the real culprit, namely, queries executing full table scans.

Here is a simplified version of the query:

select mytable.myid mytable.email mytable.first_name mytable.last_name mytable.title from mytable where lower(mytable.myid)='qqbwlz'

And this is the execution plan:

{ "query_block": { "select_id": 1, "table": { "table_name": "mytable", "access_type": "ALL", "rows": 1294267, "filtered": 100, "attached_condition": "(lcase(`mydatabase`.`mytable`.`myid`) = 'qqbwlz')" } }
}

As it can be seen, access_type="ALL" and the number of scanned rows is 1,294,267, (all the rows on the table). And the big offender is this condition:

lower(mytable.myid)='qqbwlz'

Even though there is an index defined on the column myid, the fact of using a function to filter on that column prevents the query from making use of the index.

After manually removing the function “lower” and executing again the query, this was the new execution plan:

{ "query_block": { "select_id": 1, "table": { "table_name": "mytable", "access_type": "const", "possible_keys": [ "PRIMARY" ], "key": "PRIMARY", "used_key_parts": [ "myid" ], "key_length": "32", "ref": [ "const" ], "rows": 1, "filtered": 100 } }
}

Now, access_type is constant and the number of scanned rows is 1, and the query is making use of the index defined on the column myid.

What is more, the query execution time went from a few seconds to a few milliseconds.

After confirming the problem and the solution, it was time to find the way that query was being generated by the application. Below is the snippet that builds the WHERE clause taken from GitHub (the - line is the original code and the + one is the code after the fix).

public final class Predicates { public static BooleanExpression hasMyId(String myId) { - return new Entity("myTable").myId.equalsIgnoreCase(myId); + return new Entity("myTable").myId.eq(myId.toUpperCase()); }
}

From the above snippet, it is clear what happened: some developer, misguided by the apparent simplicity of the code, decided to play it safe and make a case-insensitive comparison of myid. Yet the developer failed to notice that the code would be translated by Querydsl into a SQL query with the lcase function applied to myid‘.

It is interesting to think about what safeguards could be put in place to avoid this pitfall: no unit test nor integration test can detect the problem. Only load tests can help in this situation, but for that, it is necessary to run enough iterations as to insert millions of rows in the database. And at the end of the day, someone will have to access the database and check the queries performance.

Conclusion

Although working with abstractions is convenient, ultimately it is necessary to know about what is going on behind the scenes.

In cases like the one described in this post, anyone could have made that mistake: a new developer unfamiliar with the project and completely unaware that such an innocent change could wreak havoc on the database. Especially when no alert in the form of failing tests would be triggered. The only possible solution would be to run load tests after every change, and that is time-consuming and also expensive (as you will need to set up servers on your cloud platform of choice and generate enough traffic to have the servers hammered). All this leads to load tests not being run as often as necessary.

Moreover, the irony is that thanks to ORM tools, there is less and less need for developers to learn about the underlying database and as a result programmers lack the knowledge to deal with this kind of situations.

In a similar vein, something similar might occur with application servers as microservices become more ubiquitous. Frameworks to develop microservices like Spring Boot hide the existence of application servers by embedding them into the application. I wonder if in the future developers will also forget about how application servers work!

Original Link