Multiple Databases With Shared Entity Classes in Spring Boot and Java

Hello, everyone, It has been a few months since my last post. I have been busy traveling and relocating. In this post, I want to illustrate how a Spring Boot application can have multiple data sources with shared entity classes. The need for this arose in my current project where an in-memory database was needed for the high performance and a persistent database for storage.

In this blog post, I will use H2 for the in-memory database and Postgres for the persistent storage. I will setup up the application and show how the entities can be passed from one data source to another.

Original Link

PostgreSQL Trends: Metrics and Time-Consuming Tasks

PostgreSQL, the fourth most popular database and DBMS of the Year in 2017, has exploded in popularity amongst the development and database communities across the world. Stealing market share from leaders Oracle, MySQL, and Microsoft SQL Server, PostgreSQL hosting is also highly leveraged by new businesses in exciting spaces like IoT, e-commerce, SaaS, analytics, and more.

So What’s Trending in PostgreSQL Management?

We attended PostgresOpen in San Francisco last month to uncover the latest trends from the experts themselves.

Original Link

Understanding Postgres Connection Pooling With PgBouncer

PgBouncer Internals

PgBouncer is a connections pooling service for Postgres. It has all kinds of internal limits and limited resources. So here’s how it looks from the client’s, say, web-application point of view:

  1. Client connects to PbBouncer.
  2. Client makes SQL request/query/transaction
  3. Gets a response.
  4. Repeat steps 2–3 as many times as needed.

Here’s the client’s connection states diagram:

Original Link

Introducing Landlord: Per Tenant Stats in Postgres in Citus

Postgres keeps getting better and better. In recent years, the Postgres community has added JSONB support, improved performance, and added so many usability enhancements. The result: you can work even more powerfully with your database. Over the past 8 years, my favorite two enhancements have been JSONB and pg_stat_statements. Pg_stat_statements is a built-in extension that allows you to get high-level insights into queries that are being run as well as their performance — without having to be an expert and without needing a PhD in databases.

Introducing the New Landlord Feature in Citus 7.5

With Citus 7.5, we’ve gone one step beyond the awesomeness of pg_stat_statements and Postgres with the new Landlord feature in Citus, which gives you per-tenant stats.

Original Link

Postgres Full-Text Search

Search is one of the most important features while building an application. It leads to an increase in the popularity of various search engines like Elastic Search, Solr, etc. However, all of them require a different server to be set up and data sync processes to be implemented.

Would it not be better to have a solution that could minimize overheads and allow us to focus on the main objective, which is Search? Yes, and that’s where the Postgres Full-Text Search comes into the picture. It searches for the data right where it is stored in your tables. There is no need to set up a different server.

Original Link

TopN for Your Postgres Database

People seem to love lists of the most popular things. I think this is true of many of us, including developers. Did you get all excited like I did and listen right away to every song on the list when Spotify released Your Top Songs 2017? When the Academy Awards were announced, did you check in on the candidates and winners? Did you pay attention to the medalists and top scoring hockey teams in the Winter Olympics?

Sometimes, this problem of finding the top on a list is referred to as the Top-K problem. Also, the Top “N” problem. Whether it’s the top grossing sales reps or the most visited pages on your website, and whether you call it the Top K or the TopN, for most of us, there is usually something we want to know the top “N” of.

Finding the Top “N” is Not Easy

To find the top occurring item, you generally need to count through all the records. Counting the clicks in your web app, the number of times you’ve listened to a song, or the number of downloads of your project. It is all about counting. Counting, sorting, and limiting the list in Postgres is straightforward, and this works great on smaller sets of data. What if there are thousands of events? Machines these days are pretty fast, so this isn’t much of a problem. Millions is even acceptable. Billions? That may take a bit longer…

However, getting the counts of different items, sorting them and taking the top “N” of them out of your database — that can start to become much more challenging at a larger scale.

Even further, what if you want to materialize your top N results for smaller sets on a regular basis and run some combination queries to further analyze? The real problem starts then. Calculating the Top N can be a challenge. This is why my team at Citus Data (where we build the Citus extension to Postgres that scales out Postgres horizontally) is happy to announce the release of the open source TopN extension for PostgreSQL.

The inspiration for TopN came from a Citus Data customer who needed to use TopN-like functionality in concert with the Citus extension that scales out their Postgres database. When designing TopN, we decided to implement TopN as a Postgres extension, and we decided to write TopN in C. TopN outputs a JSONB object, which you can flexibly use for different use cases. Aggregation functions, which take JSONB input and union them together are also included.

TopN can be used to calculate the most frequently occurring values in a column and is part of the class of probabilistic distinct algorithms called sketch algorithms. Let’s look further at how the TopN extension to Postgres actually works.

Map the Elements With the Respective Counts

TopN initializes a static sized hash-map structure and aggregates the data into the map. The size can be set by a GUC called topn.number_of_counters. The variable basically defines the number of distinct elements that we are interested in for one set. For the sake of accuracy, we allow the hash-map to grow as big as 3* number_of_counters during one aggregation. Whenever the distinct data count exceeds this number, the least frequent half of the data is flushed and the aggregation continues.

TopN Operates Across Text Values

TopN takes the input from the text datatype. If you want to make TopN work on a non-text column, you can cast your existing data type to text. If you do need to cast your objects to text the resulting TopN list will be of the resulting text type as well.

Results in JSON

After the data ingestion is done and the “top N” number of elements are stored in a hashmap; this hashmap is then returned to you within a JSONB object with the elements and their frequencies. For some of you, you may have been aggregating and storing counts within your database. You can start to use them with TopN generated JSONBs since you can combine the results to make further analysis with aggregated union functions. You can also create GIN indexes and scan the counted objects in real time.

Materializing TopNs and Continuous Analysis

To be able to provide a similar use case that we dealt with for a customer of our Citus distributed database, we picked a github_events dataset for the first week of 2018. You can download and do the same tests by doing the following:


After ingesting the data and eliminating some of the buckets where the date is null. The data size we have is:

# select pg_size_pretty(pg_total_relation_size('github_events')); pg_size_pretty
---------------- 7906 MB
(1 row)

The dataset includes the events for 7 days. Let’s assume we provide a dashboard to our user in which they can analyze the activity in the repositories on a daily basis. We can aggregate and store the TopN elements for each day by the following way:

# create table aggregated_topns (day date, topn jsonb);
Time: 9.593 ms # insert into aggregated_topns select date_trunc('day', created_at), topn_add_agg((repo::json)->> 'name') as topn from github_events group by 1;
Time: 34904.259 ms (00:34.904)

Here, we are calculating the top 1000 repositories on each day and inserting it into our aggregation table.

When a user is interested in the top 10 elements in the 2nd and 3rd days of the new year, we can simply union the two TopN JSONBs.

postgres=# select (topn(topn_union_agg(topn), 10)).* from aggregated_topns where day IN ('2018-01-02', '2018-01-03'); item | frequency
------------------------------------------------+----------- dipper-github-fra-sin-syd-nrt/test-ruby-sample | 12489 wangshub/wechat_jump_game | 6402 shenzhouzd/update | 6170 SCons/scons | 4593 TheDimPause/ | 3964 nicopeters/sigrhtest | 3740 curtclifton/ | 3345 CreatorB/hackerdroid | 3206 dipper-github-icn-bom-cdg/test-ruby-sample | 3126 dotclear/dotclear | 2992
(10 rows) Time: 7.750 ms

If the user is interested in top 2 for each of the first three days, it is also pretty straightforward:

postgres=# select day, (topn(topn, 2)).* from aggregated_topns where day IN ('2018-01-01', '2018-01-02', '2018-01-03'); day | item | frequency
------------+------------------------------------------------+----------- 2018-01-01 | dipper-github-fra-sin-syd-nrt/test-ruby-sample | 9179 2018-01-01 | shenzhouzd/update | 4543 2018-01-02 | dipper-github-fra-sin-syd-nrt/test-ruby-sample | 7151 2018-01-02 | SCons/scons | 4593 2018-01-03 | dipper-github-fra-sin-syd-nrt/test-ruby-sample | 5338 2018-01-03 | CreatorB/hackerdroid | 3206
(6 rows) Time: 4.037 ms

Give the New TopN Extension to PostgreSQL a Try Today

Sketch algorithms like TopN and HyperLogLog provide powerful new ways to compute valuable information more easily. Here at Citus Data, we are excited to have created the TopN extension to Postgres and to make it available to the PostgreSQL community of users and developers.

If you are thinking about computing or saving the top <anything> today, give TopN a try.

Original Link

Fun With SQL: Functions in Postgres

In our previous Fun with SQL post on the Citus Data blog, we covered window functions. Window functions are a special class of function that allow you to grab values across rows and then perform some logic. By jumping ahead to window functions, we missed so many of the other handy functions that exist within Postgres natively. There are, in fact, several hundred built-in functions, and when needed, you can also create your own user defined functions (UDFs) if you need something custom. Today, we’re going to walk through just a small sampling of SQL functions that can be extremely handy in PostgreSQL.


First, arrays are a first-class datatype within Postgres. You can have an array of text or an array of numbers. Personally, I love using arrays when dealing with category tags. You can also index arrays, which can make querying extremely fast, but even if you’re not putting arrays directly into your database, you may want to build up arrays within your query.

A good example might be when you have a team feature within your application and you want to group all members that belong to that team into a single row. To do this, we can use the array_agg function, which will aggregate some columns together into a single row:

SELECT,, array_agg(
FROM teams, users
WHERE users.team_id =
GROUP BY 1, 2;
 id | name | array_agg
----+-------+-------------------------------------------------------------- 2 | ACME | {,,} 1 | Citus | {,}
(2 rows)

The above by itself is pretty handy, but we can go even further. If we want to send this directly to some email client, we can go ahead and parse out the array to just be a comma-separated list by wrapping our array_agg function with a function to convert it to a string array_to_string:

SELECT,, array_to_string(array_agg(, ', ')
FROM teams, users
WHERE users.team_id =
GROUP BY 1, 2;

Now our output format is much more usable:

 id | name | array_to_string
----+-------+-------------------------------------------------------------- 2 | ACME |,, 1 | Citus |,
(2 rows)

There is a whole slew of things you can do with arrays, but let’s shift gears to see what other categories exist.

Dealing With Time

Timestamps can be an extremely frustrating thing to work with. If you’re thinking about building a calendar-ing app, well, just don’t. Inevitably, you will have some dates and times within your database that you’ll need to work with. Lucky for you, Postgres has a solid set of functions to help make your life a little easier.

One of the most common functions I use when working with time is now(). It will simply give me the current timestamp of the server. This is handy in your application when setting the last_updated field, or using now the first time you set created_at. It can also be useful when you’re comparing things, such as for the past week. By starting with now(), and then doing some date interval math, you can easily get all users that created their accounts in the last week:

SELECT email
FROM users
WHERE created_at > now() - '7 days'::interval; email
(3 rows)

One less commonly used one is the literal for allballs. Allballs is UTC 00:00:00, it’s often used within communication and military settings that follow a 24-hour clock. At Citus, we use it because UTC 00:00:00 aligns nicely with 4 or 5pm depending on daylight savings time, which is a perfect time for happy hour on a Friday.

Shifting back to our earlier example of how many people signed up in the past week. This is great if we want a list of people, but in dealing with dates and timestamps, it’s more common if you want to aggregate to form some report. There are two ways you can approach grouping things by date. One is to extract the portions you want, and the other is by truncating the date to lose some of the specificity to it. Personally, I usually opt for date_trunc, as it does the job nicely. You can truncate a date in a number of ways from year, to month, to day. Using date_trunc, we can see that it’s simple to aggregate how many users have signed up each week:

SELECT date_trunc('week', created_at), count(*)
FROM users
ORDER BY 1; date_trunc | count
------------------------+------- 2018-06-11 00:00:00-07 | 3 2018-06-04 00:00:00-07 | 2
(2 rows)


Postgres has had JSON support for over 5 years now! 5 years. A number of databases have not even existed for 5 years yet. The JSON support in Postgres comes with all the rich transactional guarantees that Postgres brings, powerful indexing, and the ability to join against other data. JSON continues to get even better with each release with more built-in functionality for querying your JSON data.

The most basic JSON operator allows you to extract a value or object within the JSON:

SELECT * FROM users ; -[ RECORD 1 ]-+-------------------------------------------------------
id | 1
email |
team_id | 1
created_at | 2018-06-16 09:08:40.763473-07
location_data | {"city": "Huntsville", "state": "AL", "country": "US"} -- Get just the state from the JSON
SELECT email, location_data->'state' FROM users; email | ?column?
-----------------------+---------- | # | # | "CA" | # | "AL"
(5 rows)

To retrieve the above as a string swap the -> operator for ->>.

If you’re using JSONB, you may have some unstructured data. Another way of thinking about it is you may have a key that has a value or you may have no entry for the key. To clean this up and filter out so we have results where only state exists, we can use the ? operator. If we wanted to filter for only people from Alabama, we could use the @> operator:

SELECT email, location_data->'state' FROM users
WHERE location_data ? 'state'; email | ?column?
-----------------------+---------- | "AL" | "CA"
(2 rows) SELECT email, location_data->'state' FROM users
WHERE location_data @> '{"state":"AL"}'::jsonb; email | ?column?
---------------------+---------- | "AL"
(1 row)

If You Can Dream It, There Is a SQL Function for It in Postgres

Postgres has existed for more than 20 years. Its foundation is solid, but that doesn’t mean Postgres is complete. With each new release of Postgres, we see new improvements — including all the new built-in functions to make working with your database easier. The next time you need to perform some operation on data, well, before you jump to writing logic in your application or a custom function, take a look to see if Postgres already has your answer.

Original Link

PostgreSQL Kubernetes: How to Run HA Postgres on Kubernetes

Thanks to advances in the container ecosystem recently (Kubernetes stateful sets, PVCs, etc), getting started running PostgreSQL in a container is easy. However, running PostgreSQL in production still requires a lot of forethought and planning. Here are some of the things you need to think about when running PostgreSQL on Kubernetes in production:

  • How do I automatically deploy a new PostgreSQL instance in the cloud or on-prem data center?
  • How do I failover a PostgreSQL pod to another availability zone or rack if my PostgreSQL instance goes down?
  • How do I resize my PostgreSQL volume if I am running out of space?
  • How do I snapshot and backup PostgreSQL for disaster recovery?
  • How do I test upgrades?
  • Can I take my PostgreSQL deployment and run it in any environment if needed? Whether that is AWS, GCE, Azure, VMWare, OpenStack, or bare metal?

This article will show you how you can run PostgreSQL in production on Kubernetes so you can easily answer these questions. After reading through the following steps, you will understand how to run an HA PostgreSQL cluster in production using Kubernetes.

The Essential Steps to Run HA PostgreSQL on Kubernetes

This post will walk you step-by-step through how to deploy and manage an HA PostgreSQL cluster on Kubernetes. Before getting into all that detail. Let’s summarize.

There are 5 basic steps to run HA PostgreSQL:

  1. Chose a multi-cloud container orchestration platform like Kubernetes
  2. Install multi-cloud container storage solution like Portworx
  3. For a simple and efficient HA setup, run a single instance of PostgreSQL and set px replication to <code>repl:”3″</code>
  4. For scale-out, high performance, use PostgreSQL replication to send read-only requests to secondaries but keep px replication to <code>repl:”3″ or “2”<code> for faster failover and lower total number of instances at scale.
  5. Optionally set <code>io_priority:”high”</code> to schedule PostgreSQL instance on fast storage medium for better IO performance. Use a journal device with your volumes to speed up performance.

Read on for more details about running an HA PostgreSQL cluster on Kubernetes. Follow this link to launch the PostgreSQL Kubernetes Katacoda tutorial.

Achieving HA with PostgreSQL

PostgreSQL can run in a single node configuration and in a clustered configuration using different alternative solutions for asynchronous or synchronous replication as of PostgreSQL 9.1. The preferred replication technique with PostgreSQL is the use of a Write Ahead Log (WAL). By writing the log of actions before applying them to the database the PostgreSQL master, state can be replicated on any secondary by replaying the set of actions.

For deployments where you require replication for data protection but where a single database instance is capable of handling the read requests, a single Postgres pod with Portworx replicated volumes offers a simpler and more cost-effective solution to running HA PostgreSQL on Kubernetes.

With Portworx, each PostgreSQL Master and Secondary can have its PVC synchronously replicated. This makes recovering database instances a near-zero cost operation, which results in shorter recovery windows and higher total uptime. Our test also shows the elimination of the degradation in performance during the PostgreSQL recovery process when the state has to be recovered from the other database instances. With Portworx and Kubernetes, database instance recovery can take less than 30 seconds.

This is far less complex to manage and configure and requires ⅓ of the PostgreSQL Pods and therefore ⅓ of the CPU and Memory because Portworx is already running on your Kubernetes cluster and synchronously replicates data for all of your applications with great efficiency and scale.

Deploying PostgreSQL on Kubernetes

When deploying PostgreSQL on Kubernetes, the container image captures the database version and libraries while Kubernetes has the abstractions necessary to capture the database configurations, deployment topology, and storage configuration. There are different choices of container images for the database itself but also for running other PostgreSQL specific infrastructure and configuring them.

The Crunchy Data team has created an extensive set of containers as well as Kubernetes artifacts that we recommend taking a look at here. They also create a Kubernetes Postgres Operator to help with the management of large number of PostgreSQL instances and clusters in a single Kubernetes environment.

Helm charts are useful to quickly get started deploying a wide variety of databases onto Kubernetes, including PostgreSQL. For our tests, we used the Patroni helm chart to deploy a replicated 3 node cluster and we used a PostgreSQL helm chart to deploy a single node PostgreSQL instance. For the single instance configuration looks something like this:

Figure 1: PostgreSQL Running on Portworx Replicated Volume

HA PostgreSQL config Postgres pods required Volumes required
Without Portworx replication 3 3
With Portworx replication 1 (1/3 the pods for the same reliability!)


Table 1: Resource utilization with Portworx replication versus with PostgreSQL

In figure 1, we define a Kubernetes Storage Class object that declaratively defines how we want to handle storage for MongoDB:

kind: StorageClass
name: px-ha-sc
repl: "3"
io_profile: "db"
io_priority: "high"

Along with the repl:"3", io_profile:"db", and io_priority"high" settings, we can add Kubernetes snapshot schedules and Kubernetes data encryption policy directly in this storage class definition. This declarative style of configuration is exactly what modern cloud-native infrastructure is all about. No more snowflakes, you can recreate whole environments from source code, including the automation of your common data management tasks.

And finally, we will pass in the defined storage class as a parameter for the PostgreSQL Helm chart:
helm install --name px-psql --set persistence.storageClass=px-repl3-sc stable/postgresql

There are cases where running a single PostgreSQL instance is not going to cut it and when we will want to spread our reads to the secondary nodes which are part of our PostgreSQL secondary pool. For these scenarios, you can leverage the Kubernetes Stateful Sets to handle all the replica set configuration for you and still get all the agility benefits from Portworx by using the storage class for Portworx in the same way as before. There is a Patroni incubator Helm chart available for this and it can very easily be used with Portworx as follows:

helm repo add incubator
helm install --name px -f k8s-yaml/patroni-values.yaml incubator/patroni

The patroni-values.yaml file includes the name of the storage class as well as other deployment parameters like the number of instances in the cluster and address to the ETCD host Patroni uses to do master election. You can see the full file in the git repo for the simple REST API we create using Spring Boot and Spring JPA to test failover performance under load here:

You can choose to turn off Portworx replication in this case but we recommend that you set it to replication factor of 2 so that when nodes go away and restart the synchronization process is much faster. When one of the Postgres instances gets restarted it doesn’t have to rebuild the dataset from scratch because it can start with a copy of the volume that is consistent up to the time of the crash. This helps you reduce the instance outage window which in turn can help you reduce the total size of your cluster while keeping similar uptime guarantees. This in turn allows you to save on your total compute, memory, and network utilisation.

Failover PostgreSQL Pod on Kubernetes

Now, let’s walk through a failover scenario. In the linked katacoda tutorial, we will simulate a node failure by cordoning the Kubernetes nodes where PostgreSQL is running and then deleting the PostgreSQL pod.

Once Kubernetes identifies that the pod needs to be rescheduled, it will work with Portworx’s Kubernetes scheduler extender, STORK, to identify which node is best suited to host the restarted pod. In our small environment, any of the two remaining nodes will do because we have a copy of the data on all three nodes. In reality, you will likely have much larger clusters and that’s when STORK will benefit you by making sure the pod starts on a node where a copy of the data is locally stored. In the unlikely event that your pod cannot be started on one of those nodes, it will be able to start on any of the cluster nodes and access it’s data seamlessly through the Portworx storage fabric.

This failover should all happen within a very short time window, which is very similar to the Postgres replication configuration described above. This failover is depicted in the figure below:

Figure 2: PostgreSQL Failover

PostgreSQL Storage Operations

So, it seems that just for Reliability and High Availability alone it would be worth running PostgreSQL on Kubernetes and Portworx, but there is a lot more that you can do. So many of the data management operations that are error prone and time consuming are now going to be fully automated the same way in any cloud environment. First, we’ll show how volumes can be dynamically expanded without reconfiguring or restarting PostgreSQL, and then we will show how Snapshots can be easily restored.

Resize Postgres Volume on Kubernetes

Data management tasks like these need to be predictable and automatable when it makes sense to do so. The PostgreSQL Kubernetes Katacoda tutorial embedded at the bottom of this post will show how a simple command run from your Kubernetes command line interface can expand the PostgreSQL volume with zero downtime. Since Portworx volumes are virtual and carved out of your aggregate storage pool on your cluster, we thinly provision the volumes so that the expansion doesn’t immediately require you to add more storage capacity to your cluster. For information about expanding storage capacity, view our docs.

Snapshot PostgreSQL on Kubernetes

Snapshots can be scheduled as part of your storage class definition by using the Portworx command line interface (pxctl) or taken on demand by using Stork. Stork uses the external-storage project from kubernetes-incubator to add support for snapshots. The Katacoda tutorial will show how to use Stork with this simple YAML file to create a snapshot on demand:

kind: VolumeSnapshot
name: px-postgresql-snapshot
namespace: default
persistentVolumeClaimName: px-postgresql-pvc

To learn more about PostgreSQL snapshot schedules, please refer to our docs page. In the Katacoda tutorial, you will also learn how to start a new PostgreSQL instance from a snapshot for point-in-time recovery of your data.


As we’ve just seen, you can easily run an HA PostgreSQL container on Kubernetes using Portworx for replication, snapshots, backups, volume resizing, and even encryption. Depending on the size of the cluster, you can either forego PostgreSQL replication and use Portworx replication for HA, or use both PostgreSQL replication (for spreading out writes) and Portworx replication (for faster failover). For more information, explore our Katacoda tutorial for HA PostgreSQL on Kubernetes.

Original Link

Configuring Memory for Postgres

work_mem is perhaps the most confusing setting within Postgres. work_mem is a configuration within Postgres that determines how much memory can be used during certain operations. At its surface, the work_mem setting seems simple: after all, work_mem just specifies the amount of memory available to be used by internal sort operations and hash tables before writing data to disk. And yet, leaving work_mem unconfigured can bring on a host of issues. What perhaps is more troubling, though, is when you receive an out of memory error on your database and you jump in to tune work_mem, only for it to behave in an un-intuitive manner.

Setting Your Default Memory

The work_mem value defaults to 4MB in Postgres, and that’s likely a bit low. This means that per Postgres, activity (each join, some sorts, etc.) can consume 4MB before it starts spilling to disk. When Postgres starts writing temp files to disk, obviously things will be much slower than in memory. You can find out if you’re spilling to disk by searching for temporary file within your PostgreSQL logs when you have log_temp_files enabled. If you see temporary file, it can be worth increasing your work_mem.

On Citus Cloud (our fully-managed database as a service that scales out Postgres horizontally), we automatically tune work_mem based on the overall memory available to the box. Our tuning is based on the years of experience of what we’ve seen work for a variety of production Postgres workloads, coupled with statistics to compute variations based on cluster sizing.

It’s tough to get the right value for work_mem perfect, but often, a sane default can be something like 64 MB if you’re looking for a one size fits all answer.

It’s Not Just About the Memory for Queries

Let’s use an example to explore how to think about optimizing your work_memsetting.

Say you have a certain amount of memory, say 10 GB. If you have 100 running Postgres queries, and each of those queries has a 10 MB connection overhead, then 100*10 MB (1 GB) of memory is taken up by the 100 connections, which leaves you with 9GB of memory.

With 9 GB of memory remaining, say you give 90 MB to work_mem for the 100 running queries. But wait, it’s not that simple. Why? Well, work_mem isn’t set on a per-query basis, rather, it’s set based on the number of sort/hash operations. But how many shorts/hashes and joins happen per query? Now that is a complicated question. A complicated question made more complicated if you have other processes that also consume memory, such as autovacuum.

Let’s reserve a little for maintenance tasks and for vacuum and we’ll be okay as long as we limit our connections right? Not so fast my friend.

Postgres now has parallel queries. If you’re using Citus for parallelism you’ve had this for a while, but now you have it on single node Postgres as well. What this means is on a single query, you can have multiple processes running and performing work. This can result in some significant improvements in speed of queries, but each of those running processes can consume the specified amount of work_mem. With our 64 MB default and 100 connections, we could now have each of those running a query per each core consuming far more memory than we anticipated.

More work_mem, More Problems

So, we can see that getting it perfect is a little more work than ideal. Let’s go back a little and try this more simply. We can start work_mem small at say, 16 MB, and gradually increase work_mem when we see temporary file. But why not give each query as much memory as it would like? If we were to just say each process could consume up to 1 GB of memory, what’s the harm? Well, the other extreme out there is that queries begin consuming too much memory, more than you have available on your box. When that happens, you get 100 queries that have 5 different sort operations and a few hash joins in them. It’s in fact very possible to exhaust all the memory available to your database.

When you consume more memory than is available on your machine, you can start to see out of memory errors within your Postgres logs, or in worse cases, the OOM killer can start to randomly kill running processes to free up memory. An out of memory error in Postgres simply errors on the query you’re running, whereas the OOM killer in Linux begins killing running processes, which in some cases might even include Postgres itself.

When you see an out of memory error, you either want to increase the overall RAM on the machine itself by upgrading to a larger instance, or you want to decrease the amount of memory that work_mem uses. Yes, you read that right: out-of-memory, it’s better to decrease work_mem instead of increase since that is the amount of memory that can be consumed by each process and too many operations are leveraging up to that much memory.

General Guidance for work_mem

While you can continually tune and tweak work_mem, a couple of broad guidelines for pairing to your workload can generally get you into a good spot:

If you have a number of short running queries that run very frequently and perform simple lookups and joins, then maintaining a lower work_mem  is ideal. In this case, you get diminishing returns by allowing it to be significantly higher because it’s simply unused. If you’re workload is relatively few active queries at a time that are doing very complex sorts and joins, then granting more memory to prevent things from spilling can give you great returns.

Happy Database Tuning

Postgres powerful feature set and flexibility means you have a lot of knobs you can turn and levers you can pull in tuning it. Postgres is often used for embedded systems, time series dataOLTP, and OLAP as well. This flexibility can often mean an overwhelming set of options when tuning. On Citus Cloud, we’ve configured this to be suitable for most workloads we see. Think of it as one size fits most, and then when you need to, you’re able to customize. If you’re not running on Citus Cloud, consider leveraging pgtune to help you get to a good starting point.

Original Link

A Case for GraphQL in Enterprise

  • GraphQL supports dynamic queries and is type-safe. This reduces the number of APIs to be developed and allows enforcing compile-time checks on the data being requested by consumers.

  • It was designed to be able to seamlessly front multiple sources of data, reducing the number of complex, cross-functional API dev iterations.

  • It is backed by an active community that is constantly improving the tooling and simplifying GraphQL adoption.

APIs in Enterprise

Gartner positions APIs at the forefront of Digital Transformation or similar CIO initiatives. Sharing data-driven insights and building efficient workflows to deliver value to stakeholders largely entails building and managing APIs for applications.

GraphQL, a query language specification developed and open-sourced by Facebook, is rapidly revolutionizing the API Lifecycle Management (APLM) space. To understand how GraphQL adds value here, first, let’s take a brief look at the typical phases of APLM:

Design, Development, Testing, and Deployment are self-explanatory. The phase “Retiring” refers to the stage when an API needs an upgrade or has to be deprecated.

GraphQL and API Lifecycle Management

GraphQL is a query language for your API, and a server-side runtime for executing queries by using a type system you define for your data. GraphQL isn’t tied to any specific database or storage engine and is instead backed by your existing code and data —

Here’s a side-by-side comparison of GraphQL and REST versions of APIs for two sample use-cases:

Using GraphQL to publish a dynamically queryable and typed schema is the same as offering a strongly enforceable contract to the consumers of your API that can be validated at compile-time using client-side tooling. This reduces the scope for run-time errors! It is important to understand this and the source-agnosticism mentioned above — most of the arguments in this post to make a case for GraphQL in Enterprise will follow from them.

Let’s use the same examples as above and take a look at how API Lifecycle Management phases are affected by GraphQL:


One of the major problems with REST APIs is that the fields to be returned has to be predetermined at design time, with no room for implicit flexibility. Need a new field from your REST API? It needs to go through a fresh iteration.

In the case of GraphQL, design equals establishing the scope of data that needs to be “exposed” (anything within that scope can be queried). So in our example, the schema for authors, articles, and their relationships needs to be published. Achieving this kind of flexibility without GraphQL implies dumping a superset of data for every request at the cost of performance!


Authoring a GraphQL API involves a one-time effort of writing a GraphQL server and defining and implementing the GraphQL schema. The ecosystem is ripe with tools that handle most of the heavy-lifting (For e.g. Hasura provides a ready-to-use GraphQL server that automatically infers the schema from a database and handles the underlying implementation). If the schema remains unchanged, there is zero incremental cost to get any API that interacts with just these two tables — the API in the 2nd example comes free with the first one! This is obviously not the case with REST APIs. So, not only is the cost of a new REST API orders of magnitude higher compared to GraphQL, the cost of a marginally different REST API remains the same.

Moreover, GraphQL APIs can be:

  • leveraged to make a single nested query as opposed to writing and using multiple REST APIs
  • queried as needed with server-side transformations instead of having multiple variants of the same query or delegating this to clients
  • Used to auto-generate client-side code

GraphQL brings a lot of other magic to the table to improve a developer’s experience of building and iterating on APIs:

Multiple data sources: An important use-case in the Enterprise space — data from different sources, especially legacy apps/DBs, is often required to be queried together. This is usually a technical and an operational nightmare (in every iteration). Here’s an anecdote from James Baxley (Apollo GraphQL) from his experience at a $64M Enterprise:

“We wanted to use the Meteor accounts system and reactive data, get financial accounts, show giving history, and show profile info from Rock, have all of our images and content stored in our CMS, and we wanted in-app search using Google’s site search API. That sentence is just as exhausting as it would’ve been to build those endpoints using REST!”

James and his team at NewSpring Church used GraphQL because a GraphQL API can be supported by multiple sources of data by stitching their schemas together. More often than not, this will be a one-time effort, saving the need for expensive cross-functional operations.

Interface for multiple data sources

In one week, a single developer from the NewSpring team was able to connect all of those databases and systems together into a single endpoint.

API/Schema discovery: What’s the equivalent of Google and 3rd party documentation inside an Enterprise? Internal documentation and wikis are not the answer — the average rating by developers for internal documentation is 2.3/5.0. With a published schema, GraphQL has API discovery or documentation implicitly built into the server with tools that take advantage of this.

A GraphQL IDE using the schema for auto-suggestions and validation(

Standardised code and workflows: Working with a specification like GraphQL enforces a certain amount of standardization in codebases and processes. Considering the attrition rates in IT, this is a surprisingly underrated benefit of GraphQL.

Testing and Deployment

It follows from the above that GraphQL APIs need to be tested only when there’s a change in the schema or there’s a fresh schema (the client still needs to be tested). This also holds true for deployments. Given the imperative nature of REST APIs, every iteration is a candidate for testing and deployment. As with development, the cost of testing and deploying REST APIs is significantly more expensive compared to GraphQL.

Another area where GraphQL outshines REST is in its ability to reduce run-time errors and related testing due to its typed schema. Client-side tooling/SDKs to exploit this benefit and provide compile-time client-side query validations is the norm in the GraphQL ecosystem.

Query validation in an IDE


API versioning is traditionally the route taken to evolve or deprecate REST APIs. However, this exercise is difficult without the exact knowledge of how APIs (or more accurately, the attributes/fields in the response) are actually being used. With GraphQL and its typed schema, tracking the exact usage of all the “elements of the graph” is possible.


It is important to note that GraphQL is not a substitute for existing paradigms (REST) but a complementary one. There is plenty of balanced online literature that helps organizations make this choice and safely navigate their GraphQL adoption journey. First-time GraphQL backend developers also face a fairly steep learning curve in the early days. This and a few other rough-edges like caching, granular access-control, etc. have been identified and are being actively addressed by the community and service providers.

Enterprise GraphQL

GraphQL makes a strong case for being a crucial tool in an enterprise developer’s armory. It also helps that the backing of a vibrant open-source community brings a rich collection of tools and a plethora of managed services catering to developers and Enterprises.

It is, therefore, not surprising that GraphQL is seeing widespread adoption across a diverse set of industry verticals, from technology companies (Intuit, Coursera) and media giants (The New York Times, Condé Nast) to a cancer research hospital and a church (most of these Enterprises are also active contributors to the community). See this list of popular GraphQL users.

Original Link

Fun with SQL: Window Functions in Postgres

Today, we continue to explore all the powerful and fun things you can do with SQL. SQL is a very expressive language, and when it comes to analyzing your data, there isn’t a better option. You can see the evidence of SQL’s power in all the attempts made by NoSQL databases to recreate the capabilities of SQL. So why not just start with a SQL database that scales? (Like my favorites, Postgres and Citus.)

Today, in the latest post in our “Fun with SQL” series (earlier blog posts were about recursive CTEs, generate_series, and relocating shards on a Citus database cluster), we’re going to look at window functions in PostgreSQL. Window functions are key in various analytic and reporting use cases where you want to compare and contrast data. Window functions allow you to compare values between rows that are somehow related to the current row. Some practical uses of window functions can be:

  • Finding the first time all users performed some action
  • Finding how much each users’ bill increased or decreased from the previous month
  • Find where all users ranked for some sub-grouping

The Basic Structure of a Window Function in Postgres

Window functions within PostgreSQL have a built-in set of operators and perform their action across some specific key, but they can have two different syntaxes that express the same thing. Let’s take a look at a simple window function expressed two different ways:

The first format:

SELECT last_name, salary, department, rank() OVER ( PARTITION BY department ORDER BY salary DESC)
FROM employees;

The second format:

SELECT last_name, salary, department, rank() over w
FROM employees WINDOW w as (partition by department order by salary).

With the first query, we can see the window function is inlined, whereas the second, it is broken out separately. Both of the above queries produce the same results:

last_name | salary | department | rank
Jones | 45000 | Accounting | 1
Williams | 37000 | Accounting | 2
Smith | 55000 | Sales | 1
Adams | 50000 | Sales | 2
Johnson | 40000 | Marketing | 1

Both of these show the last name of employees, their salary, their department — and then rank where they fall in terms of salary in their department. You could easily combine this with a CTE to then find only the highest paying ( where rank = 1) or second highest paying ( where rank = 2) in each department.

What Can You Do With Window Functions in Postgres?

Within Postgres, there are a number of window functions that each perform a different operation. You can check the PostgreSQL docs for the full list, but for now, we’ll walk through a few that are particularly interesting:

  • rank — As we saw in the earlier example, rank will show where the row ranks in order of the window order.
  • percent_rank — Want to compute the percent where the row falls within your window order? percent_rank will give you the percentage ranking based on your window think of it as ((rank – 1) / (total rows – 1))
  • lag — Want to do your own operation between rows? Lag will give you the row value xrows before your current row. Want to the value for future rows? You can use lead for that. A great example of this could be computing month over month growth
  • ntile — Want to compute what percentile values fall in? ntile allows you to specify a percentile to group buckets into. For 4 quartiles you would use ntile(4), for percentile of each row you would use ntile(100).

Hopefully, you’ll find window functions as useful. If you have questions about using them, the PostgreSQL docs are a great resource. Or, feel free to jump into our Slack channel.

Original Link

Simple Tips For PostgreSQL Query Optimization

A single query optimization tip can boost your database performance by 100x. At one point, we advised one of our customers that had a 10TB database to use a date-based multi-column index. As a result, their date range query sped up by 112x. In this post, we share five simple yet still powerful tips for PostgreSQL query optimization.

To keep it simple, we ran examples for this article on a test dataset. Although it doesn’t show the actual performance improvement, you will see that our tips solve the significant set of optimization problems and work well in real-world case scenarios.

Explain Analyze

Postgres has a cool extension to the well-known ‘EXPLAIN’ command, which is called ‘EXPLAIN ANALYZE’. The difference is that ‘EXPLAIN’ shows you query cost based on collected statistics about your database, and ‘EXPLAIN ANALYZE’ actually runs it to show the processed time for every stage.

We highly recommend you use ‘EXPLAIN ANALYZE’ because there are a lot of cases when ‘EXPLAIN’ shows a higher query cost, while the time to execute is actually less and vice versa.

Tip: The most important thing is that the ‘EXPLAIN’ command will help you to understand if a specific index is used and how.

The ability to see indexes is the first step to learning PostgreSQL query optimization.

One Index Per Query

Indexes are materialized copies of your table. They contain only specific columns of the table so you can quickly find data based on the values in these columns. Indexes in Postgres also store row identifiers or row addresses used to speed up the original table scans.

It is always a trade-off between storage space and query time, and a lot of indexes can introduce overhead for DML operations. However, when read, query performance is a priority, as is the case with business analytics, it is usually a well-working approach.

Tip: Create one index per unique query for better performance.

Look further in this post to learn how to create indexes for specific queries, using multiple columns in an index

Let’s review the ‘explain analyze’ plan of the following simple query without indexes:

EXPLAIN ANALYZE SELECT line_items.product_id, SUM(line_items.price)
FROM line_items
WHERE product_id > 80

An explain analyze returns:

HashAggregate (cost=13.81..14.52 rows=71 width=12) (actual time=0.137..0.141 rows=20 loops=1)
Group Key: product_id
-> Seq Scan on line_items (cost=0.00..13.25 rows=112 width=8) (actual time=0.017..0.082 rows=112 loops=1)
Filter: (product_id > 80)
Rows Removed by Filter: 388
Planning time: 0.082 ms
Execution time: 0.187 ms

This query scans all of the line items to find a product with an ID that is greater than 80 and then sums up all the values grouped by that product ID.

Now we’ll add the index to this table:

CREATE INDEX items_product_id ON line_items(product_id)

We created a B-tree index, which contains only one column: ‘product_id’. After reading many articles about the benefits of using an index, one can expect a query boost from such an operation. Sorry, bad news.

As we need to sum up the price column in the query above, we still need to scan the original table. Depending on the table statistics, Postgres will choose to scan the original table instead of the index. The thing is, index lacks a ‘price’ column.

We can tweak this index by adding a price column as follows:

CREATE INDEX items_product_id_price ON line_items(product_id, price)

If we re-run the ‘explain’ plan, we’ll see our index is the fourth line:

GroupAggregate (cost=0.27..7.50 rows=71 width=12) (actual time=0.034..0.090 rows=20 loops=1)
Group Key: product_id
-> Index Only Scan using items_product_id_price on line_items (cost=0.27..6.23 rows=112 width=8) (actual time=0.024..0.049 rows=112 loops=1)
Index Cond: (product_id > 80)
Heap Fetches: 0
Planning time: 0.271 ms
Execution time: 0.136 ms

How would putting the price column first affect the PostgreSQL query optimization?

Well, we figured out that a multicolumn index is used in the previous query because we included both columns. The interesting thing is that we can use another order for these columns while defining the index:

CREATE INDEX items_product_id_price_reversed ON line_items(price, product_id)

If we re-run ‘explain analyze’, we’ll see that ‘items_product_id_price_reversed’ is not used. That’s because this index is sorted firstly on ‘price’ and then on ‘product_id’. Using this index will lead to its full scan, which is nearly equivalent to scanning the table. That’s why Postgres opts to use scan for an original table.

Tip: Put in the first place columns, which you use in filters with the biggest number of unique values.

Filters + Joins

It’s time to figure out what the best set of indexes is for a specific join query, which also has some filter conditions. Usually, you can achieve optimal results by trial and error.

Tip: As in the case of simple filtering, choose the most restrictive filtering condition and add an index for it.

Let’s consider an example:

SELECT orders.product_id, SUM(line_items.price)
FROM line_items
LEFT JOIN orders ON line_items.order_id =
WHERE line_items.created_at BETWEEN '2018-01-01' and '2018-01-02'

Here we have join on ‘order_id’ and filter on ‘created_at’. This way, we can create a multicolumn index that will contain ‘created_at’ in the first place and ‘order_id’ in the second:

CREATE INDEX line_items_created_at_order_id ON line_items(created_at, order_id)

We’ll get the following explain plan:

GroupAggregate (cost=16.62..16.64 rows=1 width=12) (actual time=0.026..0.026 rows=1 loops=1)
Group Key: orders.product_id
-> Sort (cost=16.62..16.62 rows=1 width=8) (actual time=0.023..0.023 rows=1 loops=1)
Sort Key: orders.product_id
Sort Method: quicksort Memory: 25kB
-> Nested Loop Left Join (cost=0.56..16.61 rows=1 width=8) (actual time=0.014..0.016 rows=1 loops=1)
-> Index Scan using line_items_created_at_order_id on line_items (cost=0.27..8.29 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: ((created_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2018-01-02 00:00:00'::timestamp without time zone)) -> Index Scan using orders_pkey on orders (cost=0.29..8.30 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=1)
Index Cond: (line_items.order_id = id)
Planning time: 0.269 ms
Execution time: 0.065 ms

As you can see, ‘line_items_created_at_order_id’ is used to reduce scan by date condition. After that, it’s joined with orders using the ‘orders_pkey’ index scan.

Tip: Date filters are usually one of the best candidates for the first column in a multicolumn index as it reduces scanning throughout in a predictable manner.


Our tips for PostgreSQL query optimization will help you speed up queries 10-100x for multi-GB databases. They can solve most of your performance bottlenecks in an 80/20 manner. However, it doesn’t mean you shouldn’t double check your queries with ‘EXPLAIN’ for real-world case scenarios.

Original Link

Contributing to Postgres via Patch Review

Citus is an open-source extension to Postgres that transforms Postgres into a distributed database, scaling horizontally. The fact that Citus is built on top of Postgres is a huge benefit to our users: it means that when you choose Citus, you get all the great features that are available in Postgres. And Postgres itself is an awesome database. Awesome. As a team, we value the foundation we’re built on and regularly aim to contribute back to it. We have a number of developers that have contributed to Postgres over the years from features like watch, event triggers, and the PostgreSQL extension framework.

Recently, a few more of our engineers expressed an interest in giving back to the PostgreSQL community. In fact, it’s a common question: How can we better help the PostgreSQL project? And a common answer is reviewing patches. To help kick-start that process, we organized a session and carved out a few days just for patch review during the most recent commitfest.

How Postgres Gets Developed

To better understand what patch reviews mean to Postgres, it is best to take a step back and understand the broader development process. PostgreSQL is released on a yearly basis. While Postgres 11 is likely to ship in fall of this year, we largely know the superset of what will be released. As developers want to contribute they start with an idea of occasionally a full patch as a discussion point on the mailing list. As these patches become more baked, they get put into a review process.

To help better coordinate reviews and features, there is a period known as a commitfest. Patches that have been submitted get reviewed, edited, and then eventually committed. In fact, there is a full commitfest application that you can check out to see which patches are in flight.

Reviewing PostgreSQL Patches

For each patch review, you’ll want to first pull the patch, apply it, test that it works, then review the code itself. For larger patches that may involve performance optimizations or such, of course, the work to test/review may be much heavier.

To us, it seemed like a good idea to walk through the process as a team with those who had more experience reviewing Postgres patches. So, we invited our coworkers from Canada, France, and the Netherlands to our Istanbul office to hold a Citus Commitfest, a time where everybody collaborates to PostgreSQL by reviewing patches. We started with Dimitri Fontaine, a Citan and major PostgreSQL contributor going over the process, explaining the commitfest logic, pgsql-etiquette, the importance of patch reviews and helping committers, etc. Then, we had everybody pick at least two patches that they wanted to review and paired over on the common patches. Moreover, Dimitri and Eren did a live review of the patch they chose on screen and everybody else has contributed to the review process. This included downloading and applying the patch on PostgreSQL master branch, compiling, running tests, checking that the patch does what is intended, reviewing the code line by line, and finally sending an e-mail to pgsql-hackers about the review we did.

Eren and Dimitri picked the patch about sending an optional message in cases of backend cancellation and termination, authored by Daniel Gustafsson. Normally, when you run pg_cancel_backend or pg_terminate_backend, the canceled/terminated backend shows a message like ERROR: canceling statement due to user request or terminating connection due to administrator command. The patch allows us to supply a message that will be appended to the original message. In other words, when we run SELECT pg_terminate_backend(pid, 'triggered failover from the console'), the message will look like terminating connection due to administrator command: triggered failover from the console in the logs. This can be helpful for communication between the app developers and DBAs or in our case of running a database as a service.

Lessons Learned Reviewing Patches

The most unfortunate part of reviewing the patch for the commitfest is that we didn’t get through more. This was due to a bit of bad planning on our part… as part of team building we took a boat ride and most team members ended up seasick-which definitely impacted our productivity and our planned reviews.

Lesson learned: Don’t mix Postgres and boats.

Lesson learned #2: Reviewing PostgreSQL patches for the commitfest, while a small gesture, felt great. I speak not only for myself but also for our team of database developers here at Citus: we look forward to contributing to the Postgres community by doing even more patch reviews going forward.

Original Link

jOOQ, PostgreSQL, and Serverless: Java and Relational Data

Azure PostgreSQL Service — as well MySQL Service — recently became GA, meaning many can now safely bring their databases (or create new ones) with all the benefits of a database service that easily allows provisioning, management, monitoring, scaling, back-up, and a lot more.

For developers, what really matters though is the ability to quickly spin a new PostgreSQL database instance, and quickly connect an application to it, without the need to download/install/configure PostgreSQL — although of course you can always use a local Docker image. What is really interesting in this case, of course, is that the database instance may be shared with other development team members.

If you haven’t already installed Azure CLI, here’s how:

$ brew update && brew install azure-cli

And if you still don’t have an Azure account, try these two options:

Free Azure ($250 over 30 days plus 25+ free services for 12 months)

Free Azure for Students (similar as above; no credit card verification required)

So let’s get started:

Step 1: Authenticate Through Azure CLI

$ az login

And follow the instructions.

Step 2: Create a PostgreSQL Database Using the CLI

$ az group create --name myresourcegroup --location westus
$ az extension add --name rdbms
$ az postgres server create --resource-group myresourcegroup \ --name mydemoserver --location westus --admin-user myadmin \ --admin-password <server_admin_password> --sku-name GP_Gen4_2 \ --version 9.6

These three commands should take no more than 5 minutes.

Pro-tips: don’t try to set the admin-user to values like root/admin. These are reserved, and the command will eventually fail. The password also must comply with these requirements:

  • Must have digits, letters, and special characters
  • Minimum length is 8 characters

If all goes well, you should get a JSON with all the details of your newly create PostgreSQL database.

The next step is to allow connections to this database by setting a firewall rule.

Step 3: Setup Firewall Rule

Be careful with this configuration. Ideally, you should NOT open your database on the internet and only allow for applications deployed on Azure services behind the firewall. We will look into this later.

$ az postgres server firewall-rule create \ --resource-group myresourcegroup --server mydemoserver \ --name AllowAllIps --start-ip-address \ --end-ip-address

Step 4: Test Your Connection With PostgreSQL Command Line

In this other article, I share some hints on setting up command-line tools for connecting to multiple databases, including Postgres.

Before we can connect to the database, we must find the address of it.

Run the following command:

$ az postgres server show --resource-group myresourcegroup --name mydemoserver

You should see a JSON like this:

{ "administratorLogin": "myadmin", "earliestRestoreDate": null, "fullyQualifiedDomainName": "", "id": "/subscriptions/.../resourceGroups/myresourcegroup/providers/Microsoft.DBforPostgreSQL/servers/mydemoserver", "location": "westus", "name": "mydemoserver", "resourceGroup": "myresourcegroup",

Now trigger a connection to the database using the fullyQualifiedDomainName and the administratorLogin you gave during create:

$ psql \ \ \ --dbname=postgres Password for user
psql (10.3, server 9.6.6)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-SHA384, bits: 256, compression: off)
Type "help" for help. postgres=>

Done! You have successfully created a PostgreSQL database on Azure and can now load up some data.

Step 5: Running a Sample Java Function on Azure Using jOOQ to Connect to PostgreSQL

Once on the Postgre command-line, make sure you create the following table:


Before proceeding, ensure you have the latest Azure Functions runtime on your local computer. Check the install procedures here.

Example for installing Azure Functions Tools on Mac:

$ brew tap azure/functions
$ brew install azure-functions-core-tools

Now you can clone the project Azure Functions Sample for jOOQ and PostgreSQL:

$ git clone

Next, go into the project folder, copy/rename and modify the file with your database credentials and information, and run the project locally:

$ cd azure-function-jooq-postgre-sample
$ cp
$ vi local.settings.json
$ mvn clean package azure-functions:run

And finally, try to hit these two URLs, with the expected output as it follows:

$ curl http://localhost:7071/api/hello\?name\=Azure
$ curl http://localhost:7071/api/greetedPeople

If everything goes well, you should have the following output:

$ curl http://localhost:7071/api/hello\?name\=Azure
Hello, Azure.
$ curl http://localhost:7071/api/greetedPeople

If you feel brave enough, deploy this Azure Function to your environment:

$ mvn azure-functions:deploy

I hope you enjoy the code, because the very early version generated quite some Twitter storm in code review.

Original Link

Fun With SQL: generate_series in Postgres

There are times within Postgres where you may want to generate sample data or some consistent series of records to join in order for reporting. Enter: the simple but handy set returning function of Postgres: generate_seriesgenerate_series, as the name implies, allows you to generate a set of data starting at some point, ending at another point, and optionally set the incrementing value. generate_series works on two datatypes:

  • Integers
  • Timestamps

Let’s get started with the most basic example:

SELECT * FROM generate_series(1, 5); generate_series
----------------- 1 2 3 4 5
(5 rows)

So generate_series pretty straight-forward, but what interesting ways can it be used?

Generating Fake Data

By putting our generate_series inside a CTE, we can easily now generate a set of numbers and then perform some operation against each value. If we want to generate some fake number, we can use random(), which generates a random number between 0.0 and 1.0.

WITH numbers AS ( SELECT * FROM generate_series(1, 5)
) SELECT generate_series * random()
FROM numbers; ?column?
------------------- 0.87764338683337 0.345125129446387 2.10317854676396 0.937898803502321 1.72822773223743
(5 rows)

Pretty Weekly Reporting With Joins

Aggregating across some time dimension is a fairly common report. A good example might be new users per week. The simplest way to get this would be by leveraging the Postgres date_trunc function:

SELECT date_trunc('week', created_at) count(*)
FROM users

The issue with the above query arises when two cases are true: first, you’re charting your data over time and then two you have a week with no sign-ups. In the case of no sign-ups in a week, you’d simply miss the 0 on your graph leaving a misleading impression. To smooth this out, we go back to generate series and do an outer join on the week:

WITH range_values AS ( SELECT date_trunc('week', min(created_at)) as minval, date_trunc('week', max(created_at)) as maxval FROM users), week_range AS ( SELECT generate_series(minval, maxval, '1 week'::interval) as week FROM range_values
), weekly_counts AS ( SELECT date_trunc('week', created_at) as week, count(*) as ct FROM users GROUP BY 1
) SELECT week_range.week, weekly_counts.ct
FROM week_range
LEFT OUTER JOIN weekly_counts on week_range.week = weekly_counts.week;

What Other Uses Do You Have for generate_series?

Postgres is has a wealth of hidden gems within it. generate_series is one just one of the handy built-in features of Postgres. If you know of other novel uses for it, we’d love to hear about it @citusdata.

Original Link

When Postgres Blocks: 7 Tips for Dealing With Locks

Recently, I wrote about locking behavior in Postgres, which commands block each other, and how you can diagnose blocked commands. Of course, after the diagnosis, you may also want a cure. With Postgres, it is possible to shoot yourself in the foot, but Postgres also offers you a way to stay on target. These are some of the important dos and don’ts that we’ve seen as helpful when working with users to migrate from their single node Postgres database to Citus or when building new real-time analytics apps on Citus.

1. Never Add a Column With a Default Value

A golden rule of PostgreSQL is: when you add a column to a table in production, never specify a default.

Adding a column takes a very aggressive lock on the table, which blocks read and write. If you add a column with a default, PostgreSQL will rewrite the whole table to fill in the default for every row, which can take hours on large tables. In the meantime, all queries will block, so your database will be unavailable.

Don’t do this:

-- reads and writes block until it is fully rewritten (hours?)
ALTER TABLE items ADD COLUMN last_update timestamptz DEFAULT now();

Do this instead:

-- select, update, insert, and delete block until the catalog is update (milliseconds)
ALTER TABLE items ADD COLUMN last_update timestamptz;
-- select and insert go through, some updates and deletes block while the table is rewritten
UPDATE items SET last_update = now();

Or better yet, avoid blocking updates and delete for a long time by updating in small batches, i.e.:

do { numRowsUpdated = executeUpdate( "UPDATE items SET last_update = ? " + "WHERE ctid IN (SELECT ctid FROM items WHERE last_update IS NULL LIMIT 5000)", now);
} while (numRowsUpdate > 0);

This way, you can add and populate a new column with minimal interruption to your users.

2. Beware of Lock Queues; Use Lock Timeouts

Every lock in PostgreSQL has a queue. If a transaction B tries to acquire a lock that is already held by transaction A with a conflicting lock level, then transaction B will wait in the lock queue. Now, something interesting happens: if another transaction C comes in, then it will not only have to check for conflict with A, but also with transaction B, and any other transaction in the lock queue.

This means that even if your DDL command can run very quickly, it might be in a queue for a long time waiting for queries to finish, and queries that start after it will be blocked behind it.

When you can have long-running SELECT queries on a table, don’t do this:

ALTER TABLE items ADD COLUMN last_update timestamptz;

Instead, do this:

SET lock_timeout TO '2s'
ALTER TABLE items ADD COLUMN last_update timestamptz;

By setting lock_timeout, the DDL command will fail if it ends up waiting for a lock, thus blocking queries for more than two seconds. The downside is that your ALTER TABLE might not succeed, but you can try again later. You may want to query pg_stat_activity to see if there are long-running queries before starting the DDL command.

3. Create Indexes Concurrently

Another golden rule of PostgreSQL is: always create your indexes concurrently.

Creating an index on a large dataset can take hours or even days, and the regular CREATE INDEX command blocks all writes for the duration of the command. While it doesn’t block SELECTs, this is still pretty bad and there’s a better way: CREATE INDEX CONCURRENTLY.

Don’t do this:

-- blocks all writes
CREATE INDEX items_value_idx ON items USING GIN (value jsonb_path_ops);

Instead, do this:

-- only blocks other DDL
CREATE INDEX CONCURRENTLY items_value_idx ON items USING GIN (value jsonb_path_ops);

Creating an index concurrently does have a downside. If something goes wrong it does not roll back and leaves an unfinished (“invalid”) index behind. If that happens, don’t worry, simply run DROP INDEX CONCURRENTLY items_value_idx and try to create it again.

4. Take Aggressive Locks as Late as Possible

When you need to run a command that acquires aggressive locks on a table, try to do it as late in the transaction as possible to allow queries to continue for as long as possible.

For example, if you want to completely replace the contents of a table. Don’t do this:

-- reads and writes blocked from here:
-- long-running operation:
\COPY items FROM 'newdata.csv' WITH CSV COMMIT; 

Instead, load the data into a new table and then replace the old table:

-- long-running operation:
\COPY items_new FROM 'newdata.csv' WITH CSV
-- reads and writes blocked from here:
ALTER TABLE items_new RENAME TO items;

There is one problem: we didn’t block writes from the start and the old items table might have changed by the time we drop it. To prevent that, we can explicitly take a lock the table that blocks writes, but not reads:


Sometimes it’s better to take locking into your own hands.

5. Adding a Primary Key With Minimal Locking

It’s often a good idea to add a primary key to your tables — for example, when you want to use logical replication or migrate your database using Citus Warp.

Postgres makes it very easy to create a primary key using ALTER TABLE, but while the index for the primary key is being built, which can take a long time if the table is large, all queries will be blocked.

ALTER TABLE items ADD PRIMARY KEY (id); -- blocks queries for a long time

Fortunately, you can first do all the heavy lifting using CREATE UNIQUE INDEX CONCURRENTLY, and then use the unique index as a primary key, which is a fast operation.

CREATE UNIQUE INDEX CONCURRENTLY items_pk ON items (id); -- takes a long time, but doesn’t block queries
ALTER TABLE items ADD CONSTRAINT items_pk PRIMARY KEY USING INDEX items_pk; -- blocks queries, but only very briefly

By breaking down primary key creation into two steps, it has almost no impact on the user.


The Postgres user experience can be a little surprising sometimes. While VACUUM FULL sounds like something you want to do clear the dust of your database, a more appropriate command would have been:


VACUUM FULL rewrites the entire table to disk, which can take hours or days, and blocks all queries while doing it. While there some valid use cases for VACUUM FULL, such as a table that used to be big, but is now small and still takes up a lot of space, it is probably not your use case.

While you should aim to tune your auto-vacuum settings and use indexes to make your queries fast, you may occasionally want to run VACUUM, but NOTVACUUM FULL.

7. Avoid Deadlocks by Ordering Commands

If you’ve been using PostgreSQL for a while, chances are you’ve seen errors like:

RROR: deadlock detected
DETAIL: Process 13661 waits for ShareLock on transaction 45942; blocked by process 13483.
Process 13483 waits for ShareLock on transaction 45937; blocked by process 13661.

This happens when concurrent transactions take the same locks in a different order. For example, one transaction issues the following commands.

UPDATE items SET counter = counter + 1 WHERE key = 'hello'; -- grabs lock on hello
UPDATE items SET counter = counter + 1 WHERE key = 'world'; -- blocks waiting for world

Simultaneously, another transaction might be issuing the same commands, but in a different order.

UPDATE items SET counter = counter + 1 WHERE key = 'world'; -- grabs lock on world
UPDATE items SET counter = counter + 1 WHERE key = 'hello'; -- blocks waiting for hello

If these transaction blocks run simultaneously, chances are that they get stuck waiting for each other and would never finish. Postgres will recognize this situation after a second or so and will cancel one of the transactions to let the other one finish. When this happens, you should take a look at your application to see if you can make your transactions always follow the same order. If both transactions first modify hello, then world, then the first transaction will block the second one on the hello lock before it can grab any other locks.

Original Link

Akka: Data Access Actors

An actor model gives us an outstanding solution for the building of high scale and high load systems. Those of you who work with Akka know that in the actor model, everything should be represented as an actor. In some sense, this axiom simplifies software development. Is this circumstance as good as it seems? In this article, I’m going to demonstrate some approaches that may be applied to actors that need to interact with a database. Just imagine, what is the sense of a high scale and high load system if it does not communicate with a database?


Let’s assume that we want to implement an actor for accessing a person table:

//Postgres syntax CREATE TABLE person ( id serial primary key, full_name text not null, phone text, created_at timestamp not null

What operations do we want to have for the person table? Probably a standard CRUD set of operations: create, select by id, update full_name and phone, and delete by id.

When we know what we want, we can start doing some steps to implement the solution. Our goal is to answer the question: What is the right way to design a data access actor?

Approach #1

Everything starts from a case class:

case class Person(id: Int, fullName: String, phone: Option[String], createdAt: LocalDateTime)

Let’s assume that we have some database provider trait. In our case, it may be something like PostgresDB. It represents a driver to the database. That means that we can create some abstraction for data access:

trait PersonRepository { implicit val ec: ExecutionContext val db: PostgresDB def createPerson(fullName: String, phone: Option[String]): Future[Int] def getPerson(id: Int): Future[Option[Person]] def updatePerson(fullName: String, phone: Option[String]): Future[Boolean] def deletePerson(id: Int): Future[Boolean]

Of course, now we have to create some realization of this trait:

class PersonRepositoryImpl(db: PostgresDB)(implicit val ec: ExecutionContext) extends PersonRepository { //implementation of the methods }

Wait! This article is about actors, isn’t it? So why don’t we see any code related to actors? Let’s correct this somehow. The most rational idea is to define messages for the actor first:

object PersonDataActor { case class CreatePerson(fullName: String, phone: Option[String]) case class GetPerson(id: Int) case class UpdatePerson(fullName: String, phone: Option[String]) case class DeletePerson(id: Int) //Then we can add response messages here as a reaction on the messages declared above } 

With this set of messages, we can create PersonDataActor:

class PersonDataActor(personRepo: PersonRepository) extends Actor { implicit val system = context.system implicit val ec: ExecutionContext = system.dispatcher override def receive: Receive = { case cp: CreatePerson => //corresponding function call from personRepo case gp: GetPerson => //corresponding function call from personRepo case up: UpdatePerson => //corresponding function call from personRepo case dp: DeletePerson => //corresponding function call from personRepo } }

Well. That’s it.

Is this approach good to be used in a production? Well, at least it works. The more significant advantage is that we can mock personRepo for testing purposes. Hence the PersonDataActor is testable.

Unfortunately, when you need more than 1 repository for some reason, the actor’s constructor becomes “fat”.

class PersonDataActor(personRepo: PersonRepository, mobProviderRepo: MobileProviderRepository, phoneBlackListRepo: PhoneBlackListRepository) extends Actor { //... }

That’s how the things going in the approach #1.

Approach #2

I hope that you have read the previous section, because I’m going to refer to it. So why don’t we pass just one parameter to the actor’s constructor? I mean PostgresDB. If we do so, this makes the actor construction more elegant, because all the repositories can be initialized inside of the actor:

class PersonDataActor(postgresDB: PostgresDB) extends Actor { val personRepo = new PersonRepositoryImpl(postgresDB) val mobProviderRepo = new MobileProviderRepositoryImpl(postgresDB) val phoneBlackListRepo = new PhoneBlackListRepositoryImpl(postgresDB) //...

Is this approach better than the first one? Actually no, because “elegance” of PersonDataActor constructor gives you less than it takes back. This code is hard to test: You are not able to mock the repositories as you need to according to test scenarios. So you will need to create an in-memory DB for each test suite run.


I tried to highlight the problems that may occur with data access actors when you design your actor system. This article is definitely just the tip of the iceberg, though. Maybe I missed something when I tried to enforce separation of concerns in the context of actors. Anyway, I’ll be really glad to read about your experience in this area.

How would you implement this data access actor?

Original Link

CLI for Indexing Data From Postgres to Elasticsearch

ElasticSearch is fantastic for indexing and filtering data. But hey, you have your data on a Postgres DB in production. How do you copy all data from Postgres to Elastic? Even better, how do you keep both the data stores in sync? Is it even possible?

I am going to answer these questions in this post. To start off, yes, it is indeed possible. We at have made an awesome CLI tool called ABC, which will allow you to do this with a single command.

abc import --src_type=postgres --src_uri=<uri> <elasticsearch_uri>

That’s it. Seriously, this is all you need to sync a Postgres database to an ElasticSearch index. Here’s a video showing the process.

The Steps

The first step is to install ABC if you have not done so already. So, go to the GitHub releases page for ABC and download the most recent version. It’s a single no-dependancy binary so put it anywhere you like. We recommended putting it inside a PATH directory so that it can be access from anywhere in the terminal.

Ensure that ABC is working by running the following command.

abc version

Now, let’s take a Postgres database and we are going to sync it to ElasticSearch index hosted on

First, we are going to create a database called ‘users’.

CREATE USER test_user with password 'test_pass';
CREATE DATABASE users with OWNER test_user;
\c users;

Next, we are going to create a table called ‘users’ inside the database ‘users’.

CREATE TABLE users ( email varchar(500) NOT NULL, name varchar(100) NOT NULL, bio text, PRIMARY KEY(email)

After that, we will add some sample data to it.

insert into users values ('', 'foo', 'bar');
insert into users values ('', 'zak', 'ceo');

The table looks like as follows now:

The Postgres test source is now complete. Its URL is:


Next, we are going to create the sink ElasticSearch index. We go to and create a new app called abcpostgrestest. The complete URL to this index looks like the following.

So now, we have both the source and the sink ready. It’s time for some ABC magic. Using this source and sink, we can build the import command. It will be as follows:

abc import --src_type=postgres --src_uri="postgresql://test_user:test_pass@" ""

Once you run this command, you should see that the command will finish in some time with no errors. Now if you visit appbaseio dashboard, you can see that the data has been transferred to the target ElasticSearch index.

Voila. Everything works. The data has been transferred to ElasticSearch and that too without doing anything at all. Next, we will see how to make ABC listen to the changes in the Postgres database.

Indexing Real-Time Data Changes From Postgres

If you are using Postgres as your production database system, there are good chances that your data is constantly changing. How to sync the Elasticsearch index with all the changes?

ABC has a nifty tail mode that allows synchronizing the Postgres database in real-time to an Elasticsearch index. It uses the replication slot feature of Postgres to be able to do this.

It can be enabled by passing a --tail switch.

abc import --tail --src_type=postgres --src_uri="postgresql://test_user:test_pass@" ""

Now, run the import command again with the tail option. You might see an error with this text:

must be superuser or replication role to use replication slots

Don’t panic. This happens because you don’t have replication setup on your Postgres database which is required for the tailing feature. So, you will now have to make the database user as superuser. (Read ALTER ROLE.)


After running the above command, you should be inside the psql shell. Now, run the following command (where ‘test_user’ is your database username).


Once that is done, you need to create some replication slots. Open postgresql.conf in the text editor and change the wal_level and max_replication_slots as follows. You will have to restart your Postgres server after this. This step enables the feature of replication slots.


After this, you will actually create a replication slot. Go back in the psql shell, connect to the source database, and run the following command.

\c users;
select * from pg_create_logical_replication_slot('users_replication_slot', 'test_decoding');
SELECT * FROM pg_replication_slots;

You should see database users in the replication slot row now. This means that the replication slot is properly setup.

Now let’s try the tail again.

abc import --tail --src_type=postgres --src_uri="postgresql://test_user:test_pass@" --replication_slot="users_replication_slot" ""

Run the import again. It should work this time. 

Also, notice that the import process won’t exit when finished now. It will keep on running and will listen to the changes. Let’s add a new entry to the database and see if it reflects on the appbaseio dashboard.

\c users;
insert into users values ('', 'tony stark', 'iron man');

And yes, it works. You should see a new entry in the app dashboard when you hit reload.

Try making some more changes and all of them will be reflected on the appbaseio based Elasticsearch cluster. ��

Transforming Data Before Indexing Into Elasticsearch

There are times when you don’t need the data to go as it is from source to the sink. You might like to change the target type name (i.e. users to accounts) or you might like to remove certain fields (i.e. bio) or create new fields. For all this, we have the transforms feature in ABC. It can be used by the transform_file parameter.

abc import --src_type=postgres --src_uri="postgresql://test_user:test_pass@" --transform_file="transform_file.js" ""

The transform_file parameter takes the path to the transform file. That file in turn contains the JavaScript transforms that should be applied to the pipeline. Let’s take the contents of transform_file.js as follows.

t.Source("source", source, "/.*/") .Transform(omit({"fields":["bio"]})) .Save("sink", sink, "/.*/")

In the above transform, you can see that we are going to omit the bio field from the data transfer. Now when we run the new import command, we should have the following result.

As you can see, the bio field was omitted when data reached the sink. More documentation on transform file can be found on GitHub. It supports lots of inbuilt functions like omit and even supports running custom JavaScript code as a transform. It’s a good idea to explore its documentation.

Further Reading

ABC’s README is a good place to start if you want to learn more about the application. You can also have a look at the Postgres adaptor docs. Furthermore, you may star the repo on GitHub and watch it to stay tuned for updates.

Original Link

Diving Into Database Performance

Welcome to the first Database Deep Dive of 2018! In our last Database deep dive, we talked about database administration and all the things you need to know about it, such as what it takes to be a DBA, what’s been going on in the world of database administration, and more. This time, we’re going to talk about one of the most popular topics in the Database Zone: database performance. Get cozy and read on to get your dose of database performance knowledge from DZone and beyond.

Image title

Database Performance Perfection

Check out some of the top database performance-related articles on DZone to help you perfect the performance of your database. These articles cover a wide range of topics from the world of database performance, so hold on to your DBA hat!

  1. 6 Simple Performance Tips for SQL Statements by Mpumelelo Msimanga. Even though this article was published back in 2014, its sentiments still ring true today. This article provides a checklist to go through when you’re checking the performance of your database.

  2. Database Performance Testing With Apache JMeter by Rathnadevi Manivannan. Learn how to construct your database performance testing plan with all the most important elements, including a thread group, JDBC request, and summary report.

  3. SQL Performance Tuning: Query Optimization 101 by Bikram Sinha. Get some best practices for using SQL queries form a performance perspective, using Oracle as a primary reference.

  4. Why Be Normal? Your Database Performance Depends On It by Scott Stone. In the world of database performance, “normal” is a highly prized goal. It means that nothing extraordinary is happening that requires intervention or diagnosis.

  5. The Key to Distributed Database Performance: Scalability by Darren Perucci. Learn why a key consideration in moving to distributed databases is ensuring that your databases are scalable enough to delivery the technology’s cost and performance benefits in the first place.

PS: Are you interested in contributing to DZone? Check out our Bounty Board, where you can apply for specific writing prompts and win prizes! 

Resourceful Round-Up

We’ve scoured the web for some database performance resources that are sure to pique your interest. 

Dive Deeper Into Database Performance

DZone has Guides and Refcardz on pretty much every tech-related topic, but if you’re specifically interested in database performance, these will appeal the most to you. 

  • The DZone Guide to Databases: Speed, Scale, and Security. Advances in database technology have traditionally been lethargic. That trend has shifted recently with a need to store larger and more dynamic data. This DZone Guide is focused on how to prepare your database to run faster, scale with ease, and effectively secure your data.

  • NoSQL and Data Scalability. While not specifically about database administration, this DZone Refcard covers loads of information imperative to any DBA — such as the range of NoSQL database types available today, whether NoSQL is right for you, and several real-life use cases for using a NoSQL database.

That’s all for this month’s deep dive! What would you like to learn about next time? Let us know in the comments!

Original Link

Citus and pg_partman: Creating a Scalable Time Series Database on Postgres

Years ago Citus, used to have multiple methods for distributing data across many nodes (we actually still support both today), there was both hash-based partitioning and time-based partitioning. Over time, we found big benefits in further enhancing the features around hash-based partitioning which enabled us to add richer SQL support, transactions, foreign keys, and more. Thus, in recent years, we put less energy into time-based partitioning. But… no one stopped asking us about time partitioning, especially for fast data expiration. All that time, we were listening. We just thought it best to align our product with the path of core Postgres as opposed to branching away from it.

Postgres has had some form of time-based partitioning for years — though for many years, it was a bit kludgy and wasn’t part of core Postgres. With Postgres 10 came native time partitioning, and because Citus is an extension to Postgres, anyone using Citus gets to take advantage of time-based partitioning as well. You can now create tables that are distributed across nodes by ID and partitioned by time on disk.

We have found a few Postgres extensions that make partitioning much easier to use. The best in class for improving time partitioning is pg_partman and today, we’ll dig into getting time partitioning set up with your Citus database cluster using pg_partman.

Time-Based Partitioning Enables Fast Data Expiration and Smaller Indexes

One of the most important benefits of time-based partitioning is that it becomes very efficient to drop old data. Citus can efficiently parallelize delete operations and subsequent vacuums and can, therefore, be orders of magnitude faster than Postgres, but it still needs to read all the data it deletes. Dropping a partition is the equivalent of simply removing the files that contain the data, which is a fast operation, independent of the data size.

The other benefit comes when your queries primarily access recent data, but you still want to keep an archive. In that case, partitioning can have several performance benefits over keeping all data in one table. In particular, because each partition is indexed separately, queries only have a small index to traverse and the index is more likely to be in the cache.

Setting Up Sharding by ID and Partitioning by Time

With Postgres 10, you have a new set of commands to help you set up time partitioning. The key to get it started is to specify what column you’ll partition your table on when you set it up. Let’s create a table for data from the GitHub archive.

CREATE SCHEMA github; -- using an explicit schema is required by pg_partman CREATE TABLE ( event_id bigint, event_type text, event_public boolean, repo_id bigint, payload jsonb, repo jsonb, actor jsonb, org jsonb, created_at timestamp ) PARTITION BY RANGE (created_at);

You’ll notice PARTITION BY RANGE (created_at), here we’re starting to tell Postgres that we’re going to be partitioning this table.

Next we’re going to tell Citus to shard our table by repo_id. Each shard will contain a subset of all GitHub repositories.

SELECT create_distributed_table('', 'repo_id');

Now, we set up a distributed partitioned table.

A partitioned table cannot contain data itself; it is more like a view across the partitions. You need to manually create the partitions and specify the time range after which you can insert data for that time range; for example:

CREATE TABLE github.events_2016 PARTITION OF FOR VALUES FROM ('2016-01-01') TO ('2016-12-31');

We made sure that Citus neatly propagates all partitioning-related commands to the shards of distributed tables. When you create a new partition in a partitioned table, Citus actually creates a new distributed table with its own shards, and each shard will follow the same partitioning hierarchy.

A Better Time Partitioning User Experience: pg_partman

Native partitioning is useful, but using it becomes much more pleasant by leveraging the pg_partman extension. The core functionality of pg_partman works out-of-the-box with Citus 7.2+ when using it for native partitioning.

To install pg_partman, you’ll need to build it then run CREATE SCHEMA partman; and CREATE EXTENSION pg_partman WITH SCHEMA partman. Once you’ve installed pg_partman, you’ll have new functions that you can use to help with managing your time partitioning.

Now, we’re going to tell Postgres what interval we want to create our partitions at and tell it to create our initial empty partitions:

-- Partition the table into hourly ranges of created_at
SELECT partman.create_parent('', 'created_at', 'native', 'hourly');
UPDATE partman.part_config SET infinite_time_partitions = true;

By default create_parent creates 4 partitions in the past, and 4 in the future, and 1 for the present, all based on system time. If you need to backfill older data, you can specify a p_start_partition parameter in the call to create_parent, or p_premake to make partitions for the future. See the pg_partman documentation for details.

You can now see all the partitions using \d+

citus=> \d+ Table "" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------+-----------------------------+-----------+----------+---------+----------+--------------+------------- event_id | bigint | | | | plain | | event_type | text | | | | extended | | event_public | boolean | | | | plain | | repo_id | bigint | | | | plain | | payload | jsonb | | | | extended | | repo | jsonb | | | | extended | | actor | jsonb | | | | extended | | org | jsonb | | | | extended | | created_at | timestamp without time zone | | | | plain | |
Partition key: RANGE (created_at)
Partitions: github.events_p2018_01_15 FOR VALUES FROM ('2018-01-15 00:00:00') TO ('2018-01-16 00:00:00'), github.events_p2018_01_16 FOR VALUES FROM ('2018-01-16 00:00:00') TO ('2018-01-17 00:00:00'), github.events_p2018_01_17 FOR VALUES FROM ('2018-01-17 00:00:00') TO ('2018-01-18 00:00:00'), github.events_p2018_01_18 FOR VALUES FROM ('2018-01-18 00:00:00') TO ('2018-01-19 00:00:00'), github.events_p2018_01_19 FOR VALUES FROM ('2018-01-19 00:00:00') TO ('2018-01-20 00:00:00'), github.events_p2018_01_20 FOR VALUES FROM ('2018-01-20 00:00:00') TO ('2018-01-21 00:00:00'), github.events_p2018_01_21 FOR VALUES FROM ('2018-01-21 00:00:00') TO ('2018-01-22 00:00:00'), github.events_p2018_01_22 FOR VALUES FROM ('2018-01-22 00:00:00') TO ('2018-01-23 00:00:00'), github.events_p2018_01_23 FOR VALUES FROM ('2018-01-23 00:00:00') TO ('2018-01-24 00:00:00')

Automate Your Table Maintenance

As time progresses, pg_partman will need to do some maintenance do create new partitions and drop old ones. All this can be done through a single function call:

SELECT run_maintenance ( p_analyze : = false ); -- note: disabling analyze is recommended for native partitioning due to aggressive locks 

You will want to set up a periodic job to run the maintenance function. You can do this from your app such that you can easily put alarms on errors, or use pg_cron to schedule it from the database itself:

SELECT cron.schedule('@hourly', $$SELECT partman.run_maintenance(p_analyze := false)$$);

Once periodic maintenance is set up, partitioning becomes very much a zero-touch experience.

Expiring old data by automatically dropping partitions

To configure pg_partman to drop old partitions, you can update the partman.part_config table:

UPDATE partman.part_config SET retention_keep_table = false, retention = '1 month' WHERE parent_table = '';

Now, whenever maintenance runs, partitions older than a month are automatically dropped.

When Does Time Partitioning Make Sense

Most applications have some usage of time-based data. It could be as simple as the created_at on a record, or it could be event/log data. The default approach of many is to jump to partitioning by time in all cases, but just because you have time series data doesn’t mean it’s the best fit.

Time partitioning makes the most sense when you:

  1. Most frequently query a very small subset of your most recent data.
  2. You actively delete/drop older data.

Having many partitions that are read frequently can cause more overhead than it takes away, but if, for example, you need to keep a year of time series data and only regularly query the last week, using partitioning can make a lot of sense.

Be aware that native partitioning in Postgres is still quite new and has a few quirks. For example, you cannot directly create an index on a partitioned table. Instead, pg_partman lets you create a template table to define indexes for new partitions. Maintenance operations on partitioned tables will also acquire aggressive locks that can briefly stall queries. There is currently a lot of work going on within the postgres community to resolve these issues, so expect time partitioning in Postgres to only get better.

A Distributed Relational Time Series Database: Postgres With Citus and pg_partman

Postgres with Citus is already a great database for time series data, especially for use cases such as dashboards for real-time analytics and monitoring. You can use high-performance data ingestion, create aggregation tables in parallel, and run advanced SQL queries in parallel across all your data.

Partitioning your distributed time series tables by time with pg_partman provides further optimization to reduce the cost of queries on the most recent data, time-ordered writes, and data expiration. As with any optimization, partitioning should not be applied prematurely, but pg_partman makes it very easy and the combination of scale-out and logical partitioning on top of a relational database is an immensely powerful tool for dealing with large volumes of time series data.

Original Link

Database Sharding Explained in Plain English

Sharding is one of those database topics that most developers have a distant understanding of, but the details aren’t always perfectly clear unless you’ve implemented sharding yourself. In building the Citus database (our extension to Postgres that shards the underlying database), we’ve followed a lot of the same principles you’d follow if you were manually sharding Postgres yourself. The main difference, of course, is that with Citus, we’ve done the heavy lifting to shard Postgres and make it easy to adopt, whereas if you were to shard at the application layer, there’s a good bit of work needed to re-architect your application.

I’ve found myself explaining how sharding works to many people over the past year and realized it would be useful (and maybe even interesting) to break it down in plain English.

Defining Your Partition Key (AKA Shard Key/Distribution Key)

Sharding, at its core, is splitting your data up to where it resides in smaller chunks spread across distinct separate buckets. A bucket could be a table, a Postgres schema, or a different physical database. Then, as you need to continue scaling, you’re able to move your shards to new physical nodes thus improving performance.

The one step that happens in every sharding implementation is deciding what you’re going to shard or partition your data on. There are a number of trade-offs to various keys, and what is right really depends on your application. Once you determine what your sharding key will be, you’re going to ensure the sharding key is in place throughout your application. You can do this a few ways; an easy one is materializing that sharding key on all your models. By materializing or de-normalizing it, you allow your application to make fewer queries to define how it needs to route the data.

When a request comes in, if you’re doing your sharding at the application layer, your application will need to determine how to route the request. The same principle applies to Citus; you just don’t have to do extra work to determine how to route the request. Rather, Citus figures that out for you. If your shard key is on the query itself, Citus sees the shard key and routes the query accordingly.

Bringing this a little more to life, here’s an example query that is sent to Citus and how Citus transforms it:

FROM visits WHERE occurred_at >= now() - '1 week'::interval AND customer_id = 2

The above gets transformed with the Citus router executor to:

FROM visits_007
WHERE occurred_at >= now() - '1 week'::interval AND customer_id = 2

So, there you have it. You’re done, right? You know all you need to about sharding! …well, not quite yet.

Shard Key != Shard Number

A common misconception about sharding is that when you define what your shard key is, the actual value of the shard key is the value that exists within your metadata tables that determine the routing. Let’s look at a real-world example. Say you’re building a SaaS application. You start in year 1 and get 20 customers; in year 2, you get 100 customers; and in year 3, you get 500 customers. Imagine your SaaS application is a CRM system and you decide to shard by customer because the data for each of your customers needs to be kept separate from others. Because your early-adopter customers will have been using your application over a longer time period, it’s likely that customers in year 1 and 2 are likely to have significantly more data than the new customers who onboarded in year 3. One option to define how values map to shards would be the following:

shard 1: customer ids 1-100
shard 2: customer ids 101-200
shard 3: customer ids 201-300
shard 4: customer ids 301-400
shard 5: customer ids 401-500
shard 6: customer ids 501- 

Because our older customers will have a chance of accumulating more data since they’ve been a customer for longer, you now have an uneven skew of data. With customers with IDs 1-100 (those with the most data) all living on the same shard, you can see why this may not be ideal.

Sharding by Customer Still Makes Sense… Say What?

Sharding by customer is super common — especially for multi-tenant applications — and has lots of benefits for performance and scale. But the example above highlights the situation where you can have shards with a very uneven distribution of data — and this uneven distribution of data across shards. The solution lies in the implementation of how you shard. As I like to say, it’s just an “implementation detail.” In this case, at least, it’s more true than not.

The solution is simple, you hash each customer_id as it comes in, then you have a table that maps which hash values are stored in which shards. Let’s say you have eight shards and you start hashing the customer_ids. Your hash range of integers within Postgres can be from -2147483648 (-2^32) to 2147483647 (2^32-1). If we were to divide that hash range evenly, we would have shards that have the following values:

 shardid | min hash value | min hash value
---------+----------------+---------------- 1 | -2147483648 | -1610612737 2 | -1610612736 | -1073741825 3 | -1073741824 | -536870913 4 | -536870912 | -1 5 | 0 | 536870911 6 | 536870912 | 1073741823 7 | 1073741824 | 1610612735 8 | 1610612736 | 2147483647

As our customer_id s come in, we’re going to evaluate the hash value of them. For customer 1, our hash comes out to -1905060026; for customer 2, it comes out to 1134484726; for customer 3, it comes out to -28094569. So, as we can see, our earlier customers from year 1 already start to get evenly distributed among shards.

Shards Are Not Nodes

As we mentioned at the very beginning, briefly, shards are some distinct grouping of data. Too often, it gets associated that a shard is a physical instance. In practice, there is a lot of leverage to begin with a higher number of shards than underlying instances. In the case of Citus, shards are Postgres tables, and then we run nodes which contain at least one but often many more shards. By placing multiple shards in a single node, you’re able to easily scale by moving a shard between nodes without having to break the data within the shard apart.

Let’s go back to our above example of the eight shards we have. Assuming we have two nodes they might map something like the following:

 shardid | min hash value | min hash value nodeid
---------+----------------+----------------+-------- 1 | -2147483648 | -1610612737 | 1 2 | -1610612736 | -1073741825 | 2 3 | -1073741824 | -536870913 | 1 4 | -536870912 | -1 | 2 5 | 0 | 536870911 | 1 6 | 536870912 | 1073741823 | 2 7 | 1073741824 | 1610612735 | 1 8 | 1610612736 | 2147483647 | 2

If we wanted to scale out our setup, we could easily move half of our shards from one node to another without having to change our mapping:

 shardid | min hash value | min hash value nodeid
---------+----------------+----------------+-------- 1 | -2147483648 | -1610612737 | 1 2 | -1610612736 | -1073741825 | 2 3 | -1073741824 | -536870913 | 3 4 | -536870912 | -1 | 4 5 | 0 | 536870911 | 1 6 | 536870912 | 1073741823 | 2 7 | 1073741824 | 1610612735 | 3 8 | 1610612736 | 2147483647 | 4

Customer ID vs. Entity ID: The Same Approach Applies

Whether you’re sharding by a granular UUID or by something higher in your model hierarchy, like customer ID, the approach of hashing your shard key before you leverage it remains the same. Hashing your partition key and keeping a mapping of how things route is key to a scalable sharding approach. Within Citus, we follow the same standard principles that many others have for years when they hand-rolled their own sharding; you just don’t have to put in all that effort now. Under the covers, we automatically evaluate the hash value as it comes in, and within the catalog tables, quickly determine how to route to the appropriate table or shard without you needing to re-write any queries.

Not Sure If Sharding Is Right for You?

Sharding may not make sense in all cases, but for most, if the data model fits into a clean sharding model, then there can be a lot of gains. You can get performance boosts even on smaller datasets, and more importantly, you can rest easier at night not having to worry about when you’re going to run into a ceiling as you grow higher.

Original Link

Building Real-Time Analytics Dashboards With Postgres and Citus

Citus scales out Postgres for a number of different use cases, both as a system of record and as a system of engagement. One use case we’re seeing implemented a lot these days is using the Citus database to power customer-facing real-time analytics dashboards, even when dealing with billions of events per day. Dashboards and pipelines are easy to handle when you’re at 10 GB of data; as you grow, even basic operations like a count of unique users require non-trivial engineering work to get performing well.

Citus is a good fit for these types of event dashboards because of its ability to ingest large amounts of data, perform rollups concurrently, mix raw unrolled-up data with pre-aggregated data, and support a large number of concurrent users. Adding all these capabilities together, the Citus extension to Postgres works well for end users where a data warehouse may not work nearly as well.

We’ve talked some here about various parts of building a real-time customer-facing dashboard, but today, we thought we’d go one step further and give you a guide for doing it end-to-end.

Ingesting Data

The first step is data ingestion. In any POC, we generally start with single-record inserts. This is usually the quickest item to put in place and we can start ingesting data easily at tens of thousands of events per second. We generally opt for single-row inserts as the quickest thing to set up, but before production, we move to a micro-batching approach.

With the \copy Postgres bulk loading utility, we’re able to ingest millions of events per second, if needed. \copy is fully transactional and the load is distributed across all nodes with Citus, making your Citus coordinator node less of a bottleneck that you would expect.

Most event pipelines we’ve observed have some upstream process already handling events, such as Kafka or Kinesis, which makes batching easy to put in place. As for the batching process itself, you can opt to do this either on a time basis (say, minutely batches) or every X records (this could also be every five minutes or even hourly depending on your requirements.) Your batches don’t have to be hundreds of thousands of events; even something like every 1,000 records can give you a nice performance boost, as you can have multiple \copy processes running in parallel.

Structuring Your Raw Events

Your raw event table will vary based on your use case, but there are some commonalities across most. In almost all cases, you have the time of the event and some customer identifier associated with it. Typically, there will be some categorization and details about the event. These details can be broken out as columns or could also be contained within a JSONB datatype. For this example dashboard, we’ll use the following schema:

CREATE TABLE events( id bigint, timestamp timestamp, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint ); SELECT create_distributed_table('events','customer_id');

Rolling Up Your Data

Once you’ve got some raw data coming in, we can now start rolling up data. To do this, we’re going to first create several roll-up tables — some for five-minute intervals, some for hourly, and some for daily.

CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, minute timestamp, event_count bigint, device_distinct_count hll, session_distinct_count hll
); CREATE UNIQUE INDEX rollup_events_5min_unique_idx ON rollup_events_5min(customer_id,event_type,country,browser,minute); SELECT create_distributed_table('rollup_events_5min','customer_id'); CREATE TABLE rollup_events_1hr ( customer_id bigint, event_type varchar, country varchar, browser varchar, hour timestamp, event_count bigint, device_distinct_count hll, session_distinct_count hll
CREATE UNIQUE INDEX rollup_events_1hr_unique_idx ON rollup_events_1hr(customer_id,event_type,country,browser,hour); SELECT create_distributed_table('rollup_events_1hr','customer_id');

One thing you’ll notice in our roll-up tables is the use of the HLL (HyperLogLog) data type. HyperLogLog is a sketch algorithm that allows you to do operations over unique buckets. HyperLogLog makes it easy to find intersections, unions, etc. across various buckets, making it incredibly useful for the types of reports your dashboard may generate.

In the above example, we have chosen (customer_id, event_type, country, browser, minute/hour) as the dimensions on which we evaluate metrics such as event_countdevice_distinct_count, session_distinct_count, etc. Based on your query workload and performance requirements, you can choose the dimensions that make sense for you. If needed, you can create multiple roll-up tables to serve different query types (just don’t go too crazy with a table for every dimension).

Ideally, you should choose dimensions that you get suitable compression (>5-10x) compared to the raw tables. Based on our customers’ experiences, we have, at times, seen orders of magnitude in compression after roll-ups — up to 100x or 1000x.

For our roll-up query, we’re going to do an INSERT INTO... SELECT, which will run across all the nodes in our cluster and parallelize.

Note: Our raw tables and roll-up tables are sharded on the same key. In this case, the sharding key is the customer_id, which is more or less a proxy for customer/tenant ID. Other granular columns can also be chosen as shard keys depending on use case.

To perform our roll-up, we’ll create the function:

CREATE OR REPLACE FUNCTION compute_rollups_every_5min(start_time TIMESTAMP, end_time TIMESTAMP) RETURNS void LANGUAGE PLPGSQL AS $function$
BEGIN RAISE NOTICE 'Computing 5min rollups from % to % (excluded)', start_time, end_time; RAISE NOTICE 'Aggregating data into 5 min rollup table';
INSERT INTO rollup_events_5min
SELECT customer_id, event_type, country, browser, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute, count(*) as event_count, hll_add_agg(hll_hash_bigint(device_id)) as device_distinct_count, hll_add_agg(hll_hash_bigint(session_id)) as session_distinct_count
FROM events WHERE timestamp >= $1 AND timestamp<=$2
ON CONFLICT (customer_id,event_type,country,browser,minute)
SET event_count=excluded.event_count, device_distinct_count = excluded.device_distinct_count, session_distinct_count= excluded.session_distinct_count;$$
USING start_time, end_time; RAISE NOTICE 'Aggregating/Upserting into 1 hr rollup table';
INSERT INTO rollup_events_1hr
SELECT customer_id, event_type, country, browser, date_trunc('hour', timestamp) as hour, count(*) as event_count, hll_add_agg(hll_hash_bigint(device_id)) as device_distinct_count, hll_add_agg(hll_hash_bigint(session_id)) as session_distinct_count
FROM events WHERE timestamp >= $1 AND timestamp<=$2
ON CONFLICT (customer_id,event_type,country,browser,hour)
SET event_count=rollup_events_1hr.event_count+excluded.event_count, device_distinct_count = rollup_events_1hr.device_distinct_count || excluded.device_distinct_count, session_distinct_count= rollup_events_1hr.session_distinct_count || excluded.session_distinct_count;$$
USING start_time, end_time;

For rolling up the last five minutes of data, we can now trigger our function:

SELECT compute_rollups_every_5min((now()-interval '5 minutes')::timestamp, now()::timestamp);

Automating Your Roll-Ups

Scheduling your own background job to run and perform the roll-ups is an option. When using Citus, you can just as easily schedule a job directly in your database since the database is doing all the heavy lifting. With pg_cron, you can schedule your rollup and call the function compute_rollups_every_5min to run every five minutes on top of the raw data.

Querying the Real-Time Events Dashboard

With our system in place now ingesting data and performing roll-ups, we can get to the fun part of querying — because all of our roll-up tables are significantly smaller in size than the raw data you’ll see performant queries against them. And all the while, Citus is now continually rolling up the raw data and storing it in an efficient data type (HyperLogLog) to let you compose all sorts of reports based on the data.

Let’s look at a few examples of queries you may want to run:

--Get me the total number of events and count of distinct devices in the last 5 minutes? SELECT sum(event_count), hll_cardinality(sum(device_distinct_count)) FROM rollup_events_5min where minute >=now()-interval '5 minutes' AND minute <=now() AND customer_id=1; --Get me the count of distinct sessions over the last week? SELECT sum(event_count), hll_cardinality(sum(device_distinct_count)) FROM rollup_events_1hr where hour >=date_trunc('day',now())-interval '7 days' AND hour <=now() AND customer_id=1; -- Get me the trend of my app usage in the last 2 days broken by hour SELECT hour, sum(event_count) event_count, hll_cardinality(sum(device_distinct_count)) device_count, hll_cardinality(sum(session_distinct_count)) session_count FROM rollup_events_1hr where hour >=date_trunc('day',now())-interval '2 days' AND hour <=now() AND customer_id=1 GROUP BY hour;

Data is everywhere in this day and age, and key to being successful is being able to derive insights from your data. By making data available in real time, you’re able to create more value for your users. With Citus, you can both have real-time ingestion as well as real-time reporting with high concurrency without having to throw out SQL, which is the lingua franca when it comes to data. 

Original Link

Looking Inside Postgres at a GiST Index

In the last few posts in this series (one, two, three, and four), I showed you how to save hierarchical data into a flat database table using the Postgres LTREE extension. I explained that you represent tree nodes using path strings and how to search your tree data using special SQL operators LTREE provides.

But the real value of LTREE isn’t the operators and functions it gives you — internally, these boil down to fairly simple string operations. Instead, what makes LTREE useful is that it integrates these new operators with Postgres’s indexing code, which allows you to search for and find matching tree paths quickly. To achieve this, LTREE takes advantage of the Generalized Search Tree (GiST) project, an API that allows C developers to extend Postgres’s indexing system.

But what does the GiST API do? And what does it mean to extend Postgres’s indexing system, exactly? Read on to find out!

Searching Without an Index

Here again is the tree table I used as an example in the earlier posts in this series:

create table tree( id serial primary key, letter char, path ltree

Note that the path column uses the custom ltree data type that the LTREE extension provides. If you missed the previous posts, ltree columns represent hierarchical data by joining strings together with periods, i.e. A.B.C.D or Europe.Estonia.Tallinn.

Earlier, I used a simple tree with only seven nodes as an example. SQL operations on a small table like this will always be fast. Today, I’d like to imagine a much larger tree to explore the benefits indexes can provide. Suppose that instead I have a tree containing hundreds or thousands of records in the path column:

Now, suppose I search for a single tree node using a select statement:

select letter from tree where path = 'A.B.T.V'

Without an index on this table, Postgres has to resort to a sequence scan, which is a technical way of saying that Postgres has to iterate over all of the records in the table:

For each and every record in the table, Postgres executes a comparison p == q where p is the value of the path column for each record in the table, and q is the query, or the value I’m searching for (A.B.V.T in this example). This loop can be very slow if there are many records. Postgres has to check all of them because they can appear in any order and there’s no way to know how many matches there might be in the data ahead of time.

Searching With a B-Tree Index

Of course, there’s a simple solution to this problem. I just need to create an index on the path column:

create index tree_path_idx on tree (path);

As you probably know, executing a search using an index is much faster. If you see a performance problem with a SQL statement in your application, the first thing you should check for is a missing index. But why? Why does creating an index speed up searches, exactly? The reason is that an index is a sorted copy of the target column’s data:

By sorting the values ahead of time, Postgres can search through them much more quickly. It uses a binary search algorithm:

Postgres starts by checking the value in the middle of the index. If the stored value (p) is too large and is greater than the query (q), if p > q, it moves up and checks the value at the 25% position. If the value is too small, if p < q, it moves down and checks the value at the 75% position. Repeatedly dividing the index into smaller and smaller pieces, Postgres only needs to search a few times before it finds the matching record or records.

However, for large tables with thousands or millions of rows, Postgres can’t save all of the sorted data values in a single memory segment. Instead, Postgres indexes (and indexes inside of any relational database system) save values in a binary or balanced tree (B-Tree):

Now, my values are saved in a series of different memory segments arranged in a tree structure. Dividing the index up into pieces allows Postgres to manage memory properly while saving possibly millions of values in the index. Note that this isn’t the tree from my LTREE dataset; B-Trees are internal Postgres data structures I don’t have access to. To learn more about the computer science behind this, read my 2014 article Discovering the Computer Science Behind Postgres Indexes.

Now, Postgres uses repeated binary searches — one for each memory segment in the B-Tree — to find a value:

Each value in the parent or root segment is really a pointer to a child segment. Postgres first searches the root segment using a binary search to find the right pointer and then jumps down to the child segment to find the actual matching records using another binary search. The algorithm is recursive; the B-Tree could contain many levels in which case the child segments would contain pointers to grandchild segments, etc.

What’s the Problem With Standard Postgres Indexes?

But there’s a serious problem here I’ve overlooked so far. Postgres’s index search code only supports certain operators. Above, I was searching using this select statement:

select letter from tree where path = 'A.B.T.V'

I was looking for records that were equal to my query: p == q. Using a B-Tree index, I could also have searched for records greater than or less than my query: p < q or p > q.

But what if I want to use the custom LTREE <@ (ancestor) operator? What if I want to execute this select statement?

select letter from tree where path <@ 'A.B.V'

As we saw in the previous posts in this series, this search will return all of the LTREE records that appear somewhere on the branch under A.B.V, that are descendants of the A.B.V tree node.

A standard Postgres index doesn’t work here. To execute this search efficiently using an index, Postgres needs to execute this comparison as it walks the B-Tree: p <@ q. But the standard Postgres index search code doesn’t support p <@ q. Instead, if I execute this search Postgres resorts to a slow sequence scan again, even if I create an index on the ltree column.

To search tree data efficiently, we need a Postgres index that will perform p <@ q comparisons equally well as p == q and p < q comparisons. We need a GiST index!

The Generalized Search Tree (GiST) Project

Almost 20 years ago, an open-source project at UC Berkeley solved this precise problem. The Generalized Search Tree (GiST) project added an API to Postgres allowing C developers to extend the set of data types that can be used in a Postgres index.

Quoting from the project’s web page:

In the beginning, there was the B-tree. All database search trees since the B-tree have been variations on its theme. Recognizing this, we have developed a new kind of index called a Generalized Search Tree (GiST), which provides the functionality of all these trees in a single package. The GiST is an extensible data structure, which allows users to develop indices over any kind of data, supporting any lookup over that data.

GiST achieves this by adding an API to Postgres’s index system anyone can implement for their specific data type. GiST implements the general indexing and searching code but calls out to custom code at four key moments in the indexing process. Quoting from the project’s web page again, here’s a quick explanation of the four methods in the GiST API:

GiST indexes use a tree structure similar to the B-Tree we saw above. But Postgres doesn’t create the GiST index tree structure by itself; Postgres works with implementations of the GiST Union, Penalty and PickSplit API functions described above. And when you execute a SQL statement that searches for a value in a GiST index, Postgres uses the Consistent function to find the target values.

The key here is the implementor of the GiST API can decide what type of data to index and how to arrange those data values in the GiST tree. Postgres doesn’t care what the data values are or how the tree looks. Postgres simply calls Consistent any time it needs to search for a value and lets the GiST API implementor find the value.

An example would help understand this, and we have an example GiST API implementation: The LTREE extension!

Implementing the GiST API for Tree Paths

Starting in around 2001, two students at Moscow State University found the API from the GiST project and decided to use it to build indexing support for tree data. Oleg Bartunov and Teodor Sigaev, in effect, wrote a “Tree Paths Consistent” function, a “Tree Path Union” function, etc. The C code that implements this API is the LTREE extension. You can find these functions, ltree_consistent and ltree_union, among other functions, in a file called ltree_gist.c, located in the contrib/ltree directory in the Postgres source code. They also implemented the Penalty, PickSplit and various other functions related to the GiST algorithm.

I can use these custom functions on my own data simply by creating a GiST index. Returning to my LTREE example, I’ll drop my B-Tree index and create a GiST index instead:

drop index tree_path_idx;
create index tree_path_idx on tree using gist (path);

Notice the using gist keywords in the create index command. That’s all it takes; Postgres automatically finds, loads, and uses the ltree_unionltree_picksplit, etc., functions whenever I insert a new value into the table. (It will also insert all existing records into the index immediately.) Of course, earlier, I installed the LTREE extension also.

Let’s see how this works. Suppose I add a few random tree records to my empty tree table after creating the index:

insert into tree (letter, path) values ('A', 'A.B.G.A');
insert into tree (letter, path) values ('E', 'A.B.T.E');
insert into tree (letter, path) values ('M', 'A.B.R.M');
insert into tree (letter, path) values ('F', 'A.B.E.F');
insert into tree (letter, path) values ('P', 'A.B.R.P');

To get things started, Postgres will allocate a new memory segment for the GiST index and insert my five records:

If I search now using the ancestor operator:

select count(*) from tree where path <@ 'A.B.T'

…Postgres will simply iterate over the records in the same order I inserted then and call the ltree_consistent function for each one. Here again is what the GiST API calls for the Consistent function to do:

In this case, Postgres will compare p <@ A.B.T for each of these five records:

Because the values of p, the tree page keys, are simple path strings, ltree_consistent directly compares them with A.B.T and determines immediately whether each value is a descendant tree node of A.B.T. Right now, the GiST index hasn’t provided much value; Postgres has to iterate over all the values, just like a sequence scan.

Now, suppose I start to add more and more records to my table. Postgres can fit up to 136 LTREE records into the root GiST memory segment, and index scans function the same way as a sequence scan by checking all the values.

But if I insert one more record, the 137th record doesn’t fit. At this point, Postgres has to do something different:

Now, Postgres “splits” the memory segment to make room for more values. It creates two new child memory segments and pointers to them from the parent or root segment.

What does Postgres do next? What does it place into each child segment? Postgres leaves this decision to the GiST API, to the LTREE extension, by calling the the ltree_picksplit function. Here again is the API spec for PickSplit:

The ltree_picksplit function — the LTREE implementation of the GiST API — sorts the tree paths alphabetically and copies each half into one of the two new child segments. Note that GiST indexes don’t normally sort their contents; however, GiST indexes created specifically by the LTREE extension do because of the way ltree_picksplit works. We’ll see why it sorts the data in a moment.

Now, Postgres has to decide what to leave in the root segment. To do this, it calls the Union GiST API:

In this example, each of the child segments is a set S. And the ltree_union function has to return a “union” value for each child segment that describes somehow what values are present in that segment:

Oleg and Teodor decided this union value should be a pair of left/right values indicating the minimum and maximum tree branches inside of which all of the values fit alphabetically. This is why the ltree_picksplit function sorted the values. For example, because the first child segment contains the sorted values from A.B.C.B through A.B.M.Z, the left/right union becomes A and A.B.M:

Note that A.B.M is sufficient here to form a union value excluding A.B.N.X and all the following values; LTREE doesn’t need to save A.B.M.Z.

Similarly, the left/right union for the second child segment becomes A.B.N/A.B.X:

This is what a GiST index looks like — or, what an LTREE GiST index looks like, specifically. The power of the GiST API is that anyone can use it to create a Postgres index for any type of data. Postgres will always use the same pattern: The parent index page contains a set of union values, each of which somehow describes the contents of each child index page.

For LTREE GiST indexes, Postgres saves left/right value pairs to describe the union of values that appear in each child index segment. For other types of GiST indexes, the union values could be anything. For example, a GiST index could store geographical information like latitude/longitude coordinates, or colors, or any sort of data at all. What’s important is that each union value describe the set of possible values that can appear under a certain branch of the index. And like B-Trees, this union value/child page pattern is recursive: A GiST index could hold millions of values in a tree with many pages saved in a large multi-level tree.

Searching a GiST Index

After creating this GiST index tree, searching for a value is straightforward. Postgres uses the ltree_consistent function. As an example, let’s repeat the same SQL query from above:

select count(*) from tree where path <@ 'A.B.T'

To execute this using the GiST index, Postgres iterates over the union values in the root memory segment and calls the ltree_consistent function for each one:

Now, Postgres passes each union value to ltree_consistent to calculate the p <@ q formula. The code inside of ltree_consistent then returns MAYBE if q > left and q < right. Otherwise, it returns NO.

In this example, you can see ltree_consistent finds that the query A.B.T, or qMAYBE is located inside the second child memory segment, but not the first one.

For the first child union structure, ltree_consistent finds q > A true but q < A.B.M false. Therefore, ltree_consistent knows there can be no matches in the top child segment, so it skips down to the second union structure.

For the second child union structure, ltree_consistent finds both q > A.B.N true and q < A.B.X true. Therefore, it returns MAYBE, meaning the search continues in the lower child segment:

Note that Postgres never had to search the first child segment: The tree structure limits the comparisons necessary to only the values that might match p <@ A.B.T.

Imagine my table contained a million rows. Searches using the GiST index will still be fast because the GiST tree limits the scope of the search. Instead of executing p <@ q on every one of the million rows, Postgres only needs to run p <@ q a handful of times, on a few union records and on the child segments of the tree that contain values that might match.

Send Them a Postcard

Oleg Bartunov and Teodor Sigaev, the authors of the LTREE extension, explain its usage and the algorithms I detailed above here on their web page. They included more examples of SQL searches on tree data, including some which use the LTREE[] data type I didn’t have time to cover in these blog posts.

But most importantly, they included this note at the bottom:

Do you save tree data in Postgres? Does your app take advantage of the LTREE extension? If so, you should send Oleg and Teodor a postcard! I just did.

Original Link

Saving a Tree in Postgres Using LTREE

In one of my last posts, I showed you how to install and enable a Postgres extension called LTREE. LTREE allows me to save, query on, and manipulate trees or hierarchical data structures using a relational database table. As we’ll see, using LTREE I can count leaves, cut off branches, and climb up and down trees easily — all using SQL right inside my application’s existing Postgres database!

But trees are natural, haphazard, branching structures with countless leaves, while database tables are man-made rectangles full of numbers and text. How can I possibly save a beautiful tree structure into an ugly, boring database table?

Path Enumeration

Let’s return to the example tree from the first post in this series:

The LTREE extension uses the path enumeration algorithm, which calls for each node in the tree to record the path from the root you would have to follow to reach that node.

For example, to find G starting from A, the root, I would move down to B, and then down again to G:

So the path to G is:

Here’s another example:

This time I’ve traced a path from A to D, via C. So the path of D is:

Saving Tree Paths Using LTREE

To use LTREE, I need to create a column to hold these paths. For my example tree, I’ll use the same table I did before, but instead of the parent_id column, I’ll use a path column:

create table tree( id serial primary key, letter char, path ltree
create index tree_path_idx on tree using gist (path);

I chose the name path; I could have picked any name here. However, notice the path column uses a Postgres data type called ltree — the LTREE extension provides this special new type. And also notice I created a special gist index on the path column; more on this later!

Next, I save the path of each tree node into the path column, encoded as a series of strings joined together by periods. For example, to save the path of G into my table, I use this insert statement:

And to save the path to node D, I write:

Following this pattern, I can save my entire tree using these insert statements — one for each node in my tree:

insert into tree (letter, path) values ('A', 'A');
insert into tree (letter, path) values ('B', 'A.B');
insert into tree (letter, path) values ('C', 'A.C');
insert into tree (letter, path) values ('D', 'A.C.D');
insert into tree (letter, path) values ('E', 'A.C.E');
insert into tree (letter, path) values ('F', 'A.C.F');
insert into tree (letter, path) values ('G', 'A.B.G');

The root node, A, contains the simplest path A. Its two child nodes, B and C, use paths A.B and A.C; the child nodes under C use paths A.C.D, A.C.E, etc. You get the idea.

The Ancestor Operator: @>

Now for the fun part: LTREE provides a series of new SQL operators that allow me to query and manipulate tree data structures. The most powerful of these is @>, the “ancestor” operator. It tests whether one path is an ancestor of another.

Returning to my question from the first post in this series, what if I needed to know how many children A had, recursively? That is, what if I needed to count its children, grandchildren, great-grandchildren, etc.? Earlier we saw that using a parent_id column this would require an ever-increasing number of SQL statements:

select count(*) from tree where parent_id = ID;
select count(*) from tree where parent_id in (CHILD IDs);
select count(*) from tree where parent_id in (GRANDCHILD IDs);
select count(*) from tree where parent_id in (GREAT-GRANDCHILD IDs);
select count(*) from tree where parent_id in (GREAT_GREAT-GRANDCHILD IDs); etc.

@> solves this problem for us. I can now recursively count the total number of nodes under any given parent like this:

select count(*) from tree where PARENT-PATH @> path;

In my example, this SQL would return the number of nodes, recursively, under the root node A:

select count(*) from tree where 'A' @> path;
count -------
(1 row)

LTREE counts the parent node itself, so the total count is 7, not 6. That is, A @> A evaluates to true. Another example; this returns the count of tree nodes under and including C:

select count(*) from tree where ‘A.C' @> path;
count -------
(1 row)

Or I could have written these predicates in the opposite order using <@:

select count(*) from tree where path <@ 'A';
select count(*) from tree where path <@ 'A.C';

As you can see, the <@ and @>  operators treat the path column, the column I defined with the ltree data type, as simple strings. But there’s some magic going on here: The path values are not simple strings. Although I typed them in as strings, <@ and @> efficiently determine whether or not one path is an ancestor of another.

And the @> ancestor operator is just one way of using ltree columns; the LTREE extension provides a long list of powerful operators and functions! For a complete list, see here.

In my next post, I’ll explore more of these functions and show you how to perform some tree operations that I’ve found useful.

Maybe You’re Not Impressed

However, thinking about the path strings for a moment, it’s fairly obvious whether one path is an ancestor of another. For example, it’s clear that A and A.C are ancestors of A.C.D, while A.B is not. In fact, it looks like all the @> operator does it check whether the string on the left (the ancestor) is a prefix or leading substring inside the string on the right (the descendant).

In fact, you might not be very impressed by LTREE, so far. The @> operator seems like a fancy way of performing a simple string operation. I could have written SQL code to determine that A is an ancestor of A.C.D myself. I probably would have used one of Postgres’s many string functions to achieve this, maybe something like this:

select count(*) from tree where strpos(path::varchar, 'A') = 1

Postgres would calculate the answer for my seven-node example tree very quickly. But to calculate this count, internally Postgres would have to iterate over all the records in my table (this is called a full table scan or sequence scan in DB jargon) and calculate the strpos function on each row. If my tree had thousands or millions of rows, then this SQL statement would take a long time to finish.

Enabling the Real Magic: Using a GiST Index With LTREE

The power of the @> operator is that it allows Postgres to search efficiently across an entire tree using an index. Saying this in a more technical way: The @> operator integrates with Postgres’s GiST index API to find and match descendant nodes. To take advantage of this technology, be sure to create a GiST index on your ltree column, for example like this:

create index tree_path_idx on tree using gist (path);

What is a “GiST” index? How does it help LTREE find and count tree nodes efficiently? Read the last post in this series to find out. There I describe the Generalized Search Index (GiST) project, explore the computer science behind GiST and look at how LTREE uses GiST to make fast tree operations inside of Postgres possible.

What’s Next?

But before we dive into LTREE’s internal implementation, first we should see what else LTREE can do. So far, I’ve shown you how to count descendant tree nodes. Tomorrow, in my next post, Manipulating Trees Using SQL and the Postgres LTREE Extension, I’ll show you how to use other LTREE’s operators and functions to work with tree data.

Original Link

Manipulating Trees Using SQL and the Postgres LTREE Extension

Previously, I used the LTREE extension to save a tree data structure in a Postgres table. After saving the tree, I used the @>, or ancestor operator, to count the number of descendant nodes on a given branch.

But that’s not all LTREE can do. Today, I’ll show you how to delete, move, and copy branches from one place to another in your tree using @> in combination with other LTREE functions. After that, in my last post in this series, I’ll look at how LTREE works under the hood and explore the computer science that makes all of this possible.

My Example Tree Again

Here’s the tree I’ve been working with during the last few blog posts:

In my last post, I saved this tree in my database using a series of insert statements:

insert into tree (letter, path) values ('A', 'A');
insert into tree (letter, path) values ('B', 'A.B');
insert into tree (letter, path) values ('C', 'A.C');
insert into tree (letter, path) values ('D', 'A.C.D');
insert into tree (letter, path) values ('E', 'A.C.E');
insert into tree (letter, path) values ('F', 'A.C.F');
insert into tree (letter, path) values ('G', 'A.B.G');

And we saw how easy it is to count the number of tree nodes in a given branch using the @> operator:

select count(*) from tree where 'A.C' @> path;

Cutting Off a Branch

But suppose I wanted to remove these nodes from the tree entirely; that is, suppose I wanted to “cut off this branch” of the tree, so to speak:

How can I do this? Simple! I just use a SQL delete statement:

delete from tree where 'A.C' @> path;

As you can see, I can use @> equally well in delete statements as in select statements.

Replanting a Branch as a New Tree

Now suppose I want to keep this branch and save it as a separate tree in my table. That is, I want two trees: the original A tree and a new tree consisting of the C branch “replanted” as a new root:

Thinking about this for a moment, moving some nodes from one location to another in my tree means I’ll need to update their path values somehow in my table; that is, I’ll need to use an update statement and not a select or delete statement. But how? Writing an update statement is easy enough, but how do I know what the new path of each tree node will be? Let’s take C as an example. Because C will become the root node of my new tree, I want to change its path from A.C to just C:

update tree set path = 'C' where path = 'A.C';

And I will want to update D, one of C’s children, in a similar way:

update tree set path = 'C.D' where path = 'A.C.D';

I could write a separate update statement for each node — just four SQL statements for my example. But imagine I had hundreds or thousands of nodes in my tree. Updating the records one SQL statement at a time would require repeated network connections from my application to Postgres, slowing down the overall operation tremendously.

Instead, I need to update the path of C and each of its descendants all in a single operation. But how can I do this? Two LTREE functions, NLEVEL() and SUBPATH(), can help.

The NLEVEL Function

First, NLEVEL. As you might guess, this returns the number of levels in a given path string:

select letter, path, nlevel(path) from tree; letter | path | nlevel -------+-------+--------
A | A | 1
B | A.B | 2
C | A.C | 2
D | A.C.D | 3
E | A.C.E | 3
F | A.C.F | 3
G | A.B.G | 3
(7 rows)

Looking at this, it’s easy to understand what the function returns: For a root node like A, NLEVEL returns 1. For A’s child nodes, A.B and A.C, NLEVEL returns 2. For the grandchild nodes, it returns 3. It simply counts the number of levels in each path string; internally, it parses the path string for period characters.

Before we continue, consider one subtle but important point. Notice that I was able to calculate NLEVEL on all of the records in the tree table with a single SQL statement! Postgres applied the function to all of the matching paths for me. The power of LTREE’s functions is that they seamlessly integrate with SQL, harnessing and extending the power of Postgres.

The SUBPATH Function

LTREE provides another new SQL function that will also help us write a general tree path formula: SUBPATH. As you might guess, this returns a selected substring from a given path. Let’s try running it on my example tree:

select letter, subpath(path, 1) from tree;
ERROR: invalid positions
STATEMENT: select letter, subpath(path, 1) from tree;

Oops — I’ve done something wrong here. Calling subpath(path, 1) returns the portion of the path starting with offset 1. Not a character offset, but a path offset. So subpath(path, 1) drops the first level of the path, A in my tree, and returns the remaining portion of each path starting from the second path element. Internally, LTREE parses the periods for me, drops the requested number of path levels, and removes the extra leading period.

In the statement above, the error was caused by the root node in the tree: A. This path has only one level, and so LTREE returns an error in this case.

Let’s try using SUBPATH only on the C branch — the branch we want to move:

select letter, subpath(path, 1) from tree where path <@ 'A.C';
letter | subpath -------+---------
C | C
D | C.D
E | C.E
F | C.F
(4 rows)

Now, I get only four records in the result: one for C and one for each node that appears under C. And you can see that the subpath column contains the portion of the path that appears after A, for each of these four records.

And again, notice that I was able to execute the SUBPATH function on all four tree records that I wanted to — in a single operation. This time, the SUBPATH function worked in concert with the <@ operator. LTREE has made the SQL language I already know how to use even more powerful.

Moving Tree Nodes Using One UPDATE Statement

Now, let’s return to the question of moving a branch into a new tree. As this diagram shows, I want to delete C and its children from the A tree and move them to a new location:

Earlier, I considered moving the nodes using a single update statement for each:

update tree set path = 'C' where path = 'A.C';
update tree set path = 'C.D' where path = 'A.C.D';
update tree set path = 'C.E' where path = 'A.C.E';
update tree set path = 'C.F' where path = 'A.C.F';

Now that I know about SUBPATH, it’s easy to write a single SQL update statement that will move all four nodes in the C branch in one operation:

update tree set path = subpath(path, 1) where path <@ 'A.C';

I use where path <@ 'A.C' to scope the update to the C branch and I use subpath(path, 1) to remove the A root element from the path of C and each of its descendants.

I can generalize this a bit more using the NLEVEL function, also:

update tree set path = subpath(path, nlevel('A.C')-1) where path <@ 'A.C';

This follows because nlevel('A.C') = 2, and therefore, nlevel('A.C')-1 returns the same formula we had above. Replacing A.C with BRANCH_PATH, I arrive at a general formula for “replanting” a branch as a new tree using a single SQL statement:

update tree set path = subpath(path, nlevel(BRANCH_PATH)-1) where path <@ BRANCH_PATH

…assuming nlevel(BRANCH_PATH) > 1, that is assuming the branch we want to replant isn’t already a root.

The || Concatenation Operator

This seems somewhat useful, but what if I want to move a branch from one location in my tree to some other location, not necessary to the root? This is a more general problem. For example, suppose I want to move the C branch under G, like this:

To write a formula for this transformation using SQL, we need to use one more important LTREE operator: the ||  or concatenation operator. Let’s try it out with an example first:

select 'A.B.G' || path as concatenated from tree;
concatenated --------------
(7 rows)

You can see LTREE has automatically added A.B.G, along with a period separator to each path in my table. And it has done this for all the paths in my table in a single operation.

Moving a Branch

Now, using ||, I can write a single SQL statement to move a tree branch from one location to another. First, of course, I need to scope the SQL operation to the target branch using the ancestor operator:

select 'A.B.G' || path as concatenated from tree where path <@ 'A.C';
concatenated ---------------
(4 rows)

I get the same results as above, but now only for the tree nodes that I want to move.

But my next problem is the new paths above start with A.B.G.A.C…. Instead, I want them to be  A.B.G.C…. I need to remove that extra A character from the new paths, using the SUBPATH  operator:

select 'A.B.G' || subpath(path, 1) as concatenated from tree where path <@ 'A.C';
concatenated --------------
(4 rows)

And finally, converting this into an update statement:

update tree set path = 'A.B.G' || subpath(path, 1) where path <@ 'A.C'

…I have the single SQL statement I need!

And generalizing this, we arrive at a SQL formula you could use in your own Postgres database:

update tree set path = DESTINATION_PATH || subpath(path, nlevel(SOURCE_PATH)-1)
where path <@ SOURCE_PATH;

Copying a Branch

One last puzzle: How can I copy a tree branch instead of moving it? I just use an insert SQL statement instead of update. Simple, right?

But how, exactly? I need to insert multiple rows — one record for each node in the branch I copy. Again, I could write a series of insert statements like this:

insert into tree (letter, path) values ('C', 'A.B.G.C');
insert into tree (letter, path) values ('D', 'A.B.G.C.D');
insert into tree (letter, path) values ('E', 'A.B.G.C.E');
insert into tree (letter, path) values ('F', 'A.B.G.C.F');

But using LTREE functions and operators, I can achieve this using a single SQL statement! I just have to write SQL that will insert the result of a select, like this:

insert into tree (letter, path) ( select letter, 'A.B.G' || subpath(path, 1) from tree where 'A.C' @> path

Executing this, Postgres will first find all the nodes inside the branch I want to copy and recalculate their paths. Then, it will insert that result set into the tree as a copy, leaving my original branch unchanged!

By writing this tree-related logic using LTREE operators in SQL, I ask Postgres to do all of the hard work of manipulating and copying the path strings for me. I don’t have to write application code to keep track of these strings, and no data needs to be transmitted back and forth between my application server and the database server.

What’s Next? LTREE Internals

In my last post about LTREE, I’ll look closely at how it works internally. It’s easy enough to imagine how simple functions like NLEVEL||, or SUBPATH work. That’s not the interesting part for me. These functions are shorthand for fairly simple string operations.

The special sauce that makes LTREE such a powerful tool is that it integrates with Postgres GiST indexes. By using an index, Postgres can execute any of the SQL expressions I wrote above equally fast on 7,000 records as it would on 7! How? The only way to find out is by looking inside Postgres at a GiST index.

Original Link

Installing the Postgres LTREE Extension

Hidden inside of your Postgres server is code that provides special SQL operators and functions designed to support tree operations. It’s called the LTREE extension. I’m guessing this stands for left-tree. In my next post, I’ll write about some of these functions and operators: what they do and how to use them.

But first, where is the LTREE extension? How can you install and start using it? Read on to find out.

Testing Whether the LTREE Extension Is Installed

Depending on where you downloaded Postgres from and how you installed it, you may have already installed LTREE with Postgres. To find out, execute this SQL statement:

=> create extension ltree;

If you see the CREATE EXTENSION message like this, then you’re all set! LTREE was already installed and you just enabled it. Skip to my next post when it’s published to find out what it can do and how to use it.

But if you see:

=> create extension ltree;
ERROR: extension "ltree" already exists

…then your Postgres server already had LTREE enabled.

FYI: The pg_available_extensions table will show you all the Postgres extensions that are available and installed on your server:

select * from pg_available_extensions; name | default_version | installed_version | comment
---------+-----------------+-------------------+------------------------------------------------- ltree | 1.1 | 1.1 | data type for hierarchical tree-like structures plpgsql | 1.0 | 1.0 | PL/pgSQL procedural language
(2 rows)

As you can see, ltree already appears on my server’s list. The value 1.1 for installed_version indicates that I’ve already enabled it, too. This would have been blank before running the create extension ltree command above.

I originally installed a local copy of Postgres on my Mac using Homebrew, and I was happy to find that the Homebrew Postgres formula does include steps to build and install LTREE, after building the rest of the Postgres server. But I still needed to enable it using create extension.

Using LTREE on a Shared Postgres Server

Running the create extension ltree command may fail with this error message:

=> create extension ltree;
ERROR: permission denied to create extension "ltree"
HINT: Must be superuser to create this extension.

Enabling Postgres extensions requires super-user access. If you’re using a shared Postgres server and don’t have super-user access, you’ll need to find someone who does. Or you may just need to login to Postgres using the proper Postgres user account.

How to Install the LTREE Extension

Running the create extension command may also fail with this error message:

=> create extension ltree;
ERROR: could not open extension control file "/usr/local/pgsql/share/extension/ltree.control": No such file or directory

This error means the LTREE code isn’t even installed on your Postgres server. If you’re running on Linux and installed Postgres using a package manager, you may have to install a second package called postgresql-contrib.

If you installed Postgres from source yourself, then you will see this error message because the Postgres Makefile doesn’t compile and install LTREE by default. Don’t worry! It turns out the Postgres source tree already contains the code for LTREE and many other extensions in a subdirectory called contrib.

Compile it as follows:

$ cd /path/to/postgres-9.6.5/contrib/ltree
$ make gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -O2 -DLOWER_NODE -I. -I. -I../../src/include -c -o ltree_io.o ltree_io.c etc… $ sudo make install /bin/sh ../../config/install-sh -c -d '/usr/local/pgsql/lib'
/bin/sh ../../config/install-sh -c -d '/usr/local/pgsql/share/extension'
/bin/sh ../../config/install-sh -c -d '/usr/local/pgsql/share/extension'
/usr/bin/install -c -m 755 '/usr/local/pgsql/lib/'
/usr/bin/install -c -m 644 ./ltree.control '/usr/local/pgsql/share/extension/'
/usr/bin/install -c -m 644 ./ltree--1.1.sql ./ltree--1.0--1.1.sql ./ltree--unpackaged--1.0.sql ‘/usr/local/pgsql/share/extension/'

You can see above that the install step copied the library into my Postgres server’s lib directory, /usr/local/pgsql/lib, and ran a couple other commands, as well. Now I can run the create extension ltree command as shown above. I don’t even need to restart Postgres; it will find and load automatically.

Now that you have LTREE installed and enabled, you can read my next post to learn how to use it.

Original Link

Trying to Represent a Tree Structure Using Postgres

Suppose you had a hierarchical data structure in your application. How would you save it in a database? How would you represent a complex tree structure using flat rows and columns?

There are a few different, equally valid, options. In this series of blog posts, I’ll take a close look at one option that Postgres provides: the LTREE extension. If you install and enable LTREE on your Postgres server, it will add powerful SQL operators and functions that support tree operations.

But what are these new SQL operators and how do you use them? And how does LTREE actually work? What computer science does it use behind the scenes to enable fast tree operations?

I’ll be publishing a series of blog posts on the Postgres LTREE extension. I’ll get started today by trying to insert a tree structure into a Postgres table using standard SQL, and during the rest of the series, I’ll take a close look at LTREE: how to install it, how to use it, and how it works.

An Example Tree

My actual dataset was more complicated, of course, but for the sake of example, let’s suppose I needed to save this tree in a Postgres table:

There are many different tree-like data structures in computer science, but this is probably the simplest: no cycles, no ordering of child nodes, and all the child nodes are accessible moving down from a single root. Should be easy, right?

At first, I thought it would be. I started by creating a table with a parent_id foreign key column, like this:

create table tree( id serial primary key, letter char, parent_id integer references tree (id)

The idea was that each row in my table represented a single node or element of the tree, and would identify its parent using the parent_id column. My single root node, A, had no parent, so I saved it first with a NULL parent ID:

insert into tree (letter, parent_id) values ('A', null); select * from tree; id | letter | parent_id ----+--------+-----------
1 | A | (1 row)

And then I inserted each of its child nodes like this:

insert into tree (letter, parent_id) values ('B', 1);
insert into tree (letter, parent_id) values ('C', 1); select * from tree; id | letter | parent_id ----+--------+-----------
1 | A | 2 | B | 1
3 | C | 1
(3 rows)

Because A has id=1, I set parent_id=1 for B and C. This is a simple example of the adjacency list algorithm: each row contains a list of its neighbors or adjacent rows. In this case, I was only recording each row’s parent. And the table is self-referencing because it contains a foreign key (parent_id) referencing another column in the same table.

I continued to fill out my tree structure with a few more insert statements:

insert into tree (letter, parent_id) values ('D', 3);
insert into tree (letter, parent_id) values ('E', 3);
insert into tree (letter, parent_id) values ('F', 3);
insert into tree (letter, parent_id) values ('G', 2); select * from tree; id | letter | parent_id ----+--------+-----------
1 | A | 2 | B | 1
3 | C | 1
4 | D | 3
5 | E | 3
6 | F | 3
7 | G | 2
(7 rows)

Did My Postgres Tree Work?

At first glance, my data structure worked well. I could easily find the parent of E:

select parent_id from tree where letter = 'E' parent_id -----------
(1 row) select letter from tree where id = 3 letter --------
(1 row)

And the children of C like this:

select letter from tree where parent_id = 3 letter --------
(3 rows)

Recursive Tree Operations

And it was also very easy to count how many children each node had; for example, this SQL statement returns the number of children under A:

select count(*) from tree where parent_id = 1; count -------
(1 row)

But what if I needed to know how many children A had, recursively? That is, what if I needed to count its children, grandchildren, great-grandchildren, etc.?

Well, first I would have to find the direct children of A:

select id from tree where parent_id = 1; id ----
(2 rows)

Then, to find the grandchildren, I would need to query for the children of the children, inserting the ID values from the previous statement:

select id from tree where parent_id in (2, 3); id ----
(4 rows)

And then I would add the child count with the grandchild count: 2 + 4 = 6.

My example tree ends here, so I’m done. But this doesn’t scale; suppose my tree had 10, 20, or 100 levels in it. I would have to execute repeated select statements, stepping down each level of the tree structure under the parent node:

select count(*) from tree where parent_id in (GREAT-GRANDCHILD-IDS);
select count(*) from tree where parent_id in (GREAT-GREAT-GRANDCHILD-IDS);
select count(*) from tree where parent_id in (GREAT-GREAT-GREAT-GRANDCHILD-IDS);

And so on.

In other words, I need to execute n-1 SQL statements, where n is the number of levels in the tree under the parent node, each time inserting all of the IDs returned by the previous query. And to find the total count, I would have to sum the number of IDs returned by each query along the way. Certainly not an efficient algorithm!

There Must Be a Better Way

My parent_id foreign key worked well for very simple tree operations, but not for more complex tasks, such as recursively counting nodes. If I set up my database schema differently, in a more thoughtful and complex way, can I avoid the repeated SQL calls?

Yes! There are a variety of options. One common solution is to use a nested set approach. In this design, each row contains a description of the set of other nodes that appear under it in the tree by saving the maximum and minimum id values of that subset of rows, the “nested set.” Using this scheme, querying children and grandchildren recursively becomes very easy. The drawback is that I would have to recalculate these values up and down the tree each time a new row was added.

Another solution, path enumeration, involves using a column to save the path or position of each node in the tree. This can be a powerful solution, but recursive queries and other tree operations require special support to parse and manipulate these paths.

A completely different approach would be to use a graph-oriented database such as Neo4j. These are database servers designed entirely around this problem: saving hierarchical data, or more generally networks of related data values.

But I didn’t want to leave Postgres behind; I already had a working, well-tested application. Why start over from scratch just because I added a single tree structure to my app? Why add new infrastructure and complexity to my overall system to support a single new data structure?

It turns out I didn’t have to. Postgres itself supports one of the two tree algorithms I mentioned above: path enumeration. Bundled inside of the Postgres source tree is an “extension” — an optional piece of C code you need compile, install, and enable — that supports tree SQL operations using path enumeration. In my next post, I’ll show you how to install and use the LTREE Postgres extension.

Original Link

Store Semantic Web Triples in Cassandra

The semantic web is the next level of web searching, where data is more important and should be well-defined. The semantic web is needed for making web searches more intelligent and intuitive to user requirements. You can find some interesting points on the semantic web here.

Triples are an atomic entity in RDF. They’re composed of subject-predicate-object and used for linking the subject and object with the help of the predicate. You can find some interesting points on triples here.

RDF stands for resource description framework. It is a framework for representing all information about a source in a graph. The RDF store is used for storing triples and it uses a SPARQL query to run them. Conversely, the RDF store creates some tables and on the basis of those tables, it converts SPARQL queries into normal SQL queries using Quetzal.

Now, there will be some questions and one of them is: What are we doing here?

We are trying to store triplesin Cassandra, as Quetzal stores into Postgres after creating the tables. Quetzal creates lots of tables for storing triples based on certain conditions. Tables created by Quetzal include:

  1. kb
  2. kb_basestats
  3. kb_direct_preds
  4. kb_dph
  5. kb_ds
  6. kb_dt
  7. kb_lstr
  8. kb_reverse_preds
  9. kb_rph
  10. kb_rs
  11. kb_topkstats

For more information of these tables, you can visit here.

We are trying to implement kb_direct_preds and kb_dph tables on Cassandra and fetch the object with the help of subject and predicate.

Table kb_direct_preds is used for storing the location of the predicate in the kb_dph table while kb_dph stores every value of the triples along with subject, predicate, and object.

Ingest Data Into Cassandra

Apache Cassandra running on default port (i.e. 9042):

sbt "runMain com.knoldus.TripleLoader /PATH/TO/TRIPLE/FILE"

Fetch Data From Cassandra

We have to follow these step to fetch out the data from the Cassandra table.

Run the project using the command sbt run. After running the project, hit the endpoint from your browser:


You can find more information on the Wiki of the project.

We have submitted the template for the project. You can find the template with this command:

sbt newknoldus/triple-manipulation.g8

These are the steps for storing triples into a kb_dph table with the help of kb_direct_preds.

If you have any questions, you can comment here!

This article originally appeared on the Knoldus blog.

Original Link

How Citus Executes Distributed Transactions on Postgres

Distributed transactions are one of the meanest, baddest problems in relational databases. With the release of Citus 7.1, distributed transactions are now available to all our users. In this article, we are going to describe how we built distributed transaction support into Citus by using PostgreSQL modules. But first, let’s give an overview of what a distributed transaction is.

(If this sounds familiar, that’s because we first announced distributed transactions as part of the Citus Cloud 2 announcement. The Citus Cloud announcement centered on other new useful capabilities, such as our warp feature to streamline migrations from single-node Postgres deployments to Citus Cloud, but it seems worthwhile to dedicate an entire post to distributed transactions.)

Why Are Distributed Transactions Hard?

You do a BEGIN transaction, INSERT, and UPDATE. Once you run a SELECT, you should see your changes applied against the database. But others shouldn’t until you do a COMMIT. Or you ROLLBACK, and your changes should disappear.

These transactional semantics, also known as ACID properties, are already hard to provide on a single node. So how do you provide them in a distributed database in the face of machine and network failures?

Distributed Atomic Transactions

Let’s start with a well-known transaction example. We have two bank accounts that belong to Alice and Bob, and we’re looking to transfer money from Alice’s account to Bob’s. The wire transfer needs to happen atomically. That is, both accounts should be updated together or neither account should be updated at all.

In Citus, we’re also going to execute this transaction on a distributed database. The database has three participating nodes that have PostgreSQL and the Citus extension installed. One of these nodes is the coordinator node — the coordinator holds metadata about distributed tables and their shards. The cluster also has two worker nodes, where each worker node holds four shards. In Citus, each shard is one PostgreSQL table. Shards are also transparent to the application.

In this example, we have a distributed table named accounts that’s sharded on account_id. We’re now going to transfer $100 from Alice’s to Bob’s account in four steps.

In the first and second steps, the client starts a transaction and sends two commands to update Alice and Bob’s accounts. The Citus coordinator then starts a transaction across two worker nodes and associates these two local transactions with a distributed transaction (more on why we do this later). The coordinator then pushes these two update commands to two worker nodes.

In the third and fourth steps, when the client issues a commit, the Citus coordinator initiates the 2PC protocol. First, the coordinator sends PREPARE TRANSACTION to both worker nodes. When the worker nodes vote to commit the transactions, the coordinator then sends COMMIT PREPARED. The workers then commit their local transactions, release their locks, and send acknowledgments to the coordinator. Once the coordinator receives all acks, it completes the distributed transaction.

The 2PC algorithm’s assumptions and behavior are widely discussed in distributed systems literature. You can read more about it on Wikipedia.

Improving Built-In 2PC

2PC has one disadvantage. If the coordinator node fails permanently, worker nodes may never resolve their transactions. In particular, after a worker node agrees to a PREPARE TRANSACTION, the transaction will block until the worker receives a commit or rollback.

To overcome this disadvantage, we made two improvements to 2PC in Citus. First, Citus Cloud and Enterprise deployments usually run in high availability mode. In this mode, both the coordinator and worker nodes have streaming replicas. When any of these nodes fail, their secondaries get promoted within seconds.

Second, when the coordinator sends PREPARE commands to worker nodes, it saves this information in distributed metadata tables. Later, the coordinator node checks this metadata table during crash recovery and at regular intervals. If any transactions on the worker nodes are in an incomplete state, the coordinator recovers them.

Consistency Algorithms in Distributed Systems

Consistency is a well-established problem in distributed systems. In that context, consistency means that every read receives the most recent write or an error. There are three popular distributed consistency algorithms (each algorithm also has variants):

  1. Two-phase commit (2PC): All participating nodes need to be up.
  2. Paxos: Achieves consensus with quorum.
  3. Raft: Designed as an alternative to Paxos. Meant to be more understandable than Paxos.

The primary difference between 2PC and Paxos is that 2PC needs all participating nodes to be available to make progress.

In Citus, we looked into the 2PC algorithm built into PostgreSQL. We also developed an experimental extension called pg_paxos. We decided to use 2PC for two reasons. First, 2PC has been used in production across thousands of Postgres deployments. Second, most Citus Cloud and Enterprise deployments use streaming replication behind the covers. When a node becomes unavailable, the node’s secondary usually gets promoted within a seconds. This way, Citus can have all participating nodes be available most of the time.

Distributed Deadlock Detection

Most articles on distributed transactions focus on the more visible problem around data consistency. They then discuss protocols such as 2PC or Raft. In practice, data consistency is only part of the problem. If you’re using a relational database, your application also benefits from another key feature: deadlock detection.

In PostgreSQL, when a transaction updates a row, the update grabs a lock for that row. If you have concurrent transactions that need to acquire the same set of locks, but in different orders, you run into a deadlock.

Deadlock Detection in PostgreSQL

Going back to our original wire transfer example, let’s say that Alice initiated a wire transfer to Bob, and Bob initiated a transfer to Alice around the same time. In this case, when you have session 1 and session 2 updating the same rows in different orders, you will run into a deadlock.

S1: BEGIN; // session 1 starts transaction block
S1: UPDATE accounts SET balance = balance - '100.00' WHERE id = 'Alice'; // S1 takes 'Alice' lock S2: BEGIN; // session 2 starts transaction block
S2: UPDATE accounts SET balance = balance + '100.00' WHERE id = 'Bob'; // S2 takes 'Bob' lock S1: UPDATE accounts SET balance = balance + '100.00' WHERE key = 'Bob'; // waits for 'Bob' lock held by S2
S2: UPDATE accounts SET balance = balance - '100.00' WHERE key = 'Alice'; // deadlocks on 'Alice' lock held by S1

In Postgres, if sessions are waiting on a lock for more than the deadlock_timeout (1s), the database will check whether those sessions are waiting for each other. In the Alice and Bob case, Postgres will, for example, abort session 2’s transaction to resolve the deadlock. Session 1 will then acquire the lock on Bob’s row and complete the transaction.

Distributed Deadlock Detection in Citus

Now, things are trickier when you’re running Alice and Bob’s wire transfers in a distributed database. In this case, let’s say that worker node 1 has a shard that holds Alice’s account. Worker 3 has another shard that holds Bob’s account.

You initiate a wire transfer (D1) to transfer money from Alice to Bob’s account. You concurrently initiate another transfer D2 that moves money from Bob to Alice’s account. D1 and D2 will then block on each other.

In this example, PostgreSQL runs deadlock detection on the coordinator, worker 1, and worker 3. None of the nodes involved in these distributed transactions see a deadlock. If you look at the whole picture, however, there’s a deadlock. This is the situation we have to deal with to safely support distributed transactions.

To handle deadlocks that span across nodes, Citus adds a new distributed deadlock detector. This deadlock detector runs as a background worker process within the Postgres extension framework.

When a client sends a transaction to Citus, the coordinator sends the transaction to related worker nodes. Before sending the transaction, however, the coordinator also calls the function SELECT assign_distributed_transaction_id();. This call ensures that the local transaction on the worker node is associated with the distributed transaction on the coordinator.

When D1 and D2 get blocked, the distributed deadlock detector gathers all lock tables from worker nodes and builds a distributed transaction graph. The deadlock detector then associates the local transactions with the distributed ones. Once it does that, the deadlock detector sees that the transactions on worker 1 and worker 3 are actually blocked on each other.

The deadlock detector can then cancel D2 on the coordinator. This will also abort D2’s transactions on worker 1 and worker 3. Then, D1 can acquire the lock on Bob’s row and complete the distributed transaction. This way, Citus can implement distributed transactions on a sharded database safely.

Why PostgreSQL Provides a Solid Foundation for a Distributed Database

Distributed transactions are one of the hardest problems in relational databases. When solving a complex problem, you can either build a solution from scratch or extend an existing technology.

At Citus, we chose to make PostgreSQL a rock-solid distributed database. When implementing distributed transactions, Citus leverages the following modules:

  • Stable storage on each node with a write-ahead log. The write-ahead log is crash safe. (requirement for 2PC).
  • 2PC protocol.
  • Streaming replication for fast node recovery.
  • PostgreSQL local process and lock graphs.
  • Cancellation and rollback logic for all statements.

This way, Citus uses Postgres components that are hardened over decades so we can focus on building a relational database that seamlessly scales.

Original Link

SQL Database, Table, and Data Partitioning: When and How to Do It

Find all the SQL queries the rest of the post at the source.  When I first came across table partitioning and started searching, I realized two things. First, it is a complex operation that requires good planning. Second, in some cases, it can be proven extremely beneficial, while in others, it can be a complete headache.

So the first question is when investing time in table partitioning seems a good way to go.  The typical case when you consider applying partitioning is the following:

“A company that maintains a large database, stores all of the data that are produced as a result of its activity. As time goes by, the velocity of the data increases more and more and queries become slower and slower as whole tables need to be scanned. But what happens in cases where there is no need for a full scan? Imagine the compilation of monthly business intelligence reports. The only data that are actually needed are those that were produced during the last month. It becomes evident that there are cases where it would be extremely helpful to be able to have control over the data that our queries take into consideration while being evaluated. ”

From what I have figured out, I would say that before creating any partitions, you should try to exhaust all other alternative options including table indexing and revision of queries. If you conclude that the only solution is table partitioning, you will have to pay special attention on how to implement it. The benefits that such an implementation can provide are constrained by the selection of the partition key and the granularity.

Regarding the first factor, you should keep in mind that partitioning can only occur in a single column and that your queries must include that column. If, for example, you have created a partitioned table, in order to run a query over the “eliminated” data, the partition indicator must be included in the query. Otherwise, a full scan will be performed.  For this, it is important to review the way in which your queries access the table in order to choose the most suitable column for partitioning.

As for the granularity, if your partitions are very large, then you won’t see any particular improvement on the performance. On the other hand, very small partitions can be hard to handle. Furthermore, even in the case of a good design, you won’t be able to see significant improvement in performance unless you are dealing with really huge tables.

If all the above concerns are being evaluated and you have concluded that table partitioning serves your needs, then the benefits you are going to gain include:

  • The relative speedup of queries that require only portions of large data sets. In this case, the optimizer eliminates searching in partitions that do not have relevant information.
  • Faster data load.
  • Faster deletion of old data limited to certain partitions, if they are no longer needed.
  • Faster archival of rarely used or old data can be migrated to cheaper and slower storage media.

That being said, let’s move on and see how can table partitioning be implemented in PostgreSQL, MS SQL Server, and Google BigQuery.

Regarding the actual implementation, the main idea behind table partitioning is that we are going to create a “parent” table and a number of “children” tables that are the ones actually responsible for holding the data. The number of the children is not necessarily constant and can grow as time goes by.

Of course, creating partitions does not mean that the “global” table stops to exist. You can still query it for events that span the whole period.

For sake of simplicity, in this post, we are going to work with a table that contains only two columns. Over this, we are going to make daily partitions. In real life, databases include much more columns but the idea remains exactly the same.

Find all the SQL queries the rest of the post at the source

Original Link

A Tour of Postgres Index Types

At Citus, we spend a lot of time working with customers on data modeling, optimizing queries, and adding indexes to make things snappy. My goal is to be as available for our customers as we need to be in order to make you successful. Part of that is keeping your Citus cluster well-tuned and performant, which we take care of for you. Another part is helping you with everything you need to know about Postgres and Citus. After all, a healthy and performant database means a fast-performing app — and who wouldn’t want that? Today, we’re going to condense some of the information we’ve shared directly with customers about Postgres indexes.

Postgres has a number of index types, and with each new release seems to come with another new index type. Each of these indexes can be useful, but which one to use depends on the:

  1. Data type
  2. Underlying data within the table (sometimes)
  3. Types of lookups performed

In what follows, we’ll look at a quick survey of the index types available to you in Postgres and when you should leverage each. Before we dig in, here’s a quick glimpse of the indexes we’ll walk you through:

  • B-Tree
  • Generalized Inverted Index (GIN)
  • Generalized Inverted Search Tree (GiST)
  • Space-Partitioned GiST (SP-GiST)
  • Block Range Indexes (BRIN)
  • Hash

Now, onto the indexing!

In Postgres, a B-Tree Index Is What You Most Commonly Want

If you have a degree in Computer Science, then a B-tree index was likely the first one you learned about. A B-tree index creates a tree that will keep itself balanced and even. When it goes to look something up based on that index, it will traverse down the tree to find the key the tree is split on and then return you the data you’re looking for. Using an index is much faster than a sequential scan because it may only have to read a few pages as opposed to sequentially scanning thousands of them (when you’re returning only a few records).

If you run a standard CREATE INDEX, it creates a B-tree for you. B-tree indexes are valuable on the most common data types such as text, numbers, and timestamps. If you’re just getting started indexing your database and aren’t leveraging too many advanced Postgres features within your database, using standard B-Tree indexes is likely the path you want to take.

GIN Indexes for Columns With Multiple Values

Generalized Inverted Indexes, commonly referred to as GIN, are most useful when you have data types that contain multiple values in a single column. From the Postgres docs:

“GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words.”

The most common data types that fall into this bucket are:

One of the beautiful things about GIN indexes is that they are aware of the data within composite values. But because a GIN index has specific knowledge about the data structure, support for each individual type needs to be added. And as a result, not all datatypes are supported.

GiST Indexes for Rows That Overlap Values

GiST indexes are most useful when you have data that can in some way overlap with the value of that same column but from another row. The best thing about GiST indexes is when you have, say, a geometry data type and you want to see if two polygons contained some point. In one case, a specific point may be contained within a box, while another point only exists within one polygon. The most common datatypes where you want to leverage GiST indexes are:

  • Geometry types
  • Text when dealing with full-text search

GiST indexes have some more fixed constraints around size, whereas GIN indexes can become quite large. As a result, GiST indexes are lossy. From the docs:

“A GiST index is lossy, meaning that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when needed.)”

This doesn’t mean you’ll get wrong results; it just means that Postgres has to do a little extra work to filter those false positives before giving your data back to you.

Special note: GIN and GiST indexes can often be beneficial on the same column types. One can often boast better performance but larger disk footprint in the case of GIN and vice versa for GiST. When it comes to GIN vs. GiST, there isn’t a perfect one size fits all, but the broad rules above apply.

SP-GiST Indexes for Larger Data

Space-partitioned GiST indexes leverage space partitioning trees that came out of some research from Purdue. SP-GiST indexes are most useful when your data has a natural clustering element to it and is also not an equally balanced tree. A great example of this is phone numbers (at least U.S. ones). They follow a format of:

  • 3 digits for area code
  • 3 digits for prefix (historically related to a phone carrier’s switch)
  • 4 digits for line number

This means that you have some natural clustering around the first set of 3 digits, around the second set of 3 digits, then numbers may fan out in a more even distribution. But, with phone numbers, some area codes have a much higher saturation than others. The result may be that the tree is very unbalanced. Because of the natural clustering up front and the unequal distribution of data, data like phone numbers could make a good case for SP-GiST.

BRIN Indexes for Larger Data

Block range indexes can focus on some similar use cases to SP-GiST in that they’re best when there is some natural ordering to the data, and the data tends to be very large. Have a billion-record table (especially if it’s time series data)? BRIN may be able to help. If you’re querying against a large set of data that is naturally grouped together such as data for several zip codes (which then roll up to some city), BRIN helps to ensure that similar zip codes are located near each other on disk.

When you have very large datasets that are ordered, such as dates or zip codes, BRIN indexes allow you to skip or exclude a lot of the unnecessary data very quickly. BRIN indexes are additionally maintained as smaller indexes relative to the overall data size, making them a big help when you have a large dataset.

Hash Indexes: Finally Crash-Safe

Hash indexes have been around for years within Postgres, but until Postgres 10, they came with a giant warning in that they were not WAL-logged. This meant if your server crashed and you failed over to a standby or recovered from archives using something like wal-g, then you’d lose that index until you recreated it. With Postgres 10, they’re now WAL-logged, so you can start to consider using them again. But the real question is: Should you?

Hash indexes at times can provide faster lookups than B-Tree indexes and can boast faster creation times as well. The big issue with them is they’re limited to only equality operators so you need to be looking for exact matches. This makes hash indexes far less flexible than the more commonly used B-Tree indexes and something you won’t want to consider as a drop-in replacement but rather a special case.

Which Do You Use?

We just covered a lot — and if you’re a bit overwhelmed, you’re not alone. If all you knew before was CREATE INDEX, you’ve been using B-Tree indexes all along. And the good news is, you’re still performing as well as or better than most databases that aren’t Postgres! As you start to use more Postgres features, consider the following a cheat sheet for when to use other Postgres types:

  • B-Tree: For most datatypes and queries
  • GIN: For JSONB/hstore/arrays
  • GiST: For full-text search and geospatial datatypes
  • SP-GiST: For larger datasets with natural but uneven clustering
  • BRIN: For really large datasets that line up sequentially
  • Hash: For equality operations, and generally B-Tree is still what you want here

If you have any questions or feedback about the post feel free to join us in our Slack channel.

Original Link