ALU

orm

Apache Cayenne 4.0 —What It Is All About

Didn’t have much time for writing lately. Still felt the topic of Apache Cayenne 4.0 is important enough to say a few words, even though this is coming to you months after the final release. So here is a belated overview of the release scope and its significance. Enjoy!

As you may know, Cayenne is a Java ORM, i.e. an intermediary between your Java objects and the database, that keeps object and relational worlds in sync and lets you run queries and persist data. The current stable version of Cayenne is 4.0 and it was released in August 2018.

Original Link

Java 11: JOIN Tables, Get Java Streams

Java 11: JOIN Tables, Get Java Streams

Ever wondered how you could turn joined database tables into a Java Stream? Read this short article and find out how it is done using the Speedment Stream ORM. We will start with a Java 8 example and then look into the improvements made with Java 11.

Java 8 and JOINs

Speedment allows dynamically JOIN:ed database tables to be consumed as standard Java Streams. We begin by looking at a solution for Java 8 using the Sakila exemplary database:

Original Link

Java 11: JOIN Tables, Get Java Streams

Java 11: JOIN Tables, Get Java Streams

Ever wondered how you could turn joined database tables into a Java Stream? Read this short article and find out how it is done using the Speedment Stream ORM. We will start with a Java 8 example and then look into the improvements made with Java 11.

Java 8 and JOINs

Speedment allows dynamically JOIN:ed database tables to be consumed as standard Java Streams. We begin by looking at a solution for Java 8 using the Sakila exemplary database:

Original Link

What Is Spring Data JPA?

In this article, I would like to talk about the Spring Data JPA (Java Persistence API). This library is one of the main building blocks of the Spring framework. It is also a powerful tool that you should know about if you would like to work with persistent data.

I often see that developers who actually use it do not see the whole picture. They miss using some of the most useful abilities of it. Hence, I would like to show you the bigger picture as well as the most meaningful tools to handle your persistent data across your application.

Original Link

Modules of the Spring Architecture

Spring Framework Architecture

The basic idea behind the development of Spring Framework was to make it a one-stop shop where you can integrate and use modules according to the need of your application. This modularity is due to the architecture of Spring. There are about 20 modules in the Spring framework that are being used according to the nature of the application.

Below is the architecture diagram of the Spring framework. There, you can see all the modules that are being defined over the top of Core container. This layered architecture contains all the necessary modules that a developer may require in developing an enterprise application. Also, the developer is free to choose or discard anythe  module that is of need or are not of any use according to its requirement. Due to its modular architecture, the integration of Spring framework with other frameworks is super easy.

Original Link

Debugging Java Streams With IntelliJ

Streams are very powerful and can capture the gist of your intended functionality in just a few lines. But, just as smooth as they are when it all works, it can be just as agonizing when they don’t behave as expected. In this tutorial, we will learn how to use IntelliJ to debug your Java Streams and gain insight into the intermediate operations of a Stream.

In this article, I will use the Sakila sample database and Speedment Stream ORM in my examples.

Original Link

Using the Spring Data JPA

Spring Data JPA is not a JPA provider but a specification — it is a library/framework that adds an extra layer of abstraction on the top of our JPA provider. It simply “hides” the Java Persistence API and the JPA provider behind its repository abstraction.

JPA is the sun specification for persisting objects in the enterprise application. Therefore, it is used as a replacement for complex entity beans.

Original Link

Java Stream ORM Now with JOINs

Speedment is a Java Stream ORM toolkit and runtime that allows you to view database tables as standard Java Streams. Because you do not have to mix Java and SQL, the application becomes much more compact, making it faster to develop, less prone to errors, and easier to maintain. Streams are also strictly type-safe and lazily constructed so that only a minimum amount of data is pulled in from the database as elements are consumed by the streams.

Speedment 3.1.1 “Homer” now also supports dynamically joined tables to be viewed as standard Java Streams. This is a big deal when developing Java applications that explore relations between database tables.

In the examples below, I have used the open-source Sakila film database content for MySQL that you can download here. Speedment works for any major relational database type such as Oracle, MySQL, Microsoft SQL Server, PostgreSQL, DB2, MariaDB, AS400, and more.

Streaming Over a Single Table

The following code snippet will create a List of all Film objects that have a Film.RATING of “PG-13” and where the List is sorted in Film.LENGTH order:

List<Film> list = films.stream() .filter(Film.RATING.equal("PG-13")) .sorted(Film.LENGTH) .collect(toList()); 

The stream will be automatically rendered to a SQL query under the hood. If we enable Stream logging, we will see the following (prepared statement “?”-variables given as values in the end):

SELECT `film_id`,`title`,`description`,`release_year`, `language_id`,`original_language_id`, `rental_duration`,`rental_rate`, `length`,`replacement_cost`,`rating`,`special_features`, `last_update` FROM `sakila`.`film` WHERE (`rating` = ? COLLATE utf8_bin) ORDER BY `length` ASC values:[PG-13]

Thus, the advantage is that you can express your database queries using type-safe Java and then consume the result by means of standard Java Streams. You do not have to write any SQL code.

Joining Several Tables

Appart from the table “film”, the Sakila database also contains other tables. One of these is a table called “language”. Each Film entity has a foreign key to the Language being spoken in the film using a column named “language_id”.

In this example, I will show how we can create a standard Java Stream that represents a join of these two tables. This way, we can get a Java Stream of matching pairs of Film/Language entities.

Join objects are created using the JoinComponent which can be obtained like this:

// Visit https://github.com/speedment/speedment
// to see how a Speedment app is created. It is easy!
Speedment app = ...; JoinComponent joinComponent = app.getOrThrow(JoinComponent.class);

Once we have grabbed the JoinComponent, we can start creating Join objects like this:

Join<Tuple2<Film, Language>> join = joinComponent .from(FilmManager.IDENTIFIER) .innerJoinOn(Language.LANGUAGE_ID).equal(Film.LANGUAGE_ID) .build(Tuples::of);

Now that we have defined our Join object, we can create the actual Java Stream:

join.stream() .map(t2 -> String.format( "The film '%s' is in %s", t2.get0().getTitle(), // get0() -> Film t2.get1().getName() // get1() -> Language )) .forEach(System.out::println);

This will produce the following output:

The film 'ACADEMY DINOSAUR' is in English
The film 'ACE GOLDFINGER' is in English
The film 'ADAPTATION HOLES' is in English
...

In the code above, the method t2.get0() will retrieve the first element from the tuple (a Film) whereas the method t2.get1() will retrieve the second element from the tuple (a Language). Default generic tuples are built into Speedment and thus Tuple2 is not a Guava class. Speedment does not depend on any other library. Below, you will see how you can use any class constructor for the joined tables. Again, Speedment will render SQL code automatically from Java and convert the result to a Java Stream. If we enable Stream logging, we can see exactly how the SQL code was rendered:

SELECT A.`film_id`,A.`title`,A.`description`, A.`release_year`,A.`language_id`,A.`original_language_id`, A.`rental_duration`,A.`rental_rate`,A.`length`, A.`replacement_cost`,A.`rating`,A.`special_features`, A.`last_update`, B.`language_id`,B.`name`,B.`last_update` FROM `sakila`.`film` AS A
INNER JOIN `sakila`.`language` AS B ON (B.`language_id` = A.`language_id`)

Interestingly, the Join object can be created once and be re-used over and over again to create new Streams.

Many-to-Many Relations

The Sakila database also defines a handful of many-to-many relations. For example, the table “film_actor” contains rows links films to actors. Each film can have multiple actors and each actor might have appeared in multiple films. Every row in the table links a particular Film to a specific Actor. For example, if a Film depicts 12 Actor entities, then FilmActor contains 12 entries all having the same film_id but different actor_ids. The purpose of this example is to create a complete list of all films and the appearing actors in a Java Stream. This is how we can join the three tables together:

Join<Tuple3<FilmActor, Film, Actor>> join = joinComponent .from(FilmActorManager.IDENTIFIER) .innerJoinOn(Film.FILM_ID).equal(FilmActor.FILM_ID) .innerJoinOn(Actor.ACTOR_ID).equal(FilmActor.ACTOR_ID) .build(Tuples::of); join.stream() .forEach(System.out::println);

The code above will produce the following output (formatted for readability):

...
Tuple3Impl { FilmActorImpl { actorId = 137, filmId = 249, lastUpdate = 2006-02-15 05:05:03.0 }, FilmImpl { filmId = 249, title = DRACULA CRYSTAL, description =..., ActorImpl { actorId = 137, firstName = MORGAN, lastName = WILLIAMS,...}
} Tuple3Impl { FilmActorImpl { actorId = 137, filmId = 254, lastUpdate = 2006-02-15 05:05:03.0 }, FilmImpl { filmId = 254, title = DRIVER ANNIE, description = ..., ActorImpl { actorId = 137, firstName = MORGAN, lastName = WILLIAMS, ...}
} Tuple3Impl { FilmActorImpl { actorId = 137, filmId = 263, lastUpdate = 2006-02-15 05:05:03.0 }, FilmImpl { filmId = 263, title = DURHAM PANKY, description = ... }, ActorImpl { actorId = 137, firstName = MORGAN, lastName = WILLIAMS,... }
}
...

Joins With Custom Tuples

As we noticed in the example above, we have no actual use of the FilmActor object in the Stream since it is only used to link Film and Actor objects together during the Join phase.

When Join objects are built using the build() method, we can provide a custom constructor that we want to apply on the incoming entities from the database. The constructor can be of any type so you can write your own Java objects that holds, for example, Film and Actor or any of the columns they contain and that are of interest.

In this example, I proved a (lambda) constructor that just discards the linking FilmActor objects altogether:

Join<Tuple2<Film, Actor>> join = joinComponent .from(FilmActorManager.IDENTIFIER) .innerJoinOn(Film.FILM_ID).equal(FilmActor.FILM_ID) .innerJoinOn(Actor.ACTOR_ID).equal(FilmActor.ACTOR_ID) .build((fa, f, a) -> Tuples.of(f, a)); join.stream() .forEach(System.out::println);

The code above will produce the following output (formatted for readability):

...
Tuple2Impl { FilmImpl { filmId = 249, title = DRACULA CRYSTAL, description = ... }, ActorImpl { actorId = 137, firstName = MORGAN, lastName = WILLIAMS, ...}
}
Tuple2Impl { FilmImpl { filmId = 254, title = DRIVER ANNIE, description = A... }, ActorImpl { actorId = 137, firstName = MORGAN, lastName = WILLIAMS,...}
}
Tuple2Impl { FilmImpl { filmId = 263, title = DURHAM PANKY, description = ... }, ActorImpl { actorId = 137, firstName = MORGAN, lastName = WILLIAMS,...}
}
...

Thus, we only get matching pairs of Film and Actor entities where there is an appearance of an actor in a film. The linking object FilmActor is never seen in the Stream.

Take it for a Spin!

Over the course of this article, you have learned how to stream over one or several database tables using Speedment.

Visit Speedment open-source on GitHub and try it out!

Read all about the new JOIN functionality in the User’s Guide.

Original Link

Work With Fluent NHibernate in Core 2.0

Table of Contents

  • Introduction
  • NuGet package
    • Create Entities
      • Person Entity
      • Task Entity
  • Create Mappings
    • PersonMap
    • TaskMap
    • PersonService
  • Conclusion

Introduction

NHibernate is an object-relational mapping (ORM) framework that allows you to map the object-oriented domain model with the tables in a relational Database. To realize the mapping, we have to write XML mapping files (.hbm.xml files), and to make it easier, here comes Fluent NHibernate as an abstraction layer to do it in C# rather than XML.

Databases supported by NHibernate are:

  • SQL Server
  • SQL Server Azure
  • Oracle
  • PostgreSQL
  • MySQL
  • SQLite
  • DB2;
  • Sybase Adaptive Server
  • Firebird
  • Informix

It can even support the use of OLE DB (Object Linking and Embedding) and ODBC (Open Database Connectivity).

What makes NHibernate stronger than Entity Framework is that it integrates querying API sets like LINQ, Criteria API (implementation of the pattern Query Object), integration with Lucene.NET, use of SQL, and even stored procedures.

NHibernate also supports:

  • Second Level Cache (used among multiple ISessionFactory)
  • Multiple ways of ID generation such as Identity, Sequence, HiLo, GUID, Pooled, and the Native mechanism used in databases
  • Flushing properties (FlushMode properties in ISession that can take these values: Auto, Commit, or Never)
  • Lazy Loading
  • Generation and updating system like migration API in Entity framework (Like Code First)

When Microsoft started working on the framework Core, much of the functionality in NHibernate wasn’t supported in .NET Core as target platform, but from the version Core 2.0, we can integrate NHibernate and Fluent NHibernate.

  1. NuGet package

The NuGet package to use Fluent NHibernate:

PM> Install-Package FluentNHibernate -Version 2.1.2

  1. Create Entities

This folder will include two entities: Person and Task.

  1. Person Entity
public class Person
{
public virtual int Id { get; set; }
public virtual string Name { get; set; }
public virtual DateTime CreationDate { get; set; }
public virtual DateTime UpdatedDate { get; set; }
}
    1. Task Entity
    public class Task
    {
    public virtual string Id { get; set; }
    public virtual string Title { get; set; } public virtual string Description { get; set; } public virtual DateTime CreationTime { get; set; } public virtual TaskState State { get; set; } public virtual Person AssignedTo { get; set; } public virtual DateTime CreationDate { get; set; }
    public virtual DateTime UpdatedDate { get; set; } public Task()
    {
    CreationTime = DateTime.UtcNow;
    State = TaskState.Open;
    } } public enum TaskState : byte
    {
    /// <summary>
    /// The task is Open.
    /// </summary>
    Open = 0,
    /// <summary>
    /// The task is active.
    /// </summary>
    Active = 1, /// <summary>
    /// The task is completed.
    /// </summary>
    Completed = 2, /// <summary>
    /// The task is closed.
    /// </summary>
    Closed = 3
    }
    
  1. Create Mappings

This folder will contain the mapping classes for the previous entities.

    1. PersonMap
    public class PersonMap : ClassMap<Person>
    {
    public PersonMap()
    {
    Id(x => x.Id); Map(x => x.Name); Map(x => x.CreationDate);
    Map(x => x.UpdatedDate);
    }
    }
    
  1. TaskMap
public class TaskMap : ClassMap<Task>
{
public TaskMap()
{
Id(x => x.Id); Map(x => x.CreationTime);
Map(x => x.State);
Map(x => x.Title);
Map(x => x.Description);
Map(x => x.UpdatedDate);
Map(x => x.CreationDate); References(x => x.AssignedTo);
}
}
  1. Create the SessionFactory Builder

The folder SessionFactories, we will include the class of SessionFactoryBuilder that will manage the schema builder and SessionFactory builder.

public class SessionFactoryBuilder { //var listOfEntityMap = typeof(M).Assembly.GetTypes().Where(t => t.GetInterfaces().Contains(typeof(M))).ToList(); //var sessionFactory = SessionFactoryBuilder.BuildSessionFactory(dbmsTypeAsString, connectionStringName, listOfEntityMap, withLog, create, update); public static ISessionFactory BuildSessionFactory(string connectionStringName, bool create = false, bool update = false) { return Fluently.Configure() .Database(PostgreSQLConfiguration.Standard .ConnectionString(ConfigurationManager.ConnectionStrings[connectionStringName].ConnectionString)) //.Mappings(m => entityMappingTypes.ForEach(e => { m.FluentMappings.Add(e); })) .Mappings(m =>m.FluentMappings.AddFromAssemblyOf<NHibernate.Cfg.Mappings>()) .CurrentSessionContext("call") .ExposeConfiguration(cfg => BuildSchema(cfg, create, update)) .BuildSessionFactory(); } /// <summary> /// Build the schema of the database. /// </summary> /// <param name="config">Configuration.</param> private static void BuildSchema(Configuration config, bool create = false, bool update = false) { if (create) { new SchemaExport(config).Create(false, true); } else { new SchemaUpdate(config).Execute(false, update); } }
}
  1. Create Services

This folder will include all different treatments that we can do, like the GetAll element of an entity, and add a new element in the database. For example, in this sample, I will include two services according to our entities.

  1. PersonService
public class PersonService
{
public static void GetPerson(Person person)
{
Console.WriteLine(person.Name);
Console.WriteLine();
}
}
    1. TaskService
    public class TaskService
    {
    public static void GetTaskInfo(Task task) { Console.WriteLine(task.Title);
    Console.WriteLine(task.Description);
    Console.WriteLine(task.State);
    Console.WriteLine(task.AssignedTo.Name);
    Console.WriteLine(); }
    }
    
  1. Add and display Table content

Now, we will try to display data from the database or add a new element.

In the program, we will start by creating a session factory using the method: BuildSessionFactory implemented inside the class SessionFactoryBuilder. After, we will open a new session for every transaction done. We start a new transaction inside, and it depends on the operation whether or not we need to insert a new line in the database. We use session.SaveOrUpdate(MyObjectToAdd); after we commit this using this transaction: transaction.Commit();

static void Main(string[] args) { // create our NHibernate session factory string connectionStringName = "add here your connection string"; var sessionFactory = SessionFactoryBuilder.BuildSessionFactory(connectionStringName, true, true); using (var session = sessionFactory.OpenSession()) { // populate the database using (var transaction = session.BeginTransaction() { // create a couple of Persons var person1 = new Person { Name = "Rayen Trabelsi" }; var person2 = new Person { Name = "Mohamed Trabelsi" }; var person3 = new Person { Name = "Hamida Rebai" }; //create tasks var task1 = new Task {Title = "Task 1", State = TaskState.Open, AssignedTo = person1}; var task2 = new Task { Title = "Task 2", State = TaskState.Closed, AssignedTo = person2 }; var task3 = new Task { Title = "Task 3", State = TaskState.Closed, AssignedTo = person3 }; // save both stores, this saves everything else via cascading session.SaveOrUpdate(task1); session.SaveOrUpdate(task2); session.SaveOrUpdate(task3); transaction.Commit(); } using (var session2 = sessionFactory.OpenSession()) { //retreive all tasks with person assigned to using (session2.BeginTransaction()) { var tasks = session.CreateCriteria(typeof(Task)) .List<Task>(); foreach (var task in tasks) { TaskService.GetTaskInfo(task); } } } Console.ReadKey(); } } }

Conclusion

This is a sample of Fluent NHibernate. In another article, I will show you another web sample using some of the advantages like Criteria APIs.

Original Link

Generate Dapper Queries On-The-Fly With C#

ORMs are very common when developing with .NET. According to Wikipedia:

Object-relational mapping (ORM, O/RM, and O/R mapping tool) in computer science is a programming technique for converting data between incompatible type systems using object-oriented programming languages. This creates, in effect, a “virtual object database” that can be used from within the programming language. There are both free and commercial packages available that perform object-relational mapping, although some programmers opt to construct their own ORM tools.

Dapper is a simple object mapper for .NET. It’s simple and fast. Performance is the most important thing that we can achieve with Dapper. According to their website:

Dapper is a simple object mapper for .NET and own the title of King of Micro ORM in terms of speed and is virtually as fast as using a raw ADO.NET data reader. An ORM is an Object Relational Mapper, which is responsible for mapping between database and programming language.

What else can we do to make it easier and better? 

I am a big fan of code generator tools. They allow you to avoid rewriting a lot of code and you can just simplify automating queries and so on.

Recently, I released a Visual Studio Extension called Dapper Crud Generator, which is responsible to automatically generate SELECT/INSERT/UPDATE/DELETE commands.

How does it work?

You have a solution with a model project inside, then you have several classes with properties, right?

For example:

public class Student
{ public int Id { get; set; } public string Name { get; set; } public DateTime Birth { get; set; }
}

Then, you have a requirement for each statement. You have to write queries and you have to write a lot of code to do it.

After installing the extension, right-click on your project and click on Generate Dapper CRUD:

Image title

Select the options above, select your model. Here is the result:

public List<Course> SelectCourse()
{ // Select List<Course> ret; using (var db = new SqlConnection(connstring)) { const string sql = @"SELECT Id, Name, StudentLimit FROM [Course]"; ret = db.Query<Course>(sql, commandType: CommandType.Text).ToList(); } return ret;
}
public void InsertCourse(Course course)
{ // Insert using (var db = new SqlConnection(connstring)) { const string sql = @"INSERT INTO [Course] (Name, StudentLimit) VALUES (@Name, @StudentLimit)"; db.Execute(sql, new { Name = course.Name, StudentLimit = course.StudentLimit }, commandType: CommandType.Text); }
}
public void UpdateCourse(Course course)
{ // Update using (var db = new SqlConnection(connstring)) { const string sql = @"UPDATE [Course] SET Name = @Name, StudentLimit = @StudentLimit WHERE Id = @Id"; db.Execute(sql, new { Id = course.Id, Name = course.Name, StudentLimit = course.StudentLimit }, commandType: CommandType.Text); }
}
public void DeleteCourse(Course course)
{ // Delete using (var db = new SqlConnection(connstring)) { const string sql = @"DELETE FROM [Course] WHERE Id = @Id"; db.Execute(sql, new { course.Id }, commandType: CommandType.Text); }
}

To install and use the extension, just go to Marketplace or download directly from Visual Studio (via the Tools and Extensions menu).

For the next major release, I’m planning to implement queries with complex objects.

You can find the source code here.

Original Link

Hibernate Show SQL

When you are developing Spring Boot applications with database interactions, you typically use Hibernate as the Object Relationship Mapping (ORM) tool.

Instead of directly coupling your code with Hibernate, often, you’d rather use Spring Data JPA, a Spring Framework project.

Spring Data JPA makes the implementation of the data access layer incredibly easy by abstracting most of the complexities involved in persisting data.

Often, when you are working with Hibernate and Spring Data JPA, you will need to see what is happening under the hood. It is very helpful to see the actual SQL statements being generated by Hibernate.

Due to the nature of the abstractions offered by Hibernate and Spring Data JPA, it’s very easy to inadvertently create n+1 queries — which is very detrimental to the performance of your application.

In this post, I’ll share a tip on how to configure Hibernate and Spring Data JPA to log executed SQL statements and used bind parameters.

The Application

For the purpose of this post, I’ve created a simple Spring Boot application. In this application, we can perform CRUD operations on a Product entity.

Here is the Product entity.

Product.java:

package guru.springframework.domain; import javax.persistence.*;
import java.math.BigDecimal; @Entity
public class Product { @Id @GeneratedValue(strategy = GenerationType.AUTO) private Integer id; @Version private Integer version; private String productId; private String description; private String imageUrl; private BigDecimal price; public String getDescription() { return description; } public void setDescription(String description) { this.description = description; } public Integer getVersion() { return version; } public void setVersion(Integer version) { this.version = version; } public Integer getId() { return id; } public void setId(Integer id) { this.id = id; } public String getProductId() { return productId; } public void setProductId(String productId) { this.productId = productId; } public String getImageUrl() { return imageUrl; } public void setImageUrl(String imageUrl) { this.imageUrl = imageUrl; } public BigDecimal getPrice() { return price; } public void setPrice(BigDecimal price) { this.price = price; }
}

Below is a JUnit test class to save and retrieve products.

If you are new to JUnit, I’d suggest checking out my JUnit series of posts.

ProductRepositoryTest.java:

package guru.springframework.repositories; import guru.springframework.configuration.RepositoryConfiguration;
import guru.springframework.domain.Product;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import java.math.BigDecimal; import static org.junit.Assert.*; @RunWith(SpringJUnit4ClassRunner.class)
@SpringBootTest(classes = {RepositoryConfiguration.class})
public class ProductRepositoryTest { private ProductRepository productRepository; @Autowired public void setProductRepository(ProductRepository productRepository) { this.productRepository = productRepository; } @Test public void testSaveProduct(){ //setup product Product product = new Product(); product.setDescription("Spring Framework Guru Shirt"); product.setPrice(new BigDecimal("18.95")); product.setProductId("1234"); //save product, verify has ID value after save assertNull(product.getId()); //null before save productRepository.save(product); assertNotNull(product.getId()); //not null after save //fetch from DB Product fetchedProduct = productRepository.findOne(product.getId()); //should not be null assertNotNull(fetchedProduct); //should equal assertEquals(product.getId(), fetchedProduct.getId()); assertEquals(product.getDescription(), fetchedProduct.getDescription()); //update description and save fetchedProduct.setDescription("New Description"); productRepository.save(fetchedProduct); //get from DB, should be updated Product fetchedUpdatedProduct = productRepository .findOne(fetchedProduct.getId()); assertEquals(fetchedProduct.getDescription(), fetchedUpdatedProduct.getDescription()); //verify count of products in DB long productCount = productRepository.count(); assertEquals(productCount, 1); //get all products, list should only have one Iterable<Product> products = productRepository.findAll(); int count = 0; for(Product p : products){ count++; } assertEquals(count, 1); }
}

Activating Logging in Hibernate

To activate the logging of the executed SQL statements with Spring Boot, set the log level of the org.hibernate.SQL category to DEBUG.

If you wish to see the bind values, you can set the log level of org.hibernate.type.descriptor.sql to TRACE.

If you are new to logging frameworks, refer my series on Log4J2.

Here is the logging configuration in the application.properties.

application.properties:

#show sql statement
logging.level.org.hibernate.SQL=debug #show sql values
logging.level.org.hibernate.type.descriptor.sql=trace

Here is the log output showing the SQL statements generated by Hibernate:

org.hibernate.SQL=debug 2018-02-04 22:34:46.861 DEBUG 1065 --- [ main] org.hibernate.SQL : select product0_.id as id1_0_0_, product0_.description as descript2_0_0_, product0_.image_url as image_ur3_0_0_, product0_.price as price4_0_0_, product0_.product_id as product_5_0_0_, product0_.version as version6_0_0_ from product product0_ where product0_.id=? org.hibernate.type.descriptor.sql=trace 2018-02-04 22:34:46.861 DEBUG 1065 --- [ main] org.hibernate.SQL : select product0_.id as id1_0_0_, product0_.description as descript2_0_0_, product0_.image_url as image_ur3_0_0_, product0_.price as price4_0_0_, product0_.product_id as product_5_0_0_, product0_.version as version6_0_0_ from product product0_ where product0_.id=?
2018-02-04 22:34:46.862 TRACE 1065 --- [ main] o.h.type.descriptor.sql.BasicBinder : binding parameter [1] as [INTEGER] - [1]
2018-02-04 22:34:46.862 TRACE 1065 --- [ main] o.h.type.descriptor.sql.BasicExtractor : extracted value ([descript2_0_0_] : [VARCHAR]) - [New Description]
2018-02-04 22:34:46.863 TRACE 1065 --- [ main] o.h.type.descriptor.sql.BasicExtractor : extracted value ([image_ur3_0_0_] : [VARCHAR]) - [https://springframework.guru/wp-content/uploads/2015/04/spring_framework_guru_shirt-rf412049699c14ba5b68bb1c09182bfa2_8nax2_512.jpg]
2018-02-04 22:34:46.863 TRACE 1065 --- [ main] o.h.type.descriptor.sql.BasicExtractor : extracted value ([price4_0_0_] : [NUMERIC]) - [18.95]
2018-02-04 22:34:46.863 TRACE 1065 --- [ main] o.h.type.descriptor.sql.BasicExtractor : extracted value ([product_5_0_0_] : [VARCHAR]) - [1234]
2018-02-04 22:34:46.863 TRACE 1065 --- [ main] o.h.type.descriptor.sql.BasicExtractor : extracted value ([version6_0_0_] : [INTEGER]) - [1]

Activating Logging With Spring Data JPA

If you are using Spring Data JPA with Hibernate as the persistence provider, add the following two lines in application.properties:

spring.jpa.show-sql=true
spring.jpa.properties.hibernate.format_sql=true

Here is the log output:

Hibernate: select product0_.id as id1_0_0_, product0_.description as descript2_0_0_, product0_.image_url as image_ur3_0_0_, product0_.price as price4_0_0_, product0_.product_id as product_5_0_0_, product0_.version as version6_0_0_ from product product0_ where product0_.id=?

Conclusion

As you can see, it’s very easy to enable the logging of SQL statements with Spring Boot and Hibernate.

Being able to see what Hibernate is actually doing with the database is very important.

Often, when I’m working on a Spring Boot project, I will enable the SQL output just as a sanity check. I may believe everything is okay. But I have, in fact, found problems which I was unaware of by examining the SQL output.

Original Link

DBAs, GDPR, FDD, and Buggy Whips

I recently found myself rereading a very old blog post of mine, from the very beginning of my blog, discussing Buggy Whips. I’ll save you the long read: I was learning new tech, it made me second guess my working assumptions, and I was curious if I was manufacturing a buggy whip while watching an automobile drive by.

2008 to 2018

Well, I’m still here.

In fact, feature-driven development has disappeared from the lexicon and the project that it was introduced to took years longer than anticipated, performed horribly, and had to have a major redesign and rework to be fundamentally functional (all after I left the old organization).

So, my fears that database design was a thing of the past were just that: fears… right?

Yes and no. Here we are in 2018 and there are all sorts of anti-design development methodologies, some wildly successful, others less so, and a few with the jury still out on their success. However, it’s very clear that for some business functions and some technical requirements, unstructured or semi-structured data is absolutely the way to go. Yet, a well-structured database for reporting and analytics is still a must. Data integrity to ensure data cleanliness is still a must. Appropriate indexes, up-to-date statistics, well-written T-SQL, all still here.

Everything has changed and nothing has.

Fear, Celebration, or Something Else

So, there’s a certain amount of trepidation, if not outright fear, coming these days because of the GDPR. I’ve been writing about it quite a lot because it’s a very important topic, Redgate is actively pursuing it, and I find the topic fascinating. In many of the posts and videos I’ve put up, I’ve been pointing out places where you could hit issues, places where a data breach could occur. That could engender fear. However, I’ve also done my level best to point out that none of this is a reason for fear. In fact, I think the GDPR is a reason for celebration.

Wait. What the heck do the GDPR, fear, FDD, and all this stuff have to do with buggy whips?

Fear is the idea that you’re making buggy whips. If you are, time to stop. Right now. Get going on making automobile upholstery. However, the GDPR is not an indication that you’re making buggy whips. Just the opposite. The GDPR is demanding that appropriate, tested backups be in place. The GDPR is demanding that you have monitoring in place. The GDPR is demanding that have a documented deployment process. The GDPR is demanding that you do not allow production data into non-production environments. The GDPR is demanding that you do the job that you know you should be doing.

In short, the GDPR is not a reason for fear — it’s a reason for celebration. There is more fairly traditional DBA work coming our way because of the GDPR. Add to this the fact that the GDPR, or something exactly like it, is spreading around the world, and we have more reason for celebration, not fear.

However, are we ready? Ahhh…

Preparation

I think here is where the issue lies. Are we prepared? And I don’t just mean you, the DBA, if you are one. I mean you, the architect, you, the developer, and you, the analyst. Are you all prepared to deal with the world where, in fact, we have security more locked-down? Where, in fact, we are held responsible for SQL Injection breaches? Where, in fact, all data processing must be documented or our organizations face major fines?

I don’t think we are. I say this because last month, heck, a couple of weeks ago, a breach involving SQL Injection was reported. For crying out loud, we’ve known how to avoid this for at least 15 years, yet we’re still doing it. We’re making fundamental, easily addressed, well-documented errors. The same errors we made 15 years ago. There’s not a single excuse for this.

Pick your favorite ORM tool that will eliminate database design and get rid of that pesky DBA. Got one? Cool. Now, tell me true, did you use the “Hello World” example that did completely ad hoc queries without validating data types and just let the text format itself on its way to the database? Yes, you did. Don’t start lying to me. In short, you just enabled SQL Injection on that brand spanking new system. The truly horrible thing, the ORM tools I’ve worked with, they don’t have to do SQL Injection by default. They can work with properly validated, parameterized queries just fine. It just requires preparation and set up and knowledge and a willingness to do the right thing the right way. And yeah, you still wouldn’t have to involve the DBA most of the time if you used these tools correctly.

That said, maybe, just maybe, you do want to involve the DBA. It could be that the person who knows how to properly configure constraints on the database can help you ensure better data protection, heck, easier deletes in support of the right to be forgotten. You can engage your DBA and your architects, and your developers, and your analysts. You can take a true DevOps approach in order to prepare your systems for GDPR compliance, which, it just so happens, makes them safer, better, strong, faster (Steve Austin).

Conclusion

I am making buggy whips. They are flipping awesome buggy whips. There’s not a single car in sight. In fact, the horse-drawn wagons I see going by frequently need a wheel, a new axle, some grease, and yeah, maybe a buggy whip to keep things going. In short, in 2008, I was worried about the death of the DBA. In 2018, I think I see a resurgence of the DBA. Let’s go make some buggy whips, everyone!

Original Link

Hibernate/GORM: Solving the N+1 Problem

Many developers who work with Hibernate or any other ORM framework eventually run into the so-called N+1 problem.

Our team faced it when we were working on a project using Grails. For its ORM, Grails uses a GORM “under the hood” that contains the same old Hibernate. In case you haven’t encountered this problem yet, let’s give you the gist of it. Let’s say we have the following perfectly typical scheme: “News – Comment(s)”.

WaveAccess_news-comment

There is a “News” item, and it can have several “Comments.”

If we need to get the last ten news items with their comments, based on the default settings, we will perform eleven database queries: one to get the news list and one for each news item in order to get its comments.

WaveAccess_Gorm_1

The ideal situation is one where the database is on the same machine, or at least the same local network, and the number of news items is limited to ten. But more likely, the database will be located on a dedicated server and there will be about 50 or so more news items on the page. This can lead to an issue with the server’s performance. Several solutions can be found to this problem using Hibernate. Let’s take a quick look at them.

FetchMode.JOIN

In the mapping for the association we’re interested in, or directly when executing the query, we can set up the JOIN fetch mode. In this case, the necessary association will be received by the same query. This will work for 1-1 or -1 connections, but for 1- queries, we will run into certain problems. Let’s take a look at the following query:

WaveAccess_Gorm_2

The first obvious problem is when limit 10 doesn’t work the way we need it to. Instead of returning the first ten news items, this query will return the first ten entries. The number of news items in these ten entries will depend on the number of comments. If the first news item has 10+ comments, it will be the only fetch result we get. All of this forces Hibernate to reject the database’s native methods for limiting and offsetting the fetch and process the results on the application server end.

The second problem is less obvious: if we don’t make it clear to Hibernate that we only want unique news items, then we’re going to get a list of doubled news items (one for each comment). In order to fix this, we need to insert the Result Transformer for the criterion:

criteria.setResultTransformer(Criteria.DISTINCT_ROOT_ENTITY);

Even if we get rid of all these drawbacks, this method still has more serious limitations: for example, it can’t cope with the task “also get the article’s author in addition to comments.”

FetchMode.SUBSELECT

Another potential alternative is SUBSELECT. Instead of doing a JOIN, it executes an additional query for linked entities while using the original query as SUBSELECT. In the end, we get only two queries rather than eleven: one base query and one query for each association.

WaveAccess_Gorm_3

This is a great option that will also work if we need to get both comments and authors at the same time. However, it also has some limitations.

First of all, it can only be used during the mapping-description phase by using the annotation @Fetch(FetchMode.SUBSELECT).

Second, we have no way of monitoring the use of this mode (unlike the same JOIN) at the moment the query is executed. We, therefore, have no way of knowing whether this mode is actually being used or not. If another developer changes the mapping, everything could fall apart. For example, the optimization might stop working and the original version with 11 queries could start being used again. If this happens, this connection will be incomprehensible to whoever made the change.

Third (and this is the deciding factor for us), this mode is not supported by a GORM-Grails framework for working with databases built on top of Hibernate.

Follow this link to learn more about possible fetch strategies.

Given all of these disadvantages, our only remaining option was to arm ourselves with an IDEA and lots of free time, and really dig around in the depths of Hibernate. The result was…

The Ultimate Solution

If we fantasize a little about the perfect solution, the following version suggests itself: make the fetch we need, then load the necessary collections all at once if necessary. It’d look something like this:

Query q = session.createQuery(“from News order by newDate“) q.setMaxResults(10) List news = q.list() BatchCollectionLoader.preloadCollections(session, news, “comments”)

Now let’s switch from fantasy to reality. The result of our inquiries was the following Groovy code (it can easily be rewritten in Java if necessary):

package cv.hibernate import groovy.transform.CompileStatic
import org.grails.datastore.gorm.GormEnhancer
import org.hibernate.HibernateException
import org.hibernate.MappingException
import org.hibernate.QueryException
import org.hibernate.engine.spi.LoadQueryInfluencers
import org.hibernate.engine.spi.SessionFactoryImplementor
import org.hibernate.engine.spi.SessionImplementor
import org.hibernate.loader.collection.BasicCollectionLoader
import org.hibernate.loader.collection.OneToManyLoader
import org.hibernate.persister.collection.QueryableCollection
import org.hibernate.persister.entity.EntityPersister
import org.hibernate.type.CollectionType
import org.hibernate.type.Type /** * Date: 08/03/2017 * Time: 15:52 */
@CompileStatic
class BatchCollectionLoader { protected static QueryableCollection getQueryableCollection( Class entityClass, String propertyName, SessionFactoryImplementor factory) throws HibernateException { String entityName = entityClass.name final EntityPersister entityPersister = factory.getEntityPersister(entityName) final Type type = entityPersister.getPropertyType(propertyName) if (!type.isCollectionType()) { throw new MappingException( "Property path [" + entityName + "." + propertyName + "] does not reference a collection" ) } final String role = ((CollectionType) type).getRole() try { return (QueryableCollection) factory.getCollectionPersister(role) } catch (ClassCastException cce) { throw new QueryException("collection role is not queryable: " + role, cce) } catch (Exception e) { throw new QueryException("collection role not found: " + role, e) } } private static void preloadCollectionsInternal(SessionImplementor session, Class entityClass, List entities, String collectionName) { def sf = session.factory def collectionPersister = getQueryableCollection(entityClass, collectionName, sf) def entityIds = new Serializable[entities.size()] int i = 0 for (def entity : entities) { if (entity != null) { entityIds[i++] = (Serializable) entity["id"] } } if (i != entities.size()) { entityIds = Arrays.copyOf(entityIds, i) } def loader = collectionPersister.isOneToMany() ? new OneToManyLoader(collectionPersister, entityIds.size(), sf, LoadQueryInfluencers.NONE) : new BasicCollectionLoader(collectionPersister, entityIds.size(), sf, LoadQueryInfluencers.NONE) loader.loadCollectionBatch(session, entityIds, collectionPersister.keyType) } private static Class getEntityClass(List entities) { for (def entity : entities) { if (entity != null) { return entity.getClass() } } return null } static void preloadCollections(List entities, String collectionName) { Class entityClass = getEntityClass(entities) if (entityClass == null) { return } GormEnhancer.findStaticApi(entityClass).withSession { SessionImplementor session -> preloadCollectionsInternal(session, entityClass, entities, collectionName) } } static void preloadCollections(SessionImplementor session, List entities, String collectionName) { Class entityClass = getEntityClass(entities) if (entityClass == null) { return } preloadCollectionsInternal(session, entityClass, entities, collectionName) }
}

This class contains two reloaded preloadCollections methods. The first one will only work for GORM (without a session), and the second one will work in both cases.

I hope this article is useful to you and will help you write great code!

P.S. Link to GIST.

Original Link

Flask 101: Adding a Database

Last time, we learned how to get Flask set up. In this article, we will learn how to add a database to our music data website. As you might recall, Flask is a micro-web-framework. That means it doesn’t come with an Object Relational Mapper (ORM) like Django does. If you want to add database interactivity, then you need to add it yourself or install an extension. I personally like SQLAlchemy, so I thought it was nice that there is a ready-made extension for adding SQLAlchemy to Flask called Flask-SQLAlchemy.

To install Flask-SQLAlchemy, you just need to use pip. Make sure that you are in your activated virtual environment that we created in the first part of this series before you run the following or you’ll end up installing the extension to your base Python instead of your virtual environment:

pip install flask-sqlalchemy

Now that we have the Flask-SQLAlchemy installed along with its dependencies, we can get started creating a database!

Creating a Database

Creating a database with SQLAlchemy is actually pretty easy. SQLAlchemy supports a couple of different ways of working with a database. My favorite is using its declarative syntax that allows you to create classes that model the database itself. So, I will use that for this example. We will be using SQLite as our backend, too; however, we could easily change that backend to something else such as MySQL or Postgres if we wanted to.

To start out, we will look at how you create the database file using just normal SQLAlchemy. Then, we will create a separate script that uses the slightly different Flask-SQLAlchemy syntax. Put the following code into a file called db_creator.py:

# db_creator.py from sqlalchemy import create_engine, ForeignKey
from sqlalchemy import Column, Date, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship, backref engine = create_engine('sqlite:///mymusic.db', echo=True)
Base = declarative_base() class Artist(Base): __tablename__ = "artists" id = Column(Integer, primary_key=True) name = Column(String) def __init__(self, name): """""" self.name = name def __repr__(self): return "<Artist: {}>".format(self.name) class Album(Base): """""" __tablename__ = "albums" id = Column(Integer, primary_key=True) title = Column(String) release_date = Column(Date) publisher = Column(String) media_type = Column(String) artist_id = Column(Integer, ForeignKey("artists.id")) artist = relationship("Artist", backref=backref( "albums", order_by=id)) def __init__(self, title, release_date, publisher, media_type): """""" self.title = title self.release_date = release_date self.publisher = publisher self.media_type = media_type # create tables
Base.metadata.create_all(engine)

The first part of this code should look pretty familiar to anyone using Python, as all we are doing here is importing the bits and pieces we need from SQLAlchemy to make the rest of the code work. Then, we create SQLAlchemy’s engine object, which basically connects Python to the database of choice. In this case, we are connecting to SQLite and creating a file instead of creating the database in memory. We also create a “base class” that we can use to create declarative class definitions that actually define our database tables.

The next two classes define the tables we care about, namely Artist and Album. You will note that we name the table via the __tablename__ class attribute. We also create the table’s columns and set their data types to whatever we need. The Album class is a bit more complex since we set up a ForeignKey relationship with the Artist table. You can read more about how this works in my old SQLAlchemy tutorial or if you want the in-depth details, then check out the well-written documentation.

When you run the code above, you should get something like this in your terminal:

2017-12-08 18:36:43,290 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2017-12-08 18:36:43,291 INFO sqlalchemy.engine.base.Engine ()
2017-12-08 18:36:43,292 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2017-12-08 18:36:43,292 INFO sqlalchemy.engine.base.Engine ()
2017-12-08 18:36:43,294 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("artists")
2017-12-08 18:36:43,294 INFO sqlalchemy.engine.base.Engine ()
2017-12-08 18:36:43,295 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("albums")
2017-12-08 18:36:43,295 INFO sqlalchemy.engine.base.Engine ()
2017-12-08 18:36:43,296 INFO sqlalchemy.engine.base.Engine CREATE TABLE artists ( id INTEGER NOT NULL, name VARCHAR, PRIMARY KEY (id)
) 2017-12-08 18:36:43,296 INFO sqlalchemy.engine.base.Engine ()
2017-12-08 18:36:43,315 INFO sqlalchemy.engine.base.Engine COMMIT
2017-12-08 18:36:43,316 INFO sqlalchemy.engine.base.Engine CREATE TABLE albums ( id INTEGER NOT NULL, title VARCHAR, release_date DATE, publisher VARCHAR, media_type VARCHAR, artist_id INTEGER, PRIMARY KEY (id), FOREIGN KEY(artist_id) REFERENCES artists (id)
) 2017-12-08 18:36:43,316 INFO sqlalchemy.engine.base.Engine ()
2017-12-08 18:36:43,327 INFO sqlalchemy.engine.base.Engine COMMIT

Now, let’s make all this work in Flask!

Using Flask-SQLAlchemy

The first thing we need to do when we go to use Flask-SQLAlchemy is to create a simple application script. We will call it app.py. Put the following code into this file and save it to the musicdb folder:

# app.py from flask import Flask
from flask_sqlalchemy import SQLAlchemy app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///mymusic.db'
app.secret_key = "flask rocks!" db = SQLAlchemy(app)

Here, we create our Flask app object and tell it where the SQLAlchemy database file should live. We also set up a simple secret key and create a database object that allows us to integrate SQLAlchemy into Flask. Next, we need to create a models.py file and save it into the musicdb folder. Once you have that made, add the following code to it:

# models.py from app import db class Artist(db.Model): __tablename__ = "artists" id = db.Column(db.Integer, primary_key=True) name = db.Column(db.String) def __init__(self, name): """""" self.name = name def __repr__(self): return "<Artist: {}>".format(self.name) class Album(db.Model): """""" __tablename__ = "albums" id = db.Column(db.Integer, primary_key=True) title = db.Column(db.String) release_date = db.Column(db.Date) publisher = db.Column(db.String) media_type = db.Column(db.String) artist_id = db.Column(db.Integer, db.ForeignKey("artists.id")) artist = db.relationship("Artist", backref=db.backref( "albums", order_by=id), lazy=True) def __init__(self, title, release_date, publisher, media_type): """""" self.title = title self.release_date = release_date self.publisher = publisher self.media_type = media_type

You will note that Flask-SQLAlchemy doesn’t require all the imports that just plain SQLAlchemy required. All we need is the database object that we created in our app script. Then, we just pre-pend “db” to all the classes we used in the original SQLAlchemy code. You will also note that instead of creating a Base class, it is already pre-defined as db.Model.

Finally, we need to create a way to initialize the database. You could put this in several different places, but I ended up creating a file I dubbed db_setup.py and added the following contents:

# db_setup.py from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.ext.declarative import declarative_base engine = create_engine('sqlite:///mymusic.db', convert_unicode=True)
db_session = scoped_session(sessionmaker(autocommit=False, autoflush=False, bind=engine))
Base = declarative_base()
Base.query = db_session.query_property() def init_db(): import models Base.metadata.create_all(bind=engine)

This code will initialize the database with the tables you created in your models script. To make the initialization happen, let’s edit out test.py script from the previous article:

# test.py from app import app
from db_setup import init_db init_db() @app.route('/')
def test(): return "Welcome to Flask!" if __name__ == '__main__': app.run()

Here, we just imported our app object and the init_db function. Then, we called the init_db function immediately. To run this code, all you need to do is run the following command in your terminal from within the musicdb folder:

FLASK_APP=test.py flask run

When you run this, you won’t see the SQLAlchemy output that we saw earlier. Instead, you will just see some information printed out stating that your Flask application is running. You will also find that a mymusic.db file has been created in your musicdb folder.

Wrapping Up

At this point, you now have a web application with an empty database. You can’t add anything to the database with your web application or view anything in the database. Yes, you just created something really cool, but it’s also completely useless for your users. In the next article, we will learn how to add forms to add information to our database and we will learn how to display our data, too!

Download Code

Download a tarball of the code from this article.

Original Link

Say No to Randos in Your Database

When I used my first ORM, I wondered “why didn’t they include a random() method?” It seemed like such an easy thing to add. While there are many reasons you may want to pull a record out of your database at random, you shouldn’t be using SQL’s RANDOM() function unless you’ll only be randomizing a limited number of records. In this post, we’ll examine how such a simple-looking SQL operator can cause a lot of performance pain, and a few different techniques we can use to fix it.

As you might know, I run CodeTriage, the best way to get started helping open source, and I’ve written about improving the database performance on that site:

Recently, I was running the heroku pg:outliers command to see what it looked like after some of those optimizations, and I was surprised to find I was spending 32% of database time in two queries with a RANDOM() in them.

$ heroku pg:outliers
14:52:35.890252 | 19.9% | 186,846 | 02:38:39.448613 | SELECT "repos".* FROM "repos" WHERE (repos.id not in (?,?)) ORDER BY random() LIMIT $1
08:59:35.017667 | 12.1% | 2,532,339 | 00:01:13.506894 | SELECT "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) ORDER BY RANDOM() LIMIT $1

Let’s take a look at the first query to understand why it’s slow.

SELECT "repos".*
FROM "repos"
WHERE (repos.id not in (?,?))
ORDER BY random()
LIMIT $1

This query is used once a week to encourage users to sign up to “triage” issues on an open source repo if they have an account, but aren’t subscribed to help out. We send an email with 3 repo suggestions including a random repo. That seems like a good use of RANDOM() after all, we literally want a random result. Why is this bad?

While we’re telling Postgres to only give us one record, the ORDER BY random() LIMIT 1doesn’t only do that. It orders all the records before returning one.

While you might think it’s doing something like Array#sample it’s really doing Array#shuffle.first. When I wrote this code, it was pretty dang fast because I only had a few repos in the database. But now there are 2,761 repos and growing. And every time this query executes, the database must load rows for each of those repos and spend CPU power to shuffle them.

You can see another query that was doing the same thing with the user table:

=> EXPLAIN ANALYZE SELECT "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) ORDER BY RANDOM() LIMIT 1; QUERY PLAN
----------------------------------------------------------------------------------------------------------------------- Limit (cost=1471.00..1471.01 rows=1 width=2098) (actual time=12.747..12.748 rows=1 loops=1) -> Sort (cost=1471.00..1475.24 rows=8464 width=2098) (actual time=12.745..12.745 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 26kB -> Seq Scan on users (cost=0.00..1462.54 rows=8464 width=2098) (actual time=0.013..7.327 rows=8726 loops=1) Filter: (github_access_token IS NOT NULL) Rows Removed by Filter: 13510 Total runtime: 12.811 ms
(8 rows)

It takes almost 13ms for each execution of this relatively small query.

So if RANDOM() is bad, what do we fix it with? This is a surprisingly difficult question. It largely depends on your application and how you’re accessing the data.

In my case, I fixed the issue by generating a random ID and then pulling that record. In this instance, I know that the IDs are relatively contiguous, so I pull the highest ID, pick a random number between 1 and @@max_id, then perform a query where I’m grabbing a record >=that id.

Is it faster? Oh yeah. Here’s the same query as before with the RANDOM() replaced:

=> EXPLAIN ANALYZE SELECT "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) AND id >= 55 LIMIT 1; QUERY PLAN
-------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..0.17 rows=1 width=2098) (actual time=0.009..0.009 rows=1 loops=1) -> Seq Scan on users (cost=0.00..1469.36 rows=8459 width=2098) (actual time=0.009..0.009 rows=1 loops=1) Filter: ((github_access_token IS NOT NULL) AND (id >= 55)) Total runtime: 0.039 ms

We went from ~13ms to sub 1ms query execution time.

There are are some pretty severe caveats here to watch out for. My implementation caches the max id, which is fine for my use cases, but it might not be for yours. It’s possible to do this entirely in SQL using something like:

WHERE /* ... */ AND id IN ( SELECT FLOOR( RANDOM() * (SELECT MAX(id) FROM issues) ) + 1 )

As always, benchmark your SQL queries before and after an optimization. This implementation doesn’t handle sparsely populated ID values very well, and doesn’t account for randomly selecting a max id that is greater than one available based on WHEREconditions. Essentially, if you were to do it “right”, you would need to apply the same WHEREconditions to the subquery for MAX(id) as to your main query.

For my cases, it’s fine if I get some failures, and I know that I’m only applying the most basic of WHERE conditions. Your needs might not be so flexible.

If you’re thinking “is there no built-in way to do this?”, it turns out there is TABLESAMPLE, which was introduced in Postgres 9.5. Thanks to @HotFusionMan for introducing it to me.

Here’s the best blog I’ve found on using TABLESAMPLE. The downside is that it’s not “truly random” (if that matters to your application), and you cannot use it to only retrieve 1 result. I was able to hack it by doing a query that the only table sampled 1%. Then I used that 1% to get ids and then limited to the first record. Something like:

SELECT *
FROM repos
WHERE id IN ( SELECT id FROM repos TABLESAMPLE SYSTEM(1) /* 1 percent */ )
LIMIT 1

While this works and is much faster than ORDER BY RANDOM() for queries returning LOTS of data (thousands or tens of thousands of rows), it’s very slow for queries that have very little data.

When I was optimizing https://www.codetriage.com, I found another query that uses RANDOM(). It is being used to find open source issues for a specific repo. Due to the way issues are stored, the IDs are not very contiguous so my previous (sampling >= to a random ID) trick wouldn’t work as well. I need a more robust way to randomize the data, and I thought perhaps TABALESAMPLE might perform better.

While some repos have thousands of issues, 50% have 27 or fewer issues. When I used the TABLESAMPLE technique for this query, it made my small queries really slow, and my previously slow queries fast. Since my numbers skew towards the small side for that query, it wasn’t a net gain, so I stuck to the original RANDOM() method.

Have you replaced RANDOM() with another more efficient technique? Let me know about it to Twitter @schneems.

Original Link