ALU

search engine

Reporting and Analysis With Elasticsearch

Since the popularity of NoSQL and Big Data exploded in recent years, keeping up with the latest trends in databases, search engines, and business analytics is vital for developers.

And it’s hard not to be overwhelmed by the number of solutions available on the market: Amazon CloudSearch, Elasticsearch, Swiftype, Algolia, Searchify, Solr, and others.

Original Link

Briefing: Google’s China search engine is coming within nine months?

A leaked transcript of Google executives meeting shows a different picture from Google’s official statements. Original Link

RediSearch 1.4: Phonetics and Spell Check

It’s always exciting when a new version of RediSearch comes out — we just released version 1.4 (yes, we skipped 1.3 to align with a new versioning methodology). This new version has two key features which add quite a bit of smarts to querying:

  • Spell Check and Custom Dictionaries.
  • Phonetic (sound-alike) Matching.

Spell Check

Let’s first take a look at spell check. Everyone knows what spell check is from a broad perspective, but let’s examine how it works in a search engine context. It’s best to think of it as a primitive that would power a "did-you-mean" feature.

Original Link

Google employees not happy about new plans for China

Google workers already proved their power after objecting to a project with the US military. Original Link

Reviewing the Bleve Search Library

Bleve is a Go search engine library, and that means that it hits a few good points with me. It is interesting, it is familiar ground, and it is in a language that I’m not too familiar with, so that is a great chance to learn some more.

I reviewed revision: 298302a511a184dbab2c401e2005c1ce9589a001

I like to start by reading from the bottom up, and in this case, the very first thing that I looked at was the storage level. Bleve uses a pluggable storage engine and currently has support for:

  • BoltDB
  • LevelDB
  • Moss
  • In memory tree

This is interesting, if only because I put BoltDB and Moss on my queue of projects to read.

The actual persistent format for Bleve is very well documented here. This makes it much easier to understand what is going on. The way Bleve uses the storage, it has a flat key/value store view of the world as well as needing prefix range queries. Nothing else is required. Navigating the code is a bit hard for me as someone who isn’t too familiar with Go, but the interesting things start here, in scorch.go (no idea why this is called scorch, though).

image

We get a batch of changes and run over them, adding an _id field to the document. So far, pretty simple to figure out. The next part is interesting:

image

You can see that we are running in parallel here, starting the analysis work and queuing it all up. Bleve then waits for the analysis to run. I’ll dig a bit deeper into how that work in a bit. First, I want to understand how the whole batch concept works.

image

So, that tells us some interesting things. First, even though there is the concept of a store, there is also this idea of a segment. I’m familiar with this from Lucene, but there, it is tied very closely to the on-disk format. Before looking at the analysis, let’s look at this concept of segments.

The “zap” package, in this term, seems to refer to the encoding that is used to store the analysis results. It looks like it is running over all the results of the batch and writing them into a single binary value. This is very similar to the way Lucene works so far, although I’m still confused about the key/value store. What is happening is that after the segment is created, it is sent to prepareSegment. This eventually sends it to a Go channel that is used in the Scortch.mainLoop function (which is being run as a separate thread).

Here is the relevant code:

image

The last bit is the one that is handling the segment introduction, whatever that is. Note that this seems to be strongly related to the store, so hopefully, we’ll see why this is showing up here. What seems to be going on here is that there is a lot of concurrency in the process, the code spawns multiple go functions to do work. The mainLoop is just one of them. The persisterLoop is another as well as the mergerLoop. All of which sounds very much like how Lucene works.

I’m still not sure how this is all tied together. So I’m going to follow just this path for now and see what is going on with these segments. A lot of the work seems to be around managing this structure:

image

The segment itself is an interface with the following definition:

image

There are go in memory and mmap versions of this interface, it seems. So far, I’m not following relation between the storage interface and this segments idea. I think that I’m lost here, so I’m going to go a slightly different route. Instead of seeing how Bleve writes stuff, let’s focus on how it reads. I’ll try to follow the path of a query. This path of inquiry leads me to this guy:

image

Again, very similar to Lucene. And the TermFieldReader is where we are probably going to get the matches for this particular term (field, value). Let’s dig into that. Indeed, following the code for this method leads to the inverted index, called upside_down in this code. I managed to find how the terms are being read, and it makes perfect sense. Exactly as expected, it does a range query and parses both key and values for the relevant values. Still not seeing why there is the need for segments.

Here is where things start to come together. Bleve uses the key/value interface to store some data that it searches on, but document values are stored in segments and are loaded directly from there on demand. At a glance, it looks like the zap encoding is used to store values in chunks. It looks like I didn’t pay attention before, but the zap format is actually documented and it is very helpful. Basically, all the per document (vs. per term/field) data is located there as well as a few other things.

I think that this is where I’ll stop. The codebase is interesting, but I now know enough to have a feeling of how things work. Some closing thoughts:

  • Really good docs.
  • I didn’t use my usual “read the project in lexical file order” to figure out things, and I had a hard time navigating the codebase because of that. Probably my lack of Go chops.
  • There seems to be a lot more concurrency for stuff that I would usually assume be single threaded than I’m used to. I’m aware that Go has built-in concurrency primitives, and it is more common to use there, but it seems strange to see. As a consumer of search libraries, I’m not sure that I’m happy about this. I like to control my threading behaviors.
  • It seems that a lot of the data is held in memory (mmap), but in a format that requires work to handle or in the key/value store, but again, in a format that requires work.

The problem with work is that you have to do it each and every time. I’m used to Lucene (read it once from disk and keep a cached version in memory that is very fast) or Voron, in which the data is held in memory and can be accessed with zero work.

I didn’t get to any of the core parts of the library (analysis, full-text search). This is because they aren’t likely to be that different and they are full of the storage interaction details that I just went over.

Original Link

Connecting Elasticsearch Directly to your Java EE Application

The trendy word big data comes of the 3 Vs: volume, variety, and velocity. Volume refers to the size of data, variety refers to the diverse types of data, and velocity refers to the speed of data processing. To handle persistent big data, there are NoSQL databases that write and read data faster. But with the diversity in a vast volume, a search engine is required to find information that is without significant computer power and that takes too much time. A search engine is a software system that is designed to search for information; this mechanism makes it more straightforward and clear for users get the information that they want.

This article will cover NoSQL that is both document type and search engine Elasticsearch.

Elasticsearch is a NoSQL document type and a search engine based on Lucene. It provides a distributed, multi-tenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open-source under the terms of the Apache License. Elasticsearch is the most popular enterprise search engine followed by Apache Solr, which is also based on Lucene. It is a near-real-time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.

Steps in a Search Engine

In Elasticsearch, the progress of a search engine is based on the analyzer, which is a package containing three lower-level building blocks: character filters, tokenizers, and token filters. Through the Elasticstatic documentation, the definitions are:

  • A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert Hindu-Arabic numerals into their Arabic-Latin equivalents or to strip HTML elements from the stream.

  • A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer breaks the text into tokens whenever it sees any whitespace. It would convert the text “Quick brown fox!” into the terms [Quick, brown, fox!].

  • A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.

How to Install ElasticSearch in Docker

The first step to use ES is to install it in Docker. You can install both manually and through Docker. The easiest way is with Docker following the steps below:

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.2.3

Elasticsearch and Java EE Working Together

Eclipse JNoSQL is the bridge to work between these platforms (Java EE and the search engine). An important point to remember is that Elasticsearch is also a NoSQL document type, so a developer may model the application as such. To use both the standard document behavior and the Elasticsearch API, a programmer needs to use the the Elasticsearch extension.

<dependency> <groupId>org.jnosql.artemis</groupId> <artifactId>elasticsearch-extension</artifactId> <version>0.0.5</version>
</dependency>

For this demo, we’ll create a contacts agenda for a developer that will have a name, address, and, of course, the language that they know. An address has fields and becomes a subdocument that is a document inside a document.

@Entity("developer")
public class Developer { @Id private Long id; @Column private String name; @Column private List < String > phones; @Column private List < String > languages; @Column private Address address;
} @Embeddable
public class Address { @Column private String street; @Column private String city; @Column private Integer number; }

With the model defined, let’s set the mapping. Mapping is the process of determining how a document and the fields it contains are stored and indexed. For this example, the fields are usually the type keyword and those are only searchable by their exact value. Also, there is the languages field that we defined as text with a custom analyzer. This custom analyzer, the whitespace_analyzer, has one tokenizer, whitespace, and three filters (standard, lowercase, and asciifolding).

{ "settings": { "analysis": { "filter": { }, "analyzer": { "whitespace_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "standard", "lowercase", "asciifolding" ] } } } }, "mappings": { "developer": { "properties": { "name": { "type": "keyword" }, "languages": { "type": "text", "analyzer": "whitespace_analyzer" }, "phones": { "type": "keyword" }, "address": { "properties": { "street": { "type": "text" }, "city": { "type": "text" }, "number": { "type": "integer" } } } } } }
}

With the API, the developer can do the basic operations of a document NoSQL database — at least, a CRUD — however, in ES, the behavior of search engine matters and is useful. That why it has an extension.

public class App { public static void main(String[] args) { Random random = new Random(); Long id = random.nextLong(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { Address address = Address.builder() .withCity("Salvador") .withStreet("Rua Engenheiro Jose") .withNumber(10).build(); Developer developer = Developer.builder(). withPhones(Arrays.asList("85 85 343435684", "55 11 123448684")) .withName("Poliana Lovelace") .withId(id) .withAddress(address) .build(); DocumentTemplate documentTemplate = container.select(DocumentTemplate.class).get(); Developer saved = documentTemplate.insert(developer); System.out.println("Developer saved" + saved); DocumentQuery query = select().from("developer") .where("_id").eq(id).build(); Optional < Developer > personOptional = documentTemplate.singleResult(query); System.out.println("Entity found: " + personOptional); } } private App() {}
}

From the Elasticsearch extension, the user might use the QueryBuilders, a utility class to create search queries in the database.

public class App3 { public static void main(String[] args) throws InterruptedException { try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { Random random = new Random(); long id = random.nextLong(); Address address = Address.builder() .withCity("São Paulo") .withStreet("Av. nove de Julho 1854") .withNumber(10).build(); Developer developer = Developer.builder(). withPhones(Arrays.asList("85 85 343435684", "55 11 123448684")) .withName("Maria Lovelace") .withId(id) .withAddress(address) .withLanguage("Java SE") .withLanguage("Java EE") .build(); ElasticsearchTemplate template = container.select(ElasticsearchTemplate.class).get(); Developer saved = template.insert(developer); System.out.println("Developer saved" + saved); TimeUnit.SECONDS.sleep(2 L); TermQueryBuilder query = QueryBuilders.termQuery("phones", "85 85 343435684"); List < Developer > people = template.search(query); System.out.println("Entity found from phone: " + people); people = template.search(QueryBuilders.termQuery("languages", "java")); System.out.println("Entity found from languages: " + people); } } private App3() {}
}

Conclusion

An application that has an intuitive way to find data in an enterprise application is prime, mainly when the software handles a massive and with several data kinds. Elasticsearch can help the Java EE world with both NoSQL documents and a search engine. This post covered how to join the best of these two worlds using Eclipse JNoSQL.

Original Link

Google’s Project Owl to Tackle Unverifiable Content and Fake News

While Google follows 200+ signals to rank search engine results, problems with fake news and problematic content showing up on top results are causing issues with user engagement. Google has been aware of the problem with search quality since November 2016, and as such, they have initiated Project Owl to address the issues through three ways.

First, there will be a feedback form where users can answer questions about featured snippets. Second, there will be a renewed emphasis on sorting and displaying authoritative content in the top results. Third, their policies on suggestions based results could be revised to include a feedback form for receiving search suggestions.

Understanding Problematic Searches

Problematic searches is a term that Google is using to identify perceptions and concepts without any meaning. Content that feeds a certain perception without any basis in the factual world such as urban myths, rumors, or any derogatory issue that influences the collective mentality is a cause of concern.

In the past, Google has dealt with search spam, poor-quality content, and piracy, but none of these come under the category of problematic searches. Rather, they are about generating and propagating fake news that creates a biased perception.

Previously, Google was aware of the problem, but it did not merit any priority until this happened.

Solutions

As mentioned, Project Owl proposes three solutions. A brief description of each is given below.

1. Improvement in Featured Snippets

Featured snippets is a functional tool that gives the user an informational gist before the search results display. They are visible with Google Assistant and in Google Home to generate a quicker answer to the user’s query. In the last few months, Google noticed that the answers in the featured snippet section were increasingly becoming problematic. Google is combating this issue with a feedback form for the user to indicate if the featured section content is offensive, vulgar, or helpful. An example form is shown below.

featured-snippets-feedback.png

The feedback will find use in improving the algorithm changes. Users with Google Home can use the device to send feedback directly. Each feedback submitted will undergo strong consideration and possible implementation.

2. Improvement in Autocomplete

Autocomplete saves time. Originally designed to speed up user search, users are bombarded with unsavory beliefs and perceptions when searching for problematic topics. These autocompleted suggestions can deviate the user from the original search intent, shockingly. A case study by Guardian made Google realize the complexity of the problem, prompting the company to find solutions.

Like the featured snippet, there is a form for report inappropriate predictions, prompting users to report problematic autocomplete suggestions such as hateful, racial, sexist, and provocative content. Google made changes to their publication policies to mention the non-legal reasons of removing suggestions such as pirated content, personal information, and court-ordered removals.

It is early to understand whether this step will work; however, the Google team assures that each feedback will undergo thoughtful consideration.

3. Authority Content

There will be an increased focus to identify authority content and give them higher search engine placement on results. Few changes were initiated since December 2016, and lately, their team has begun to flag content that is offensive or upsetting.

It is a wishful thought to expect changes to happen and reflect overnight. The giant search engine contains trillions of pages, and correcting even half of those results will take months and years. Hence, as users, we should come forward and help Google with Project Owl through our feedback and suggestions.

Original Link

Search Engines vs. Relational Databases

Recently, we launched a new product INVESTimate from HomeUnion. INVESTimate is using machine learning and AI to help determine the investment potential of a residential property. INVESTimate is powered by big data on 110 million homes, institutional quality research, and on-the-ground experts with deep insight on local real estate market conditions. 

Behind the scenes, there is lots of data crunching, with data coming from more than 50+  sources, and mapping key property data with custom modeled PRICE AVM, RENT AVM, and Neighborhood Investment Rating (NIR) for more than 100M residential homes and 30,000+ neighborhoods. All this data are stitched together and indexed in the Apache Solr Search Engine for display of data in the front-end search portal.

The main purpose of this article is to discuss why we chose Solr Search Engine vs. MySQL or any relational database for storing, indexing, and retrieving data.

First, let’s understand the five key differences between search engines and relational databases. In our case, Apache Solr is our chosen search engine and MySQL is our RDBMS :

Category MySQL Solr
Transaction Capability Supports ACID (Atomicity, Consistency, Isolation, Durability) properties Very less support or no support for ACID
Partitioning Horizontal partitioning and sharding Supports only sharding    
Consistency Immediate consistency Eventual consistency
Keys Support primary and foreign key Supports only primary key
Models Relational model Document store

Now, let’s look at how and where we can efficiently use RDBMS vs. a search engine by taking a simple use case. Let’s say, an investor is looking for an investment property located in Dallas, with an investment of $150,000 located next to the best schools in a simple drill-down wizard type of experience. This would be a perfect use case for an RDBMS-based solution, as the desired results could be presented to the users as a series of fixed, structured queries on the database. First, the top-level query can select all properties within Dallas; then, it can filter properties that fit within or equal to 150K and sort properties based on school ranking within that neighborhood. The investor can finally pick a property of his choice or his liking. 

Let’s take a simple use case in which search engines are very useful. Let’s say an investor is looking for “textual” type of search experience. The investor simply types, “Find an investment property in the range of 150,000 with 8% yield.” As you know, Solr stores data as documents and each document represents multiple fields with values. The documents represent a unit of search and index. The above textual content submitted by users is tokenized and matched with all the documents and based on relevancy, respective results are displayed to users. This allows for better user experience to find what users want in a fast and efficient way.

Technical Usage

  1. All of our 110M properties with key attributes are processed and enriched in a big data environment. The processed data gets stored and indexed in the Solr Engine on a regular basis using DIH (Data Import Handler) within 15 minutes.

  2. Using Solr for ReadOnly for better performance. All queries coming from our INVESTimate website hit our Solr engine for most of the data required to be served.

  3. Created Java Interface using SolrJ for front-end to interact with Solr Engine. This completely encapsulates and decouples from front-end code and allows services to be scalable independently

  4. Our HomeUnion Asset Recommendation Engine (HARE) is built on top of Solr Engine for the recommendation of properties and search of portfolios. We have used the concepts of facets and boosting very extensively to recommend search results to our Investors

  5. Reduce load on our MySQL database. 90% of our searches are served by a Solr engine hosted within AWS, thereby reducing cost and improving query performance by more than 80%.

Hopefully, this article was useful to learn some practical use cases of using search engines.

Original Link

Mastering RediSearch (Part 3)

Today, we’re going to dive quite a bit deeper and make something useful with Node.js, RediSearch, and the client library we started in Part 2.

While RediSearch is a great full-text search engine, it’s much more than that and has extensive power as a secondary index for Redis. Considering this, let’s get a dataset that contains some more field-based data. I’ll be using the TMDB (the movie database) dataset. You can download this dataset from the Kaggle TMDB page. The data is formatted in CSV over two files and has a few features that we’re not going to use yet, so we’ll need to build an ingestion script.

To follow along, it would be best to get the chapter-3 branch of the GitHub repo.

Understanding the Dataset Structure

The first file is tmdb_5000_credits.csv, which contains the cast and crew information. Though cast and crew rows are modeled a bit differently, they do share some features. Initially, it’s not a very usable CSV file since two columns (cast, crew) contain JSON.

  • movie_id (column): This correlates with a movie row in the other file.
  • title (column): The title of the movie identified with the movie_id.
  • cast (column with JSON):
    • character: The name of the character.
    • cast_id: An identifier of the character across multiple movies.
    • name: The name of the actor or actress.
    • credit_id: A unique identifier of this credit.
  • crew (column with JSON):
    • department: The department or category of this role.
    • job: The name of the role on set.
    • name: The name of the crew member.
    • credit_id: A unique identifier of this credit.

Another problem with the CSV is that it’s quite huge for a file of its type: 40mb (for comparison, the total works of Shakespeare is just 5.5mb). While Node.js can handle files of this size, it’s certainly not all that efficient. To ingest this data optimally, we’ll be using csv-parse, a streaming parser. This means that as the CSV file is read in, it is also parsed and events are emitted. Streamed parsing is a fitting addition to the already high-performing RediSearch.

Each row in the CSV represents a single movie, with about 4,800 movies in the file. Each movie, as you might imagine, has dozens to hundreds of cast and crew members. All in all, you can expect to index about 235,838 cast and crew members — each represented by nine fields.

The other file in the TMDB dataset is the movie data. This is quite a bit more straightforward. Each movie is a row with a few columns that contain JSON data. We can ignore those JSON data columns for this part in the series. Here is how the fields are represented:

  • movie_id (column): A unique ID for each movie.

  • budget (column): Thetotal film budget.

  • original_language (column): ISO 639-1 version of the original language.

  • original_title (column): The original title of the film.

  • overview (column): A few sentences about the film.

  • popularity (column): The film’s popularity ranking.

  • release_date (column): The film’s release date (in YYYY-MM-DD format).

  • revenue (column): The amount of money it earned (USD).

  • runtime (column): The film’s runtime (in minutes).

  • status (column): The film’s release status (“released” or “unreleased”).

  • Ignored columns: genres, production_companies, keywords, production_countries, spoken_languages

Importing Data From TMDB

Now that we’ve explored both the fields and how to create an index, let’s move forward with creating our actual indexes and putting data into them.

Thankfully, most of the data in these files is fairly clean, but that doesn’t mean we don’t need to make adjustments. In the cast/crew file, we have the challenge of cast and crew entries having slightly different sets of data. In the data, each row represents a movie and the cast column has all the cast members while the crew column has all the crew members. So, when we’re representing this in the schema, we’re effectively creating a union of the fields (since there is overlap). cast and crew are numeric fields that are set to “1” for each kind of credit-think of it like a flag.

For the movie database, we’re going to convert release_date to a number. We’ll simply parse the date into a JavaScript timestamp and store it in a numeric field. Finally, we’ll ignore a number of fields — we’ll just compare columns’ keys to an array of columns in order to skip (ignoreFields).

From a project structure, we may end up doing more with our fieldDefinitions later on, so we’ll store both schema in a Node.js module. This is purely optional but is a clean pattern that reduces the likelihood of having to duplicate your code later on.

module.exports = { movies : function(search) { return [ search.fieldDefinition.numeric('budget',true), //homepage is just stored, not indexed search.fieldDefinition.text('original_language',true,{ noStem : true }), search.fieldDefinition.text('original_title',true,{ weight : 4.0 }), search.fieldDefinition.text('overview',false), search.fieldDefinition.numeric('popularity'), search.fieldDefinition.numeric('release_date',true), search.fieldDefinition.numeric('revenue',true), search.fieldDefinition.numeric('runtime',true), search.fieldDefinition.text('status',true,{ noStem : true }), search.fieldDefinition.text('title',true,{ weight : 5.0 }), search.fieldDefinition.numeric('vote_average',true), search.fieldDefinition.numeric('vote_count',true) ]; }, castCrew : function(search) { return [ search.fieldDefinition.numeric('movie_id',false), search.fieldDefinition.text('title',true, { noStem : true }), search.fieldDefinition.numeric('cast',true), search.fieldDefinition.numeric('crew',true), search.fieldDefinition.text('name', true, { noStem : true }), //cast only search.fieldDefinition.text('character', true, { noStem : true }), search.fieldDefinition.numeric('cast_id',false), //crew only search.fieldDefinition.text('department',true), search.fieldDefinition.text('job',true) ]; }
};

For importing both movies and crew, we’ll be using an Async queue and the above-mentioned streaming CSV parser. These two modules work similarly and well together, but have some different terminology in their syntax. First, the CSV parser will read a chunk of data (parser.on(‘readable’,…)) and then continue to read a full-row in at a time (while(record = parser.read()){ … }). Each row is manipulated and readied for RediSearch (csvRecord(record)). In this function, a few rows are formatted while some are ignored, and finally, the item is pushed into the queue (q.push(…)).

Async is a very useful JavaScript library that provides a huge number of metaphors for handling asynchronous behavior. The queue implementation is pretty fun — items are pushed into a queue and are processed at a given concurrency by a single worker function defined at instantiation. The worker function has two arguments: the item to be processed and a callback. Once the callback has ran, the next is available for processing (up to the given concurrency). There is a great animation that explains it:

Image title

From the async documentation

The other feature of queue that we’ll be using is the drain function. This function executes when there are no items left in the queue, i.e. the queue was in a working state but is no longer processing anything. It’s important to understand that a queue is never “finished” — it just becomes idle. Given that we’ll be using a streaming parser, it’s possible that RediSearch is ingesting faster than the CSV parser, resulting in an empty Async queue (triggering drain). To address this potential problem, each record is added to the queue and when it is successfully indexed, two individual counters are incremented (total and processed, respectively) and the CSV parser sets a parsed variable from “false” to “true.” So when drain is called, we check to see if parsed is true and if the value of processed matches total. If both of these conditions are true, we know that all the values have been parsed from the CSV and that everything has been added to our RediSearch index. After you’ve successfully added an item to the index, you invoke the callback for the worker function and the queue manages the rest.

As mentioned earlier, the credits CSV is more complex, with each row in the table representing multiple cast/crew members. To manage this complexity, I’ll be using a batch. We’ll be using the same overall structure with the CSV parser and Async queue, but each row will contain multiple calls to RediSearch via the batch (one for each cast or crew member). Instead of pushing a plain object into the queue, we’ll actually push a RediSearch batch. In the worker function, we’ll call exec on it and then the callback. While in the movies CSV we’ll have a single Redis (RediSearch) command per movie, in the credits CSV we’ll have a single batch (made up of dozens of individual features) for each movie.

The two imports are different enough to warrant separate import scripts. To import the movies, you’ll run the script like this:

$ node import-tmdb-movies.node.js --connection ./your-connection-object-as.json --data ./path-to-your-movie-csv-file/tmdb_5000_movies.csv

And the credits will be imported like this:

$ node import-tmdb-credits.node.js --connection ./your-connection-object-as.json --data ./path-to-your-credits-csv-file/tmdb_5000_credits.csv

A killer feature of RediSearch is that as soon as your data is indexed, it’s available for query. True real-time stuff!

Searching the Data

Now that we’ve ingested all this data, let’s write some scripts to get it back out of RediSearch. Despite the data being quite different between the cast/crew and the movie datasets, we can write a single, short script to do the searching.

This little script will allow you to pass in a search string from the command line (-search=”your search string”) and also designate the database you’re searching through (-searchtype movies or -searchtype castcrew). The other command line arguments are the connection JSON file (-connection path-to-your-connection-file.json) and optional arguments to set the offset and number of results (-offset and -resultsize, respectively).

After we instantiate the module by passing in the argv.searchtype, we’ll just need to use the search method to send the search query and options as an object to RediSearch. Our library from the last section takes care of building the arguments that will be passed through to the FT.SEARCH command.

In the callback, we get a very standard looking error-first callback. The second argument has the results of our search, pre-parsed and formatted in a useful way — each document has it’s own object with two properties: doc and docId. The docId contains the unique identifier and document as an object.

All we need to do is JSON.stringify the results (so console.log won’t display [Object object]) and then quit the client.

You can try it out by running the following command:

$ node search.node.js --connection ./your-connection-object-as.json --search="star" --searchtype movies --offset 0 --resultsize 1

This should return an entry about the movie Lone Star (anyone seen it? No, I didn’t think so). Now, let’s look in the cast/crew index for anything with the same move_id:

$ node search.node.js --connection ./your-connection-object-as.json --search="@movie_id:[26748 26748]" --searchtype castCrew

This will give you the first ten items of the cast and crew for Lone Star. Looks straightforward except for the search string — why do you have to repeat 26748 twice? In this case, it’s because the movie_id field in the database is numeric and numerics can only be limited by a range.

Grabbing a Document

Getting a document from RediSearch is even easier. Basically, we instantiate everything the same way as we did with the search script, but we don’t need to supply any options and instead of a search string we’re getting a docId.

We just need to pass the docId to the getDoc method (abstracting the FT.GET command) along with a callback and we’re in business! The show function is the same as in the search.

This will work equally well for both cast/crew or movie documents:

$ node by-docid.node.js --connection ./your-connection-object-as.json --docid="56479e8dc3a3682614004934" --searchtype castCrew

Dropping the Index

If you try to import a CSV file twice, you’ll get an error similar to this:

if (err) { throw err; }
ReplyError: Index already exists. Drop it first!

This is because you can’t just create an index over a pre-existing one. There is a script included to quickly drop one of your indexes. You can run it like this:

$ node drop-index.node.js --connection ./your-connection-object-as.json --searchtype movies

Status and Next Steps

In this installment, we’ve covered how to parse large CSV files and import them into RediSearch efficiently with a couple of scripts tailored to their different structures. Then, we built a script to run search queries over this data, grab individual documents, and drop an index. We’ve now learned most of the steps to managing the lifecycle of a dataset.

In our next installment, we’ll build out a few more features in our library to better abstract searching and add in a few more options. Then, we’ll start building a web UI to search through our data. Stay tuned to the Redis Labs blog!

Original Link

Elasticsearch for Dummies

Have you heard about the popular open-source tool used for searching and indexing that is used by giants like Wikipedia and Linkedin? No? I’m pretty sure you’ve heard it in passing.

I’m talking about Elasticsearch. In this blog, you’ll get to know the basics of Elasticsearch, its advantages, how to install it, and how to index documents using Elasticsearch.

What Is Elasticsearch?

Elasticsearch is an open-source, enterprise-grade search engine that can power extremely fast searches and support all data discovery applications. With Elasticsearch, we can store, search, and analyze big volumes of data quickly and in near real-time. It is generally used as the underlying search engine that powers applications that have simple/complex search features and requirements.

Advantages of Elasticsearch

  • Built on top of Lucene: Being built on top of Lucene, it offers the most powerful full-text search capabilities.

  • Document-oriented: It stores complex entities as structured JSON documents and indexes all fields by default, providing higher performance.

  • Schema-free: It stores a large quantity of semi-structured (JSON) data in a distributed fashion. It also attempts to detect the data structure and index the present data, making it search-friendly.

  • Full-text search: Elasticsearch performs linguistic searches against documents and returns the documents that match the search condition. Result relevancy for the given query is calculated using the TF/IDF algorithm.

  • Restful API: Elasticsearch supports REST APIs, which is light-weight protocol. We can query Elasticsearch using the REST API with the Chrome plug-in Sense. Sense provides a simple user interface and has features like autocomplete Elasticsearch query syntax and copying the query as cURL command.

Elasticsearch Terminology

  • Cluster: A collection of nodes that share data.

  • Node: A single server that is part of the cluster, stores the data, and participates in the cluster’s indexing and search capabilities.

  • Index: A collection of documents with similar characteristics. An index is more equivalent to a schema in RDBMS.

  • Type: There can be multiple types within an index. For example, an e-commerce application can have used products in one type and new products in another type of the same index. One index can have multiple types as multiple tables in one database.

  • Document: A a basic unit of information that can be indexed. It is like a row in a table.

  • Shards and replicas: Elasticsearch indexes are divided into multiple pieces called shards, which allows the index to scale horizontally. Elasticsearch also allows us to make copies of index shards, which are called replicas.

Use Cases

E-commerce websites use Elasticsearch to index their entire product catalog and inventory with all the product attributes with which the end user can search against.

Whenever a user searches for a product on the website, the corresponding query will hit an index with millions of products and retrieve the product in near real-time.

Or, say you want to collect log or transaction data and want to analyze and mine this data to look for statistics, summarizations, or anomalies. In this case, you can index this data into Elasticsearch. Once the data is in Elasticsearch, we can visualize the data in timelion/D3.JS to better understand the collected logs.

Installation

Let’s assume that you are in a Linux-based environment. Assuming that you also have JDK 6 or above installed, let’s get on with downloading Elasticsearch using the command below:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.4.0.tar.gz

Then extract it:

tar -zxvf elasticsearch-5.4.0.tar.gz

Go to the folder where Elasticsearch has been installed:

cd elasticsearch-5.4.0

Start the Elasticsearch server:

bin/elasticsearch

You can access it at http://localhost:9200 on your web browser. Here, localhost denotes the host (server) and the default port of Elasticsearch is 9200.

To confirm everything is working fine, type http://localhost:9200 into your browser. You should see something like this.

{
“name” : “90AzDAw”,
“cluster_name” : “elasticsearch”,
“cluster_uuid” : “e6t_hv6eQCi280elcktrUQ”,
“version” : {
“number” : “5.4.0”,
“build_hash” : “780f8c4”,
“build_date” : “2017-04-28T17:43:27.229Z”,
“build_snapshot” : false,
“lucene_version” : “6.5.0”
},
“tagline” : “You Know, for Search”
}

Indexing Documents

Elasticsearch tends to use Lucene indexes to store and retrieve data. Adding data to Elasticsearch is known as indexing. While performing an indexing operation, Elasticsearch converts raw data into its internal documents. Each document is nothing but a mere set of correlating keys and values. Here, the keys are strings and the values would be one of the numerous data types such as strings, numbers, lists, dates, etc.

We can query Elasticsearch using the methods mentioned below:

  • cURL command

  • Using an HTTP client

  • Querying with the JSON DSL

Elasticsearch provides a REST API that we can interact with in a variety of ways through common HTTP methods like GETPOSTPUT, and DELETE  — which does the same thing as a CRUD operation does.

Now, let’s try indexing some data in our Elasticsearch instance.

curl -XPUT http://localhost:9200/patient/outpatient/1?pretty -d’
{ “name” : “John”,
“City” : “California”
}’

This command will insert the JSON document into an index named patient with the type named outpatient. 1 is the ID here. If you didn’t provide any ID here, it will simply create one for you. pretty is used to pretty-print the JSON response. To replace an existing document with an updated data, we just PUT it again.

By using the above method, we can insert one document at a time. In order to bulk load the data, we can use Bulk API of Elasticsearch.

curl -XPOST ‘localhost:9200/patient/outpatient/_bulk?pretty&refresh’ –data-binary “@/home/ubuntu/Ex.json”

The above command loads the Ex.json file into the patient index.

Retrieving a Document

Retrieving a Document in a index can be done using GET request.

curl -XGET ‘localhost:9200/patient/outpatient/1?pretty’

The response of this command contains the resulting JSON document under the _source field.

{
“_index” : “patient”,
“_type” : “outpatient”,
“_id” : “1”,
“_version” : 1,
“found” : true,
“_source” : {
“name” : “John”,
“City” : “California”
}
}

It returns the document with the ID 1 and some metadata about the document.

Deleting a Document

This API allows us to delete a JSON document from an index.

curl -XDELETE ‘localhost:9200/patient/outpatient/1?pretty’

This command deletes the JSON document with the ID 1. In order to delete a document that matches a specific condition, we can use the _delete_by_query API.

curl -XPOST ‘localhost:9200/patient/_delete_by_query?pretty’ -H ‘Content-Type: application/json’ -d’
{
“query”: {
“match”: { “city”: “California” }
}
}’

That’s how we index a document using Elasticsearch.

Be it in terms of configuration and usage, Elasticsearch is quite elastic in comparison to its peers. Systems working with big data may encounter I/O bottlenecks due to data analysis and search operations. For systems like these, Elasticsearch would be the ideal choice.

Original Link

Mastering RediSearch (Part 2)

In our last installment, we started looking at RediSearch, the Redis search engine built as a module. We explored the curious nature of the keys and indexed a single document. In this segment, we’ll lay the groundwork necessary to make working with RediSearch more productive and useful in Node.js.

Welcoming RediSearch Into JavaScript

Now, we could certainly bring in all this data using the RediSearch commands directly or with the bindings, but with a large amount of data, using direct syntax becomes difficult to manage. Let’s take some time to develop a small Node.js module that will make our lives easier.

I’m a big fan of the so-called “fluent” JavaScript syntax, wherein you chain methods together so that functions are separated by dots when operating over a single object. If you’ve used jQuery, then you’ve seen this style.

$('.some-class') .css('color','red') .addClass('another-class') .on('click',function() { ... });

This approach will present some challenges. Firstly, we need to make sure that we can interoperate with “normal” Redis commands and still be able to use pipelining/batching (we’ll address the use of MULTI in a later installment). Also, RediSearch commands have a highly variadic syntax (for example, commands can have a small or large number of arguments). Translating this directly into JavaScript wouldn’t gain us much over the simple bindings. We can, however, leverage a handful of arguments and then supply optional arguments in the guise of function-level options objects. What I’m aiming to design looks a little like this:

const myRediSearch = rediSearch(redisClient,'index-key'); myRediSearch.createIndex([ ...fields... ],cbFn); myRediSearch .add(itemUniqueId,itemDataAsObject,cbFn) .add(anotherItemUniqueId,anotherItemDataAsObject,addOptions, cbFn);

Overall, this is a much more idiomatic way of doing things in JavaScript and that’s important when trying to get a team up to speed, or even just to improve the development experience.

Another goal of this module is to make the results more usable. In Redis, results are returned in what is known as a “nested multi-bulk” reply. Unfortunately, this can get quite complex with RediSearch. Let’s take a look at some results returned from redis-cli:

1) (integer) 564
2) "52fe47729251416c75099985"
3) 1) "movie_id" 2) "18292" 3) "title" 4) "George Washington" 5) "department" 6) "Editing" 7) "job" 8) "Editor" 9) "name" 10) "Zene Baker" 11) "crew" 12) "1" 4) "52fe48cbc3a36847f8179cc7" 5) 1) "movie_id" 2) "55420" 3) "title" 4) "Another Earth" 5) "character" 6) "Kim Williams" 7) "cast_id" 8) "149550" 9) "name" 10) "Jordan Baker" 11) "cast" 12) "1"

So, when using node_redis, you would get nested arrays at two levels — but positions are associative (except for the first one which is the number of results). Without writing an abstraction, it’ll be a mess to use. We can abstract the results into more meaningful nested objects with an array to represent the actual results. The same query would return this type of result:

{ "results": [ { "docId": "52fe47729251416c75099985", "doc": { "movie_id": "18292", "title": "George Washington", "department": "Editing", "job": "Editor", "name": "Zene Baker", "crew": "1" } }, { "docId": "52fe48cbc3a36847f8179cc7", "doc": { "movie_id": "55420", "title": "Another Earth", "character": "Kim Williams", "cast_id": "149550", "name": "Jordan Baker", "cast": "1" } } ], "totalResults": 564, "offset": 0, "requestedResultSize": 2, "resultSize": 2
}

View the code on Gist.

So, let’s get started on writing a client library to abstract RediSearch.

RediSearchClient Abstraction Components

Let’s first examine the entire “stack” of components that let you access RediSearch at a higher level.

[Your Application] ├── RediSearchClient - Abstraction │ ├── node_redis-redisearch - Bindings to Redis module commands └───┴── node_redis - Redis library for Node.js └── Redis - Data Store └── RediSearch - Redis Module

View the code on Gist.

This is a bit confusing due to the terminology and duplication, but each layer has its own job.

node_redis-redisearch just provides the commands to node_redis, without any parsing or abstraction. node_redis just opens up the world of Redis to JavaScript. Got it? Good.

Detecting RediSearch Bindings

Since RediSearch isn’t a default part of Redis, we need to check that it is installed. We’re going to make the assumption that RediSearch is installed on the underlying Redis server. If it isn’t installed, then you’ll simply get a Redis error similar to this:

ERR unknown command 'ft.search'

Not having the bindings is a more subtle error (complaining about an undefined function), so we’ll build in a simple check for the ft_create command on the instance of the Redis client.

Creating the Client

To be able to manage multiple different indexes and potentially different clients in a way that isn’t syntactically ugly and inefficient, we’ll use a factory pattern to pass in both the client and the index key. You won’t need to pass these again. The last two arguments are optional: an options object and/or a callback.

It looks like this:

...
rediSearchBindings(redis);
let mySearch = rediSearch(client,'my-index');
//with optional options object
let mySearch = rediSearch(client,'my-index', { ... });
//with optional options object and callback.
let mySearch = rediSearch(client,'my-index', { ... }, function() { ... });
...

The callback here doesn’t actually provide an error in its arguments; it is just issued when the node_redis client is ready. It is entirely optional and provided primarily for benchmarking so you don’t start counting down the time until the connection is fully established.

Another useful feature of this function is that the first argument can optionally be the node_redis module. We’ll also automatically add in the RediSearch bindings in this case. You can designate this library to manage the creation of your client and specify other connection preferences in the options object located at clientOptions. Many scripts have specialized connection management routines so it is completely optional to pass either a client or the node_redis module.

We’ll be using similar signatures for most functions and the final two arguments are optional: an options object and a callback. Consistency is good.

Creating an Index

Creating an index in RediSearch is a one-time affair. You set up your schema prior to indexing data and then you can’t alter the schema without re-indexing the data.

As previously discussed, there are three basic types of indexes in RediSearch:

  1. Numeric

  2. Text

  3. Geo

(Note: There is a fourth type of index, the tag index, but we’ll cover that in a later installment.)

Each field can have a number of options — this can be a lot to manage! So, let’s abstract this by returning a fieldDefinition object that has three functions: numerictext, and geo. Seems familiar, eh?

All three methods have two required options and text fields have an optional options object. They are in this order:

  1. Field name: String
  2. Sortable: Boolean
  3. Options: Object (optional, text fields only) with two possible properties: noStem (do not stem words) and weight (sorting weight)

These methods return arrays of strings that can be used to build a RediSearch index. Let’s take a look at a few examples:

mySearch.fieldDefinition.text('companyName',true,{ noStem : true }); // -> [ 'companyName', 'TEXT', 'NOSTEM', 'SORTABLE' ]
mySearch.fieldDefinition.numeric('revenue',false); // -> [ 'revenue', 'NUMERIC' ]
mySearch.fieldDefinition.geo('location',true); // -> [ 'location', 'GEO', 'SORTABLE' ]

So, what do we do with these little functions? Of course, we use them to specify a schema.

mySearch.createIndex([ mySearch.fieldDefinition.text('companyName',true,{ noStem : true }), mySearch.fieldDefinition.numeric('revenue',false), mySearch.fieldDefinition.geo('location',true)], function(err) { /* ... do stuff after the creation of the index ... */ } );

This makes a clear and expressive statement on the fields in the schema. One note here: While we use an array to contain the fields, RediSearch has no concept of order in fields, so it doesn’t really matter in which order you specify fields in the array.

Adding Items to an Iindex

Adding the item to a RediSearch index is pretty simple. To add an item, we supply two required arguments and consider two optional arguments. The required arguments are (in order):

  1. A unique ID

  2. The data as an object

The two optional arguments follow our common signature: options and a callback. As per common Node.js patterns, the first argument of the callback is an error object (unset if no errors) and the second argument of the callback is the actual data.

myRediSearch .add('kyle',{ dbofchoice : 'redis', languageofchoice : 'javascript' }, { score : 5 }, function(err) { if (err) { throw err; } console.log('added!'); } );

Batches (AKA Pipelines)

Batch, or “pipeline” as it’s called in the non-Node.js Redis world, is a useful structure in Redis, it allows for multiple commands to be sent at a time without waiting for a reply for each command.

The batch function works pretty similarly to any batch you’d find in node_redis — you can chain them together with an exec() at the end. This does cause a conflict, though. Since “normal” node_redis allows you to batch together commands, you need to distinguish between RediSearch and non-RediSearch commands. First, you need to start a RediSearch batch using one of two methods:

Start a new batch:

let searchBatch = mySearch.batch() // a new, RediSearch enhanced batch

Or, with an existing batch:

let myBatch = client.batch();
let searchBatch = mySearch.batch(myBatch) // a batch command, perhaps already in progress

After you have created the batch, you can add normal node_redis commands to it or you can use RediSearch commands.

searchBatch .rediSearch.add(...) .hgetall(...) .rediSearch.add(...)

Take note of the HGETALL stuck in the middle of this chain; this is to illustrate that you can intermix abstracted RediSearch commands with ‘normal’ Redis commands. Cool, right?

As mentioned earlier, the output of RediSearch (and many Redis commands) is likely in a form that you wouldn’t use directly. FT.GET and FT.SEARCH produce interleaved field/value results that get represented as an array, for example. The idiomatic way of dealing with data like this in JavaScript is through plain objects. So, we need to do some simple parsing of the interleaved data. There are many ways to accomplish this, but the simplest way is to use a lodash chain to first chunk the array into two-length individual arrays then use the fromPairs function to convert the two-length arrays into field/values in a single object. We’ll be using this quite a bit, so we’ll contain it in the non-public function deinterleave in order to reduce repetition.

const deinterleave = function(doc) { // `doc` is an array like this `['fname','kyle','lname','davis']` return _(doc) // Start the lodash chain with `doc` .chunk(2) // `chunk` to convert `doc` to `[['fname','kyle'],['lname','davis']]` .fromPairs() // `fromPairs` converts paired arrays into objects `{ fname : 'kyle', lname : 'davis }` .value(); // Stop the chain and return it back
}

View the code on Gist.

If we didn’t need to contend with pipelines, adding these parsing functions would be a somewhat simple process of monkey patching the client. But with batches in node_redis, the results are provided both in a function-level callback and at the end of the batch, with many scripts omitting function-level callbacks and just dealing with all the results at the end. Given this, we need to make sure that the commands are only parsing these values when needed-but always at the end.

Additionally, this opens up a can-of-worms when writing our abstraction. Normal client objects and pipeline objects both need RediSearch-specific commands injected. To prevent writing two different repetitious functions, we need to have one function that can be dynamically injected. To accomplish this, the factory pattern is employed — the outer function is passed in a client or pipeline object (let’s call it cObj) and then it returns a function with the normal arguments. cObj  can represent either a pipeline or just a node_redis client.

Thankfully, node_redis is consistent in how it handles pipelined and non-pipelined commands, so the only thing that changes is the object being chained. There are only two exceptions:

  1. In the commands that need special result parsing, we augment the pipeline object with a parser property that is itself a plain object. This contains the appropriate parsing function to be completed at the end. We need to use a plain object here rather than an array in order to avoid sparseness when the parsing is not needed.
  2. To enable chaining you need to be able to return the correct value: either the general rediSearch object for non-pipelined calls or the pipeline object itself.

These two exceptions only need to be applied when pipelined, thus we need to be able to detect pipelining. To do this, we have to look at the name of the constructor. It’s been abstracted into the function chainer.

Searching

In the RediSearch module, search is executed with the FT.SEARCH command, which has a ton of options. We’ll abstract this into our search method. At this point we’re going to provide only the bare minimum of searching abilities — we’ll pass in a search string (where you can use RediSearch’s extensive query language), then an optional Options argument and finally, a callback. Technically the callback is optional, but it would be silly not to include it.

In our initial implementation, we’ll just make a couple of options available:

  • offset: Where to begin the result set
  • numberOfResults: The number of results to be returned

These options map directly to the RediSearch LIMIT argument (very similar to the LIMIT argument found throughout SQL implementations).

The search also implements a result parser to make things a little more usable. The output object ends up looking like this:

{ "results": [ { "docId": "19995", "doc": { "budget": "237000000", "homepage": "http://www.avatarmovie.com/", "original_language": "en", "original_title": "Avatar", "overview": "In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.", "popularity": "150.437577", "release_date": "1260403200000", "revenue": "2787965087", "runtime": "162", "status": "Released", "tagline": "Enter the World of Pandora.", "title": "Avatar", "vote_average": "7.2", "vote_count": "11800" } } ], "totalResults": 1, "offset": 0, "requestedResultSize": 10, "resultSize": 1
}

View the code on Gist.

The property results is an ordered array of the results (with the most relevant results at the top). Notice that each result has both the ID of the document (docId) and the fields in the document (doc). totalResults is the number of items index that match the query (irrespective of any limiting). requestedResultSize is the maximum number of results to be returned. resultSize is the number of results returned.

Getting a Document

In the previous section, you may have noticed the docId property. RediSearch stores each document by a unique ID that you need to specify at the time of indexing. Documents can be retrieved by searching or by directly fetching the docId using the RediSearch command FT.GET. In our abstraction, we’ll call this method getDoc (get has a specific meaning in JavaScript, so it should be avoided as a method name).getDoc, like most other commands in our module, has a familiar argument signature:

  • docId is the first argument and only required argument. You pass in the ID of the previously indexed item.
  • options is the second argument and is optional. We aren’t actually using it yet, but we’ll keep it here for future expansion.
  • cb is the third argument and is technically optional-this is where you provide your callback function to get your results.

Like the search method, getDoc does some parsing to turn the document from an interleaved array into a plain JavaScript object.

Getting Rid of an Index

One more important thing to cover before we have a minimal set of functionalities: the dropIndex, which is just a simple wrapper for the command FT.DROP, is a little different as all it takes is a callback for when the index is dropped.

Neither dropIndex nor createIndex allow for chaining as the nature of these commands prevent them from having further chained functions.

Conclusion

In this piece, we’ve discussed the creation of a limited abstraction library for RediSearch in Node.js, as well as its syntax. Reaching back to our previous piece, let’s look at the same small example to see the complete index lifecycle.

/* jshint node: true, esversion: 6 */ const argv = require('yargs') // `yargs` is a command line argument parser .demand('connection') // pass in the node_redis connection object location with '--connection' .argv, // return it back as a plain object connection = require(argv.connection), // load and parse the JSON file at `argv.connection` redis = require('redis'), // node_redis module rediSearch = require('./index.js'), // rediSearch Abstraction library data = rediSearch(redis,'the-bard',{ clientOptions : connection }); // create an instance of the abstraction module using the index 'the-bard' // since we passed in redis module instead of a client instance, it will create a client instance // using the options specified in the 3rd argument.
data.createIndex([ // create the index using the following fields data.fieldDefinition.text('line', true), // field named 'line' that holds text values and will be sortable later on data.fieldDefinition.text('play', true, { noStem : true }), // 'play' field is a text values that won’t be stemmed data.fieldDefinition.numeric('speech',true), // 'speech' is a numeric field that is sortable data.fieldDefinition.text('speaker', false, { noStem : true }), // 'speaker' is a text field that is not stemmed and not sortable data.fieldDefinition.text('entry', false), // 'entry' is a text field that stemmed and not sortable data.fieldDefinition.geo('location') // 'location' is a geospatial index ], function(err) { // Error first callback after the index is created if (err) { throw err; } // Handle the errors data.batch() // Start a 'batch' pipeline .rediSearch.add(57956, { // index the object at the ID 57956 entry : 'Out, damned spot! out, I say!--One: two: why,', line : '5.1.31', play : 'macbeth', speech : '15', speaker : 'LADY MACBETH', location : '-3.9264,57.5243' }) .rediSearch.getDoc(57956) // Get the document index at 57956 .rediSearch.search('spot') // Search all fields for the term 'spot' .rediSearch.exec(function(err,results) { // execute the pipeline if (err) { throw err; } // Handle the errors console.log(JSON.stringify(results[1],null,2)); // show the results from the second pipeline item (`getDoc`) console.log(JSON.stringify(results[2],null,2)); // show the results from the third pipeline item (`search`) data.dropIndex(function(err) { // drop the index and send any errors to `err` if (err) { throw err; } // handle the errors data.client.quit(); // `data.client` is direct access to the client created in the `rediSearch` function }); }); }
);

View the code on Gist.

As you can see, this example covers all the bases, though it probably isn’t very useful in a real-world scenario. In our next installment, we’ll dig into the TMDB dataset and start playing with real data and further expanding our client library for RediSearch.

In the meantime, I suggest you take a look at the GitHub repo to see how it’s all structured.

Original Link

What Is Elasticsearch and How Can It Be Useful?

Products that involve e-commerce and search engines with huge databases are facing issues such as product information retrieval taking too long. This leads to poor user experience and in turn turns off potential customers.

Lag in search is attributed to the relational database used for the product design, where the data is scattered among multiple tables — and the successful retrieval of meaningful user information requires fetching the data from these tables. The relational database works comparatively slow when it comes to huge data and fetching search results through database queries. Businesses nowadays are looking for alternatives where the data stored to promote quick retrieval. This can be achieved by adopting NoSQL rather than RDBMS for storing data. Elasticsearch (ES) is one such NoSQL distributed database. Elasticsearch relies on flexible data models to build and update visitor profiles to meet the demanding workloads and low latency required for real-time engagement.

Let’s understand what is so significant about Elasticsearch. ES is a document-oriented database designed to store, retrieve, and manage document-oriented or semi-structured data. When you use Elasticsearch, you store data in JSON document form. Then, you query them for retrieval. It is schema-less, using some defaults to index the data unless you provide mapping as per your needs. Elasticsearch uses Lucene StandardAnalyzer for indexing for automatic type guessing and for high precision.

Every feature of Elasticsearch is exposed as a REST API:

  1. Index API: Used to document the index.

  2. Get API: Used to retrieve the document.

  3. Search API: Used to submit your query and get a result.

  4. Put Mapping API: Used to override default choices and define the mapping.

Elasticsearch has its own query domain-specific language in which you specify the query in JSON format. You can also nest other queries based on your needs. Real-world projects require search on different fields by applying some conditions, different weights, recent documents, values of some predefined fields, and so on. All such complexity can be expressed through a single query. The query DSL is powerful and is designed to handle real-world query complexity through a single query. Elasticsearch APIs are directly related to Lucene and use the same name as Lucene operations. Query DSL also uses the Lucene TermQuery to execute it.

This figure shows how the Elasticsearch query works:Indexing and Searching in Elasticsearch

The Basic Concepts of Elasticsearch

Let’s take a look at the basic concepts of Elasticsearch: clusters, near real-time search, indexes, nodes, shards, mapping types, and more.

Cluster

A cluster is a collection of one or more servers that together hold entire data and give federated indexing and search capabilities across all servers. For relational databases, the node is DB Instance. There can be N nodes with the same cluster name.

Near-Real-Time (NRT)

Elasticsearch is a near-real-time search platform. There is a slight from the time you index a document until the time it becomes searchable.

Index

The index is a collection of documents that have similar characteristics. For example, we can have an index for customer data and another one for a product information. An index is identified by a unique name that refers to the index when performing indexing search, update, and delete operations. In a single cluster, we can define as many indexes as we want. Index = database schema in an RDBMS (relational database management system) — similar to a database or a schema. Consider it a set of tables with some logical grouping. In Elasticsearch terms: index = database; type = table; document = row.

Node

A node is a single server that holds some data and participates on the cluster’s indexing and querying. A node can be configured to join a specific cluster by the particular cluster name. A single cluster can have as many nodes as we want. A node is simply one Elasticsearch instance. Consider this a running instance of MySQL. There is one MySQL instance running per machine on different a port, while in Elasticsearch, generally, one Elasticsearch instance runs per machine. Elasticsearch uses distributed computing, so having separate machines would help, as there would be more hardware resources.

Shards

A shard is a subset of documents of an index. An index can be divided into many shards.

Mapping Type

Mapping type = database table in an RDBMS.

Elasticsearch uses document definitions that act as tables. If you PUT (“index”) a document in Elasticsearch, you will notice that it automatically tries to determine the property types. This is like inserting a JSON blob in MySQL, and then MySQL determining the number of columns and column types as it creates the database table.

Do you want to know more about what Elasticsearch is and when to use it? Some of the use cases of Elasticsearch can be found here. Elasticsearch users have delightfully diverse use cases, ranging from appending tiny log-line documents to indexing web-scale collections of large documents and maximizing indexing throughput.

Sometimes, we have more than one way to index or query documents. And with the help of Elasticsearch, we can do it better. Elasticsearch is not new, though it is evolving rapidly. Still, the core product is consistent and can help achieve faster performance with search results for your search engine.

Original Link

Mastering RediSearch (Part 1)

I’ve been working with the RediSearch module quite a bit lately — it’s one of the more fascinating developments in the Redis ecosystem and it deserves its own series. If you’re not familiar with RediSearch and its features, you should take a look at this video.

If you’ve built an application with Redis as a primary data store, you’ve likely experienced both the elation and confusion of the native data types. When you understand the data types, you realize that much of your data fits neatly into one of them. However, many common application patterns require both indexing (“What key has x value?”) and search (“What key contains some text string?”). While these questions can be answered by leveraging the native datatypes in creative ways, the code can be complex and has speed and/or space efficiency tradeoffs. The RediSearch module fills in these blanks with few trade-offs. In this first installment, we’re going to be exploring the very basics of the module as a gentle introduction.

What Are Modules?

Modules are add-ons for your Redis server. At their most basic level, they implement new commands, but they can also implement new data types. Modules are written in systems programming languages; C/C++, Rust, and Golang have been used, but other languages are also possible. Since they’re written in compiled languages, extremely high performance is possible.

Modules are distinct from Redis scripting (Lua) in that they are first-class commands in the system and can interface storage directly, enabling the creation of their own datatypes. The only thing that sets them apart from inbuilt commands is that module commands are namespaced by a prefix, often two letters, and a dot (i.e. XX.SOMECOMMAND).

Modules can be loaded either on the fly with MODULE LOAD, in the redis.conf file with loadmoduleor through the command line argument loadmodule. My personal preference is to load them via the conf file as it ensures that it’s always available and the configuration is portable.

What Is RediSearch?

I’ve asked myself the question what isn’t RediSearch — but I’ll attempt to answer it without inverting. RediSearch is a module that provides three main features:

  • Full-text search
  • Secondary indexing
  • Suggestion/auto-complete engine

RediSearch utilizes both its own datatype and the inbuilt Redis data types. In this way, it’s more of a solution that uses Redis and also resides with Redis. That may seem confusing now, but stay with me.

Let’s evaluate each of the features from above. First, consider full-text searching. With RediSearch, you can index text that hasn’t already been processed. Let’s say that you have a list of one million client comments and you want to find all that mention “rendering.” Before RediSearch, you could certainly store those comments in Redis (in, say, a hash), but finding a specific word inside those comments was a struggle at best. Even if you managed to build your own index of words to comments (which involves splitting each comment into words at the app level), matching would need to be exact — “render,” “rendering,” and “rendered” would not match one another. Instead, by storing the data with RediSearch, you could find all the comments without having to do anything special at your application level — and it would match “rendered” to “rendering” automatically since it smartly processes both the index and the query.

Obviously, if it’s possible to do the above, it’s also possible to do it without the language processing smarts. As you start to think of this, you start to realize that RediSearch can be used as a general purpose secondary index. But it’s also possible to go beyond text matches — RediSearch can do numeric and geo indexes on a single item (termed “document”). It is possible to have multiple fields on each document, each with individual attributes.

Finally, somewhat separately, RediSearch provides a suggestion engine that can drive auto-complete-like services. This allows you to take known valid values and provide users “hints.” It’s based on a prefix model, so if a user starts to type “Hamb” the suggestion engine would provide, say, “Hamburger,” “Hambone,” and “Hamburg.” It’s important to note these suggestions aren’t integrated with the search results directly, so it’s up to your application to add or delete them from this suggestion store.

Hands-On

As a hands-on exercise, let’s install the module:

$ git clone https://github.com/RedisLabsModules/RediSearch.git $ cd RediSearch/src $ make all $ redis-cli
> MODULE LOAD ./redisearch.so

(Or install it in your redis.conf file and restart redis-server.)

After your module is loaded, go ahead and run this command in redis-cli to verify that the module is running:

> module list
1) 1) "name" 2) "ft" 3) "ver" 4) (integer) 2000

In the results of this command, you should see an entry for each module you have installed (likely just one). The name field of one of the entries should read ft (meaning full text). That’s how RediSearch is identified and the command prefix. Your version number will likely be different from mine; progress on this module is moving fast.

Now that the module is up and running it’s best to start with a clean database for these exercises (flushdb or a clean database/instance). To start let’s create an index and add an item:

> FT.CREATE shakespeare SCHEMA line TEXT SORTABLE play TEXT NOSTEM speech NUMERIC SORTABLE speaker TEXT NOSTEM entry TEXT location GEO

This might look a tad complicated, especially if you’re used to commands with one or two arguments. Let’s break it down:

  • FT.CREATE shakespeare: This is just the command and the “key” (more on that later)

  • SCHEMA: This indicates that the following arguments will be about the fields in the search index.

  • line TEXT SORTABLE: Here, we are creating a field named line that holds text values and will be sortable later on.

  • play TEXT NOSTEM: This is the field play that is for text values but it won’t be stemmed (i.e. rendering will not match render).

  • speech NUMERIC SORTABLE: We’re creating a field named speech that is numeric and sortable.

  • speaker TEXT NOSTEM: Just like the play field, the speaker field will hold text that will only do exact, word-for-word matches.

  • entry TEXT: This field (entry) holds text values that are processed for exact or stemmed matches.

  • location GEO: The location field holds a geographic coordinate.

See? It’s just a lot in one line, but not really complicated.

Now, let’s add a document to our index:

> FT.ADD shakespeare 57956 1 FIELDS text_entry "Out, damned spot! out, I say!--One: two: why," line "5.1.31" play macbeth speech 15 speaker "LADY MACBETH" location -3.9264,57.5243

Comparing the two commands, you might notice that the FT.CREATE andFT.ADDCREATE commands are following a similar pattern. Let’s look at the command in more depth:

  • FT.ADD shakespeare 57956 1: We’re adding a document with an ID of 57956  to the index (shakespeare). Note that in this command the document ID is a number (just a feature of the dataset I’m using), but it can be any valid Redis key. The final argument in this section is the weight — we’ll get into this in a later part of the series, but, for now, you just need to know that it can be between 0 and 1 and 1 is a good default value.

  • FIELDS …: This indicates that we’re going to specifying the fields of the document in a [fieldname] [value] repeating pattern. Note that when the value is single word or number, you don’t need quotes, but if you’re using spaces or other odd characters, enclose your value in quotes. The other special one is the location field that includes a set of coordinates (longitudelatitude).

The Curious Case of RediSearch Keys

Recall that we created an index with the key shakespeare” (via the FT.CREATE command). Let’s do a quick experiment:

> TYPE shakespeare
none

Strange, right? This is where we start departing from normal Redis behavior and you’ll start seeing where RediSearch is a solution that is both using and integrated with Redis.

If you’re running this on a non-production database, let’s do KEYS * for debugging purposes:

> KEYS *
1) "ft:shakespeare/1"
2) "ft:shakespeare/31"
3) "idx:shakespeare"
4) "ft:shakespeare/5"
5) "ft:shakespeare/macbeth"
6) "ft:shakespeare/lady"
7) "nm:shakespeare/speech"
8) "geo:shakespeare/location"
9) "57956"

Running two commands had yielded nine keys. I want to highlight a few of these keys just to fill out the understanding of what is actually going on here:

> TYPE idx:shakespeare
ft_index0

Here, we can see that RediSearch has created a key with its own datatype (ft_index0). We can’t really do much with this key directly, but it’s important to know that it exists and how it was created.

Now, let’s look at key 57956:

> TYPE 57956
hash

A hash! We can work with this — let’s look at this key directly:

> HGETALL 57956 1) "text_entry" 2) "Out, damned spot! out, I say!--One: two: why," 3) "line" 4) "5.1.31" 5) "play" 6) "macbeth" 7) "speech" 8) "15" 9) "speaker"
10) "LADY MACBETH"
11) "location"
12) "-3.9264,57.5243"

This should look familiar as it’s your data from the FT.ADD command and the key is just your document ID. While it’s important to know how this is being stored, don’t manipulate this key directly with HASH  commands.

> TYPE nm:shakespeare/speech
numericdx

Interesting — the field speech in our dataset is a numeric index and the type is a numericdx. Again, since this is a RediSearch native datatype, we can’t manipulate this with any “normal” Redis commands.

> TYPE geo:shakespeare/location
zset

The key here gives you a hint — while the TYPE command returns that it’s a ZSET, Redis geohash sets are stored as ZSETs and will report as them when the type is queried. That being said, let’s look at a couple of GEO commands:

> GEOHASH geo:shakespeare/location 1
1) "gfjpnxuzk40"
> GEOPOS geo:shakespeare/location 1
1) 1) "-3.92640262842178345" 2) "57.52429905544970268"

Brilliant! RediSearch has stored the coordinates in a bog-standard GEO set. But, like the hash above, don’t modify these values directly with ZSET or GEO commands.

Finally, let’s take a look at one more key:

> TYPE ft:shakespeare/lady
ft_invidx

Sharp readers might notice that the term “lady” was only indexed in a full-text field (speaker). Data stored ft_invidx keys are textual indexes.

Now that we know a little about how RediSearch is storing our data, we can start to load more substantial information into database and explore querying but that will have to wait to Part 2 of Mastering RediSearch coming soon.

Original Link