More Than LIKE: Efficient JSON Searching With N1QL

Enterprise applications require both exact search (equality, range predicate) and pattern search (LIKE or text search). The B-Tree based indexes in databases help perform an exact search in milliseconds. Pattern search is a whole different problem. Predicates of patterns (name LIKE "%turin%") will scan the whole index — unsuitable for application use. Recognizing the need for speed, we've introduced a TOKENS() functions in Couchbase 4.6 which retuns an array of tokens from each field or document. We then index the array using the array indexing in Couchbase N1QL. This transforms the pattern search problem to a simple equality look up, improves the performance by orders of magnitude.

Introduction

Whether you’re booking a hotel in Turin or Timbuktu, Milan or Malibu, you need to search based on various criteria: location, dates, rooms, number of people, their age and more. The numerical requirements (number of rooms, number of people in the party), date ranges (check-in and checkout dates) are exact predicates. The target place to visit could be a search within a city name, region name, nearby places, or interesting landmarks.

Search image

Here is the problem.

How do you make this query run really, really quickly?
How do you get thousands of users searching and buying your product and service concurrently?

Search is a broad term. So, we divide searches into distinct types.

Exact search with one or more predicates.
Pattern search to find a word or a pattern within a larger string.
Fuzzy search — text search with a lot of text data.

Of course, all of these search operations have to come back with results in milliseconds and not seconds.

In Couchbase:

For exact searches, you can use B-tree- or skip-list-based global secondary indexes with a simple or composite index to achieve your goals.
For full-text search, which requires heavy text processing — stemming, scoring, and more — you can use Couchbase full-text search (FTS).

The hotel search problem doesn’t fall nicely into either category completely. Exact predicates require exact index; search predicates require FTS index. Using only index scans, you’ll get too much data, and you’d need filter further. On the other hand, if you search using full-text index, you’ll need to filter further for dates, rooms, etc. This second level of filtering will slow things down significantly.

Real world problems like you saw above are not always clean cut. You want the exact search for numerical and date range; you want the string search with smaller data like city, neighborhood, region.

Here is the problem. How do you make this query run really, really quickly? How do you get thousands of users searching and buying your product and service concurrently?

Let’s use sample Hotel documents in the Couchbase travel-sample data set to explore the use case.

select hotel, meta().id as dockey 
from `travel-sample` hotel 
where hotel.type = "hotel"
limit 1;

[
  {
     "dockey": "hotel_10025",
     "hotel": {
      "address": "Capstone Road, ME7 3JE",
      "alias": null,
      "checkin": null,
      "checkout": null,
      "city": "Medway",
      "country": "United Kingdom",
      "description": "40 bed summer hostel about 3 miles from Gillingham, housed in a distinctive converted Oast House in a semi-rural setting.",

Now, let’s see some common questions from your application and how these questions can be translated to queries.

1. Find all the hotel information for "Medway Youth Hostel"

* This is an exact search of the predicate (hotel.name = "Medway Youth Hostel")

Query:
​
SELECT hotel 
FROM `travel-sample` hotel
WHERE hotel.type = "hotel"
AND hotel.name = "Medway Youth Hostel";
​
Index:
​
CREATE INDEX idx_hotel_typename 
ON `travel-sample`(name) WHERE type = "hotel";

2. Find all the hotels in "Medway", "United Kingdom"

* This is an exact search with the predicate: ( hotel.country = “United Kingdom”
AND hotel.city = “Medway” )

Query:
​
SELECT hotel 
FROM `travel-sample` hotel
WHERE hotel.type = "hotel" 
AND hotel.country = "United Kingdom"
AND hotel.city = "Medway"
​
Index:
​
CREATE INDEX idx_hotel_typecountryname 

3. Find all the hotels liked (in the field public_field) by the a person with a name (first or last) Vallie

* public_likes is an array of field. It has names of the peopel who liked the hotel.

* This is a pattern search. We're looking for a first name or last name Vallie.

      "public_likes": [
        "Julius Tromp I",
        "Corrine Hilll",
        "Jaeden McKenzie",
        "Vallie Ryan",
        "Brian Kilback",
        "Lilian McLaughlin",
        "Ms. Moses Feeney",
        "Elnora Trantow"
      ]

So we use the array predicate to search through array elements. To make this search faster, we create an array index on public_likes. This index scan will still fetch all the elements from the index and apply the LIKE "%vallie%" on top of it.

Query:
​
SELECT hotel 
FROM `travel-sample` hotel
WHERE hotel.type = "hotel"
AND ANY p IN hotel.public_likes 
        SATISFIES LOWER(p) LIKE "%vallie%"
    END
​
Index:
​

4. Find all the hotels that mention architecture anywhere in the document. This anywhere -- could be in keyname (attribute name), any value — in name, description, reviews, etc.

This has a pattern search. Search for architecture anywhere in the document is a pattern search.Index: This query can use a primary index.

SELECT COUNT(1) totalcount FROM `travel-sample` hotel
WHERE ANY v WITHIN OBJECT_PAIRS(hotel) 
          SATISFIES LOWER(v)  LIKE "%architecture%"
      END
{
    "requestID": "426b993e-a214-4e5f-824e-aef2ebfd3262",
    "signature": {
        "$1": "number"
    },
    "results": [
        {

While this query works, it takes 12.13 seconds to find the 41 documents which have the word architecture somewhere in the document. Too slow and too expensive.

Here is where the TOKENS come into the picture. TOKENS() function takes a JSON document or an expression, analyzes all of name value pairs, arrays, objects and returns an array of values of basic types: numeric, strings, boolean, null.

SELECT TOKENS(hotel)
FROM `travel-sample` hotel
WHERE hotel.type = 'hotel' 
LIMIT 1;
​
[
  {
    "$1": [
      null,
      false,
      true,

Once we get this array, we can use the array indexing, to index the result of TOKENS and query them:

TOKENS in array index

CREATE INDEX idx_travel_tokens ON `travel-sample`
(DISTINCT ARRAY s for s IN TOKENS(`travel-sample`, {"case":"lower", "names": true}) END) 
​
select count(1) FROM `travel-sample` h 
WHERE ANY s IN tokens(h, {"case":"lower", "names": true}) SATISFIES s = 'architecture' END;
{
    "requestID": "dd6cfb24-ae5f-4667-a332-fc21ec48f808",
    "signature": {
        "$1": "number"
    },
    "results": [

Let’s look at the query plan. It chose the tokens right index and more importantly, the predicates were pushed correctly into index scans (reported here as spans). The appropriate index is selected and the predicate looking for architecture is pushed down to index scan.

      "~children": [
        {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "idx_travel_tokens",
            "index_id": "9f370cd078ebd2eb",
            "keyspace": "travel-sample",
            "namespace": "default",
            "spans": [
              {

Finally, let’s put all of this together to answer this question: Find all the hotels in United Kingdom which have vacancies and with a website.

We can check for (country = ‘United Kingdom’ and vacancy = true) easily via composite index.
Having a “website” needs to be defined well.
When you get information from hotels, unless you go through a ETL process to save the information in an exact field and format, you’d have this information anywhere in the hotel document.

Example:
"url": "http://www.yha.org.uk",
"Website": "http://www.yha.org.uk”
“Http”: "www.yha.org.uk"
“Http address”: "www.yha.org.uk"
“Web” : "Our website: http://www.yha.org.uk "
“Contact” : 
     “Phone” : "+44 482 492 292",
     “Web” : "http://www.yha.org.uk",
     “Mail” : "help@yha.org.uk"
   }

Problem: You’d need to search in every field name, every object, every array.

Query:
EXPLAIN 
SELECT * 
FROM `travel-sample` hotel 
WHERE country = "United Kingdom"
AND vacancy = true
AND ANY v IN TOKENS(hotel, {"case":"lower", "name":true})
        SATISFIES v = "http"
    END
AND type = "hotel"
​

In this case, we push down all of the predicates to the index scan.

The query runs in about 233 milliseconds!

SELECT * 
FROM `travel-sample` hotel 
WHERE country = "United Kingdom" 
AND vacancy = true 
AND ANY v IN TOKENS(hotel, {"case":"lower", "name":true}) SATISFIES v = "http" 
    END 
AND type = "hotel";
​
Results (partially shows for brevity):
​
{

TOKENS() Function

This function is explained in detail in Couchbase documentation. The important thing to note is that first expression passed into token can be anything: constant literal, simple JSON value, JSON key name or the whole document itself.

TOKENS (Expression, Options)

You can invoke TOKENS() simply with the expressions or with the option. The function always returns an array of tokens it extracted for the given (implied) options.

TOKENS(address)

This returns an array of values including the attribute (key) names. It retrieves the data in their basic types and returns

Options can take the following options. Each invocation of TOKENS() can choose one or more of the options, passed in as JSON.

{"name": true}: Valid values are true or false. Default is true.

{"case":"lower"}: Valid values are "upper" and "lower". Default is neither — return the case in the data.

{"specials": true}: Preserves strings with special characters such as email addresses, URLs, and hyphenated phone numbers. The default is false.

Let's use all of the options on TOKENS() to see what it produces.

Query:
​
SELECT 
TOKENS ({"info":"Name is keshav email is keshav@couchbase.com Blog is http://planetnosql.com twitter handle is @rkeshavmurthy"}, 
{"specials":true, "names":true, "case":"lower"}) AS words
​
Result:
​
[
  {
    "words": [

With names set to true, you'll get the field name, in this case info.

With specials set to true, you'll get delimited words as well as special works like keshav@couchbase.com or @rkeshavmurthy as a single word to index.

With case set to lower, all of the results will be in the lower case

Let's look at an example usage on the travel-sample data.

SELECT TOKENS(url) AS defaulttoken, 
       TOKENS(url, {"specials":true, "case":"UPPER"}) AS specialtoken
FROM `travel-sample` h WHERE h.type = 'hotel' 
LIMIT 1;
​
[
  {
    "defaulttoken": [
      "uk",
      "www",
      "http",

In this case, TOKENS() converted all of the URL data into UPPER case and added the full URL in addition to delimited words.

Use {"case":"lower"} or {"case":"upper"} to have case insensitive search. Index creation and querying can use this and other parameters in combination. As you saw in the examples, these parameters should be passed within the query predicates as well. The parameters and values have to match exactly for N1QL to pick up and use the index correctly.

Here is an example of how you can create the index and use it your applicaiton.

Index definition:
​
CREATE INDEX idx_url_upper_special on `travel-sample`(
  DISTINCT ARRAY v for v in 
       tokens(url, {"specials":true, "case":"UPPER"}) 
  END ) where type = 'hotel' ;
​
Query:
​
SELECT name, address, url
FROM `travel-sample` h

Summary

Couchbase 4.6 has introduced a new way to split the JSON document into simpler values using TOKENS(). Using the flexible array indexing in Couchbase, you can now create secondary indexes for scalar and array values. This is a high-performance alternative to the LIKE predicate. Its performance is so much more likable.

References

Couchbase : http://www.couchbase.com
N1QL: http://query.couchbase.com
Couchbase Full Text Search: http://developer.couchbase.com/documentation/server/current/fts/full-text-intro.html
SPLIT and CONQUER: https://dzone.com/articles/split-and-conquer-efficient-string-search-with-n1q
A Couchbase Index Technique for LIKE Predicates With Wildcard: https://dzone.com/articles/a-couchbase-index-technique-for-like-predicates-wi

Swami Vivekananda: The Monk That Nobody Sent to Chicago

There’s a saying in Chicago: “We don’t want nobody that nobody sent.” This was the cold reception Swami Vivekananda faced when he arrived in the windy city in July 1893, determined to attend the World Parliament of Religions that September. He belonged to no organization, carried no letter of recommendation, his countrymen were nobody, and represented an alien religion to the Western world. As the days passed, his hope of attending the parliament dwindled. With money running out and the odds stacked against him, he left the Windy City and went to Boston, praying for a glimmer of opportunity. Swamiji came to America to share India’s most profound gift: the wisdom of the Hindu sages, preserved through centuries of oral tradition and embodied by its monks. This was 1893, not 1993—India was under the British grip, its resources drained, and its spirit subdued. Swamiji’s mission was not just a cultural exchange; it was a bold step toward envisioning a future where India could re...

Planet NoSQL

Search This Blog