Skip to main content

More Than LIKE: Efficient JSON Searching With N1QL

Enterprise applications require both exact search (equality, range predicate) and pattern search (LIKE or text search). The B-Tree based indexes in databases help perform an exact search in milliseconds. Pattern search is a whole different problem. Predicates of patterns (name LIKE "%turin%") will scan the whole index — unsuitable for application use. Recognizing the need for speed, we've introduced a TOKENS() functions in Couchbase 4.6 which retuns an array of tokens from each field or document. We then index the array using the array indexing in Couchbase N1QL. This transforms the pattern search problem to a simple equality look up, improves the performance by orders of magnitude.

Introduction

Whether you’re booking a hotel in Turin or Timbuktu, Milan or Malibu, you need to search based on various criteria: location, dates, rooms, number of people, their age and more. The numerical requirements (number of rooms, number of people in the party), date ranges (check-in and checkout dates) are exact predicates. The target place to visit could be a search within a city name, region name, nearby places, or interesting landmarks.

Search image

Here is the problem.  

  • How do you make this query run really, really quickly?  

  • How do you get thousands of users searching and buying your product and service concurrently?

Search is a broad term. So, we divide searches into distinct types.

  1. Exact search with one or more predicates.

  2. Pattern search to find a word or a pattern within a larger string.

  3. Fuzzy search — text search with a lot of text data.

Of course, all of these search operations have to come back with results in milliseconds and not seconds.

In Couchbase:

  • For exact searches, you can use B-tree- or skip-list-based global secondary indexes with a simple or composite index to achieve your goals.

  • For full-text search, which requires heavy text processing — stemming, scoring, and more — you can use Couchbase full-text search (FTS).

The hotel search problem doesn’t fall nicely into either category completely. Exact predicates require exact index; search predicates require FTS index. Using only index scans, you’ll get too much data, and you’d need filter further. On the other hand, if you search using full-text index, you’ll need to filter further for dates, rooms, etc. This second level of filtering will slow things down significantly.

Real world problems like you saw above are not always clean cut. You want the exact search for numerical and date range; you want the string search with smaller data like city, neighborhood, region.

Here is the problem. How do you make this query run really, really quickly? How do you get thousands of users searching and buying your product and service concurrently?

Let’s use sample Hotel documents in the Couchbase travel-sample data set to explore the use case.

Now, let’s see some common questions from your application and how these questions can be translated to queries.

1. Find all the hotel information for "Medway Youth Hostel"

* This is an exact search of the predicate (hotel.name = "Medway Youth Hostel")

2. Find all the hotels in "Medway", "United Kingdom"

* This is an exact search with the predicate: ( hotel.country = “United Kingdom”
AND hotel.city = “Medway” )


3. Find all the hotels liked (in the field public_field) by the a person with a name (first or last) Vallie

    * public_likes is an array of field. It has names of the peopel who liked the hotel.

* This is a pattern search. We're looking for a first name or last name Vallie.


So we use the array predicate to search through array elements. To make this search faster, we create an array index on public_likes. This index scan will still fetch all the elements from the index and apply the LIKE "%vallie%" on top of it.


4. Find all the hotels that mention architecture anywhere in the document. This anywhere -- could be in keyname (attribute name), any value — in name, description, reviews, etc.

This has a pattern search. Search for architecture anywhere in the document is a pattern search.Index: This query can use a primary index.


While this query works, it takes 12.13 seconds to find the 41 documents which have the word architecture somewhere in the document. Too slow and too expensive.

Here is where the TOKENS come into the picture. TOKENS() function takes a JSON document or an expression, analyzes all of name value pairs, arrays, objects and returns an array of values of basic types: numeric, strings, boolean, null.


Once we get this array, we can use the array indexing, to index the result of TOKENS and query them:

TOKENS in array index



Let’s look at the query plan. It chose the tokens right index and more importantly, the predicates were pushed correctly into index scans (reported here as spans). The appropriate index is selected and the predicate looking for architecture is pushed down to index scan.


  1. Finally, let’s put all of this together to answer this question: Find all the hotels in United Kingdom which have vacancies and with a website.

    • We can check for (country = ‘United Kingdom’ and vacancy = true) easily via composite index.

    • Having a “website” needs to be defined well.

    • When you get information from hotels, unless you go through a ETL process to save the information in an exact field and format, you’d have this information anywhere in the hotel document.

Problem: You’d need to search in every field name, every object, every array.

In this case, we push down all of the predicates to the index scan. 

The query runs in about 233 milliseconds!


TOKENS() Function

This function is explained in detail in Couchbase documentation. The important thing to note is that first expression passed into token can be anything: constant literal, simple JSON value, JSON key name or the whole document itself.

TOKENS (Expression, Options)

You can invoke TOKENS() simply with the expressions or with the option. The function always returns an array of tokens it extracted for the given (implied) options.

TOKENS(address)  

This returns an array of values including the attribute (key) names. It retrieves the data in their basic types and returns 

Options can take the following options. Each invocation of TOKENS() can choose one or more of the options, passed in as JSON.

{"name": true}: Valid values are true or false. Default is true.

{"case":"lower"}: Valid values are "upper" and "lower".  Default is neither — return the case in the data.

{"specials": true}:  Preserves strings with special characters such as email addresses, URLs, and hyphenated phone numbers. The default is false. 

Let's use all of the options on TOKENS() to see what it produces.


With names set to true, you'll get the field name, in this case info.

With specials set to true, you'll get delimited words as well as special works like keshav@couchbase.com or @rkeshavmurthy as a single word to index. 

With case set to lower, all of the results will be in the lower case

Let's look at an example usage on the travel-sample data.

In this case, TOKENS() converted all of the URL data into UPPER case and added the full URL in addition to delimited words.

Use {"case":"lower"} or  {"case":"upper"} to have case insensitive search. Index creation and querying can use this and other parameters in combination. As you saw in the examples, these parameters should be passed within the query predicates as well. The parameters and values have to match exactly for N1QL to pick up and use the index correctly.

Here is an example of how you can create the index and use it your applicaiton.



Summary

Couchbase 4.6 has introduced a new way to split the JSON document into simpler values using TOKENS(). Using the flexible array indexing in Couchbase, you can now create secondary indexes for scalar and array values. This is a high-performance alternative to the LIKE predicate. Its performance is so much more likable.


References

  1. Couchbase  : http://www.couchbase.com

  2. N1QL:  http://query.couchbase.com

  3. Couchbase Full Text Search:  http://developer.couchbase.com/documentation/server/current/fts/full-text-intro.html

  4. SPLIT and CONQUER: https://dzone.com/articles/split-and-conquer-efficient-string-search-with-n1q

  5. A Couchbase Index Technique for LIKE Predicates With Wildcard:  https://dzone.com/articles/a-couchbase-index-technique-for-like-predicates-wi

Comments

Popular posts from this blog

Swami Vivekananda: The Monk That Nobody Sent to Chicago

  There’s a saying in Chicago: “We don’t want nobody that nobody sent.” This was the cold reception Swami Vivekananda faced when he arrived in the windy city in July 1893, determined to attend the World Parliament of Religions that September. He belonged to no organization, carried no letter of recommendation, his countrymen were nobody, and represented an alien religion to the Western world. As the days passed, his hope of attending the parliament dwindled. With money running out and the odds stacked against him, he left the Windy City and went to Boston, praying for a glimmer of opportunity.  Swamiji came to America to share India’s most profound gift: the wisdom of the Hindu sages, preserved through centuries of oral tradition and embodied by its monks. This was 1893, not 1993—India was under the British grip, its resources drained, and its spirit subdued. Swamiji’s mission was not just a cultural exchange; it was a bold step toward envisioning a future where India could re...

Why Should Databases Go Natural?

From search to CRM, applications are adopting natural language and intuitive interactions. Should databases follow? This article provides a strategic perspective. Amid the many technological evolutions in software and hardware (CISC/RISC, Internet, Cloud, and AI), one technology has endured:  Relational Database Systems   (RDBMS), aka SQL databases. For over 50 years, RDBMS has survived and thrived, overcoming many challenges. It has evolved and adopted beneficial features from emerging technologies like object-relational databases and now competes robustly with   NoSQL databases .  Today, RDBMS dominates the market, with four of the top five databases and seven of the top ten being relational. RDBMS has smartly borrowed ideas, like JSON support, from NoSQL, while NoSQL has also borrowed from RDBMS. NoSQL no longer rejects SQL. From a user perspective, all modern databases have SQL-inspired query language and a set of APIs. All applications manage the respective data...

iQ Interactive: Cool Things for Developers on Couchbase Capella iQ

  The landscape of software development is ever-evolving with the advent of new technologies. As we venture into 2023, natural language processing ( NLP ) is rapidly emerging as a pivotal aspect of programming. Unlike previous generations of tools that primarily aimed at enhancing coding productivity and code quality, the new generation of Artificial Intelligence ( GenAI ) tools, like iQ, is set to revolutionize every facet of a developer's workflow. This encompasses a wide range of activities: Reading, writing, and rewriting specifications Designing, prototyping, and coding Reviewing, refactoring, and verifying software Going through the iterative cycle of deploying, debugging, and improving the software Create a draft schema and sample data for any use case Natural language queries. Generate sample queries on a given dataset Fix the syntax error for a query Don't stop here. Let your imagination fly. Although the insights garnered from iQ are preliminary and should be treated ...