Skip to main content

Indexing JSON in Couchbase



[Reposting of my article at DZone: http://dzone.com/articles/index-first-and-query-faster]

Introduction

There are three things important in database systems:performance, performance, performance. Creating the right index, with the right keys, right order, and right expression is critical to query performance in any database system. That's true for Couchbase as well.
We've discussed data modeling for JSON and querying on JSONearlier. In this article, we'll discuss indexing options for JSON in Couchbase.
Couchbase 4.5 can create two types of indices:
  1. Standard global secondary index.
  2. Memory-Optimized global secondary index.
The standard secondary index stores uses the ForestDB storage engine to store the B-Tree index and keeps the optimal working set of data in the buffer. That means, the total size of the index can be much bigger than the amount of memory available in each index node.
A memory-optimized index uses a novel lock-free skiplist to maintain the index and keeps 100% of the index data in memory. A memory-optimized index has better latency for index scans and can also process the mutations of the data much faster.
Both standard and memory-optimized indeces implement multi-version concurrency control (MVCC) to provide consistent index scan results and high throughput. During cluster installation, choose the type of index.
The goal is to give you an overview of various indices you create in each of these services so that your queries can execute efficiently.
The goal of this article is not to describe or compare and contrast these two types of index services. It does not cover the Full Text Index (FTS), in developer preview now. Another topic not covered by this article is how the index selection is made for scans and joinsCouchbase documentation
Let's take travel-sample dataset shipped with Couchbase 4.5 to walk through this.  To try out these indices, install Couchbase 4.5. On your web console, go to Settings->Sample Buckets to install travel-sample.
Here are the various indices you can create.
Primary Index
Named primary index
Secondary index
Composite Secondary Index
Functional index
Array Index
ALL array 
ALL DISTINCT array
Partial Index
Duplicate Indices
Covering Index

Background

Couchbase is a distributed database. It supports flexible data model using JSON. Each document in a bucket will have a user-generated unique document key. This uniqueness is enforced during insertion of the data.
Here's an example document.
select meta().id, travel 
from `travel-sample` travel 
where type = 'airline' limit 1;
[
    {
        "id": "airline_10",
        "travel": {
            "callsign": "MILE-AIR",
            "country": "United States",
            "iata": "Q5",
            "icao": "MLA",
            "id": 10,
            "name": "40-Mile Air",
            "type": "airline"
            }
        }
    ]

1. Primary Index

Create the primary index on 'travel-sample';
The primary index is simply the index on the document key on the whole bucket. The Couchbase data layer enforces the uniqueness constraint on the document key. The primary index, like every other index, is maintained asynchronously. The primary index is used for full bucket scans (primary scans) when the query does not have any filters (predicates) or no other index or access path can be used.  
Here is the metadata for this index:
select * from system:indexes where name = #primary’;
"indexes": {
    "datastore_id": "http://127.0.0.1:8091",
    "id": "f6e3c75d6f396e7d",
    "index_key": [],
    "is_primary": true,
    "keyspace_id": "travel-sample",
    "name": "#primary",
    "namespace_id": "default",
    "state": "online",
    "using": "gsi"
    }

The metadata gives you additional information on the index: Where the index resides (datastore_id), its state (state) and the indexing method (using).

2. Named Primary Index

CREATE PRIMARY INDEX `def_primary` ON `travel-sample`
You can also name the primary index. The rest of the features of the primary index are the same, except the index is named. A good side effect of this is that you can have multiple primary indices in the system. Duplicate indices help with high availability as well as query load distribution throughout them.  This is true for both primary indices and secondary indices.
select meta().id as documentkey, `travel-sample` airline 
from `travel-sample` 
where type = 'airline' limit 1;
  {
    "airline": {
      "callsign": "MILE-AIR",
      "country": "United States",
      "iata": "Q5",
      "icao": "MLA",
      "id": 10,
      "name": "40-Mile Air",
      "type": "airline"
    },
    "documentkey": "airline_10"
  }

3. Secondary Index


The secondary index is an index on any key-value or document-key. This index can be any key within the document. The key can be of any time: scalar, object, or array.  The query has to use the same type of object for the query engine to exploit the index.
CREATE INDEX travel_name ON `travel-sample`(name);
name is a simple scalar value.
{    "name": "Air France"  }
CREATE INDEX travel_geo on `travel-sample`(geo);
geo is an object embedded within the document.  Example:
    "geo": {
        "alt": 12,
        "lat": 50.962097,
        "lon": 1.954764
        }

Creating indexes on keys from nested objects is straightforward.
CREATE INDEX travel_geo on `travel-sample`(geo.alt);
CREATE INDEX travel_geo on `travel-sample`(geo.lat);

Schedule is an array of objects with flight details. This indexes on the complete array.  Not exactly useful unless you're looking for the whole array.  
CREATE INDEX travel_schedule ON `travel-sample`(schedule);
Example:
"schedule": [
        {
            "day": 0,
            "flight": "AF198",
            "utc": "10:13:00"
            },
        {
            "day": 0,
            "flight": "AF547",
            "utc": "19:14:00"
            },
        {
            "day": 0,
            "flight": "AF943",
            "utc": "01:31:00"
            },
        {
            "day": 1,
            "flight": "AF356",
            "utc": "12:40:00"
            },
        {
            "day": 1,
            "flight": "AF480",
            "utc": "08:58:00"
            },
        {
            "day": 1,
            "flight": "AF250",
            "utc": "12:59:00"
            }
    ]

4. Composite Secondary Index

It's common to have queries with multiple filters (predicates). So, you want the indices with multiple keys so the indices can return only the qualified document keys. Additionally, if a query is referencing only the keys in the index, the query engine will simply answer the query from the index scan result without going to the data nodes. This is a commonly exploited performance optimization.
CREATE INDEX travel_info ON `travel-sample`(name, type, id, icoo, iata);
Each of the keys can be a simple scalar field, object, or an array. For the index filtering to be exploited, the filters have to use respective object type in the query filter.
The keys to the secondary indices can include document keys (meta().id) explicitly if you need to filter on it in the index.

5. Functional Index

It's common to have names in the database with a mix of upper and lower cases. When you need to search, "John," you want it to search for any combination of "John," "john," etc.  Here's how you do it.
CREATE INDEX travel_cxname ON `travel-sample`(LOWER(name));
Provide the search string in lowercase and the index will efficiently search for already lowercased values in the index.
EXPLAIN SELECT * FROM `travel-sample` WHERE LOWER(name) = “john”;
{
    "#operator": "IndexScan",
    "index": "travel_cxname",
    "index_id": "2f39d3b7aac6bbfe",
    "keyspace": "travel-sample",
    "namespace": "default",
    "spans": [
    {
        "Range": {
            "High": [
            "\"john\""
            ],
            "Inclusion": 3,
            "Low": [
            "\"john\""
            ]
        }
    }
    ],

You can use complex expressions in this functional index.
CREATE INDEX travel_cx1 ON `travel-sample`(LOWER(name), 
    length*width, round(salary));

6. Array Index

JSON is hierarchical. At the top level, it can have scalar fields, objects, or arrays. Each object can nest other objects and arrays. Each array can have other objects and arrays. And so on. The nesting continues.
When you have this rich structure, here's how you index a particular array, or a field within the sub-object.
Consider the array, schedule:
schedule: 
[  
    {  
        "day" : 0, 
        "special_flights" : 
        [ 
        {  
            "flight" : "AI111", "utc" : ”1:11:11"
            }, 
        {  
            "flight" : "AI222", "utc" : ”2:22:22" 
            }   
            ]  
        },
    {
        "day": 1,
        "flight": "AF552",
        "utc": "14:41:00”
        }
    ]
CREATE INDEX travel_sched ON `travel-sample`
 (ALL DISTINCT ARRAY v.day FOR v IN schedule END) 
This index key is an expression on the array to clearly reference only the elements needed to be indexed.
schedule the array we’re dereferencing into.
v is the variable we’ve implicitly declared to reference each element/object within the array: schedule
v.day refers to the element within each object of the array schedule.

The query below will exploit the array index.
EXPLAIN SELECT * FROM `travel-sample` 
WHERE ANY v IN SCHEDULE SATISFIES v.day = 2 END;
{
    "#operator": "DistinctScan",
    "scan": {
        "#operator": "IndexScan",
        "index": "travel_sched",
        "index_id": "db7018bff5f10f17",
        "keyspace": "travel-sample",
        "namespace": "default",
        "spans": [
        {
            "Range": {
                "High": [
                "2"
                ],
                "Inclusion": 3,
                "Low": [
                "2"
                ]
                }
            }
        ],
        "using": "gsi"
    }

Because the key is a generalized expression, you get the flexibility to apply additional logic and processing on the data before indexing. For example, you can create functional indexing on elements of each array.  Because you're referencing individual fields of the object or element within the array, the index creation, size, and search are efficient.
The index above stores only the distinct values within an array.  To store all elements of an array in an index, use the DISTINCT modifier to the expression.
CREATE INDEX travel_sched ON `travel-sample`
    (ALL ARRAY v.day FOR v IN schedule END)

7. Partial Index

So far, the indices we've created will create indices on the whole bucket. Because the Couchbase data model is JSON and JSON schema are flexible, an index may not contain entries to documents with absent index keys. That's expected.
Unlike relational systems, where each type of row is in a distinct table, Couchbase buckets can have documents of various types. Typically, customers include a type field to differentiate distinct types.
{
    "airline": {
        "callsign": "MILE-AIR",
        "country": "United States",
        "iata": "Q5",
        "icao": "MLA",
        "id": 10,
        "name": "40-Mile Air",
        "type": "airline"
        },
    "documentkey": "airline_10"
    }

When you want to create an index of airline documents, you can simply add the type field for the WHERE clause of the index.
CREATE INDEX travel_info ON `travel-sample`(name, id, icoo, iata)
WHERE type = 'airline';
This will create an index only on the documents that have (type = ‘airline').  In your queries, you'd need to include the filter (type = ‘airline') in addition to other filters so this index qualifies.
You can use complex predicates in the WHERE clause of the index. Various use cases to exploit partial indexes are:
  1. Partitioning a large index into multiple indices using the mod function.
  2. Partitioning a large index into multiple indices and placing each index into distinct indexer nodes.
  3. Partitioning the index based on a list of values. For example, you can have an index for each state.
  4. Simulating index range partitioning via a range filter in the WHERE clause. One thing to remember is Couchbase N1QL queries will use one partitioned index per query block. Use UNION ALL to have a query exploit multiple partitioned indices in a single query.

8. Duplicate Index

This isn't really a special type of index, but a feature of Couchbase indexing. You can create duplicate indexes with distinct names.
CREATE INDEX i1 ON `travel-sample`(LOWER(name),id, icoo) 
WHERE type = ‘airline’;
CREATE INDEX i2 ON `travel-sample`(LOWER(name),id, icoo) 
WHERE type = ‘airline’;
CREATE INDEX i3 ON `travel-sample`(LOWER(name),id, icoo) 
WHERE type = ‘airline’;

All three indices have identical keys, identical WHERE clause; Only difference is the name of the indices.  You can choose their physical location using the WITH clause of the CREATE INDEX.
During query optimization, query will choose one of the names. You see that in your plan. During query runtime, these indices are used in round-robin fashion to distribute the load. This gives you scale-out, multi-dimensional scaling, performance, and high availability. Not bad!

9. Covering Index

Index selection for a query solely depends on the filters in the WHERE clause of your query. After the index selection is made, the engine analyzes the query to see if it can be answered using only the data in the index. If it does, query engine skips retrieving the whole document. This is a performance optimization to consider while designing the indices.

Summary

Let's put together a partitioned composite functional array index now!
CREATE INDEX travel_all ON `travel-sample`(
iata, 
LOWER(name), 
UPPER(callsign),
            ALL DISTINCT ARRAY p.model FOR p IN jets END),
TO_NUMBER(rating),
            meta().id
) WHERE LOWER(country) = "united states" AND type = "airline";

References

Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index :http://vldb2016.persistent.com/industrial_track_papers.php

  • Couchbase Documentation: http://docs.couchbase.com
  • Comments

    Popular posts from this blog

    Swami Vivekananda: The Monk That Nobody Sent to Chicago

      There’s a saying in Chicago: “We don’t want nobody that nobody sent.” This was the cold reception Swami Vivekananda faced when he arrived in the windy city in July 1893, determined to attend the World Parliament of Religions that September. He belonged to no organization, carried no letter of recommendation, his countrymen were nobody, and represented an alien religion to the Western world. As the days passed, his hope of attending the parliament dwindled. With money running out and the odds stacked against him, he left the Windy City and went to Boston, praying for a glimmer of opportunity.  Swamiji came to America to share India’s most profound gift: the wisdom of the Hindu sages, preserved through centuries of oral tradition and embodied by its monks. This was 1893, not 1993—India was under the British grip, its resources drained, and its spirit subdued. Swamiji’s mission was not just a cultural exchange; it was a bold step toward envisioning a future where India could re...

    Why Should Databases Go Natural?

    From search to CRM, applications are adopting natural language and intuitive interactions. Should databases follow? This article provides a strategic perspective. Amid the many technological evolutions in software and hardware (CISC/RISC, Internet, Cloud, and AI), one technology has endured:  Relational Database Systems   (RDBMS), aka SQL databases. For over 50 years, RDBMS has survived and thrived, overcoming many challenges. It has evolved and adopted beneficial features from emerging technologies like object-relational databases and now competes robustly with   NoSQL databases .  Today, RDBMS dominates the market, with four of the top five databases and seven of the top ten being relational. RDBMS has smartly borrowed ideas, like JSON support, from NoSQL, while NoSQL has also borrowed from RDBMS. NoSQL no longer rejects SQL. From a user perspective, all modern databases have SQL-inspired query language and a set of APIs. All applications manage the respective data...

    iQ Interactive: Cool Things for Developers on Couchbase Capella iQ

      The landscape of software development is ever-evolving with the advent of new technologies. As we venture into 2023, natural language processing ( NLP ) is rapidly emerging as a pivotal aspect of programming. Unlike previous generations of tools that primarily aimed at enhancing coding productivity and code quality, the new generation of Artificial Intelligence ( GenAI ) tools, like iQ, is set to revolutionize every facet of a developer's workflow. This encompasses a wide range of activities: Reading, writing, and rewriting specifications Designing, prototyping, and coding Reviewing, refactoring, and verifying software Going through the iterative cycle of deploying, debugging, and improving the software Create a draft schema and sample data for any use case Natural language queries. Generate sample queries on a given dataset Fix the syntax error for a query Don't stop here. Let your imagination fly. Although the insights garnered from iQ are preliminary and should be treated ...