Indexing JSON in Couchbase

[Reposting of my article at DZone: http://dzone.com/articles/index-first-and-query-faster]

Introduction

There are three things important in database systems:performance, performance, performance. Creating the right index, with the right keys, right order, and right expression is critical to query performance in any database system. That's true for Couchbase as well.

We've discussed data modeling for JSON and querying on JSONearlier. In this article, we'll discuss indexing options for JSON in Couchbase.

Couchbase 4.5 can create two types of indices:

Standard global secondary index.
Memory-Optimized global secondary index.

The standard secondary index stores uses the ForestDB storage engine to store the B-Tree index and keeps the optimal working set of data in the buffer. That means, the total size of the index can be much bigger than the amount of memory available in each index node.

A memory-optimized index uses a novel lock-free skiplist to maintain the index and keeps 100% of the index data in memory. A memory-optimized index has better latency for index scans and can also process the mutations of the data much faster.

Both standard and memory-optimized indeces implement multi-version concurrency control (MVCC) to provide consistent index scan results and high throughput. During cluster installation, choose the type of index.

The goal is to give you an overview of various indices you create in each of these services so that your queries can execute efficiently.

The goal of this article is not to describe or compare and contrast these two types of index services. It does not cover the Full Text Index (FTS), in developer preview now. Another topic not covered by this article is how the index selection is made for scans and joinsCouchbase documentation

Let's take travel-sample dataset shipped with Couchbase 4.5 to walk through this. To try out these indices, install Couchbase 4.5. On your web console, go to Settings->Sample Buckets to install travel-sample.

Here are the various indices you can create.

Primary Index
Named primary index
Secondary index
Composite Secondary Index
Functional index
Array Index
ALL array 
ALL DISTINCT array
Partial Index
Duplicate Indices
Covering Index

Background

Couchbase is a distributed database. It supports flexible data model using JSON. Each document in a bucket will have a user-generated unique document key. This uniqueness is enforced during insertion of the data.

Here's an example document.

select meta().id, travel 
from `travel-sample` travel 
where type = 'airline' limit 1;
[
    {
        "id": "airline_10",
        "travel": {
            "callsign": "MILE-AIR",
            "country": "United States",
            "iata": "Q5",
            "icao": "MLA",
            "id": 10,
            "name": "40-Mile Air",
            "type": "airline"
            }
        }
    ]

1. Primary Index

Create the primary index on 'travel-sample';

The primary index is simply the index on the document key on the whole bucket. The Couchbase data layer enforces the uniqueness constraint on the document key. The primary index, like every other index, is maintained asynchronously. The primary index is used for full bucket scans (primary scans) when the query does not have any filters (predicates) or no other index or access path can be used.

Here is the metadata for this index:

select * from system:indexes where name = ‘#primary’;
"indexes": {
    "datastore_id": "http://127.0.0.1:8091",
    "id": "f6e3c75d6f396e7d",
    "index_key": [],
    "is_primary": true,
    "keyspace_id": "travel-sample",
    "name": "#primary",
    "namespace_id": "default",
    "state": "online",
    "using": "gsi"
    }

The metadata gives you additional information on the index: Where the index resides (datastore_id), its state (state) and the indexing method (using).

2. Named Primary Index

CREATE PRIMARY INDEX `def_primary` ON `travel-sample`

You can also name the primary index. The rest of the features of the primary index are the same, except the index is named. A good side effect of this is that you can have multiple primary indices in the system. Duplicate indices help with high availability as well as query load distribution throughout them. This is true for both primary indices and secondary indices.

select meta().id as documentkey, `travel-sample` airline 
from `travel-sample` 
where type = 'airline' limit 1;
  {
    "airline": {
      "callsign": "MILE-AIR",
      "country": "United States",
      "iata": "Q5",
      "icao": "MLA",
      "id": 10,
      "name": "40-Mile Air",
      "type": "airline"
    },
    "documentkey": "airline_10"
  }

3. Secondary Index

The secondary index is an index on any key-value or document-key. This index can be any key within the document. The key can be of any time: scalar, object, or array. The query has to use the same type of object for the query engine to exploit the index.

CREATE INDEX travel_name ON `travel-sample`(name);
name is a simple scalar value.
{    "name": "Air France"  }
CREATE INDEX travel_geo on `travel-sample`(geo);
geo is an object embedded within the document.  Example:
    "geo": {
        "alt": 12,
        "lat": 50.962097,
        "lon": 1.954764
        }

Creating indexes on keys from nested objects is straightforward.

CREATE INDEX travel_geo on `travel-sample`(geo.alt);
CREATE INDEX travel_geo on `travel-sample`(geo.lat);

Schedule is an array of objects with flight details. This indexes on the complete array. Not exactly useful unless you're looking for the whole array.

CREATE INDEX travel_schedule ON `travel-sample`(schedule);
Example:
"schedule": [
        {
            "day": 0,
            "flight": "AF198",
            "utc": "10:13:00"
            },
        {
            "day": 0,
            "flight": "AF547",
            "utc": "19:14:00"
            },
        {
            "day": 0,
            "flight": "AF943",
            "utc": "01:31:00"
            },
        {
            "day": 1,
            "flight": "AF356",
            "utc": "12:40:00"
            },
        {
            "day": 1,
            "flight": "AF480",
            "utc": "08:58:00"
            },
        {
            "day": 1,
            "flight": "AF250",
            "utc": "12:59:00"
            }
    ]

4. Composite Secondary Index

It's common to have queries with multiple filters (predicates). So, you want the indices with multiple keys so the indices can return only the qualified document keys. Additionally, if a query is referencing only the keys in the index, the query engine will simply answer the query from the index scan result without going to the data nodes. This is a commonly exploited performance optimization.

CREATE INDEX travel_info ON `travel-sample`(name, type, id, icoo, iata);

Each of the keys can be a simple scalar field, object, or an array. For the index filtering to be exploited, the filters have to use respective object type in the query filter.

The keys to the secondary indices can include document keys (meta().id) explicitly if you need to filter on it in the index.

5. Functional Index

It's common to have names in the database with a mix of upper and lower cases. When you need to search, "John," you want it to search for any combination of "John," "john," etc. Here's how you do it.

CREATE INDEX travel_cxname ON `travel-sample`(LOWER(name));

Provide the search string in lowercase and the index will efficiently search for already lowercased values in the index.

EXPLAIN SELECT * FROM `travel-sample` WHERE LOWER(name) = “john”;
{
    "#operator": "IndexScan",
    "index": "travel_cxname",
    "index_id": "2f39d3b7aac6bbfe",
    "keyspace": "travel-sample",
    "namespace": "default",
    "spans": [
    {
        "Range": {
            "High": [
            "\"john\""
            ],
            "Inclusion": 3,
            "Low": [
            "\"john\""
            ]
        }
    }
    ],

You can use complex expressions in this functional index.

CREATE INDEX travel_cx1 ON `travel-sample`(LOWER(name), 
    length*width, round(salary));

6. Array Index

JSON is hierarchical. At the top level, it can have scalar fields, objects, or arrays. Each object can nest other objects and arrays. Each array can have other objects and arrays. And so on. The nesting continues.

When you have this rich structure, here's how you index a particular array, or a field within the sub-object.

Consider the array, schedule:

schedule: 
[  
    {  
        "day" : 0, 
        "special_flights" : 
        [ 
        {  
            "flight" : "AI111", "utc" : ”1:11:11"
            }, 
        {  
            "flight" : "AI222", "utc" : ”2:22:22" 
            }   
            ]  
        },
    {
        "day": 1,
        "flight": "AF552",
        "utc": "14:41:00”
        }
    ]
CREATE INDEX travel_sched ON `travel-sample`
 (ALL DISTINCT ARRAY v.day FOR v IN schedule END) 
This index key is an expression on the array to clearly reference only the elements needed to be indexed.
schedule the array we’re dereferencing into.
v is the variable we’ve implicitly declared to reference each element/object within the array: schedule
v.day refers to the element within each object of the array schedule.

The query below will exploit the array index.

EXPLAIN SELECT * FROM `travel-sample` 
WHERE ANY v IN SCHEDULE SATISFIES v.day = 2 END;
{
    "#operator": "DistinctScan",
    "scan": {
        "#operator": "IndexScan",
        "index": "travel_sched",
        "index_id": "db7018bff5f10f17",
        "keyspace": "travel-sample",
        "namespace": "default",
        "spans": [
        {
            "Range": {
                "High": [
                "2"
                ],
                "Inclusion": 3,
                "Low": [
                "2"
                ]
                }
            }
        ],
        "using": "gsi"
    }

Because the key is a generalized expression, you get the flexibility to apply additional logic and processing on the data before indexing. For example, you can create functional indexing on elements of each array. Because you're referencing individual fields of the object or element within the array, the index creation, size, and search are efficient.

The index above stores only the distinct values within an array. To store all elements of an array in an index, use the DISTINCT modifier to the expression.

CREATE INDEX travel_sched ON `travel-sample`
    (ALL ARRAY v.day FOR v IN schedule END)

7. Partial Index

So far, the indices we've created will create indices on the whole bucket. Because the Couchbase data model is JSON and JSON schema are flexible, an index may not contain entries to documents with absent index keys. That's expected.

Unlike relational systems, where each type of row is in a distinct table, Couchbase buckets can have documents of various types. Typically, customers include a type field to differentiate distinct types.

{
    "airline": {
        "callsign": "MILE-AIR",
        "country": "United States",
        "iata": "Q5",
        "icao": "MLA",
        "id": 10,
        "name": "40-Mile Air",
        "type": "airline"
        },
    "documentkey": "airline_10"
    }

When you want to create an index of airline documents, you can simply add the type field for the WHERE clause of the index.

CREATE INDEX travel_info ON `travel-sample`(name, id, icoo, iata)

WHERE type = 'airline';

This will create an index only on the documents that have (type = ‘airline'). In your queries, you'd need to include the filter (type = ‘airline') in addition to other filters so this index qualifies.

You can use complex predicates in the WHERE clause of the index. Various use cases to exploit partial indexes are:

Partitioning a large index into multiple indices using the mod function.
Partitioning a large index into multiple indices and placing each index into distinct indexer nodes.
Partitioning the index based on a list of values. For example, you can have an index for each state.
Simulating index range partitioning via a range filter in the WHERE clause. One thing to remember is Couchbase N1QL queries will use one partitioned index per query block. Use UNION ALL to have a query exploit multiple partitioned indices in a single query.

8. Duplicate Index

This isn't really a special type of index, but a feature of Couchbase indexing. You can create duplicate indexes with distinct names.

CREATE INDEX i1 ON `travel-sample`(LOWER(name),id, icoo) 
WHERE type = ‘airline’;
CREATE INDEX i2 ON `travel-sample`(LOWER(name),id, icoo) 
WHERE type = ‘airline’;
CREATE INDEX i3 ON `travel-sample`(LOWER(name),id, icoo) 
WHERE type = ‘airline’;

All three indices have identical keys, identical WHERE clause; Only difference is the name of the indices. You can choose their physical location using the WITH clause of the CREATE INDEX.

During query optimization, query will choose one of the names. You see that in your plan. During query runtime, these indices are used in round-robin fashion to distribute the load. This gives you scale-out, multi-dimensional scaling, performance, and high availability. Not bad!

9. Covering Index

Index selection for a query solely depends on the filters in the WHERE clause of your query. After the index selection is made, the engine analyzes the query to see if it can be answered using only the data in the index. If it does, query engine skips retrieving the whole document. This is a performance optimization to consider while designing the indices.

Summary

Let's put together a partitioned composite functional array index now!

CREATE INDEX travel_all ON `travel-sample`(
iata, 
LOWER(name), 
UPPER(callsign),
            ALL DISTINCT ARRAY p.model FOR p IN jets END),
TO_NUMBER(rating),
            meta().id
) WHERE LOWER(country) = "united states" AND type = "airline";

References

Nitro: A Fast, Scalable In-Memory Storage Engine for NoSQL Global Secondary Index :http://vldb2016.persistent.com/industrial_track_papers.php

Couchbase: http://www.couchbase.com

Couchbase Documentation: http://docs.couchbase.com

Planet NoSQL

Search This Blog