Skip to main content

COUNT and GROUP Faster With N1QL

[Repost from my article at DZone: https://dzone.com/articles/count-amp-group-faster-using-n1ql]

Count Group

Humans have counted things for a very long time. In database applications, COUNT() is frequently used in various contexts. The COUNT performance affects both application performance and user experience. Keeping this mind, Couchbase supported generalized COUNT() and has improved its performance in Couchbase 4.5 release.

Two Couchbase 4.5 Features

There are two features in Couchbase 4.5 helping the COUNT performance.
When the query is interested only in the COUNT of a range of data that’s indexed, the indexer does the counting itself. In other words, the query pushes the counting to the index. This reduces the amount of data exchanged between indexer and query, improving the query speed.
CREATE INDEX idxname ON `travel-sample` (name);
SELECT COUNT(name) 
FROM `travel-sample` 
WHERE name = 'Air Alaska';
First, when all of the predicates can be pushed down to index scan and we know index scan won’t result in any false positives, the index can simply do the COUNT() of qualifying values without returning the qualifying data to the query engine. This improves the COUNT performance. When the COUNT() is pushed to the index scan, the optimizer will denote this by choosing the IndexCountScan method in the query plan. The qualifying queries benefit directly, without any effort from the application developer.
The predicate (name = 'Air Alaska') is pushed down to the index scan in addition to the COUNT() aggregate as well.
{
    "#operator": "IndexCountScan",
    "covers": [
        "cover ((`travel-sample`.`name`))",
        "cover ((meta(`travel-sample`).`id`))"
    ],
    "index": "ixname",
    "index_id": "ea81dd71a0a98351",
    "keyspace": "travel-sample",
    "namespace": "default",
    "spans": [
        {
            "Range": {
                "High": [
                    "\"Air Alaska\""
                ],
                "Inclusion": 3,
                "Low": [
                    "\"Air Alaska\""
                ]
            }
  • Second, the optimal execution of the MIN() aggregate when the query is covered and can be pushed down to index.
SELECT MIN(name) 
FROM `travel-sample 
WHERE prodname > 'Air Alaska';
For this query, the first name that’s greater than “Air Alaska” is determined by the index scan.  Because we’re interested in the just one minimum value and therefore, in the query plan, we push the {limit:1} parameter to the index scan. This works very quickly compared to getting all the qualifying values from the index scan to determine the minimum value.
{
    "#operator": "IndexScan",
    "covers": [
        "cover ((`travel-sample`.`name`))",
        "cover ((meta(`travel-sample`).`id`))"
    ],
    "index": "ixname",
    "index_id": "ea81dd71a0a98351",
    "keyspace": "travel-sample",
    "limit": "1",
    "namespace": "default",
    "spans": [
    {
        "Range": {
        "Inclusion": 0,
        "Low": [
            "\"Air Alaska\""
        ]
    }

Use Cases

Let’s walk through common use cases and how these features work together to speed up many queries. The examples use travel-sample data set shipped along with Couchbase server.
There are some interesting uses with a little bit of help from the application developer. Since multiple customers are using this approach, it’s worth checking out if this is going to help you as well.
Let’s look at common questions and how to improve the performance with Couchbase 4.5.

Use Case 1

What’s the total number of documents in the bucket?
cbq> select count(*) from `travel-sample`;
{
    "requestID": "15549343-54ec-44d8-b62a-a4d98445b484",
    "signature": {
        "$1": "number"
    },
    "results": [
        {
            "$1": 31591
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "7.724136ms",
        "executionTime": "7.693342ms",
        "resultCount": 1,
        "resultSize": 35
    }
}
Counting of the whole bucket runs in a few milliseconds because we can directly get the total number of documents directly from the bucket metadata.
{
    "#operator": "CountScan",
    "keyspace": "travel-sample",
    "namespace": "default"
}

Use Case 2

Let’s see how many document types we have in the travel-sample bucket:
> select distinct type from `travel-sample`;
{
    "requestID": "584a7785-95fe-4d4a-b72b-7e161858bf2a",
    "signature": {
        "type": "json"
    },
    "results": [
        {
            "type": "airline"
        },
        {
            "type": "route"
        },
        {
            "type": "hotel"
        },
        {
            "type": "landmark"
        },
        {
            "type": "airport"
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "1.031288771s",
        "executionTime": "1.031253954s",
        "resultCount": 5,
        "resultSize": 202
    }
}
One full second for this seems quite expensive.  Let’s see the query plan.
{
    "#operator": "PrimaryScan",
    "index": "def_primary",
    "keyspace": "travel-sample",
    "namespace": "default",
    "using": "gsi"
}

To determine the number of distinct items, this plan will scan the whole bucket, which determines the distinct types. Let’s see how we can improve this.
The MIN() optimization in Couchbase 4.5 can be exploited to optimize this. We can evaluate if the field is the leading key of any index, as the index has pre-sorted data. Using this, we can write a small script to calculate the distinct values much faster.

Step 1

Get the first entry in the index.
      SELECT MIN(type) 
      FROM `travel-sample` 
      WHERE type IS NOT MISSING;
{
"results": [
        {
            "$1": "airline"
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "15.050713ms",
        "executionTime": "15.023588ms",
        "resultCount": 1,
        "resultSize": 39
    }
}
The Query Plan shows that this is done by index scan only. The scan parameter "limit": "1" shows why we get the results so fast. Because the index stores data in a sorted order, the first item will be the lowest value.
        {   
            "#operator": "IndexScan",
            "covers": [
                "cover ((`travel-sample`.`type`))",
                "cover ((meta(`travel-sample`).`id`))"
            ],  
            "index": "def_type",
            "index_id": "7cea57503ecfe0d3",
            "keyspace": "travel-sample",
            "limit": "1",
            "namespace": "default",
            "spans": [
                {   
                    "Range": {
                        "Inclusion": 0,
                        "Low": [
                            "null"
                        ]   
                    }   
                }   
            ],  
            "using": "gsi"
        }

Step 2  

Now we use the index to find the next value for the type.
Use the predicate (type > "airline"). Now you get airport:
SELECT MIN(type)
FROM `travel-sample` 
WHERE type > "airline";
{
"results": [
        {
            "$1": "airport"
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "10.370383ms",
        "executionTime": "10.346036ms",
        "resultCount": 1,
        "resultSize": 39
    }
}

Repeat this step 2 using the previous value in the predicate until you get NULL as the result.
A simple Python script will get the job done.
import requests
import json
url="http://localhost:8093/query"
s = requests.Session()
s.keep_alive = True
s.auth = ('Administrator','password')
query = {'statement':'SELECT MIN(type) minval FROM `travel-sample` WHERE type IS NOT MISSING ;'}
r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
result = r.json()['results'][0]
lastval = result['minval']
while lastval != None:
    print lastval
    stmt = 'SELECT MIN(type) minval FROM `travel-sample` WHERE type > "' + lastval + '";'; 
    query = {'statement':stmt}
    r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
    result = r.json()['results'][0]
    lastval = result['minval']

Pro Tip

  • Use a prepared statement for the SELECT statement in the while loop to improve the performance even further.

Use Case 3

Grouping works closely with counting. It’s common to group the data by type, name, and date and count them. If you have a dataset with millions of documents, the index scan itself may take minutes to scan and retrieve all the data.
Here is a simple example on travel-sample.
SELECT type, count(type) 
FROM `travel-sample`
GROUP BY type;
    "metrics": {
        "elapsedTime": "904.607551ms",
        "executionTime": "904.585043ms",
        "resultCount": 5,
        "resultSize": 321
    }
This scans the entire bucket to calculate the groups and counts. You can improve the performance by using the index scan.
SELECT type, count(type) 
FROM `travel-sample`
WHERE type IS NOT MISSING
GROUP BY type;
The query plan uses  the index now.
        {
          "#operator": "IndexScan",
          "covers": [
            "cover ((`travel-sample`.`type`))",
            "cover ((meta(`travel-sample`).`id`))"
          ],
          "index": "def_type",
          "index_id": "a1ae13a30ad4408",
          "keyspace": "travel-sample",
          "namespace": "default",
 ...
This query runs faster as well.
    "metrics": {
        "elapsedTime": "212.792255ms",
        "executionTime": "212.771933ms",
        "resultCount": 5,
        "resultSize": 321
    }
Let’s see if we can make this faster.
Using the combination of MIN() and COUNT() optimization features in 4.5, we can improve this further by writing a small script.
Here is the outline.
Step 1: Get the first entry in the index for the type.
Step 2: Then, COUNT() from the data set where type = first-value.
Step 3: Now we use the index to find the next value for type.
Step 4: Repeat step 2 and 3 for all the values of type.
Here’s the actual Python script:
import requests
import json
url="http://localhost:8093/query"
s = requests.Session()
s.keep_alive = True
s.auth = ('Administrator','password')
query = {'statement':'SELECT MIN(type) minval FROM `travel-sample` WHERE type IS NOT MISSING ;'}
r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
result = r.json()['results'][0]
lastval = result['minval']
while lastval != None:
    stmt = 'SELECT COUNT(type) tcount FROM `travel-sample` WHERE type = "' + lastval + '";';
    query = {'statement':stmt}
    r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
    result = r.json()['results'][0]
    tcount = result['tcount']
    print lastval, tcount
    stmt = 'SELECT MIN(type) minval FROM `travel-sample` WHERE type > "' + lastval + '";'; 
    query = {'statement':stmt}
    r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
    result = r.json()['results'][0]
    lastval = result['minval']

Let's run this!
$ time python ./group.py 
airline 187
airport 1968
hotel 917
landmark 4495
route 24024
real0m0.372s
user0m0.079s
sys0m0.036s
$
This is run on travel-sample with about 31,000 documents. The Python script itself runs about 10 SELECT statements to retrieve this result. As the number of documents in the dataset increases, the script approach will run much faster to a single statement. Further, using the prepared statement, you can save more CPU cycles.

Conclusion

The MIN() and COUNT() optimizations in 4.5 improves query performance. Using additional logic, you can use these optimizations to improve the performance of a variety of queries.

Comments

Popular posts from this blog

Swami Vivekananda: The Monk That Nobody Sent to Chicago

  There’s a saying in Chicago: “We don’t want nobody that nobody sent.” This was the cold reception Swami Vivekananda faced when he arrived in the windy city in July 1893, determined to attend the World Parliament of Religions that September. He belonged to no organization, carried no letter of recommendation, his countrymen were nobody, and represented an alien religion to the Western world. As the days passed, his hope of attending the parliament dwindled. With money running out and the odds stacked against him, he left the Windy City and went to Boston, praying for a glimmer of opportunity.  Swamiji came to America to share India’s most profound gift: the wisdom of the Hindu sages, preserved through centuries of oral tradition and embodied by its monks. This was 1893, not 1993—India was under the British grip, its resources drained, and its spirit subdued. Swamiji’s mission was not just a cultural exchange; it was a bold step toward envisioning a future where India could re...

Why Should Databases Go Natural?

From search to CRM, applications are adopting natural language and intuitive interactions. Should databases follow? This article provides a strategic perspective. Amid the many technological evolutions in software and hardware (CISC/RISC, Internet, Cloud, and AI), one technology has endured:  Relational Database Systems   (RDBMS), aka SQL databases. For over 50 years, RDBMS has survived and thrived, overcoming many challenges. It has evolved and adopted beneficial features from emerging technologies like object-relational databases and now competes robustly with   NoSQL databases .  Today, RDBMS dominates the market, with four of the top five databases and seven of the top ten being relational. RDBMS has smartly borrowed ideas, like JSON support, from NoSQL, while NoSQL has also borrowed from RDBMS. NoSQL no longer rejects SQL. From a user perspective, all modern databases have SQL-inspired query language and a set of APIs. All applications manage the respective data...

iQ Interactive: Cool Things for Developers on Couchbase Capella iQ

  The landscape of software development is ever-evolving with the advent of new technologies. As we venture into 2023, natural language processing ( NLP ) is rapidly emerging as a pivotal aspect of programming. Unlike previous generations of tools that primarily aimed at enhancing coding productivity and code quality, the new generation of Artificial Intelligence ( GenAI ) tools, like iQ, is set to revolutionize every facet of a developer's workflow. This encompasses a wide range of activities: Reading, writing, and rewriting specifications Designing, prototyping, and coding Reviewing, refactoring, and verifying software Going through the iterative cycle of deploying, debugging, and improving the software Create a draft schema and sample data for any use case Natural language queries. Generate sample queries on a given dataset Fix the syntax error for a query Don't stop here. Let your imagination fly. Although the insights garnered from iQ are preliminary and should be treated ...