Skip to main content

It's the Workload, Stupid!





As an application architect, eventually, you’d choose the database or database service to power your newest application or a microservice. Selecting one of the databases among relational databases was easier. The use cases were roughly divided into OLTP and OLAP (decision support). The workload differences between OLTP and OLAP were well known. OLTP workloads consist of short transactions on few random rows, expecting millisecond responses on pre-compiled queries; OLAP workloads consist of data loads, long-running queries scanning millions of rows of a fact table of a star/snowflake schema. Each had the performance benchmark and TCO well defined, measured and audited via TPC benchmarks. You can make use of these numbers, approximate your workload, understand the needs and capabilities match on other fronts like administration.

Then, there are NoSQL databases. NoSQL databases were invented to handle the webscale performance of operational applications. It had to be elastic to handle the scale and tolerate nodes going down (aka partition tolerance). That sparked the innovation to create databases on a variety of data models and use cases. There are databases for JSON, graphs, time-series and more. From Azure databases to ZODB, from Couchbase to Cassandra. MongoDB to TiDB, spatial to JSON databases — so many different kinds of databases. In fact, NoSQL-databases.org lists 225 databases as of November 2018.

E-Commerce applications need to generate the sales report, shopping cart applications need to report outstanding shopping carts, etc. Every application makes progress on the workflow on behalf of its user or customer. These operational queries can be simple key-value operations, short-range queries or complex search queries in NoSQL. This workload is the bread and butter of most businesses.

The solutions for the high-volume transaction and high-volume analysis is seemingly contradictory. The lookup operations or queries (e.g. searching for flights) requires very efficient lookup, quick data flow back to the application. Milliseconds matter.

“By the way, most workloads are mixed workloads” — Larry Ellison, Founder and CTO of Oracle.

  • What are the application requests to the database?
  • Simple operations to get and set data (e.g.new customer, new order)
  • Search requests (search for products, orders, etc)
  • Pagination requests (list a customer orders, sorted by date)
  • Reporting workload.
  • Queries from business intelligence tools.
  • As the number of concurrent users increase, which of these operations need to scale?
  • What’s the trigger point in the workload to add or remove database nodes?
  • What’s the failover strategy?
Comparing NoSQL Databases
If you showed the pending orders for a user in the app before, you still need to do that with the new database. You may change how you achieve the same result, but not the result itself. This gets us back to the workload. You’ll have to understand the full application workload and the database workload it generates. Then do the performance, scaling measurement and scaling. The simple use cases, YCSB benchmark will help, but for complex use cases, YCSB-JSON can be used. This exercise will require you to translate the workload into respective database operations and measure them. Measure them for each of the critical operations as well as system characteristics like elasticity, replication speed, and failover.
Database Workload
And all that you flush
All that you shard
All you hash
All that you fail
All you test
All you save
And all that you store
And all that you transact
Begin, commit or rollback
And all you do
And all that you undo
And all that you redo
And every node you add (every node you add)
And all that you search
And every key you index
And all that is run
And all that’s to change
But the base is running on the cloud

Source: BI Research

Recognizing the need to address mixed workloads, OLTP databases have added features for complex queries (e.g. hash joins, window functions), NoSQL databases have added SQL. SQL databases have added JSON. We may be reaching the state of the paradox of choice!

Until we have a universal solution, we’d need to evaluate and stitch different products and services to provide the solution to support the business workload and business outcome. E.g. You could run SQL Server for the OLTP and Teradata for OLAP. Use Couchbase for an e-commerce application and Hadoop for machine learning. First, you’ll have to drill down the workload to understand the operations within a workload.

So, what’s a workload?

The article What’s Your Definition of the Workload? gives examples database workload from various angles. Application designers may define by the application SLAs at a given concurrency. DBAs may define the workload by the CPU, memory usage, I/O throughput, etc. The advice is to understand the resource usage and tie them to higher level entities like the queries, users, and applications.

TPC-C describes the benchmark workload and measurement in the following way: "The most frequent transaction consists of entering a new order which, on average, is comprised of ten different items. Each warehouse tries to maintain stock for the 100,000 items in the Company’s catalog and fill orders from that stock. However, in reality, one warehouse will probably not have all the parts required to fill every order. Therefore, TPC-C requires that close to ten percent of all orders must be supplied by another warehouse of the Company. Another frequent transaction consists in recording a payment received from a customer. Less frequently, operators will request the status of a previously placed order, process a batch of ten orders for delivery, or query the system for potential supply shortages by examining the level of stock at the local warehouse. A total of five types of transactions, then, are used to model this business activity. The performance metric reported by TPC-C measures the number of orders that can be fully processed per minute and is expressed in tpm-C."

For modern applications, as an architect, you’ll have to understand all of the application operations, patterns, and their SLAs. Application performance isn’t measured by a simple benchmark, but its ability to perform at the business scale over a long period of time. Adding the period of time factor means, you’ll have to consider the scale, SLA and cost of the infrastructure on a slow summer day and the day 100 million customers on your site trying to add their favorite toy to the shopping cart and buy it.

Relational databases (aka SQL databases) had the same relational model, roughly same data types, implemented similar SQL. The difference was in performance, which could be measured by TPC benchmarks, ease of administration, which was more subjective.

Variety is the name of the game for No-SQL databases. There are specialized databases for key-value, JSON, wide-column, graph, timeseries and spatial and more. These represent different data models and various operations that can be done effectively in this model. Each of them their own APIs, query languages, performance characteristics, scaling capabilities.

The critical lesson while evaluating databases for your workload is:

The business and application requirements on the database won’t change just because you can use a new kind of database.

To summarize, the workload is described by the modified Pink Flyod lyric. (With apologies to Roger Waters).

All that you cache

All that you scale

All that you share

All you create

And all that you delete

And all that is planned

And everything under the (data)base is in tune

Comments

Popular posts from this blog

Swami Vivekananda: The Monk That Nobody Sent to Chicago

  There’s a saying in Chicago: “We don’t want nobody that nobody sent.” This was the cold reception Swami Vivekananda faced when he arrived in the windy city in July 1893, determined to attend the World Parliament of Religions that September. He belonged to no organization, carried no letter of recommendation, his countrymen were nobody, and represented an alien religion to the Western world. As the days passed, his hope of attending the parliament dwindled. With money running out and the odds stacked against him, he left the Windy City and went to Boston, praying for a glimmer of opportunity.  Swamiji came to America to share India’s most profound gift: the wisdom of the Hindu sages, preserved through centuries of oral tradition and embodied by its monks. This was 1893, not 1993—India was under the British grip, its resources drained, and its spirit subdued. Swamiji’s mission was not just a cultural exchange; it was a bold step toward envisioning a future where India could re...

Why Should Databases Go Natural?

From search to CRM, applications are adopting natural language and intuitive interactions. Should databases follow? This article provides a strategic perspective. Amid the many technological evolutions in software and hardware (CISC/RISC, Internet, Cloud, and AI), one technology has endured:  Relational Database Systems   (RDBMS), aka SQL databases. For over 50 years, RDBMS has survived and thrived, overcoming many challenges. It has evolved and adopted beneficial features from emerging technologies like object-relational databases and now competes robustly with   NoSQL databases .  Today, RDBMS dominates the market, with four of the top five databases and seven of the top ten being relational. RDBMS has smartly borrowed ideas, like JSON support, from NoSQL, while NoSQL has also borrowed from RDBMS. NoSQL no longer rejects SQL. From a user perspective, all modern databases have SQL-inspired query language and a set of APIs. All applications manage the respective data...

iQ Interactive: Cool Things for Developers on Couchbase Capella iQ

  The landscape of software development is ever-evolving with the advent of new technologies. As we venture into 2023, natural language processing ( NLP ) is rapidly emerging as a pivotal aspect of programming. Unlike previous generations of tools that primarily aimed at enhancing coding productivity and code quality, the new generation of Artificial Intelligence ( GenAI ) tools, like iQ, is set to revolutionize every facet of a developer's workflow. This encompasses a wide range of activities: Reading, writing, and rewriting specifications Designing, prototyping, and coding Reviewing, refactoring, and verifying software Going through the iterative cycle of deploying, debugging, and improving the software Create a draft schema and sample data for any use case Natural language queries. Generate sample queries on a given dataset Fix the syntax error for a query Don't stop here. Let your imagination fly. Although the insights garnered from iQ are preliminary and should be treated ...