Database performance tuning — Simple tips and tricks

Sujit Udhane
9 min readDec 10, 2020

--

Source : https://www.skyeits.com/services/database-performance-tuning.html

1) Proactive Tuning at Server & Infrastructure level

a)DB Server performance depends on -

i) RAM :-

Most of the databases perform best when the application’s working set (indexes and most frequently accessed data) fits in memory.

RAM size is the most important factor for instance sizing; other optimizations may not significantly improve the performance of the database if there is insufficient RAM.

If price/performance is more of a priority over performance alone, then using fast SSDs to complement smaller amounts of RAM is a viable design choice.

You should test the optimum balance for your workload and SLAs.

When the application’s working set fits in RAM, read activity from disk will be low. You can analyze this with the Query Profiling tools.

If your working set exceeds the RAM of your chosen instance size or server, consider moving to a larger instance with more memory or partition (shard) your database across multiple servers.

ii) Number of CPU Cores :-

Database storage engine architecture is capable of efficiently using multiple CPU cores. Typically a single client connection is represented by its own thread.

In addition background worker threads perform tasks like checkpointing and cache eviction. Adequate number of CPU cores in proportion to concurrent client connections helps for parallel execution.

Typically investing in more RAM and disk IOPS gives the highest benefit to database performance.

iii) Storage :-

The capability of the storage system creates some important physical limits for the performance of MongoDB’s write operations. Many unique factors related to the storage system of the drive affect write performance, including random access patterns, disk caches, disk readahead and RAID configurations.

Solid state drives (SSDs) can outperform spinning hard disks (HDDs) by 100 times or more for random workloads.

b)Single DB vs Distributed Cluster -

i) Single DB instance provides vertical scaling opportunity, which ultimately caps the potential to certain limits .

ii) Distributed cluster provides vertical & horizontal scaling opportunity, which helps to improve scalability at database level. (Note:- Unless, Write restricted to single or master instances).

2) Database Layer performance

Typically being treated as an application performance Tuning.

Non-negotiable step is “Know your data”.

Database layer finetuning should factor in 4 Vs (Volume/Velocity/Variety/Veracity).

a) Data modeling matters -

Understand your application’s query patterns so that you design your data model and select the appropriate indexes accordingly. Tailoring your data model to the application’s query patterns produces more efficient queries, increases the throughput of insert and update operations, and more effectively distributes your workload across a sharded cluster.

Data model design differs from case to case. E.g. Hierarchical data model suits for Catalog management, Materialized View suits for Data Analytics.

Few critical decisions which can impact performance either way -

i) Relationships between entities (or documents) :-

Denormalized (Flattened) or Normalized (if yes, which normalization forms suits).

Flattened table/collection will be useful, WHEN there are more READ queries on the table/collection, and most of the queries reads relational data, OTHERWISE you will end up having more working sets into a memory, ultimately impacting the DB performance.

Normalized table/collection will need JOIN (which are major performance bottlenecks, when the data set grows) queries to serve the relational data.

ii) Nature of data :-

Tailor the data model as per the nature of data like Text, Numeric, Geospatial, Time-Series.

b) Analyzer, Profiler & Diagnostic tools -

Regularly review query plans with the help of tools, which share important insights about the logical plan created by the database, and how Directed Acyclic Graph (DAG) generated to provide the final outcome. It also explains which DB indexes being used for query execution. Command like Table scan helps to know data size laying in specific tables/collection.

E.g. Database provides commands such as EXPLAIN/ANALYZE/TABLE SCAN, which are very useful in diagnosing performance issues.

i) Explain command/function enables you to test queries from your application, showing information about how a query will be, or was, resolved, including:

  • Which indexes were used
  • Whether the query was covered by the index or not
  • Whether an in-memory sort was performed, which indicates an index would be beneficial
  • The number of index entries scanned
  • The number of rows/documents returned, and the number read
  • How long the query took to resolve in milliseconds
  • Which alternative query plans were rejected (when using the allPlansExecution mode)

ii) SQL Profiler helps you locate and optimize poorly running SQL code.

Specific features and benefits of SQL Profiler include the following:

  • On-Demand Traces — You can capture SQL traces at any time by manually setting up your parameters and starting the trace.
  • Scheduled Traces — For inconvenient times, you can also specify your trace parameters and schedule them to run at some later time.
  • Save Traces — Execute your traces and save them for later review.
  • Trace Filters — Selectively filter SQL captures by database and by user, or capture every SQL statement sent by all users against all databases.
  • Trace Output Analyzer — A graphical table lets you quickly sort and filter queries by duration or statement, and a graphical or text based EXPLAIN plan lays out your query paths and joins.
  • Index Advisor Integration — Once you have found your slow queries and optimized them, you can also let the Index Advisor recommend the creation of underlying table indices to further improve performance.

c) Indexes -

In any database, indexes support the efficient execution of queries.

Without them, the database must scan every document in a collection or table to select those that match the query statement.

If an appropriate index exists for a query, the database can use the index to limit the number of documents it must inspect.

Databases offer a broad range of index types and features with sort orders to support complex access patterns to your data.

Database indexes can be created and dropped on-demand to accommodate evolving application requirements and query patterns and can be declared on any field.

i) Use Compound (aka Composite) Indexes

Compound indexes are indexes composed of several different fields.

For example, instead of having one index on “Last name” and another on “First name”, it is typically most efficient to create an index that includes both “Last name” and “First name” if you query against both of the names.

Compound index can still be used to filter queries that specify the last name only.

ii) Focus on the order of fields in the Compound Index

- First field should help in reducing the number of records in the working set.

- ESR(Equality, Sort, Range) rule can help to decide the order of the fields in Compound Indexes

First, add those fields against which Equality queries are run.

The next fields to be indexed should reflect the Sort order of the query.

The last fields represent the Range of data to be accessed.

iii) Use Covering Indexes When Possible

An index that contains all required information to resolve the query is known as a “Covering Index” — it completely covers the query.

Covering Index includes all the columns, the query refers to in the SELECT, JOIN, and WHERE clauses.

iv) Use Caution When Considering Indexes on Low-Cardinality Fields

Queries on fields with a small number of unique values (low cardinality) can return large result sets.

Compound indexes may include fields with low cardinality, but the value of the combined fields should exhibit high cardinality.

v) Eliminate Unnecessary Indexes

Indexes are resource-intensive, they consume RAM and disk.

As fields are updated, associated indexes must be maintained, incurring additional CPU and disk I/O overhead. Profiling tools help you to understand index usage.

vi) Use text search to match words inside a field

Regular indexes are useful for matching the entire value of a field. If you only want to match on a specific word in a field with a lot of text, then use a text index.

vii) Use Partial Indexes

Reduce the size and performance overhead of indexes by only including documents that will be accessed through the index.

Setting up a Partial Index to Exclude Common Values

For example,

Suppose you are storing web server access logs in a database.

Most accesses originate from the IP address range of your organization but some are from elsewhere (say, employees on dial-up connections).

If your searches by IP are primarily for outside accesses, you probably do not need to index the IP range that corresponds to your organization’s subnet.

Assume a table like this:

CREATE TABLE access_log (

url varchar,

client_ip inet,

);

To create a partial index that suits our example, use a command such as this:

CREATE INDEX access_log_client_ip_ix ON access_log (client_ip)

WHERE NOT (client_ip > inet ‘192.168.100.0’ AND client_ip < inet ‘192.168.100.255’);

A typical query that can use this index would be:

SELECT * FROM access_log WHERE url = ‘/index.html’ AND client_ip = inet ‘212.78.10.32’;

A query that cannot use this index is:

SELECT * FROM access_log WHERE client_ip = inet ‘192.168.100.23’;

d) Fine tuning DB queries -

Performance of database queries should be monitored regularly, and incase of any aberration in key metrics,should be revisited and refactored (if required).

Few critical areas, which requires attention

i) JOIN queries

— -Type of Joins (Merge/Hash/PartionWise)

Understanding different types, and

— -Nature of existing data and in-future data

— -Join Order

Smaller dataset collection should be

ii) Aggregate queries

iii)Sort & their order

iv)Ensure only required columns are part of select clause

v) Avoid transformation on select column

e) Locking -

The type and number of locks acquired and released during query execution can have a surprising effect on performance (when using a locking isolation level like the default read committed) even where no waiting or blocking occurs. There is no information in execution plans to indicate the amount of locking activity during execution, which makes it harder to spot when excessive locking is causing a performance problem.

f) Perform multiple steps in a single transaction -

Batch Operation for Update/Insert/Delete (or combination of them)

Below is example of, how multiple insertions/deletions chunked into a one batch, and only one commit statement

Begin CREATE temporary staging_table;

INSERT INTO table1 SELECT col1, col2 FROM table 2;

DELETE FROM table3 WHERE col4 =?;

INSERT INTO table4 SELECT col1, col2 FROM table5;

DELETE FROM table5 WHERE col2 =?;

INSERT INTO table6 SELECT col1, col2 FROM table7

Commit

g)Arranging data -

i) Sharding/Partitioning Strategies

By simply hashing a primary key value, most distributed databases randomly spray data across a cluster of nodes.

This imposes performance penalties when data is queried across nodes and adds application complexity when data needs to be localized to a specific region.

— Adopting a right sharding strategy that suits the application query pattern or data placement requirement, giving you much higher scalability across a diverse set of workloads.

Below are the widely adopted strategies

Ranged Sharding:- Data records are partitioned across shards according to the shard key value. Records with shard key values close to one another are likely to be co-located on the same shard.This approach is well suited for applications that need to optimize range based queries, such as co-locating data for all customers in a specific region on a specific shard.

Hashed Sharding:- Records are distributed according to an MD5 hash of the shard key value. This approach guarantees a uniform distribution of writes across shards, which is often optimal for ingesting streams of time-series and event data.

Zoned Sharding:- Provides the ability for developers to define specific rules governing data placement in a sharded cluster.

— Consider the Ordering of Your Shard Key

If you configured range based sharding, and load data sorted by the shard key, then all inserts at a given time will necessarily have to go to the same chunk and same shard. This will void any benefit from adding multiple shards, as only a single shard is active at a given time.

You should design your data load such that different shard key values are inserted in parallel, into different shards. If your data is sorted in shard key order, then you can use hash based sharding to ensure that concurrent inserts of nearby shard key values will be routed to different shards.

h) Adopt latest stable drivers -

Drivers are updated more frequently than the database, typically every several months.

Always use the most recent version of the drivers when possible, and install native extensions if available for your language. Develop a standard procedure for testing and upgrading drivers so that upgrading is naturally a part of your process.

i) Separation of READ/WRITE concerns -

If your application performs complex or long-running operations, such reporting or ETL, you may want to isolate analytics queries from the rest of your operational workload.

By isolating different workloads, you can ensure different query types never contend for system resources, and avoid analytics queries flushing the working set from RAM. Tag the cluster node appropriately to achieve the read and write isolation.

If you like the article, please clap for it. Also, share the article with your friends.

References -

https://www.mongodb.com/blog/post/performance-best-practices-benchmarking

https://blog.pythian.com/how-different-datatype-affect-performance/

https://www.enterprisedb.com/how-to-benchmark-postgresql-using-hammerdb-open-source-tool

https://www.percona.com/blog/2019/02/21/parallel-queries-in-postgresql/

https://www.xplenty.com/blog/7-tips-improve-etl-performance/

https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

https://logicalread.com/sql-server-performance-tuning/#.X8SVWmgzZPY

--

--

Sujit Udhane

I am Lead Platform Architect, working in Pune-India. I have 20+ years of experience in technology, and last 10+ years working as an Architect.