How to use Composite Indexes

You might be a developer like me. You might also wander around in the deep and dark layers of relational database systems. You might've been face to face with this thing some call a composite index.

how to use composite indexes

Why are you talking in this denigrating tone?

Because in the fairly limited experience I have, I have seen some grave cases of index abuse. Of them, badly designed composite indexes - or multi-column indexes - were always the most aggravating ones.


Woah not so fast, remind me again about regular indexes

Well, remember that the slowest part of a database is its I/O. Data is stored in pages which are read as a whole. Since a hard disk takes some time to retrieve one such page, we want to read as little pages as possible (unless they are adjacent to each other, in which case multiple pages can be read at once with a relatively cheap cost).

Say we have a table containing personal data like a Name and an Age. Suppose some query wants to retrieve all data about persons named Johnny.

"Gimme all Johnnies!"

It would be silly to scan the complete set of data (and reading all pages from disk), only to find a small part of the population.

What we really want to do is create some extra pages containing information like:

"Records with Name Johnny can be found in pages 1, 32 and 57"

This resembles a lot like the index at the back of a book!

Because we would want entries like these for any name in the population, the index may occupy some precious disk space itself. Note that this should be taken with a grain of salt. While expensive high speed disks are still widely used, the price for storage is generally declining.

Also, there should be a page providing information like:

"To find names beginning with J, go to page 82.
To find names beginning with K, go to page 141"

The database can use this tree structure called an index to retrieve the result for the query above much faster, since it only has to read those three pages plus the relevant pages in which the index is stored.

Beware, when another Johnny joins the club, the index also has to be updated. Creating an index and keeping it up to date with the data it describes comes with some processing cost. This downside is part of a trade-off some are afraid of to make.

Okay, so what's the deal with Composite Indexes?

A composite index differs from a regular index because it is an index providing directions for multiple columns. For the example from above we could create an index providing directions for both Name and Age. It will look somewhat like this:

Root page of the index:
"To find names beginning with J, go to page 82."

Page 82:
"To find Johnnies go to page 9."

Page 9:
"Johnnies of age 12 can be found in page 32.
Johnnies of age 24 can be found in pages 1 and 32.
Johnnies of age 38 can be found in page 57."

It is clear that an index like this takes up somewhat more space since there is an additional level in the tree. There is also more work to be done when persons are added, removed or changed.

This seems quite handy actually

It is! Composite indexes can be very useful when the query is asking questions like:

"Gimme all 12 year old Johnnies!"

In this case, the index provides very directed information about what pages of data should be retrieved, since all variables of the query are covered by the index alone. This is called a covering index. Covering indexes are usually very performant!

Even the first query (asking for all persons named Johnny) can be executed efficiently using this index, since the index is rather selective on the column Name. It narrows down the tree structure sufficiently.

But here comes the disaster. I have seen the use of these kind of indexes here and there. In almost all of these cases the columns were added to the index in alphabetical order, meaning that the index tree also has to be navigated alphabetically. Let us think this through using the same example. When creating a composite index for Name and Age in alphabetical order, indexing will happen on the Age column first. The index pages will be something like:

Root page of the index:
"To find persons of age 11, go to page 13.
To find persons of age 12, go to pages 82.
To find persons of age 13, go to page 87.
To find persons of age 14, go to page 93.
To find persons of age 15, go to page 107."

Page 82:
"To find names beginning with J, go to page 14.
To find names beginning with K, go to page 105.
To find names beginning with L, go to page 3."

Page 14:
"To find Johnnies go to page 9."

Page 9:
"12 year old Johnnies can be found in page 32."

By indexing on age first, another level of depth is added. This is the case because Age is not nearly selective enough to find our Johnnies immediately. Indexing on Name first on the other hand (which is more selective because a name is generally more unique) will lead faster to the requested result.

It gets really bad when the first query (the one to retrieve all Johnnies regardless of Age) is executed again. Using the latter index, all pages underneath the root page need to be retrieved since the query provides no selection on Age whatsoever. This means that we still need to read more pages than we really want to!

What should I learn from this?

Composite indexes are great when used correctly! Here are some facts:

  • Composite indexes are optimal when most queries use all columns of the index. In this case we talk about covering indexes.
  • Arrange the columns from more selective to lesser selective, but always keep your queries in mind. They will do the effective selection, no matter how selective a column is.
  • Composite indexes occupy somewhat more space and require more processing power to maintain.
  • It can be useful to create multiple indexes for the columns of a composite index. Some database systems only keep track of statistics for the first column of an index. Creating multiple indexes tackles this issue and may even provide better indexes for some queries. Mind the overhead though!
  • The proof of the pudding is in the eating. A query optimizer may sometimes seem like a non-deterministic machine. Be sure to test often when improving the performance of your queries and indexes! You can use the Query Execution Plan to help you with this task. This will help you understand and direct some of the choices the database system makes.

What is your experience with Composite Indexes or indexes in general?

Do not hesitate to share your ideas and experiences on this topic!