Index in DBMS and how to use them

5 minute read

Why Indexes?

  • Indexes are an alternative way for File Organizations, we can speed up a query execution by better organizing the data in a file.

  • An Index is a copy of selected columns of data, from a table, that is designed to enable very efficient search.

  • Users cannot see the indexes, they’re just used to speed up searches/queries.

Examples of Indexing:

  • B+ Tree
  • Hash Index
  • Bitmap Index

Index Basics

• An Index speeds up searches for a subset of records, based on values in a certain (search key) field.

• By minimizing the number of disk accesses required when a query is processed.

Better Definition: An Index is a data structure that organizes records to optimize retrieval.

Hash Index

• A Has index is collection of Buckets, where bucket = primary page plus overflow pages • Buckets contain data entries

• Uses a has function h: h(r) = bucket in which (data entry for) record r belongs

Pros: Good for equality Search

Cons: Not good for range searches

B+ Trees

  • Leaf pages contain data entries, and are chained (prev & next)
  • Non-Lead pages have data entries
  • B+ tree is a (key, value) storage method in a tree like structure

More on Indexes

  • A file can have several indexes

Primary vs Secondary

  • If search key contains the primary key, it is called primary index, anyother index is called secondary index

  • If the search key contains a candidate key, it is called a unique index, it cannot return any duplicates

Examples:

Clustered Indexing vs Non-Cluster Indexing

Clustering Indexing

  • When more than two records are stored in the same file

Non-Cluster Indexing

  • Opposite of that

What to think about when creating Indexes

  • What relations should have indexes?
  • What field(s) should be the search key?
  • Should we build several or one index?

For each index, what kind of index should it be?

  • Clustered
  • Hash or Tree

Choosing Indexes

  • Attributes in WHERE clause are candidates for index keys

  • Multi-Attribute search keys should be used when a WHERE clause contains several conditions

Composite Search Keys: Search on a combination of fields. (e.g. <data,price>)

Equality Query: Constant value * date = “02-25-2021” and price = 69

Range Query: In a range, not constant * date = “02-25-2021” and price > 69

Practice working in SQL

  • Using consumer compliant database from data.gov

Lets write a Query without an index

to find the counts of the top 5 Product, State pairs in the complaints database (return the product and state as well as the count). Lets use single-line syntax and timing so we can see how long it takes.

%time %sql SELECT product, state, count(*) AS C FROM complaints GROUP BY product, state ORDER BY C DESC LIMIT 5;

Output:

CPU times: user 18.9 ms, sys: 9.48 ms, total: 28.4 ms Wall time: 48.2 ms

Product State c
None None 13451
Mortgage CA 3891
Mortgage FL 2343
Debt collection None 1654
Mortgage None 1427

Note: The time taken to execute the query without indexing

Lets write a Single Search Key Index

Syntax:

CREATE INDEX index_name
ON table_name (column_name);

Note, you may have to drop index if it exists so here’s the syntax in that case:

DROP INDEX IF EXISTS index_name;

CREATE INDEX index_name ON table(attributes)

Single-Key Index Example

%%sql
DROP INDEX IF EXISTS state_index;
CREATE INDEX state_index ON complaints(state);

Now lets use the newly created index to run the same query again

%time %sql SELECT product, state, count(*) AS C FROM complaints GROUP BY product, state ORDER BY C DESC LIMIT 5;

Output:

CPU times: user 16 ms, sys: 7.18 ms, total: 23.2 ms Wall time: 36.5 ms

Product State c
None None 13451
Mortgage CA 3891
Mortgage FL 2343
Debt collection None 1654
Mortgage None 1427

Note: The time taken to execute the query using Single-Key Index is faster than having no index at all.

Create Covering Index and Check the Time

%%sql
DROP INDEX IF EXISTS state_product_index;
CREATE INDEX state_product_index ON complaints(state, product);
%time %sql SELECT product, state, count(*) AS c FROM complaints GROUP BY product, state ORDER BY c DESC LIMIT 5;

Output

CPU times: user 16.3 ms, sys: 4.54 ms, total: 20.8 ms Wall time: 31.1 ms

Product State c
None None 13451
Mortgage CA 3891
Mortgage FL 2343
Debt collection None 1654
Mortgage None 1427

Note: The time taken to execute the query using Covering Index is faster than the previous attempts at speeding up the query.

Use EXPLAIN Query Plan

  • EXPLAIN is an operator that tells SQL to explain its query plan.

Syntax

EXPLAIN QUERY PLAN your_query_here;

Example:

%%sql
EXPLAIN QUERY PLAN
SELECT product, state, count(*) AS c
FROM complaints
GROUP BY product, state ORDER BY c DESC LIMIT 5;

Output:

id parent notused detail
8 0 0 SCAN TABLE complaints USING COVERING INDEX state_product_index
44 0 0 USE TEMP B-TREE FOR ORDER BY

More on B and B+ Index

Hash Indexes

  • Remember Bitcoin, block chain

What are Has Index?

  • A Hash Index is a collection of buckets, where the bucket is the primary page plus any overlfow pages, while the buckets contain one or more data entries

  • To implement we use a hash function; h

    • h(r) = bucket in which (data entry for) record r belongs

Pros and Cons of Hash Index

Pros: Good for equality Search (=)

Cons: Not so good for range search (Use tree indexes liek B or B+ trees instead)

Operations on Hash Indexes

  • Equality Search
    • Apply hash function on search key to locate the bucket needed
  • Deletion
  • Finds the bucket and deletes the record(s)

  • Insertion

  • Find the bucket and inserts the record, if no space, then it creates a new overflow page

Hash Functions is any function that can be used to convert numerical input into another compressed numerical value.

  • Idead that the function must be uniform: each bucket is assigned the same number of key value

Bucket Overflow

  • A bucket overflow can occur because of insufficient number of buckets or a skew in distribution of records
  • It’s handled by using overflow buckets

Static Hashing

  • In Static Hashing the number of data buckets in memory remains constant throughout

Extendible or Dynamic Hashing

  • Keeps directory of pointers to buckets
  • When overflow occurs, it reorganizes the index by doubling the directory (and not the number of buckets)

Pros and Cons

Pros

  • save some money

Cons

  • Can be expensive as shit

How it Works

Inserting

Deleting

Extra

BitMap Indexing

Updated: