Index in DBMS and how to use them
Why Indexes?
-
Indexes are an alternative way for File Organizations, we can speed up a query execution by better organizing the data in a file.
-
An Index is a copy of selected columns of data, from a table, that is designed to enable very efficient search.
-
Users cannot see the indexes, they’re just used to speed up searches/queries.
Examples of Indexing:
- B+ Tree
- Hash Index
- Bitmap Index
Index Basics
• An Index speeds up searches for a subset of records, based on values in a certain (search key) field.
• By minimizing the number of disk accesses required when a query is processed.
Better Definition: An Index is a data structure that organizes records to optimize retrieval.
Hash Index
• A Has index is collection of Buckets, where bucket = primary page plus overflow pages • Buckets contain data entries
• Uses a has function h: h(r) = bucket in which (data entry for) record r belongs
Pros: Good for equality Search
Cons: Not good for range searches
B+ Trees
- Leaf pages contain data entries, and are chained (prev & next)
- Non-Lead pages have data entries
- B+ tree is a (key, value) storage method in a tree like structure
More on Indexes
- A file can have several indexes
Primary vs Secondary
-
If search key contains the primary key, it is called primary index, anyother index is called secondary index
-
If the search key contains a candidate key, it is called a unique index, it cannot return any duplicates
Examples:
Clustered Indexing vs Non-Cluster Indexing
Clustering Indexing
- When more than two records are stored in the same file
Non-Cluster Indexing
- Opposite of that
What to think about when creating Indexes
- What relations should have indexes?
- What field(s) should be the search key?
- Should we build several or one index?
For each index, what kind of index should it be?
- Clustered
- Hash or Tree
Choosing Indexes
-
Attributes in WHERE clause are candidates for index keys
-
Multi-Attribute search keys should be used when a WHERE clause contains several conditions
Composite Search Keys: Search on a combination of fields. (e.g. <data,price>)
Equality Query: Constant value * date = “02-25-2021” and price = 69
Range Query: In a range, not constant * date = “02-25-2021” and price > 69
Practice working in SQL
- Using consumer compliant database from data.gov
Lets write a Query without an index
to find the counts of the top 5 Product, State pairs in the complaints database (return the product and state as well as the count). Lets use single-line syntax and timing so we can see how long it takes.
%time %sql SELECT product, state, count(*) AS C FROM complaints GROUP BY product, state ORDER BY C DESC LIMIT 5;
Output:
CPU times: user 18.9 ms, sys: 9.48 ms, total: 28.4 ms Wall time: 48.2 ms
Product | State | c |
---|---|---|
None | None | 13451 |
Mortgage | CA | 3891 |
Mortgage | FL | 2343 |
Debt collection | None | 1654 |
Mortgage | None | 1427 |
Note: The time taken to execute the query without indexing
Lets write a Single Search Key Index
Syntax:
CREATE INDEX index_name
ON table_name (column_name);
Note, you may have to drop index if it exists so here’s the syntax in that case:
DROP INDEX IF EXISTS index_name;
CREATE INDEX index_name ON table(attributes)
Single-Key Index Example
%%sql
DROP INDEX IF EXISTS state_index;
CREATE INDEX state_index ON complaints(state);
Now lets use the newly created index to run the same query again
%time %sql SELECT product, state, count(*) AS C FROM complaints GROUP BY product, state ORDER BY C DESC LIMIT 5;
Output:
CPU times: user 16 ms, sys: 7.18 ms, total: 23.2 ms Wall time: 36.5 ms
Product | State | c |
---|---|---|
None | None | 13451 |
Mortgage | CA | 3891 |
Mortgage | FL | 2343 |
Debt collection | None | 1654 |
Mortgage | None | 1427 |
Note: The time taken to execute the query using Single-Key Index is faster than having no index at all.
Create Covering Index and Check the Time
%%sql
DROP INDEX IF EXISTS state_product_index;
CREATE INDEX state_product_index ON complaints(state, product);
%time %sql SELECT product, state, count(*) AS c FROM complaints GROUP BY product, state ORDER BY c DESC LIMIT 5;
Output
CPU times: user 16.3 ms, sys: 4.54 ms, total: 20.8 ms Wall time: 31.1 ms
Product | State | c |
---|---|---|
None | None | 13451 |
Mortgage | CA | 3891 |
Mortgage | FL | 2343 |
Debt collection | None | 1654 |
Mortgage | None | 1427 |
Note: The time taken to execute the query using Covering Index is faster than the previous attempts at speeding up the query.
Use EXPLAIN Query Plan
- EXPLAIN is an operator that tells SQL to explain its query plan.
Syntax
EXPLAIN QUERY PLAN your_query_here;
Example:
%%sql
EXPLAIN QUERY PLAN
SELECT product, state, count(*) AS c
FROM complaints
GROUP BY product, state ORDER BY c DESC LIMIT 5;
Output:
id | parent | notused | detail |
---|---|---|---|
8 | 0 | 0 | SCAN TABLE complaints USING COVERING INDEX state_product_index |
44 | 0 | 0 | USE TEMP B-TREE FOR ORDER BY |
More on B and B+ Index
Hash Indexes
- Remember Bitcoin, block chain
What are Has Index?
-
A Hash Index is a collection of buckets, where the bucket is the primary page plus any overlfow pages, while the buckets contain one or more data entries
-
To implement we use a hash function; h
- h(r) = bucket in which (data entry for) record r belongs
Pros and Cons of Hash Index
Pros: Good for equality Search (=)
Cons: Not so good for range search (Use tree indexes liek B or B+ trees instead)
Operations on Hash Indexes
- Equality Search
- Apply hash function on search key to locate the bucket needed
- Deletion
-
Finds the bucket and deletes the record(s)
-
Insertion
- Find the bucket and inserts the record, if no space, then it creates a new overflow page
Hash Functions is any function that can be used to convert numerical input into another compressed numerical value.
- Idead that the function must be uniform: each bucket is assigned the same number of key value
Bucket Overflow
- A bucket overflow can occur because of insufficient number of buckets or a skew in distribution of records
- It’s handled by using overflow buckets
Static Hashing
- In Static Hashing the number of data buckets in memory remains constant throughout
Extendible or Dynamic Hashing
- Keeps directory of pointers to buckets
- When overflow occurs, it reorganizes the index by doubling the directory (and not the number of buckets)
Pros and Cons
Pros
- save some money
Cons
- Can be expensive as shit
How it Works
Inserting
Deleting
Extra