This documentation is related to an older version of Firebolt. For the most current documentation, see Firebolt documentation.
Using primary indexes
- How you create a primary index
- Primary indexes can’t be modified
- How to choose primary index columns
- Using partitions with primary indexes
- Primary index examples
Firebolt uses primary indexes to physically sort data into the Firebolt File Format (F3). The index also colocates similar values, which allows data to be pruned at query runtime. When you query a table, rather than scanning the whole data set, Firebolt uses the table’s index to prune the data. Unnecessary ranges of data are never loaded from disk. Firebolt reads only the relevant ranges of data to produce query results.
Primary indexes in Firebolt are a type of sparse index. Unlike a dense index that maps every search key value in a file, a sparse index is a smaller construct that holds only one entry per data block (a compressed range of rows). By using the primary index to read a much smaller and highly compressed range of data from F3 into the engine cache at query runtime, Firebolt produces query results much faster with less disk I/O.
The video below explains sparse indexing. Eldad Farkash is the CEO of Firebolt.
How you create a primary index
To define a primary index, you use the PRIMARY INDEX
clause within a CREATE TABLE
statement. Although they are optional, we strongly recommend them.
The basic syntax of a PRIMARY INDEX
clause within a CREATE TABLE
statement is shown in the example below.
CREATE [FACT|DIMENSION] TABLE <table_name> (
<colname_1> <datatype>,
<colname_2> <datatype>,
<colname_3> <datatype>,
...
)
PRIMARY INDEX <colname_1> [, <...colname_N>];
Primary indexes can’t be modified
After you create a table, you can’t modify the primary index. To change the index, you must drop the table and recreate it.
How to choose primary index columns
The columns that you choose for the primary index and the order in which you specify them are important. Use the following recommendations to guide your choices. To see these guidelines in action, see Primary index examples.
Include columns used in WHERE clauses
Include all columns that are used in query WHERE
clauses to filter query results.
Consider including columns used in GROUP BY clauses
Consider adding columns that you use in GROUP BY
statements with aggregate functions.
Order columns in the index definition by cardinality
Specify columns in order of how frequently they’re used in WHERE
clauses and in descending order of cardinality. In other words, in the first position (<colname_1>
in the syntax above) specify the column that filters results the most. Then specify remaining columns in descending order of how much they filter.
Avoid specifying a column of the highest cardinality—that is, a column that has truly unique values or the primary key—unless you use that column in query WHERE
clauses. Also avoid specifying columns of low cardinality that won’t adequately filter results.
Include as many columns as you need
The number of columns that you specify in the index won’t negatively affect query performance. Additional columns might slow down ingestion very slightly, but the benefit for flexibility and performance of analytics queries will almost certainly outweigh any impact to ingestion performance.
Consider how you alter values in WHERE clauses
The primary index isn’t effective if Firebolt can’t determine the values in the index column. If the WHERE
clause in your query contains a function that transforms the column values, Firebolt can’t use the index. Consider a table with the primary index definition shown below, where asset_id
is a TEXT
data type in a table named events_log
.
PRIMARY INDEX asset_id
In the example analytics query over the events_log
table, Firebolt can’t use the primary index with the WHERE
clause. This is because the function with asset_id
is on the left side of the comparison. To satisfy the conditions of comparison, Firebolt must read all values of asset_id
to apply the UPPER
function.
SELECT
asset_id
FROM
events_log
WHERE
UPPER(asset_id) LIKE ‘AA%’;
In contrast, Firebolt can use the primary index in the following example:
SELECT
asset_id
FROM
events_log
WHERE
asset_id LIKE ‘AAA%’;
If you know that you will use a function in a predicate ahead of time, consider creating a virtual column to store the result of the function. You can then use that virtual column in your index and queries. This is particularly useful for hashing columns.
With a star schema, include join key columns in the fact table index
If you have a star schema with a fact table referring to many dimension tables, include the join keys (the foreign key columns) in the primary index of the fact table. This helps accelerate queries because the Firebolt query planner uses join keys as a predicate.
Conversely, on the dimension table side, there is no benefit to including the join key in the dimension table primary index unless you use it as a filter on the dimension table itself.
Using partitions with primary indexes
In most cases, partitioning isn’t necessary because of the efficiency of primary indexes (and aggregating indexes). If you use partitions, the partition column is the first stage of sorting. Firebolt divides the table data into file segments according to the PARTITION BY
definition. Then, within each of those segments, Firebolt applies the primary index to prune and sort the data into even smaller data ranges as described above.
For more information, see Working with partitions.
Primary index examples
This section demonstrates different primary indexes created on a fact table, site_sales
, created with the DDL and sample values shown below.
Example fact table
The examples in this section are based on the fact table below. The table is a web log with hundreds of millions of rows. Each record stores an event_count
of each event_type
for HTML elements identified by asset_id
and the customer_id
of the visitor that initiated the event. The table is denormalized. No single column has unique values.
Table DDL
CREATE FACT TABLE events_log (
visit_date DATE,
asset_id TEXT,
customer_id TEXT NOT NULL,
event_type TEXT,
event_count INTEGER NOT NULL
)
PRIMARY INDEX <see examples below>;
Table contents (excerpt)
+------------+--------------------------------------+-------------+------------+-------------+
| visit_date | asset_id | customer_id | event_type | event_count |
+------------+--------------------------------------+-------------+------------+-------------+
| 2018-05-30 | a974ff70-3367-4460-bd2e-e26de9439469 | 78152 | click | 137 |
| 2020-11-13 | 3d58b0a0-f838-428b-8f1c-2ff30aa9b9ea | 57328 | mouseover | 104 |
| 2020-07-11 | e8c533a4-b039-44df-b3a1-61fdc3d6c21d | 44963 | mouseout | 111 |
| 2019-09-06 | 02333518-5a39-4c11-a1a7-ed5e4163cb04 | 70147 | click | 49 |
| 2019-05-04 | 83f3a7bc-f6ca-4511-a8fb-74683e2b25cc | 58458 | mouseover | 127 |
| 2021-03-19 | 1664be68-d09e-4beb-a5b7-1889ebd06cfb | 40360 | mouseout | 43 |
| 2018-05-01 | 37858981-bbea-46bf-8d85-0e9a73062137 | 47880 | mouseover | 101 |
| 2018-06-19 | c5e7882b-3639-42ee-93f0-8bf5c6e99b14 | 74728 | mouseover | 141 |
| 2018-02-28 | 97ac6f85-3bbc-4894-811f-7584304c84f9 | 84802 | mouseout | 15 |
| ... | | | | |
+------------+--------------------------------------+-------------+------------+-------------+
Cardinality of columns
A COUNT DISTINCT
query on each column returns the following. A higher number indicates higher cardinality.
+----------------+-----------------+--------------------+-----------------+
| distinct_dates | distinct_assets | distinct_customers | distinct_events |
+----------------+-----------------+--------------------+-----------------+
| 1461 | 300 | 89664 | 3 |
+----------------+-----------------+--------------------+-----------------+
Example query pattern—date-based queries
Consider the two example queries below that return values with date-based filters.
Query 1
SELECT
*
FROM
events_log
WHERE
visit_date BETWEEN '2020-01-01' AND '2020-01-02'
AND customer_id = "11386"
AND event_type = 'click'
Query 2
SELECT
count(*),
visit_date
FROM
events_log
WHERE
EXTRACT(YEAR FROM visit_date) = ‘2021’
For both queries, the best primary index is:
PRIMARY INDEX (visit_date, customer_id, event_type)
- With
visit_date
in the first position in the primary index, Firebolt sorts and compresses records most efficiently for these date-based queries. - The addition of
customer_id
in the second position andevent_type
in the third position further compresses data and accelerates query response. customer_id
is in the second position because it has higher cardinality thanevent_type
, which has only three possible values.asset_id
is not used in query filters, so it is omitted.
For query 2, you can improve performance further by partitioning the table according to year as shown in the query excerpt below.
PRIMARY INDEX (visit_date, customer_id, event_type)
PARTITION BY (EXTRACT (YEAR FROM visit_date))
Without the partition, Firebolt likely must scan across file segments to return results for the year 2021. With the partition, segments exist for each year, and Firebolt can read all results from a single segment. If the query runs on a multi-node engine, the benefit may be greater. Firebolt can avoid pulling data from multiple engine nodes for results.
Example query pattern—customer-based query
Consider the example query below that returns the sum of click
values for a particular customer.
SELECT
asset_id,
customer_id,
event_type,
SUM(event_value)
FROM
events
WHERE
customer_id = "14493"
AND event_type = 'click'
AND event_value > 0
GROUP BY
1,
2,
3;
For this query, the best primary index is:
PRIMARY INDEX (customer_id, asset_id, event_type)
customer_id
is in the first position because it has the highest cardinality and sorts and prunes data most efficiently for this query.- The addition of
asset_id
won’t accelerate this particular query, but adding it is not detrimental. - Although
event_type
has low cardinality, because it’s contained in theWHERE
clause, adding it to the primary index has some benefit.
Example—using virtual columns
Virtual columns are most often used in a primary index to:
- Accommodate functions that alter column values.
- Calculate hash values for columns that contain long strings.
A virtual-column example for a function that transforms a column value is shown below.
Step 1—create the fact table with the virtual column in the index
The example DDL below creates a fact table similar to the one earlier in this section. However, it adds the upper_customer_id
column. The table creates this virtual column to store the result of an UPPER
function that upper-cases customer_id
values during the INSERT INTO
operation.
The PRIMARY INDEX
clause uses the upper_customer_id
column because that column is used in analytics queries.
CREATE FACT TABLE events_log (
visit_date DATE,
asset_id TEXT,
customer_id TEXT NOT NULL,
event_type TEXT,
event_count INTEGER NOT NULL,
uppder_customer_id TEXT NOT NULL
)
PRIMARY INDEX visit_date, upper _customer_id;
Step 2—use the function during ingestion (INSERT INTO
statement)
INSERT INTO
events_log
SELECT
visit_date,
asset_id,
customer_id,
event_type,
event_count,
UPPER(customer_id) AS upper_customer_id
FROM
ext_tbl_events;
Step 3—query using the virtual column in predicates
The example SELECT
query below uses the virtual column to produce query results and benefits from the index.
SELECT
customer_id
FROM
events_log
WHERE
upper_customer_id LIKE ‘AAA%’;