6 Optimizing Joins with Join Groups

A join group is a user-created dictionary object that lists two columns that can be meaningfully joined.

This chapter contains the following topics:

Topics:

About In-Memory Joins
Joins are an integral part of data warehousing workloads. The IM column store enhances the performance of joins when the tables being joined are stored in memory.
About Join Groups
When the IM column store is enabled, the database can use join groups to optimize joins of tables populated in the IM column store.
Purpose of Join Groups
In certain queries, join groups eliminate the performance overhead of decompressing and hashing column values.
How Join Groups Work
In a join group, the database compresses all columns in the join group using the same common dictionary.
Creating Join Groups
Define join groups using the CREATE INMEMORY JOIN GROUP statement. Candidates are columns that are frequently paired in a join predicate, for example, a column joining a fact and dimension table.
Monitoring Join Group Usage
To determine whether queries are using the join group, you can pass the SQL ID to the DBMS_SQLTUNE.REPORT_SQL_MONITOR_XML function.

About In-Memory Joins

Joins are an integral part of data warehousing workloads. The IM column store enhances the performance of joins when the tables being joined are stored in memory.

Because of faster scan and join processing, complex multi-table joins and simple joins that use Bloom filters benefit from the IM column store. In a data warehousing environment, the most frequently-used joins involved a fact table and one or more dimension tables.

The following joins run faster when the tables are populated in the IM column store:

Joins that are amenable to using Bloom filters
Joins of multiple small dimension tables with one fact table
Joins between two tables that have a primary key-foreign key relationship

About Join Groups

When the IM column store is enabled, the database can use join groups to optimize joins of tables populated in the IM column store.

A join group is a set of columns on which a set of tables is frequently joined. The column set contains one or more columns; the table set contains one or more tables. The columns in the join group can be in the same or different tables. For example, if sales and times frequently join on the time_id column, then you might create a join group for (times(time_id), sales(time_id)). The maximum number of columns in a join group is 255.

Note:

A column cannot be a member of multiple join groups.

When you create a join group, the database invalidates the current In-Memory contents of the tables referenced in the join group. Subsequent repopulation causes the database to re-encode the IMCUs of the tables with the common dictionary. Thus, Oracle recommends that you first create the join group, and then populate the tables.

Create join groups using the CREATE INMEMORY JOIN GROUP statement. To add columns to or drop columns from a join group, use an ALTER INMEMORY JOIN GROUP statement. Drop a join group using the DROP INMEMORY JOIN GROUP statement.

Note:

In Oracle Active Data Guard, a standby database ignores join group definitions. A standby database does not use common dictionaries, and executes queries as if join groups did not exist.

Example 6-1 Creating a Join Group

This example creates a join group named deptid_jg that includes the department_id column in the hr.employees and hr.departments tables.

CREATE INMEMORY JOIN GROUP deptid_jg (hr.employees(department_id),hr.departments(department_id));

Purpose of Join Groups

In certain queries, join groups eliminate the performance overhead of decompressing and hashing column values.

Without join groups, if the optimizer uses a hash join but cannot use a Bloom filter, or if the Bloom filter does not filter rows effectively, then the database must decompress IMCUs and use an expensive hash join. To illustrate the problem, assume a star schema has a sales fact table and a vehicles dimension table. The following query joins these tables, but does not filter the output, which means that the database cannot use a Bloom filter:

SELECT v.year, v.name, s.sales_price
FROM   vehicles v, sales s
WHERE  v.name = s.name;

The following figure illustrates how the database joins the two data sets.

Figure 6-1 Hash Join without Join Group

Description of "Figure 6-1 Hash Join without Join Group"

The database performs a hash join as follows:

Scans the vehicles table, decompresses the rows that satisfy the predicate (in this case, all rows satisfy the predicate because no filters exist), and sends the rows to the hash join
Builds a hash table in the PGA based on the decompressed rows
Scans the sales table and applies any filters (in this case, the query does not specify filters)
Decompresses matching rows from the IMCUs, hashes them, and then sends them to the join
Probes the hash table using the join column, which in this case is the vehicle name

If a join group exists on the v.name and s.name columns, however, then the database can make the preceding steps more efficient, eliminating the decompression and filtering overhead. The benefits of join groups are:

The database operates on compressed data.
In a hash join based on a join group, the database uses an array instead of building a hash table.

The database stores codes for each join column value in a common dictionary. The database joins on the codes rather than on the actual column values. This technique avoids the overhead of copying row sources.
The dictionary codes are dense and have a fixed length, which makes them space efficient.
Optimizing a query with a join group is sometimes possible when it is not possible to use a Bloom filter.

How Join Groups Work

In a join group, the database compresses all columns in the join group using the same common dictionary.

This section contains the following topics:

Topics:

How a Join Group Uses a Common Dictionary
A common dictionary is a table-level, instance-specific set of dictionary codes.
How a Join Group Optimizes Scans
The key optimization is joining on common dictionary codes instead of column values, thereby avoiding the use of a hash table for the join.

How a Join Group Uses a Common Dictionary

A common dictionary is a table-level, instance-specific set of dictionary codes.

The database automatically creates a common dictionary in the IM column store when a join group is defined on the underlying columns. The common dictionary enables the join columns to share the same dictionary codes.

A common dictionary provides the following benefits:

Encodes the values in the local dictionaries with codes from the common dictionary, which provides compression and increases the cache efficiency of the IMCU
Enables joins to use dictionary codes to construct and probe the data structures used during hash joins
Enables the optimizer to obtain statistics such as cardinality, distribution of column values, and so on

The following figure illustrates a common dictionary that corresponds to a join group created on the sales.name and vehicles.name tables.

Figure 6-2 Common Dictionary for a Join Group

Description of "Figure 6-2 Common Dictionary for a Join Group"

When the database uses a common dictionary, the local dictionary for each CU does not store the original values: Audi, BMW, and so on. Instead, the local dictionary stores references to the values stored in the common dictionary. For example, the local dictionary might store 101 for Audi, 220 for BMW, and so on.

How a Join Group Optimizes Scans

The key optimization is joining on common dictionary codes instead of column values, thereby avoiding the use of a hash table for the join.

Consider the following query, which uses a join group to join vehicles and sales on the name column:

SELECT v.year, v.name, s.sales_price
FROM   vehicles v, sales s
WHERE  v.name = s.name
AND    v.name IN ('Audi', 'BMW', 'Porsche', 'VW');

The following figure illustrates how the join benefits from the common dictionary created on the join group.

Figure 6-3 Hash Join with Join Group

Description of "Figure 6-3 Hash Join with Join Group"

As illustrated in the preceding diagram, the database performs a hash join on the compressed data as follows:

Scans the vehicles table, and sends the dictionary codes (not the original column values) to the hash join: 0 (Audi), 1 (BMW), 2 (Cadillac), and so on
Builds an array of distinct common dictionary codes in the PGA
Scans the sales table and applies any filters (in this case, the filter is for German cars only)
Sends matching rows to the join in compressed format
Looks up corresponding values in the array rather than probing a hash table, thus avoiding the need to compute a hash function on the join key columns

In this example, the vehicles table has only seven rows. The vehicles.name column has the following values:

Audi
BMW
Cadillac
Ford
Porsche
Tesla
VW

The common dictionary assigns a dictionary code to each distinct value. Conceptually, the common dictionary looks as follows:

Audi     0
BMW      1
Cadillac 2
Ford     3
Porsche  4
Tesla    5
VW       6

The database scans vehicles.name, starting at the first dictionary code in the first IMCU and ending at the last code in the last IMCU. It stores a 1 for every row that matches the filter (German cars only), and 0 for every row that does not match the filter. Conceptually, the array might look as follows:

array[0]: 1
array[1]: 1
array[2]: 0
array[3]: 0
array[4]: 1
array[5]: 0
array[6]: 1

The database now scans the sales fact table. To simplify the example, assume that the sales table only has 6 rows. The database scans the rows as follows (the common dictionary code for each value is shown in parentheses):

Cadillac (2)
Cadillac (2)
BMW      (1)
Ford     (3)
Audi     (0)
Tesla    (5)

The database then proceeds through the vehicles.name array, looking for matches. If a row matches, then the database sends the matching row with its associated common dictionary code, and retrieves the corresponding column value from the vehicles.name and sales.name IMCUs:

2  -> array[2] is 0, so no join
2  -> array[2] is 0, so no join
1  -> array[1] is 1, so join
3  -> array[3] is 0, so no join
0  -> array[0] is 1, so join
5  -> array[5] is 0, so no join

Creating Join Groups

Define join groups using the CREATE INMEMORY JOIN GROUP statement. Candidates are columns that are frequently paired in a join predicate, for example, a column joining a fact and dimension table.

The CREATE INMEMORY JOIN GROUP statement immediately defines a join group, which means that its metadata is visible in the data dictionary. The database does not immediately construct the common dictionary. Rather, the database builds the common dictionary the next time that a table referenced in the join group is populated or repopulated in the IM column store.

Guidelines

Creating, modifying, or dropping a join group typically invalidates all the underlying tables referenced in the join group. Thus, Oracle recommends that you create join groups before initially populating the tables.

To create a join group:

In SQL*Plus or SQL Developer, log in to the database as a user with the necessary privileges.

Create a join group by using a statement in the following form:

CREATE INMEMORY JOIN GROUP join_group_name ( table1(col1), table2(col2) );

For example, the following statement creates a join group named sales_products_jg:

CREATE INMEMORY JOIN GROUP sales_products_jg (sales(prod_id), products(prod_id));

Optionally, view the join group definition by querying the data dictionary (sample output included):

COL JOINGROUP_NAME FORMAT a18
COL TABLE_NAME FORMAT a8
COL COLUMN_NAME FORMAT a7

SELECT JOINGROUP_NAME, TABLE_NAME, COLUMN_NAME, GD_ADDRESS 
FROM   DBA_JOINGROUPS;

JOINGROUP_NAME     TABLE_NA COLUMN_ GD_ADDRESS
------------------ -------- ------- ----------------
SALES_PRODUCTS_JG  SALES    PROD_ID 00000000A142AE50
SALES_PRODUCTS_JG  PRODUCTS PROD_ID 00000000A142AE50

Populate the tables referenced in the join group, or repopulate them if they are currently populated.

Monitoring Join Group Usage

To determine whether queries are using the join group, you can pass the SQL ID to the DBMS_SQLTUNE.REPORT_SQL_MONITOR_XML function.

From the command line, Oracle recommends querying the DBMS_SQLTUNE.REPORT_SQL_MONITOR_XML output for a SQL ID. If the query returns rows, then the database used the join group for the statement associated with this SQL ID. Otherwise, the database did not use the join group.

Prerequisites

To monitor join groups, you must meet the following prerequisites:

A join group must exist.
The columns referenced by the join group must have been populated after join group creation.
You must execute a join query that could potentially use the join group.

To monitor join group usage:

Log in to the database with the necessary privileges.
Obtain the SQL ID for the query you want to monitor.

For example, execute the query that you want to monitor, and then query V$SESSION.PREV_SQL_ID.
Use the DBMS_SQLTUNE.REPORT_SQL_MONITOR_XML.EXTRACT function to determine whether the database used the join group in the hash join.

If querying the DBMS_SQLTUNE.REPORT_SQL_MONITOR_XML.EXTRACT function output returns rows, then the database used the join group.

Example 6-3 Monitoring a Join Group

In this example, you create a join group on the prod_id columns of sh.products and sh.sales tables, and then join these tables on this column. Your goal is to determine whether the join query used the join group. You grant the sh account administrative privileges. You log in as sh, and then proceed as follows:

Create a SQL*Plus variable for the SQL ID as follows:
```
VAR B_SQLID VARCHAR2(13)
```

Apply the INMEMORY attribute to the sh.products and sh.sales tables as follows:

ALTER TABLE sales INMEMORY MEMCOMPRESS FOR QUERY;
ALTER TABLE products INMEMORY MEMCOMPRESS FOR QUERY;

Create a join group on prod_id:

CREATE INMEMORY JOIN GROUP sh_jg (products(prod_id), sales(prod_id));

Scan the tables to populate them in the IM column store:

SELECT /*+ FULL(s) */ COUNT(*) FROM sales s;
SELECT /*+ FULL(p) */ COUNT(*) FROM products p;

Execute a query that joins on the prod_id column, and then aggregates product sales:

SELECT /*+ USE_HASH(sales) LEADING(products sales) MONITOR */
       products.prod_id, SUM(sales.amount_sold)
FROM   products, sales
WHERE  products.prod_id = sales.prod_id
GROUP BY products.prod_id;

Obtain the SQL ID of the preceding aggregation query:

BEGIN
  SELECT PREV_SQL_ID 
  INTO   :B_SQLID
  FROM   V$SESSION 
  WHERE  SID=USERENV('SID');
END;

Determine whether the database used the join group:

SET LONGCHUNKSIZE 10000000 LONG 10000000
COL JOIN_GROUP_USAGE FORMAT A50

SELECT DBMS_SQLTUNE.REPORT_SQL_MONITOR_XML(sql_id=>:B_SQLID).
       EXTRACT(q'#//operation[@name='HASH JOIN']/rwsstats/stat[@id='9']#').
       GETCLOBVAL(2,2) join_group_usage
FROM DUAL;

JOIN_GROUP_USAGE
--------------------------------------------------
<stat id="9">1</stat>

The query returned rows, so the database used the join group for the statement associated with this SQL ID.