ETL 流程下的 Oracle 物化视图：机会成本方法

会员 11127135

4.00/5 (1投票)

2018 年 1 月 30 日

CPOL

6分钟阅读

8008

ETL 流程中的 Oracle 物化视图。

数据仓库领域中的维度模型规则，如果没有提供适当的报告响应时间，那就是徒劳的。

当处理数百万甚至数十亿行数据时，即使拥有非常强大的 CPU，数据库在执行数据分组或数据排序操作时也可能花费很长时间。

使查询响应时间尽可能快的一种可能方法是，在主 ETL 流程完成事实表加载后预先聚合数据，这样在用户需要时就可以预先计算出来。更准确地说，可能的数据聚合步骤甚至应该成为整个 ETL 链的一部分（通常在最后调用）。

Oracle 拥有一个很棒的功能，允许开发人员在不需要昂贵探索工具的情况下实现这一点（当然，Oracle 本身并不便宜……但据我所知，它应该是最好的数据库）。Oracle 为开发人员提供的功能称为物化视图，它就像一个由底层基本查询构建的数据快照。此功能在其他数据库引擎中也存在，即使名称不同（例如，SQL Server 有一个类似的功能称为索引视图，IBM DB2 有物化查询表）。

使用物化视图，我们最终可以采用多种方法对主事实表进行聚合，这几乎就像我们拥有较小的事实表，如下面的模式所示。

如果存在物化视图，Oracle 数据库引擎就能够重写查询，使其利用这些预先计算的持久化查询。本文将不提供有关相关 SQL 语法或如何创建它们的详细解释。这方面的信息已经很多了。物化视图创建后，可以手动刷新、提交时刷新、计划刷新等。Oracle 甚至可以建议您创建哪种类型的物化视图来加速特定查询。该软件包的名称是 DBMS_ADVISOR.QUICK_TUNE。像 https://docs.oracle.com/database/121/DWHSG/toc.htm 或这个 ETL wiki 这样的链接将为您提供与物化视图创建相关的技术信息，以及大量关于 Oracle 和通用数据仓库或 ETL 开发见解的信息。

本文将展示使用物化视图的收益证据。我将创建一个测试维度模型，包含一个事实表和两个维度。该模型将模拟前往五个欧洲城市的旅行信息，涉及时间和成本。它将存储随机生成的 1600 万行数据。

最后，将进行模拟，针对我们的维度模型运行 100 个查询。前 50 个查询是在为每个物化视图禁用查询重写选项后执行的。相应的语法如下：

alter materialized view fac_trip_agg_city_cost disable query rewrite

第二个 50 个查询是启用查询重写后执行的。查询将在迭代之间有所不同，使用随机生成的“where”子句，因为底层上下限将是随机生成的数字。这样，我们就可以最小化缓存结果的使用，并强制查询每次都执行。

因此，在接下来的步骤中，我们将创建所需的源表，然后是四个物化视图、统计表，最后执行一个模拟脚本。让我们开始吧。

步骤 1

请遵循下面的脚本（必须为使用的用户分配适当的权限，请注意这一点）- 它创建两个维度表并向所有表中插入数据。

/**************************************************************************************
creates a dummy dimension table with city names and surrogate keys
**************************************************************************************/
create table dim_city (city_sk number(5), city_name varchar2(50)) compress nologging;
create unique index ix_pk_dim_city on dim_city (city_sk) nologging pctfree 1 storage _
(buffer_pool default flash_cache default cell_flash_cache default) noparallel;
alter table dim_city add constraint ix_pk_dim_city primary key (city_sk);
insert into dim_city values(0, 'Unknown');
insert into dim_city values(1, 'Lisbon');
insert into dim_city values(2, 'London');
insert into dim_city values(3, 'Paris');
insert into dim_city values(4, 'Amsterdam');
insert into dim_city values(5, 'Brussels');
commit;
/*************************************************************************************
creates a dummy dimension table with transportation types and surrogate keys
**************************************************************************************/
create table dim_transport(transport_sk number(5), transport_name varchar2(50)) compress nologging;
create unique index ix_pk_dim_transport on dim_transport (transport_sk) nologging pctfree _
1 storage (buffer_pool default flash_cache default cell_flash_cache default) noparallel;
alter table dim_transport add constraint ix_pk_dim_transport primary key (transport_sk);
insert into dim_transport values(0, 'Unknown');
insert into dim_transport values(1, 'Plane');
insert into dim_transport values(2, 'Car');
insert into dim_transport values(3, 'Train');
insert into dim_transport values(4, 'Bus');
commit;

第二步

下一个语句将创建主事实表，采用随机数生成方法。City 和 Transport 的外键是随机生成的，表中的其他字段，如 cost 或 travel time 也是随机生成的。

/*************************************************************************************
creates a dummy fact table with 16.000.000 random data rows
the logic assigns higher values to trips made to London, Paris or Amsterdam
**************************************************************************************/
create table fac_trip
nologging
compress
noparallel as
with tbl_aux(line_id) as (select level - 1 as line_id
                         from dual
                         connect by level <= 4000),
    tbl_aux_2 as (select  rownum as fac_sk,
                          round(dbms_random.value(0, 5), 0) as city_fk,
                          round(dbms_random.value(0, 4), 0) as transport_fk,
                          date'2016-01-01' + dbms_random.value(0, 500) as dat_trip_start,
                          dbms_random.value(0, 9) as val_trip_dur_hours,
                          dbms_random.value(0, 2250) as val_trip_cost,
                          1 as qty_trip
                  from tbl_aux a
                  cross join tbl_aux b)
select cast(w.fac_sk as number(10)) as fac_sk,
      cast(w.city_fk as number(5)) as city_fk,
      cast(w.transport_fk as number(5)) as transport_fk,
      w.dat_trip_start,
      case when w.city_fk in (2, 3, 4) then w.val_trip_dur_hours * dbms_random.value(1, 1.5) _
                else w.val_trip_dur_hours end as val_trip_dur_hours,
      case when w.city_fk in (2, 3, 4) then w.val_trip_cost * dbms_random.value(1, 1.5) _
                else w.val_trip_cost end as val_trip_cost,
      cast(w.qty_trip as number(1)) as qty_trip
from tbl_aux_2 w;

步骤 3

在创建数据和所需的支持表后，让我们更新它们的底层统计信息。

/**************************************************************************************
estimates statistics for the created tables
**************************************************************************************/
execute dbms_stats.gather_table_stats(ownname => 'your_owner_here', _
tabname => 'dim_city', cascade => true, estimate_percent => dbms_stats.auto_sample_size);
execute dbms_stats.gather_table_stats(ownname => 'your_owner_here', _
tabname => 'dim_transport', cascade => true, estimate_percent => dbms_stats.auto_sample_size);
execute dbms_stats.gather_table_stats(ownname => 'your_owner_here', _
tabname => 'fac_trip', cascade => true, estimate_percent => dbms_stats.auto_sample_size);

步骤 4

现在让我们创建四个物化视图，它们将连接并聚合主事实表数据（按 city 和 transport 类型）。查询计算平均 cost 和 travel time。

/**************************************************************************************
creates four materialized views so they are used in the test script bellow the idea will be 
to run several queries against our dummy fact table with enable query rewrite and 
without enable query rewrite
**************************************************************************************/
create materialized view fac_trip_agg_city_cost
nologging
cache
build immediate
enable query rewrite as
select b.city_sk,
      b.city_name,
      avg(a.val_trip_cost) as metric_value
From fac_trip a,
    dim_city b
where a.city_fk = b.city_sk
group by b.city_sk,
        b.city_name;
create materialized view fac_trip_agg_city_duration
nologging
cache
build immediate
enable query rewrite as
select b.city_sk,
      b.city_name,
      avg(A.val_trip_dur_hours) as metric_value
From fac_trip a,
    dim_city b
where a.city_fk = b.city_sk
group by b.city_sk,
        b.city_name;
create materialized view fac_trip_agg_transp_cost
nologging
cache
build immediate
enable query rewrite as
select b.transport_sk,
      b.transport_name,
      avg(a.val_trip_cost) as metric_value
From fac_trip a,
    dim_transport b
where a.transport_fk = b.transport_sk
group by b.transport_sk,
        b.transport_name;
create materialized view fac_trip_agg_transp_duration
nologging
cache
build immediate
enable query rewrite as
select b.transport_sk,
      b.transport_name,
      avg(A.val_trip_dur_hours) as metric_value
From fac_trip a,
    dim_transport b
where a.transport_fk = b.transport_sk
group by b.transport_sk,
        b.transport_name;

步骤 5

让我们创建三个辅助表来存储模拟脚本中获得的统计数据和计时。

/**************************************************************************************
creates two tables to store the test query results
**************************************************************************************/
create table tbl_aux_stats_city
(
 iteration               number(3),
 iteration_type          varchar2(1),
 metric_name             varchar2(50),
 city_name               varchar2(50),
 metric_value            number
)
nologging
noparallel
nocache;

create table tbl_aux_stats_transport
(
 iteration               number(3),
 iteration_type          varchar2(1),
 metric_name             varchar2(50),
 transport_name          varchar2(50),
 metric_value            number
)
nologging
noparallel
nocache;

/***************************************************************************************
and a table to store the executions’ timings
***************************************************************************************/
create table tbl_aux_stats_time
(
 iteration               number(3),
 iteration_type          number(1),
 start_time              timestamp,
 end_time                timestamp
)
nologging
noparallel
nocache;

步骤 6

现在我们可以运行以下脚本。

/**************************************************************************************
simulates 100 queries against our test dimensional model
**************************************************************************************/
declare
    start_time       timestamp;
    inf_value        number(5);
    sup_value        number(5);
begin
    execute immediate 'alter session set query_rewrite_enabled = true';
    execute immediate 'truncate table tbl_aux_stats_time';
    execute immediate 'truncate table tbl_aux_stats_city';
    execute immediate 'truncate table tbl_aux_stats_transport';
    for idxm in 1..2 loop
       if idxm = 1 then
           execute immediate 'alter materialized view fac_trip_agg_city_cost disable query rewrite';
           execute immediate 'alter materialized view fac_trip_agg_city_duration disable query rewrite';
           execute immediate 'alter materialized view fac_trip_agg_transp_cost disable query rewrite';
           execute immediate 'alter materialized view fac_trip_agg_transp_duration disable query rewrite';
           --execute immediate 'alter session set query_rewrite_enabled = false';
       else
           execute immediate 'alter materialized view fac_trip_agg_city_cost enable query rewrite';
           execute immediate 'alter materialized view fac_trip_agg_city_duration enable query rewrite';
           execute immediate 'alter materialized view fac_trip_agg_transp_cost enable query rewrite';
           execute immediate 'alter materialized view fac_trip_agg_transp_duration enable query rewrite';
       end if;

       for idx in 1..50 loop
           start_time := systimestamp;

           /*******************************************************
           metrics by city
           *******************************************************/
           inf_value := round(dbms_random.value(0, 5), 0);
           sup_value := round(dbms_random.value(inf_value, 5), 0);
           insert into tbl_aux_stats_city
           select idx as iteration,
                  idxm as iteration_type,
                  'Avg Cost' as metric_name,
                  b.city_name,
                  avg(a.val_trip_cost) as metric_value
           from fac_trip a,
                dim_city b
           where a.city_fk = b.city_sk and
                 b.city_sk between inf_value and sup_value
           group by b.city_sk,
                    b.city_name
           union all
           select idx as iteration,
                  idxm as iteration_type,
                  'Avg Duration' as metric_name,
                  b.city_name,
                  avg(a.val_trip_dur_hours) as metric_value
           from fac_trip a,
                dim_city b
           where a.city_fk = b.city_sk and
                 b.city_sk between inf_value and sup_value
           group by b.city_sk,
                    b.city_name;

           /*******************************************************
           metrics by transport type
           *******************************************************/
           insert into tbl_aux_stats_transport
           select idx as iteration,
                  idxm as iteration_type,
                  'Avg Cost' as metric_name,
                  b.transport_name,
                  avg(a.val_trip_cost) as metric_value
           from fac_trip a,
                dim_transport b
           where a.transport_fk = b.transport_sk and
                 b.transport_sk between inf_value and sup_value
           group by b.transport_sk,
                    b.transport_name
           union all
           select idx as iteration,
                  idxm as iteration_type,
                  'Avg Duration' as metric_name,
                  b.transport_name,
                  avg(a.val_trip_dur_hours) as metric_value
           from fac_trip a,
                dim_transport b
           where a.transport_fk = b.transport_sk and
                 b.transport_sk between inf_value and sup_value
           group by b.transport_sk,
                    b.transport_name;

           /*******************************************************/
           insert into tbl_aux_stats_time
           values (idx, idxm, start_time, systimestamp);
           commit;
       end loop;
    end loop;
end;
/

脚本运行完成后（可能需要一段时间），您将在表 tbl_aux_stats_time 中获得每次迭代的开始和结束时间，包括使用物化视图和不使用物化视图的情况（迭代类型 1 不使用物化视图，而类型 2 使用它们）。下面的查询将提供每次迭代花费的时间（以秒为单位）以及每个时期的累积时间（请注意，对于同一时期，脚本不运行相同的查询）。

with tbl_aux as (select a.iteration,
                          a.iteration_type,
                          ((a.end_time + 0) - (a.start_time + 0)) * 24 * 60 * 60 as elapsed_time_seconds
                from  migra_arf.tbl_aux_stats_time a),
    tbl_aux_2 as (select a.iteration,
                         max(case when a.iteration_type = 1 then a.elapsed_time_seconds else null end) as elapsed_time_seconds_1,
                         max(case when a.iteration_type = 2 then a.elapsed_time_seconds else null end) as elapsed_time_seconds_2
                  from tbl_aux a
                  group by a.iteration)
select o.iteration,
      o.elapsed_time_seconds_1,
      sum(o.elapsed_time_seconds_1) over (order by o.iteration rows unbounded preceding) as elapsed_cum_time_seconds_1,
      o.elapsed_time_seconds_2,
      sum(o.elapsed_time_seconds_2) over (order by o.iteration rows unbounded preceding) as elapsed_cum_time_seconds_2
from tbl_aux_2 o
order by 1

查询结果如下所示。

将结果可视化为图表，我们可以观察到（在线性刻度上）两个累积计时之间的差异在增加。两个累积函数的差距预计会增加，因为平均而言，不使用物化视图运行查询的时间应该大约是使用物化视图运行查询时间的 three 倍。

使用对数刻度，单个时期计时之间的差异变得更加明显（尽管我们没有在同一时期比较相同的查询）。针对物化视图执行的查询通常比针对主表执行的查询花费的时间少得多。

很容易理解，随着迭代的增加，两个累积指标之间的差异也会增加。这似乎是物化视图在长期来看具有优势的经验证据。两条线之间的差异代表 CPU 成本，代表用户等待数据的时长，代表 CEO 等待重要决策支持报告的时长，或者代表在某些金融产品上快速进行投资决策的可能性。这种差异伴随着成本，因此利用这些特性是有意义的，因为通常在 ETL 过程中刷新物化视图所需的时间与如果大量报告请求针对数据库发出时所获得的收益相比，可以忽略不计。

为了帮助读者更好地理解这些优势，我在接下来的图片中展示了没有物化视图和使用物化视图时的查询执行计划。如果您注意步骤顺序，您会发现在第一张图中，Oracle 在步骤 3 和 10 中对表 FAC_TRIP 执行了全表扫描，之后，它需要对提取的数据进行分组然后排序。

如果物化视图包含所需的预计算数据并可以替代主表，那么这些步骤就不应该需要了。请查看第二张图片。

在下面的图片中，我们可以确认，通过为所有物化视图启用查询重写，显然我们会得到一个更简单的执行计划，因此数据库将回复得更快。

总而言之，使用物化视图似乎有其优势，至少在决策支持系统和 ETL 流程中是这样，因此报告响应时间可以减少和优化。

再次强调，应该进行测试，并且根据基础数据集的大小，开发人员应仔细评估刷新现有物化视图所需的时间。这些都不是教条，因此可以随意质疑它们，并始终为您的需求寻找最佳解决方案。