Monday, December 7, 2015

Oracle bigfile tablespace pre-allocation and session blocking

Oracle 11g introduced the new feature tablespace extent pre-allocation, and described as:

SMCO decides to expand the tablespace based on history, extension is split evenly across all datafiles in the tablespace which have not reached their maxsize and are still limited to 10% of the full tablespace size in one hourly SMCO wakeup.

With 11g tbs pre-extension feature, more extents are required to meet the free space projection requirement. The system will does extension in multiple of increments (NEXT), not just keeping free space within one increment.

and provided the workaround:

 1. User can set hidden parameter "_enable_space_preallocation" to 0 (default value 3) to turn off the tbs pre-extension feature.
   alter system set "_enable_space_preallocation"=0;

 2. User can set hidden parameter "_kttext_warning" to specify the percentage we'd like to pre-allocate. By default, it is 5%.
   alter system set "_kttext_warning"=5;

See MOS Notes:
   SMCO (Space Management Coordinator) For Autoextend On Datafiles And How To Disable/Enable (Doc ID 743773.1)
   AUTOEXTEND Grows To Full Size Without Reason (Doc ID 1459097.1)
   Master Note: Overview of Oracle Segment Storage (Doc ID 1491960.1)

A bigfile tablespace contains only one datafile. When a tablespace of 2T requiring a pre-allocation of 5% in default, it is 100G. On a system with disk write of 100M/S, it will take 1000 seconds to terminate this pre-allocation. That is probably why the hidden parameter is prepared to turn off the new feature.

Test


Run test with default setting (see appended Test Code):
 ALTER SYSTEM SET "_ENABLE_SPACE_PREALLOCATION" = 3; 
 exec test_run(5, 2, 1);

then look the tablespace usage statistics by:

 select d.snap_id
       ,round((tablespace_usedsize*8192/1024/1024/1024),2) used_size_gb
       ,round((tablespace_size*8192/1024/1024/1024),2)     size_gb
       ,round((tablespace_maxsize*8192/1024/1024/1024),2)  max_size_gb
       ,round(100*(tablespace_size - tablespace_usedsize)/tablespace_size, 2) percentage
 from dba_hist_tbspc_space_usage d, v$tablespace t
 where d.tablespace_id = t.ts# and t.name in ('TBS')
 order by d.rtime;

the output:

SNAP_ID USED_SIZE_GB SIZE_GB MAX_SIZE_GB PERCENTAGE
12021.1921.951003.48
12122.1923.121004.01
12223.1924.411005.02
12324.1924.691002.02
12425.1326.051003.56

The percetage is (almost) under 5%.

Verify again segment size with unix file size:

 select segment_name, sum(blocks) blocks, round(sum(bytes)/1024/1024/1024, 2) gb
       ,round(27976015872/1024/1024/1024, 2)               unix_gb
       ,round(100*(27976015872-sum(bytes))/27976015872, 2) percetage
   from dba_segments
  where segment_name in ('TESTT') group by segment_name;
 
 where 27976015872 is unix file size (ls -l tbs.dbf)

SEGMENT_NAME BLOCKS GB UNIX_GB PERCENTAGE
TESTT325632024.8426.054.65

We can see SIZE_GB(tablespace_size) of 26.05 matches well with UNIX_GB (unix file size).

Test again with pre-allocation off:
 ALTER SYSTEM SET "_ENABLE_SPACE_PREALLOCATION" = 0;  
 exec test_run(5, 2, 1);

and collect the tablespace usage statistics. The output:

SNAP_ID USED_SIZE_GB SIZE_GB MAX_SIZE_GB PERCENTAGE
13231.1931.251000.20
13332.2532.261000.02
13433.2533.281000.09
13534.3434.381000.09
13635.3235.361000.11

The percetage is under 1%.

Verify again segment size with unix file size (37968945152):

 select segment_name, sum(blocks) blocks, round(sum(bytes)/1024/1024/1024, 2) gb
       ,round(37968945152/1024/1024/1024, 2)               unix_gb
       ,round(100*(37968945152-sum(bytes))/37968945152, 2) percetage
   from dba_segments
  where segment_name in ('TESTT') group by segment_name;

SEGMENT_NAME BLOCKS GB UNIX_GB PERCENTAGE
TESTT456396834.8235.361.53

With following query, one can also estimate the Disk Write Throughput:

select count(*)                                                     cnt
      ,round(sum(delta_write_io_bytes)/(1024*1024*1024), 2)         write_gb
      ,round(sum(delta_interconnect_io_bytes)/(1024*1024*1024), 2)  write_int_gb
      ,count(*)                                                     elapsed_seconds
      ,round(sum(delta_write_io_bytes)/(1024*1024)/(count(*)*1), 2) disk_write_MB_per_second
from v$active_session_history t
where event in 'Data file init write';

The same behavior can be reproduced by simply resize datafile to a bigger size,
for instance, from current 60G to 100G:
 ALTER DATABASE DATAFILE '/home/oradata/testdb/tbs.dbf' RESIZE 100G;
and monitor file size by:
 select * from dba_data_files where file_name = '/home/oradata/testdb/tbs.dbf';

Wait Events


There are 3 blocking wait events observed among Wnnn, DBWn, and user sessions.
1. Data file init write
Wnnn or user sessions wait on this event when allocating space in datafile. It is never blocked.
When "_ENABLE_SPACE_PREALLOCATION" = 3, both Wnnn and user sessions can wait on such event.
When "_ENABLE_SPACE_PREALLOCATION" = 0, no Wnnn active; and only user sessions can wait on such event.

2. local write wait  
Wnnn or user sessions wait on this event when writing "bitmap block" (CLASS# 11). The blocker is DBWn.

3. buffer busy waits
user sessions wait on this event when reading file header block (CLASS# 13, BLOCK# 2).
The blocker is Wnnn or other user session, which are waiting on "Data file init write".

It can also happen when one Wnnn reading file header block (CLASS# 13, BLOCK# 2).
The blocker is another Wnnn or other user session, which are waiting on "Data file init write".

The user session, which is waiting on Event "buffer busy waits" when reading file header block (CLASS# 13, BLOCK# 2), can in turn block other user sessions if they are performing DML (e.g. INSERT) on the same segment. These other user sessions are waiting on Event "enq: TX - contention".

For the application of Distributed Transaction, this can cause the Error:
    ORA-02049: timeout: distributed transaction waiting for lock
because over there there is a hard constraint: distributed_lock_timeout.
If "enq: TX - contention" wait time of Distributed Transaction is more than distributed_lock_timeout, the session throws ORA-02049.
(For ORA-02049, See Blog: Redo Practice)

During pre-allocation, Wnnn or user sessions are writing to datafile till its target size (by big chunks)with locking on its file header block. The consequence is that all access to the datafile by user sessions is blocked.

For Batch processing, it could improve the whole throughput; whereas on OLTP applications, user sessions could be frozen during entire allocation. Even turning off pre-allocation by "_ENABLE_SPACE_PREALLOCATION" = 0, the allocation by user sessions also blocks other sessions because file header block lock (event "Data file init write") is an exclusive lock and serialize all operations on it.

Dumping file header block shows that it records the file size and allocation details (all number in K), which probably explains why it is a single serializing point.

 File Space Header Block:
 Header Control:
 RelFno: 1024, Unit: 8, Size: 664320, Flag: 9
 AutoExtend: YES, Increment: 1280, MaxSize: 13107200
 Initial Area: 126, Tail: 664319, First: 82240, Free: 784
 Deallocation scn: 2440207268.2059
 Header Opcode:
 Save: No Pending Op
 End dump data blocks tsn: 861 file#: 859 minblk 2 maxblk 2


truss (AIX) on Wnnn when it is on "Data file init write":

  listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFF6140, 0x0000000000000001, ...
  listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFF6140, 0x0000000000000001, ...
  listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFF6140, 0x0000000000000001, ...
  aio_nwait_timeout64(0x0000000000001000, 0x0000000000000003, ...) = 0x0000000000000003
  fsync(16)                        = 0
  statx("/home/oradata/testdb/tbs.dbf", 0x0FFFFFFFFFFF7D00, 176, 010) = 0
  kopen64x("/home/oradata/testdb/tbs.dbf", 0400400000200, 037777676400, , 1152921504606846944) = 17
  close(17)

Oracle has designated a dedicated event:
 db file single write
and said:
 This event is used to wait for the writing of the file headers.

but is only observed a single once when first time to create TABLESPACE with DATAFILE.

Other Test


Curious about the jump of "_ENABLE_SPACE_PREALLOCATION" from default 3 to 0, value 1 and 2 is also tested, which are more or less like 0.

Parallel to "_kttext_warning" for tablespace, there are also 4 hidden parameters for segment allocation:

NAME DES INST_DEF_VAL
_ktslj_segext_warning segment pre-extension warning threshold in percentage 10
_ktslj_segext_warning_mbsegment pre-extension warning threshold in MB 0
_ktslj_segext_max_mb segment pre-extension max size in MB (0: unlimited) 0
_ktslj_segext_retry segment pre-extension retry 5

Test Code


drop tablespace tbs including contents and datafiles cascade constraints;

create bigfile tablespace tbs datafile
  '/home/oradata/testdb/tbs.dbf' size 10M autoextend on next 10M maxsize 100G
logging online extent management local autoallocate blocksize 8k
segment space management auto flashback on;


drop table testt cascade constraints;

create table testt(id1 number, id2 number, id3 number, id4 number, text1 varchar2(3000), text2 varchar2(4000))
tablespace tbs storage (initial 16k next 128k);
       
create index testt_ix1 on testt (id1)
tablespace tbs storage (initial 16k next 128k); 

create index testt_ix2 on testt (id2)
tablespace tbs storage (initial 16k next 128k); 

create index testt_ix3 on testt (id3)
tablespace tbs storage (initial 16k next 128k); 

create index testt_ix4 on testt (id4)
tablespace tbs storage (initial 16k next 128k); 
                                          
create or replace procedure insert_testt (p_cnt number default 1) is
begin
  for i in 1..p_cnt loop
   insert into testt select level id1, level id2, level id3, level id4
                           ,rpad('ABC', 3000, 'K') text1, rpad('XYZ', 4000, 'K') text2
     from dual connect by level <= 65000;  -- about 0.5GB
   commit;
  end loop;
end;
/

create or replace procedure insert_testt_job (p_jobs number default 2, p_cnt number default 1)
is
  l_job_id       pls_integer;
  l_running_job_cnt number := 1;
  l_start_time    number := dbms_utility.get_time;
begin
  for i in 1..p_jobs loop
   dbms_job.submit(l_job_id, 'begin insert_testt(p_cnt=> '||p_cnt||'); end;');
  end loop;
  commit;
 
  while l_running_job_cnt > 0
  loop
    dbms_lock.sleep(1);
   select count(*) into l_running_job_cnt
     from dba_jobs
    where (job = l_job_id or what like 'begin insert_testt(p_cnt=>%')
      and rownum = 1;
  end loop;
  dbms_output.put_line('Elapsed Seconds: '|| round((dbms_utility.get_time - l_start_time)/100));
end;   
/

create or replace procedure test_run (p_run number default 5, p_jobs number default 2, p_cnt number default 1) is
begin
 sys.dbms_workload_repository.create_snapshot('ALL');
 for i in 1..p_run loop
  insert_testt_job(p_jobs, p_cnt);
  sys.dbms_workload_repository.create_snapshot('ALL');
 end loop;
end;
/