World of Oracle DBA : August 2012

Troubleshooting Streams Capture when status is Paused For Flow Control

Definition

This warning message typically occurs when the target site is unable to keep up with the rate of messages flowing from the source site.In 10g Flow control is automatically enabled to reduce the rate at which events are being enqueued into the staging area. But this message also indicate other problems which will be covered in this note.
If “Streams AQ: enqueue blocked due to flow control” occurs,it simply means that we cannot queue messages until the components that had fallen behind catch up .As long as you see CAPTURE getting in and out of FLOWCONTROl status,there should not be any problem.

Where and which view to check the symptoms

a) The healthcheck script will show capture status enabled and capture statistics as PAUSED FOR FLOW
CONTROL . Example from SOURCE Healthcheck Script

Capture statistics:

C001 CAPUSR_CAP TURE 03:32:25 09/13/08 193 404 PAUSED FOR FLOW CONTROL

or from

SELECT CAPTURE_NAME, STATE FROM V$STREAMS_CAPTURE;

CAPTURE_NAME STATE
————– ————————
STREAM_CAPTURE PAUSED FOR FLOW CONTROL

V$BUFFERED_SUBSCRIBERS may show TOO MANY UNBROWSED MESSAGES
Example from SOURCE Healthcheck Script(HC)

++ BUFFERED PUBLISHERS ++

Queue Name: STREAMS_QUEUE
SENDER_NAME: STREAM_CAPTURE
Current Number of Msgs in Queue: 15521
MEMORY_USAGE: 15
UNBROWSED_MSGS: 5001
PUBLISHER_STATE: IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES

or from Flow Control Query:

COLUMN queue_schema FORMAT A12
COLUMN queue_name FORMAT A20
SELECT queue_schema, queue_name, unbrowsed_msgs, overspilled_msgs, memory_usage, publisher_state
FROM V$BUFFERED_PUBLISHERS;

NOTE: The Flow Control query will show:

* the queue owner and name
* the number of messages that have been enqueued, but not browsed
* the number of messages that have been spilled, but not browsed
* the percentage of the streams pool that is being used (or 0 if there is no streams pool)
* the state of the publisher. Possible values are:

- PUBLISHING MESSAGES (normal)
- IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES
- IN FLOW CONTROL: OVERSPILLED MESSAGES
- IN FLOW CONTROL: INSUFFICIENT MEMORY AND UNBROWSED MESSAGES

This last column will really help to determine why flow control is being enabled for a system and what can
be done to resolve the matter.

c) Propagation schedule may throw Ora-25307 error
Example from SOURCE Healthcheck Script(HC)

++ SCHEDULE FOR EACH PROPAGATION++
Propagation Name: STREAM_PROPAGATE
Status : Enabled
Error Message: ORA-25307: Enqueue rate too high, flow control enabled

The capture process is unable to enqueue LCRs either because of low memory or because propagation and apply process are consuming messages slower than the capture process is creating them.
1) INIT.ORA parameter problem

Inadequate Memory esp., Streams pool size is small or not set

2) If AQ_TM_PROCESSES parameter is set too high in 9i and 10g

In 10g unset AQ_TM_PROCESSES .QMOn is automatically configured To know more on this
parameter See Note:305662.1 – Queue Monitor Process: Architecture and Known Issues

3) Capture parameter setting have not been done properly .

4) Publisher has enqueued a lot more messages than have been browsed.

(>1000 unbrowsed messages)

5) High latency

High Latency can be due to long-running transactions,many dependent transactions, or slow Capture,Propagation, or Apply processes.In some situations, propagation may become disabled (if the number of failures is (16). Also check Propagation latency from dba_queue_schedules;

6) Check Apply latency

select (apply_time-applied_message_create_time)*86400 "latency_in_seconds",
(sysdate-apply_time)*86400"latency_in_seconds",
to_char(applied_message_create_time,'hh24:mi:ss mm/dd/yy') "event_creation",
to_char(apply_time,'hh24:mi:ss mm/dd/yy') "apply_time"
from dba_apply_progress;

7) Long running Transactions
Check that there is no occurrence for long running transaction entries with the alert log file, it will be
something like that C002: long running txn detected, xid: 0×0008.010.00084261

8) Some known bugs

a) Bug 7206332 PROPAGATION NOT HAPPENING–ASSOCIATED J00X PROCESSES STUCK (Suspended, Req’d Info not Available)

b) Bug 6054573 ENQUEUE: MEMORY FLOW CONTROL ; CAPTURE PERFORMANCE DEGRADATION ON CDC
Closed as a duplicate of Unpublished Bug 5093060 STREAMS: 5000 LCR LIMIT IS CAUSING UNNECESSARY FLOW CONTROL AT APPLY SITE
Fixed-Releases: A204 B106

c) Streams pool memory accounting can underflow, leading to capture staying in ‘PAUSED FOR FLOW CONTROL’ state indefinitely.
If you see capture in ‘PAUSED FOR FLOW CONTROL’ even in a low workload environment, and x$knlasg has fields such as memory_allocated_knlasg that are very large, you may have hit this issue. The information in x$knlasg is recorded in the Streams healthcheck section entitled ++ Streams Pool memory Information ++ as per Note 273674.1 for 10.2 onwards. The cause has been identifed and fixed in Bug 7719668 STREAMS MEMORY ACCOUNTING INCORRECT, LEADING TO CAPTURE ‘PAUSED IN FLOW CONTROL’ Fixed In Ver: 11.2

9) No proper patches for specific database version installed .

10) One off case: When the capture process was recreated, one of the customer did not recreate the
associated queues and queue tables, which retained the messages from the previous capture process.
This is the reason for CAPTURE to go into FLOW CONTROL.

11) In a scenario with Multiple Destinations we can get the ‘pause’ message when one destination is down. This is the correct behaviour. There are two common reasons for source-queue growth:

Suppose a source queue is propagating captured messages to multiple destination queues, and one or more destination databases acknowledge successful propagation of messages much more slowly than the other queues. In this case, the source queue can grow because the slower destination
databases create a backlog of messages that have already been acknowledged by the faster
destination databases. In such an environment, consider creating more than one capture process to capture changes at the source database. Doing so lets you use one source queue for the slower
destination databases and another source queue for the faster destination databases.

Solution

Diagnostics and Resolution

1) INIT.ora changes

Increase the STREAMS_POOL_SIZE on capture and apply site It is required to have a value of 200MB at least as per recommendation For example:

CONNECT / AS SYSDBA
ALTER SYSTEM SET STREAMS_POOL_SIZE=500M;

This will re-enable the capture from the ‘PAUSED FOR FLOW CONTROL’ status.

2) Unset AQ_TM_PROCESSES

In 10g onwards this parameter should be unset on both the sites.

ALTER SYSTEM RESET AQ_TM_PROCESSES SCOPE=SPFILE SID=’*';

High AQ_TM_PROCESSES value causes the acknowledgments not to receive from apply site.
3) Follow Section 5 “Capture Process Configuration”of NOTE: 298877.1 – 10gR1 Streams Recommended Configuration

Example at SOURCE site:

execute dbms_capture_adm.set_parameter('capture_name','_CHECKPOINT_FREQUENCY','100');
execute dbms_capture_adm.set_parameter('capture_name','_SGA_SIZE','50');

Restart(stop and start) the apply then capture.
If you have a lot of memory to work you could even set _SGA_SIZE’ to something like 200.

4) If Status is IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES

This situation could be a propagation problem

a) Stop capture that is in Flow Control via DBMS_CAPTURE_ADM.STOP_CAPTURE

b) Determine the J process associated with propagation via DBA_QUEUE_SCHEDULES, ie:

select schema, qname, destination, schedule_disabled, process_name from dba_queue_schedules;

c) Stop propagation via DBMS_PROPAGATION_ADM.STOP_PROPAGATION (Set FORCE option to TRUE if needed)

d) Check DBA_QUEUE_SCHEDULES / DBA_PROPAGATION to make sure the propagation has been stopped

e) Wait few minutes and make sure that the J Process associated to the Propagation that has been stopped is gone .If not use use kill -9 to kill the process at OS.

f) Set job_queue_processes to 0, wait few minutes, make sure the CJQ process and other J processes are gone (at OS level), then set JOB_QUEUE_PROCESSES back to the original value

g) Use DBMS_PROPAGATION_ADM.START_PROPAGATION to start Propagation

h) Check DBA_QUEUE_SCHEDULES / DBA_PROPAGATION to make sure the propagation has been started and it has a J process associated.

i) Start CAPTURE via DBMS_CAPTURE_ADM.START_CAPTURE

j) Check DBA_QUEUE_SCHEDULES columns TOTAL_NUMBER, TOTAL_BYTES …

If propagation is happening, the numbers would be changing..

k) At SOURCE and TARGET databases, check V$BUFFERED_QUEUES.

l) In case PROPAGATION once more seems hanging and not propagating LCRs then find the associated J process and upload to Support along with HC scripts:

- errostacks with 3 minutes interval
- 10046 and 24040 propagation level 10 trace .

From sqlplus as any AS SYSDBA user :

spool &spoolfilename

set echo on
set pagesize 1000

column destination format a25
column session_id format a10
column destination_dblink format a25
column source_queue_owner format a20
column source_queue_name format a25

select schema,qname,destination,latency,process_name,session_id,instance,message_delivery_mode
from dba_queue_schedules order by message_delivery_mode;
SELECT source_queue_owner,source_queue_name,propagation_name,
destination_dblink,queue_to_queue,status from dba_propagation order by propagation_name;

oradebug setospid <ospid_of_hung_propagation_job>
oradebug unlimit

oradebug dump errorstack 10
<< wait one minute >>
oradebug dump errorstack 10
<< wait one minute >>
oradebug dump errorstack 10

oradebug tracefile_name

oradebug Event 10046 trace name context forever, level 12
oradebug Event 24040 trace name context forever, level 10

<< wait 10-15 minutes >>

oradebug dump errorstack 10

oradebug short_stack

-- repeat the short_stack command 10-15 times in rapid succession

oradebug dump errorstack 10
oradebug Event 24040 trace name context off
oradebug Event 10046 trace name context off
exit;

If the state is IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES, then you can run the following query to determine which subscriber is not dequeuing messagesor lagging behind from the queue:

SELECT subscriber_name, cnum_msgs, total_dequeued_msg, total_spilled_msg
FROM V$BUFFERED_SUBSCRIBERS;

A comparison of the total number of messages enqueued for the subscriber (CNUM_MSG)
and the number of messages dequeued by that subscriber (TOTAL_DEQUEUED_MSG) should indicate which subscriber has fallen behind.

5) Resolve latency of Propagation

If the propagation is disabled(broken), re-enable it manually.
To reduce the latency of propagation between jobs, set the hidden parameter_job_queue_interval = 1. This should be done as an init.ora parameter or an spfile parameter.Because it is a hidden parameter, the database must be restarted in order for the value to take effect. The default for _job_queue_interval is 5 seconds.

In some situations, when propagation becomes disabled (if the number of failures is 16),re-enable the propagation manually.For example, if the queue name is STREAMS_QUEUE owned by STRMADMIN and the destination database link is ORCL2.WORLD, syntax would be:

execute dbms_aqadm.disable_propagation_schedule ('STRMADMIN.STREAMS_QUEUE','ORCL2.WORLD');
execute dbms.aqadm.enable_propagation_schedule ('STRMADMIN.STREAMS_QUEUE','ORCL2.WORLD' );
execute dbms_aqadm.alter_propagation_schedule ('STRMADMIN.STREAMS_QUEUE','ORCL2.WORLD',latency=>5);

Propagation and the capture process will be resumed automatically when the target site is able to accept more messages.

6) Check Apply parameters

Set the appropriate parameters for apply process.For 10.2 set the following apply parameters.
Disable the capture process and the apply process as well

execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’parallelism’,’4’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’_dynamic_stmts’,’Y’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’_hash_table_size’,’1000000’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’disable_on_error’,’N’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’_txn_buffer_size’,10*parallelism’);

NOTE:315666.1 – How To Set The _TXN_BUFFER_SIZE Parameter For a Streams Apply

SELECT APPLY_NAME,PARAMETER,VALUE FROM ALL_APPLY_PARAMETERS;

Enable the apply processes followed by Capture process

7) Check for long running Transactions:
Identify the transaction id for that long running transaction using the following query and ignore the
transaction if it causes high latency

prompt
prompt ++ Current Long Running Transactions ++
prompt Current transactions open for more than 20 minutes
prompt

col runlength HEAD 'Txn Open|Minutes' format 9999.99
col sid HEAD 'Session' format a13
col xid HEAD 'Transaction|ID' format a18
col terminal HEAD 'Terminal' format a10
col program HEAD 'Program' format a27 wrap

select t.inst_id, sid||','||serial# sid,xidusn||'.'||xidslot||'.'||xidsqn xid, (sysdate - start_date )* 1440
runlength ,terminal,program from gv$transaction t, gv$session s where t.addr=s.taddr
and (sysdate - start_date) * 1440 > 20;

For ignoring transaction at capture , use the following steps on source site:

execute dbms_capture_adm.stop_capture('CAPTURE_NAME');
execute dbms_capture_adm.set_parameter('CAPTURE_NAME','_ignore_transaction','TXN_ID');
execute dbms_capture_adm.start_capture('CAPTURE_NAME');

8) Solutions for some known bugs

ORA-25307: Enqueue rate too high, flow control enabled is reported after enqueueing 5000
user-enqueued buffered messages.
By default flow control kicks for user-enqueued events when there are 5000 unbrowsed messages in a
queue. The default behaviour can be changed in 10.2 by applying Patch 5093060 which allows
you to manipulate when buffer publisher flow control is activated

The fix is event based in 10.2.0.3 and it introduces two new events:
a) 10867 : This event controls flow control threshold values for any buffered message publisher.
The level of the event decides the threshold value.
b) 10868 : This event controls flow control threshold values for only capture publisher (the Capture process).The level of the event decides the threshold value.
When any of the above events are not set or reset using “context off” command, the threshold values
falls back to default values which are:
1) Buffered publisher : 5000
2) Capture publisher : 15000

The fix for unpublished bug 5093060 is present in 10.2.0.4 onwards and can be activated by setting the
following hidden parameters, i.e.,

alter system set "_capture_publisher_flow_control_threshold"=10000 scope=memory;
alter system set "_buffered_publisher_flow_control_threshold"=10000 scope=memory;

From 10.2.0.1 to 10.2.0.3 ,enable the flow control fix for user enqueued messages.

connect / as sysdba
--Setting the value dynamically
alter system set events '10867 trace name context forever, level 15000';
--Setting back to the default 5000
alter system set events '10867 trace name context off';
--Setting in the spfile
alter system set event='10867 trace name context forever, level 15000' scope=spfile;
--Unsetting in the spfile
alter system reset event scope=spfile sid='*';

Setting in a pfile

event=’10867 trace name context forever, level 15000′

Setting any of these will require a bounce of the database.

World of Oracle DBA

Sunday, August 19, 2012

identify the Long running transaction in streams

About Me , Working as Senior Oracle RAC DBA ,