Sunday, August 19, 2012

identify the Long running transaction in streams

We had a performance issue, due to data scrub session which was executed without a commit.
The transaction session was abruptly closed by a developer, forcing itself to get hung and was later killed by us after confirming this session was the culprit.Unluckily we killed this session after 12hrs, this impacted the our streams latency the update which ran for more than 12hrs only to be killed by us on source, this obviously will run long time in the target until the session gets rolled back after running more than 12hrs. This means latency difference will be more and this being the christmas holiday,Luckily we didn’t get anybody complaining of it. Below is the solution followed, although we had our doubts due to lack of confidence because we are ameture in streams process.
below is the info from alert log
——————————————–
Sun Dec 27 20:45:16 2009
C001: large txn detected (17372231 LCRs), xid: 0×0045.01e.0012f30b

convert 0045.01e.0012f30b (hexadecimal to decimal) in 3 sets, in this case it is
0045     = 69
01e      = 30
0012f30b = 1241867

When Capture mines a long transaction, it logs transaction details into v$streams_transaction view. And, the XID info is also logged into alertSID.log by Capture process.
now exec the below sql
set lines 300 pages 3000
select * from V$STREAMS_TRANSACTION;

From the above, look for the values for the column [XIDUSN, XIDSLT, XIDSQN]

Sometimes, Streams may take time to mine a transaction and may not log any info about long transaction in alert.log OR in v$streams_transaction view. The v$logmnr_transaction view is really helpful and will also provide the transaction details being mined by capture’s logminer process.
now exec the below sql
select xid, xidusn, xidslt, xidsqn, mining_status from v$logmnr_transaction where mining_status = ‘ACTIVE’;
to select the correct xid, hence take the xid where these 3 columns (xidusn, xidslt, xidsqn) has the above calculated decimal value.
eg :- 45001E000BF31200, 69, 30, 1241867, PINNED
69.30.1241867 (is the transaction id – we will use this later)
Find the sql_id of the above long running transaction using the below sql, by substituting the xid value
select distinct sql_id from v$active_session_history where xid = ’45001E000BF31200′;
Find the exact sql using the sql_id we got from our last step
set long 10000
select sql_text from v$sql where sql_id = ’76n6dpqstc5ax’;
Note : The view v$active_session_history contains recent sessions activity. If the above transaction is most recent then it would be captured (if that sql happen to take couple of secs to run). If the above returns no rows then, we can use the below sql
select distinct session_id, session_serial#, sql_id from DBA_HIST_ACTIVE_SESS_HISTORY where xid = ’45001E000BF31200′;
We can also cross verify for the correct transaction if you happened to remember the session ID and the serial# which you killed in the source. We accidentally happened to remember , thereby assuring ourself.
Find the exact sql using the sql_id we got from our last step
select sql_text from dba_hist_sqltext where sql_id = ’76n6dpqstc5ax’;
select sql_text from dba_hist_sqltext where sql_id = ’76n6dpqstc5ax’;
SQL_TEXT
——————————————————————————–
update usr.tbl set col1 = ‘val’,col2=sysdate, col3=’value’,col4=sysdate+1

Now is that we have got the sql, sql_id, xid, transaction_id. We for sure know this transaction didn’t complete, since we killed it earlier in source, so why should we allow it to run in the target, only to be rolled back after the stipulated time it took in the source, we will be wasting our time by allowing it to run and may run into other issues.
Hence we simply ignored the transaction to be captured.
In the capture side, we did the following
prompt
prompt ++ Current Long Running Transactions ++
prompt Current transactions open for more than 20 minutes
prompt
col runlength HEAD ‘Txn Open|Minutes’ format 9999.99
col sid HEAD ‘Session’ format a13
col xid HEAD ‘Transaction|ID’ format a18
col terminal HEAD ‘Terminal’ format a10
col program HEAD ‘Program’ format a27 wrap
select t.inst_id, sid||’,'||serial# sid,xidusn||’.'||xidslot||’.'||xidsqn xid, (sysdate – start_date )* 1440
runlength ,terminal,program from gv$transaction t, gv$session s where t.addr=s.taddr
and (sysdate – start_date) * 1440 > 20;
The above will return rows only if the transaction is recent, in my case it is old, but since we already know the transaction id we directly executed the next step. Rememberthe value we got from the column (xidusn, xidslt, xidsqn).
For ignoring transaction at capture , use the following steps on source site:
execute dbms_capture_adm.stop_capture(‘CAPTURE_NAME’);
execute dbms_capture_adm.set_parameter(‘CAPTURE_NAME’,'_ignore_transaction’,’69.30.1241867′);
execute dbms_capture_adm.start_capture(‘CAPTURE_NAME’);


Troubleshooting Streams Capture when status is Paused For Flow Control
Definition
This warning message typically occurs when the target site is unable to keep up with the rate of messages flowing from the source site.In 10g Flow control is automatically enabled to reduce the rate at which events are being enqueued into the staging area. But this message also indicate other problems which will be covered in this note.
If “Streams AQ: enqueue blocked due to flow control” occurs,it simply means that we cannot queue messages until the components that had fallen behind catch up .As long as you see CAPTURE getting in and out of FLOWCONTROl status,there should not be any problem.

Where and which view to check the symptoms
a) The healthcheck script will show capture status enabled and capture statistics as PAUSED FOR FLOW
CONTROL . Example from SOURCE Healthcheck Script
Capture statistics:
C001 CAPUSR_CAP TURE 03:32:25 09/13/08 193 404 PAUSED FOR FLOW CONTROL
or from
SELECT CAPTURE_NAME, STATE FROM V$STREAMS_CAPTURE;
CAPTURE_NAME STATE
————– ————————
STREAM_CAPTURE PAUSED FOR FLOW CONTROL
V$BUFFERED_SUBSCRIBERS may show TOO MANY UNBROWSED MESSAGES
Example from SOURCE Healthcheck Script(HC)
++ BUFFERED PUBLISHERS ++
Queue Name: STREAMS_QUEUE
SENDER_NAME: STREAM_CAPTURE
Current Number of Msgs in Queue: 15521
MEMORY_USAGE: 15
UNBROWSED_MSGS: 5001
PUBLISHER_STATE: IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES
or from Flow Control Query:
COLUMN queue_schema FORMAT A12
COLUMN queue_name FORMAT A20
SELECT queue_schema, queue_name, unbrowsed_msgs, overspilled_msgs, memory_usage, publisher_state
FROM V$BUFFERED_PUBLISHERS;
NOTE: The Flow Control query will show:
* the queue owner and name
* the number of messages that have been enqueued, but not browsed
* the number of messages that have been spilled, but not browsed
* the percentage of the streams pool that is being used (or 0 if there is no streams pool)
* the state of the publisher. Possible values are:
- PUBLISHING MESSAGES (normal)
- IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES
- IN FLOW CONTROL: OVERSPILLED MESSAGES
- IN FLOW CONTROL: INSUFFICIENT MEMORY AND UNBROWSED MESSAGES
This last column will really help to determine why flow control is being enabled for a system and what can
be done to resolve the matter.

c) Propagation schedule may throw Ora-25307 error
Example from SOURCE Healthcheck Script(HC)
++ SCHEDULE FOR EACH PROPAGATION++
Propagation Name: STREAM_PROPAGATE
Status : Enabled
Error Message: ORA-25307: Enqueue rate too high, flow control enabled
The capture process is unable to enqueue LCRs either because of low memory or because propagation and apply process are consuming messages slower than the capture process is creating them.
1) INIT.ORA parameter problem
Inadequate Memory esp., Streams pool size is small or not set
2) If AQ_TM_PROCESSES parameter is set too high in 9i and 10g
In 10g unset AQ_TM_PROCESSES .QMOn is automatically configured To know more on this
parameter See Note:305662.1 – Queue Monitor Process: Architecture and Known Issues
3) Capture parameter setting have not been done properly .
4) Publisher has enqueued a lot more messages than have been browsed.
(>1000 unbrowsed messages)
5) High latency
High Latency can be due to long-running transactions,many dependent transactions, or  slow Capture,Propagation, or Apply processes.In some situations, propagation may become disabled (if the number of failures is (16). Also check Propagation latency from dba_queue_schedules;
6) Check Apply latency
select (apply_time-applied_message_create_time)*86400 "latency_in_seconds",
(sysdate-apply_time)*86400"latency_in_seconds",
to_char(applied_message_create_time,'hh24:mi:ss mm/dd/yy') "event_creation",
to_char(apply_time,'hh24:mi:ss mm/dd/yy') "apply_time"
from dba_apply_progress;
7) Long running Transactions
Check that there is no occurrence for long running transaction entries with the alert log file, it will be
something like that C002: long running txn detected, xid: 0×0008.010.00084261
8) Some known bugs
a) Bug 7206332 PROPAGATION NOT HAPPENING–ASSOCIATED J00X PROCESSES STUCK (Suspended, Req’d Info not Available)
b) Bug 6054573 ENQUEUE: MEMORY FLOW CONTROL ; CAPTURE PERFORMANCE DEGRADATION ON CDC
Closed as a duplicate of Unpublished Bug 5093060 STREAMS: 5000 LCR LIMIT IS CAUSING UNNECESSARY FLOW CONTROL AT APPLY SITE
Fixed-Releases: A204 B106
c) Streams pool memory accounting can underflow, leading to capture staying in ‘PAUSED FOR FLOW CONTROL’ state indefinitely.
If you see capture in ‘PAUSED FOR FLOW CONTROL’ even in a low workload environment, and x$knlasg has fields such as memory_allocated_knlasg that are very large, you may have hit this issue. The information in x$knlasg is recorded in the Streams healthcheck section entitled ++ Streams Pool memory Information ++ as per Note 273674.1 for 10.2 onwards. The cause has been identifed and fixed in Bug 7719668 STREAMS MEMORY ACCOUNTING INCORRECT, LEADING TO CAPTURE ‘PAUSED IN FLOW CONTROL’  Fixed In Ver: 11.2
9) No proper patches for specific database version installed .
10) One off case: When the capture process was recreated, one of the customer did not recreate the
associated queues and queue tables, which retained the messages from the previous capture process.
This is the reason for CAPTURE to go into FLOW CONTROL.
11) In a scenario with Multiple Destinations  we can get the ‘pause’ message when one destination is down.       This is the correct behaviour. There are two common reasons for source-queue growth:
Suppose a source queue is propagating captured messages to multiple destination queues, and one               or more destination databases acknowledge successful propagation of messages much more slowly            than the other queues. In this case, the source queue can grow because the slower destination
databases create a backlog of messages that have already been acknowledged by the faster
destination databases. In such an environment, consider creating more than one capture process to             capture changes at the source database. Doing so lets you use one source queue for the slower
destination databases and another source queue for the faster destination databases.
Solution
Diagnostics and Resolution
1) INIT.ora changes
Increase the STREAMS_POOL_SIZE on capture and apply site It is required to have a value of 200MB at least as per recommendation For example:
CONNECT / AS SYSDBA
ALTER SYSTEM SET STREAMS_POOL_SIZE=500M;
This will re-enable the capture from the ‘PAUSED FOR FLOW CONTROL’ status.
2) Unset AQ_TM_PROCESSES

In 10g onwards this parameter should be unset on both the sites.
ALTER SYSTEM RESET AQ_TM_PROCESSES SCOPE=SPFILE SID=’*';
High AQ_TM_PROCESSES value causes the acknowledgments not to receive from apply site.
3) Follow Section 5 “Capture Process Configuration”of NOTE: 298877.1 – 10gR1 Streams Recommended Configuration
Example at SOURCE site:
execute dbms_capture_adm.set_parameter('capture_name','_CHECKPOINT_FREQUENCY','100');
execute dbms_capture_adm.set_parameter('capture_name','_SGA_SIZE','50');
Restart(stop and start) the apply then capture.
If you have a lot of memory to work you could even set _SGA_SIZE’ to something like 200.
4) If Status is IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES
This situation could be a propagation problem
a) Stop capture that is in Flow Control via DBMS_CAPTURE_ADM.STOP_CAPTURE
b) Determine the J process associated with propagation via DBA_QUEUE_SCHEDULES, ie:
select schema, qname, destination, schedule_disabled, process_name from dba_queue_schedules;
c) Stop propagation via DBMS_PROPAGATION_ADM.STOP_PROPAGATION (Set FORCE option to TRUE if needed)
d) Check DBA_QUEUE_SCHEDULES / DBA_PROPAGATION to make sure the propagation has been stopped
e) Wait few minutes and make sure that the J Process associated to the Propagation that has been stopped is gone .If not use use kill -9 to kill the process at OS.
f) Set job_queue_processes to 0, wait few minutes, make sure the CJQ process and other J processes are gone (at OS level), then set JOB_QUEUE_PROCESSES back to the original value
g) Use DBMS_PROPAGATION_ADM.START_PROPAGATION to start Propagation
h) Check DBA_QUEUE_SCHEDULES / DBA_PROPAGATION to make sure the propagation has been started and it has a J process associated.
i) Start CAPTURE via DBMS_CAPTURE_ADM.START_CAPTURE
j) Check DBA_QUEUE_SCHEDULES columns TOTAL_NUMBER, TOTAL_BYTES …
If propagation is happening, the numbers would be changing..
k) At SOURCE and TARGET databases, check V$BUFFERED_QUEUES.
l) In case PROPAGATION once more seems hanging and not propagating LCRs then find the associated  J process and upload to Support along with HC scripts:
- errostacks with 3 minutes interval
- 10046 and 24040 propagation level 10 trace .
From sqlplus as any AS SYSDBA user :
spool &spoolfilename
set echo on
set pagesize 1000
column destination format a25
column session_id format a10
column destination_dblink format a25
column source_queue_owner format a20
column source_queue_name format a25
select schema,qname,destination,latency,process_name,session_id,instance,message_delivery_mode
from dba_queue_schedules order by message_delivery_mode;
SELECT source_queue_owner,source_queue_name,propagation_name,
destination_dblink,queue_to_queue,status from dba_propagation order by propagation_name;
oradebug setospid <ospid_of_hung_propagation_job>
oradebug unlimit
oradebug dump errorstack 10
<< wait one minute >>
oradebug dump errorstack 10
<< wait one minute >>
oradebug dump errorstack 10
oradebug tracefile_name
oradebug Event 10046 trace name context forever, level 12
oradebug Event 24040 trace name context forever, level 10
<< wait 10-15 minutes >>
oradebug dump errorstack 10
oradebug short_stack
-- repeat the short_stack command 10-15 times in rapid succession
oradebug dump errorstack 10
oradebug Event 24040 trace name context off
oradebug Event 10046 trace name context off
exit;
If the state is IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES, then you can run the following query to determine which subscriber is not dequeuing messagesor lagging behind from the queue:
SELECT subscriber_name, cnum_msgs, total_dequeued_msg, total_spilled_msg
FROM V$BUFFERED_SUBSCRIBERS;
A comparison of the total number of messages enqueued for the subscriber (CNUM_MSG)
and the number of messages dequeued by that subscriber (TOTAL_DEQUEUED_MSG) should indicate which subscriber has fallen behind.
5) Resolve latency of Propagation

If the propagation is disabled(broken), re-enable it manually.
To reduce the latency of propagation between jobs, set the hidden parameter_job_queue_interval = 1. This should be done as an init.ora parameter or an spfile parameter.Because it is a hidden parameter, the database must be restarted in order for the value to take effect. The default for _job_queue_interval is 5 seconds.
In some situations, when propagation becomes disabled (if the number of failures is 16),re-enable the propagation manually.For example, if the queue name is STREAMS_QUEUE owned by STRMADMIN and the destination database link is ORCL2.WORLD, syntax would be:
execute dbms_aqadm.disable_propagation_schedule ('STRMADMIN.STREAMS_QUEUE','ORCL2.WORLD');
execute dbms.aqadm.enable_propagation_schedule ('STRMADMIN.STREAMS_QUEUE','ORCL2.WORLD' );
execute dbms_aqadm.alter_propagation_schedule ('STRMADMIN.STREAMS_QUEUE','ORCL2.WORLD',latency=>5);
Propagation and the capture process will be resumed automatically when the target site is able to accept more messages.
6) Check Apply parameters

Set the appropriate parameters for apply process.For 10.2 set the following apply parameters.
Disable the capture process and the apply process as well
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’parallelism’,’4’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’_dynamic_stmts’,’Y’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’_hash_table_size’,’1000000’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’disable_on_error’,’N’)
execute DBMS_APPLY_ADM.Set_parameter(‘applyName’,’_txn_buffer_size’,10*parallelism’);
NOTE:315666.1 – How To Set The _TXN_BUFFER_SIZE Parameter For a Streams Apply
SELECT APPLY_NAME,PARAMETER,VALUE FROM ALL_APPLY_PARAMETERS;
Enable the apply processes followed by Capture process

7) Check for long running Transactions:
Identify the transaction id for that long running transaction using the following query and ignore the
transaction if it causes high latency
prompt
prompt ++ Current Long Running Transactions ++
prompt Current transactions open for more than 20 minutes
prompt
col runlength HEAD 'Txn Open|Minutes' format 9999.99
col sid HEAD 'Session' format a13
col xid HEAD 'Transaction|ID' format a18
col terminal HEAD 'Terminal' format a10
col program HEAD 'Program' format a27 wrap
select t.inst_id, sid||','||serial# sid,xidusn||'.'||xidslot||'.'||xidsqn xid, (sysdate - start_date )* 1440
runlength ,terminal,program from gv$transaction t, gv$session s where t.addr=s.taddr
and (sysdate - start_date) * 1440 > 20;
For ignoring transaction at capture , use the following steps on source site:
execute dbms_capture_adm.stop_capture('CAPTURE_NAME');
execute dbms_capture_adm.set_parameter('CAPTURE_NAME','_ignore_transaction','TXN_ID');
execute dbms_capture_adm.start_capture('CAPTURE_NAME');
8) Solutions for some known bugs
ORA-25307: Enqueue rate too high, flow control enabled is reported after enqueueing 5000
user-enqueued buffered messages.
By default flow control kicks for user-enqueued events when there are 5000 unbrowsed messages in a
queue. The default behaviour can be changed in 10.2 by applying Patch 5093060 which allows
you to manipulate when buffer publisher flow control is activated
The fix is event based in 10.2.0.3 and it introduces two new events:
a) 10867 : This event controls flow control threshold values for any buffered message publisher.
The level of the event decides the threshold value.
b) 10868 : This event controls flow control threshold values for only capture publisher (the Capture    process).The level of the event decides the threshold value.
When any of the above events are not set or reset using “context off” command, the threshold values
falls  back to default values which are:
1) Buffered publisher : 5000
2) Capture publisher : 15000
The fix for unpublished bug 5093060 is present in 10.2.0.4 onwards and can be activated by setting the
following hidden parameters, i.e.,
alter system set "_capture_publisher_flow_control_threshold"=10000 scope=memory;
alter system set "_buffered_publisher_flow_control_threshold"=10000 scope=memory;
From 10.2.0.1 to 10.2.0.3 ,enable the flow control fix for user enqueued messages.
connect / as sysdba
--Setting the value dynamically
alter system set events '10867 trace name context forever, level 15000';
--Setting back to the default 5000
alter system set events '10867 trace name context off';
--Setting in the spfile
alter system set event='10867 trace name context forever, level 15000' scope=spfile;
--Unsetting in the spfile
alter system reset event scope=spfile sid='*';
Setting in a pfile
event=’10867 trace name context forever, level 15000′
Setting any of these will require a bounce of the database.