解决一则row cache lock引起的性能故障

     收到紧急邮件:说某客户生产系统在今天早上9:30左右,生产环境整个系统操作响应很慢,查询一票单子可能要耗时半分钟,有时候登录界面也卡住很久才响应

       这是一套10.2.0.5.0的双节点RAC数据库,平时系统运行较为正常,现在突然变慢。好了,对于我这个优化菜鸟来讲,还是从AWR报告入手。下面给出分析步骤和解决办法:

1   从报告头中看到,在数据库出现性能问题的一个小时内,DB Time达到240分钟,(DB Time)/Elapsed=3.93,说明数据库应该是存在问题的。

  Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 10638 05-6月 -12 09:00:00 54 5.8
End Snap: 10639 05-6月 -12 10:01:03 102 5.8
Elapsed:   61.04 (mins)    
DB Time:   240.11 (mins)    

2   从报告的Load Profile节,看到用户的每秒调用高达680次,说明肯定存在问题了。

Load Profile

  Per Second Per Transaction
Redo size: 12,939.46 6,115.60
Logical reads: 67,323.18 31,819.06
Block changes: 53.22 25.15
Physical reads: 1.02 0.48
Physical writes: 4.72 2.23
User calls: 679.70 321.25
Parses: 90.49 42.77
Hard parses: 0.35 0.16
Sorts: 1.94 0.92
Logons: 0.08 0.04
Executes: 316.75 149.71
Transactions: 2.12  

3   继续分析报告,从Top 5中看到排在第一的等待事件是row cache lock,并且该等待事件的平均等待达到2128ms。 

ROW CACHE LOCK等待事件是一个共享池相关的等待事件,是由于对于字典缓冲的访问造成的。通常直接的解决办可以通过调大共享池来解决,但是,并非在所有场景下都凑效。

 

Top 5 Timed Events

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
row cache lock 2,736 5,822 2,128 40.4 Concurrency
CPU time   4,305   29.9  
gc cr block busy 2,293 2,633 1,148 18.3 Cluster
gc buffer busy 1,569 1,096 698 7.6 Cluster
enq: TX – row lock contention 2,029 998 492 6.9 Application

 

4    继续分析,发现基于时间的统计信息中,加载序列sequence的耗时排在了第二位。

Time Model Statistics

  • Total time in database user-calls (DB Time): 14406.5s
  • Statistics including the word "background" measure background process time, and so do not contribute to the DB time statistic
  • Ordered by % or DB time desc, Statistic name

 

Statistic Name Time (s) % of DB Time
sql execute elapsed time 14,188.01 98.48
sequence load elapsed time 6,900.83 47.90
DB CPU 4,304.59 29.88
PL/SQL execution elapsed time 20.64 0.14
parse time elapsed 10.25 0.07
hard parse elapsed time 6.00 0.04
PL/SQL compilation elapsed time 1.17 0.01
hard parse (sharing criteria) elapsed time 1.07 0.01
repeated bind elapsed time 0.80 0.01
hard parse (bind mismatch) elapsed time 0.64 0.00
connection management call elapsed time 0.31 0.00
failed parse elapsed time 0.00 0.00
DB time 14,406.50  
background elapsed time 2,115.75  
background cpu time 20.52  

5     看到最耗时的竟然是一条再简单不过的SQL语句,SELECT SEQ_NEWID.NEXTVAL FROM DUAL,取序列的值,竟然会如此的耗时?

SQL ordered by Elapsed Time

  • Resources reported for PL/SQL code includes the resources used by all SQL statements called by the code.
  • % Total DB Time is the Elapsed Time of the SQL statement divided into the Total Database Time multiplied by 100
  • Total DB Time (s): 14,406
  • Captured SQL account for 223.8% of Total

 

Elapsed Time (s) CPU Time (s) Executions Elap per Exec (s) % Total DB Time SQL Id SQL Module SQL Text
6,910 0 281 24.59 47.96 1gd7ancd2px8m FC.EdiService.Import.exe SELECT SEQ_NEWID.NEXTVAL FROM …

6    再看字典缓冲区的统计信息:取序列值一共287次,就失败了43.9%,看来的确是取序列值的地方出现问题,也就解释了为什么上一步骤中的分许出的那条SQL会如此耗时,因为差不多有一半的情况下都没有取到序列的值。

Dictionary Cache Stats

  • "Pct Misses" should be very low (< 2% in most cases)
  • "Final Usage" is the number of cache entries being used

 

Cache Get Requests Pct Miss Scan Reqs Pct Miss Mod Reqs Final Usage
dc_awr_control 65 1.54 0   1 1
dc_database_links 304 0.00 0   0 1
dc_global_oids 155 0.00 0   0 24
dc_histogram_data 74,704 0.25 0   0 5,612
dc_histogram_defs 71,400 0.26 0   0 4,945
dc_object_ids 29,398 0.01 0   0 1,136
dc_objects 3,912 0.23 0   0 860
dc_profiles 150 0.00 0   0 1
dc_rollback_segments 17,789 0.10 0   0 56
dc_segments 8,927 0.06 0   4 896
dc_sequences 287 43.90 0   279 3
dc_tablespace_quotas 2 50.00 0   0 2
dc_tablespaces 8,954 0.00 0   0 25
dc_usernames 1,082 0.00 0   0 8
dc_users 13,991 0.00 0   0 31
outstanding_alerts 326 77.91 0   23 54

7   到此,解决问题的基本思路已经出来了,通过将序列缓存到内存中,基本可以解决问题。通过查看生产系统上的该序列的信息,发现创建序列的语句如下:

-- Create sequence 
create sequence SEQ_NEWID
minvalue 1000
maxvalue 9999
start with 1000
increment by 1
nocache;

8   调整序列,使之cache到内存中,alter sequence SEQ_NEWID cache 3000;

9   附带,刚在生产系统上把序列cache到内存,另一同事就把RAC数据库给重启了,据他判断说是存储系统出了问题。可是,怎么会是存储出了问题呢?如果,能重现这次的性能问题该多好啊,只是没有如果!

      在处理这个问题的时候,参照了老白的文章:白鳝说WAIT EVENT之ROW CACHE LOCK