收到紧急邮件:说某客户生产系统在今天早上9:30左右,生产环境整个系统操作响应很慢,查询一票单子可能要耗时半分钟,有时候登录界面也卡住很久才响应
。
这是一套10.2.0.5.0的双节点RAC数据库,平时系统运行较为正常,现在突然变慢。好了,对于我这个优化菜鸟来讲,还是从AWR报告入手。下面给出分析步骤和解决办法:
1 从报告头中看到,在数据库出现性能问题的一个小时内,DB Time达到240分钟,(DB Time)/Elapsed=3.93,说明数据库应该是存在问题的。
Snap Id | Snap Time | Sessions | Cursors/Session | |
---|---|---|---|---|
Begin Snap: | 10638 | 05-6月 -12 09:00:00 | 54 | 5.8 |
End Snap: | 10639 | 05-6月 -12 10:01:03 | 102 | 5.8 |
Elapsed: | 61.04 (mins) | |||
DB Time: | 240.11 (mins) |
2 从报告的Load Profile节,看到用户的每秒调用高达680次,说明肯定存在问题了。
Load Profile
Per Second | Per Transaction | |
---|---|---|
Redo size: | 12,939.46 | 6,115.60 |
Logical reads: | 67,323.18 | 31,819.06 |
Block changes: | 53.22 | 25.15 |
Physical reads: | 1.02 | 0.48 |
Physical writes: | 4.72 | 2.23 |
User calls: | 679.70 | 321.25 |
Parses: | 90.49 | 42.77 |
Hard parses: | 0.35 | 0.16 |
Sorts: | 1.94 | 0.92 |
Logons: | 0.08 | 0.04 |
Executes: | 316.75 | 149.71 |
Transactions: | 2.12 |
3 继续分析报告,从Top 5中看到排在第一的等待事件是row cache lock,并且该等待事件的平均等待达到2128ms。
ROW CACHE LOCK等待事件是一个共享池相关的等待事件,是由于对于字典缓冲的访问造成的。通常直接的解决办可以通过调大共享池来解决,但是,并非在所有场景下都凑效。
Top 5 Timed Events
Event | Waits | Time(s) | Avg Wait(ms) | % Total Call Time | Wait Class |
---|---|---|---|---|---|
row cache lock | 2,736 | 5,822 | 2,128 | 40.4 | Concurrency |
CPU time | 4,305 | 29.9 | |||
gc cr block busy | 2,293 | 2,633 | 1,148 | 18.3 | Cluster |
gc buffer busy | 1,569 | 1,096 | 698 | 7.6 | Cluster |
enq: TX – row lock contention | 2,029 | 998 | 492 | 6.9 | Application |
4 继续分析,发现基于时间的统计信息中,加载序列sequence的耗时排在了第二位。
Time Model Statistics
- Total time in database user-calls (DB Time): 14406.5s
- Statistics including the word "background" measure background process time, and so do not contribute to the DB time statistic
- Ordered by % or DB time desc, Statistic name
Statistic Name | Time (s) | % of DB Time |
---|---|---|
sql execute elapsed time | 14,188.01 | 98.48 |
sequence load elapsed time | 6,900.83 | 47.90 |
DB CPU | 4,304.59 | 29.88 |
PL/SQL execution elapsed time | 20.64 | 0.14 |
parse time elapsed | 10.25 | 0.07 |
hard parse elapsed time | 6.00 | 0.04 |
PL/SQL compilation elapsed time | 1.17 | 0.01 |
hard parse (sharing criteria) elapsed time | 1.07 | 0.01 |
repeated bind elapsed time | 0.80 | 0.01 |
hard parse (bind mismatch) elapsed time | 0.64 | 0.00 |
connection management call elapsed time | 0.31 | 0.00 |
failed parse elapsed time | 0.00 | 0.00 |
DB time | 14,406.50 | |
background elapsed time | 2,115.75 | |
background cpu time | 20.52 |
5 看到最耗时的竟然是一条再简单不过的SQL语句,SELECT SEQ_NEWID.NEXTVAL FROM DUAL,取序列的值,竟然会如此的耗时?
SQL ordered by Elapsed Time
- Resources reported for PL/SQL code includes the resources used by all SQL statements called by the code.
- % Total DB Time is the Elapsed Time of the SQL statement divided into the Total Database Time multiplied by 100
- Total DB Time (s): 14,406
- Captured SQL account for 223.8% of Total
Elapsed Time (s) | CPU Time (s) | Executions | Elap per Exec (s) | % Total DB Time | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|
6,910 | 0 | 281 | 24.59 | 47.96 | 1gd7ancd2px8m | FC.EdiService.Import.exe | SELECT SEQ_NEWID.NEXTVAL FROM … |
6 再看字典缓冲区的统计信息:取序列值一共287次,就失败了43.9%,看来的确是取序列值的地方出现问题,也就解释了为什么上一步骤中的分许出的那条SQL会如此耗时,因为差不多有一半的情况下都没有取到序列的值。
Dictionary Cache Stats
- "Pct Misses" should be very low (< 2% in most cases)
- "Final Usage" is the number of cache entries being used
Cache | Get Requests | Pct Miss | Scan Reqs | Pct Miss | Mod Reqs | Final Usage |
---|---|---|---|---|---|---|
dc_awr_control | 65 | 1.54 | 0 | 1 | 1 | |
dc_database_links | 304 | 0.00 | 0 | 0 | 1 | |
dc_global_oids | 155 | 0.00 | 0 | 0 | 24 | |
dc_histogram_data | 74,704 | 0.25 | 0 | 0 | 5,612 | |
dc_histogram_defs | 71,400 | 0.26 | 0 | 0 | 4,945 | |
dc_object_ids | 29,398 | 0.01 | 0 | 0 | 1,136 | |
dc_objects | 3,912 | 0.23 | 0 | 0 | 860 | |
dc_profiles | 150 | 0.00 | 0 | 0 | 1 | |
dc_rollback_segments | 17,789 | 0.10 | 0 | 0 | 56 | |
dc_segments | 8,927 | 0.06 | 0 | 4 | 896 | |
dc_sequences | 287 | 43.90 | 0 | 279 | 3 | |
dc_tablespace_quotas | 2 | 50.00 | 0 | 0 | 2 | |
dc_tablespaces | 8,954 | 0.00 | 0 | 0 | 25 | |
dc_usernames | 1,082 | 0.00 | 0 | 0 | 8 | |
dc_users | 13,991 | 0.00 | 0 | 0 | 31 | |
outstanding_alerts | 326 | 77.91 | 0 | 23 | 54 |
7 到此,解决问题的基本思路已经出来了,通过将序列缓存到内存中,基本可以解决问题。通过查看生产系统上的该序列的信息,发现创建序列的语句如下:
-- Create sequence create sequence SEQ_NEWID minvalue 1000 maxvalue 9999 start with 1000 increment by 1 nocache;
8 调整序列,使之cache到内存中,alter sequence SEQ_NEWID cache 3000;
9 附带,刚在生产系统上把序列cache到内存,另一同事就把RAC数据库给重启了,据他判断说是存储系统出了问题。可是,怎么会是存储出了问题呢?如果,能重现这次的性能问题该多好啊,只是没有如果!
在处理这个问题的时候,参照了老白的文章:白鳝说WAIT EVENT之ROW CACHE LOCK