0
点赞
收藏
分享

微信扫一扫

RAC环境Log File Sync导致业务卡顿

1、业务环境

服务器型号:HP 580G7 

操作系统版本:Red Hat Enterprise Linux Server release 6.5

数据库版本:Oracle 11.2.0.3

双机模式:Oracle RAC

2、故障现象

在上午九点多接到客户电话,被告知核心业务系统非常卡顿,点击读卡或者提交操作会导致客户端卡死,业务科室无法顺利开展业务。

3、故障排查

3.1、网络排查

根据故障现象首先排查网络层次原因,避免网络异常(例如环路、收发错误包过多等)导致的业务卡顿,通过常规网络诊断排除了是网络问题导致的业务异常。

3.2、存储IO排查

在服务器本地通过dd工具对磁盘进行读写,发现磁盘读写性能非常高,登录存储设备,发现存储运行正常,排除了由于存储IO异常导致的前端业务异常。

3.3、数据库层次排查

首先进行数据库TRACE文件分析,没有发现任何异常告警,然后按照常规经验首先在两个节点收集AWR性能报告,然后进行报告分析,发现EVEVT:log file sync等待时间很长,在数据库层次查询到阻塞log file sync的SID,再通过数据库的SID定位到系统进程SPID,根据SPID定位到阻塞log file sync的进程是LGWR进程,竟然问题出现在日志写进程上,日志写进程阻塞log file sync导致前端业务无法提交。分析lgwr trace 文件,发现在八点多日志LGWR进程由post/wait模式切换成了polling模式,而且一直处于polling模式。

3.4、问题解决

在mos平台上找到Adaptive Log File Sync Optimization (Doc ID 1541136.1),文章中提到影响LGWR模式的参数是_use_adaptive_log_file_sync,此参数是在11g R2版本引入,在11.2.0.3版本默认值设置成了true,官网文档描述如下:

RAC环境Log File Sync导致业务卡顿_锦集

LGWR可以通过(Post/wait&&Polling)两种方式和前台进程通信,确认提交已经完成,官网描述如下:

There are 2 methods by which LGWR and foreground processes can communicate in order to acknowledge that a commit has completed:

   >Post/wait - traditional method available in previous Oracle releases

   LGWR explicitly posts all processes waiting for the commit to complete.

   The advantage of the post/wait method is that sessions should find out almost immediately when the redo has been flushed to disk.

   >Polling

   Foreground processes sleep and poll to see if the commit is complete.

   The advantage of this new method is to free LGWR from having to inform many processes waiting on commit to complete thereby freeing high CPU usage by the LGWR.

Initially the LGWR uses post/wait and according to an internal algorithm evaluates whether polling is better. Under high system load polling may perform better because the post/wait implementation typically does not scale well. If the system load is low, then post/wait performs well and provides better response times than polling. Oracle relies on internal statistics to determine which method should be used.  Because switching between post/wait and polling incurs an overhead, safe guards are in place in order to ensure that switches do not occur too frequently.

All switches are recorded in LGWR's trace file with a time stamp and the string "Log file sync switching to ...":

Statistics on polling are stored in v$sysstat:

SQL> select name,value from v$sysstat where name in ('redo synch poll writes','redo synch polls');

NAME                                                                  VALUE

---------------------------------------------------------------- ----------

redo synch poll writes                                                 0

redo synch polls                                                          0

In the above example we see that polling is not occurring --如果值为0,说明polling没有发生。


通过Bug 13707904 - LGWR sometimes uses polling, sometimes post/wait (Doc ID 13707904.8)文章找到解决方式如下:

ALTER SYSTEM SET "_use_adaptive_log_file_sync"= FALSE  scope=both;

官网文档描述如下:

RAC环境Log File Sync导致业务卡顿_oracle_02

4、总结

此次问题的发生是由于oracle 11.2.0.3的一个bug引起,通过设置隐藏参数_use_adaptive_log_file_sync的默认值解决问题。


举报

相关推荐

0 条评论