一 问题现象
早上,收到项目组诉求,业务系统的某个模块从昨天下午3:30左右出现有数据插入失败的情况。希望能从数据库端或者服务器上看看有没有异常现象,并给予支持。
登录到数据库控制台上,经过初步排查发现,看到下述现象:
从昨天上午10:29开始,系统频繁出现类似插入SQL,且每次执行耗时1秒。更为糟糕的是,该类型SQL一直持续到今天早上,不停的在执行。
二 分析问题
拎出其中一条SQL进行分析
INSERT
INTO t_ai_prd_history_effective_record
(
ID,
create_time,
cust_category,
cust_id,
cust_name,
company_id,
company_name,
template_no,
template_name,
template_field_identifier,
template_field_name,
submit_value
)
SELECT random_string (32),
now(),
$1,
$2,
$3,
$4,
$5,
$6,
$7,
$8,
$9,
$10
WHERE NOT EXISTS
(SELECT 1
FROM t_ai_prd_history_effective_record r
WHERE COALESCE (r.cust_category, '') = COALESCE ($11, '')
AND COALESCE (r.cust_id, '') = COALESCE ($12, '')
AND COALESCE (r.company_id, '') = COALESCE ($13, '')
AND COALESCE (r.template_no, '') = COALESCE ($14, '')
AND COALESCE (r.template_field_identifier, '') = COALESCE ($15,'')
AND COALESCE (r.submit_value, '') = COALESCE ($16, '')
)
对应的参数为:
params: $1 = 'MY001',
$2 = 'b8168c7513014b0c9769f3d61574833d',
$3 = 'WP MANAGEMENT SERVICES',
$4 = '1BABS7HSS5UH01FE140A000085535171',
$5 = 'KLINIK PERGIGIAN M DENTAL',
$6 = 'MYBL0303',
$7 = 'Expenses',
$8 = 'InvoiceDate',
$9 = 'InvoiceDate(发票日期)',
$10 = '20170614',
$11 = 'MY001',
$12 = 'b8168c7513014b0c9769f3d61574833d',
$13 = '1BABS7HSS5UH01FE140A000085535171',
$14 = 'MYBL0303',
$15 = 'InvoiceDate',
$16 = '20170614'
显然,该SQL的目的是要向表t_ai_prd_history_effective_record进行有条件的插入数据,其条件是表中不存在
COALESCE (r.cust_category, '') = COALESCE ('DZ001', '')
ai-> AND COALESCE (r.cust_id, '') = COALESCE ('b151ad4f86ab4ec5aee8c4cc377e9eb7', '')
ai-> AND COALESCE (r.company_id, '') = COALESCE ('04cb580238dc49af8bfb46e00e959a1a', '')
ai-> AND COALESCE (r.template_no, '') = COALESCE ('KJDZ0101', '')
ai-> AND COALESCE (r.template_field_identifier, '') = COALESCE ('ItemQuantity','')
ai-> AND COALESCE (r.submit_value, '') = COALESCE ('10100$__$6080$__$$__$$__$', '');
的情况下。具体体现在SQL中的WHERE NOT EXISTS。
那么,问题就明了了,应该是出现在NOT EXISTS的子查询上。
接下来,单独看看这个子查询的执行计划:
ai=> explain analyze SELECT 1
ai-> FROM t_ai_prd_history_effective_record r
ai-> WHERE COALESCE (r.cust_category, '') = COALESCE ('DZ001', '')
ai-> AND COALESCE (r.cust_id, '') = COALESCE ('b151ad4f86ab4ec5aee8c4cc377e9eb7', '')
ai-> AND COALESCE (r.company_id, '') = COALESCE ('04cb580238dc49af8bfb46e00e959a1a', '')
ai-> AND COALESCE (r.template_no, '') = COALESCE ('KJDZ0101', '')
ai-> AND COALESCE (r.template_field_identifier, '') = COALESCE ('ItemQuantity','')
ai-> AND COALESCE (r.submit_value, '') = COALESCE ('10100$__$6080$__$$__$$__$', '');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on t_ai_prd_history_effective_record r (cost=0.00..203817.53 rows=1 width=0) (actual time=1266.116..1267.093 rows=1 loops=1)
Filter: (((COALESCE(cust_category, ''::character varying))::text = 'DZ001'::text) AND ((COALESCE(cust_id, ''::character varying))::text = 'b151ad4f86ab4ec5aee8c4cc377e9eb7'::text) AND ((COALESCE(company_id, ''::character varying))::text = '04cb580238dc49af8bfb46e00e959a1a'::text) AND ((COALESCE(template_no, ''::character varying))::text = 'KJDZ0101'::text) AND ((COALESCE(template_field_identifier, ''::character varying))::text = 'ItemQuantity'::text) AND (COALESCE(submit_value, ''::text) = '10100$__$6080$__$$__$$__$'::text))
Rows Removed by Filter: 3193300
Planning time: 0.129 ms
Execution time: 1267.133 ms
(5 rows)
ai=>
t_ai_prd_history_effective_record表数据量:
ai=> select count(*) from t_ai_prd_history_effective_record;
count
---------
3193138
(1 row)
ai=>
意料之中,每次执行这个去重的不存在判断,都对这个拥有300多万条数据的t_ai_prd_history_effective_record表执行一次全表扫描,可想而知,执行效率肯定低下,每次耗时1秒多,且随着该表的数据不断增多,插入的速度将变得越来越慢。
这里,可以说是,每一条记录的插入,都将导致下一次插入变得更慢!
三 解决方案
找到了问题的症结,就要着手优化了。终极目标就是怎么可以绕过那个NOT exists的子查询?
经过和BPO部门林朝荣同学讨论,给出下述方案:
- 能不能对于t_ai_prd_history_effective_record表的子查询的那6个字段上建立一个唯一性索引,如果可以的话,那么我们就在插入的时候不再需要那个where条件判断去重了,而由这个唯一性索引的约束去做判断,如果插入重复数据的话,就违反约束,报错抛出异常。这样的话,看似解决了这个问题,但是,如果从业务上考虑的话,的确是存在过多的重复数据,如果报错的话,业务人员会在系统上频繁的收到报错信息,不是很友好。
- 既然这样的话,那问题就变的更简单了。直接粗暴的把SQL的where not exists的去重判断删掉,先一股脑儿的插入,不论是否重复,然后,写个批处理的SQL,放到晚上再统一把重复的记录删掉,问题不就解决了嘛。
给出一个删除PostgreSQL数据库表重复记录的示例:
ai=> create table t(id int primary key,name varchar(10),addr varchar(10));
CREATE TABLE
ai=> insert into t values(1,'onlyou.com','xiamen');
INSERT 0 1
ai=> insert into t values(2,'apple.com','usa');
INSERT 0 1
ai=> insert into t values(3,'apple.com','usa');
INSERT 0 1
ai=> insert into t values(4,'google','usa');
INSERT 0 1
ai=> insert into t values(5,'google','usa');
INSERT 0 1
ai=> select * from t;
id | name | addr
----+------------+--------
1 | onlyou.com | xiamen
2 | apple.com | usa
3 | apple.com | usa
4 | google | usa
5 | google | usa
(5 rows)
ai=> select t1.id from t t1 where t1.ctid<>(select max(t2.ctid) from t t2 where t1.name=t2.name and t1.addr=t2.addr);
id
----
2
4
(2 rows)
ai=> delete from t where id in(select t1.id from t t1 where t1.ctid<>(select max(t2.ctid) from t t2 where t1.name=t2.name and t1.addr=t2.addr));
DELETE 2
ai=> select * from t;
id | name | addr
----+------------+--------
1 | onlyou.com | xiamen
3 | apple.com | usa
5 | google | usa
(3 rows)
ai=>
四 小结
发现并找到慢SQL的问题所在,进行有针对性的优化改造,解决问题往往不一定难,深入思考一下多尝试。
以上,给出一个优化的思路和方向。