全文索引
全文索引存储在索引数据中的词频和所在记录,频率越高,权重越低,用过一定的算个给出相关性评分(relevance score)。MySQL的MyISAM和InnoDB支持全文检索,但要注意:
- InnoDB在版本5.6.4才开始提供全文索引
- 虽然语法一样,但MyISAM和InnoDB在实现和算法是不同的,它们之间的相关性评分是不具备可比性,即不要用一个InnoDB表的相关度值和一个MyIASM表来比对。
- MyISAM有543个stopwords(因词太常用,不作为索引),而InnoDB有36个,MySQL文档给出如果在引擎中加入或者删除stopword的命令,也给出了Full Text search的配置参数调整方式。
SQL
具体参考MYSQL的官方文档
为表格设置FULLTEXT KEY
-- 全文索引
ALTER TABLE TicketComment ADD FULLTEXT INDEX TicketComment_Search (Body);
-- 全文联合索引,同时检索这两列,只要在这两列当中出现的,都进行相关度打分
ALTER TABLE Ticket ADD FULLTEXT INDEX Ticket_Search (Subject, Body);
检索的sql语句
单词搜索
mysql> SELECT * FROM `TicketComment` WHERE MATCH(`Body`) AGAINST('test');
+-----------+----------+--------+-------------------+----------------------------+
| CommentId | TicketId | UserId | Body | DateCreated |
+-----------+----------+--------+-------------------+----------------------------+
| 2 | 1 | 4 | Comment Two: test | 2018-03-07 15:40:22.631000 |
| 12 | 1 | 4 | Test | 2018-03-07 15:50:35.068000 |
+-----------+----------+--------+-------------------+----------------------------+
2 rows in set (0.04 sec)
查看关联分值
mysql> SELECT *, MATCH(`Body`) AGAINST('test') AS score From TicketComment;
+-----------+----------+--------+-------------------+----------------------------+--------------------+
| CommentId | TicketId | UserId | Body | DateCreated | score |
+-----------+----------+--------+-------------------+----------------------------+--------------------+
| 1 | 1 | 4 | my comment: Hello | 2018-03-07 15:40:00.719000 | 0 |
| 2 | 1 | 4 | Comment Two: test | 2018-03-07 15:40:22.631000 | 0.6055193543434143 |
| 3 | 1 | 4 | Comment Three : 3 | 2018-03-07 15:40:51.588000 | 0 |
| 4 | 1 | 4 | Comment Four: 4 | 2018-03-07 15:41:00.622000 | 0 |
| 5 | 1 | 4 | Comment Five: 5 | 2018-03-07 15:41:09.777000 | 0 |
| 6 | 1 | 4 | Comment Six: 6 | 2018-03-07 15:41:16.899000 | 0 |
| 7 | 1 | 4 | Comment Serven: 7 | 2018-03-07 15:41:28.665000 | 0 |
| 8 | 1 | 4 | Comment 8 | 2018-03-07 15:41:37.733000 | 0 |
| 9 | 1 | 4 | Comment 9 | 2018-03-07 15:41:43.515000 | 0 |
| 10 | 1 | 4 | Comment 10 | 2018-03-07 15:41:51.349000 | 0 |
| 11 | 1 | 4 | Comment 11 | 2018-03-07 15:42:01.263000 | 0 |
| 12 | 1 | 4 | Test | 2018-03-07 15:50:35.068000 | 0.6055193543434143 |
+-----------+----------+--------+-------------------+----------------------------+--------------------+
12 rows in set (0.01 sec)
SELECT *, MATCH(`Body`) AGAINST('test Hello') AS score From TicketComment;
# 下面的等同与上面,但我们可以将不同列的关联性加起来,或者将不同表格里面的关联性加起来(使用到join)
SELECT *, (MATCH(`Body`) AGAINST('test')+ MATCH(`Body`) AGAINST('Hello')) AS score From TicketComment;
多词搜索
mysql> select * from TicketComment where match(`Body`) against('Five Six');
+-----------+----------+--------+-----------------+----------------------------+
| CommentId | TicketId | UserId | Body | DateCreated |
+-----------+----------+--------+-----------------+----------------------------+
| 5 | 1 | 4 | Comment Five: 5 | 2018-03-07 15:41:09.777000 |
| 6 | 1 | 4 | Comment Six: 6 | 2018-03-07 15:41:16.899000 |
+-----------+----------+--------+-----------------+----------------------------+
2 rows in set (0.00 sec)
测试发现数字属于stopwords,my也属于stopwords。
联合索引
mysql> select * from Ticket where Match(`subject`,`Body`) against('hello');
+----------+--------+---------+-----------------------------------+----------------------------+
| TicketId | UserId | Subject | Body | DateCreated |
+----------+--------+---------+-----------------------------------+----------------------------+
| 1 | 3 | hello | This is the frist ticket created! | 2018-01-15 16:09:13.016000 |
+----------+--------+---------+-----------------------------------+----------------------------+
1 row in set (0.01 sec)
使用boolean mode
我们在against里面可以注明使用boolean mode,可以得到一些逻辑组合,例如必须以什么开头,必须不包含,与还是或的关系。可使用的符号可以查询ft_boolean_syntax,其中ft就是fulltext的缩写。
mysql> SHOW VARIABLES LIKE 'ft%';
+--------------------------+----------------+
| Variable_name | Value |
+--------------------------+----------------+
| ft_boolean_syntax | + -><()~*:""&| |
| ft_max_word_len | 84 |
| ft_min_word_len | 4 |
| ft_query_expansion_limit | 20 |
| ft_stopword_file | (built-in) |
+--------------------------+----------------+
- + 表示必须包含。例如+apple,表示必须含有apple,并且以apple开始的,例如apple123。
- 空 表示含有或者。例如apple banana,表示含有apple或者banana
- - 表示不能包含。例如+apple -banana,表示含有apple但不能含有banana
- > 提高该词的相关性,即优先含有该词
- < 降低该词相关性,
- ( ) 可以通过括号来使用字条件。例如+aaa +(>bbb <ccc)
- ~ 将其相关性由正转负,表示拥有该字会降低相关性,但不像「-」将之排除,只是排在较后面。
- * 通配符,这个只能接在字符串后面。
- " " :整体匹配,用双引号将一段句子包起来表示要完全相符,不可拆字。
使用例子:
select * from TicketComment where match(`Body`) against('Test -two' in boolean mode);
相关链接:我的Professional Java for Web Applications相关文章