Tuning BM25
One of the nice features of BM25 is that, unlike TF/IDF, it has two parameters that allow it to be tuned:
k1
This parameter controls how quickly an increase in term frequency results in term-frequency saturation. The default value is 1.2
. Lower values result in quicker saturation, and higher values in slower saturation.
b
This parameter controls how much effect field-length normalization should have. A value of 0.0
disables normalization completely, and a value of 1.0
normalizes fully. The default is 0.75
.
The practicalities of tuning BM25 are another matter. The default values for k1
and b
should be suitable for most document collections, but the optimal values really depend on the collection. Finding good values for your collection is a matter of adjusting, checking, and adjusting again.
The similarity algorithm can be set on a per-field basis. It’s just a matter of specifying the chosen algorithm in the field’s mapping:
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"similarity": "BM25"
},
"body": {
"type": "string",
"similarity": "default"
}
}
}
}
The | |
The |
Currently, it is not possible to change the similarity
mapping for an existing field. You would need to reindex your data in order to do that.
Configuring BM25
Configuring a similarity is much like configuring an analyzer. Custom similarities can be specified when creating an index. For instance:
PUT /my_index
{
"settings": {
"similarity": {
"my_bm25": {
"type": "BM25",
"b": 0}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"similarity": "my_bm25"},
"body": {
"type": "string",
"similarity": "BM25"}
}
}
}
}
参考:https://www.elastic.co/guide/en/elasticsearch/guide/current/changing-similarities.html