Lucene-13-suggest lucene 搜索之拼写检查和相似度查询提示 spellcheck

老马啸西风2022年1月10日大约 7 分钟

suggest应用场景

用户的输入行为是不确定的，而我们在写程序的时候总是想让用户按照指定的内容或指定格式的内容进行搜索，这里就要进行人工干预用户输入的搜索条件了；

我们在用百度谷歌等搜索引擎的时候经常会看到按键放下的时候直接会提示用户是否想搜索某些相关的内容，恰好lucene在开发的时候想到了这一点，lucene提供的suggest包正是用来解决上述问题的。

包介绍

suggest包提供了lucene的自动补全或者拼写检查的支持；

拼写检查相关的类在 org.apache.lucene.search.spell 包下；

联想相关的在 org.apache.lucene.search.suggest 包下；

基于联想词分词相关的类在 org.apache.lucene.search.suggest.analyzing 包下；

入门代码

测试例子

package com.github.houbb.lucene.learn.chap06;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.search.suggest.FileDictionary;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Paths;

/**
 * @author binbin.hou
 * @since 1.0.0
 */
public class SuggestDemo {

    public static void main(String[] args) {
        Directory spellIndexDirectory;
        try {
            spellIndexDirectory = FSDirectory.open(Paths.get("suggest"));

            SpellChecker spellchecker = new SpellChecker(spellIndexDirectory );
            Analyzer analyzer = new IKAnalyzer(true);
            IndexWriterConfig config = new IndexWriterConfig(analyzer);
            config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
            spellchecker.setAccuracy(0f);
            //HighFrequencyDictionary dire = new HighFrequencyDictionary(reader, field, thresh)
            String suggestPath = "D:\\gitee2\\lucene-learn\\src\\main\\resources\\suggest.txt";
            spellchecker.indexDictionary(new FileDictionary(new FileInputStream(new File(suggestPath))),config,false);
            String[] similars = spellchecker.suggestSimilar("中国", 10);
            for (String similar : similars) {
                System.out.println(similar);
            }
            spellIndexDirectory.close();
            spellchecker.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

suggest.txt

测试文件的内容如下：

中国人民	100
奔驰3	101
奔驰中国	102
奔驰S级	103
奔驰A级	104
奔驰C级	105

对应的测试结果为：

中国人民
奔驰中国

拼写检查原理

Lucene的拼写检查由 org.apache.lucene.search.spell.SpellChecker 类提供支持；

SpellChecker设置了默认精度0.5，如果我们需要细粒度的支持可以通过调用setAccuracy(float accuracy)来设定；

spellChecker会将外部来源的词进行索引；

这些来源包括：

DocumentDictionary 查询document中的field对应的值；
FileDictionary 基于一个文本文件的Directionary,每行一项，词组之间以"\t" TAB分隔符进行，每项中不能含有两个以上的分隔符；
HighFrequencyDictionary 从原有的索引文件中读取某个term的值，并按照出现次数检查；
LuceneDictionary 也是从原有索引文件中读取某个term的值，但是不检查出现次数；
PlainTextDictionary 从文本中读取内容，按行读取，没有分隔符；

其索引的原理如下：

1）对索引过程加syschronized同步；

2）检查Spellchecker是否已经关闭，如果关闭，抛出异常，提示内容为：Spellchecker has been closed；

3）对外部来源的索引进行遍历，统计被遍历的词的长度，如果长度小于三，忽略该词，反之构建document对象并索引到本地文件，创建索引的时候会对每个单词进行详细拆分（对应addGram方法），其执行过程如下所示

indexDictionary 源码

/**
 * Indexes the data from the given {@link Dictionary}.
 * @param dict Dictionary to index
 * @param config {@link IndexWriterConfig} to use
 * @param fullMerge whether or not the spellcheck index should be fully merged
 * @throws AlreadyClosedException if the Spellchecker is already closed
 * @throws IOException If there is a low-level I/O error.
 */
public final void indexDictionary(Dictionary dict, IndexWriterConfig config, boolean fullMerge) throws IOException {
  synchronized (modifyCurrentIndexLock) {
    ensureOpen();
    final Directory dir = this.spellIndex;
    final IndexWriter writer = new IndexWriter(dir, config);
    IndexSearcher indexSearcher = obtainSearcher();
    final List termsEnums = new ArrayList<>();
    final IndexReader reader = searcher.getIndexReader();
    if (reader.maxDoc() > 0) {
      for (final LeafReaderContext ctx : reader.leaves()) {
        Terms terms = ctx.reader().terms(F_WORD);
        if (terms != null)
          termsEnums.add(terms.iterator());
      }
    }
    
    boolean isEmpty = termsEnums.isEmpty();
    try { 
      BytesRefIterator iter = dict.getEntryIterator();
      BytesRef currentTerm;
      
      terms: while ((currentTerm = iter.next()) != null) {

        String word = currentTerm.utf8ToString();
        int len = word.length();
        if (len As the Lucene similarity that is used to fetch the most relevant n-grammed terms
   * is not the same as the edit distance strategy used to calculate the best
   * matching spell-checked word from the hits that Lucene found, one usually has
   * to retrieve a couple of numSug's in order to get the true best match.
   *
   * I.e. if numSug == 1, don't count on that suggestion being the best one.
   * Thus, you should set this value to at least 5 for a good suggestion.
   *
   * @param word the word you want a spell check done on
   * @param numSug the number of suggested words
   * @param ir the indexReader of the user index (can be null see field param)
   * @param field the field of the user index: if field is not null, the suggested
   * words are restricted to the words present in this field.
   * @param suggestMode 
   * (NOTE: if indexReader==null and/or field==null, then this is overridden with SuggestMode.SUGGEST_ALWAYS)
   * @param accuracy The minimum score a suggestion must have in order to qualify for inclusion in the results
   * @throws IOException if the underlying index throws an {@link IOException}
   * @throws AlreadyClosedException if the Spellchecker is already closed
   * @return String[] the sorted list of the suggest words with these 2 criteria:
   * first criteria: the edit distance, second criteria (only if restricted mode): the popularity
   * of the suggest words in the field of the user index
   * 
   */
  public String[] suggestSimilar(String word, int numSug, IndexReader ir,
      String field, SuggestMode suggestMode, float accuracy) throws IOException {
    // obtainSearcher calls ensureOpen
    final IndexSearcher indexSearcher = obtainSearcher();
    try {
      if (ir == null || field == null) {
        suggestMode = SuggestMode.SUGGEST_ALWAYS;
      }
      if (suggestMode == SuggestMode.SUGGEST_ALWAYS) {
        ir = null;
        field = null;
      }

      final int lengthWord = word.length();

      final int freq = (ir != null && field != null) ? ir.docFreq(new Term(field, word)) : 0;
      final int goalFreq = suggestMode==SuggestMode.SUGGEST_MORE_POPULAR ? freq : 0;
      // if the word exists in the real index and we don't care for word frequency, return the word itself
      if (suggestMode==SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX && freq > 0) {
        return new String[] { word };
      }

      BooleanQuery.Builder query = new BooleanQuery.Builder();
      String[] grams;
      String key;

      for (int ng = getMin(lengthWord); ng  0) { // should we boost prefixes?
          add(query, "start" + ng, grams[0], bStart); // matches start of word

        }
        if (bEnd > 0) { // should we boost suffixes
          add(query, "end" + ng, grams[grams.length - 1], bEnd); // matches end of word

        }
        for (int i = 0; i  sugWord.freq) || sugWord.freq = 0; i--) {
        list[i] = sugQueue.pop().string;
      }

      return list;
    } finally {
      releaseSearcher(indexSearcher);
    }
  }

参考资料

一步一步跟我学习lucene（9）---lucene搜索之拼写检查和相似度查询提示（spellcheck）

贡献者

binbin.hou