IndexSearcher 常用方法

文档信息本身

IndexSearcher.doc(int docID) 获取索引文件中的第n个索引存储的相关字段,返回为Document类型,可以据此读取document中的Field.STORE.YES的字段;

IndexSearcher.doc(int docID, StoredFieldVisitor fieldVisitor) 获取StoredFieldVisitor指定的字段的document,StoredFieldVisitor定义如下

  [java]
1
StoredFieldVisitor visitor = new DocumentStoredFieldVisitor(String... fields);

IndexSearcher.doc(int docID, Set<String> fieldsToLoad) 此方法同上边的 IndexSearcher.doc(int docID, StoredFieldVisitor fieldVisitor) ,其实现如下图

  [java]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public Document doc(int docID, Set<String> fieldsToLoad) throws IOException { return reader.document(docID, fieldsToLoad); } /** * Like {@link #document(int)} but only loads the specified * fields. Note that this is simply sugar for {@link * DocumentStoredFieldVisitor#DocumentStoredFieldVisitor(Set)}. */ public final Document document(int docID, Set<String> fieldsToLoad) throws IOException { final DocumentStoredFieldVisitor visitor = new DocumentStoredFieldVisitor( fieldsToLoad); document(docID, visitor); return visitor.getDocument(); }

查询方法

IndexSearcher.search(Query query, int n) 查询符合query条件的前n个记录

IndexSearcher.search(Query query, Collector results) 查询符合collector的记录,collector定义了分页等信息

IndexSearcher.search(Query query, int n,Sort sort, boolean doDocScores, boolean doMaxScore) 实现任意排序的查询,同时控制是否计算hit score和max score是否被计算在内,查询前n条符合query条件的document;

IndexSearcher.search(Query query, CollectorManager<C, T> collectorManager) 利用给定的collectorManager获取符合query条件的结果,其执行流程如下:

先判断是否有ExecutorService执行查询的任务,如果没有executor,IndexSearcher会在单个任务下进行查询操作;

如果IndexSearcher有executor,则会由每个线程控制一部分索引的读取,而且查询的过程中采用的是future机制,此种方式是边读边往结果集里边追加数据,这种异步的处理机制也提升了效率,其执行过程如下:

  [java]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/** * Lower-level search API. * Search all leaves using the given {@link CollectorManager}. In contrast * to {@link #search(Query, Collector)}, this method will use the searcher's * {@link ExecutorService} in order to parallelize execution of the collection * on the configured {@link #leafSlices}. * @see CollectorManager * @lucene.experimental */ public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager) throws IOException { if (executor == null) { final C collector = collectorManager.newCollector(); search(query, collector); return collectorManager.reduce(Collections.singletonList(collector)); } else { final List<C> collectors = new ArrayList<>(leafSlices.length); boolean needsScores = false; for (int i = 0; i < leafSlices.length; ++i) { final C collector = collectorManager.newCollector(); collectors.add(collector); needsScores |= collector.needsScores(); } final Weight weight = createNormalizedWeight(query, needsScores); final List<Future<C>> topDocsFutures = new ArrayList<>(leafSlices.length); for (int i = 0; i < leafSlices.length; ++i) { final LeafReaderContext[] leaves = leafSlices[i].leaves; final C collector = collectors.get(i); topDocsFutures.add(executor.submit(new Callable<C>() { @Override public C call() throws Exception { search(Arrays.asList(leaves), weight, collector); return collector; } })); } final List<C> collectedCollectors = new ArrayList<>(); // 通过 future 获取异步线程执行的结果 for (Future<C> future : topDocsFutures) { try { collectedCollectors.add(future.get()); } catch (InterruptedException e) { throw new ThreadInterruptedException(e); } catch (ExecutionException e) { throw new RuntimeException(e); } } return collectorManager.reduce(collectors); } }

查询指定条数之后的信息

IndexSearcher.count(Query query) 统计符合query条件的document个数

IndexSearcher.searchAfter(final ScoreDoc after, Query query, int numHits) 此方法会返回符合query查询条件的且在after之后的numHits条记录;

其实现原理为:

先读取当前索引文件的最大数据条数limit,然后判断after是否为空和after对应的document的下标是否超出limit的限制,如果超出的话抛出非法的参数异常;

设置读取的条数为numHits和limit中最小的(因为有超出最大条数的可能,避免超出限制而造成的异常)

接下来创建一个CollectorManager类型的对象,该对象定义了要返回的TopDocs的个数,上一页的document的结尾(after),并且对查询结果进行分析合并

最后调用search(query,manager)来查询结果

  [java]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/** Finds the top <code>n</code> * hits for <code>query</code> where all results are after a previous * result (<code>after</code>). * <p> * By passing the bottom result from a previous page as <code>after</code>, * this method can be used for efficient 'deep-paging' across potentially * large result sets. * * @throws BooleanQuery.TooManyClauses If a query would exceed * {@link BooleanQuery#getMaxClauseCount()} clauses. */ public TopDocs searchAfter(ScoreDoc after, Query query, int numHits) throws IOException { // 范围校验 final int limit = Math.max(1, reader.maxDoc()); if (after != null && after.doc >= limit) { throw new IllegalArgumentException("after.doc exceeds the number of documents in the reader: after.doc=" + after.doc + " limit=" + limit); } // 避免超出最大值 final int cappedNumHits = Math.min(numHits, limit); final CollectorManager<TopScoreDocCollector, TopDocs> manager = new CollectorManager<TopScoreDocCollector, TopDocs>() { @Override public TopScoreDocCollector newCollector() throws IOException { //设置分页数量和上一页的最后元素 return TopScoreDocCollector.create(cappedNumHits, after); } @Override public TopDocs reduce(Collection<TopScoreDocCollector> collectors) throws IOException { final TopDocs[] topDocs = new TopDocs[collectors.size()]; int i = 0; for (TopScoreDocCollector collector : collectors) { topDocs[i++] = collector.topDocs(); } // 结果分析+merge return TopDocs.merge(0, cappedNumHits, topDocs, true); } }; // 查询 return search(query, manager); }

源码

分页查询工具类

  [java]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
package com.github.houbb.lucene.learn.chap05; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.MultiReader; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.FSDirectory; import java.io.File; import java.io.IOException; import java.nio.file.Paths; import java.util.Set; import java.util.concurrent.ExecutorService; /** * @author binbin.hou * @since 1.0.0 */ public class SearcherPageUtil { /**获取IndexSearcher对象 * @param parentPath * @param service * @return * @throws IOException */ public static IndexSearcher getIndexSearcherByParentPath(String parentPath, ExecutorService service) throws IOException{ MultiReader reader = null; //设置 try { File[] files = new File(parentPath).listFiles(); IndexReader[] readers = new IndexReader[files.length]; for (int i = 0 ; i < files.length ; i ++) { readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0]))); } reader = new MultiReader(readers); } catch (IOException e) { e.printStackTrace(); } return new IndexSearcher(reader,service); } /**根据索引路径获取IndexReader * @param indexPath * @return * @throws IOException */ public static DirectoryReader getIndexReader(String indexPath) throws IOException{ return DirectoryReader.open(FSDirectory.open(Paths.get(indexPath, new String[0]))); } /**根据索引路径获取IndexSearcher * @param indexPath * @param service * @return * @throws IOException */ public static IndexSearcher getIndexSearcherByIndexPath(String indexPath,ExecutorService service) throws IOException{ IndexReader reader = getIndexReader(indexPath); return new IndexSearcher(reader,service); } /**如果索引目录会有变更用此方法获取新的IndexSearcher这种方式会占用较少的资源 * @param oldSearcher * @param service * @return * @throws IOException */ public static IndexSearcher getIndexSearcherOpenIfChanged(IndexSearcher oldSearcher,ExecutorService service) throws IOException{ DirectoryReader reader = (DirectoryReader) oldSearcher.getIndexReader(); DirectoryReader newReader = DirectoryReader.openIfChanged(reader); return new IndexSearcher(newReader, service); } /**根据IndexSearcher和docID获取默认的document * @param searcher * @param docID * @return * @throws IOException */ public static Document getDefaultFullDocument(IndexSearcher searcher,int docID) throws IOException{ return searcher.doc(docID); } /**根据IndexSearcher和docID * @param searcher * @param docID * @param listField * @return * @throws IOException */ public static Document getDocumentByListField(IndexSearcher searcher, int docID, Set<String> listField) throws IOException{ return searcher.doc(docID, listField); } /**分页查询 * @param page 当前页数 * @param perPage 每页显示条数 * @param searcher searcher查询器 * @param query 查询条件 * @return * @throws IOException */ public static TopDocs getScoreDocsByPerPage(int page,int perPage,IndexSearcher searcher,Query query) throws IOException{ TopDocs result = null; if(query == null){ System.out.println(" Query is null return null "); return null; } ScoreDoc before = null; if(page != 1){ TopDocs docsBefore = searcher.search(query, (page-1)*perPage); ScoreDoc[] scoreDocs = docsBefore.scoreDocs; if(scoreDocs.length > 0){ before = scoreDocs[scoreDocs.length - 1]; } } result = searcher.searchAfter(before, query, perPage); return result; } public static TopDocs getScoreDocs(IndexSearcher searcher, Query query) throws IOException { TopDocs docs = searcher.search(query, getMaxDocId(searcher)); return docs; } /**统计document的数量,此方法等同于matchAllDocsQuery查询 * @param searcher * @return */ public static int getMaxDocId(IndexSearcher searcher){ return searcher.getIndexReader().maxDoc(); } }

测试代码

  [java]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
package com.github.houbb.lucene.learn.chap05; import org.apache.lucene.document.Document; import org.apache.lucene.index.Term; import org.apache.lucene.search.*; import java.io.IOException; import java.util.HashSet; import java.util.Set; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; /** * @author binbin.hou * @since 1.0.0 */ public class SearcherPageUtilTest { public static void main(String[] args) { ExecutorService service = Executors.newCachedThreadPool(); try { IndexSearcher searcher = SearcherPageUtil.getIndexSearcherByParentPath("index",service); System.out.println(SearcherPageUtil.getMaxDocId(searcher)); Term term = new Term("content", "lucene"); Query query = new TermQuery(term); TopDocs docs = SearcherPageUtil.getScoreDocsByPerPage(2, 20, searcher, query); ScoreDoc[] scoreDocs = docs.scoreDocs; System.out.println("所有的数据总数为:"+docs.totalHits); System.out.println("本页查询到的总数为:"+scoreDocs.length); for (ScoreDoc scoreDoc : scoreDocs) { Document doc = SearcherPageUtil.getDefaultFullDocument(searcher, scoreDoc.doc); //System.out.println(doc); } System.out.println("\n\n"); TopDocs docsAll = SearcherPageUtil.getScoreDocs(searcher, query); Set<String> fieldSet = new HashSet<String>(); fieldSet.add("path"); fieldSet.add("modified"); for (int i = 0 ; i < 20 ; i ++) { Document doc = SearcherPageUtil.getDocumentByListField(searcher, docsAll.scoreDocs[i].doc,fieldSet); System.out.println(doc); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }finally{ service.shutdownNow(); } } }

参考资料

一步一步跟我学习lucene(8)—lucene搜索之索引的查询原理和查询工具类(支持分页)示例