IndexSearcher 常用方法
文档信息本身
IndexSearcher.doc(int docID)
获取索引文件中的第n个索引存储的相关字段,返回为Document类型,可以据此读取document中的Field.STORE.YES的字段;
IndexSearcher.doc(int docID, StoredFieldVisitor fieldVisitor)
获取StoredFieldVisitor指定的字段的document,StoredFieldVisitor定义如下
StoredFieldVisitor visitor = new DocumentStoredFieldVisitor(String... fields);
IndexSearcher.doc(int docID, Set<String> fieldsToLoad)
此方法同上边的 IndexSearcher.doc(int docID, StoredFieldVisitor fieldVisitor)
,其实现如下图
public Document doc(int docID, Set<String> fieldsToLoad) throws IOException {
return reader.document(docID, fieldsToLoad);
}
/**
* Like {@link #document(int)} but only loads the specified
* fields. Note that this is simply sugar for {@link
* DocumentStoredFieldVisitor#DocumentStoredFieldVisitor(Set)}.
*/
public final Document document(int docID, Set<String> fieldsToLoad)
throws IOException {
final DocumentStoredFieldVisitor visitor = new DocumentStoredFieldVisitor(
fieldsToLoad);
document(docID, visitor);
return visitor.getDocument();
}
查询方法
IndexSearcher.search(Query query, int n) 查询符合query条件的前n个记录
IndexSearcher.search(Query query, Collector results) 查询符合collector的记录,collector定义了分页等信息
IndexSearcher.search(Query query, int n,Sort sort, boolean doDocScores, boolean doMaxScore) 实现任意排序的查询,同时控制是否计算hit score和max score是否被计算在内,查询前n条符合query条件的document;
IndexSearcher.search(Query query, CollectorManager<C, T> collectorManager)
利用给定的collectorManager获取符合query条件的结果,其执行流程如下:
先判断是否有ExecutorService执行查询的任务,如果没有executor,IndexSearcher会在单个任务下进行查询操作;
如果IndexSearcher有executor,则会由每个线程控制一部分索引的读取,而且查询的过程中采用的是future机制,此种方式是边读边往结果集里边追加数据,这种异步的处理机制也提升了效率,其执行过程如下:
/**
* Lower-level search API.
* Search all leaves using the given {@link CollectorManager}. In contrast
* to {@link #search(Query, Collector)}, this method will use the searcher's
* {@link ExecutorService} in order to parallelize execution of the collection
* on the configured {@link #leafSlices}.
* @see CollectorManager
* @lucene.experimental
*/
public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager) throws IOException {
if (executor == null) {
final C collector = collectorManager.newCollector();
search(query, collector);
return collectorManager.reduce(Collections.singletonList(collector));
} else {
final List<C> collectors = new ArrayList<>(leafSlices.length);
boolean needsScores = false;
for (int i = 0; i < leafSlices.length; ++i) {
final C collector = collectorManager.newCollector();
collectors.add(collector);
needsScores |= collector.needsScores();
}
final Weight weight = createNormalizedWeight(query, needsScores);
final List<Future<C>> topDocsFutures = new ArrayList<>(leafSlices.length);
for (int i = 0; i < leafSlices.length; ++i) {
final LeafReaderContext[] leaves = leafSlices[i].leaves;
final C collector = collectors.get(i);
topDocsFutures.add(executor.submit(new Callable<C>() {
@Override
public C call() throws Exception {
search(Arrays.asList(leaves), weight, collector);
return collector;
}
}));
}
final List<C> collectedCollectors = new ArrayList<>();
// 通过 future 获取异步线程执行的结果
for (Future<C> future : topDocsFutures) {
try {
collectedCollectors.add(future.get());
} catch (InterruptedException e) {
throw new ThreadInterruptedException(e);
} catch (ExecutionException e) {
throw new RuntimeException(e);
}
}
return collectorManager.reduce(collectors);
}
}
查询指定条数之后的信息
IndexSearcher.count(Query query)
统计符合query条件的document个数
IndexSearcher.searchAfter(final ScoreDoc after, Query query, int numHits)
此方法会返回符合query查询条件的且在after之后的numHits条记录;
其实现原理为:
先读取当前索引文件的最大数据条数limit,然后判断after是否为空和after对应的document的下标是否超出limit的限制,如果超出的话抛出非法的参数异常;
设置读取的条数为numHits和limit中最小的(因为有超出最大条数的可能,避免超出限制而造成的异常)
接下来创建一个CollectorManager类型的对象,该对象定义了要返回的TopDocs的个数,上一页的document的结尾(after),并且对查询结果进行分析合并
最后调用search(query,manager)来查询结果
/** Finds the top <code>n</code>
* hits for <code>query</code> where all results are after a previous
* result (<code>after</code>).
* <p>
* By passing the bottom result from a previous page as <code>after</code>,
* this method can be used for efficient 'deep-paging' across potentially
* large result sets.
*
* @throws BooleanQuery.TooManyClauses If a query would exceed
* {@link BooleanQuery#getMaxClauseCount()} clauses.
*/
public TopDocs searchAfter(ScoreDoc after, Query query, int numHits) throws IOException {
// 范围校验
final int limit = Math.max(1, reader.maxDoc());
if (after != null && after.doc >= limit) {
throw new IllegalArgumentException("after.doc exceeds the number of documents in the reader: after.doc="
+ after.doc + " limit=" + limit);
}
// 避免超出最大值
final int cappedNumHits = Math.min(numHits, limit);
final CollectorManager<TopScoreDocCollector, TopDocs> manager = new CollectorManager<TopScoreDocCollector, TopDocs>() {
@Override
public TopScoreDocCollector newCollector() throws IOException {
//设置分页数量和上一页的最后元素
return TopScoreDocCollector.create(cappedNumHits, after);
}
@Override
public TopDocs reduce(Collection<TopScoreDocCollector> collectors) throws IOException {
final TopDocs[] topDocs = new TopDocs[collectors.size()];
int i = 0;
for (TopScoreDocCollector collector : collectors) {
topDocs[i++] = collector.topDocs();
}
// 结果分析+merge
return TopDocs.merge(0, cappedNumHits, topDocs, true);
}
};
// 查询
return search(query, manager);
}
源码
分页查询工具类
package com.github.houbb.lucene.learn.chap05;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiReader;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import java.io.File;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Set;
import java.util.concurrent.ExecutorService;
/**
* @author binbin.hou
* @since 1.0.0
*/
public class SearcherPageUtil {
/**获取IndexSearcher对象
* @param parentPath
* @param service
* @return
* @throws IOException
*/
public static IndexSearcher getIndexSearcherByParentPath(String parentPath, ExecutorService service) throws IOException{
MultiReader reader = null;
//设置
try {
File[] files = new File(parentPath).listFiles();
IndexReader[] readers = new IndexReader[files.length];
for (int i = 0 ; i < files.length ; i ++) {
readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
}
reader = new MultiReader(readers);
} catch (IOException e) {
e.printStackTrace();
}
return new IndexSearcher(reader,service);
}
/**根据索引路径获取IndexReader
* @param indexPath
* @return
* @throws IOException
*/
public static DirectoryReader getIndexReader(String indexPath) throws IOException{
return DirectoryReader.open(FSDirectory.open(Paths.get(indexPath, new String[0])));
}
/**根据索引路径获取IndexSearcher
* @param indexPath
* @param service
* @return
* @throws IOException
*/
public static IndexSearcher getIndexSearcherByIndexPath(String indexPath,ExecutorService service) throws IOException{
IndexReader reader = getIndexReader(indexPath);
return new IndexSearcher(reader,service);
}
/**如果索引目录会有变更用此方法获取新的IndexSearcher这种方式会占用较少的资源
* @param oldSearcher
* @param service
* @return
* @throws IOException
*/
public static IndexSearcher getIndexSearcherOpenIfChanged(IndexSearcher oldSearcher,ExecutorService service) throws IOException{
DirectoryReader reader = (DirectoryReader) oldSearcher.getIndexReader();
DirectoryReader newReader = DirectoryReader.openIfChanged(reader);
return new IndexSearcher(newReader, service);
}
/**根据IndexSearcher和docID获取默认的document
* @param searcher
* @param docID
* @return
* @throws IOException
*/
public static Document getDefaultFullDocument(IndexSearcher searcher,int docID) throws IOException{
return searcher.doc(docID);
}
/**根据IndexSearcher和docID
* @param searcher
* @param docID
* @param listField
* @return
* @throws IOException
*/
public static Document getDocumentByListField(IndexSearcher searcher, int docID, Set<String> listField) throws IOException{
return searcher.doc(docID, listField);
}
/**分页查询
* @param page 当前页数
* @param perPage 每页显示条数
* @param searcher searcher查询器
* @param query 查询条件
* @return
* @throws IOException
*/
public static TopDocs getScoreDocsByPerPage(int page,int perPage,IndexSearcher searcher,Query query) throws IOException{
TopDocs result = null;
if(query == null){
System.out.println(" Query is null return null ");
return null;
}
ScoreDoc before = null;
if(page != 1){
TopDocs docsBefore = searcher.search(query, (page-1)*perPage);
ScoreDoc[] scoreDocs = docsBefore.scoreDocs;
if(scoreDocs.length > 0){
before = scoreDocs[scoreDocs.length - 1];
}
}
result = searcher.searchAfter(before, query, perPage);
return result;
}
public static TopDocs getScoreDocs(IndexSearcher searcher, Query query) throws IOException {
TopDocs docs = searcher.search(query, getMaxDocId(searcher));
return docs;
}
/**统计document的数量,此方法等同于matchAllDocsQuery查询
* @param searcher
* @return
*/
public static int getMaxDocId(IndexSearcher searcher){
return searcher.getIndexReader().maxDoc();
}
}
测试代码
package com.github.houbb.lucene.learn.chap05;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
/**
* @author binbin.hou
* @since 1.0.0
*/
public class SearcherPageUtilTest {
public static void main(String[] args) {
ExecutorService service = Executors.newCachedThreadPool();
try {
IndexSearcher searcher = SearcherPageUtil.getIndexSearcherByParentPath("index",service);
System.out.println(SearcherPageUtil.getMaxDocId(searcher));
Term term = new Term("content", "lucene");
Query query = new TermQuery(term);
TopDocs docs = SearcherPageUtil.getScoreDocsByPerPage(2, 20, searcher, query);
ScoreDoc[] scoreDocs = docs.scoreDocs;
System.out.println("所有的数据总数为:"+docs.totalHits);
System.out.println("本页查询到的总数为:"+scoreDocs.length);
for (ScoreDoc scoreDoc : scoreDocs) {
Document doc = SearcherPageUtil.getDefaultFullDocument(searcher, scoreDoc.doc);
//System.out.println(doc);
}
System.out.println("\n\n");
TopDocs docsAll = SearcherPageUtil.getScoreDocs(searcher, query);
Set<String> fieldSet = new HashSet<String>();
fieldSet.add("path");
fieldSet.add("modified");
for (int i = 0 ; i < 20 ; i ++) {
Document doc = SearcherPageUtil.getDocumentByListField(searcher, docsAll.scoreDocs[i].doc,fieldSet);
System.out.println(doc);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}finally{
service.shutdownNow();
}
}
}