InputIterator说明
联想词
lucene的联想词是在org.apache.lucene.search.suggest包下边,提供了自动补全或者联想提示功能的支持。
InputIterator 是一个支持枚举term,weight,payload三元组的供suggester使用的接口,目前仅支持AnalyzingSuggester,FuzzySuggester andAnalyzingInfixSuggester 三种suggester支持payloads;
接口
/**
* Interface for enumerating term,weight,payload triples for suggester consumption;
* currently only {@link AnalyzingSuggester}, {@link
* FuzzySuggester} and {@link AnalyzingInfixSuggester} support payloads.
*/
public interface InputIterator extends BytesRefIterator {
/** A term's weight, higher numbers mean better suggestions. */
// 权重
public long weight();
/** An arbitrary byte[] to record per suggestion. See
* {@link LookupResult#payload} to retrieve the payload
* for each suggestion. */
// 载荷信息
// 每个suggestion对应的元数据的二进制表示,我们在传输对象的时候需要转换对象或对象的某个属性为BytesRef类型,相应的suggester调用lookup的时候会返回payloads信息;
public BytesRef payload();
/** Returns true if the iterator has payloads */
// 是否包含载荷
public boolean hasPayloads();
/**
* A term's contexts context can be used to filter suggestions.
* May return null, if suggest entries do not have any context
* */
// 上下文
// 获取某个term的contexts,用来过滤suggest的内容,如果suggest的列表为空,返回null
public Set<BytesRef> contexts();
/** Returns true if the iterator has contexts */
// 是否包含上下文
public boolean hasContexts();
/** Singleton InputIterator that iterates over 0 BytesRefs. */
public static final InputIterator EMPTY = new InputIteratorWrapper(BytesRefIterator.EMPTY);
/**
* Wraps a BytesRefIterator as a suggester InputIterator, with all weights
* set to <code>1</code> and carries no payload
*/
public static class InputIteratorWrapper implements InputIterator {
private final BytesRefIterator wrapped;
/**
* Creates a new wrapper, wrapping the specified iterator and
* specifying a weight value of <code>1</code> for all terms
* and nullifies associated payloads.
*/
public InputIteratorWrapper(BytesRefIterator wrapped) {
this.wrapped = wrapped;
}
@Override
public long weight() {
return 1;
}
@Override
public BytesRef next() throws IOException {
return wrapped.next();
}
@Override
public BytesRef payload() {
return null;
}
@Override
public boolean hasPayloads() {
return false;
}
@Override
public Set<BytesRef> contexts() {
return null;
}
@Override
public boolean hasContexts() {
return false;
}
}
}
BytesRefIterator 对应的内容也比较简单:
/**
* A simple iterator interface for {@link BytesRef} iteration.
*/
public interface BytesRefIterator {
/**
* Increments the iteration to the next {@link BytesRef} in the iterator.
* Returns the resulting {@link BytesRef} or <code>null</code> if the end of
* the iterator is reached. The returned BytesRef may be re-used across calls
* to next. After this method returns null, do not call it again: the results
* are undefined.
*
* @return the next {@link BytesRef} in the iterator or <code>null</code> if
* the end of the iterator is reached.
* @throws IOException If there is a low-level I/O error.
*/
public BytesRef next() throws IOException;
/** Singleton BytesRefIterator that iterates over 0 BytesRefs. */
public static final BytesRefIterator EMPTY = new BytesRefIterator() {
@Override
public BytesRef next() {
return null;
}
};
}
ps: 其实这就是一个简单的 Iterator 接口,类似于 list 的迭代器接口。
实现类
InputIterator 的实现类有以下几种:
BufferedInputIterator:对二进制类型的输入进行轮询;
DocumentInputIterator:从索引中被store的field中轮询;
FileIterator:从文件中每次读出单行的数据轮询,以\t进行间隔(且\t的个数最多为2个);
HighFrequencyIterator:从索引中被store的field轮询,忽略长度小于设定值的文本;
InputIteratorWrapper:遍历BytesRefIterator并且返回的内容不包含payload且weight均为1;
SortedInputIterator:二进制类型的输入轮询且按照指定的comparator算法进行排序;
Suggester查询工具Lookup类说明
此类提供了字符串的联想查询功能
Lookup类提供了一个CharSequenceComparator,此comparator主要是用来对CharSequence进行排序,按字符顺序排序;
内置LookupResult,用于返回suggest的结果,同时也是按照CharSequenceComparator进行key的排序;
内置了LookupPriorityQueue,用以存储LookupResult;
方法
LookUp 提供的方法
build(Dictionary dict) : 从指定directory进行build;
load(InputStream input) : 将InputStream转成DataInput并执行load(DataInput)方法;
store(OutputStream output) : 将OutputStream转成DataOutput并执行store(DataOutput)方法;
getCount() : 获取lookup的build的项的数量;
build(InputIterator inputIterator) : 根据指定的InputIterator构建Lookup对象;
lookup(CharSequence key, boolean onlyMorePopular, int num) :根据key查询可能的结果返回值为 List<LookupResult>
;
代码实战
maven 引入
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-suggest</artifactId>
<version>7.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-misc</artifactId>
<version>7.2.1</version>
</dependency>
代码
实体类
import java.io.Serializable;
import java.util.Arrays;
/**
* @author binbin.hou
* @since 1.0.0
*/
public class Product implements Serializable {
private String name;
private String image;
private String[] regions;
private int numberSold;
public Product(String name, String image, String[] regions, int numberSold) {
this.name = name;
this.image = image;
this.regions = regions;
this.numberSold = numberSold;
}
//getter setter toString
}
对应的迭代器:
实现 InputIterator 的所有接口。
import org.apache.lucene.search.suggest.InputIterator;
import org.apache.lucene.util.BytesRef;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.ObjectOutputStream;
import java.io.UnsupportedEncodingException;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
/**
* 定义产品迭代器
*
* @author binbin.hou
* @since 1.0.0
*/
public class ProductIterator implements InputIterator {
private final Iterator<Product> productIterator;
private Product currentProduct;
ProductIterator(Iterator<Product> productIterator) {
this.productIterator = productIterator;
}
@Override
public boolean hasContexts() {
return true;
}
/**
* 是否有设置payload信息
*/
@Override
public boolean hasPayloads() {
return true;
}
public Comparator<BytesRef> getComparator() {
return null;
}
@Override
public BytesRef next() {
if (productIterator.hasNext()) {
currentProduct = productIterator.next();
try {
return new BytesRef(currentProduct.getName().getBytes("UTF8"));
} catch (UnsupportedEncodingException e) {
throw new RuntimeException("Couldn't convert to UTF-8",e);
}
} else {
return null;
}
}
@Override
public BytesRef payload() {
try {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(bos);
out.writeObject(currentProduct);
out.close();
return new BytesRef(bos.toByteArray());
} catch (IOException e) {
throw new RuntimeException("Well that's unfortunate.");
}
}
@Override
public Set<BytesRef> contexts() {
try {
Set<BytesRef> regions = new HashSet<BytesRef>();
for (String region : currentProduct.getRegions()) {
regions.add(new BytesRef(region.getBytes("UTF8")));
}
return regions;
} catch (UnsupportedEncodingException e) {
throw new RuntimeException("Couldn't convert to UTF-8");
}
}
@Override
public long weight() {
return currentProduct.getNumberSold();
}
}
工具方法
定义 lookUp 查询方法:
private static void lookup(AnalyzingInfixSuggester suggester, String name,
String region) throws IOException {
HashSet<BytesRef> contexts = new HashSet<>();
contexts.add(new BytesRef(region.getBytes("UTF8")));
List<Lookup.LookupResult> results = suggester.lookup(name, contexts, 2, true, false);
System.out.println("-- \"" + name + "\" (" + region + "):");
for (Lookup.LookupResult result : results) {
System.out.println(result.key);
BytesRef bytesRef = result.payload;
ObjectInputStream is = new ObjectInputStream(new ByteArrayInputStream(bytesRef.bytes));
Product product = null;
try {
product = (Product)is.readObject();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
System.out.println("product-Name:" + product.getName());
System.out.println("product-regions:" + product.getRegions());
System.out.println("product-image:" + product.getImage());
System.out.println("product-numberSold:" + product.getNumberSold());
}
System.out.println();
}
测试
测试代码如下:
public static void main(String[] args) {
try {
Directory indexDir = FSDirectory.open(Paths.get("suggestPath"));
StandardAnalyzer analyzer = new StandardAnalyzer();
AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(indexDir, analyzer);
ArrayList<Product> products = new ArrayList<>();
products.add(new Product("Electric Guitar",
"http://images.example/electric-guitar.jpg", new String[] {
"US", "CA" }, 100));
products.add(new Product("Electric Train",
"http://images.example/train.jpg", new String[] { "US",
"CA" }, 100));
products.add(new Product("Acoustic Guitar",
"http://images.example/acoustic-guitar.jpg", new String[] {
"US", "ZA" }, 80));
products.add(new Product("Guarana Soda",
"http://images.example/soda.jpg",
new String[] { "ZA", "IE" }, 130));
suggester.build(new ProductIterator(products.iterator()));
lookup(suggester, "Gu", "US");
lookup(suggester, "Gu", "ZA");
lookup(suggester, "Gui", "CA");
lookup(suggester, "Electric guit", "US");
suggester.refresh();
} catch (IOException e) {
System.err.println("Error!");
}
}
日志如下:
-- "Gu" (US):
Electric Guitar
product-Name:Electric Guitar
product-regions:[Ljava.lang.String;@1e730495
product-image:http://images.example/electric-guitar.jpg
product-numberSold:100
Acoustic Guitar
product-Name:Acoustic Guitar
product-regions:[Ljava.lang.String;@7d3a22a9
product-image:http://images.example/acoustic-guitar.jpg
product-numberSold:80
-- "Gu" (ZA):
Guarana Soda
product-Name:Guarana Soda
product-regions:[Ljava.lang.String;@1d082e88
product-image:http://images.example/soda.jpg
product-numberSold:130
Acoustic Guitar
product-Name:Acoustic Guitar
product-regions:[Ljava.lang.String;@60704c
product-image:http://images.example/acoustic-guitar.jpg
product-numberSold:80
-- "Gui" (CA):
Electric Guitar
product-Name:Electric Guitar
product-regions:[Ljava.lang.String;@6b19b79
product-image:http://images.example/electric-guitar.jpg
product-numberSold:100
-- "Electric guit" (US):
Electric Guitar
product-Name:Electric Guitar
product-regions:[Ljava.lang.String;@1a04f701
product-image:http://images.example/electric-guitar.jpg
product-numberSold:100
可以发现,这里会对内容自动补全,返回联想之后对应的结果。