lucene 例子

在Lucene中,权重计算是由 Similarity 类及其子类负责的。

以下是一个简单的Java示例,演示如何使用Lucene的TFIDFSimilarity来计算文档的权重。

请注意,以下示例使用Lucene的版本为8.x。

具体实现可能会根据Lucene版本而有所不同。

首先,你需要添加Lucene的依赖:

  [xml]
1
2
3
4
5
6
7
8
9
10
11
<!-- Add Lucene dependencies to your project --> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>8.11.1</version> <!-- Replace with the latest version --> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> <version>8.11.1</version> <!-- Replace with the latest version --> </dependency>

然后,以下是一个简单的Java示例,演示如何使用Lucene的TFIDFSimilarity计算文档的权重:

  [java]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.util.Version; import org.apache.lucene.search.similarities.TFIDFSimilarity; public class LuceneTFIDFExample { public static void main(String[] args) throws Exception { // 创建内存索引目录 Directory indexDirectory = new RAMDirectory(); // 使用标准分词器 Analyzer analyzer = new StandardAnalyzer(); // 配置IndexWriter IndexWriterConfig config = new IndexWriterConfig(analyzer); config.setOpenMode(OpenMode.CREATE_OR_APPEND); // 创建IndexWriter IndexWriter indexWriter = new IndexWriter(indexDirectory, config); // 添加文档到索引 addDocument(indexWriter, "1", "Lucene is a full-text search library."); addDocument(indexWriter, "2", "It is widely used for information retrieval."); // 提交事务 indexWriter.commit(); // 使用相似度模型 TFIDFSimilarity TFIDFSimilarity similarity = new TFIDFSimilarity(); // 创建IndexSearcher IndexSearcher indexSearcher = new IndexSearcher(indexWriter.getReader()); indexSearcher.setSimilarity(similarity); // 查询 Query query = new TermQuery(new org.apache.lucene.index.Term("content", "search")); ScoreDoc[] hits = indexSearcher.search(query, 10).scoreDocs; // 输出查询结果 for (ScoreDoc hit : hits) { Document hitDoc = indexSearcher.doc(hit.doc); System.out.println("Document ID: " + hitDoc.get("id") + ", Score: " + hit.score); } // 关闭IndexWriter indexWriter.close(); } private static void addDocument(IndexWriter indexWriter, String id, String content) throws Exception { Document document = new Document(); document.add(new Field("id", id, Field.Store.YES, Field.Index.NO)); document.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); indexWriter.addDocument(document); } }

在此示例中,我们使用了 TFIDFSimilarity 作为相似度模型,并使用标准分词器。

addDocument 方法用于向索引中添加文档。

在查询时,我们使用 TermQuery 查询,并输出文档的得分。请注意,得分的具体计算取决于所选的相似度模型。

这只是一个简单的示例,实际应用中可能需要更多的配置和处理。

在实际项目中,你可能需要更复杂的分析器、索引字段、相似度模型等,以满足具体需求。

核心类

我们根据这个入门例子,可以找到对应的核心类:

Directory 内存索引目录 Analyzer analyzer 使用标准分词器 IndexWriter indexWriter 创建IndexWriter Document 文档

IndexSearcher 查询类 ScoreDoc 分数结果 Query 查询

Directory 类

源码

  [java]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
public abstract class Directory implements Closeable { public abstract String[] listAll() throws IOException; public abstract void deleteFile(String name) throws IOException; public abstract long fileLength(String name) throws IOException; public abstract IndexOutput createOutput(String name, IOContext context) throws IOException; public abstract IndexOutput createTempOutput(String prefix, String suffix, IOContext context) throws IOException; public abstract void sync(Collection<String> names) throws IOException; public abstract void syncMetaData() throws IOException; public abstract void rename(String source, String dest) throws IOException; public abstract IndexInput openInput(String name, IOContext context) throws IOException; public ChecksumIndexInput openChecksumInput(String name) throws IOException { return new BufferedChecksumIndexInput(openInput(name, IOContext.READONCE)); } public abstract Lock obtainLock(String name) throws IOException; public abstract void close() throws IOException; protected void ensureOpen() throws AlreadyClosedException {} public abstract Set<String> getPendingDeletions() throws IOException; }

IndexWriter

IndexReader

IndexSearch

Analyzer

Query

Document

ScoreDoc

  [java]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public class ScoreDoc { /** The score of this document for the query. */ public float score; /** * A hit document's number. * * @see StoredFields#document(int) */ public int doc; /** Only set by {@link TopDocs#merge} */ public int shardIndex; // }

参考资料