java sensitive-word 敏感词之 DFA 双数组实现源码学习

老马啸西风2020年1月7日大约 19 分钟

为了便于大家学习，项目开源地址如下，欢迎 fork+star 鼓励一下老马~

sensitive-word

双数组实现原理

双数组Tire树是Tire树的升级版，Tire取自英文Retrieval中的一部分，即检索树，又称作字典树或者键树。

下面简单介绍一下Tire树。

1.1 Tire树

Trie是一种高效的索引方法，它实际上是一种确定有限自动机(DFA)，在树的结构中，每一个结点对应一个DFA状态，每一个从父结点指向子结点(有向)标记的边对应一个DFA转换。

遍历从根结点开始，然后从head到tail，由关键词(本想译成键字符串，感太别扭)的每个字符来决定下一个状态，标记有相同字符的边被选中做移动。

注意每次这种移动会从关键词中消耗一个字符并走向树的下一层，如果这个关键字符串空了，并且走到了叶子结点，那么我们达到了这个关键词的出口。

如果我们被困在了一点结点，比如因为没有分枝被标记为当前我们有的字符，或是因为关键字符串在中间结点就空了，这表示关键字符串没有被trie认出来。

图1.1.1即是一颗Tire树，其由字典{AC,ACE,ACFF,AD,CD,CF,ZQ}构成。

图中R表示根节点，并不代表字符，除根节点外每一个节点都只包含一个字符。

从根节点到图中绿色节点，路径上经过的字符连接起来，为该节点对应的字符串。绿色叶节点的数字代表其在字典中的位置

说明

双数组Trie树(DoubleArrayTrie)是一种空间复杂度低的Trie树，应用于字符区间大的语言（如中文、日文等）分词领域。

双数组Trie (Double-Array Trie)结构由日本人JUN-ICHI AOE于1989年提出的，是Trie结构的压缩形式，仅用两个线性数组来表示Trie树，该结构有效结合了数字搜索树(Digital Search Tree)检索时间高效的特点和链式表示的Trie空间结构紧凑的特点。

双数组Trie的本质是一个确定有限状态自动机（DFA），每个节点代表自动机的一个状态，根据变量不同，进行状态转移，当到达结束状态或无法转移时，完成一次查询操作。

在双数组所有键中包含的字符之间的联系都是通过简单的数学加法运算表示，不仅提高了检索速度，而且省去了链式结构中使用的大量指针，节省了存储空间。

——《基于双数组 Trie 树算法的字典改进和实现》

自己的实现

接口定义

我们首先统一定义 trie-tree 的接口，和 leetcode 风格保持一致。

多一个插入列表，便于日常使用。

package com.github.houbb.trie.api;

import java.util.Collection;

/**
 * 前缀树接口
 * @since 1.0.0
 */
public interface ITrieTree {

    /**
     * 插入列表
     * @param words 列表
     */
    void insert(Collection words);

    /**
     * 插入单词
     * @param word 单词
     */
    void insert(String word);

    /**
     * 是否存在单词
     * @param word 单词
     * @return 结果
     */
    boolean search(String word);

    /**
     * 是否以 xxx 开头
     * @param prefix 开头
     * @return 结果
     * @since 1.0.0
     */
    boolean startsWith(String prefix);

}

V1-简单版本

思路

最简单的是每一个节点，都有对应的 next 数组。

优点是实现非常简单，缺点是如果是中文、日文等体系，数组的会变得非常稀疏，直接内存爆炸。

实现

刷过 leetcode 话，这个实现自然不在话下。

我们把 arraySize 作为外部可以指定，默认为 256。

package com.github.houbb.trie.impl;

/**
 * 前缀树接口
 * @since 1.0.0
 */
public class TrieTreeArray extends AbstractTrieTree {

    private class TrieNode {
        private boolean isEnd;

        private TrieNode[] next;

        public TrieNode() {
            this.isEnd = false;
            next = new TrieNode[arraySize];
        }
    }

    // 根节点
    private TrieNode root;

    /**
     * 如果只是英文小写字母 26 即可，
     * 如果是 ASCII，则 256
     * 如果是其他的，则太大了。可能会造成数组非常的稀疏，空间的浪费。
     *
     * 如何优化，可以避免这种浪费呢？使用 map 替代？
     */
    private int arraySize = 256;

    public TrieTreeArray(int arraySize) {
        this.arraySize = arraySize;
        // 初始化
        root = new TrieNode();
    }

    public TrieTreeArray() {
        this(256);
    }

    public void insert(String word) {
        TrieNode cur = root;
        for(int i = 0; i  newWordMap = new HashMap<>(8);
                newWordMap.put(IS_END, false);

                // 将新的节点放入当前 map 中
                currentMap.put(charKey, newWordMap);

                // 将新节点设置为当前节点，方便下一次节点的循环。
                currentMap = newWordMap;
            }

            // 判断是否为最后一个，添加是否结束的标识。
            if (i == size - 1) {
                currentMap.put(IS_END, true);
            }
        }
    }

    @Override
    public boolean search(String word) {
        if(MapUtil.isEmpty(map)) {
            return false;
        }

        // 每一个 char 都存在，且最后对应的是 ED
        Map currentMap = map;
        for(int i = 0; i  CharMap = new HashMap();
    ArrayList CharList = new ArrayList();
 
    public DoubleArrayTrie()
    {
        Base[1] = 1;
 
        CharMap.put(END_CHAR,1);
        CharList.add(END_CHAR);
        CharList.add(END_CHAR);
        for(int i=0;i Tail.length-Pos)
        {
            Extend_Tail();
        }
        for(int i=p; i= Base.length) Extend_Array();
                if(Base[cur_p]!= 0 || Check[cur_p]!= 0)
                {
                    flag = false;
                    break;
                }
            }
            if (flag) return i;
        }
    }
 
    private ArrayList GetChildList(int p)
    {
        ArrayList ret = new ArrayList();
        for(int i=1; i= Check.length) break;
            if(Check[Base[p]+i] == p)
            {
                ret.add(i);
            }
        }
        return ret;
    }
 
    private boolean TailContainString(int start,String s2)
    {
        for(int i=0;i= Base.length) Extend_Array();
 
            //空闲状态
            if(Base[cur_p] == 0 && Check[cur_p] == 0)
            {
                Base[cur_p] = -Pos;
                Check[cur_p] = pre_p;
                Pos = CopyToTailArray(s,i+1);
                break;
            }else
            //已存在状态
            if(Base[cur_p] > 0 && Check[cur_p] == pre_p)
            {
                pre_p = cur_p;
                continue;
            }else
            //冲突 1：遇到 Base[cur_p]小于0的，即遇到一个被压缩存到Tail中的字符串
            if(Base[cur_p]  list1 = GetChildList(pre_p);
                int toBeAdjust;
                ArrayList list = null;
                if(true)
                {
                    toBeAdjust = pre_p;
                    list = list1;
                }
 
                int origin_base = Base[toBeAdjust];
                list.add(GetCharCode(s.charAt(i)));
                int avail_base = x_check((Integer[])list.toArray(new Integer[list.size()]));
                list.remove(list.size()-1);
 
                Base[toBeAdjust] = avail_base;
                for(int j=0; j 0)
                    {
                        ArrayList subsequence = GetChildList(tmp1);
                        for(int k=0; k GetAllChildWord(int index)
    {
        ArrayList result = new ArrayList();
        if(Base[index] == 0)
        {
            result.add("");
            return result;
        }
        if(Base[index]  FindAllWords(String word)
    {
        ArrayList result = new ArrayList();
        String prefix = "";
        FindStruct fs = Find(word);
        int p = fs.p;
        if (p == -1) return result;
        if(Base[p] 0)
        {
            ArrayList r =  GetAllChildWord(p);
            for(int i=0;i words = new ArrayList();
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("words3.dic")));
        String s;
        int num = 0;
        while((s=reader.readLine()) != null)
        {
            words.add(s);
            num ++;
        }
        DoubleArrayTrie dat = new DoubleArrayTrie();
 
        for(String word: words)
        {
            dat.Insert(word);
        }
 
        System.out.println(dat.Base.length);
        System.out.println(dat.Tail.length);
 
        Scanner sc = new Scanner(System.in);
        while(sc.hasNext())
        {
            String word = sc.next();
            System.out.println(dat.Exists(word));
            System.out.println(dat.FindAllWords(word));
        }
    }
}

开源项目-darts-java

https://github.com/komiya-atsushi/darts-java

ps: 原项目是日文，这里简单的翻译一下。

简介

Taku Kudo 的双数组 Trie C++ 实现 (*1) 已被 MURAWAKI Yugo (*2) 移植到 Java。这是 Darts 的 Java 实现，具有更像 Java 的接口和改进的性能。。

(1) Darts http://www.chasen.org/~taku/software/darts/

(2) http://nlp.ist.i.kyoto-u.ac.jp/member/murawaki/misc/index.html

原始实现的改进（Java 移植版本）

改进的界面
对字符串的 char[] 表示和大量使用数组进行了修改，以创建更易于使用的类似于 Java 的界面。
处理输入数据导致错误的情况
改善堆消耗
减少了 Trie 构造期间和之后的堆消耗（原始实现的堆消耗约为 26%）。
提高执行速度
trie 构建（快约 2.6 倍）、精确匹配搜索（快约 1.7 倍）和公共前缀搜索（快约 1.2 倍）的处理性能得到改进。

用法

请单独使用 DoubleArrayTrie.java（用这个文件完成实现）。

基准

测试数据

Ubuntu 12 的 /usr/share/dict/words (916 KB, 99171 words) 作为测试数据。

测量方法

堆消耗
调用DoubleArrayTrie#build()前后分别通过Runtime#totalMemory() / Runtime#freeMemory()测量堆消耗，取差值计算构建后的堆消耗。
从文件中读取的测试数据经过build()处理50次，计算出平均值和标准差。
exactMatchSearch() 处理时间
对于基于测试数据构建的trie，对测试数据中的所有词按顺序进行精确匹配搜索的过程进行了50次，并计算了平均值和标准差。
commonPrefixSearch() 处理时间
与 exactMatchSearch() 类似，对测试数据构建的 Trie 搜索公共前缀 50 次，对测试数据中的所有词条进行排序，并计算平均值和标准差。
在最初的实现中，接口接收 commonPrefixSearch() 的结果作为传递给相同方法的 int 数组。调用 commonPrefixSearch()。

測定結果

                              original     imploved
====================================================
ヒープ消費量 [byte]         62,287,864   16,780,160
----------------------------------------------------
build() [msec]                  165.68        64.26
  (標準偏差)                    (82.87)       (6.74)
----------------------------------------------------
exactMatchSearch() [msec]        10.88         6.24
  (標準偏差)                     (7.21)       (7.73)
----------------------------------------------------
commonPrefixSearch() [msec]      17.18        14.04
  (標準偏差)                     (4.68)       (4.75)
----------------------------------------------------

测试

实际上官方没有给出具体的使用例子。

我们把这个类封装一下。

实现

只有一个文件类，非常的优雅。

ps: 但是看起来不太舒服，这种写法 c 的风格太强烈了。

/**
 * DoubleArrayTrie: Java implementation of Darts (Double-ARray Trie System)
 * 
 * 
 * Copyright(C) 2001-2007 Taku Kudo &lt;taku@chasen.org&gt;
 * Copyright(C) 2009 MURAWAKI Yugo &lt;murawaki@nlp.kuee.kyoto-u.ac.jp&gt;
 * Copyright(C) 2012 KOMIYA Atsushi &lt;komiya.atsushi@gmail.com&gt;
 * 
 * 
 * 
 * The contents of this file may be used under the terms of either of the GNU
 * Lesser General Public License Version 2.1 or later (the "LGPL"), or the BSD
 * License (the "BSD").
 * 
 */
package darts;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class DoubleArrayTrie {
	private final static int BUF_SIZE = 16384;
	private final static int UNIT_SIZE = 8; // size of int + int

	private static class Node {
		int code;
		int depth;
		int left;
		int right;
	};

	private int check[];
	private int base[];

	private boolean used[];
	private int size;
	private int allocSize;
	private List key;
	private int keySize;
	private int length[];
	private int value[];
	private int progress;
	private int nextCheckPos;
	// boolean no_delete_;
	int error_;

	// int (*progressfunc_) (size_t, size_t);

	// inline _resize expanded
	private int resize(int newSize) {
		int[] base2 = new int[newSize];
		int[] check2 = new int[newSize];
		boolean used2[] = new boolean[newSize];
		if (allocSize > 0) {
			System.arraycopy(base, 0, base2, 0, allocSize);
			System.arraycopy(check, 0, check2, 0, allocSize);
			System.arraycopy(used2, 0, used2, 0, allocSize);
		}

		base = base2;
		check = check2;
		used = used2;

		return allocSize = newSize;
	}

	private int fetch(Node parent, List siblings) {
		if (error_  cur) {
				error_ = -3;
				return 0;
			}

			if (cur != prev || siblings.size() == 0) {
				Node tmp_node = new Node();
				tmp_node.depth = parent.depth + 1;
				tmp_node.code = cur;
				tmp_node.left = i;
				if (siblings.size() != 0)
					siblings.get(siblings.size() - 1).right = i;

				siblings.add(tmp_node);
			}

			prev = cur;
		}

		if (siblings.size() != 0)
			siblings.get(siblings.size() - 1).right = parent.right;

		return siblings.size();
	}

	private int insert(List siblings) {
		if (error_  nextCheckPos) ? siblings.get(0).code + 1
				: nextCheckPos) - 1;
		int nonzero_num = 0;
		int first = 0;

		if (allocSize  1.0 * keySize / (progress + 1)) ? 1.05 : 1.0
						* keySize / (progress + 1);
				resize((int) (allocSize * l));
			}

			if (used[begin])
				continue;

			for (int i = 1; i = 0.95)
			nextCheckPos = pos;

		used[begin] = true;
		size = (size > begin + siblings.get(siblings.size() - 1).code + 1) ? size
				: begin + siblings.get(siblings.size() - 1).code + 1;

		for (int i = 0; i  new_siblings = new ArrayList();

			if (fetch(siblings.get(i), new_siblings) == 0) {
				base[begin + siblings.get(i).code] = (value != null) ? (-value[siblings
						.get(i).left] - 1) : (-siblings.get(i).left - 1);

				if (value != null && (-value[siblings.get(i).left] - 1) >= 0) {
					error_ = -2;
					return 0;
				}

				progress++;
				// if (progress_func_) (*progress_func_) (progress,
				// keySize);
			} else {
				int h = insert(new_siblings);
				base[begin + siblings.get(i).code] = h;
			}
		}
		return begin;
	}

	public DoubleArrayTrie() {
		check = null;
		base = null;
		used = null;
		size = 0;
		allocSize = 0;
		// no_delete_ = false;
		error_ = 0;
	}

	// no deconstructor

	// set_result omitted
	// the search methods returns (the list of) the value(s) instead
	// of (the list of) the pair(s) of value(s) and length(s)

	// set_array omitted
	// array omitted

	void clear() {
		// if (! no_delete_)
		check = null;
		base = null;
		used = null;
		allocSize = 0;
		size = 0;
		// no_delete_ = false;
	}

	public int getUnitSize() {
		return UNIT_SIZE;
	}

	public int getSize() {
		return size;
	}

	public int getTotalSize() {
		return size * UNIT_SIZE;
	}

	public int getNonzeroSize() {
		int result = 0;
		for (int i = 0; i  key) {
		return build(key, null, null, key.size());
	}

	public int build(List _key, int _length[], int _value[],
			int _keySize) {
		if (_keySize > _key.size() || _key == null)
			return 0;

		// progress_func_ = progress_func;
		key = _key;
		length = _length;
		keySize = _keySize;
		value = _value;
		progress = 0;

		resize(65536 * 32);

		base[0] = 1;
		nextCheckPos = 0;

		Node root_node = new Node();
		root_node.left = 0;
		root_node.right = keySize;
		root_node.depth = 0;

		List siblings = new ArrayList();
		fetch(root_node, siblings);
		insert(siblings);

		// size += (1 = allocSize) resize (size);

		used = null;
		key = null;

		return error_;
	}

	public void open(String fileName) throws IOException {
		File file = new File(fileName);
		size = (int) file.length() / UNIT_SIZE;
		check = new int[size];
		base = new int[size];

		DataInputStream is = null;
		try {
			is = new DataInputStream(new BufferedInputStream(
					new FileInputStream(file), BUF_SIZE));
			for (int i = 0; i  commonPrefixSearch(String key) {
		return commonPrefixSearch(key, 0, 0, 0);
	}

	public List commonPrefixSearch(String key, int pos, int len,
			int nodePos) {
		if (len  result = new ArrayList();

		char[] keyChars = key.toCharArray();

		int b = base[nodePos];
		int n;
		int p;

		for (int i = pos; i < len; i++) {
			p = b;
			n = base[p];

			if (b == check[p] && n < 0) {
				result.add(-n - 1);
			}

			p = b + (int) (keyChars[i]) + 1;
			if (b == check[p])
				b = base[p];
			else
				return result;
		}

		p = b;
		n = base[p];

		if (b == check[p] && n < 0) {
			result.add(-n - 1);
		}

		return result;
	}

	// debug
	public void dump() {
		for (int i = 0; i < size; i++) {
			System.err.println("i: " + i + " [" + base[i] + ", " + check[i]
					+ "]");
		}
	}
}