Home » lucene-3.0.1-src » org.apache.lucene.analysis.cn.smart.hhmm » [javadoc | source]
org.apache.lucene.analysis.cn.smart.hhmm
abstract class: AbstractDictionary [javadoc | source]
java.lang.Object
   org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary

Direct Known Subclasses:
    BigramDictionary, WordDictionary

SmartChineseAnalyzer abstract dictionary implementation.

Contains methods for dealing with GB2312 encoding.

WARNING: The status of the analyzers/smartcn analysis.cn.smart package is experimental. The APIs and file formats introduced here might change in the future and will not be supported anymore in such a case.

Field Summary
public static final  int GB2312_FIRST_CHAR    First Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation. 
public static final  int GB2312_CHAR_NUM    Last Chinese Character in GB2312 (87 * 94). Characters in GB2312 are arranged in a grid of 94 * 94, 88-94 are unassigned. 
public static final  int CHAR_NUM_IN_FILE    Dictionary data contains 6768 Chinese characters with frequency statistics. 
Method from org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary Summary:
getCCByGB2312Id,   getGB2312Id,   hash1,   hash1,   hash2,   hash2
Methods from java.lang.Object:
clone,   equals,   finalize,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary Detail:
 public String getCCByGB2312Id(int ccid) 

    Transcode from GB2312 ID to Unicode

    GB2312 is divided into a 94 * 94 grid, containing 7445 characters consisting of 6763 Chinese characters and 682 symbols. Some regions are unassigned (reserved).

 public short getGB2312Id(char ch) 
    Transcode from Unicode to GB2312
 public long hash1(char c) 
    32-bit FNV Hash Function
 public long hash1(char[] carray) 
    32-bit FNV Hash Function
 public int hash2(char c) 
    djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
 public int hash2(char[] carray) 
    djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.