java.lang.Object
org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
Direct Known Subclasses:
BigramDictionary, WordDictionary

abstract class AbstractDictionary extends Object
SmartChineseAnalyzer abstract dictionary implementation.

Contains methods for dealing with GB2312 encoding.

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Dictionary data contains 6768 Chinese characters with frequency statistics.
    static final int
    Last Chinese Character in GB2312 (87 * 94).
    static final int
    First Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    getCCByGB2312Id(int ccid)
    Transcode from GB2312 ID to Unicode
    short
    getGB2312Id(char ch)
    Transcode from Unicode to GB2312
    long
    hash1(char c)
    32-bit FNV Hash Function
    long
    hash1(char[] carray)
    32-bit FNV Hash Function
    int
    hash2(char c)
    djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.
    int
    hash2(char[] carray)
    djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • GB2312_FIRST_CHAR

      public static final int GB2312_FIRST_CHAR
      First Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.
      See Also:
    • GB2312_CHAR_NUM

      public static final int GB2312_CHAR_NUM
      Last Chinese Character in GB2312 (87 * 94). Characters in GB2312 are arranged in a grid of 94 * 94, 88-94 are unassigned.
      See Also:
    • CHAR_NUM_IN_FILE

      public static final int CHAR_NUM_IN_FILE
      Dictionary data contains 6768 Chinese characters with frequency statistics.
      See Also:
  • Constructor Details

    • AbstractDictionary

      AbstractDictionary()
  • Method Details

    • getCCByGB2312Id

      public String getCCByGB2312Id(int ccid)
      Transcode from GB2312 ID to Unicode

      GB2312 is divided into a 94 * 94 grid, containing 7445 characters consisting of 6763 Chinese characters and 682 symbols. Some regions are unassigned (reserved).

      Parameters:
      ccid - GB2312 id
      Returns:
      unicode String
    • getGB2312Id

      public short getGB2312Id(char ch)
      Transcode from Unicode to GB2312
      Parameters:
      ch - input character in Unicode, or character in Basic Latin range.
      Returns:
      position in GB2312
    • hash1

      public long hash1(char c)
      32-bit FNV Hash Function
      Parameters:
      c - input character
      Returns:
      hashcode
    • hash1

      public long hash1(char[] carray)
      32-bit FNV Hash Function
      Parameters:
      carray - character array
      Returns:
      hashcode
    • hash2

      public int hash2(char c)
      djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
      Parameters:
      c - character
      Returns:
      hashcode
    • hash2

      public int hash2(char[] carray)
      djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
      Parameters:
      carray - character array
      Returns:
      hashcode