Chinese Information Extraction - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Chinese Information Extraction

Description:

Chinese Information Extraction Tianfang Yao Department of Computer Science and Engineering Shanghai Jiao Tong University 1954 Hua Shan Road Shanghai, 200030 – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 40
Provided by: yao72
Category:

less

Transcript and Presenter's Notes

Title: Chinese Information Extraction


1
Chinese Information Extraction  Tianfang
Yao Department of Computer Science and
EngineeringShanghai Jiao Tong University1954
Hua Shan RoadShanghai, 200030China
2
Outline
  • Introduction
  • Word Segmentation
  • Named Entity Extraction
  • Entity Relation Extraction
  • Conclusion

3
Introduction (1)
  • Chinese Language
  • Difficulties in Chinese NLP
  • State-of-the-Art for Chinese Information
    Extraction

4
Introduction (2)
  • Chinese Language
  • Chinese is a different topological language from
    English or German.
  • It has a big character set that involves about
    44,908 characters.
  • Although Chinese has a history of more than
    6,000 years, up to now, Chinese grammar standard
    has not been built perfectly.

5
Introduction (3)
  • Chinese Language
  • The form of Chinese character is related to the
    meaning of character. It combines with the
    hieroglyph, e.g. ?(sun) and?(moon), the
    self-explanatory, e.g. ?(above) and ?(below), as
    well as the associative compounds, e.g. ?
    (believe), a character made up of ? (man) and ?
    (word), means a message or something that can be
    believed or trusted.
  • There are many homonyms in Chinese words, e.g.
    ?(gong), ?(spiral shell), ?(mule), ?(bamboo
    basket) etc.
  • Chinese word can be disconnected or expanded.
    Its order can be changed. e.g.??(take a meal) vs.
    ???????(haircut)? vs. ???

6
Introduction (4)
  • Difficulties in Chinese NLP
  • Because there is no space between the characters
    in the Chinese sentence, we have to segment word
    before we analyze the sentence structure.
  • Chinese characters have no flection, using
    semantic structures to understand Chinese
    sentences is more important than using syntactic
    structures to do that.
  • The combination of Chinese words is flexible,
    changeable, succinct and implicit. Sometimes
    there are omitted constituents in the sentence.
  • There exist continuous nouns or continuous verbs
    in a Chinese sentence at times.

7
Introduction (5)
  • State-of-the-Art for Chinese Information
    Extraction
  • Knowledge engineering approaches
  • Automatically trainable approaches
  •  Statistic approaches
  • Hybrid approaches

8
Word Segmentation (1)
  • Research of Automatic Chinese Word Segmentation
  • (Kaiying Liu. Computer Science Department, Shan
    Xi University, China)
  • 1. Definitions
  • Definition 1 Ambiguous Phrase of Overlap Type
  • Assume that AJB is a character string and W is a
    word list. If AJ
  • W, and JB W, then AJB is called
    ambiguous phrase of overlap type.
  • e.g. In the string ???(act as a delegate) ,
    both ??(of our time) and ??(delegate) are
    words. So this string is an ambiguous phrase of
    overlap type.

9
Word Segmentation (2)
  • Definition 2 Chain Length
  • The number of ambiguous strings is called chain
    length.
  • e.g. There is one ambiguous string in the string
    ???, so the chain length is 1.
  • Definition 3 Ambiguous Phrase of Combination
    Type
  • Assume that AB is a character string and W is a
    word list. If A W, B W and AB W, then
    AB is called ambiguous phrase of combination
    type.
  • e.g. In the string ??(individual) ,
    ?(quantifier), ?(man) and ?? are all words.
    So this string is an ambiguous phrase of
    combination type.

10
Word Segmentation (3)
  • 2. Build the ambiguous phrase libraries
  • 78,000 phrases for overlap type
  • More than 3,000 phrases for combination type
  • Statistical results for overlap type
  • Their chain lengths are mostly 1 or 2, about 95
    of all.
  • Among the ambiguous phrases like ABCD with a
    chain length of 2. 98 of them can be segmented
    into ABCD.
  • The segmentation of about 82 of the ambiguous
    phrases like ABCDE with a chain length of 3
    depends on the leftmost three characters ABC.
  • False ambiguous phrase 94
  • Real ambiguous phrase 6

11
Word Segmentation (4)
  • False ambiguous phrase It is with actually only
    one segmentation result in real texts. e.g. ?(be
    given)??(a criticism)
  • Real ambiguous phrase It is more than two
    applicable segmentation results.
  • Case 1 with almost equal occurrence
    probabilities
  • e.g. ???(apply to) can be segmented into
    ???(applyto) or ???(should be used in)
  • Case 2 mostly segmented into only one result in
    real texts.
  • e.g. ???(have dismissed) should be mostly
    segmented into ???(have dismissed)

12
Word Segmentation (5)
  • 3. Approaches for segmenting ambiguous phrases
    with overlap type
  • Statistics based approach
  • Built the wording capacity library includes
    frequency information for ambiguous phrase AJB
    with chain length of 1, that is, different
    frequencies for constructing words FreqLeft(AJ),
    FreqRight(B), FreqLeft(A) and FreqRight(JB)
  • Rule1 If FreqLeft(AJ) FreqRight(B) gt
    FreqLeft(A) FreqRight(JB), AJB is segmented
    into AJB otherwise AJB

13
Word Segmentation (6)
  • (Depending on the statistical results for
    ambiguous phrase library)
  • Rule 2 Ambiguous phrase with a chain length of
    2, like ABCD, is segmented into ABCD.
  • Rule 3 Ambiguous phrase with a chain length of
    3, like ABCDE, first is segmented into
    ABCDE then the fore part ABC is segmented
    as an ambiguous phrase with a chain length of 1.
  • Rule 4 Ambiguous phrase with a chain length of
    4, like ABCDEF, is segmented into ABCDEF

14
Word Segmentation (7)
  • Rules based approach
  • Rule 1 If there is an appulsive verb in an
    ambiguous phrase with its previous word as a
    verb, it is segmented solely. e.g. ?????(really
    embody) should be segmented ?????, because
    ?(come up) is an appulsive verb, ?? is a
    verb.
  • Rule 2 If the foremost character in an
    ambiguous phrase is a quantifier and the
    preceding word of the phrase is a numeral, the a
    quantifier is segmented solely. e.g. 65???(a
    high building of 65 stories) should be segmented
    into 65???, because ? is a quantifier and
    65 is a numeral.

15
Word Segmentation (8)
  • 4. Approaches for segmenting ambiguous phrases
    with combination type
  • Statistics based approach
  • Among all ambiguous phrases, 30 of them usually
    have only one segmentation result. Therefore, a
    library including 133 phrases is built. The
    structure of database is as follows
  • FIELD NAME TYPE LENGTH EXPLANATION
  • word char 4
    AB
  • nh number 3
    the times of seg. into AB
  • nf number 3
    the times of seg. into AB
  • Assume freqnh/(nhnf), thresholds are a1and a2,
    here a1gt a2 . If freqgta1 , AB will be segmented
    into AB if freqlta2 , it is segmented into
    AB.

16
Word Segmentation (9)
  • POS rule based approach
  • The word to be segmented is related with the POS
    of its context words. If the previous word of
    AB is numeral, AB will be segmented into
    AB
  • otherwise segmented into AB. e.g. In the
    sentence ????????(He sleeps in his room by
    himself), here AB??. Because ? is a numeral,
    ?? should be segmented into ?? . But in the
    phrase ??????(The individual interests of the
    peasantry), ?? should not be segmented.

17
Word Segmentation (10)
  • 5. System architecture

18
Word Segmentation (11)
  • 6. System test results
  • The system has been tested with the corpus
    randomly chosen from Beijing Youth, in which
    there are 607 ambiguous phrases of overlap type
    and 2292 ambiguous phrases of combination type.
    The precisions are 97 and 87 respectively.

19
Named Entity Extraction (1)
  • Description of the NTU System used for MET2
  • (Hsin-His Chen et al. Natural Language Processing
    Lab., Department of Computer Science and
    Information Engineering, National Taiwan
    University)
  • Processing Steps of Named Entity Extraction
  • (1) Transform Chinese texts in GB codes into
    texts in Big-5 codes
  • (2) Segment Chinese texts into a sequence of
    tokens
  • (3) Identify named people
  • (4) Identify named organizations
  • (5) Identify named locations
  • (6) Use n-gram model to identify named
    organizations/locations
  • (7) Identify the rest of named expressions
  • (8) Transform the results in Big-5 codes into
    the results in GB codes

20
Named Entity Extraction (2)
  • (1) Transform Chinese texts in GB codes into
    texts in Big-5 codes
  • The GB code is an internal code of the
    simplified Chinese character set, which is used
    in the mainland of China. The Big-5, on the other
    hand, is an internal code of the traditional
    Chinese character set, which is used in Taiwan
    and Hong Kong.
  • e.g. simplified Chinese character vs.
    traditional Chinese character
  • ???? (Artificial Intelligence)
    ????
  • ??(Software)
    ??
  • ??(Report) ??
  • ???(New Zealand)
    ???
  • NTU System is designed for the traditional
    Chinese character text and the test texts in MET2
    are in GB code. So it must transform GB code of
    test texts into Big-5 code. But this mapping is
    not only one-to-one, sometimes it is one-to-many.

21
Named Entity Extraction (3)
  • (2) Segment Chinese texts into a sequence of
    tokens
  • List all possible words by dictionary
    look-up, and then resolve ambiguities by
    segmentation strategies. The dictionary is
    trained from CKIP corpus, of which articles are
    collected from Taiwan newspapers, magazines, etc.
  • (3) Identify named people
  • Chinese person names
  • Most Han Chinese surnames are single
    character, but some are two characters.
  • Most names are two characters, but some are
    single character.
  • Theoretically, every character can be used for
    a name. Thus the length of Chinese names ranges
    from 2 to 6 characters.
  • Three kinds of recognition strategies are
    adopted
  • Named-formulation rules
  • Context clues, e.g., titles, positions,
    speech-act verbs, etc.
  • Cache

22
Named Entity Extraction (4)
  • Named-formulation rules
  • They are trained from a person name corpus in
    Taiwan, which contains 1 million Chinese names.
    Each contains surname, name and sex.
  • Possible candidates
  • Model 1. Single character for surname
  • P(C1)P(C2)P(C3) using male (female) training
    table gt threshold1(3) and
  • P(C2)P(C3) using male (female) training
    table gt threshold2(4)
  • Model 2. Two characters for surname
  • P(C2)P(C3) using male (female) training
    table gt threshold2(4)
  • Model 3. Two surnames together
  • P(C12)P(C2)P(C3) using female training table
    gt threshold3
  • P(C2)P(C3) using female training table gt
    threshold4 and
  • P(C12)P(C2)P(C3) using female training
    table gt P(C12)P(C2)P(C3) using male training
    table

23
Named Entity Extraction (5)
  • Context clues, e.g., titles, positions,
    speech-act verbs, etc.
  • Titles ??(Dr.) ??(Prof.) ??(Mrs./Ms.)
    ??(Miss) ??(Mr.)
  • Positions ??(President) ??(Director)
    ???(General Manager)
  • Speech-act verbs ??(speak)?(say) ??(bring
    up)
  • Cache
  • The cache presents a global clue. Because a
    person name may appear more than once in a
    document. The cache is used to store the
    identified candidates. There are four cases shown
    below when cache is used
  • (1) C1C2C3 and C1C2C4 are in the cache, and
    C1C2 is correct.
  • (2) C1C2C3 and C1C2C4 are in the cache, and
    both are correct.
  • (3) C1C2C3 and C1C2 are in the cache, and
    C1C2C3 is correct.
  • (4) C1C2C3 and C1C2 are in the cache, and
    C1C2 is correct.

24
Named Entity Extraction (6)
  • Transliterated person names
  • Transliterated person names denote foreigners.
    The length of transliterated person names is not
    restricted to 2 to 6 characters.
  • Main strategies
  • Transliterated name set
  • The transliterated names trained from MET data
    are regarded as a built-in name set.
  • Character condition
  • Two special character sets are retrieved from
    MET training data. The first character of names
    must belong to a 280-character set, and the
    remaining characters must appear in a
    411-character set. The character condition is a
    loose restriction. It should be employed with
    other clues.
  • Titles
  • They used in Chinese person names are also
    applicable to transliterated person names.
  • Name introducers
  • Such as, ? (be called), ?? (Her/His name is
    ), ?? (respectfully call sb. )
  • Special verbs
  • e.g. ??(issue/express/deliver),
    ??(hint/imply)

25
Named Entity Extraction (7)
  • (4) Identify named organizations
  • The structure of organization names is more
    complex than that of person names. Basically, a
    complete organization name can be divided into
    name and keyword.
  • Such as, names ???(UN), ??(USA), ???(Robertson)
  • keywords ??(Army), ???(Embassy),
    ???(Foundation)
  • There are some rules to recognize organization
    names
  • OrganizationName -gt OrganizationName
    OrganizationNameKeyword
  • OrganizationName -gt CountryName
    OrganizationNameKeyword
  • OrganizationName -gt PersonName
    OrganizationNameKeyword
  • OrganizationName -gt CountryName DDD
    OrganizationNameKeyword
  • OrganizationName -gt PersonName DD
    OrganizationNameKeyword
  • OrganizationName -gt LocationName DD
    OrganizationNameKeyword
  • OrganizationName -gt CountryName
    OrganizationName
  • OrganizationName -gt LocationName
    OrganizationName
  • Where D is a content word, such as,
    ??(International), ??(culture and education) etc.

26
Named Entity Extraction (8)
  • Identify named locations
  • The structure of location names is similar to
    that of organization names. The rules are like
  • LocationName -gt PersonName LocationNameKeyword
  • LocationName -gt LocationName LocationNameKeyword
  • The following are some examples of location
    keywords
  • ?(maintain) ??(center) ??(highway) ??(the
    Northern of ) ?(city)
  • Other strategies for recognizing location names
    without keywords
  • Locative verbs ??(come from ) ??(go to )
  • Cache
  • N-gram model employ multiple occurrences to
    find a pattern

27
Named Entity Extraction (9)
  • (6) Use n-gram model to identify named
    organizations/locations
  • Although cache mechanism and n-gram use the same
    feature, i.e., multiple occurrences, their
    concepts are totally different. For organization
    names, it is not sure when a pattern should be
    put into cache because its left boundary is hard
    to be decided.
  • In the model, the patterns are selected to meet
    the following criteria
  • It must consist of a name and an organization
    name keyword
  • Its length must be greater than two words
  • It does not cross sentence boundary and any
    punctuation marks
  • It must occur at lease twice

28
Named Entity Extraction (10)
  • (7) Identify the rest of named expressions
  • The rule based approach is used for the following
    named expressions
  • Date expressions
  • DATE-gtNUMBERYEAR
  • DATE-gtNUMBERMTHUNIT
  • Time expressions
  • TIME-gtNUMBERHUNIT
  • TIME-gtTIMEBSTATE
  • Monetary expressions
  • DMONEY-gtMOUNITNUMBERMOUNIT
  • DMONEY-gtNUMBERMONUIT
  • Percentage expressions
  • DPERCENT-gtPERCENTNUMBER
  • DPERCENT-gtNUMBERPERCENT

29
Named Entity Extraction (11)
  • (8) Transform the results in Big-5 codes into the
    results in GB codes
  • MET2 Testing Results
  • Named Entity Recall()
    Precision()
  • Person Name 91
    74
  • Organization Name 78
    85
  • Location Name 78
    69
  • Date
    94 88
  • Time 98
    70
  • Money 98
    98
  • Percent 83
    98
  • F-MEASURES PR 79.61 2PR 77.88
    P2R 81.42

30
Entity Relation Extraction (1)
  • A Trainable Method for Extracting Chinese Entity
    Names and Their Relations
  • (Yimin Zhang et al. Intel China Research Center,
    Beijing, China)
  • The process can be divided into two stages. The
    first one is the learning process in which
    several classifiers are built from the training
    data. The second one is the extracting process in
    which Chinese entity names and their relations
    are extracted using the classifiers learned. The
    learning algorithm used in the learning process
    is memory-based learning (MBL) which is a
    classification based supervised learning approach.

31
Entity Relation Extraction (2)
32
Entity Relation Extraction (3)
  • The main steps for the learning process
  • (1) Prepare training data in which all noun
    phrases, entity names and relations are manually
    annotated.
  • (2) Segmenting, tagging and partial parsing for
    the training data.
  • (3) Extract the training sets from the parsed
    training data. Four training sets are extracted
    for different tasks, related to Chinese person
    names, entity names, noun phrase, or relations
    between entity names in the training data
    respectively. The main feathers used in an
    example can be either local context feathers,
    e.g. dependency relation, or global context
    features, e.g. the feature of a word in the whole
    document, etc.
  • (4) Use MBL algorithm to obtain IG-Tree for four
    training sets. IG-Tree is a compressed
    representation of the training set that can be
    processed quickly in classification process.

33
Entity Relation Extraction (4)
  • The main steps for the extracting process
  • Segmenting, tagging and partial parsing for the
    Chinese documents.
  • Identify Chinese people names using
    PersonName-IG-Tree.
  • Identify Chinese organization names using the
    same method of NTU System.
  • Identify other entity names using the same method
    of NTU System.
  • Identify Chinese noun phrases (NP chunking) using
    NP-IG-Tree.
  • Use entity names and noun phrases extracted to
    perform partial parsing again to fix the parsing
    errors.
  • Use EntityName-IG-Tree to classify the noun
    phrases extracted. This step will identify entity
    names that are missed in the previous steps.
  • Use Relation-IG-Tree to identify relations
    between the extracted entity names.

34
Entity Relation Extraction (5)
  • The entity relation extracted
  • Employee-of,
  • Location-of,
  • Product-of and
  • No-relation
  • The feathers for this task
  • The features used in CRYSTAL System,
  • Add some new feathers, such as the linear order
    of entity names, the word(s) between the entity
    names, the relative position of the entity names
    (in same sentence or in neighboring sentence) etc.

35
Entity Relation Extraction (6)
  • Example
  • Phrase ????(Legends President) (Note
    LegendLegend Holdings Limited or Legend Group
    which is a famous computer company in China) in
    the subject position includes the features
  • SUBJ-Terms-??
  • SUBJ-Terms-??
  • SUBJ-Mod-Terms-??
  • SUBJ-Head-Terms-??
  • SUB-Classes-Employee
  • SUB-Mod-Classes-Organization
  • SUB-Head-Classes-Organization(should be Position)

36
Entity Relation Extraction (7)
  • Learning and extracting processes
  • For every two related entity names in the
    training data, a training example is identified
    and extracted. After all examples are extracted,
    they are fed to MBL Learner to build the
    Relation-IG-Tree.
  • The extracting process is the same as the
    learning process for extracting all pairs of
    entity names. Then the relation between every
    pair of entity names is derived by the
    Relation-IG-Tree.

37
Entity Relation Extraction (8)
  • Example1
  • ???????????IT???????,
  • As a famous manufacturer of IT hardware devices
    in China, the Lang Chao Group
  • Company name ???? Product name IT????
  • Training example Company name (??/?) Product
    name ???
  • Relation product-of
  • Example2
  • ?????????????????,?????TCL???????????????????????
  • Wu Shihong became the media focus once again,
    however, this time she came to Shanghai as the
    vice president of TCL group and its IT companys
    general manager.
  • Person name ??? Company name TCL??
  • Training example If a person name and a company
    name appear in neighboring sentences, and no
    other person names and company names are found in
    between, they tend to have an employee-of
    relation.
  • Relation employee-of

38
Entity Relation Extraction (9)
  • System testing results
  • To test this approach, a manually annotated
    corpus which comprises about 200 business news is
    used. All the entity names (about 500 person
    names and 300 organization names), noun phrases,
    and relations in the corpus were manually
    annotated. Ten pairs of training sets and tests
    were randomly selected from the corpus with each
    set size equivalent to half of the entire corpus.
    All data sets were tested, the result is as
    follows

  • Recall()
    Precision()
  • Person Name
    86.3 83.2
  • Organization Name
    73.4 89.3
  • Employee-of
    75.6 92.3
  • Product-of
    56.2
    87.1
  • Location-of
    67.2 75.6

39
Conclusion
  • Chinese is a different topological language from
    English or German. There exist some special
    difficulties in Chinese NLP, such as word
    segmentation.
  • There are mainly two ambiguous phrases in
    Chinese word segmentation. One is overlap type,
    another is combination type. In overlay ambiguous
    phrases, the chain lengths are mostly 1 or 2 and
    take up 95. In combination ambiguous phrases,
    30 of them usually have only one possibility of
    segmentation. We can remove ambiguity depending
    on different ambiguous types.
  • Chinese named entities are major constituents in
    Chinese documents. We can adopt different methods
    to extract them together, such as character
    conditions, statistical information, titles,
    punctuation marks, organization and location
    keywords, speech-act and locative verbs, cache
    and n-gram model.
  • We can view the determination of Chinese entity
    relation as classification process. In the
    learning process, several classifiers are built
    from the training data. In the extracting
    process, the relations are extracted using the
    classifiers learned. Machine learning technique
    has been effectively used in Chinese entity
    relation extraction.
Write a Comment
User Comments (0)
About PowerShow.com