Sunday, January 20, 2008

Natural Language Processing (Part 1)

Natural Language Processing (NLP) can be divided into two categories;

    • processing written text
    • processing spoken language

Steps in NLP

Roughly we can break the process down into the following five components;

  • Morphological Analysis: Individual words are analyzed into their components and non-word tokens, such as punctuations, are separated. Phonetics is considered for spoken language at this phase.
  • Syntactic Analysis: Linear sequences of words are transformed into structures that show how the words relate to each other. Some word sequences may be rejected if they violate the language’s rules for how words may be combined.
  • Semantic Analysis: A mapping is made between the syntactic structures and objects in the task domain. Structures for which no such mapping is possible may be rejected.
  • Discourse Integration: The meaning of an individual sentence may depend on the sentences that precede it. In this phase, the meaning of a sentence is analyzed depending on the information that precede it, e.g, in “John wanted it.”, “it” depends on the prior discourse context. Such as, “He always had.” would require information about previous sentences.
  • Pragmatic Analysis: The structure representing what was said is reinterpreted to determine what was actually meant. For example, the sentence “Do you know the rout?”.

Practical Applications

We are going to look at some practical applications of natural language processing;

    • Machine Translation
    • Voice Interface for Humanoids
    • Database Access
    • Text Interpretation
      • information retrieval
      • text categorization
      • extracting data from text

Machine Translation

        • Correct translation requires an in-depth understanding of both natural languages since structure of expressions varies in every natural language
        • Yehoshua Bar-Hillel declared in 60’s that Machine Translation was impossible (Bar-Hillel Paradox):
        • analysis by humans of messages relies to some extent on the information which is not present in the words that make up the message
        • “The pen is in the box”
        • [i.e. the writing instrument is in the container]
        • “The box is in the pen”
        • [i.e. the container is in the playpen or the pigpen]

        • Examples of poor machine translations would include:
          • "the spirit is strong, but the body is weak" was translated literally as "the vodka is strong but the meat is rotten”
          • "Out of sight, out of mind” was translated as "Invisible, insane”
          • "hydraulic ram” was translated as "male water sheep”
          • These do not imply that machine translation is a waste of time
          • some mistakes are inevitable regardless of the quality and sophistication of the system
          • one has to realize that human translators also make mistakes

There is a substantial start-up cost to any machine translation effort to achieve broad coverage, translation systems should have lexicons of 20,000 to 100,000 words and grammars of 100 to 10,000 rules (depending on the choice of formalism)

1 comment:

Akash Mankar said...

http://rapidshare.com/files/199802865/VidwanDos4.0.exe.html

get vidwan dos 4.0 version from rapidshare.Its a good rulebase e.s.