Sunday, January 20, 2008

Natural Language Processing (Part 2)

Database Access

The first major success of natural language processing !!!

There was a hope that databases could be controlled by natural languages instead of complicated data retrieval commands, this was a major problem in the early 1970s since the staff in charge of data retrieval could not keep up with demand of users for data.

LUNAR system was the first such interface built by William Woods in 1973 for NASA Manned Spacecraft Center, this system was able to correctly answer 78% of the questions such as: “What is the average modal plagioclase concentration for lunar samples that contain rubidium?”

  • Other examples of data retrieval systems would include:
    • CHAT system
      • developed by Fernando Pereira in 1983
      • similar level of complexity to LUNAR system
      • worked on geographical databases
      • was restricted
        • question wording was very important
    • TEAM system
      • could handle a wider set of problems than CHAT
      • was still restricted and unable to handle all types of input

Text Interpretation

  • In early 1980s, most online information was stored in databases and spreadsheets
  • Now, most of online information is text: email, news, journals, articles, books, encyclopedias, reports, essays, etc
    • there is a need to sort this information to reduce it to some comprehendible amount
  • Text interpretation has become a major field in natural language processing
    • becoming more and more important with expansion of the Internet
    • consists of:
      • information retrieval
      • text categorization
      • data extraction

Information Retrieval

        • Information Retrieval (IR) is also know as Information Extraction (IE)
        • Information retrieval systems analyze unrestricted text in order to extract specific types of information
        • IR systems do not attempt to understand all of the text in all of the documents, but they do analyze those portions of each document that contain relevant information
        • relevance is determined by pre-defined domain guidelines which must specify, as accurately as possible, exactly what types of information the system is expected to find
        • query would be a good example of such a pre-defined domain
        • documents that contain relevant information are retrieved while other are ignored

Example: Commercial System (HIGHLIGHT):

It helps users find relevant information in large volumes of text and present it in a structured fashion.

It can extract information from newswire reports for a specific topic area - such as global banking, or the oil industry - as well as current and historical financial and other data.

Although its accuracy will never match the decision-making skills of a trained human expert, HIGHLIGHT can process large amounts of text very quickly, allowing users to discover more information that even the most trained professional would have time to look for

see Demo at: http://www.cgi.cam.sri.com/highlight/

It could be classified under “Extracting Data From Text”

Text Categorization

It is often desirable to sort all text into several categories

There are number of companies that provide their subscribers access to all news on a particular industry, company or geographic area

    • traditionally, human experts were used to assign the categories
    • in the last few years, NLP systems have proven very accurate (correctly categorizing over 90% of the news stories)

Context in which text appears is very important since the same word could be categorized completely differently depending on the context

    • Example: in a dictionary, the primary definition of the word “crude” is vulgar, but in a large sample of the Wall Street Journal, “crude” refers to oil 100% of the time.

The task of data extraction is take on-line text and derive from it some assertions that can be put into a structured database

Examples of data extraction systems include:

  • SCISOR system

SCISOR is able to take stock information text (such as the type released by Dow Jones News Service) and extract important stock information pertaining to:

  • events that took place
  • companies involved
  • starting share prices
  • quantity of shares that changed hands
  • effect on stock prices

Natural Language Processing (Part 1)

Natural Language Processing (NLP) can be divided into two categories;

    • processing written text
    • processing spoken language

Steps in NLP

Roughly we can break the process down into the following five components;

  • Morphological Analysis: Individual words are analyzed into their components and non-word tokens, such as punctuations, are separated. Phonetics is considered for spoken language at this phase.
  • Syntactic Analysis: Linear sequences of words are transformed into structures that show how the words relate to each other. Some word sequences may be rejected if they violate the language’s rules for how words may be combined.
  • Semantic Analysis: A mapping is made between the syntactic structures and objects in the task domain. Structures for which no such mapping is possible may be rejected.
  • Discourse Integration: The meaning of an individual sentence may depend on the sentences that precede it. In this phase, the meaning of a sentence is analyzed depending on the information that precede it, e.g, in “John wanted it.”, “it” depends on the prior discourse context. Such as, “He always had.” would require information about previous sentences.
  • Pragmatic Analysis: The structure representing what was said is reinterpreted to determine what was actually meant. For example, the sentence “Do you know the rout?”.

Practical Applications

We are going to look at some practical applications of natural language processing;

    • Machine Translation
    • Voice Interface for Humanoids
    • Database Access
    • Text Interpretation
      • information retrieval
      • text categorization
      • extracting data from text

Machine Translation

        • Correct translation requires an in-depth understanding of both natural languages since structure of expressions varies in every natural language
        • Yehoshua Bar-Hillel declared in 60’s that Machine Translation was impossible (Bar-Hillel Paradox):
        • analysis by humans of messages relies to some extent on the information which is not present in the words that make up the message
        • “The pen is in the box”
        • [i.e. the writing instrument is in the container]
        • “The box is in the pen”
        • [i.e. the container is in the playpen or the pigpen]

        • Examples of poor machine translations would include:
          • "the spirit is strong, but the body is weak" was translated literally as "the vodka is strong but the meat is rotten”
          • "Out of sight, out of mind” was translated as "Invisible, insane”
          • "hydraulic ram” was translated as "male water sheep”
          • These do not imply that machine translation is a waste of time
          • some mistakes are inevitable regardless of the quality and sophistication of the system
          • one has to realize that human translators also make mistakes

There is a substantial start-up cost to any machine translation effort to achieve broad coverage, translation systems should have lexicons of 20,000 to 100,000 words and grammars of 100 to 10,000 rules (depending on the choice of formalism)