Sunday, January 20, 2008

Natural Language Processing (Part 2)

Database Access

The first major success of natural language processing !!!

There was a hope that databases could be controlled by natural languages instead of complicated data retrieval commands, this was a major problem in the early 1970s since the staff in charge of data retrieval could not keep up with demand of users for data.

LUNAR system was the first such interface built by William Woods in 1973 for NASA Manned Spacecraft Center, this system was able to correctly answer 78% of the questions such as: “What is the average modal plagioclase concentration for lunar samples that contain rubidium?”

  • Other examples of data retrieval systems would include:
    • CHAT system
      • developed by Fernando Pereira in 1983
      • similar level of complexity to LUNAR system
      • worked on geographical databases
      • was restricted
        • question wording was very important
    • TEAM system
      • could handle a wider set of problems than CHAT
      • was still restricted and unable to handle all types of input

Text Interpretation

  • In early 1980s, most online information was stored in databases and spreadsheets
  • Now, most of online information is text: email, news, journals, articles, books, encyclopedias, reports, essays, etc
    • there is a need to sort this information to reduce it to some comprehendible amount
  • Text interpretation has become a major field in natural language processing
    • becoming more and more important with expansion of the Internet
    • consists of:
      • information retrieval
      • text categorization
      • data extraction

Information Retrieval

        • Information Retrieval (IR) is also know as Information Extraction (IE)
        • Information retrieval systems analyze unrestricted text in order to extract specific types of information
        • IR systems do not attempt to understand all of the text in all of the documents, but they do analyze those portions of each document that contain relevant information
        • relevance is determined by pre-defined domain guidelines which must specify, as accurately as possible, exactly what types of information the system is expected to find
        • query would be a good example of such a pre-defined domain
        • documents that contain relevant information are retrieved while other are ignored

Example: Commercial System (HIGHLIGHT):

It helps users find relevant information in large volumes of text and present it in a structured fashion.

It can extract information from newswire reports for a specific topic area - such as global banking, or the oil industry - as well as current and historical financial and other data.

Although its accuracy will never match the decision-making skills of a trained human expert, HIGHLIGHT can process large amounts of text very quickly, allowing users to discover more information that even the most trained professional would have time to look for

see Demo at: http://www.cgi.cam.sri.com/highlight/

It could be classified under “Extracting Data From Text”

Text Categorization

It is often desirable to sort all text into several categories

There are number of companies that provide their subscribers access to all news on a particular industry, company or geographic area

    • traditionally, human experts were used to assign the categories
    • in the last few years, NLP systems have proven very accurate (correctly categorizing over 90% of the news stories)

Context in which text appears is very important since the same word could be categorized completely differently depending on the context

    • Example: in a dictionary, the primary definition of the word “crude” is vulgar, but in a large sample of the Wall Street Journal, “crude” refers to oil 100% of the time.

The task of data extraction is take on-line text and derive from it some assertions that can be put into a structured database

Examples of data extraction systems include:

  • SCISOR system

SCISOR is able to take stock information text (such as the type released by Dow Jones News Service) and extract important stock information pertaining to:

  • events that took place
  • companies involved
  • starting share prices
  • quantity of shares that changed hands
  • effect on stock prices

No comments: