Using the huge volume of language data created by the spread of the Web
to build a mechanism for “natural acquisition” of knowledge by computers
With the exponential spread of personal computers and the internet, the relationship between society and computers has become inseparable. As you know, the “language” of computers is combinations of “0s” and “1s.” However, the languages of human society are complex and diverse. Professor Sadao Kurohashi of Kyoto University Graduate School of Informatics is building a mechanism to enable computers to acquire languages of human society (natural languages) as knowledge. With conventional machine translation it was necessary to first feed the computer with the basic knowledge. Professor Kurohashi is developing a mechanism whereby computers can automatically acquire language knowledge, using the huge volume of language data now available. Human assistance is minimized. Therefore, speed of acquisition is fast and accuracy is high. The following paragraphs introduce the concept and method.
The language used to convey or communicate thoughts and information in human society within cultural context. The adjective “natural” is used to distinguish this type of language from formal or artificial languages such as are used for mathematics, programming, etc.
Computers ‘learn’ from the huge volume of data on the Web
Back in the days when there was little data available from which computers could obtain knowledge, humans had to feed each grammar and translation rule into the computer. Expansion of the computer’s knowledge thus had limitations. However, with emergence of the Web, a huge volume of language data (called a “text corpus”) is becoming available for use. Due to the comprehensive circulation of such data, knowledge of the usage of words and closely related words, correspondence with words of other languages, and so on, can be acquired to a high degree of accuracy. Of course, the text corpus contains misuses and mistranslations. Even so, as the number of samples increases, so does the likelihood that the correct usage will be encountered. Therefore, the probability that the computer will be provided with correct knowledge will rise. The power of a large corpus is demonstrated particularly in the process of understanding an unknown word.
Consider the word “Google,” for example. Using conventional knowledge, the computer might translate it initially as the name of the internet search engine. However, if the computer encounters many other usages of “Google” including “without Googling it,” “let’s Google it,” “to Google,” “if we Google it” and so on, the computer ‘learns’ that “Google” can also be used as a verb. This is a simple illustration of “natural acquisition.” R&D is focused on an outstanding program to give computers the power of “independent learning.”
A vast collection of structured natural language sentences. In the computing world, the huge language data volume on the Web is called the “text corpus.”
Building and developing a more convenient search system
Professor Kurohashi’s group is developing and providing services relating to the information analysis system WISDOM (Web Information Sensibly and Discreetly Ordered and Marshaled). This is a project of the independent administrative agency, the National Institute of Information and Communications Technology, or NICT. WISDOM (http://wisdom-nict.jp/) is for use by individuals for non-commercial purposes. That is, it is limited to use for research and development purposes. Although WISDOM is similar to search engines such as Yahoo! and Google, in fact it has a different nature. Conventional search engines search sites based on input keywords. WISDOM is a system for finding appropriate sites that provide the user with more useful and diversified information relating to the request.
For example, the search sentence ["Bioethanol is good for the environment"] is entered in the search window. Rather than searching with these keywords, the information request is made as a direct statement. Conventional search engines pick up only sites using the sentence exactly as written (disregarding variations, such as internal punctuation marks). In contrast, WISDOM searches for sites containing assertions or discussion relating to the sentence. WISDOM has additional convenient functions, including display of related keywords and information on affiliations of the originator, as well as color-coding of affirmative and negative statements relating to the search sentence.
This kind of information search becomes possible only with a mechanism enabling computers to “naturally acquire” knowledge from a huge text corpus.
Professor Kurohashi notes: “The more information people put into the Web, the greater the strengths of this mechanism. For computers, coming into contact with many and diverse corpuses is the most effective means of natural acquisition of knowledge. In that sense, perhaps this mechanism is a natural creation of modern society.”
(Kaniwa Hioki November 26, 2009)