The amount of electronically stored information in thai language has grown rapidly in the past few years and the number of these documents is still increasing. this makes information extraction (IE) an essential task of extracting keywords from thai texts. thai texts are considered as un-delimeted language where the structure of writing is a string of symbols without explicit word delimiters. words in thai language are not naturally separated by any word delimiting symbols. due to this characteristic of thai written language, word segmentation is a challenging task and has become one of the important research topics. many word segmentation techniques have been proposed to segment thai texts into a set of words to support extraction keywords. however, most of the word segmentation approach required complex language analysis. they usually rely on language analysis or on the use of dictionary or corpus. in this paper, an alternative method for extracting important keywords. this approach looks for long and frequent substrings rather than individual words from given texts. as a result, this approach is language-independent. it does not rely on the use of dictionary or languge analysis . we refer this technique as frequent max substring mining or FM technique. applying the FM technique to thai texts yieldsa a set of keywords that are frequent and highly distinct from given texts. the set of extracted keywords from FM technique is able to contain all frequent substrings wihout information loss. therefore this technique uses less space for storing all frequent substrings in order to support the growth of thai electronic information.
|
|