Department of TAOY KemGUKI

Information retrieval thesauri:

structure, purpose and development procedure

1. Thesaurus as a way of systematized representation of knowledge and

kind of ideographic dictionary.

2. Information retrieval thesauri: essence and purpose

3. Structure of the IPT

4. The procedure for the development, examination, registration and maintenance of IPT.

Bibliography

1. GOST 7.74 - 96. Information retrieval languages. Terms and definitions [Text]. - Input. 1997-07-01. - Minsk: Interstate Council for Standardization, Metrology and, 1997. - 34 p. (System of standards for information, librarianship and publishing) TC 191.

2. GOST 7.25-2001. Thesaurus information retrieval monolingual. Development rules, structure, and presentation form [Text]. – GOST 7.25-80; Introduction 2002-07-01. - M.: IPK Publishing house of standards, 2001. - 16 p. MTK 191.

3. GOST 7.24-2007 Multilingual information retrieval thesaurus. Composition, structure and basic requirements for construction. - Instead of GOST 7.24-90; input. 2008-07-01. / Interstate Council for Standardization, Metrology and Certification. - M.: Standartinform, 2008. - 7 p. (System of standards on information, librarianship and publishing)

4. Baranov, O. S. Ideographic dictionary of the Russian language / O. S. Baranov. - M.: ETS Publishing House, 1995. - 820 s

5. Zhmailo, S. V. On the definition of the thesaurus [Text] / S. V. // NTI. Ser. 1 Organization and information work. - 2003. - No. 12. – P.20 – 25.

6. Zhmailo, S. V. Development of modern information retrieval thesauri [Text] / S. V. Zhmailo // NTI. Ser. 1 Organization and methodology of information work. -2004. - No. 1. – P.23 – 31.

So, in the ideographic dictionary of the Russian language by O. S. Baranov (4), 12 higher sections of the ideographic dictionary are distinguished, among which are: “order, nature, activity, culture”, etc., each of which is divided into groups, subgroups, departments, sections . All words in this dictionary are grouped into nests according to their meaning and are grouped by a certain concept with which they are most often associated by species relations. Nests are grouped into subsections and so on. At the moment, there are 5923 nests in the dictionary, 7 division levels (according to www.rifmovnik.ru/thesaurus.htm as of February 16, 2010). Here is an example of a dictionary entry from this dictionary:

178.4.7 aroma ▲ - a pleasant smell (for example, the smell of flowers, grass, hay. gentle #. intoxicating #). aromatization . . . ambre. incense.

The code of the word "aroma" reflects the ideographic classification accepted in this given word, in particular, the correlation of this word with the category "178-Sensations".

Thus, the terms "thesaurus", "ideographic dictionary", "thesaurus-type dictionary" first of all mean that the totality of the words of the language is presented in them in such a way that one group of words includes words that are similar in meaning. The main purpose of ideographic dictionaries is a collection of lexical units united by a common concept; this makes it easier for the reader to find the most appropriate means for adequate expression of thought and promotes active command of the language.

From the history of thesauri

JACKETS 2302

in Suits

Coat products

Sewing products

n Double-breasted jacket

Combined jacket

Sports jacket

in Packing measures

Remaining material

Waste material

Lexical note;

Ascriptors or descriptors-synonyms;

Superior descriptors;

Downstream descriptors;

Associative descriptors;

Descriptors linked by other kinds of relationships.

Within each group of LUs associated with a head descriptor by one kind of paradigmatic relationship, there must be an alphabetical order of arrangement. For example:

ALGORITHMIC LANGUAGES

with algorithmic languages

machine-oriented languages

domain-specific languages

in SOFTWARE

FORMAL LANGUAGES

n AUTOCODES

a ALGORITHMS

PROGRAMMING cf. artificial languages

An ascriptor entry consists of an ascriptor and descriptors or a combination of descriptors that replace it when processing and searching for information. Here are examples of ascriptor articles:

Alphanumeric characters

Spanish FORMAL LANGUAGES

NATURAL LANGUAGES

see ALGORITHMIC LANGUAGES

A dictionary entry may also include:

How often the descriptor is used;

Descriptor code number;

Descriptor code according to the systematic index;

Classification indices;

Additional semantic and lexicographic marks;

foreign equivalents.

The quality of a lexico-semantic index is determined by the completeness of the lexical units included in it. is understood as the probability of entering into the thesaurus any informatively meaningful word for a given subject area. The completeness of the lexico-semantic index, and, consequently, of the entire thesaurus has a significant effect on the results of indexing documents and queries.

Additional parts may include systematic, permutational, hierarchical and other indexes and lists of special categories of lexical units.

A systematic index is an index in which descriptors are grouped according to the headings accepted in the IPT. A systematic index defines the thesaurus thematic direction, reveals its content and reflects those branches of science and technology that can be searched with one or another depth of detail. The need for it as part of the IPT is due to the fact that it gives a visual representation of the general state of terminology in a particular field of knowledge, allows you to build a coherent terminological model and, if possible, all the terms and concepts that should find a place in the thesaurus. It is intended to facilitate the search for terms when compiling search images of documents and queries by ordering a set of descriptors and ascriptors by subject.

The systematic index, in essence, is a classification scheme for filling the thesaurus with terminology, since it is built by ordering a set of descriptors according to subject areas.

Systematic indexes of IPT are divided into three types:

Thematic,

Mixed.

This division reflects the principle of constructing the classification scheme of a systematic index.

The main functions performed by the systematic index of IPT:

Use as an auxiliary for indexing, providing, in total, the search for descriptors for indexing concepts that are not explicitly represented in the thesaurus (search function);

Use in the process of maintaining a thesaurus (function of maintaining IPT);

Use as a structural basis of IPT, as a management of its development (constructive function).

In accordance with GOST 7.25-2001 (2), when constructing a systematic index of thematic and mixed types, in its thematic part, rubrics of the Interstate NTI rubricator or a specific ASNTI rubricator compatible with the Interstate NTI rubricator should be used. When constructing a systematic index of categorical and mixed types, the following general categories follow in its categorical part:

Names of disciplines and branches of activity;

Items, materials;

Methods, processes, operations, phenomena;

Properties, values, parameters, characteristics;

Relationships, structures, models, laws, rules, abstract concepts.

Hierarchical index. A hierarchical index is an index that gives a list of lists of descriptors, each list starting with a descriptor that has no parent. It reflects the complete structure of hierarchical relationships in IPT. After each descriptor, descriptors are given directly with an indication of their level in the hierarchy by using numbering or a graphic designation of the level:

The need to develop a hierarchical index of IPT is caused by the fact that the entire system of subordination of concepts is not fixed in the dictionary entries of IPT, because this would entail a significant increase in the lexico-semantic index. there is a need to develop an independent section of the IPT - a hierarchical index that would reflect the entire hierarchical chain of subordination of descriptors to the bottom.

A permutational index is an index that lists in alphabetical order all the individual words that are part of the components of phrases denoting descriptors and for each of them all descriptors that include these words are indicated. Therefore, each term appears in the permutational index as many times as it contains significant words. The purpose of the permutational index is to provide a search for descriptors-phrases by any word included in their composition, including those that do not appear at the beginning of a lexical unit. It allows you to group single-root words in one place.

As a rule, a permutational index is compiled in an automated way and usually has the form of an index of the KWIC type (Key Word - In Context - “Key words in context”), in which all meaningful words - terms - are arranged in alphabetical order. in the permutation index is in the center of the column, which is formed by the microcontexts of the term elements, and the part of the terms that does not fit is transferred to the left side of the same line:

optical quantum

arousal

electrical

with dependent excitation

Interference Generators

SERIAL GENERATORS

DC GENERATORS

DC GENERATORS prove to be necessary.

4. The procedure for the development, examination, registration and maintenance of IPT

Currently, the procedure for the development, examination and registration of IPT is determined by two standards: GOST 7.25-2001 “Information retrieval thesaurus monolingual. Development rules, structure, composition and presentation form” and GOST 7.24-2007 “Multilingual information retrieval thesaurus. Composition, structure and basic requirements for construction. In accordance with these standards, the functions of examination and registration of IPT are performed by national and international depository funds.

The National Depository Fund of IPT in Russian (including IPT containing the equivalents of descriptors in Russian) is located at , in VINITI.

There are also two international depositary IPTs:

1) the IPT International Depository Fund in English, including IPT containing the equivalents of descriptors in English. It is located in, in Toronto, in the library of the Faculty of Information Sciences at the University of Toronto (Thesaurus Clearinghouse - “settlement”, The Library, Faculty of Information Studies, University of Toronto, TORONTO, Canada);

2) IPT International Depository Fund in all languages ​​other than English. It is located in , in Warsaw, in scientific and technical and economic information (Instytut Informacji Naukowej, Technicznej i Ekonomicznej, Clearinghouse, WARSZAW A, Poland.).

The full addresses of these organizations are given in GOST 7.25-2001.

GOST 7.25-2001 and GOST 7.24-2007 define the actions of IPT developers as follows:

1. Prior to starting work on the creation of an IPT, the developer must apply to the appropriate national or international depositary fund in order to determine the availability of registered thesauri on a given topic. In the presence of such thesauri, an assessment is made of the possibility of introducing them into a given system. If no such thesauri are found, the creation of an IPT may be possible. At the same time, the entire technology for creating IPT must strictly comply with GOST 7.25-2001 and GOST 7.24-2007

2. Finished (developed) IPT must undergo an examination for compliance with GOST 7.25-2001. they meet the standard, then the National issues the developer . of this IPT is deposited (deposited) in the relevant national or in one of the international depository funds (in Toronto or Warsaw).

National depositories disseminate information on the composition of the fund of deposited IPTs and provide them to developers of new IPTs in order to borrow elements and ensure the compatibility of the linguistic support of various information systems. Thus, they perform the functions of examination, registration, storage of IPTs and information about available IPTs.

many operations for the management of IPT);

The transition of AIS from independent operation to network operation (when using IPT within the framework of a single principle of their maintenance, they must be agreed).

The process of keeping the IPT up and running is called maintaining or adjusting the thesaurus. It usually includes the following:

Changing the lexical composition of the IPT: introducing new lexical units, their, changing the status of lexical units (translating a keyword into descriptors and vice versa);

Change of paradigmatic relations in IPT (strengthening, weakening);

Maintaining the IPT involves the mandatory use of automation tools that allow you to quickly perform such labor-intensive operations as alphabetical sorting of the dictionary, vocabulary, checking the reciprocity and consistency of references, with the help of which paradigmatic relations are fixed in the IPT, etc.

, antonyms, paronyms, hyponyms, hypernyms, etc.) between lexical units. Thesauri are one of the most effective tools for describing individual subject areas.

In the past, the term thesaurus dictionaries were designated mainly, representing the vocabulary of the language with examples of its use in texts with maximum completeness.

Also term thesaurus used in information theory to refer to the totality of all the information that the subject possesses.

In psychology, the thesaurus of an individual characterizes the perception and understanding of information. Communication theory also considers the general thesaurus of a complex system, through which its elements interact.

Story

One of the first thesauri is called the Dictionary of Synonyms by Philo of Biblus. A more accurate correspondence to the term is Amara-kosha, written in Sanskrit in verse form in the 6th century. The first modern English thesaurus was created by Peter Mark Roger in 1805. It was published in 1852 and has been in use ever since.

In the 1970s, thesauri began to be actively used for information retrieval tasks. In such thesauri, words are compared with descriptors through which semantic links are established.

Thesauri

see also

Write a review on the article "Thesaurus"

Notes

An excerpt characterizing the Thesaurus

- What a dandy you are today! - looking around at his new mentic and saddle cloth, said Nesvitsky.
Denisov smiled, took a handkerchief from the tashka, which spread the smell of perfume, and thrust it into Nesvitsky's nose.
- I can't, I'm going to work! got out, cleaned his teeth and perfumed himself.
The imposing figure of Nesvitsky, accompanied by a Cossack, and the decisiveness of Denisov, who waved his saber and shouted desperately, had the effect that they squeezed through to the other side of the bridge and stopped the infantry. Nesvitsky found a colonel at the exit, to whom he had to convey the order, and, having fulfilled his order, went back.
Having cleared the road, Denisov stopped at the entrance to the bridge. Carelessly holding back the stallion, which was rushing towards his own and kicking, he looked at the squadron moving towards him.
Transparent sounds of hooves rang out on the boards of the bridge, as if several horses were galloping, and the squadron, with officers in front four people in a row, stretched out along the bridge and began to go out to the other side.
The stopped infantry soldiers, crowding in the mud trampled by the bridge, looked at the clean, dapper hussars, harmoniously passing by them, with that special unfriendly feeling of alienation and mockery with which various branches of the army usually meet.
- Nice guys! If only to Podnovinskoye!
- What good are they! Only for show and drive! another said.
– Infantry, not dust! - the hussar joked, under which the horse, playing, splashed mud at the infantryman.
“I would have driven you away with a knapsack for two transitions, the laces would have been worn out,” the infantryman said, wiping the dirt from his face with his sleeve; - otherwise it’s not a person, but a bird is sitting!
“It would be better to put you on a horse, Zikin, if you were dexterous,” the corporal joked at the thin soldier, twisted from the weight of the knapsack.
“Take a baton between your legs, here’s a horse for you,” the hussar replied.

The rest of the infantry hurried across the bridge, vortexing at the entrance. Finally the wagons all passed, the crush became less, and the last battalion entered the bridge. Some hussars of Denisov's squadron remained on the other side of the bridge against the enemy. The enemy, visible in the distance from the opposite mountain, from below, from the bridge, was not yet visible, since from the hollow along which the river flowed, the horizon ended with the opposite elevation no further than half a verst. Ahead was a desert, along which in some places groups of our traveling Cossacks were moving. Suddenly, on the opposite elevation of the road, troops in blue hoods and artillery appeared. These were the French. The Cossacks' troop moved off downhill at a trot. All the officers and people of Denisov's squadron, although they tried to talk about strangers and look around, did not stop thinking only about what was there, on the mountain, and incessantly peered into the spots that appeared on the horizon, which they recognized as enemy troops. The weather cleared up again in the afternoon, the sun set brightly over the Danube and the dark mountains surrounding it. It was quiet, and from that mountain occasionally came the sounds of horns and cries of the enemy. There was no one between the squadron and the enemy, except for small sidings. An empty space, three hundred fathoms, separated them from him. The enemy stopped firing, and the more clearly felt that strict, formidable, impregnable and elusive line that separates the two enemy troops.
“One step beyond this line, reminiscent of the line separating the living from the dead, and - the unknown of suffering and death. And what's there? who's there? there, behind this field, and a tree, and a roof lit by the sun? Nobody knows, and one wants to know; and it’s scary to cross this line, and I want to cross it; and you know that sooner or later you will have to cross it and find out what is there, on the other side of the line, just as it is inevitable to find out what is there, on the other side of death. And he himself is strong, healthy, cheerful and irritable, and surrounded by such healthy and irritably lively people. So if he does not think, then every person who is in sight of the enemy feels, and this feeling gives a special brilliance and joyful sharpness of impressions to everything that happens at these moments.
The smoke of a shot appeared on a hillock near the enemy, and the ball, whistling, flew over the heads of the hussar squadron. The officers who had been standing together dispersed to their places. The hussars diligently began to straighten the horses. Everything in the squadron went silent. Everyone looked ahead at the enemy and at the squadron commander, waiting for the command. Another, third core flew by. It is obvious that they fired at the hussars; but the cannonball, whistling evenly quickly, flew over the heads of the hussars and hit somewhere behind. The hussars did not look back, but at every sound of a flying cannonball, as if on command, the entire squadron with its monotonously diverse faces, holding its breath while the cannonball flew, rose in the stirrups and lowered again. The soldiers, without turning their heads, squinted at each other, curiously looking for the impression of a comrade. On every face, from Denisov to the bugler, near the lips and chin, one common feature of struggle, irritation and excitement appeared. The sergeant-major frowned, looking at the soldiers, as if threatening punishment. Junker Mironov stooped down with each passage of the core. Rostov, standing on the left flank on his touched but visible Grachik, had the happy look of a student called before a large audience for an exam, in which he was sure that he would excel. He looked around clearly and brightly at everyone, as if asking them to pay attention to how he calmly stands under the cannonballs. But in his face, too, the same feature of something new and strict, against his will, was shown near the mouth.
- Who bows there? Yunkeg "Mig" ons! Hexog "oh, look at me" ite! - shouted Denisov, who could not stand still and who was spinning on a horse in front of the squadron.
The snub-nosed and black-haired face of Vaska Denisov and his whole small, knocked-down figure with his sinewy (with short fingers covered with hair) hand, in which he held the hilt of a drawn saber, was exactly the same as always, especially in the evening, after drinking two bottles. He was only redder than usual, and, throwing his shaggy head up like birds when they drink, ruthlessly pressing his spurs into the sides of the good Bedouin with his small feet, he, as if falling back, galloped to the other flank of the squadron and in a hoarse voice shouted to inspect pistols. He drove up to Kirsten. The staff captain, on a broad and sedate mare, rode towards Denisov at a pace. The captain's staff, with his long mustaches, was as serious as ever, only his eyes shone more than usual.
– Yes what? - he said to Denisov, - it will not come to a fight. You'll see, let's go back.
- Chog "does not know what they are doing," grumbled Denisov. "Ah! G" skeleton! he shouted to the cadet, noticing his cheerful face. - Well, I waited.
And he smiled approvingly, apparently rejoicing at the junker.
Rostov felt completely happy. At this time, the chief appeared on the bridge. Denisov galloped up to him.
- Your pg "elevation! Let me attack! I'll throw them at them."
“What kind of attacks are there,” the chief said in a dull voice, wincing as if from an annoying fly. "And why are you standing here?" See, the flankers are retreating. Lead the squadron back.
The squadron crossed the bridge and got out from under the shots without losing a single person. Following him, the second squadron, which was in the chain, also crossed, and the last Cossacks cleared that side.
Two squadrons of Pavlograd residents, having crossed the bridge, one after the other, went back to the mountain. Regimental commander Karl Bogdanovich Schubert drove up to Denisov's squadron and rode at a pace not far from Rostov, not paying any attention to him, despite the fact that after the previous clash over Telyanin, they now saw each other for the first time. Rostov, feeling at the front in the power of a man to whom he now considered himself guilty, did not take his eyes off the athletic back, blond nape and red neck of the regimental commander. It seemed to Rostov that Bogdanich was only pretending to be inattentive, and that his whole goal now was to test the courage of the cadet, and he straightened up and looked around merrily; then it seemed to him that Bogdanich was deliberately riding close to show Rostov his courage. Then he thought that his enemy would now deliberately send a squadron into a desperate attack in order to punish him, Rostov. It was thought that after the attack he would come up to him and generously extend to him, the wounded man, the hand of reconciliation.

Increasingly, in numerous projects, books, brochures, Internet resources, one can come across the concept of "thesaurus". Like a mysterious phenomenon, it frightens with its unknownness, because it is much easier to say "dictionary" than to use a strange definition.

Thesaurus: what is it? How is it different from a regular dictionary? We will try to study these issues in more detail and accessible.

Interpretation of the term

Initially, the concept of thesaurus was considered from the point of view of a dictionary, representing the vocabulary of the language with examples of use in the text.

Ozhegov interprets a thesaurus as a dictionary of a particular language that fully reflects the vocabulary, while Efremova considers this phenomenon from the point of view of a systematic set of data in a certain field of knowledge.

The most specific definition is used in philology, where a thesaurus is understood as a component of a dictionary type, where all the meanings of words are connected by semantic relationships among themselves and reflect the key relationships of concepts in a particular subject area.

As we can see, it is quite difficult to answer the question: "Thesaurus: what is it?" clearly. For a narrower study of the term, let's consider the history of occurrence, types and relationships of lexical units in a dictionary of this type.

History of occurrence

The English physicist Roger is considered the founding father of thesauri; it was he who systematized it in 1852 by distributing it into groups. At the same time, each group was represented by the name of the concept, and then there were its synonyms for certain parts of speech, lists of related names, as well as references to the names of other categories. The idea of ​​such a classification was very valuable, since the dictionary was considered the most natural, describing the vocabulary of the language to the fullest extent. However, it could be used as a quick search for important concepts. Since the time of the first thesaurus and still there has been a regular transformation of this type of dictionary, which is used in many fields of knowledge and is widely popular all over the world. At the same time, the study of the topic: "Thesaurus: what is it?" relevant in many schools.

Until now, thesauri have remained the most popular way of describing knowledge in any field necessary for effective human perception.

Relationships of words in the thesaurus

The most common relationships in the classical thesaurus are:

  1. Synonymy is a phenomenon in which words of one part of speech that are similar in lexical meaning are associated. For example: power-fatherland, brigade-detachment, scarlet - red etc.
  2. Antonymy is the connection of words of one part of speech that have the opposite lexical meaning. For example: silence - roar, affectionate - rude.
  3. Hyperonymy (hyponymy) - key relationships for the purpose of describing nouns. The hypernym has a wide lexical meaning, expresses the generic, common name of a class (set) of objects, objects, namely its properties and features. A hyponym has a narrow meaning; it names an object (attribute, property) as an element of a particular set or class. To make this relationship clear, let's take a simple example. The words beast and tiger interconnected, while the common name - the beast- is a hypernym in relation to the hyponym tiger.
  4. Meronymy (partonymy) - relations for nouns, are formed according to the principle "part - whole". As an example, consider the words aircraft, landing gear, porthole. In this case, the common name of the transport is holonym (whole, name), and its constituent parts are meronyms.
  5. Consequence (relations between verbs). For example, words go and come connected with the process and its consequence (result).
  6. Reason (also valid for verbs only). Consider an example of such relations, take the words: hurt - miss. In this case, the reason can be traced - to skip because there were health problems.

What a thesaurus is, we will see from the following example.

The bed is a device for sleeping.

[hypernym]: furniture
[meronym]: house
[synonym]: couch, bed.

This is just a classic example of the thesaurus of the Russian language, but all dictionaries of this type are built exactly on this principle.

Thesaurus functions

Thesaurus dictionary has important social, communication, scientific and other functions.

He is:

  • a source of special knowledge in a wide or narrow subject area, a way of ordering, describing terms;
  • search tool in the information flow;
  • tool for manual analysis of documentation in search engines;
  • tool for automatic indexing of complex texts.

Types of thesauri

The variety of dictionaries requires considering not only the question: "Thesaurus: what is it?", but also paying attention to types. This will help us better understand the features of this type of dictionaries.


Conclusion

We hope that we were able to explain in an accessible language what a thesaurus is. Thanks to the examples, it is easy to understand how it differs from other dictionaries. We also covered the issue of information retrieval thesauri, which are widely used by the information system for quick search and systematization of millions of titles.

N. V. Lukashevich

[email protected]

B. V. Dobrov

Research Computing Center of Moscow State University M.V. Lomonosov;

ANO Center for Information Research

[email protected]

Keywords: thesaurus, information retrieval, automatic text processing,

The vast majority of technologies that work with large collections of texts are based on statistical and probabilistic methods. This is due to the fact that lexical resources that could be used to process text collections using linguistic methods should have a volume of tens of thousands of dictionary entries and have a number of important properties that need to be specially monitored when developing a resource. In the report, we consider the basic principles of developing lexical resources for automatic processing of large text collections using the example of the Russian language thesaurus created since 1997 for computer processing of texts RuThez, which is currently a hierarchical network of more than 42 thousand concepts. We describe the current state of the thesaurus based on a comparison of its lexical composition and the text corpus of the University Information System RUSSIA (www.cir.ru) - 400 thousand documents. Examples of using the thesaurus in various automatic word processing applications are discussed.

  1. Introduction

Currently, millions of documents have become available in electronic form, thousands of information systems and electronic libraries have been created. At the same time, information systems that use lexical and terminological resources for searching are calculated in fractions of a percent. This is due to the serious problems of creating such linguistic resources for the automatic processing of modern collections of electronic documents.

First, these collections are usually very large, the resource must include descriptions of thousands of words and terms. Secondly, collections are a set of documents of different structure with a variety of syntactic constructions, which makes it difficult to automatically process text sentences. In addition, important information is often distributed among different sentences of the text.

All this sharply raises the question of what kind of linguistic resource should be, which, on the one hand, would be useful for automatic processing and searching in electronic collections, on the other hand, could be created in a foreseeable time and maintained with relatively little effort.

In the article, we will consider the basic principles of developing lexical resources for automatic processing of large text collections. These principles will be considered on the example of the Russian language thesaurus created since 1997 by the ANO Center for Information Research for computer processing of texts RuThez. RuThez is currently a hierarchical network of more than 42 thousand concepts, which includes more than 95 thousand Russian words, expressions, terms. We will describe the current state of the thesaurus based on a comparison of its lexical composition and the vocabulary of the text corpus of the University Information System RUSSIA, supported by the Research and Development Center of Moscow State University. M.V. Lomonosov and ANO TsII. UIS RUSSIA (www.cir.ru) contains 400,000 documents on socio-political topics (about 3 GB of texts, 200 million word usages). The article will also look at examples of using the thesaurus in various word processing applications.

  1. Principles for the development of a linguistic resource

for information retrieval tasks

To ensure efficient automatic processing of electronic documents (automatic indexing, categorization, comparison of documents), it is necessary to build a basis for their comparison - a list of what was mentioned in the document. In order for such an index to be more effective than a word index, it is necessary to overcome the lexical diversity of the text: synonyms, polysemy, parts of speech, style, and reduce it to an invariant - a concept that becomes the basis for comparing different texts. Thus, concepts should become the basis of a linguistic resource, and language expressions: words, terms - become only text inputs that initialize the corresponding concept.

To be able to compare different, but close in meaning, concepts, relationships must be established between them. Traditionally, in linguistic resources for automatic processing of texts in natural language, certain sets of semantic relations were used, such as part, source, cause etc. However, when working with large and heterogeneous text collections, we must understand that with the current state of text processing technology, a computer system will not be able to detect these relationships in the text in any stable way in order to perform the procedures that we have associated with certain relationships. Therefore, relations between concepts should first of all describe some invariant properties that do not depend or weakly depend on the topic of a particular text in which the concept is mentioned.

The main function of these relations is to answer the following question:

if it is known that the text is devoted to the discussion of C1, and C2 is connected

attitudeRwith C1, can we say that the subject of the text(*)

has to do with C2?

When creating a linguistic resource for automatic processing, it is important to determine which properties of the concepts C1 and C2 allow establishing the correct (*) relations between them.

So, for example, whatever texts are written about birches, we can always say that these are lyrics about trees. But despite the popularity and frequent discussion of the relationship wood as part the woods, a very small number of texts about trees are texts about forests. Note that the problem is not related to the name of the relationship. So clearing is part of the forest, and texts about clearings are texts about the forest.

The invariance of relations with respect to the spectrum of possible topics of the texts of the subject area is largely determined by deeper properties than those reflected by the names of the relations, namely its quantifier and existential properties. So the quantifier properties of relations describe whether all instances of a concept have a given relation, whether a given relation is preserved throughout the entire life cycle of the example. Problem using relation woodforest it is precisely connected with the fact that not every particular tree is in the forest, but the clearing cannot be outside the forest.

An example of describing the existential properties of relations is whether the existence of the concept C2 follows from the existence of the concept C1 (for example, the existence of the concept GARAGE requires the concept AUTOMOBILE) or the existence of examples of C1 depends on the existence of examples of C2 (so a particular FLOOD inseparable from a concrete example RIVERS). The discussion in the text of the dependent concept C2, especially the example dependent one, suggests that the text is also relevant to the main concept C1.

Consider the relationship between the concepts FOREST and WOOD in details. In fact, part of the concept FOREST is TREE IN THE FOREST, while there are and STANDING TREE,TREE IN THE GARDEN etc. In any case, it is required to break the relation of subordination of the concept WOOD notion FOREST.

On the other hand, FOREST is kind SET OF TREES, does not exist without trees (as well as GARDEN). Thus the concept FOREST should be dependent on the concept WOOD. Starting with an analysis of the needs of specific applied tasks, we came to the conclusion that it is important to describe the deep properties of relationships that were previously very insignificantly reflected in linguistic resources, but which are of paramount importance for tasks of automatic processing of large text collections, and, possibly, for many other tasks.

Now we are modeling the description of quantifier and existential properties of concepts by a set of traditional thesaurus relations ABOVE-DOWN (66% of all relations), PART-Whole (30% of relations), ASSOCIATION (4%), in combination with some set of additional modifiers (20% of relations are labeled ). Note that the PART-Whole and ASSOCIATION relations are interpreted according to the rule (*). In total, about 160 thousand direct connections between concepts are described, which, taking into account the transitivity of relations, gives a total number of different connections of more than 1350 thousand connections, that is, on average, each concept is connected with 30 others.

  1. RuThes Thesaurus: General Structure

The RuThes Thesaurus is a hierarchical network of concepts corresponding to the meanings of individual words, textual expressions or synonymous series. Thus, the main elements of the thesaurus are concepts, language expressions, relations, language expression - concept, relationships between concepts.

In the thesaurus, both linguistic knowledge - descriptions of lexemes, idioms and their connections, traditionally related to lexical, semantic knowledge, and knowledge about terms and relationships within subject areas, traditionally related to the field of activity of terminologists, described in information retrieval thesauri, are collected in a single system. . As such subject subdomains, the thesaurus describes such subject areas as economics, legislation, finance, international relations, which are so important for a person's daily life that they have a significant lexical representation in traditional explanatory dictionaries. In them, lexical and terminological are strongly interconnected and strongly interact with each other.

Language expressions are separate lexemes (nouns, adjectives and verbs), nominal and verbal groups. Thus, the thesaurus now does not include adverbs and auxiliary words as linguistic expressions. Multi-word groups may include terms, idioms, lexical functions ( influence e).

For each language expression, the following is described:

Its ambiguity is the connection with one or more concepts, which means that a given linguistic expression can serve as a textual expression of this concept. The assignment of a linguistic expression to different concepts is also an implicit indication of its ambiguity;

Its morphological composition (part of speech, number, case);

Features of writing (for example, with a capital letter), etc.

Each thesaurus concept has a unique name, a list of language expressions by which this concept can be expressed in the text, a list of relationships with other concepts.

As a unique name for a concept, one of its unambiguous textual expressions is usually chosen. But the name of the concept can also be formed by a pair of its ambiguous textual expressions - synonyms written with a comma and uniquely defining it (for example, the concept FAT, FAT). An ambiguous textual expression of the name of a concept can also be provided with a label or a shortened fragment of interpretation, for example, the concept CROWD (CLUSTER OF PEOPLE).

  1. Example of a dictionary entry

We have chosen as an example the dictionary entry of the concept FOREST corresponding to one of the meanings of the word forest. This dictionary entry is interesting because it includes different types of knowledge traditionally classified as lexical (semantic) knowledge and encyclopedic knowledge (knowledge about the subject area, terminology).

Synonyms for the concept FOREST(total 13):

forest(M), forest zone, forest environment,

forest, forest quarter, forest landscape,

forest area, forest, forested,

forest raw area, forest,

array of forests.

The following terms with synonyms:

JUNGLE(jungle);

FOREST PARK(city ​​garden, green area,

green massif, forest park,

forestry, forestry

belt, parkM), park zone);

FOREST HUNTING;

deciduous forest(softwood forest, hardwood

forest);

GROVE(oak forest);

CONIFEROUS FOREST (coniferous massif, dark coniferous forest)

Concepts-parts with synonyms:

BORELOM (windbreak, windfall);

FELLING(cutting area);

FOREST CULTURE(forest species, forestry

culture);

FOREST LAND (lands of the forest fund; lands covered with

forest; forest land, forest area;

wooded land, wooded

area,);

FOREST(forest plantations, forest plantations,

afforestation);

FOREST EDGE(edging, edging);

UNDERGROWTH (undergrowth);

PROSECA;

DRY LAND(dry).

Here, the symbols (M) reflect the mark of the ambiguity of the text input.

concept FOREST also has other relationships, the so-called dependency relationships (in the modern version they are called ASC 2 - asymmetric association): FOREST FIRE(forest fire, fire in the forest; FOREST MANAGEMENT (forest use, use of forest fund plots); FOREST OWNERSHIP; FOREST SCIENCE (forest science). As already noted in paragraph 2, the concept of FOREST depends on the concept of TREE, which in the thesaurus is denoted by the relation ASC 1 .

Whole concept FOREST is directly related to 28 other concepts, taking into account the transitivity of relations - with 235 concepts (more than 650 text inputs in total).

  1. Assessment of the state of the art

Thesaurus of the Russian language RuThez

5.1. Lexical composition

Currently, more than 95 thousand language expressions are included in the thesaurus network, of which 61 thousand are single-word ones.

This amount of work made us decide what words and language expressions should be included in the descriptions of the Thesaurus. The natural desire was to see how the most frequent words of the Russian language are represented in the thesaurus. For this, the text collection of the University Information System RUSSIA (400 thousand documents) was used. The collection contains official documents of various bodies of the Russian Federation (55 thousand documents since 1992), as well as press materials since 1999 (newspapers Izvestia, Nezavisimaya Gazeta, Komsomolskaya Pravda, Arguments and Facts, Expert magazine and others), materials of scientific journals (Bulletin of the Moscow University, Sociological Journal). The comparison was made between the list of lemmas included in the Thesaurus and the list of the most frequent 100,000 lemmas in the text collection (frequency more than 25).

The lexical markup of the list showed that among these hundred thousand lemmas, 35 thousand are described in RuThes, only about 7 thousand lexemes deserve to be included in the Thesaurus, the rest are lemmatic variants of various proper names. Therefore, replenishment has ceased to be a priority and is carried out gradually, starting with the most frequent words. It is assumed that as soon as this list is basically exhausted, the next comparison with the text array of the information system will be performed, new tokens with a frequency of more than 25 will be selected. Further, the viewing threshold is supposed to be reduced. The presence in the text collection of a large number of text examples allows you to quickly respond to "lexical novelties" (for example, installation,blockbuster, beau monde, thriller) and include them in the appropriate places in the hierarchical system of the Thesaurus.

Constant work with the current text collection provides unique opportunities to test the significance and quality of lexical descriptions offered in dictionaries. For example, an unusually high frequency of use of the word Mother See(more than 400 times). Checking against the array showed that the word is indeed often used as a synonym for the word Moscow, while explanatory dictionaries often mark this word as obsolete. Another example of a frequently used word (more than 300 times) marked as obsolete in dictionaries is the word blissful.

5.2 Description of word meanings

A comparison with the text collection shows that many of the frequency words in the array are well represented in the Thesaurus in at least one of their (usually basic) values. Finding out to what extent the range of meanings of polysemantic words of the Russian language is represented in the Thesaurus is our primary task at the present time.

As you know, different dictionary sources often give a different set of meanings for polysemantic words, distinguish shades of meanings, and the same type of polysemy can be described differently for different words even in the same dictionary. Therefore, the task of a consistent and representative description of the meanings of lexemes is an important task for the creators of any dictionary resource.

However, if the resource is intended for automatic processing, then the task of balanced description of values ​​becomes much more important. Excessive inflating of values ​​can lead to the inability of the computer system to select the desired value, which in turn leads to a significant decrease in the efficiency of the word processing system. So, as one of the disadvantages of the WordNet resource as a resource for automatic word processing is the excessive number of values ​​​​described for some words (in WordNet 1.6: 53 values ​​for run.47 for play etc.). These meanings are difficult to distinguish even for a person when semantic annotating texts. It is clear that the computer system also cannot cope with the choice of an appropriate value. Therefore, different authors propose different ways of combining values ​​to improve the quality of processing.

At the same time, the opposite factor acts: if the values ​​really differ in their set of vocabulary links (in our case, thesaurus links) - they cannot be glued into one unit (one concept) - this will also lead to a deterioration in the quality of automatic processing.

Consider for example the words school and church, each of which can be considered as an organization and as a building.

Each school organization has a building (most often one). All parts of the school building (classrooms, blackboards) are related to school as an organization. There are no specific types of school buildings. Therefore the description schools as buildings it is inappropriate to single out as a separate concept. However, the description of such a cumulative concept SCHOOL as an organization and as a building must have a specially designed relationship with the concept BUILDING. When describing such relationships in the Thesaurus, a mark on the relationship is used - the modifier “A” (“aspect”, in automatic analysis, to take into account this relationship, “confirmation” by other concepts is required).

SCHOOL

ABOVE EDUCATIONAL INSTITUTION

ABOVE A PUBLIC BUILDING

Relevant word meanings church not so close. churches How an organization can have a large number of church-buildings in different locations and also have many other buildings. church-building closely associated with religion and confession, but can change belonging to organization churches. church-organization and church-building have different subspecies. That's why CHURCH (ORGANIZATION) and CHURCH (BUILDING) are presented in RuThes as different concepts.

The significant divergence in thesaurus relationships correlates in an interesting way with the ability of denotations corresponding to meanings to exist separately from each other. Thus, the church-building does not cease to exist and even be called a church even when the use changes, unlike the school-building.

The process of reconciliation of the representation of values ​​in the Thesaurus is constantly being carried out, starting with the most frequent lemmas. For each frequency token, it is checked how its values ​​are described in explanatory dictionaries, what values ​​are used in the collection and how they are presented in the Thesaurus. As a result, a list of 10,000 lexemes has been formed, the ambiguity of which still requires either additional analysis or additional description. The list is based on 30 thousand of the most frequent lemmas.

It should be noted that in the Thesaurus the problem of ambiguity is partially removed due to the fact that thesaurus relationships can be described between the different meanings of a word, and therefore it is possible to choose the highest concept in the hierarchy by default. It was definitely discussed in the text. For example, the word photo has three meanings: photography as a field of activity, photography as a photograph, photography as a photo studio:

PHOTOGRAPHY(photographing, photography, ..., photo )

PART PHOTOGRAPHIC IMAGE

(a photo, photograph, photo )

PART PHOTO STUDIO (photo ).

Thus, if it was not possible to figure out what meaning the word is used photo, the default is considered to be a photograph (process, result, or location), which is sufficient for many automatic word processing applications.

  1. Application of the RuThes thesaurus

for automatic word processing

Since 1995, RuThes socio-political terminology (socio-political thesaurus) has been actively and successfully used for various applications of automatic text processing, such as automatic conceptual indexing, automatic categorization using several rubricators, automatic annotation of texts, including English ones.. Socio-political thesaurus (27,000 concepts, 62,000 text entries) is the basic search tool in the UIS RUSSIA search engine (www.cir.ru).

The entire vocabulary of the RuThes thesaurus is used in the procedures for automatic rubrication of texts according to complex hierarchical headings. In the existing technology, each rubric is described as a Boolean expression of terms, after which the original formula is expanded along the thesaurus hierarchy. The resulting Boolean expression may already include hundreds and thousands of conjuncts and clauses.

Let us give as an example a fragment of the description by thesaurus concepts (and language expressions after the expansion of the formula) of the “Image of a Woman” rubric of the SOFIST 2 rubricator used by VTsIOM to classify public opinion survey questionnaires:

(WOMAN[N]

|| GIRL[N]

|| RELATIVE[L] (grandmother, granddaughter, cousin,

daughter, sister-in-law, mother, stepmother, daughter-in-law, stepdaughter, ...))

(CHARACTER TRAIT[L] (thrifty, heartless, forgetful,

frivolous, mocking, intolerant, sociable, ...)

|| IMAGE[E] (representation, appearance, appearance,

appearance, shape, image, appearance)

|| PLEASANT[L] (..., interesting, beautiful, cute,

attractive, attractive, endearing, ...)

|| UNPLEASANT[L] (unsympathetic, rude, nasty, ...)

|| VALUE [L] (revere, idolize, adore,

worship, worship, ...)

|| PREFER[N]

The symbol "E" denotes the full expansion along the thesaurus hierarchy, the symbol "L" - according to species relationships ("BELOW"), the symbol "N" - do not expand.

Research is being carried out on the development of a combined technology for automatic text categorization that combines thesaurus knowledge and machine learning procedures.

The issues of using a thesaurus to expand a query formulated in natural language (now only the socio-political part of the thesaurus is used to expand the terminological query in the information retrieval system of the UIS RUSSIA), searching for answers to questions in large text collections.

7. Conclusion

The paper presents the basic principles of developing linguistic resources for automatic processing of large text collections. The created linguistic resource - RuThez Russian Thesaurus - is intended for use in such applications of automatic text processing as conceptual indexing of documents, automatic categorization by complex hierarchical headings, automatic expansion of natural language queries.

This work is partially supported by the Russian Foundation for the Humanities, grant No. 00-04-00272a.

Literature

  1. Lukashevich N.V., Saliy A.D., Knowledge representation in automatic text processing //NTI, Ser.2. 1997. No. 3. S. 1-6.
  2. Zhuravlev S.V., Yudina T.N., Information system RUSSIA //NTI, Ser.2. 1995. No. 3. S. 18‑20.
  3. Winston M., Chaffin R., Herman D., A Taxonomy of Part-Whole Relations // Cognitive Science. 1987. no. 11. P. 417-444.
  4. Priss U.E., The Formalization of WordNet by Methods of Relational Concept Analysis // WordNet. An Electronic Lexical Database / Ed. by C. Fellbaum. Cambridge, Massachusetts, London, England.: The MIT Press 1998. P. 179-196.
  5. Guarino N., Welty C., A Formal Ontology of Properties // Proceedings of the ECAI-00 Workshop on Applications of Ontologies and Problem Solving Methods. Berlin: 2000. P. 121-128. (http://citeseer.nj.nec.com/guarino00formal.html).

Some Ontological Principles for Designing Upper Level Lexical Resources // First Int. Conf. on Language Resources and Evaluation. 1998.

  1. LukashevichN.V., Dobrov B.V., Modifiers of conceptual relations in the thesaurus for automatic indexing // NTI, Ser.2. 2000, No. 4, S. 21-28.
  2. Big Explanatory Dictionary of the Russian Language / Ed. S.A. Kuznetsova. St. Petersburg: Norint, 1998.
  3. Ozhegov S.I., Shvedova N.Yu., Explanatory dictionary of the Russian language - 3rd edition. M.: Az, 1996.
  4. Apresyan Yu.D., Selected works, volume I. Lexical semantics: 2nd ed. M.: School "Languages ​​of Russian culture", Ed. Firm "Eastern Literature" RAS, 1995.
  5. G. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller, Five papers on WordNet, CSL Report 43. Cognitive Science Laboratory, Princeton University, 1990.
  6. Chugur, J. Gonzalo and F. Verdjeo, Sense distinctions in NLP applications // Proceedings of “OntoLex-2000”: Ontologies and Lexical Knowledge Bases. Sofia: OntoTextLab. 2000.
  7. Loukachevitch N., Dobrov B., Thesaurus-Based Structural Thematic Summary in Multilingual Information Systems // Machine Translation Review. 2000 No. 11. P. 10-20. (http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-8.htm).

Thesaurus of russian language for natural language processing

of large text collections

Natalia V. Loukachevitch, Boris V. Dobrov

keywords: thesaurus, natural language processing, informational retrieval

In our presentation we consider main principles of developing lexical resources for automatic processing of large text collections and describe the structure of Thesaurus of Russian Language, which is developed since 1997 specially as a tool for automatic text processing. Now the Thesaurus is a hierarchical net of 42 thousand concepts. We describe the current stage of the Thesaurus developing in comparison with 100,000 the most frequent lemmas of the text collection of University Information System RUSSIA (www.cir.ru), including 400 thousand documents. Also we consider the use of the Thesaurus in different applications of automatic text processing.

One of the new basic concepts that emerged as a result of the development of machine methods of information processing, in particular, when translating from one language to another, searching for scientific and technical information and creating an information model of an enterprise in automated control systems, was the concept of an information system thesaurus. The term "thesaurus" implies a body of knowledge about the outside world - this is the so-called thesaurus of the world T. All the concepts of the outside world, expressed using natural language, constitute a thesaurus, from which private thesauri can be distinguished by hierarchical division, taking into account the subordination of individual concepts or by highlighting parts general thesaurus of the world. Thesaurus in information retrieval systems plays an important role in finding the desired document by keywords. Therefore, the construction of a thesaurus is a complex and responsible task. But this task can also be automated.

Classification in its most general definition is the division and ordering of sets. It is called the distribution of objects into classes on the basis of a common feature inherent in these phenomena or objects and distinguishing them from objects and phenomena that make up other classes. If necessary, each class can be divided into subclasses. The rubricator is a special kind of classification. Therefore, they are created on the basis of general provisions:
 scientific basis for building a classification;
 reflection of the modern level of development of science;
 availability of a system of links and references, as well as a reference apparatus (RSA).

However, the rubricator is a pragmatic classification, created on the basis of information flows and the needs of specialists. This is its difference from a priori classifications such as UDC and IPC.

The main functions of classifications and, in particular, the rubricator are the following:
 thematic differentiation of information subsystems;
 formation of information arrays according to any signs;
 systematization of information materials and publications;
 current and retrospective search;
 indexing of documents and queries;
 connection with other classification schemes;
- normative functions.

They are built by dividing concepts - objects of classification on the basis of established relationships between the features of these objects in accordance with certain logical principles. The attribute by which the classification is made is called the basis of division of the classification. Classifications widely use the methods of deduction and induction to fix groups, classes and identify relationships between them. This is typical for hierarchical classifications. The depth of classification (number of hierarchy levels) may vary depending on the purpose. One of the widely used rubricators is the state rubricator of scientific and technical information (SRSTI).

The SRSTI rubricator is designed in such a way that it can be used jointly with other classifications such as UDC and IPC. The Universal Decimal Classification (UDC) has existed for more than 70 years, but is still unrivaled in its breadth of distribution and is used in many countries around the world. UDC covers the entire universe of knowledge and is successfully used for systematization and subsequent search for a wide variety of information sources.

In addition to UDC, the library-bibliographic classification (LBC) is widely used in practice. The LBC is built on the principles of logical subordination and represents an applied type classification.
In the Russian Federation, to classify inventions and systematize domestic collections of descriptions of inventions, the international patent classification is used - a rather complex multi-aspect classification built according to the functional-industry principle. The same technical concepts can be in the IPC or special classes (according to industry) or functional classes (according to the principle of action). The sectoral principle of the distribution of concepts involves the classification of objects depending on the application in one or another historically developed branch of technology, technology.

Comparative characteristics of the rubricator SRNTI, UDC, LBC and IPC are shown in Table 1.

Table 1
Characteristics of the rubricator SRNTI, UDC, LBC and IPC

Name

Structure

The principle of the location of the divisions

Partition scheme

Hierarchical

Industry

From general to specific

Hierarchical

Thematic

Hierarchical

Functional-industry

From general to specific

LBC for scientific libraries

Hierarchical

Industry

From general to particular, by type


Thus, we can single out the main distinguishing features of rubricators and classifiers:
 they are characterized by applied nature and sectoral orientation;
 these are open systems that depend on the development of science and technology, the needs and demands of specialists;
 inorganic systems, since objects arise and develop in the environment and from it enter them. Elements are able to exist independently outside the system. This feature is closely related to the second feature;
 the minimum element is the concept associated with the environment. The concept represents a system of definitions;
 between the concepts there are connections both along the “vertical” (genus-species, whole-part) and along the “horizontal” (view-species, part-part), which indicates the hierarchy of systems.

Consequently, the structure and principles of organization of classifications and rubricators make it possible to automate the process of constructing thesauri of a subject area using the deduction method. The algorithm for constructing a thesaurus using the deduction method is shown in fig. one.

The basis for the formation of the thesaurus is the search image of the document, the task or application for information search, filled in by the operator. Therefore, the first step is to research and analyze the application. At the first stage, the operator indicates the topic or problem of interest, possible keywords and their synonyms. As a result, we get a superficial idea of ​​the subject area.

Rice. 1. Algorithm for constructing a thesaurus using the deduction method

In addition, a thesaurus of CS keywords is formed using the deduction method, which requires:
 CS array, which is set by the user himself, indicated in Figure 1 as MP;
 CS array extracted from the search task, respectively, MZ.

However, for a more complete and in-depth understanding of the subject area, we use existing rubricators and classification schemes (GRNTI, UDC, LBC, IPC). In order to maximize coverage of the subject area, it is necessary to view all available ones. The array of rubricators represents MR. The deduction search algorithm consists of two steps:
1. Finding generic concepts (Fig. 2);
2. Finding specific terms within generic concepts (Fig. 3).


Rice. 2. Processing a generic concept

We load the first rubricator from the array and organize a cycle of checking for the presence in the rubricators of the CS entered by the user. Each CS is searched in the rubricator and compared with a generic concept or "nest", and then the condition is checked - is there a link to the specific terms. If there is such a reference, then the CS is compared with the specific terms. If the link is not found, go to the next generic concept. When the keywords of the CS entered by the operator are viewed, we move on to the array of CSs extracted from the task. The verification procedure is similar - we are looking for CSs corresponding to generic concepts, and then their links to specific terms.


Rice. 3. Processing of generic terms

Note that within each generic concept, it is important to review all available generic terms in order to obtain the maximum understanding of the problem area. The result of these actions is the formation of an array of CS keywords, which is a complete thesaurus corresponding to the information search task or the search image of the document.

On the basis of a complete set of search images of documents (let's denote it), it is possible to create branch thesauri and a single library classifier. Obviously, the complete set  itself represents the simplest thesaurus.

However, using the selection criterion
, (1)
we can build industry thesauri. In this case, the set of all branch thesauri forms a complete thesaurus
, (2)
sections of which can be hierarchically structured in accordance with the requirements of GOSTs for the main classifiers (GRNTI, UDC, LBC, IPC) or for an internal unified classifier.

Automation of the process of building a thesaurus and classification makes it possible to maximally facilitate the work of an operator working with distributed information resources.

In addition to building a thesaurus, based on the search image of a document, the proposed approach can be used for automatic document summarization and text clustering.

Abstracting of documents is one of the tasks aimed at providing specialists-experts with reliable information necessary for making a management decision on the value of documents received from the Internet. Abstracting is the process of converting documentary information, culminating in the compilation of an abstract, and an abstract is a semantically adequate presentation of the main content of the primary document, distinguished by economical sign design, constancy of linguistic and structural characteristics and intended to perform various information and communication functions in the system of scientific communication. The document referencing algorithm is shown in fig. four.


Rice. 4. Algorithm for summarizing documents

In general, the algorithm includes the following main steps.
1. Sentences are extracted from a document downloaded from the Internet and located in the data warehouse by extracting punctuation marks and storing it in an array.
2. Each sentence is divided into words by selecting separators, and we store them in an array, and the array is different for each sentence.
3. For each sentence, for each word of this sentence, we count the number of words in other sentences (before and after). The sum of repetitions for each word (before and after) will be the weight of this sentence.
4. The specified number of sentences with the maximum weight coefficient and select in the abstract in the order of appearance in the text.

The proposed model for constructing a thesaurus and thematic catalogs of an information system is a theoretical basis for automating semantic search and allows a specialist expert not only to carry out search work, but also in an automated mode, abstract documents obtained as a result of searching in distributed information systems of the Internet.

Literature:
1. Barushkova R.I. Classification schemes of scientific and technical information. Proc. allowance. - M., 1981. - 80s.
2. Barushkova R.I. Rubricator as a classification scheme for scientific and technical information. Toolkit. - M., 1980. - 38s.
3. Trusov A.V., Babarykin E.P. Evaluation of the boundaries of the area of ​​thematic information request in distributed information systems. Materials of the All-Russian (with international participation) conference "Information, innovations, investments", November 24-25, 2004, Perm / Perm CSTI. - Perm, 2004. - S.76-79.
4. Yatsko V.A. Logical-linguistic problems of analysis and abstracting of scientific text. - Abakan: publishing house of the Khakass State. un-ta, 1996. - 128 p.


close