From time to time, posts and translated articles are published on Habré devoted to certain aspects of the theory of formal languages. Among such publications (I do not want to point out specific works so as not to offend their authors), especially among those devoted to the description of various language processing software tools, there are often inaccuracies and confusion. The author is inclined to believe that one of the main reasons that led to this unfortunate state of affairs is the insufficient level of understanding of the ideas underlying the theory of formal languages.

This text is intended as a popular introduction to the theory of formal languages ​​and grammars. This theory is considered (and, I must say, rightly) rather complex and confusing. Students usually get bored at lectures, and exams, moreover, do not arouse enthusiasm. Therefore, in science there are not so many researchers in this subject. Suffice it to say that for all the time, from the birth of the theory of formal grammars in the mid-50s of the last century to the present day, only two doctoral dissertations have been published in this scientific direction. One of them was written in the late 60s by Alexei Vladimirovich Gladkiy, the second is already on the threshold of the new millennium - by Mati Pentus.

Further, in the most accessible form, two basic concepts of the theory of formal languages ​​are described: formal language and formal grammar. If the test is interesting to the audience, then the author makes a solemn promise to give birth to a couple more similar opuses.

Formal languages

In short, formal language is mathematical model real language. Real language here is understood as a certain way of communication (communication) of subjects with each other. For communication, subjects use a finite set of signs (symbols) that are pronounced (written out) in a strict temporal order, i.e. form linear sequences. Such sequences are usually called words or sentences. Thus, only the so-called. communicative function language, which is studied using mathematical methods. Other functions of the language are not studied here and, therefore, are not considered.

To better understand how formal languages ​​are learned, it is first necessary to understand what are the features of mathematical learning methods. According to Kolmogorov et. to whatever it is applied, it always follows two basic principles:

  1. Generalization (abstraction). Objects of study in mathematics are special entities that exist only in mathematics and are intended to be studied by mathematicians. Mathematical objects are formed by generalizing real objects. Studying any object, the mathematician notices only some of its properties, and is distracted from the rest. So, the abstract mathematical object "number" can actually mean the number of geese in a pond or the number of molecules in a drop of water; the main thing is that you can talk about geese and water molecules
    talk about aggregates. One important property follows from such an “idealization” of real objects: mathematics often operates with infinite collections, while in reality such collections do not exist.
  2. Rigidity of reasoning. In science, it is customary to verify the truth of one or another reasoning by comparing their results with what actually exists, i.e. conduct experiments. In mathematics, this criterion for checking reasoning for truth does not work. Therefore, the conclusions are not verified experimentally, but it is customary to prove their validity by strict reasoning that obeys certain rules. These arguments are called proofs, and proofs serve as the only way to substantiate the correctness of a particular statement.
Thus, in order to study languages ​​using mathematical methods, it is necessary first to extract from the language its properties that seem important for learning, and then these properties are strictly defined. The abstraction obtained in this way will be called a formal language - a mathematical model of a real language. The content of a particular mathematical model depends on what properties are important for study, i.e. what is planned at the moment to highlight and study.

As famous example Such a mathematical abstraction can be cited as a model known under the dissonant name for the Russian ear “a bag of words”. In this model, natural language texts (ie, one of those languages ​​that people use in the process of everyday communication with each other) are studied. The main object of the bag of words model is a word equipped with a single attribute, the frequency of occurrence of this word in original text. The model does not take into account how words are placed next to each other, only how many times each word occurs in the text. The bag of words is used in text-based machine learning as one of the main objects of study.

But in the theory of formal languages, it seems important to study the laws of the arrangement of words next to each other, i.e. syntactic properties of texts. For this, the bag of words model looks poor. Therefore, a formal language is defined as a set of sequences composed of elements of a finite alphabet. Let's define this more strictly.

The alphabet is a finite non-empty set of elements. These elements will be called symbols. To denote the alphabet, we will usually use the Latin V, and to denote the symbols of the alphabet, we will use the initial lowercase letters of the Latin alphabet. For example, the expression V = (a,b) denotes an alphabet of two characters a and b.

A string is a finite sequence of characters. For example, abc is a string of three characters. Often when designating chains in symbols, indices are used. The chains themselves are denoted by lowercase characters of the end of the Greek alphabet. For example, omega = a1...an is a string of n characters. The chain can be empty, i.e. not contain any character. Such chains will be denoted by the Greek letter epsilon.

Finally, a formal language L over an alphabet V is an arbitrary set of strings composed of symbols of the alphabet V. Arbitrariness here means the fact that the language can be empty, i.e. not have a single chain, and infinite, i.e. made up of an infinite number of chains. The last fact is often puzzling: are there any real languages, which contain an infinite number of strings? Generally speaking, everything in nature is finite. But here we use infinity as the possibility of forming chains of unlimited length. For example, a language that consists of the possible variable names of the C++ programming language is infinite. After all, variable names in C++ are not limited in length, so there can potentially be an infinite number of such names. In reality, of course, long variable names don't make much sense to us. by the end of reading such a name, you already forget its beginning. But as a potential possibility to set variables of unlimited length, this property looks useful.

So, formal languages ​​are simply sets of strings made up of symbols of some finite alphabet. But the question arises: how can one define a formal language? If the language is finite, then one can simply write out all its chains one after another (of course, one might wonder if it makes sense to write down the chains of a language that has at least ten thousand elements and, in general, is there any point in writing it out?). What to do if the language is infinite, how to define it? This is where grammar comes into play.

Formal grammars

The way of specifying a language is called the grammar of that language. Thus, we call grammar any way of specifying a language. For example, the grammar L = (a^nb^n) (here n is natural number) defines a language L consisting of strings of the form ab, aabb, aaabbb, etc. The language L is an infinite set of strings, but nevertheless, its grammar (description) consists of only 10 characters, i.e. finite.

The purpose of grammar is the task of language. This task must necessarily be final, otherwise the person will not be able to understand this grammar. But how does the final task describe infinite collections? This is possible only if the structure of all chains of the language is based on uniform principles, of which there are a finite number. In the example above, the following principle acts as such a principle: "each string of the language begins with the characters a, followed by the same number of characters b". If a language is an infinite collection of randomly typed chains, the structure of which does not obey uniform principles, then it is obviously impossible to invent a grammar for such a language. And here is another question, whether or not such a collection can be considered a language. For the purposes of mathematical rigor and uniformity of approach, such collections are usually considered a language.

So, the grammar of a language describes the laws internal structure his chains. Such laws are usually called syntactic laws. Thus, one can rephrase the definition of grammar as the final way of describing the syntactic patterns of a language. For practice, not just grammars are interesting, but grammars that can be set within the framework of a single approach (formalism or paradigm). In other words, on the basis of a single language (metalanguage) for describing the grammars of all formal languages. Then you can come up with an algorithm for a computer that will take as input a description of the grammar made in this metalanguage and do something with the chains of the language.

Such paradigms for describing grammars are called syntactic theories. Formal grammar is a mathematical model of grammar described within the framework of some kind of syntactic theory. There are quite a few such theories. The best-known metalanguage for specifying grammars is, of course, Chomsky's generative grammars. But there are other formalisms as well. One of these, neighborhood grammars, will be described below.

From an algorithmic point of view, grammars can be subdivided according to the way the language is specified. There are three main such ways (types of grammars):

  • Recognizing grammars. Such grammars are devices (algorithms) that are given a language chain as input, and the device prints “Yes” at the output if the chain belongs to the language, and “No” otherwise.
  • Generating grammars. This kind of device is used to generate chains of languages ​​on demand. Figuratively speaking, when a button is pressed, some language chain will be generated.
  • Enumerative grammars. Such grammars print all the chains of the language one after the other. Obviously, if a language consists of an infinite number of chains, then the process of enumeration will never stop. Although, of course, it can be stopped forcibly at the right time, for example, when the desired chain is printed.
An interesting question is about the transformation of grammar types into each other. Is it possible, having a generative grammar, to construct, say, an enumerative one? The answer is yes, you can. To do this, it is enough to generate chains, ordering them, say, by length and order of characters. But in the general case, it is impossible to turn an enumerating grammar into a recognizing one. You can use the following method. Having received a chain as input, start the process of enumerating chains and wait whether the enumerating grammar prints this chain or not. If such a chain is printed, then we finish the enumeration process and print "Yes". If the string belongs to a language, then it will certainly be printed and thus recognized. But, if the chain does not belong to the language, then the recognition process will continue indefinitely. The grammar recognizer will loop. In this sense, the power of recognizing grammars is less than the power of generators and enumerators. This should be kept in mind when comparing Chomsky's generative grammars and Turing's recognition machines.

Neighborhood grammars

In the mid-60s, the Soviet mathematician Yuliy Anatolyevich Shreider proposed a simple way to describe the syntax of languages ​​based on the so-called. neighborhood grammars. For each character of the language, a finite number of its “neighborhoods” are specified - chains containing this character (the center of the neighborhood) somewhere inside. A set of such neighborhoods for each character of the alphabet of a language is called a neighborhood grammar. A chain is considered to belong to the language defined by the neighborhood grammar if each symbol of this chain is included in it along with some of its neighborhood.

As an example, consider the language A = (a+a, a+a+a, a+a+a+a,...) . This language is the simplest model the language of arithmetic expressions, where the symbol "a" plays the role of numbers, and the symbol "+" plays the role of operations. Let us compose a neighborhood grammar for this language. Let's set the neighborhood for the symbol "a". The character "a" can occur in A language strings in three syntactic contexts: at the beginning, between two "+" characters, and at the end. To indicate the beginning and end of the chain, we introduce the pseudo-symbol "#". Then the neighborhoods of the symbol "a" will be the following: #a+, +a+, +a# . Usually, to highlight the center of the neighborhood, this symbol in the chain is underlined (after all, there may be other such symbols in the chain that are not the center!), We will not do this here for lack of a simple technical possibility. The "+" character only occurs between two "a" characters, so it is given one neighborhood, the string a+a .

Consider the chain a+a+a and check if it belongs to the language. The first character "a" of the chain enters it along with the neighborhood #a+ . The second character "+" enters the chain along with the neighborhood a+a . A similar occurrence can be checked for the rest of the characters in the string, i.e. this chain belongs to the language, as one would expect. But, for example, the string a+aa does not belong to the language A, since the last and penultimate characters "a" do not have neighborhoods with which they are included in this string.

Not every language can be described by a neighborhood grammar. Consider, for example, language B, whose strings begin either with the character "0" or with the character "1". In the latter case, the characters "a" and "b" can go further in the chain. If the chain starts from zero, then only the characters "a" can go further. It is not difficult to prove that no neighborhood grammar can be invented for this language. The legitimacy of the occurrence of the character "b" in the chain is due to its first character. For any neighborhood grammar that specifies a link between the characters "b" and "1", it will be possible to choose a chain long enough so that any neighborhood of the character "b" does not reach the beginning of the chain. Then it will be possible to substitute the character "0" at the beginning and the chain will belong to the language A, which does not correspond to our intuitive ideas about the syntactic structure of the chains of this language.

On the other hand, it is easy to build a state machine that recognizes this language. This means that the class of languages ​​that are described by neighborhood grammars is narrower than the class of automata languages. The languages ​​defined by neighborhood grammars will be called Schrader's. Thus, in the hierarchy of languages, one can single out a class of Schrader languages, which is a subclass of automata languages.

We can say that Schrader's languages ​​define one simple syntactic relation - "to be near" or the relation of immediate precedence. The far-precedence relation (which obviously exists in language B) cannot be specified by a neighborhood grammar. But, if we visualize the syntactic relations in the chains of the language, then for the diagrams of relations into which such chains turn, we can come up with a neighborhood grammar.

The 21st century is a time when the possession of information is the most important competitive advantage in any field. However, it will not bring any benefit if it is not expressed in a language understandable to those to whom it is intended or if there is no translator capable of conveying its meaning to the addressee.

At the moment, about 2000 peoples live on earth. Their distinguishing feature, first of all, is the language.

Along with colloquial (natural) mankind has created many artificial languages. Each of them is designed to solve specific problems.

Such sign systems include formal languages, examples of which are presented below.

Definitions

First of all, let's define what a language is. This word is commonly understood as a sign system that is used to establish communications between people and knowledge.

The basis of most both artificial and natural languages ​​is the alphabet.

It is a set of characters used to form words and phrases.

The language is characterized by:

  • the set of characters used;
  • rules for composing “words”, “phrases” and “texts” from them;
  • a set of rules (syntactic, pragmatic, and semantic) for the use of constructed constructs.

Characteristics of natural languages

As already mentioned, all languages ​​are conditionally divided into artificial and natural. There are many differences between them.

Spoken languages ​​are natural. Their characteristics include, among others:

  • ambiguity of most words;
  • the existence of synonyms and homonyms;
  • the presence of several names for the same subject;
  • the existence of exceptions to almost all rules.

All these characteristics are the main differences between natural sign systems and formal languages. Examples of ambiguous words and statements are known to all. So the word "ether", depending on the context, can mean both substance and radio or television broadcasting.

The main functions of spoken languages ​​are:

  • communication;
  • cognitive activity;
  • expression of emotions;
  • impact on the interlocutor (correspondent, if we are talking about correspondence).

Characteristics of artificial languages

Artificial languages ​​are created by people for special purposes or for specific groups of people.

One of the main characteristics of artificial languages ​​is the unambiguous definition of their vocabulary, as well as the rules for giving them meanings and forming expressions.

Formal languages ​​and grammars

A language, whether natural or artificial, can only exist if there is a set of specific rules. At the same time, a consistent, compact and accurate display of relations and properties of the studied subject area should be provided. If they are strictly formulated, then they say that the language. Programming languages ​​are examples of such sign systems, although, strictly speaking, they rather occupy an intermediate position (see below).

The scheme for constructing formal sign systems is as follows:

  • an alphabet is selected (a set of initial symbols);
  • the rules for constructing expressions (syntax) of the language are specified.

Scope of application

Formal logics, programming, etc.) are used in the process of scientific research. They are better than natural ones for representing knowledge and are a means of a more objective and accurate exchange of information.

Formal languages ​​include all known systems of mathematical and chemical symbols, Morse code, musical notation, etc.

In addition, formal programming languages ​​are widely used. Their rapid development began in the middle of the 20th century, in connection with the advent of computer technology.

Formal logic language

At the heart of any programming language is mathematics. It, in turn, relies on the sign system of formal logic.

As a science, logic was created by Aristotle. He also developed rules for transforming statements that preserve their truth value regardless of the content of the concepts included in these statements.

Formal logic struggles with the “shortcomings” of natural languages ​​associated with the ambiguity of some statements, etc. For this purpose, operations with thoughts are replaced by actions with signs of a formal language. This eliminates any uncertainty and allows you to accurately establish the truth of the statement.

Features of programming languages

As already mentioned, with some reservations they can be classified as formal.

They are united with the latter by many syntactic rules, and with the natural ones by some keywords and constructions.

To create a programming language, it is necessary to determine the set of valid symbols and the correct programs of the language and the meaning of each correct program. If the first task can be dealt with by means of formalization, in the case of the latter, these approaches do not work.

The set of valid characters in programming languages ​​are characters that can be typed from the keyboard. They are the first part of the ASCII encoding table.

grammars

Programming languages, like any other, have a grammar. This term is understood as a description of the method of compiling proposals. Grammars are described in various ways. In the case of programming languages, they are rules that are specified by ordered pairs of strings of characters of two types: defining syntactic constructions and semantic restrictions. When setting grammars, first they formally set out the rules for constructing syntactic constructions, and then they set semantic ones in one of the natural languages.

The rules are written in graphical form by means of special diagrams. Initially, this approach was applied when creating the Pascal language. However, then it began to be widely used in others.

Classification of programming languages

At the moment, there are several thousand of them, together with “dialects”. They are classified as procedural and declarative. In languages ​​of the first type, data transformation is specified by describing the sequence of actions performed on them, the second - by relations. There are other classifications as well. For example, programming languages ​​are divided into functional, procedural, object-oriented and logical. If we approach the issue strictly, then no classification can be objective. After all, a significant part of programming languages ​​has the capabilities of formal systems of several types at once. Over time, the edges are likely to blur even more.

Now you will be able to answer the question: “What formal languages ​​do you know?”. Scientists continue to improve them, in order to make possible solution various practical and theoretical problems that are currently considered unsolvable.

FORMALIZED (FORTAL) LANGUAGES

UNDERSTAND

Formalized (formal) language is an artificial language characterized by precise rules for constructing expressions and understanding them.

The formal language is built in accordance with clear rules, providing a consistent, accurate and compact display of the properties and relationships of the studied subject area (modeled objects).

Unlike natural languages, formal languages ​​have clearly defined rules for the semantic interpretation and syntactic transformation of the signs used, as well as the fact that the meaning and meaning of the signs does not change depending on any pragmatic circumstances (for example, on the context).

Formal languages ​​are often constructed on the basis of the language of mathematics.

Throughout the history of the development of mathematics, symbolic designations for various objects and concepts have been widely used in it. However, along with symbolic designations, mathematicians also freely used natural language. But at some stage in the development of science (the 17th century), the need arose for a rigorous logical analysis of mathematical judgments, as well as for clarifying the concept of “proof”, which is important for mathematics. It turned out that it is impossible to solve these problems without a strict formalization of mathematical theories. There was a need to present these theories in a formal language. The 20th century can be considered a century of rapid development of various formal languages.

From the point of view of computer science, formal languages ​​play the most significant role among formal languages. language of logic (the language of the algebra of logic) and programming languages . They are also of great practical importance.

All formal languages ​​are constructs created by someone. Most of them are built according to the following scheme.

First of all, choose alphabet , or a set of source characters from which all expressions of the language will be built. Then it is described syntax language, that is, the rules for constructing meaningful expressions.

Since the concept of "symbol" has a multi-valued semantic load for the signs of the alphabet, the term "letter" is more often used. But it should be remembered that the letters in the alphabet of a formal language can be letters of the alphabets of natural languages, and brackets, and special characters, etc.

From letters, according to certain rules, you can make words and expressions .

The simplest rule is that any finite sequence of letters can be considered a word. Actually the word is the simplest information model(and it is, of course, a constructive object).

EXAMPLE 1

One of the most important from the point of view of computer science is the alphabet, consisting of two letters “0”, “1”. Any finite sequence of zeros and ones is a word in this alphabet.

In logico-mathematical languages, among expressions there are terms and formulas .

Thermae are an analogue of object names, their main purpose is to designate some object.

Terms primarily include subject variables and constants - expressions that serve to designate specific objects.

More complex terms are built from subject variables and constants according to certain rules. Usually, functions allowed in the language are used for this.

EXAMPLE 2

In logic, such functions are inversion (), conjunction (), disjunction (), implication (), etc.

Examples of terms in logic algebra:

BUT; AB A; (AC).

In programming languages, arithmetic operations, relational operations (,

Examples of terms in the Pascal programming language:

BUT; prog_1; ((A1+25)3*B) and (B0)); 2+sqrt(z*sin(b)).

Formulas

EXAMPLE 3

Examples of logical formulas:

(АС)  АС = 1; x((x)(x))

Program statements can be called formulas in a programming language.

Examples of "formulas" of the Pascal programming language:

A:= 2+sqrt(Z*sin(B)); if F3 then write(R) else R:=sqr(F);

Meaningful expressions are obtained in a formal language only if certain regulations formation, transformation and “understanding” of terms and formulas. These rules include:

    construction rules terms and formulas;

    interpretation rules terms and formulas (semantic aspect of the language);

    inference rules

For each formal language, the set of these rules must be strictly defined, and the modification of any of them most often leads to the emergence of a new variety (dialect) of this language.

EXAMPLE 4

Pascal operator

if F3 then write(R) else R:=sqr(F);

interpreted according to the following rules:

    the variable F can only be of integer or real type, and the variable R can only be of real type. If this is not the case, then the statement is considered to be syntactically incorrect and will not be executed (a syntax error message will be issued);

    variables (the simplest terms) F and R must be previously defined, that is, cells with these names must contain some values ​​of the corresponding type (for some versions of Pascal, this rule is not included in the syntax of the language. In this case, that sequence of zeros and ones is selected , which is contained in cells with given addresses and is interpreted as a decimal number);

    if the value of the expression (complex term “F3”) following the keyword (reserved) word if is “true”, then the statement following the keyword then is executed (the value of the variable F is displayed on the screen); if its value is “false”, then the operator located after the else keyword is executed (the square of the value of the variable F is calculated and the result is placed in a cell named R).

The presence in the syntax of the formal language of the rules for deriving terms and formulas allows you to perform isomorphic transformations models built on the basis of this language. Thus, formal languages ​​not only reflect (represent) one or another set of already existing knowledge, but are means of formalization these knowledge, allowing through formal transformations to obtain new knowledge. Moreover, since transformations can take place only according to strict formal rules, the construction of models isomorphic to the given one, but giving new knowledge, may well be automated. This feature is widely used in computer knowledge bases, in expert systems, in decision support systems.

Formal languages ​​are widely used in science and technology. In the process scientific research and practical activities formal languages ​​are usually used in close association with natural language, since the latter has much greater expressive power. At the same time, a formal language is a means of a more accurate representation of knowledge than a natural language, and, consequently, a means of a more accurate and objective exchange of information between people.

KNOW

Formalized (formal) language is an artificial language characterized by precise rules for constructing expressions and their interpretation (understanding).

When constructing a formal language, one chooses alphabet , and is described syntax language.

Alphabet- a set of initial characters from which all expressions of the language will be built.

Expressions formal language are terms and formulas.

Main purpose terma - designate some object.

The simplest terms are subject variables and constants - expressions that serve to designate specific objects.

Compound terms are built according to certain rules by applying the functions allowed in the language to simple terms.

Formulas are formed from terms to which the operators allowed in the language are applied.

Syntax language - a set of rules for constructing meaningful expressions - includes:

    construction rules terms and formulas;

    interpretation rules terms and formulas;

    inference rules some formulas and terms from other formulas and terms.

Formal languages ​​such as language of logic and programming languages .

Formal languages ​​are widely used in science and technology. They are a means of a more accurate and objective exchange of information between people than natural language.

Formal languages ​​not only reflect (represent) one or another set of already existing knowledge, but are a means of formalizing this knowledge, which makes it possible to obtain new knowledge through formal transformations. This possibility is widely used in computer knowledge bases, in expert systems, in decision support systems.

BE ABLE TO

EXERCISE 1

List what letters the alphabet of the programming language you know consists of and what rules exist for the formation of simple terms in this language.

If there are reserved words in this programming language? If yes, please provide examples of reserved and non-reserved words.

What in programming languages ​​can be considered as terms and formulas?

ANSWER. The alphabet of a programming language includes all the symbols that can be used when writing programs.

The terms of the programming language are identifiers, as well as expressions built from identifiers, constants, signs of arithmetic and logical operations, mathematical and other (defined in the language) functions, brackets.

The formulas of a programming language are the operators allowed in it: input, output, assignment, conditional, loop, etc.

TASK 2

If you studied the basics of formal logic, then:

    give examples when the formal transformation of logical formulas allows you to get new knowledge about the objects under study;

    interpret the formula: x ((x)  (x)) or  (A  A) = 1

ANSWER. 2) is the law of non-contradiction, the essence of which is: no statement can be true and false at the same time.

TASK 3

What is the alphabet of the decimal number system?

What is the basic rule for the formation (recording) of numbers in this positional number system?

ANSWER. Alphabet: Decimal digits, decimal point (or comma) and plus and minus signs. Rule: the weight of a digit in a number depends on its position in the notation of the number.

TASK 4

How can the binary alphabet word “0100 1001 0100 0110” be interpreted in a programming system known to you (spaces are inserted for readability)?

ANSWER. In Pascal, these two bytes can be interpreted as a string of “IF” characters, as two numbers of type byte - 73 and 70, as one number of type integer - 20758 (18758 ???).

TASK 5

The graphical interface of the Windows system contains elements such as pictograms or icons. Can we assume that they are included in the alphabet of the user interface language of this system? Justify the answer.

EXPAND YOUR PERSPECTIVE

In formal languages, as in no other, the role of the sign, understood in the broad sense of the word, is great. Some aspects of the use of signs were considered by us earlier, but it makes sense to talk about this in more detail.

The reason for the appearance of signs is quite obvious: most of the objects of cognition and activity are not available for direct perception in the process of cognition and presentation in the process of communication.

Sign(gr.  - sign, lat. transcription - semeion) is a material object that acts as a representative of some other object, property or relationship and is used to acquire, store, process and transmit messages (information, knowledge ).

NOTE 1. Instead of the word “sign”, other concepts are used in a similar sense: “name”, “term”, “designation”.

According to the definition of one of the creators of the theory of signs (semiotics) C.P. Pierce, sign- this is such an element x, which replaces for the subject some element y (denotation) according to some attribute.

Respectively, denotation- this is what this sign means in a particular situation.

Denotation some linguistic abstract unit (from Latin denoto - I designate) - a set of objects that can be called by a given sign.

NOTE 2. Instead of the word “denotation” in logic, other (identical, synonymous) names are used: most often “meaning”, “denoted”.

In turn, each sign defines some properties of the object it denotes. The information that a sign carries about what it denotes is called concept sign (from lat. conceptus - concept).

NOTE 3. The term “concept” has synonyms: “meaning”, “sign meaning”.

FOR EXAMPLE, in the word "animal" we find the ancient meaning of the word "stomach" - life. Animals differ not in the presence of a stomach, but in the fact that they are alive, they have a stomach-life. Thus, the concept of the sign “animal” is the concept of a living being, detonation is any specific living being that is meant in a given sign situation.

According to Peirce, all signs are divided into index , iconic and symbolic the nature of the relationship between the signifier and the signified.

Index ratio between the signifier and the signified in the sign is based on their actual, existing similarity. This class includes, for example, onomatopoeic words or structural formulas chemical compounds. Index signs are associated with the designated causal relationship (for example, the presence of wet roofs is a sign that it has rained).

Iconic attitude between the signifier and the signified is, according to C. Pierce, “a simple generality in terms of some property”. Copy signs (iconic signs) - reproductions, reproductions that are similar to the designated one (for example, photographs, fingerprints).

AT symbolic sign signifier and signified are related “without regard to any actual connection” (for example, a certain combination of sounds, letters, figures, colors, movements, etc. is assigned to some object.

For the construction of formal languages, it is this type of signs that is important (see the paragraph of the first chapter on the main thesis of formalization).

NOTE 4 Symbolic characters are sometimes referred to as symbols . According to the idea of ​​the outstanding Russian philosopher P.A. Florensky, a symbol is “a being that is greater than itself. Symbol- this is something that is something that is not itself, greater than it, and yet essentially manifests itself through it. For example, the mythical creature griffin, which combines a lion and an eagle, is one of the symbols of Jesus Christ.

It often happens that a sign that first appeared as an iconic sign later becomes a sign-symbol.

FOR EXAMPLE, the letter  in the Phoenician alphabet was called "aleph" - a bull (it resembles a bull's head). Then she was an iconic sign. In the Greek language, this letter is not associated with a bull and becomes a sign-symbol.

As mathematical symbolism develops, iconic signs are also replaced by symbols. For example, the Roman numeral V resembled an open hand (five fingers), while the modern number 5 is a symbol.

The signs denote respectively the planet Venus and Mars in astronomy, and in biology - female and male. These signs are iconic in origin. The first of them is a stylized image of an ancient mirror, the second is a shield with a spear.

Denotations are by no means always real-life objects and collections of such objects. Many examples of denotations that are not objects of reality are contained in the well-known fairy tale by L. Carol "Alice in Wonderland". It also figuratively formulates the principle of the emergence of such denotations:

“He lived to live (March Hare - author's note), but to be something he was not.” In this regard, the Russian proverb “he lived and was” does not seem to be a tautology at all.

The structure of the sign is described by the so-called “Frege triangle” (after the name of the outstanding German logician, who did a lot for the development of the theory of formal languages). In other terminology, it is called the “semantic triangle” or the triangle of Ogden and Richards. It establishes a connection between the sign, the denotation of the sign and the concept of the sign.

Rice. 4.3.1. Frege triangle

With the help of this triangle, a number of well-known language effects (sign situations) can be clarified.

1) Synonymy- a situation consisting in full or partial coincidence of the meanings of different signs:

Rice. 4.3.2. Synonymy scheme

2) signs can have the same denotation, but have different meanings (denotative identity). For example, the signs “sin 30 °” and “1/2” have the same denotation, that is, they name the same real number, but the meaning of these signs is different:

Rice. 4.3.3. Scheme of denotative identity

3) Polysemy(polysemy) - the sign has more than one meaning:

Rice. 4.3.4. Diagram of polysemy

INTERESTING FACT

History reference

The first steps towards the creation of a formal language of logic were taken in antiquity. Aristotle (384-322 BC) introduced letter variables for subjects and predicates of simple categorical statements, and the head of the Stoic school Chrysippus (c. 281-208 BC) and his students - variables for statements in general . In the 16th century, R. Descartes (1596-1659) created the basis of the modern formal language of mathematics - letter algebra, and G. W. Leibniz (1646-1716) transferred Descartes' symbolism to logic. The main language of logic at that time was natural language. Realizing the significant syntactic and semantic shortcomings of natural language (cumbersomeness, ambiguity and inaccuracy of expressions, fuzzy syntactic rules, etc.), Leibniz formulated the thesis that without the creation of a special artificial language - "universal calculus" - the further development of logic is impossible. But only in late XIX century, the idea of ​​Leibniz was developed in the studies of J. Boole (1815-1864), S. Jevons (1835-1882), E. Schroeder (1841-1902) and others - the algebra of logic appeared.

Further development of the language of logic is associated with the names of J. Peano (1858-1932) and G. Frege (1848-1925). Peano introduced a number of symbols accepted in modern mathematics, in particular “”, “”, “”, to designate, respectively, the relations of membership, union and intersection of sets. Frege built an axiomatic calculus of propositions and predicates, which contained all the basic elements of modern logical calculus.

Based on the results obtained by Frege, and using the modified Peano symbolism, B. Russell (1872-1970) and A. N. Whitehead (1861-1947) in their joint work “Principles of Mathematics” (1913) formulated the main provisions of the formal language of logic.

At present, the language of logic finds important application in computer science, in the development of programming languages, computer software, and various technical systems.

The emergence of programming languages ​​falls on the beginning of the 50s of the XX century. Initially, programs were created in the language of machine instructions and were sequences of binary codes that were entered from the console into a computer for execution.

The first step in the development of programming languages ​​was the introduction of mnemonic (mnemonic - memory) designations for commands and data and the creation of a machine program that translates these mnemonic designations into machine codes. Such a program, and with it the notation system, was called assembler .

Each type of machine had its own assembler, and transferring programs from machine to machine was a very laborious procedure. Therefore, the idea of ​​creating a machine-independent language arose. Such languages ​​began to appear in the mid-1950s, and the program that translated the sentences of this language into machine language became known as translator .

There are several thousand programming languages ​​and their dialects (varieties). They can be classified in different ways. Some authors break the whole variety of programming languages ​​into procedural and declarative ones. In procedural languages, data transformation is specified by describing the sequence of actions on them. AT declarative languages data transformation is specified primarily by describing the relationship between the data itself. According to another classification, programming languages ​​can be divided into procedural, functional, logical, object-oriented. However, any classification is somewhat arbitrary, since, as a rule, most programming languages ​​include the features of languages different types.

A special place among programming languages ​​is occupied by languages ​​that ensure the operation of database management systems (DBMS). Often, two subsystems are distinguished in them: a data description language and a data manipulation language (another name is the query language).

Programming is a whole science that allows you to create computer programs. It includes a huge number of different operations and algorithms that form a single programming language. So, what is it and what are the programming languages? The article provides answers, as well as an overview list of programming languages.

The history of the emergence and change of programming languages ​​should be studied along with the history of the development of computer technology, because these concepts are directly related. Without programming languages, it would be impossible to create any program for the operation of a computer, which means that the creation of computers would become a meaningless exercise.

The first machine language was invented in 1941 by Konrad Zuse, who is the inventor of the Analytical Engine. A little later, in 1943, Howard Aiken created the Mark-1 machine, capable of reading instructions at the level of machine code.

In the 1950s, there was an active demand for software development, and machine language could not withstand large amounts of code, so a new way of communicating with computers was created. "Assembler" is the first mnemonic language to replace machine instructions. Over the years, the list of programming languages ​​is only increasing, because the scope of computer technology is becoming more extensive.

Classification of programming languages

At the moment there are more than 300 programming languages. Each of them has its own characteristics and is suitable for one specific task. All programming languages ​​can be divided into several groups:

  • Aspect-oriented (the main idea is the separation of functionality to increase the efficiency of program modules).
  • Structural (based on the idea of ​​creating a hierarchical structure of individual blocks of the program).
  • Logical (based on the theory of the apparatus of mathematical logic and resolution rules).
  • Object-oriented (in such programming, algorithms are no longer used, but objects that belong to a certain class).
  • Multi-paradigm (combine several paradigms, and the programmer himself decides which language to use in this or that case).
  • Functional (the main elements are functions that change value depending on the results of the calculations of the initial data).

Programming for beginners

Many people ask what is programming? Basically, it is a way to communicate with a computer. Thanks to programming languages, we can set specific tasks for various devices by creating special applications or programs. When studying this science on initial stage the most important thing is to choose suitable (interesting for you) programming languages. The list for beginners is below:

  • Basic was invented in 1964, belongs to the family of high-level languages ​​and is used to write application programs.
  • Python ("Python") is quite easy to learn due to its simple, readable syntax, but the advantage is that it can be used to create both ordinary desktop programs and web applications.
  • Pascal ("Pascal") - one of the oldest languages ​​(1969) created for teaching students. Its modern modification has strict typing and structure, but "Pascal" is a completely logical language that is understandable on an intuitive level.

Is not full list programming languages ​​for beginners. There are a huge number of syntaxes that are easy to understand and will definitely be in demand in the coming years. Everyone has the right to independently choose the direction that will be interesting for him.

Beginners have the opportunity to accelerate the learning of programming and its basics thanks to special tools. The main assistant is the Visual Basic integrated development environment for programs and applications (“Visual Basic” is also a programming language that inherited the style of the Basic language of the 1970s).

Programming language levels

All formalized languages ​​designed to create, describe programs and algorithms for solving problems on computers are divided into two main categories: low-level programming languages ​​(the list is given below) and high-level ones. Let's talk about each of them separately.

Low-level languages ​​are designed to create machine instructions for processors. Their main advantage is that they use mnemonic notation, i.e. instead of a sequence of zeros and ones (from the binary number system), the computer remembers a meaningful abbreviated word from of English language. The most famous low-level languages ​​are "Assembler" (there are several subspecies of this language, each of which has much in common, but differs only in a set of additional directives and macros), CIL (available in the .Net platform) and JAVA Bytecode.

High-level programming languages: list

High-level languages ​​are designed for convenience and efficiency of applications, they are the exact opposite of low-level languages. Their distinguishing feature is the presence of semantic constructions that concisely and briefly describe the structures and algorithms of the programs. In low-level languages, their description in machine code would be too long and incomprehensible. High-level languages, on the other hand, are platform independent. Instead, compilers perform the translator function: they translate the program text into elementary machine instructions.

The following list of programming languages: C ("C"), C# ("C-sharp"), "Fortran", "Pascal", Java ("Java") - is among the most used high-level syntaxes. It has the following properties: these languages ​​work with complex structures, support string data types and file I/O operations, and also have the advantage of being much easier to work with due to readability and understandable syntax.

Most used programming languages

In principle, you can write a program in any language. The question is, will it work efficiently and without fail? That is why the most suitable programming languages ​​should be chosen for solving various problems. The popularity list can be summarized as follows:

  • OOP languages: Java, C++, Python, PHP, VisualBasic and JavaScript;
  • group of structural languages: Basic, Fortran and Pascal;
  • multi-paradigm: C#, Delphi, Curry and Scala.

Scope of programs and applications

The choice of the language in which this or that program is written depends largely on the area of ​​its application. So, for example, to work with the computer hardware itself (writing drivers and supporting programs), the best option would be C ("C") or C ++, which are included in the main programming languages ​​(see the list above). And for the development of mobile applications, including games, you should choose Java or C # ("C-sharp").

If you have not yet decided which direction to work in, we recommend that you start learning with C or C ++. They have a very clear syntax, a clear structural division into classes and functions. In addition, knowing C or C ++, you can easily learn any other programming language.

The first thing that distinguishes one programming language from another is their syntax. The main purpose of syntax is to provide a notation for the exchange of information between the programmer and the compiler. However, when developing the details of syntax, they often proceed from secondary criteria, the purpose of which is to make the program easy to read, write and translate, and also to make it unambiguous. If the convenience of reading and writing programs is necessary for the user of a programming language, then the ease of translation and the absence of discrepancies in the language are relevant to the needs of the translator. These goals are generally contradictory, and finding an acceptable compromise in their solution is one of the central tasks in the development of a programming language.

The development of a new programming language begins with the definition of its syntax. To describe the syntax of a programming language, in turn, some class is also needed. A language intended to describe another language is called a metalanguage. The language used to describe the syntax of a language is called a metasyntactic language. In metasyntactic languages, a special set of conventional signs is used, which forms the notation of this language.

Historically, the first metasyntactic language that was used in practice to describe the syntax of programming languages ​​(in particular, Algol-60) are Backus normal forms, abbreviated as BNF - Backus normal form or Backus-Naur form. The main purpose of Backus forms is to present in a concise and compact form strictly formal and unambiguous rules for writing the main structures of the described programming language.

The formal definition of the syntax of a programming language is usually called a grammar.

Formal languages ​​and grammars

Initially, the science of languages ​​- linguistics - was reduced to the study of specific natural languages, their classification, elucidation of similarities and differences between them. The emergence and development of metamathematics, which studies the language of mathematics, work on the study of the means of animal communication and other research led in the 30s to a significantly broader understanding of the language, in which language is understood as any means of communication, consisting of:

sign system, i.e. sets of admissible character sequences;

many meanings of this system;

correspondence between sequences of characters and meanings, which makes possible sequences of characters "meaningful".

Signs can be letters of the alphabet, mathematical symbols, sounds, etc. Mathematical linguistics considers only such sign systems in which signs are symbols of some alphabet, and sequences of signs are texts, i.e. languages ​​are viewed as arbitrary sequences of meaningful texts. At the same time, the rules that define the set of texts form the syntax of the language, and the description of the set of meanings and the correspondence between meanings and texts form the semantics of the language. The semantics of a language depends on the nature of the objects described by the language, and the means of studying it are different for various types languages. The syntax of the language, as it turned out, depends to a lesser extent on the purpose of the language and can be studied by methods that do not depend on the content and purpose of the language. The mathematical apparatus for studying the syntax of languages ​​is called the theory of formal grammars. From the point of view of syntax, the language is no longer understood as a means of communication, but as a set of formal objects - sequences of alphabetic characters. The term "formal" emphasizes that objects and operations on them are considered purely formally, without any meaningful interpretations of objects. Let us reproduce the main terms and definitions of this theory.

A letter (or symbol) is a simple indivisible sign; many letters form an alphabet. Alphabets are sets, and therefore set-theoretic notation can be applied to them. A string is an ordered sequence of letters from the alphabet. Chains will also be called words. The set of all possible chains (words) over the alphabet A is called the closure of A and is denoted by A*.

The set A* is called an iteration of the alphabet A.

If the strings consist of repeated letters, then abbreviated notation is used to show that the string is to be considered as a product of the letters of the alphabet.

When converting some chains to others, the concept of a subchain is used.

An alternative set of terms for a letter, alphabet, or string (word) is the set: word, dictionary, and sentence, respectively. The set of chains (or sentences) is called a language. Formally, the language L over the alphabet A.

Thus, using the above terminology, the programming language for a given alphabet A is such a subset of the set A* that includes only those sentences that, due to external information about their semantics, are considered meaningful, i.e. satisfy the syntax of the programming language.

The above definition of a formal language as any subset of A* is general: it does not allow one to single out among the multitude of languages ​​their individual classes that are used in practice.

The systematic use of mathematical methods to describe programming languages ​​begins in the 1960s. Then it was found that the Backus forms that were used to describe the syntax of the ALGOL-60 language have a strict formal justification using the means of mathematical linguistics. Since that time, the history of the development and application of the formal mathematical apparatus - the theory of formal languages ​​and grammars - began for the design and construction of translators.

The Backus form describes two classes of objects: firstly, the main symbols of the programming language and, secondly, the names of the structures of the described language, or the so-called metalinguistic variables.

Formal definition of grammar

Backus-Naur form

The grammar is defined as follows:

VT - set of terminal symbols (set of symbols of the alphabet);

VN - a set of non-terminal symbols (symbols that define the concepts of the language

P - set of rules;

S is the target symbol of the grammar, an axiom.

Consider a formal description of the Backus grammar for integer decimal numbers.

G((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -), (<число>, <цифра>), P,<число>)

P - rule for generating language lexemes:

<число> -> [(+,-)]<цифра>[<цифра>]

<цифра> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9.

An optional element within a rule is enclosed in square brackets [...].

Alternative items are indicated by a vertical list of options enclosed in curly brackets (...).

Optional alternatives are indicated by a vertical list of options enclosed in square brackets [...].

A repeating element is indicated by a list of one element (enclosed, if necessary, in curly or square brackets) followed by a regular ellipsis....

Required keywords are underlined, but optional noise words are not.

Backus forms represent a formal description of the language and they, in essence, prompted researchers to introduce mathematical tools for systemic description and research of programming languages, use of the mathematical apparatus as the basis for parsing in the translator, which was later developed in various parsing methods based on formal syntactic definitions.

It should be noted that BNF does not allow describing context dependencies in a programming language. For example, such a limitation of Pascal programs as "an identifier cannot be declared twice in the same block" cannot be described using BNF. Restrictions of this kind are closer to another characteristic of the language - semantics. Therefore, other means are used here, generally called metasemantic languages. However, as a rule, the same BNF is the core of these languages.

A feature of many metalinguistic formulas is the presence of recursions in them, i.e. the use of the described structures themselves to describe the structures. Recursion can be explicit or implicit. Explicit recursion takes place, for example, in rule 2 in the above list of description rules decimal number. Implicit recursion is present in the case when a metalinguistic variable is used to construct a construction at a certain step, denoting this construction itself.

The presence of recursions makes it difficult to read and understand metalinguistic formulas, but this is perhaps the only way that allows using a finite number of rules to describe a language that can contain an infinite number of chains of basic characters. Programming languages ​​are infinite - an infinite number of correct programs can be written on them, and when describing their syntax using BNF, there will always be explicit or implicit recursions.

In practice, other metalinguistic languages ​​are also used to describe the syntax of programming languages. One of the purposes of their use is to eliminate some unnaturalness in the BNF representation of common syntactic constructions for optional, alternative, and repeating rule elements.

The theory of formal grammars deals with the description, recognition and processing of languages. It allows you to answer a number of applied questions. For example, can languages ​​from some class Z be recognized quickly and easily; whether the given language belongs to class Z; Are there algorithms that would answer questions like: “Does the string a belong to the language L or not?” etc.

In general, there are two main ways to describe individual classes of languages:

using a generating procedure;

using a recognition procedure.

The first of these is given by a finite set of rules, called grammars, that generate exactly those chains that belong to the language L.

The second is with the help of some abstract recognizing device (automaton).

When constructing translators, both of these methods are used: the grammar as a means of describing the syntax of the programming language, and the automaton as a model of the algorithm for recognizing sentences of the language, which is used as the basis for constructing the translator. At the same time, methodically (and technologically), a grammar is first constructed, and then, based on it, as a source, a recognition algorithm is built.

It should be noted that although the generative grammar describes the process of generating chains in the language L(G), this description is not algorithmic - one of the main properties of the algorithm is missing in the grammar - determinism, i.e. the specific order of applying grammar substitution rules is not fixed. This ensures the compactness of the description of the language. In the general case, such an enumerating algorithm can be fixed in various ways, but this is not required for a precise definition of the language.

Thus, the formal grammar G potentially defines a set of language generation algorithms.

The practical application of grammars is connected with the solution of the problem of recognition. The recognition problem is solvable if there is an algorithm that, in a finite number of steps, gives an answer to the question of whether an arbitrary string over the main vocabulary of the grammar belongs to the language generated by this grammar. If such an algorithm exists, then the language is called recognizable. If, in addition, the number of steps of the recognition algorithm depends on the length of the chain and can be estimated, the language is said to be easily recognizable. Otherwise, it makes no sense to talk about building a translator for an unrecognized programming language. Therefore, in practice, such particular classes of generative grammars are considered that correspond to recognizable, and in most cases, easily recognizable languages. The most important classes of such languages ​​can be defined within the framework of the classification of languages ​​proposed in 1959 by the American linguist N. Chomsky (Chomsky classification). He proposed to classify formal languages ​​according to the type of rules of the grammars that generate them:

Class 0. Grammars with phrase structure. They can serve as a model of natural languages. They are the most complex, they have no practical application for constructing translators.

Class 1. Context-sensitive grammars. When constructing sentences, a non-terminal symbol can be replaced by another, taking into account the context. Based on such grammars, automated translation from one natural language to another can be carried out.

Class 2. Context-free grammars. The replacement of a non-terminal occurs without regard to context. COP-grammars play a major role in the formal study of the syntax of programming languages ​​and in the construction of a parsing unit for a translator.

Class 3. Regular grammars. The languages ​​of class 3 are called languages ​​with a finite number of states or automatic (regular) languages, and the grammars that generate them are called automatic grammars (A-grammars). A-grammars are used mainly at the stage of lexical analysis.

The main classes of languages ​​can be defined by classes of abstract recognizers (automata), which also form the corresponding hierarchy.

Of the four classes of grammars, context-free grammars are the most important when applied to programming languages. With their help, you can define a large, although not all, part of the syntactic structure of a programming language.

According to the types of grammars, languages ​​are divided into 4 types:

<тип 0>- languages ​​with phrase structure. All natural languages ​​belong to this type.

<тип 1>- context-sensitive languages. Languages ​​and grammars are used in the analysis and translation of texts in natural languages. Based on such grammars, automated translation from one natural language to another can be performed.

<тип 2>- context-free languages. Context-free languages ​​underlie syntactic constructions modern languages programming.

<тип 3>- regular languages. They are the most common and widely used in the field of computer systems design. To work with them, regular sets, regular expressions, and finite automata are used.

Conclusion: it depends on the classification type of the language, with the help of which grammar it is possible to construct a sentence of the language, how to recognize the sentence.

For many programming languages, there are specially formulated statements that allow you to check whether the language belongs to the specified type. Such statements are called lemmas.

Finite automata use memory and process a sequence of input characters that belonged to a finite set. Mathematically, the state machine is described as follows:

where V=() - input alphabet;

Q=() - alphabet of states;

transition function;

The initial state of the automaton;

F is the final state of the automaton;

The finite automaton can be conditionally represented by the following diagram (Figure 2.1).

Figure 2.1 - Simplified scheme of the state machine

The control unit (CU) can sequentially read characters, moving from left to right. The control device can be in different states: start of work: ; at the end of F. The transition from state to state is carried out in accordance with the transition function. In this regard, the finite state machine can be represented as follows:

This command means that the state machine is in a state, reads a character and enters the state.

So the state machine is a language recognizer.

The task of parsing is, on the basis of the existing grammar (the programming language is known), to build a recognizer for this language.

Recognizers can be classified depending on the type of their constituent components:

Reader;

Memory management devices.

According to the types of reader, recognizers can be one-sided and two-sided. One-way recognizers allow the reader to move when reading input characters in only one direction. Double-sided recognizers allow movement in both directions.

According to the type of control devices, recognizers are:

deterministic;

Non-deterministic.

The recognizer is deterministic if at each step of its work there is a unique configuration that the recognizer will go to at the next step.

According to the types of memory, recognizers are:

1) without memory;

2) with limited memory;

3) with unlimited memory.

One way to describe a language recognition algorithm is to specify a recognizer.

For context-free devices, these devices are push-pull automata.

Push-pull memory is organized on a first-in, last-out basis.

The process of translating a source program into an object program is usually divided into several subprocesses (phases). The main phases of translation are:

1) lexical analysis;

2) parsing;

3) semantic analysis;

4) synthesis of an object program.


close