Specification & Recognition of Tokens,

RGPV: Compiler Design: Unit 1


Lexical Analysis:

1.      The lexical analyzer breaks these syntaxes into a series of tokens, by removing any whitespace or comments in the source code.

2.      If the lexical analyzer finds a token invalid, it generates an error. It reads character streams from the source code, checks for legal tokens, and passes the data to the syntax analyzer when it demands.



1.      Lexemes are said to be a sequence of characters (alphanumeric) in a token.

2.      There are some predefined rules for every lexeme to be identified as a valid token.

3.      These rules are defined by grammar rules, by means of a pattern.

4.      In programming language, keywords, constants, identifiers, strings, numbers, operators and punctuations symbols can be considered as tokens.

For example, in C language, the variable declaration line

int value = 100;

contains the tokens:

int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).



Specifications of Tokens:

Let us understand how the language theory undertakes the following terms:


Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.


Any finite sequence of alphabets is called a string.

Special Symbols:

A typical high-level language contains the following symbols:

Arithmetic Symbols

Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)


Comma(,), Semicolon(;), Dot(.), Arrow(->)



Special Assignment

+=, /=, *=, -=


==, !=, <, <=, >, >=



Location Specifier



&, &&, |, ||, !

Shift Operator

>>, >>>, <<, <<<


A language is considered as a finite set of strings over some finite set of alphabets

Longest Match Rule:

When the lexical analyzer read the source-code, it scans the code letter by letter and when it encounters a whitespace, operator symbol, or special symbols it decides that a word is completed.

For example:

int intvalue;

While scanning both lexemes till int, the lexical analyzer cannot determine whether it is a keyword int or the initials of identifier int value.


The various operations on languages are:

·        Union of two languages L and M is written as

L U M = {s | s is in L or s is in M}

·        Concatenation of two languages L and M is written as

LM = {st | s is in L and t is in M}

·        The Kleene Closure of a language L is written as

L* = Zero or more occurrence of language L.


If r and s are regular expressions denoting the languages L(r) and L(s), then

·        Union : (r)|(s) is a regular expression denoting L(r) U L(s)

·        Concatenation : (r)(s) is a regular expression denoting L(r)L(s)

·        Kleene closure : (r)* is a regular expression denoting (L(r))*

·        (r) is a regular expression denoting L(r)

Representing valid tokens of a language in regular expression:

If x is a regular expression, then:

·        x* means zero or more occurrence of x.

·        x+ means one or more occurrence of x.

·        x? means at most one occurrence of x


Representing occurrence of symbols using regular expressions:

letter = [a z] or [A Z]

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]

sign = [ + | - ]

Representing language tokens using regular expressions:

Decimal = (sign)?(digit)+

Identifier = (letter)(letter | digit)*

The only problem left with the lexical analyzer is how to verify the validity of a regular expression used in specifying the patterns of keywords of a language. A well-accepted solution is to use finite automata for verification.

Finite Automata:

1.      Finite automata is a state machine that takes a string of symbols as input and changes its state accordingly.


2.      If the input string is successfully processed and the automata reaches its final state, it is accepted.

The mathematical model of finite automata consists of:

  • Finite set of states (Q)
  • Finite set of input symbols (Σ)
  • One Start state (q0)
  • Set of final states (qf)
  • Transition function (δ)

The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ Q

Finite Automata Construction

Let L(r) be a regular language recognized by some finite automata (FA).

·        States: States of FA are represented by circles. State names are written inside circles.

·        Start state: The state from where the automata starts, is known as the start state. Start state has an arrow pointed towards it.

·        Intermediate states: All intermediate states have at least two arrows; one pointing to and another pointing out from them.

·        Final state: If the input string is successfully parsed, the automata is expected to be in this state. Final state is represented by double circles.

·        Transition: The transition from one state to another state happens when a desired symbol in the input is found. Upon transition, automata can either move to the next state or stay in the same state





Stay tuned for updates..

Related topics

Professor Jayesh video tutorial

Please use contact page in this website if you find anything incorrect or you want to share more information about the topic discussed above.