Graphical Parsing Tree Drawing Tool

January 01, 2022 Post a Comment

If you need to parse a linguistic communication, or document, from Java there are fundamentally three slipway to solve the trouble:

Use an existing library supporting that specific language: for case a library to parse XML.
Building your have custom parser by hand.
A tool or subroutine library to generate a parser: for example ANTLR, which you put up exercise to build parsers for some language.

Use an Existing Program library

The first option is the high-grade for healed-known and supported languages, like XML or HTML. A good library unremarkably also includes an API to programmatically build and modify documents in that language. This is typically Sir Thomas More of what you get from a underlying parser. The problem is that such libraries are not so common and they patronise only the most common languages. In other cases, you are unsuccessful of luck.

Building Your Own Custom Parser by Hand

You could need to go for the second option if you take in particular needs. Some in the sense that the language you need to parse cannot be parsed with traditional parser generators, or you have specific requirements that you cannot satisfy using a distinctive parser generator. For instance, because you involve the primo possible performance OR a intense integrating between different components.

A Tool or Depository library to Generate a Parser

In every other cases, the third option should comprise the default one, because IT is the peerless that is most flexible and has the shorter developing time. That is why, in this clause, we condense on the tools and libraries that correspond to this option.

Note: Text in blockquotes describing a program comes from the respective support.

Tools To Create Parsers

We are going to see:

Tools that posterior generate parsers utile from Java (and possibly from early languages)
Java libraries to build parsers

Tools that can be wont to return the code for a parser are calledparser generators operating theater compiler-compilers. Libraries that produce parsers are known asparser combinators.

Parser generators (or parser combinators) are not trivial: You need just about time to learn how to enjoyment them, and not entirely types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best-known of them, with a chunky introduction for from each one of them. We are also concentrating along one object language: Java. This also means that (usually) the parser itself leave be engrossed in Java.

To list all possible tools and libraries parser for all languages would be kind of interesting, but not that recyclable. That is because there would be simply besides many options, and we would all get lost in them. Away concentrating happening one programing language, we can provide an apples-to-apples comparison and assistant you select one option for your project.

Multipurpose Things to Know Nearly Parsers

To make a point that this list is available to whol programmers, we cause prepared a short explanation of terms and concepts that you may encounter searching for a parser. We are not trying to give you conventional explanations, but interoperable ones.

Structure of a Parser

A parser is usually composed of two parts: alexer, a.k.a.scanner Oregontokenizer, and the square-toed parser. Not all parsers adopt this 2-ill-trea schema: Whatsoever parsers act up not hinge upon a lexer. They are known asscannerless parsers.

A lexer and a parser solve in chronological sequence: The lexer scans the input and produces the matching tokens, the parser scans the tokens and produces the parsing result.

Allow's look at the following example and imagine that we are trying to parse a mathematical operation.

          437 + 734

The lexer scans the text and finds '4', '3', '7' and then the blank space. The speculate of the lexer is to recognize that the first characters constitute one token of typeNUM. Then the lexer finds a '+' symbol, which corresponds to a second minimum of typeAsset, and in conclusion, information technology finds some other token of typeNUM. The input stream is transformed in Tokens and then in an AST by the parser

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used aside lexers operating theater parser are calledrules orproductions. A lexer rule wish designate that a sequence of digits correspond to a token of typeNUM, spell a parser rule leave specify that a sequence of tokens of typeNUM, PLUS, NUMcorresponds to an facial expression.

Scannerless parsers are contrasting because they serve immediately the original text, instead of processing a number of tokens produced away a lexer.

Information technology is at once typical to happen suites that fundament generate both a lexer and parser. In the past, it was instead more common to aggregate two different tools: One to produce the lexer and one to produce the parser. This was, for example, the grammatical case of the venerable lex & yacc couple: lex produced the lexer, while yacc produced the parser.

Parse Tree and Abstract Syntax Tree

There are two terms that are related and sometimes they are victimised interchangeably: parse Tree and Abstract SyntaxTree (AST).

Conceptually they are very similar:

They are bothtrees: There is a solution representing the whole piece of encipher parsed. Then there are little subtrees representing portions of code that become smaller until unary tokens appear in the tree
The difference is the level of abstraction: The parse tree contains all the tokens that appeared in the program and possibly a set of junior rules. The AST instead is a polished edition of the parse tree diagram where the data that could be derivative or is not decisive to understand the piece of code is removed

In the AST, whatever selective information is at sea. For illustration, comments and pigeonholing symbols (parentheses) are not represented. Things alike comments are superfluous for a program and grouping symbols are implicitly circumscribed by the structure of the tree.

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually rules correspond to the type of a node. They are usually transformed in AST away the exploiter, with some help from the parser generator.

A graphical representation of an AST looks like this.

An abstract syntax tree for the Euclidean algorithm Sometimes you whitethorn want to start producing a parse tree and then derive from it an AST. This rear add up because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process via the following steps (and past the following steps, we mean all the operations that you may want to execute on the tree): code substantiation, rendition, compilation, etc..

Grammar

A grammar is a formal description of a language that can be used to recognize its structure.

In simple price, a grammar is a tilt of rules that specify how each construct can be unagitated. For example, a rule for an if statement could particularise that it moldiness starts with the "if" keyword, followed by a left parenthesis, an expression, a right parenthesis, and a statement.

A rule could character reference other rules or token types. In the example of the if statement, the keyword "if", the left, and the right parenthesis were token types, while the expression and command were references to other rules.

The most used data format to account grammars is theBackus-Naur Form (BNF), which also has more variants, including theExtended Backus-Naur Form. The Extented variant has the advantage of including a simple way to announce repetitions. A typical rule in a Backus-Naur grammar looks like this:

          <symbol> ::= __expression__

The<symbol> is usually nonterminal, which means that it fanny be replaced by the group of elements on the right,__expression__. The element__expression__ could stop new nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do non appear equally a<symbolisation> anywhere in the grammar. A typical example of a terminal symbolization is a string of characters, like "class".

Left-Recursive Rules

In the context of parsers, an important feature is support for left-recursive rules. This means that a rule could start with a reference to itself. This source could be also circumlocutious.

Look at for example arithmetical operations. An gain could be delineated as two expression(s) separate aside the plus (+) symbol, but an expression could also curb other additions.

          addition       ::= expression '+' formulation multiplication ::= verbalism '*' expression // an expression could be an addition Oregon a multiplication or a number expression     ::= addition | propagation |// a number

This description as wel matches twofold additions, suchlike 5 + 4 + 3. That is because it can be interpreted as expression (5) ('+') expression(4+3). And then 4 + 3 itself can follow divided into its 2 components.

The problem is that these kinds of rules Crataegus oxycantha non glucinium used with some parser generators. The alternative is a long chain of expressions that takes care also of the precedence of operators.

Some parser generators support direct left-recursive rules, but non an crooked one.

Types of Languages and Grammars

We care mostly about two types of languages that can be parsed with a parser author: day-after-day languages andcontext-free linguistic communications. We could give you the formal definition according to the Chomsky hierarchy of languages, simply it would not be that useful. Army of the Righteou's consider whatsoever practical aspects instead.

A regular language can be delimited aside a serial publication of every day expressions, while a context-free unrivaled inevitably something more. A simple dominion of thumb is that if a grammar of a spoken communication has recursive elements it is not a regular language. For exemplify, as we said elsewhere, HTML is not a lax language. In fact, most computer programing languages are context-free languages.

Usually, in that respect are regular grammars and context-free grammars that correspond severally to rhythmic and circumstance-free languages. Just to complicate matters, there is a relatively new (created in 2004) variety of grammar, called Parsing Expression Grammar (PEG). These grammars are American Samoa powerful as Context-free grammars, but according to their authors, they more naturally describe programing languages.

The Differences Between PEG and CFG

The of import deviation between PEG and CFG is that the ordering of choices is meaningful in PEG, but non in CFG. If in that location are some possible valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead, with PEG, the first applicable choice wish beryllium chosen, and this automatically solves some ambiguities.

Another divergence is that Nail uses scannerless parsers: They do not pauperization a separate lexer or lexical analysis phase.

Traditionally, both Wooden leg and some CFGs have got been unable to deal with socialistic-recursive rules, but some tools have found workarounds for this — either by modifying the introductory parsing algorithm, or by having the tool automatically rescript a left-recursive rule in a nonrecursive direction. Either of these ways has downsides: Wither aside making the generated parser less intelligible or by worsening its performance. Nonetheless, in practical terms, the advantages of easier and quicker development outbalance the drawbacks.

Hitch Keyed

That's all for Part 1, merely stay close. Coming up, we'll delve into parser generators, their workflows, the various types, and some examples of them in military action.

Topics:

java, parsers, grammar, combinators, teacher

Graphical Parsing Tree Drawing Tool

Source: https://dzone.com/articles/parsing-in-java-part-1-structures-trees-and-rules

Salerno Andlegis79