Program 2: Parsing XML

Author: Dr. Paul Plassmann

Assigned: 9/29
Due: 10/9 by 11:55 pm (Wyatt section)
Starter Code: P2.zip

Introduction

XML, as all elementary school students know, is the initialism for Extensible Markup Language. The basic idea behind XML is to have a portable way to represent a data structure as a text file that a computer, and perhaps even a human, would be able to read. There is a specific format that an XML file must have to be a valid XML file—recognizing a “valid” XML file is the goal of this project. Given a valid XML file, there is a second step where the computer needs to recognize that the XML file is a valid instance of the data structure expected by an application (for example, an Excel file for Microsoft’s Excel program). The definition of this data structure is made by an XML Schema. In this project we will not be determining if a valid XML file belongs to an XML Schema, but knowing that schema exist might be a good thing to know for job interviews!

Background on Formal Grammars and Languages

You will definitely have a new line of fascinating conversation topics for first dates if you know the basics of formal grammars. The idea behind formal grammars is quite simple, a grammar G is defined by a set of symbols (think of ASCII or Unicode characters) and syntax rules. The grammar generates a language L(G) such that any string of the symbols in the grammar is either in the language (is valid) or not in the language (is not valid). A particularly relevant example (for this class) are those long strings of characters that we edit in .cpp and .hpp files that comprise either valid or invalid C++ programs. That is, there is grammar G for the C++ language, and the C++ compiler checks to see if your program (the long string of characters specified by the grammar included in your .cpp and .hpp files) satisfies the C++ grammar’s syntax rules. If the compiler decides that your code is in L(G) then it returns to you compiled object code; however, if it is not in L(G), then you get a syntax error, and you have to go re-edit your code and fix this error.

Let’s do a more specific example. For Project 1, there was a definition for what comprised a valid “word.” Let’s change this definition a little bit to only include lower case letters. Then, if you recall, a valid word would consist only of characters from ‘a’—‘z’ (that is, no numbers, capital letters, special characters, punctuation, etc.; thus, “don’t” and “whatz up”, “Word”, and “ace1hardware” are not allowable words). This definition can be expressed as a formal grammar. To begin, we need a set of symbols, usually given the name Σ. In this case, these are the characters that we could put in a C++ string. To have a grammar, we also need to express syntax. In a formal grammar, this is done by giving a set of production rules. For our “word” language these could be written as:


$$ \begin{align} S &\rightarrow AS \\ S &\rightarrow A \\ A &\rightarrow a\left| b \right|c|\ldots\left| y \right|\text{z} \end{align} $$

It takes a bit of work to understand what this expression means. The capital letters are called nonterminal symbols. A nonterminal symbol is simply a string of characters from the alphabet Σ that will be broken apart by the production rules into substrings and individual characters. The arrow symbol means that any nonterminal symbol on the left must be of the form on the right. When two symbols on the right are next to each other, that means that the corresponding strings are concatenated. The small letters are known as terminal symbols. These must be characters from the alphabet. In the production rules, the terminal symbols are not written with the single quotes around them like we do in a C++ program. That is, in the production rules the character ‘a’ is written as a. The vertical line | represents an “or” or productions rules; thus, A → a | b combines the rules A → a and A → b. There is a special nonterminal symbol S that is known as the start symbol.

If G is the above grammar (defined by its alphabet and its production rules) then L(G) is the set of strings that can be expressed by this grammar using the start symbol. For example, the string “wow” is in L(G) because of the derivation (the production rule used is written under the intermediate results):


$$ \begin{align*} S &\rightarrow AS &\rightarrow wS &\rightarrow wAS &\rightarrow woS &\rightarrow woA &\rightarrow wow\\ & \rightarrow (1) & \rightarrow (3) & \rightarrow (1) & \rightarrow (3) & \rightarrow (2) & \rightarrow (3) \end{align*} $$

Looking at this derivation, one can (perhaps) see how to write a C++ function to determine whether a candidate string is in L(G).

bool isWord(string str) {

  for (int i = 0; i < str.size(); i++) {
    if (!islower(str[i])) {
      return false;
    }
  }

  return true;
}

Balanced Parenthesis Grammar

Let us consider a different grammar that will be important for recognizing XML files. For a shorthand, we call this grammar BPG—the initials of Balanced Parenthesis Grammar. We will keep the alphabet the same, although we are only using the characters ‘(‘, ‘)’, ‘{‘, ‘}’, ‘[‘, and ‘]’. As before, we do not use the single quotes in the production rules as it would drive us crazier than we already are. We are going to introduce a new production rule:  S → ε. This special rule means “replace the non-terminal symbol on the left with nothing.” The rules for our BPG are the following.


$$ \begin{align} S &\rightarrow \left( S \right)\\ S &\rightarrow \{ S\}\\ S &\rightarrow \lbrack S\rbrack\\ S &\rightarrow SS\\ S &\rightarrow \varepsilon\\ \end{align} $$

The derivation below shows how the string “([]){}” can be generated by this grammar (and, hence, is in the BPG language).


$$ \begin{align*} S &\rightarrow SS &\rightarrow (S)S &\rightarrow (S)\{ S\} &\rightarrow (\left\lbrack S \right\rbrack)\{ S\} &\rightarrow (\left\lbrack S \right\rbrack)\{\} &\rightarrow (\lbrack\rbrack)\{\} \\ & \rightarrow (7) & \rightarrow (4) & \rightarrow (5) & \rightarrow (6) & \rightarrow (8) & \rightarrow (8) \end{align*} $$

The first thing that self-respecting Computer Engineer would want to do is to write a program to see if any string belongs to the BPG language. We will employ a stack data structure to do this (I am sure that you will spend lots of time in class going over how these work !). The program uses the helper function classifyChar() to determine whether a character from a string is one of three self-evident values of type BracketType : LEFT_BRACKET, RIGHT_BRACKET, or NOT_A_BRACKET. These constants would be defined in a typedef enum statement. The program also uses the helper function isSameBracket() to determine whether the left and right brackets are the same kind of bracket. Here is a program to determine if a string belongs to BPG.

bool isInBPG(string str) {

  stack<char> bracketStack;

  for (int i = 0; i < str.size(); i++) {
    BracketType bracketValue = classifyChar(str[i]);

    switch (bracketValue) {
    case LEFT_BRACKET:
      bracketStack.push(str[i]);
      break;
    case RIGHT_BRACKET:
      if (bracketStack.empty()) {
        return false;
      }
      
      char leftBracket = bracketStack.peek();
      
      bool success = isSameBracket(leftBracket, str[i]);

      if (!success) {
        return false;
      }

      bracketStack.pop();
      break;

    case NOT_A_BRACKET:

      return false;
    }
  }

  if (!bracketStack.empty())
    return false;

  return true;
}

Let’s go over a quick summary of what the program is doing when it rejects strings. The program rejects string containing brackets for a couple of reasons. First, the brackets might be of the right type, but just not be paired correctly. For example, the strings “(()”, or “())”, or “())(“. Second, the brackets might be paired, but the pairs of left and right brackets might not be of the correct type. For example, “(}” or “({[}])”. Be sure that you understand how the program rejects strings for each of these cases.

XML – The Extensible Markup Language

We need to define a valid XML file. A valid XML file is a text file (Unicode) that follows several simple rules based on the following definitions.

Markup and Content: All text in an XML document is either markup or content. Markup is any text enclosed by the angle brackets ‘<’ and ‘>’ (the markup does not include the brackets). Content is anything that is not markup. The angle brackets that define markup may not be nested, and must be in the proper order. (Note: in real XML one can also define markup as being enclosed by the characters ‘&’ and ‘;’—we are not considering this definition of markup for this project.)

Tag: A tag is a markup section (i.e., enclosed by the characters ‘<’ and ‘>’). Tags can be of three types (or a declaration) which are defined as follows:

  1. Start-tag: a start-tag is a markup with the of the form <element>.

  2. End-tag: an end-tag is markup with of the form </element>.

  3. Empty-tag: an empty-tag is <element/>.

  4. Declaration: not really a tag, but a special XML item that begins with the two characters “<?” and ends with the two characters “?>”. This is usually at the beginning of an XML document—an example would be <?xml version="1.0" encoding="UTF-8"?>. Your code will need to recognize declarations.

Tag Name: The tag name is valid if it satisfies the following rules:

  1. For start-tags and empty-tags, the tag name is the text immediately after the ‘<’ up to the first white space or the ending angle bracket ‘>’ or “/>”. For example, given the start-tag <Edgar src="edgar.jpg"> the tag name is “Edgar”.

  2. For end-tags, the tag name is the text between the delimiters. For example, “Edgar” for the end-tag </Edgar>.

  3. The name cannot contain any of the following characters (I have not put quotes around these characters): !"#$%&'()*+,/;<=>?@[\]^`{|}~. A tag name cannot contain a space character (no white space), and cannot begin with "-", ".", or a numeric digit. All other Unicode is valid.

  4. For start-tags and empty-tags there may be additional text following the tag name. This additional text are known as attributes. As stated in (1), the tag name is the text following the initial ‘<’ character up to the first white space, or the ending delimiter if there are no attributes. For example, the tag name is “Edgar” for the following start-tags and empty-tags: <Edgar>, <Edgar src="edgar.jpg">, <Edgar/>, <Edgar src="edgar.jpg” version=”1.0”/>. The attributes for the last string are src="edgar.jpg" and version=”1.0”. Note that there is a specific format for the attributes; however, we are not going to verify this for this project. The only thing that matters for this project is to extract the tag name, and make sure that it is valid per the above rules.

  5. Tag names are case-sensitive, and the start-tag and end-tag pairs must match exactly.

  6. Start-tags and end-tags define elements that must be nested (see the definition of an element and discussion on nested tag names).

Nested Tag Names: As rule (6) for the tag names states, the tag names must be correctly nested. To be nested, a start-tag and the corresponding end-tag must satisfy the BPG defined above (that is, instead of a particular set of parenthesis, replace the left parenthesis with the start-tag and replace the right parenthesis with the corresponding end-tag.) Empty-tags and declarations are not involved in this matching grammar.

Element: An element is a section of an XML document contained between matching start-tag and end-tag, or an empty tag. Any text between the element’s start-tag and end-tag is its content, which may contain both content and other elements. Elements contained within a tag in this way are called child elements. The element name is the tag name for the matching start-tag and end-tag pair that define the element, or the name of the empty-tag. An element is valid if it satisfies the above tag name rules and is properly nested with itself and other child elements.

XML Document: An XML document must begin and end with a tag (white space is permitted before the initial tag or after the final tag). The entire document must be contained in one element (a root element), although declarations are permitted before the start-tag for this root element. A document that satisfies these tag syntax rules and, for which all elements are valid, is said to be a valid XML document.

The Project

For this project we will be implementing two classes: an XML Parser, and a Stack ADT. You will be completing the implementation of both these classes, and developing tests to ensure that your implementations of these two classes is correct. The specifications of both classes are given in the header files provided with the Starter Code. You are also given a simple Bag class and Node class implementation (from the text) and a starter test file. Specific requirements for this project are given below:

  1. The Stack ADT header and implementation files are set up for the linked-based implementation of the Stack presented in the text. You must complete the link-based implementation for the Stack.

  2. The Stack linked-based implementation uses the Node data type given in the text. The Node definition and implementation is identical to that in the text and is included with the Starter Code. (The Node data type is also used in the Bag implementation given to you.)

  3. The XML specification we use is described above and the start-tag and end-tag names must satisfy the BPG. The Wikipedia page on XML also contains a good discussion, but follow the specific XML specification given above.

  4. The XMLParser class takes a string as input (for example, if you wanted to parse a file, you would read the file in as a long string, and then pass this to the instance of the XMLParser). See the mainXMLParser driver program for an example. To determine determine if the string is valid XML, there are two steps:

    1. First, we run the string through a "scanner" or "lexer" that breaks this long string into "tokens". The method that performs this process is cleverly called tokenizeInputString(). This method returns true if successful, false otherwise.

    2. Second, we try to "parse" a successfully tokenized input string (stored internally in the XMLParse class). If this tokenized string satisfies the XML grammar, the parseTokenizedInput() method returns true.

  5. Use the defined constants in the XMLParser.hpp file, as enums with type StringTokenType, for the five types of XML tokens. The five types are: START_TAG, END_TAG, EMPTY_TAG, CONTENT, and DECLARATION.

  6. The comments in the declaration files are in Doxygen format. You can automatically generate web page documentation from these header files using the Doxygen program. You can also read the documentation as html and it is contained in the /html directory in the folder that you get when you unzip the Starter Code.

You will be completing both the implementations of our ADT Stack class and XMLParser class, and you will have to implement unit tests that cover your implementations. And, as before, you will have to check your implementation against the (unknown) instructor tests on the grader site. 

Submission

Once you are satisfied with your code based on your test results and addressing the project requirements, upload a zip file containing (only) Stack.cpp, XMLParser.cpp, and XMLParser_test.cpp through Canvas at the assignment link. You should not submit the other files from the starter code, nor your build directory. There is a build target called "submission" configured by default to create this file with the correct contents in your build directory.

Grading

There are 100 points allocated to this assignment.