The World Wide Web Consortium's Extensible Markup Language (XML) 1.0 recommedation defines a data object as a well-formed XML document. Additionally, a well-formed XML document may be valid if it meets certain criteria. Well-formed XML documents conform to the following criteria:
The recommendation defines an XML document as valid if it has an associated document type (DOCTYPE) declaration and if the document complies with the constraints expressed in the declaration. The XML DOCTYPE declaration contains or points to markup declarations that provide the document grammar, essentially defining document content and structure. The declared document grammar is termed the Document Type Definition (DTD).
XML documents are validated using schemas or the DTD. To be validly constructed, the name in the document type declaration must match the element type of the root element. The document type declaration must precede the first element in the document, and only one DTD can be used with an XML document. DTDs are used to:
DTDs are divided into two subsets, internal definitions in the same document and external definitions located in a separate file. When both internal and external subsets are used, the internal subset antecedes the external subset, giving precedence to the entity and attribute-list declarations in the internal subset.
The internal subset DOCTYPE syntax is as follows.
External DTDs can be attached by many documents (and authors), a useful advantage. Of course, they require all authors to use the same document grammar. Two forms of external subset DOCTYPE declarations exist, the SYSTEM location and the PUBLIC location. The SYSTEM location form is preferred. The PUBLIC location form is used when required by your application. The syntax for these two external subsets follows.
Valid documents declare every element in the DTD. The element names are case sensitive, and without blank spaces or XML reserved symbols. The element type declarations specify the element name and the type of content. The element structure, for validation purposes, may be constrained using element type and attribute-list declarations. An element type declaration constrains the element's content, often constraining which element types can appear as children of the element. The element order in the document can also be specified.
The element type declaration states both the name of the element and the type of content, the content-model. An element type declaration takes the following form and syntax.
Elements generally contain text or other elements. Element types have 'element content' when elements of that type must contain only child elements (no character data). In this case, the validity constraint includes a content model, a simple grammar governing the allowed types of the child elements and the order in which they are allowed to appear. The grammar is built on content particles, the name followed by a choice list of contents or a sequence lists of contents:
DTDs define five element content types. The five element content types and their syntax follow.
ANY allows the declared elements to store any type of content. This content model does not enforce any validation rules.
EMPTY is used for elements without content. Adding content to an empty element results in an invalid document.
Character data can contain only well-formed text string, any text string except symbols reserved by XML. This content model does not support child elements. The code #PCDATA is employed to represent parsed-character data (any well-formed text string).
Elements can contain only child elements. The list of child elements is specified in the declaration. The elements also need to be specified as many times as used in the document.
Either a sequence or a choice of child elements can be specified. A sequence of elements follows a defined order. The choice element declaration presents a set of possible child elements. A content particle in a choice list may appear in the element content at the location where the choice list appears in the grammar. To be valid, a content particle occurring in a sequence list must appear in the element content in the order given in the list.
The syntax for a sequence uses comma separators, while the choice specification uses the vertical line character. The sequence and choice element content models can also be used together. The respective syntax of these three possible usages follows.
Modifying symbols can be applied to the content model to indicate the number of element occurences. An optional character following a name or list respectively governs whether the element or all the content particles in the list may occur one or more, zero or more, or zero or one times. The absence of such an operator means that the element or content particle must appear exactly once. The plus sign (+) allows one or more occurences. The asterisk (*) allows zero or more occurences. The question mark (?) allows zero to one occurence.
The following syntax illustrates the modifying symbols, first with a defined-order declaration applying all three modifying symbols to individual child elements, and secondly applying a allows one or more occurences modifer (+) to an entire element declaration list by placement after the parenthesis.
Multiple combinations of each child element are valid when the modifiers are applied to the choice model, and the number of specified occurrences can be met with any choice and combination.
The Mixed content element type may contain character data, optionally interspersed with child elements. The types of the child elements may be constrained, but order and the number of occurrences of each element cannot be constrained with this element type. The same name must not appear more than once in a single mixed-content declaration.
XML documents are logically structured. Each XML document contains one or more elements delimited by start-tags and end-tags or, for empty elements, by an empty-element tag. Each element has a type, identified by name, and may have a set of attribute specifications. The attribute-list declaration accomplishes the following:
Attribute-list declarations specify, using the following syntax, the name, data type, and default value (if any) of each attribute associated with a given element type.
When more than one attribute-list declaration is stated for a given element type, the contents of all declarations are merged. When more than one definition is provided for the same attribute of a given element type, the first declaration is binding and later declarations are ignored.
There are three kinds of XML attribute types: a string type, a set of tokenized types, and enumerated types. The three attribute type categories provide varying degrees of control over attribute content.
String type may take any literal string as a value; they can contain blank spaces and any characters except reserved XML symbols. They are the simplest form of attribute values. String types are declared using the following syntax.
Tokenized types are text strings that follow format and content rules; they have varying lexical and semantic constraints. The syntax follows.
There are seven tokenized types:
Enumerated attribute types are limited to specified possible values and notation. There are two kinds of enumerated attribute types, notation type and enumeration type. The enumeration type syntax follows.
Notation declarations provide a name for the notation (for use in entity and attribute-list declarations and in attribute specifications) and an external identifier for the notation which may allow an XML processor or its client application to locate a helper application capable of processing data in the given notation. The notation enumerated attribute type associates the value of the attribute with a notation <!NOTATION> declaration located in the DTD. Only one notation declaration can declare a given name. The syntax follows.
XML processors must provide applications with the name and external identifier(s) of any notation declared and referred to in an attribute value, attribute definition, or entity declaration. They may additionally resolve the external identifier into the system identifier, file name, or other information needed to allow the application to call a processor for data in the notation described.
The final part of an attribute declaration is the attribute default. There are four possible defaults:
© 2004 by James Q. Jacobs. All rights reserved.