Introduction
The World Wide Web Consortium's Extensible Markup Language (XML) 1.0
recommedation defines a data object as a well-formed XML document.
Additionally, a well-formed XML document may be valid if it meets certain
criteria. Well-formed XML documents conform to the following criteria:
-
the document contains one or more elements,
-
no part of the root element appears in any other element,
-
each of the parsed entities referenced directly or indirectly within
the document is well-formed, and
-
the elements start and end tags nest properly within each other.
The recommendation defines an XML document as valid if it has an associated
document type (DOCTYPE) declaration and if the
document complies with the constraints expressed in the declaration.
The XML DOCTYPE declaration contains or points
to markup declarations that provide the document grammar, essentially
defining document content and structure. The declared document
grammar is termed the Document Type Definition (DTD).
XML documents are validated using schemas or the DTD. To be validly
constructed, the name in the document type declaration must match the
element type of the root element. The document type declaration
must precede the first element in the document, and only one DTD can be
used with an XML document. DTDs are used to:
-
ensure that required elements are present,
-
prevent using undefined elements,
-
enforce data structure,
-
specify element attributes and define default values,
-
describe how the parser should access non-XML or non-textual content.
Declaring a Document Type Definition
DTDs are divided into two subsets, internal definitions in the same document
and external definitions located in a separate file. When both internal
and external subsets are used, the internal subset antecedes the external
subset, giving precedence to the entity and attribute-list declarations
in the internal subset.
The internal subset DOCTYPE syntax is as follows.
<!DOCTYPE root
[
declarations
]>
External DTDs can be attached by many documents (and authors), a useful
advantage. Of course, they require all authors to use the same document
grammar. Two forms of external subset DOCTYPE
declarations exist, the SYSTEM location and
the PUBLIC location. The SYSTEM
location form is preferred. The PUBLIC location
form is used when required by your application. The syntax for these two
external subsets follows.
<! DOCTYPE root SYSTEM "URL"
[
declarations
]>
<! DOCTYPE root PUBLIC "identifier" "URL"
[
declarations
]>
Declaring Document Elements
Valid documents declare every element in the DTD. The element names
are case sensitive, and without blank spaces or XML reserved symbols.
The element type declarations specify the element name and the type of
content. The element structure, for validation purposes, may be
constrained using element type and attribute-list declarations.
An element type declaration constrains the element's content, often constraining
which element types can appear as children of the element. The element
order in the document can also be specified.
The element type declaration states both the name of the element and
the type of content, the content-model. An element type declaration
takes the following form and syntax.
Element Name Type Declaration
<!ELEMENT element content-model>
Elements generally contain text or other elements. Element types
have 'element content' when elements of that type must contain only child
elements (no character data). In this case, the validity constraint
includes a content model, a simple grammar governing the allowed types
of the child elements and the order in which they are allowed to appear.
The grammar is built on content particles, the name followed by a choice
list of contents or a sequence lists of contents:
DTDs define five element content types. The five element content
types and their syntax follow.
ANY allows the declared elements to store any type of content.
This content model does not enforce any validation rules.
EMPTY is used for elements without content. Adding content
to an empty element results in an invalid document.
Character data can contain only well-formed text string, any text
string except symbols reserved by XML. This content model does not
support child elements. The code #PCDATA
is employed to represent parsed-character data (any well-formed text string).
<!ELEMENT element (#PCDATA)>
Elements can contain only child elements. The list of child elements
is specified in the declaration. The elements also need to be specified
as many times as used in the document.
<!ELEMENT element (child_elements)>
Either a sequence or a choice of child elements can be
specified. A sequence of elements follows a defined order.
The choice element declaration presents a set of possible child elements.
A content particle in a choice list may appear in the element content
at the location where the choice list appears in the grammar.
To be valid, a content particle occurring in a sequence list must appear
in the element content in the order given in the list.
The syntax for a sequence uses comma separators, while the choice specification
uses the vertical line character. The sequence and choice element
content models can also be used together. The respective syntax
of these three possible usages follows.
<!ELEMENT element (child1, child2, ...)>
<!ELEMENT element (child1 | child2 | ...)>
<!ELEMENT element ((child1 | child2 | ...) child3, child4,
...)>
Modifying symbols can be applied to the content model to indicate
the number of element occurences. An optional character
following a name or list respectively governs whether the element
or all the content particles in the list may occur one or more,
zero or more, or zero or one times. The absence of such
an operator means that the element or content particle must appear
exactly once. The plus sign (+) allows one or more occurences.
The asterisk (*) allows zero or more occurences. The question
mark (?) allows zero to one occurence.
The following syntax illustrates the modifying symbols, first
with a defined-order declaration applying all three modifying
symbols to individual child elements, and secondly applying a
allows one or more occurences modifer (+) to an entire element
declaration list by placement after the parenthesis.
<!ELEMENT element (item1?, item2+, item3*)>
<!ELEMENT element (child_elements)+>
Multiple combinations of each child element are valid when the modifiers
are applied to the choice model, and the number of specified occurrences
can be met with any choice and combination.
The Mixed content element type may contain character data, optionally
interspersed with child elements. The types of the child elements
may be constrained, but order and the number of occurrences of each element
cannot be constrained with this element type. The same name must
not appear more than once in a single mixed-content declaration.
<!ELEMENT element (#PCDATA | child1
| ...)>
Attribute-List Declarations
XML documents are logically structured. Each XML document contains
one or more elements delimited by start-tags and end-tags or, for empty
elements, by an empty-element tag. Each element has a type, identified
by name, and may have a set of attribute specifications. The attribute-list
declaration accomplishes the following:
-
lists the names of all of the attributes associated with a specific
element,
-
specifies the data type of the attribute,
-
indicates whether the attribute is required or optional,
-
provides a default value for the attribute, if necessary.
Attribute-list declarations specify, using the following syntax, the
name, data type, and default value (if any) of each attribute associated
with a given element type.
<!ATTLIST' attribute_name type default>
When more than one attribute-list declaration is stated for a given element
type, the contents of all declarations are merged. When more than
one definition is provided for the same attribute of a given element type,
the first declaration is binding and later declarations are ignored.
There are three kinds of XML attribute types: a string type,
a set of tokenized types, and enumerated types. The
three attribute type categories provide varying degrees of control over
attribute content.
String type may take any literal string as a value; they can contain
blank spaces and any characters except reserved XML symbols. They
are the simplest form of attribute values. String types are declared
using the following syntax.
<!ATTLIST' name CDATA>
Tokenized types are text strings that follow format and content
rules; they have varying lexical and semantic constraints. The syntax
follows.
<!ATTLIST' element attribute_name token default>
There are seven tokenized types:
ID. ID values uniquely identify the elements which bear
them. As a value of this type, a name can only appear once
in an XML document. No element type may have more than one ID
attribute specified. An ID attribute must have a declared default
of #IMPLIED or #REQUIRED.
<!ATTLIST' element attribute ID default>
IDREF. IDREF attribute values are linked to declared ID
attributes in the document using the IDREF token. An attribute
declared as an IDREF type must have the same value as an ID attribute
of another element in the same document.
<!ATTLIST' element attribute IDREF default>
IDREFS. This type creates a reference to multiple IDs.
ENTITY references an external file. Each entity name
must match the name of an unparsed entity declared in the DTD.
ENTITIES references a list of entity references, employing syntax
separated by blank spaces.
NMTOKEN is a name token with a validity constraint requiring
a valid XML name, a text string without blank spaces or XML reserved
characters.
NMTOKENS references a list of name token references, employing
syntax separated by blank spaces.
Enumerated attribute types are limited to specified possible values
and notation. There are two kinds of enumerated attribute types,
notation type and enumeration type. The enumeration type syntax
follows.
<!ATTLIST' element attribute (value1 | value2 | value3
| ...)>
Notation declarations provide a name for the notation (for use in entity
and attribute-list declarations and in attribute specifications) and an
external identifier for the notation which may allow an XML processor
or its client application to locate a helper application capable of processing
data in the given notation. The notation enumerated attribute type
associates the value of the attribute with a notation <!NOTATION>
declaration located in the DTD. Only one notation declaration can
declare a given name. The syntax follows.
<!ATTLIST' element attribute (notation1 | notation2
| ...)>
XML processors must provide applications with the name and external identifier(s)
of any notation declared and referred to in an attribute value, attribute
definition, or entity declaration. They may additionally resolve
the external identifier into the system identifier, file name, or other
information needed to allow the application to call a processor for data
in the notation described.
The final part of an attribute declaration is the attribute default. There
are four possible defaults:
#REQUIRED. The attribute must always be specified for
all elements of the type in the attribute-list declaration.
<!ATTLIST' element attribute #REQUIRED>
#IMPLIED. The attribute is optional, no default value is provided.
<!ATTLIST' element attribute #IMPLIED>
"default" If you omit the attribute from
the element, the XML parser supplies the default value.
#FIXED "default" If an attribute has a default
value declared with the #FIXED keyword, instances of that attribute
must match the default value. The #FIXED form cannot be used with
the ID attribute.
References
|