jqjacobs.net/web/xml

XML Data Type Definitions (DTD)

Declaring a DTD  |  Declaring Elements  |  Attribute-List Declarations

Introduction

The World Wide Web Consortium's Extensible Markup Language (XML) 1.0 recommedation defines a data object as a well-formed XML document.  Additionally, a well-formed XML document may be valid if it meets certain criteria.  Well-formed XML documents conform to the following criteria:

  • the document contains one or more elements,

  • no part of the root element appears in any other element,

  • each of the parsed entities referenced directly or indirectly within the document is well-formed, and

  • the elements start and end tags nest properly within each other.

The recommendation defines an XML document as valid if it has an associated document type (DOCTYPE) declaration and if the document complies with the constraints expressed in the declaration.  The XML DOCTYPE declaration contains or points to markup declarations that provide the document grammar, essentially defining document content and structure.  The declared document grammar is termed the Document Type Definition (DTD).

XML documents are validated using schemas or the DTD.  To be validly constructed, the name in the document type declaration must match the element type of the root element.  The document type declaration must precede the first element in the document, and only one DTD can be used with an XML document.  DTDs are used to:

  • ensure that required elements are present,

  • prevent using undefined elements,

  • enforce data structure,

  • specify element attributes and define default values,

  • describe how the parser should access non-XML or non-textual content.

Separator line

Declaring a Document Type Definition

DTDs are divided into two subsets, internal definitions in the same document and external definitions located in a separate file.  When both internal and external subsets are used, the internal subset antecedes the external subset, giving precedence to the entity and attribute-list declarations in the internal subset.

The internal subset DOCTYPE syntax is as follows.

<!DOCTYPE root
[
declarations
]>

External DTDs can be attached by many documents (and authors), a useful advantage.  Of course, they require all authors to use the same document grammar.  Two forms of external subset DOCTYPE declarations exist, the SYSTEM location and the PUBLIC location. The SYSTEM location form is preferred. The PUBLIC location form is used when required by your application. The syntax for these two external subsets follows.

<! DOCTYPE root SYSTEM "URL"
[
declarations
]>

<! DOCTYPE root PUBLIC "identifier" "URL"
[
declarations
]>

Separator line

Declaring Document Elements

Valid documents declare every element in the DTD.  The element names are case sensitive, and without blank spaces or XML reserved symbols.  The element type declarations specify the element name and the type of content.  The element structure, for validation purposes, may be constrained using element type and attribute-list declarations.  An element type declaration constrains the element's content, often constraining which element types can appear as children of the element.  The element order in the document can also be specified. 

The element type declaration states both the name of the element and the type of content, the content-model.  An element type declaration takes the following form and syntax.

Element Name Type Declaration

<!ELEMENT element content-model>

Elements generally contain text or other elements.  Element types have 'element content' when elements of that type must contain only child elements (no character data).  In this case, the validity constraint includes a content model, a simple grammar governing the allowed types of the child elements and the order in which they are allowed to appear.  The grammar is built on content particles, the name followed by a choice list of contents or a sequence lists of contents:

DTDs define five element content types.  The five element content types and their syntax follow.

ANY allows the declared elements to store any type of content.  This content model does not enforce any validation rules.

<!ELEMENT element ANY>

EMPTY is used for elements without content.  Adding content to an empty element results in an invalid document.

<!ELEMENT element EMPTY>

Character data can contain only well-formed text string, any text string except symbols reserved by XML.  This content model does not support child elements.  The code #PCDATA is employed to represent parsed-character data (any well-formed text string).

<!ELEMENT element (#PCDATA)>

Elements can contain only child elements. The list of child elements is specified in the declaration.  The elements also need to be specified as many times as used in the document.

<!ELEMENT element (child_elements)>

Either a sequence or a choice of child elements can be specified.  A sequence of elements follows a defined order.  The choice element declaration presents a set of possible child elements.  A content particle in a choice list may appear in the element content at the location where the choice list appears in the grammar.  To be valid, a content particle occurring in a sequence list must appear in the element content in the order given in the list. 

The syntax for a sequence uses comma separators, while the choice specification uses the vertical line character.  The sequence and choice element content models can also be used together.  The respective syntax of these three possible usages follows.

<!ELEMENT element (child1, child2, ...)>

<!ELEMENT element (child1 | child2 | ...)>

<!ELEMENT element ((child1 | child2 | ...) child3, child4, ...)>

Modifying symbols can be applied to the content model to indicate the number of element occurences.  An optional character following a name or list respectively governs whether the element or all the content particles in the list may occur one or more, zero or more, or zero or one times.  The absence of such an operator means that the element or content particle must appear exactly once.  The plus sign (+) allows one or more occurences.  The asterisk (*) allows zero or more occurences.  The question mark (?) allows zero to one occurence. 

The following syntax illustrates the modifying symbols, first with a defined-order declaration applying all three modifying symbols to individual child elements, and secondly applying a allows one or more occurences modifer (+) to an entire element declaration list by placement after the parenthesis.

<!ELEMENT element (item1?, item2+, item3*)>

<!ELEMENT element (child_elements)+>

Multiple combinations of each child element are valid when the modifiers are applied to the choice model, and the number of specified occurrences can be met with any choice and combination.

The Mixed content element type may contain character data, optionally interspersed with child elements.  The types of the child elements may be constrained, but order and the number of occurrences of each element cannot be constrained with this element type.  The same name must not appear more than once in a single mixed-content declaration.

<!ELEMENT element (#PCDATA | child1 | ...)>

Separator line

Attribute-List Declarations

XML documents are logically structured.  Each XML document contains one or more elements delimited by start-tags and end-tags or, for empty elements, by an empty-element tag.  Each element has a type, identified by name, and may have a set of attribute specifications.  The attribute-list declaration accomplishes the following:

  • lists the names of all of the attributes associated with a specific element,

  • specifies the data type of the attribute,

  • indicates whether the attribute is required or optional,

  • provides a default value for the attribute, if necessary.

Attribute-list declarations specify, using the following syntax, the name, data type, and default value (if any) of each attribute associated with a given element type.

<!ATTLIST' attribute_name type default>

When more than one attribute-list declaration is stated for a given element type, the contents of all declarations are merged.  When more than one definition is provided for the same attribute of a given element type, the first declaration is binding and later declarations are ignored.

There are three kinds of XML attribute types: a string type, a set of tokenized types, and enumerated types.  The three attribute type categories provide varying degrees of control over attribute content.

String type may take any literal string as a value; they can contain blank spaces and any characters except reserved XML symbols.  They are the simplest form of attribute values.  String types are declared using the following syntax.

<!ATTLIST' name CDATA>

Tokenized types are text strings that follow format and content rules; they have varying lexical and semantic constraints.  The syntax follows.

<!ATTLIST' element attribute_name token default>

There are seven tokenized types:

ID.  ID values uniquely identify the elements which bear them.  As a value of this type, a name can only appear once in an XML document.  No element type may have more than one ID attribute specified.  An ID attribute must have a declared default of #IMPLIED or #REQUIRED.

<!ATTLIST' element attribute ID default>

IDREF.  IDREF attribute values are linked to declared ID attributes in the document using the IDREF token.  An attribute declared as an IDREF type must have the same value as an ID attribute of another element in the same document.

<!ATTLIST' element attribute IDREF default>

IDREFS.  This type creates a reference to multiple IDs.

ENTITY references an external file.   Each entity name must match the name of an unparsed entity declared in the DTD.

ENTITIES references a list of entity references, employing syntax separated by blank spaces.

NMTOKEN is a name token with a validity constraint requiring a valid XML name, a text string without blank spaces or XML reserved characters.

NMTOKENS references a list of name token references, employing syntax separated by blank spaces.

Enumerated attribute types are limited to specified possible values and notation.  There are two kinds of enumerated attribute types, notation type and enumeration type.  The enumeration type syntax follows.

<!ATTLIST' element attribute (value1 | value2 | value3 | ...)>

Notation declarations provide a name for the notation (for use in entity and attribute-list declarations and in attribute specifications) and an external identifier for the notation which may allow an XML processor or its client application to locate a helper application capable of processing data in the given notation.  The notation enumerated attribute type associates the value of the attribute with a notation <!NOTATION> declaration located in the DTD.  Only one notation declaration can declare a given name.  The syntax follows.

<!ATTLIST' element attribute (notation1 | notation2 | ...)>

XML processors must provide applications with the name and external identifier(s) of any notation declared and referred to in an attribute value, attribute definition, or entity declaration.  They may additionally resolve the external identifier into the system identifier, file name, or other information needed to allow the application to call a processor for data in the notation described.

The final part of an attribute declaration is the attribute default.  There are four possible defaults:

#REQUIRED.  The attribute must always be specified for all elements of the type in the attribute-list declaration.

<!ATTLIST' element attribute #REQUIRED>

#IMPLIED. The attribute is optional, no default value is provided.

<!ATTLIST' element attribute #IMPLIED>

"default"  If you omit the attribute from the element, the XML parser supplies the default value.

#FIXED "default"  If an attribute has a default value declared with the #FIXED keyword, instances of that attribute must match the default value.  The #FIXED form cannot be used with the ID attribute.

Separator line

References

Home
Classes
Web Design
XML
Contact

© 2004 by James Q. Jacobs. All rights reserved.

jqjacobs.net/web/xml