Extensible Markup Language (XML) Basics

XML Documents  |  XML Elements  |  XML Attributes  |  Character References


A markup language specifies the structure and content of a document.  Extensible Markup Language (XML) is a subset of, or restricted form of, Standard Generalized Markup Language (SGML), which was introduced in the 1980s.  XML documents are conforming SGML documents.   

Hypertext Markup Language (HTML) is the lingua franca for publishing hypertext on the World Wide Web (WWW).  It is a non-proprietary format based upon SGML.  HTML was designed to format Web page text and other elements, not for data description or cataloging.  HTML cannot be modified to meet specific needs.  HTML is not applied consistently by different browsers, hence documents appear differently on different browsers. 

XML, because it is extensible, can be used to create a wide variety of document types.  With XML, new markup languages, called XML applications, can be created.  Many XML applications have been developed to work with specific types of documents.

The design goals for XML, as expressed in the W3C Recommendation are:

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance.
Separator line

XML Documents

XML describes a class of data objects called XML documents.  XML documents have both logical and a physical structures, which must nest properly to be well-formed.  XML documents consist of storage units (entities).  An entity, by reference to other entities, may include them in a document.  XML documents begin with a "root" (or document) entity. 

Logically, a document is composed of declarations, elements, comments, character references, and processing instructions, each declared by explicit markup.  XML markup encodes the document's storage layout and logical structure. XML documents consist of three parts, the prolog, the document body, and the epilog. The prolog and epilog are optional.

  • The prolog, which provides information about the document, consists of four parts in the order below. This order is mandatory or the parser will generate an error message. Although these parts are not required, it is good form to include them.
    • XML declaration
    • Miscellaneous statements or comments
    • Document type declaration
    • Miscellaneous statements or comments
  • The document body contains the contents in a hierarchical structure.
  • The epilog contains final comments and processing instructions.

The first line of code is always the XML declaration, instructing the processor that the file is written using XML. The declaration can also instruct the parser about how to interpret the code. The complete syntax is:

<?xml version=“version number” encoding=“encoding type” standalone=“yes | no” ?>

A sample declaration might look like this:

<?xml version=“1.0” encoding=“UTF-8” standalone=“yes” ?>

Comments and other statements follow the declaration, or may be placed anywhere after the declaration. The XML comment syntax is the same as HTML comment syntax.  White space is not permitted between the markup declaration open delimiter (angle bracket and exclamation) and the comment open delimiter hyphens, but is permitted between the comment close delimiter (two hyphens) and the markup declaration close delimiter (angle bracket).  Warning: Hence, strings of two or more adjacent hyphens cannot be used inside comments. 

<!-- insert your comment here -->

<!-- This is also a comment.
Comments can occupy more than one line. -->

Separator line

XML Elements

Elements are the basic building blocks of XML files.  Element names are case sensitive.  XML supports two types of elements, closed and empty (open) elements.  Closed elements consist of both opening and closing tags. The following example presents a closed element. In the closing tag, a forward slash precedes the element name.


Elements can be nested, and all elements must be nested within a single root element.  Nested elements are termed child elements.  Elements must be nested correctly, with child elements enclosed within their parent opening and closing element tags, as follows:


Empty (open) elements contain no content.  An empty or open element can be used to mark sections of the document for the processor.  Empty elements can contain attributes used.  A empty element has the following syntax; the element name is followed by a slash.


Separator line

XML Attributes and Values

Attributes are characteristics of elements.  Attributes are case sensitive.  Attributes have values.  Their syntax requires double (or single) quotes, as follows in the closed and empty element examples.  In this example, the attribute is days and the value is the number in quotes.  Attributes must begin with an underscore or a letter (Warning: but not the letters xml), they cannot contain spaces, and they cannot appear more than once in the same tag.  Values are text strings.  They can therefore contain most characters (except markups) and white space.

<Year days=“365.24219”> Tropical Year </Year>

<Year days=“365.24219” />

Separator line

Character References

XML supports the ISO/IEC character set, the same character references used in HTML.  Character entity references in HTML 4 are available online.  Character reference syntax utilizes the ampersand and pound symbols for markup, and the semi-colon for closing.  The following two examples are alternate forms to display the copyright sign.



Separator line

References and Advanced Topics:

Web Design

2004 by James Q. Jacobs. All rights reserved.