XML Documents | XML Elements | XML Attributes | Character References
Introduction A markup language specifies the structure and content of a document. Extensible Markup Language (XML) is a subset of, or restricted form of, Standard Generalized Markup Language (SGML), which was introduced in the 1980s. XML documents are conforming SGML documents. Hypertext Markup Language (HTML) is the lingua franca for publishing hypertext on the World Wide Web (WWW). It is a non-proprietary format based upon SGML. HTML was designed to format Web page text and other elements, not for data description or cataloging. HTML cannot be modified to meet specific needs. HTML is not applied consistently by different browsers, hence documents appear differently on different browsers. XML, because it is extensible, can be used to create a wide variety of document types. With XML, new markup languages, called XML applications, can be created. Many XML applications have been developed to work with specific types of documents. The design goals for XML, as expressed in the W3C Recommendation are:
XML describes a class of data objects called XML documents. XML documents have both logical and a physical structures, which must nest properly to be well-formed. XML documents consist of storage units (entities). An entity, by reference to other entities, may include them in a document. XML documents begin with a "root" (or document) entity. Logically, a document is composed of declarations, elements, comments, character references, and processing instructions, each declared by explicit markup. XML markup encodes the document's storage layout and logical structure. XML documents consist of three parts, the prolog, the document body, and the epilog. The prolog and epilog are optional.
The first line of code is always the XML declaration, instructing the processor that the file is written using XML. The declaration can also instruct the parser about how to interpret the code. The complete syntax is:
A sample declaration might look like this:
Comments and other statements follow the declaration, or may be placed anywhere after the declaration. The XML comment syntax is the same as HTML comment syntax. White space is not permitted between the markup declaration open delimiter (angle bracket and exclamation) and the comment open delimiter hyphens, but is permitted between the comment close delimiter (two hyphens) and the markup declaration close delimiter (angle bracket). Warning: Hence, strings of two or more adjacent hyphens cannot be used inside comments.
Elements are the basic building blocks of XML files. Element names are case sensitive. XML supports two types of elements, closed and empty (open) elements. Closed elements consist of both opening and closing tags. The following example presents a closed element. In the closing tag, a forward slash precedes the element name.
Elements can be nested, and all elements must be nested within a single root element. Nested elements are termed child elements. Elements must be nested correctly, with child elements enclosed within their parent opening and closing element tags, as follows: <Year>2000 <Month>January</Month> <Month>February</Month> </Year> Empty (open) elements contain no content. An empty or open element can be used to mark sections of the document for the processor. Empty elements can contain attributes used. A empty element has the following syntax; the element name is followed by a slash.
Attributes are characteristics of elements. Attributes are case sensitive. Attributes have values. Their syntax requires double (or single) quotes, as follows in the closed and empty element examples. In this example, the attribute is days and the value is the number in quotes. Attributes must begin with an underscore or a letter (Warning: but not the letters xml), they cannot contain spaces, and they cannot appear more than once in the same tag. Values are text strings. They can therefore contain most characters (except markups) and white space.
XML supports the ISO/IEC character set, the same character references used in HTML. Character entity references in HTML 4 are available online. Character reference syntax utilizes the ampersand and pound symbols for markup, and the semi-colon for closing. The following two examples are alternate forms to display the copyright sign.
References and Advanced Topics:
|
|
|
|
|
|
© 2004 by James Q. Jacobs. All rights reserved.