Published by: Sujan
Published date: 16 Jun 2021
XML (Extensible Markup Language) is a new standard that has recently been approved by the World Wide Web Consortium. It is a promising new customizable markup language that will allow for complex information transactions on the Internet. Many companies such as Microsoft and Netscape have developed or are developing XML technologies.
HTML is designed for content being sent to a browser but isn’t good for sending content to other mediums like a printer or a ticker. XML allows developers to create a custom markup language specific to their needs. Specially coded XML documents reside on a server and can be converted to HTML and read by browsers. Other clients (including future browsers that are XML-compliant) can access the XML documents directly and use the content for a variety of purposes.
Although XML is a markup language like HTML, a common misconception is that it is HTML on steroids. XML and HTML are related, but through a common parent, SGML, Standard General Markup Language. SGML is a meta-language—a comprehensive set of syntax rules for marking up documents and data.
When the creators of the Web needed a markup language that told browsers how to display web content, they used SGML guidelines to create HTML. HTML was designed specifically for displaying content in a browser but isn’t good for much else.
Now that the Web has matured and we are using it for more than just viewing text and images, we need to create more versatile markup languages. We could use SGML as we did when creating HTML, but SGML wasn’t designed for the Web. It is too bloated in that it has features that are unnecessary and wouldn’t be used.
Also, SGML documents themselves are too large and would unnecessarily take up much of the Web’s bandwidth. Clearly, a more portable, Web-specific version of SGML had to be created. Thus XML is SGML’s smaller cousin. XML is SGML with a reduced feature set. It is powerful enough to describe data but light enough to travel across the Web.
Another important part of XML is the Document Type Definition (DTD), which defines each tag and provides more information about each tag or the document in general. A DTD can be part of an XML file itself, but it is usually a separate file or series of files. The DTD is what turns XML from a meta-language into a true language designed for a specific task. It’s a type of file associated with SGML and XML documents that define how markup tags should be interpreted by the application reading the document.
The HTML Specification that defines how web pages should be displayed by a browser is one example of a DTD. Other emerging technologies, such as the proposed multimedia standard SMIL and the proposed vector graphics standard PGML (both discussed later in this chapter), use DTDs that were created in compliance with the XML meta-language. If you were creating recipes that could be accessed over the Web, you might create your own language called RML or Recipe Markup Language. RML would have tags like <title> and <body>, but would also have RML specific tags such as <ingredients>, <prep-time>, and <nutritional-information>.
These tags would be established in a DTD for the new language. The DTD imparts detailed information about what data should be found in each tag. A DTD for Recipe Markup Language might have a line like this:
<!ELEMENT ingredients ( li+, text? )>
The first line declares an element called ingredients. An ingredients tag can contain a li element and text. The plus sign (+) after li indicates that an “ingredients” element will have one or more “li” elements within it. The question mark after text shows that text is optional. The Recipe Markup Language DTD would also specify the “li” element:
<!ELEMENT li (#PCDATA)>
This element contains text only. It doesn’t have to be associated with a DTD. You can simply mark up a document, and assume the person reading your XML file already has the proper DTD or will make up their own. Because it doesn’t require a DTD, you can turn your existing DTD-less HTML files into XML by making a few changes.
Browsers will often recover from sloppily written or illegal HTML. This won’t be the case with XML documents. A client reading an XML document may be reading tags unique to that document and therefore can’t make assumptions about whether or not a tag should be closed. Every XML element must be closed.
Like HTML, XML tags cannot overlap. Overlapping tags look like this:
<Element1><Element2>This is content contained</Element1> in overlapping tags</Element2>
In the above example, it is unclear whether the text, “This is content contained” is an Element 1 or Element 2. To avoid such confusion, an XML document cannot contain overlapping tags. The above example should be written like this:
<Element1><Element2>This is content contained</Element2></Element1> <Element2> in overlapping tags</Element2>
With this code, there is no question as to which tags or objects are contained within others.
Turning Existing HTML Documents into XML
Because HTML and XML are closely related, it isn’t difficult to make an HTML document XML-compliant. You basically have to make sure your HTML is “wellformed.”
• Replace the DOCTYPE declaration and any internal subset with the XML declaration.
Replace:
<!DOCTYPE HTML …> with:
<?xml version=”1.0″ standalone=”yes”?>
• Change any empty elements such as <isindex>, <base>, <meta>, <img>, <br>, <hr>, or <spacer> so they end with />, for example:
<IMG SRC=”this_photo.jpg” alt=”Photo”/>
These elements may require some experimentation. For instance, some browsers treat </br> or </hr> the same as <br> or <hr>. Others will accept /> if there is a space before it, but not otherwise.
• Make sure that each nonempty element has a correctly matched end-tag; every <p> must have a </p>.
• Escape all markup characters. (< and & should be written as < and &). • Make sure all attribute values are in quotes.
• Ensure all element names match with respect to upper- and lowercase characters in both start and end tags and are consistent throughout the file.
• Ensure all attribute tags are similarly in a consistent case throughout the file.
• Make sure there are no overlapping tags. Each tag should completely contain any tags within it.