How to Read XML Files

From LMU BioDB 2013
Jump to: navigation, search

To this point, you have been working with what are called “plain text” files and information — that is, information that is viewed as a simple sequence of symbols or characters (letters, numbers, punctuation, spaces, etc.), without any additional structure.

There are, however, other text “formats” that do impose a structure over the included data. One such format is called XML (short for eXtensible Markup Language). This page seeks to introduce you to this type of text information.

Contents

Overall Concept

The core idea behind XML data is that the information inside it can be thought of as an outline or tree. Our own wiki pages have outlines, in the form of either sections or bulleted lists:

  • Level 1, item 1
    • Level 2, item 1 (of level 1, item 1)
    • Level 2, item 2 (of level 1, item 1)
  • Level 1, item 2
  • Level 1, item 3
    • Level 2, item 1 (of level 1, item 3)
    • Level 2, item 2 (of level 1, item 3)
    • Level 2, item 3 (of level 1, item 3)
    • Level 2, item 4 (of level 1, item 3)

XML also captures an outline; it just looks different. Here’s an example:

 <organism key="2">
   <name type="scientific">Vibrio cholerae</name>
   <dbReference type="NCBI Taxonomy" key="3" id="666"/>
   <lineage>
     <taxon>Bacteria</taxon>
     <taxon>Proteobacteria</taxon>
     <taxon>Gammaproteobacteria</taxon>
     <taxon>Vibrionales</taxon>
     <taxon>Vibrionaceae</taxon>
     <taxon>Vibrio</taxon>
   </lineage>
 </organism>

This piece of XML breaks down, roughly, to this outline:

  • organism: key is "2"
    • name: type is "scientific", content is "Vibrio cholerae"
    • dbReference: type is "NCBI Taxonomy", key is "3", id is "666"
    • lineage
      • taxon: content is "Bacteria"
      • taxon: content is "Proteobacteria"
      • taxon: content is "Gammaproteobacteria"
      • taxon: content is "Vibrionales"
      • taxon: content is "Vibrionaceae"
      • taxon: content is "Vibrio"

Even now, you might already be seeing a pattern in terms of how the XML looks and what outline it represents. That’s one of the intentions of XML: it’s meant to strike a balance between human readability and machine readability. The “human readability” part manifests in recognizable words (“name,” “lineage,” “taxon”), while “machine readability” comes in through some special symbols and rules.

Specific Parts

An XML file consists of three primary parts, each expressed in a very specific manner.

Tags

Tags serve to delineate the outlined “sections” of XML data. They serve the same function as bullets or section numbers in the more familiar outlines that we know. Whereas humans are generally capable of figuring out where a heading or bullet item starts and ends, computers need more help. Thus, tags also explicitly state when an XML “section” ends.

A tag consists of a name without spaces, “bookended” by less-than and greater-than symbols. There are three types of tags: start tags, end tags, and standalone tags.

Start tags look like this:

<tagName>

When you see a start tag, this means that a particular section has started; everything from that point up until the matching end tag belongs to that section.

End tags, then, look like this:

</tagName>

The main difference between a start and end tag is the inclusion of a slash symbol (/) right after the less-than sign. Start and end tags must pair up: that is, for every start tag, there must be a matching end tag later on. The tags are matched up based on the name in between.

The outline in the data can thus be inferred from what tags are within which tags — the technical term for this is nesting. Here’s another XML example:

 <name>5NTD_VIBCH</name>
 <protein>
   <recommendedName ref="1">
     <fullName>5'-nucleotidase</fullName>
   </recommendedName>
 </protein>
 <gene>
   <name type="primary">nutA</name>
   <name type="ordered locus">VC_2174</name>
 </gene>

Based on our description of tags so far, you can see that this particular piece of XML has three “top-level” sections: name, protein, and gene. To help humans perceive the outline, most XML files are indented wherever a tag starts inside another one. Thus, we see that name has no subsections, while protein has one subsection called recommendedName and gene has two name subsections.

Sometimes, tags need to further information than to just be there. In this case, a shortcut called a standalone tag can be used. Standalone tags look like this:

 <standaloneTagName/>

Note how the slash symbol is now right before the greater-than sign. A standalone tag is simply a shortcut; it means the same thing as:

 <standaloneTagName>
 </standaloneTagName>

Hope you can see why this shortcut is viewed to be pretty useful :)

Content

Tags by themselves hint at the structure of XML data, but not the actual information within. For example, here’s an XML representation for contact information:

<contact>
  <name></name>
  <email></email>
  <phone></phone>
  <address>
    <street></street>
    <city></city>
    <state></state>
    <zip></zip>
  </address>
  <birthday></birthday>
</contact>

As you can see, a full XML address book will have multiple versions of this block; the difference lies in the specific contact information within that block. This is referred to as the content of the tag(s). Essentially, content is any text that is in between tags:

<contact>
  <name>Clark Kent</name>
  <email>ckent@dailyplanet.com</email>
  <phone>(555) 555-5555</phone>
  <address>
    <street>344 Clinton St., Apt. 3B</street>
    <city>Metropolis</city>
    <state>NY</state>
    <zip>12345</zip>
  </address>
  <birthday>June 1, 1938</birthday>
</contact>

Note that you can’t have content in between tags. This is incorrect:

<contact>
  <name>Bruce Wayne</name>
  owner of Wayne Enterprises
  <email>bwayne@wayneenterprises.com</email>
</contact>

Thus, tags either have plain text content in them, or other tags. Never both.

Attributes

An alternative way to provide specific information in an XML file is through attributes. An attribute is a name="value" expression that is included inside a start or standalone tag:

 <phone withAreaCode="yes">(310) 338-5782</phone>
 <birthday format="mmddyyyy">01311970</birthday>

Attribute names, like tag names, cannot have spaces. Attribute values, in turn, must always be enclosed in double-quotes ("). An equals sign (=) sits between these components.

When should something be an attribute vs. content? There are no hard-and-fast rules. The general approach, though, is that an attribute is information about the content in the tag, while the content is, well, the information of the tag itself.

The XML Schema

In this page, you’ve seen two types of XML examples: one that looks like it holds gene, protein, or organism information of some sort, and another one that looks like a typical address book. How do you know what the tags mean? This is where the XML schema comes in. An XML schema is a separate document that explains the tags and attributes for a particular type of XML document. We won’t go into too much detail about the XML schema at this point, but suffice it to say that such things exist, so that readers of a particular XML file have an authoritative source for what the tags and attributes within that file might mean.

When an XML document follows a particular schema, this is provided at the top of the file:

<uniprot xmlns="http://uniprot.org/uniprot"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">

Note that, like any other tag, this tag ends too, at the end of the XML file:

</uniprot>

More to come on the schema; for now, it’s hoped that you can at least scan some XML information and get an idea of the outline that it provides.

The Concept’s the Thing

Recall that the L in XML stands for language—meaning that, yes, there are other languages that may be used to express the same information, in the same way that languages like English, Spanish, or Mandarin Chinese can say the same things, but with different sights and sounds. Similarly, there are other “formats” for communicating outlines. For example:

{
    organism: {
        key: "2",
        name: {
            type: "scientific",
            text: "Vibrio cholerae"
        },
        dbReference: {
            type: "NCBI Taxonomy",
            key: "3",
            id: "666"
        },
        lineage: [
            { taxon: "Bacteria" },
            { taxon: "Proteobacteria" },
            { taxon: "Gammaproteobacteria" },
            { taxon: "Vibrionales" },
            { taxon: "Vibrionaceae" },
            { taxon: "Vibrio" }
        ]
    }
}

Note how the outline you saw previously is recognizable here, even though it looks different. The point here is that these “formats” and “languages” are ultimately meant to express some idea or concept. The ultimate goal of any language or format is the accurate communication of ideas.

(and yes, the language above is real—it is JSON, short for JavaScript Object Notation)

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox