How to Read JSON Files

From LMU BioDB 2017
Jump to: navigation, search

To this point, you have been working with what are called “plain text” files and information — that is, information that is viewed as a simple sequence of symbols or characters (letters, numbers, punctuation, spaces, etc.), without any additional structure.

There are, however, other text “formats” that do impose a structure over the included data. One such format is called JSON (short for JavaScript Object Notation). This page seeks to introduce you to this type of text information.

Overall Concept

The core idea behind JSON data is that the information inside it can be thought of as an outline or tree. Our own wiki pages have outlines, in the form of either sections or bulleted lists:

  • Level 1, item 1
    • Level 2, item 1 (of level 1, item 1)
    • Level 2, item 2 (of level 1, item 1)
  • Level 1, item 2
  • Level 1, item 3
    • Level 2, item 1 (of level 1, item 3)
    • Level 2, item 2 (of level 1, item 3)
    • Level 2, item 3 (of level 1, item 3)
    • Level 2, item 4 (of level 1, item 3)

JSON also captures an outline; it just looks different. Here’s an example:

{
  organism: {
    key: "2",
    name: {
      type: "scientific",
      text: "Vibrio cholerae"
    },
    dbReference: {
      type: "NCBI Taxonomy",
      key: "3",
      id: "666"
    },
    lineage: [
      { taxon: "Bacteria" },
      { taxon: "Proteobacteria" },
      { taxon: "Gammaproteobacteria" },
      { taxon: "Vibrionales" },
      { taxon: "Vibrionaceae" },
      { taxon: "Vibrio" }
    ]
  }
}

This piece of JSON breaks down, roughly, to this outline:

  • The JSON is for a single object, represented by braces { }, that has one property, organism, which itself is an object
    • The organism object has a key property whose value is "2"
    • The name property is another object whose type is "scientific" and text is "Vibrio cholerae"
    • The dbReference property is an object whose type is "NCBI Taxonomy", key is "3", and id is "666"
    • lineage is a list of objects, indicated by the use of brackets [ ] rather than braces { }, where each object has a single taxon property…
      • taxon: "Bacteria"
      • taxon: "Proteobacteria"
      • taxon: "Gammaproteobacteria"
      • taxon: "Vibrionales"
      • taxon: "Vibrionaceae"
      • taxon: "Vibrio"

Even now, you might already be seeing a pattern in terms of how the JSON looks and what outline it represents. That’s one of the intentions of JSON: it’s meant to strike a balance between human readability and machine readability. The “human readability” part manifests in recognizable words (“name,” “lineage,” “taxon”), while “machine readability” comes in through some special symbols and rules.

Specific Parts

A JSON file consists of three primary parts, each expressed in a very specific manner. Of these parts, objects and lists can mix and match in any combination and to nearly any depth (e.g., lists with lists, objects within lists, lists within objects, objects within objects, objects within lists within objects, lists within objects within lists, etc.). Properties are strictly associated with objects.

Objects

Objects represent distinct, self-contained items or records of data. They begin with a left brace { followed by the object’s properties—names and values. Whereas humans are generally capable of figuring out where a piece of data starts and ends, computers need more help. Thus, every object { has a matching }.

Computers don’t care about spacing within JSON, but humans can read JSON a lot more easily with proper spacing. For human consumption, the braces of an object are typically on their own lines (as above), with the properties indented by a couple of spaces from the braces.

Properties

Objects by themselves are actually quite meaningless; there is nothing to say about just an object in its own right. What makes an object useful are its properties—named values that are associated with the object. Properties are expressed within the braces with its name name, sometimes enclosed in quotes, followed by a colon :, followed by its value. Commas , separate properties from each other.

In the example above, the single main object has a single property called organism, which is itself an object. The organism object in turn has four properties, key, name, dbReference, and lineage.

Properties can be indicated via "dot notation" (where the dot is none other than a period .). Thus, if we represent the object above as some variable obj, the organism property of that object is obj.organism. The key property within that object is then obj.organism.key.

Lists

Lists or arrays represent collections of objects or values. They are denoted by brackets [ ] with commas , separating the items in the list. Most lists are meant to contain items of the same type or structure, but JSON does not actually require that. Lists can have a number for the first item and an object for the next; that is permitted. But practically speaking, lists tend to hold uniformly-typed or -structured members.

When talking about the contents of a list, it is sometimes convenient to refer to them by their ordinal position in the list (e.g., first slot, second slot, fifth slot, etc.). JSON lists start counting at 0, and that number is referred to as the index of an item in the list. Thus, in the JSON example above, the item at index 2 is the object whose taxon property is "Gammaproteobacteria." Or, combining property notation with indexing, we enclose an index in brackets [ ]. Thus, continuing the example of using obj to represent the example JSON object above, the "Gammaproteobacteria" object is obj.organism.lineage[2]. If you were to speak that out entirely, that means "the item at index 2 of the lineage property of the organism property of the object called obj."

The Concept’s the Thing

Recall that the N in JSON stands for notation—meaning that, yes, there are other notations that may be used to express the same information, in the same way that languages like English, Spanish, or Mandarin Chinese can say the same things, but with different sights and sounds. Similarly, there are other “formats” for communicating outlines. For example:

 <organism key="2">
   <name type="scientific">Vibrio cholerae</name>
   <dbReference type="NCBI Taxonomy" key="3" id="666"/>
   <lineage>
     <taxon>Bacteria</taxon>
     <taxon>Proteobacteria</taxon>
     <taxon>Gammaproteobacteria</taxon>
     <taxon>Vibrionales</taxon>
     <taxon>Vibrionaceae</taxon>
     <taxon>Vibrio</taxon>
   </lineage>
 </organism>

Note how the outline you saw previously is recognizable here, even though it looks different. The point here is that these “formats” and “languages” are ultimately meant to express some idea or concept. The ultimate goal of any language or format is the accurate communication of ideas.

(and yes, the language above is real—it is XML, short for eXtensible Markup Language)