6 Validation and Content references
This section explains the validation rules of EML. While most of the validation rules are expressed as constraints within the XML Schema definition files, there are some rules that cannot be written directly into the XML Schemas nor enforced by an XML parser. These additional validation rules MUST be enforced by every EML package in order for it to be considered EML-compliant.
6.1 Validation rules
For a document to be EML-valid, all of the following constraints must hold true:
- The document MUST validate using a compliant XML Schema validating parser
- All EML documents MUST have the ‘eml’ module as the root
- A
packageIdattribute MUST be present on the rootemlelement - All
idattributes within the document MUST be unique - Elements which contain an
annotationchild element MUST contain anidattribute, unless the containingannotationelement contains areferencesattribute - If an element references another using a child
referenceselement, another element with that value in itsidattribute MUST exist in the document - When
referencesis used, thesystemattribute MUST have the same value in both the target and source elements, or it must be absent in both. Frequently it is absent in both. - If an element references another using a child
referenceselement, it MUST not have anidattribute itself - If an
additionalMetadataelement references another using a childdescribeselement, another element with that value in itsidattribute MUST exist in the document - If a customUnit is referenced in a document, it must have a corresponding STMML unit definition in the document with a matching (???)
6.2 Validation algorithm
One reasonable algorithm for assessing these constraints without loading the XML into
a DOM structure could be implemented by checking id and references fields while
parsing the document and storing their values in identifierHash and referencesHash data
structures in order to do the final consistency check. For example, in pseudocode:
- Parse the XML document using an XML Schema-compliant parser
- If the root element is not
eml, then the document is invalid - For each element, record whether it has an
idattribute or not- If an element does not contain an
id, but it has a childannotationelement, and that child annotation does not contain areferencesattribute, then the document is invalid
- If an element does not contain an
- For each
idattribute- If
idis not inidentifiersHashthen add it as the key ofidentifiersHash, with itssystemas the value - If
idis already inidentifiersHashthen the document is invalid - If the element containing the id contains a
referenceselement as an immediate child then the document is invalid
- If
- For each
referenceselement- If the
referenceskey is not inreferencesHash, then add it as a key with thesystemvalue toreferencesHash - If the
referenceskey is inreferencesHash, but the currentsystemvalue does not match the value for that key, then the document is invalid
- If the
- For each
referencesattribute on anannotationelement- If the
referenceskey is not inreferencesHash, then add it as a key with the empty string ’’ value toreferencesHash
- If the
- For each
describeselement within anadditionalMetadataelement- If the
describeskey is not inreferencesHash, then add it as a key with the empty string ’’ value toreferencesHash
- If the
- For each
customUnitelement- If the
customUnitkey is not inunitsHash, then add it as a key tounitsHash
- If the
- Once document processing is complete, for each
unitinunitsHash- If
!identifierHash.hasKey(unit)'then the document is invalid
- If
- Once document processing is complete, for each
keyinreferencesHash- If
!identifierHash.hasKey(key) OR 'referencesHash[key] != identifierHash[key]'then the document is invalid
- If
- If no validity errors are found above or by the parser, then the document is valid
- If the root element is not
6.3 Content references
Each EML module, with the exception of “eml” itself, has a top level
choice between the structured content of that element or a
“references” field. This enables the reuse of content previously
defined elsewhere in the document. This allows, for example, an author to
create a single <creator id='m.jones'> element with all of its child detail,
and then reference that as <contact><references>m.jones</references></contact>
to indicate that the same person is both the creator and contact. This creates
an unambiguous linkage via the id field that the two elements refer to the same
entity, in this case a person, and avoids having to re-enter the same information
multiple times in the document. Another common location for re-use is when a single
attributeList is defined with a set of variables and their metadata, and then
that list is referenced in multiple dataTable elements to show that they are
structured identically.
The reuse of structured content is accomplished through the
use of id/references pairs. Each element that is to be reused will contain a
unique id attribute on the element. Because this identifier is guaranteed to
be unique within the EML document, any other location that wants to point at that
content can do so using the references element, as shown in the example above.
These types of references can also be used in the references attribute of
annotation elements, and in the describes element within the additionalMetadata
element.
If an id attribute is provided for content, then that content is considered
to represent a different entity than all other elements that are defined in
the document, except for those that include its id in the references child.
This is useful to indicate, for example, that two people with similar names
(e.g., “D. Clark” and “D. Clark”) are in fact distinct individuals
(e.g., “Deborah Clark” and “David Clark”), or that two variables with the same
attributeName are in fact different variables. While it would be bad practice
to reuse attribute names like this, it does happen and EML needs to be able to document it
when it does.
6.4 EML Validity Parser
Because some of these rules cannot be enforced in XML-Schema, we have
written a parser which checks the validity of the references and ids
used in a document. This parser is included with the release of
EML. To run the parser, you must have Java installed. To execute
it change into the top-level directory of the EML release and run the
‘validate.sh’ script passing your EML instance file as a parameter.
You can also validate your EML document from R, using the EML::validate() function.
There may also be an online
version of this parser, which
is publicly accessible. The validator will both validate your XML
document against the schema as well as check the integrity of your
references.
6.5 id and Scope Examples
Example: Invalid EML due to duplicate identifiers
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 eml.xsd">
<dataset id="ds.1">
<title>Sample Dataset Description</title>
<!-- the two creators have the same id. this should be an error-->
<creator id="23445" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
<creator id="23445" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
...
</dataset>
</eml:eml>This instance document is invalid because both creator elements have the same id. No two elements can have the same string as an id.
Example: Invalid EML due to a non-existent reference
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 eml.xsd">
<dataset id="ds.1">
<title>Sample Dataset Description</title>
<creator id="23445" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
<creator id="23446" scope="document">
<individualName>
<surName>Myer</surName>
</individualName>
</creator>
...
<contact>
<references>23447</references>
</contact>
</dataset>
</eml:eml>This instance document is invalid because the contact element references
an id that does not exist. Any referenced id must exist in the document.
Example: Invalid EML due to a conflicting id attribute and a
<references> element
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 eml.xsd">
<dataset id="ds.1">
<title>Sample Dataset Description</title>
<creator id="23445" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
<creator id="23446" scope="document">
<individualName>
<surName>Meyer</surName>
</individualName>
</creator>
...
<contact id="522">
<references>23445</references>
</contact>
</dataset>
</eml:eml>This instance document is invalid because the contact element both references another element and has an id itself. If an element references another element, it may not have an id. This prevents circular references.
Example: A valid EML document
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 eml.xsd">
<dataset id="ds.1">
<title>Sample Dataset Description</title>
<creator id="23445" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
<creator id="23446" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
...
<contact>
<references>23446</references>
</contact>
<contact>
<references>23445</references>
</contact>
</dataset>
</eml:eml>This instance document is valid. Each contact is referencing one of the
creators above and all the ids are unique. The each creator has a its own id
indicates that they are different people, even though they have the same
surName and there is no other distinguishing metadata.