Article: The SML file format: XML to SML

The task to provide files for initialization or configuration of a framework is not totally trivial. Since the framework may have multiple plug-ins, whose configuration options should be stored in the same files as the rest of the configuration. The numbers, structure, type and names of these files is however unknown at the time of development. Eventually the plug-ins, the framework will work with, do not even exist at this point. To be still able to provide a practical DTD or schema in advance, with respect to the these restrictions, the logical solution would be to retreat to the natural properties of the used variables. Stable properties, like the data type of a variable, may be used to create a structure, which is predictable, because it is dictated by syntactical properties of PHP.

Because PHP is not strictly typed and references and objects may not be part of this configuration by definition, this relaxes the task to scalar variables and arrays (non-recursive and recursive data structures). This two kinds of nodes are defined. The Tag "scalar" is introduced to represent scalar values. It may only contain CDATA section ? this matches a scalar context ? but no other tags or elements. The Tag "array" is introduced to represent array values. It may not contain CDATA sections, but tags only. These tags represent the items of the array. Since XML explicitely expects a single root element the tag "root" is introduced. No further tags are required.

<?xml version="1.0" ?>
<!DOCTYPE example SYSTEM "example.dtd">
<root>
	<array name="a">
		<array name="0">
			<scalar name="a"><![CDATA[value]]></scalar>
			<scalar name="b"><![CDATA[value]]></scalar>
			<scalar name="c"><![CDATA[value]]></scalar>
		</array>
		<array name="1">
			<scalar name="a"><![CDATA[value]]></scalar>
			<scalar name="c"><![CDATA[value]]></scalar>
		</array>
	</array>
	<array name="b">
		<array name="0">
			<scalar name="q"><![CDATA[value]]></scalar>
			<scalar name="r" />
			<scalar name="s"><![CDATA[value]]></scalar>
			<scalar name="t"><![CDATA[value]]></scalar>
		</array>
	</array>
</root>

By this example some characteristics are noticeable. First of all the element "root" is complete without semantic meaning. It exists only, because the syntax requires it.

The empty tags may also be considered to be somewhat atificial constructs. They are included, but it is not completely clear, how to interpret these tags regarding to the variables they represents. PHP expects no definition of variables prior to their first use. So an initialization of a non-existing variable with an "undefined" value, even with the constant "null", doesn't make much sense. Particularly since in a configuration file a "null" value has no semantic value at all. Therefor it will be defined by convention, that empty tags should be avoided. Any access to empty or nonexistent tags will be converted to the boolean value "false", as implied by common PHP conventions.

Furthermore, it is foreseeable that this representation could be confusing if used to store larger amounts of data. Ffor example, in the event that a data field contains50 or more items, which may also be nested and may contain large CDATA sections with a number of line breaks, it is likely that the readability would be affected. For reasons of readability it would be an advantage, if the viewer could tell by the end-tag, which item is closed.

The separate marking of CDATA sections could also affect readability, especially since the sense of this syntax may not be obvious to an inexperienced layman. In addition, the previously made definition explicitly demands nothing other than a CDATA section may be used at this anyway, which is why no real extra value is introduced by this syntax. There is a very exceptional case, namely that the CDATA section contains another closing "scalar" tag. However, this case may be easily avoided by the use of entities, which causes the whole syntax to be obsolete.

Next the focus is to be set on the tags themself. For example, the text "<array name="b">". This text may seem to be unnecessarily complex to a layman. Since it is already obvious by it's structure this element has to be an array anyway. The tag "scalar" may not even be used at this point, since it may not contain other tags but just CDATA. It is of no extra value to stress that this is an (empty) array. The attribute "name" is common to all tags. This derives directly from syntaktik properties of PHP, since the attribute represents the identifier of a variable in PHP. So it would be more intuitiv to use this identifiers instead of the more generic names "array" and "scalar" to name the tags. The same argument could be used againts the tag "scalar".

In respect to what has been discussed so far the presentation could be simplified.

<a>
	<0>
		<a>value</a>
		<b>value</b>
		<c>value</c>
	</0>
	<1>
		<a>value</a>
		<c>value</c>
	</1>
</a>
<b>
	<0>
		<q>value</q>
		<s>value</s>
		<t>value</t>
	</0>
</b>

Even without extensive explanation is apparent that both variants represent the same content. Similarly, it should probably not be necessary to discuss which of the two syntaxes are more intuitive to read.

But: because of it's syntactic properties, this variant is not a well-formed XML document. For example, there is no single root element and also "0" is not even a valid name for a tag in XML. It was therefore necessary to choose a different name. This name is "SML" for "simple markup language", in allusion to "XML" ("eXtensible markup language").