|
|
Developed
at the Polish-Japanese Institute of Information
Technology © Copyright by ODRA team, © Copyright by PJIIT |
|
|
ODRA – Object
Database for Rapid Application development Description and Programmer Manual |
|
|
by Krzysztof Kaczmarski and the ODRA team |
|
12. XML Importer and Exporter12.1 IntroductionAn XML document consists of three
basic elements: named nodes, text nodes and node attributes. The example
below shows them in a simple XML document. <Person
id=”0000123”>
<Name>John</Name>
<Surname>Smith</Surname>
<Phone type=”mobile”>+48-888-88-00-12</Phone>
<Info>Likes <animal>cats</animal> but not <animal>dogs</animal>
</Info> </Person> Named nodes: Person, Info, Name, Surname, Phone, animal Text nodes: John, Smith, +48-888-88-00-12, Likes, cats, but not, dogs Node
attributes: id=” The above terminology will
be used in this document. Please note the following problems with XML
structures when imported into an object database and addressed in a query
language. 1) Named
nodes may be simple, single valued nodes or complex nodes (containing other
nodes). If some nodes are optional then the type of a containing node may
vary on presence of the optional part. Consider for example the Phone node.
If the type attribute is present then phone must be a complex object, it
clearly contains two subobjects: type (mobile) and string (+48-888-88-00-12).
If type is not present then the phone object is to be perceived as a simple
type string object. 2) Similarly
an object may be seen as a complex one if an optional subobject is present.
For example, if someone does not use animal tags in the info object then the
info object may be understood as a simple string object. 3) Objects
may contain nameless subobjects. Let us consider Info node. If Info is the
name for the whole node than we have no name for “Likes” and “but
not” text nodes. Objects having no name cannot be used in queries. Some
query languages solves this problem by simply enumerating subobjects. These problems of
unpredictably changing structure of objects may result in queries that will
change semantics and results in a way hard to understand. For example let us consider
the following XML input data: <Person
id=”0000123”>
<Name>John</Name>
<Surname>Smith</Surname>
<Phone type=”mobile”>+48-888-88-00-12</Phone>
<Phone>+48-22-234-22-22</Phone> </Person> Here, the first Phone object seems to be complex
object, while the second one may be seen as a simple object. If the type
attribute is optional then the structure and semantics of the Phone objects may
change unexpectedly. Another problem appears if someone adds type attribute
to the second Phone object. It must become a complex object. Our XML importer is designed
to avoid these problems. If fact, there are two problems to be solved when
dealing with XML input: attributes and nameless object content. The XML
import procedure offers two ways of solving these problems. 12.1.1 Naming
the nameless content
As it was shown in the previous section, sometimes XML tags may
contain data that is nameless from the object database’s point of view.
What’s even more confusing sometimes character of data may change when
container object changes but the data itself is not touched. Let us recall the previous example: <Phone>+48-888-88-00-12</Phone> The
phone element may be seen as a simple string object named Phone. But, if
someone alters the Phone tag by adding an attribute: <Phone
type=”mobile”>+48-888-88-00-12</Phone> then phone becomes a complex object because it has two properties:
type and nameless value. Now a user must access the same object in different
way. Therefore XML Importer always treats all objects as complex objects.
Simple type values are always stored in subobjects named _VALUE. So the phone element before the modification is perceived as if it
contains another element named _VALUE: <Phone><_VALUE>+48-888-88-00-12</VALUE></Phone> The
second one looks similarly: <Phone
type=”mobile”><_VALUE>+48-888-88-00-12</VALUE></Phone> In both
cases a user has got the same access path to the phone number value: Phone._VALUE; 12.1.2 Accessing attributes
The
main problem with the elements’ attributes is that they must be
properly treated when exporting XML data. If an object has been created from
an attribute it should also result in the attribute creation when exported to
XML. Hence those objects must be distinguished from normal objects. XML
Importer uses two independently exclusive solutions to this problem: addition
of ‘@’ prefix to an object’s name and attaching an
annotation to imported attribute object. 12.1.3 ‘@’ attribute prefix
In this
case the phone object from the previous section will be imported as if it
looks like this: <Phone> </Phone> Please
note that now a user must use name @type to
access the attribute subobject value while accessing the main value of the
object remains the same. This
kind of import procedure may be executes using the M0 importer option. Please
refer to importing XML using CLI command line section. 12.1.4 Attribute annotations
Another way to mark object coming from attributes is to attach an
annotation. Annotations are hidden from the user and may be used and created
only by the system. However, they are recognized when an object is exported
to XML and the expected XML data format is produced. An example from the previous section is loaded as if it looks like
this: <Phone> </Phone> You may observe that all objects are treated in the same way. All
simple type values are accessed using the same uniform construct: Phone.type._VALUE; Phone._VALUE; Importing XML with annotations is
the default option for the importer when executed in the CLI command line.
Please refer to the proper section for more details. 12.1.5 XML namespacesXML elements may be equipped with
namespace information. Generally it means that an element’s name is
preceded by a namespace: <addr:Phone>+48-888-88-00-12</addr:Phone> Here Phone is the name of the object while addr is the namespace information.
Namespaces must be properly declared before use: <?xml version="1.0"
encoding="UTF-8"?>
xmlns:addr=”http://www.company.com/addressbook”> Here, xmlns:addr=”http://www.company.com/addressbook” is a namespace declaration.
Please refer to W3C XML specification for detailed information about
namespaces. The XML Importer can handle
nemespaces automatically by annotating imported objects. For each namespace
declaration and usage an annotation is created. Please refer to XML import
procedure section for more details. Since a user cannot see
annotations, namespace information so far remains invisible, that is, it
cannot be accessed using the query language. However, after exporting
namespaces may appear in the resulting XML again. 12.2 Importing XML using CLI command lineThe general structure of XML importer execution command is as follows: load “resource”
using XMLImporter(“params”) or load “resource”
using XMLImporter where resource – a path to a XML file that is to be imported params – a list of
parameters recognized by the importer, separated by: “[space] , ; \n \t
\r \f ”. For example: load
“res/xml/bookstore.xml” using XMLImporter load
“foo/myshop/bookstore.xml” using XMLImporter(“M0,
noGuessType”) When no parameters are specified
then the import procedure assumes that annotations must be used for marking
attributes and namespace information, simple type value must be automatically
guessed and references between object using id, idref attribute pairs. 12.2.1 XML importer parametersCurrently the list of implemented XML importer parameters contains: – M0 – do not use annotated object during import procedure (contrary to useMetabase). See import procedure description for details. By default this option is not used so it must be stated explicitly when needed. – noGuessType – do not perform automatic type guessing (contrary to useMetabase, using metabase will always use explicit type information). See automatic type guessing for details. By default this option is not used, so simple types are guessed. – noAutoRefs – do not perform automatic id/idref recognition and reference object creation. See automatic references for details. By default this option is not used so references between objects are created automatically. – useMetabase – import XML using type information from metabase (contrary to M0, using metabase for type inferring will always use annotated objects). Metabase may be created in any way but in most cases will be constructed by importing XSD file. 12.3 XML Import ProcedureThe import procedure is able to
deal with complex objects and attributes, resolve idref and id attribute pairs to create
references and import namespace information. Generally when an XML document
is imported into an ODRA object store,
all information found in XML is converted to appropriate ODRA objects. 12.3.1 Complex structures and attributes.Complex XML structures, simple
values and attributes are imported according to the following rules: 1. A tagged element is converted to a complex object. A tag name is used as the object's name. 2. A text inside a tagged element is stored in a simple type object named _VALUE. Type may be guessed or taken from a metabase. 3. Element's attributes are stored in subobjects: 1. In case of a simple import procedure, a created subobject is a simple type object with name preceded by ‘@’. Its type may be guessed automatically. Please note that this way of attribute importing will work only when the parameter M0 in XMLImporter is used. Character ‘@’ distinguishes attribute and normal objects. Example:
2. In case of the import using annotated objects, the created subobject is a complex object containing single simple type object named _VALUE. The subobject's name is equal to the attribute's name but an appropriate annotation is created (attribute=”true”). In this way attributes are treated in the same way as all other objects. The annotation is the only way to distinguish non-attribute and attribute objects. Example:
12.3.2 Type guessing.For some purposes (mainly
comparing values or selecting a minimal value) the XML importing procedure
tries to guess the type of imported simple type value. If it is a parseable
double then a double object is produced. If it is a parseable integer then an
integer object is produced. Otherwise a string object is produced. Please
note that this option does not use any kind of schema. For example, if an XML
file contains: <avg>-10.20</avg> then after the XML import with
automatic type guessing avg will be a double object, count will be an integer object and descr will be a string object. Type guessing may be switched off
by “noGuessType” plugin option. Type may be also assigned to an object using
metabase entries. Please refer to the proper section for more information. 12.3.3 Automatic references between elements.XML Importer may automatically create
reference objects using the following algorithm:
Automatic creation of reference objects upon id/idref attributes may be turned off by
import parameter “noAutoRefs”. 12.3.4 Namespaces.Namespace information may be also
imported but please note that SBQL has no constructs to access those
information right now. However it will be visible when an object with
namespace information will be produces as a query result. In case of a simple
import procedure (M0) all namespace declarations and prefixes are omitted.
One must use annotating import procedure to handle namespaces correctly,
since namespaces are converted to annotation objects: 1. namespace definition is converted to an annotation object: namespaceDef( prefix:String, uri:String ) 2. a single object may have many namespaceDef annotations; 3. namespace assignment creates a reference annotation namespaceRef pointing to an appropriate namespaceDef object; 4. an object may contain only one namespaceRef annotation; 5. if an object is assigned to a namespace it must contain namespaceRef annotation, even if it points to its own namespaceDef; 6. attributes may contain only single namespaceRef annotation. 12.3.5 Type inferring using metabase objects.XML Importer may use type
information taken from objects in metabase. In such a case, simple objects
and attributes will not be imported as strings nor any type guessing will be
done. Structure of imported XML objects must exactly reflect structures
described in the metabase. Type assignment do XML object may be done in two
(alternative) ways: 1. by name of an XML object, which must be exactly the same as name of declared variable, structure or typedef existing in metabase 2. by type attribute, which points to metaobject with name equal to value of the attribute. Type declaration attribute must be assigned to namespace http://www.w3.org/2001/XMLSchema-instance (other namespaces, also undefined namespace will result in omitting the type attribute). Please note that the second case
makes sense only for simple type objects because name of the XML object must
exactly fit name of variable declaration (in case of a root object) or
structure field's name declaration in (case of an object embedded in other
object). Otherwise type checking will fail. Example: <shipTo exportCode="1"
xsi:type="ipo:UKAddress"> XSD fragment: <element name="shipTo"
type="ipo:UKAddress"/>
<element name="postcode"
type="ipo:UKPostcode"/> The above XML and XSD will create
an object named shipTo with type referring to UKAddress typedef but the metabase will
also declare variable shipTo with the same type. Importing XML schema and XML commands using
types taken from the metabase: load "res/xml/personnel.xsd" using XSDImporter If one wants to infer type upon information from the
matabase, XML file must be imported with annotations, thus M0 option is
forbidden. 12.4 Examples of XML Files and Queries
12.4.1 Example 1 – Books and
Authors
The XML
file contains information about books and authors. Each book has a title,
possibly many authors, a publisher, a price, and optionally an editor. Let us
assume that the file named bib.xml
contains: <?xml version="1.0"
encoding="UTF-8"?> <bib> <book year="1994"> <title>TCP/IP
Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher> <price>65.95</price> </book> <book year="1992"> <title>Advanced Programming in the
Unix environment</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher> <price>65.95</price> </book> <book year="2000"> <title>Data on the
Web</title>
<author><last>Abiteboul</last><first>Serge</first></author>
<author><last>Buneman</last><first>Peter</first></author>
<author><last>Suciu</last><first>Dan</first></author> <publisher>Morgan Kaufmann
Publishers</publisher> <price>39.95</price> </book> <book year="1999"> <title>The Economics of Technology
and Content for Digital
TV</title> <editor>
<last>Gerbarg</last><first>Darcy</first>
<affiliation>CITI</affiliation> </editor> <publisher>Kluwer Academic
Publishers</publisher> <price>129.95</price> </book> </bib> It may
be loaded using CLI command: load
"bib.xml" using XMLImporter The
second XML file contains books’ reviews (reviews.xml): <?xml version="1.0"
encoding="UTF-8"?> <reviews> <entry> <title>Data on the
Web</title> <price>34.95</price> <review> A very good discussion of
semi-structured database systems and XML. </review> </entry> <entry> <title>Advanced Programming in the
Unix environment</title> <price>65.95</price> <review> A clear and detailed discussion of UNIX
programming. </review> </entry> <entry> <title>TCP/IP
Illustrated</title> <price>65.95</price> <review> One of the best books on TCP/IP. </review> </entry> </reviews> It may
be loaded using CLI command: load
"reviews.xml" using XMLImporter Here
are the valid queries and corresponding results: 1. List
books published by Addison-Wesley after 1991, including their year and title. (((bib.book where
(publisher._VALUE="Addison-Wesley") and
(year._VALUE>1991)). (title as
title, year as year)) as book ) groupas bib; Result: <?xml
version="1.0" encoding="UTF-8"?> <bib> <book year="1994"> <title>TCP/IP
Illustrated</title> </book> <book year="1992"> <title>Advanced Programming in the
Unix environment</title> </book> </bib> 2. For
each book in the bibliography, list the title and authors, grouped inside a
"result" element. bib.book.((author
as author union title as title) groupas result) Results: <?xml
version="1.0" encoding="UTF-8"?> <results> <result> <author> <last>Stevens</last> <first>W.</first> </author> <title>TCP/IP
Illustrated</title> </result> <result> <author> <last>Stevens</last> <first>W.</first> </author> <title>Advanced Programming in the
Unix environment</title> </result> <result> <author> <last>Abiteboul</last> <first>Serge</first> </author> <author> <last>Buneman</last> <first>Peter</first> </author> <author> <last>Suciu</last> <first>Dan</first> </author> <title>Data on the Web</title> </result> <result> <title>The Economics of Technology
and Content for Digital </result> </results> 3. Create
a flat list of all the title-author pairs, with each pair enclosed in a
"result" element. bib.book.((title
as title, author as author) groupas result ) Result: <?xml
version="1.0" encoding="UTF-8"?> <results> <result> <title>TCP/IP
Illustrated</title> <author> <last>Stevens</last> <first>W.</first> </author> </result> <result> <title>Advanced Programming in the
Unix environment</title> <author> <last>Stevens</last> <first>W.</first> </author> </result> <result> <title>Data on the Web</title> <author> <last>Abiteboul</last> <first>Serge</first> </author> </result> <result> <title>Data on the Web</title> <author> <last>Buneman</last> <first>Peter</first> </author> </result> <result> <title>Data on the Web</title> <author> <last>Suciu</last> <first>Dan</first> </author> </result> <result/> </results> 4. For each author in the bibliography, list the author's name and the
titles of all books by that author, grouped inside a "result"
element. (unique(deref(bib.book.author)) as
anAuthor). ((anAuthor as
author union (bib.(book where anAuthor in author)).title as
title) groupas result) groupas results; Results: <?xml
version="1.0" encoding="UTF-8"?> <results> <result> <author> <last>Abiteboul</last> <first>Serge</first> </author> <title>Data on the Web</title> </result> <result> <author> <last>Buneman</last> <first>Peter</first> </author> <title>Data on the Web</title> </result> <result> <author> <last>Stevens</last> <first>W.</first> </author> <title>TCP/IP
Illustrated</title> <title>Advanced Programming in the
Unix environment</title> </result> <result> <author> <last>Suciu</last> <first>Dan</first> </author> <title>Data on the Web</title> </result> </results> 5. For
each book found at both bib.xml and reviews.xml, list the title of the book
and its price from each source. We assume that the files have been loaded as
shown previously. ((unique(bib.book.title as aBookTitle)). ((aBookTitle as title union reviews.(entry where title = aBookTitle).price as price_bstore1 union (bib.book where title=aBookTitle).price as price_bstore2) groupas
book_with_prices)) groupas results; Results: <?xml
version="1.0" encoding="UTF-8"?> <results> <book_with_prices> <title>TCP/IP
Illustrated</title>
<price_bstore1>65.95</price_bstore1> <price_bstore2>65.95</price_bstore2> </book_with_prices> <book_with_prices> <title>Advanced Programming in the
Unix environment</title>
<price_bstore1>65.95</price_bstore1>
<price_bstore2>65.95</price_bstore2> </book_with_prices> <book_with_prices> <title>Data on the Web</title>
<price_bstore1>34.95</price_bstore1>
<price_bstore2>39.95</price_bstore2> </book_with_prices> <book_with_prices> <title>The Economics of Technology
and Content for Digital <price_bstore2>129.95</price_bstore2> </book_with_prices> </results> 6. For each book that has at least one author, list the title, one or
two first authors, and an empty "et-al" element if the book has
additional authors. ((((bib.book where count(author)>0 and
count(author)<=2) as b). ((b.title as title union
b.author as author) groupas book)) union ((bib.book where count(author)>=3) as
b). (((b.title as title) union (b.author[1] as
author) union (b.author[2] as author) union (("") as
et_al)) groupas book)) groupas bib; Results: <?xml
version="1.0" encoding="UTF-8"?> <bib> <book> <title>TCP/IP
Illustrated</title> <author> <last>Stevens</last> <first>W.</first> </author> </book> <book> <title>Advanced Programming in the
Unix environment</title> <author> <last>Stevens</last> <first>W.</first> </author> </book> <book> <title>Data on the Web</title> <author> <last>Abiteboul</last> <first>Serge</first> </author> <author> <last>Buneman</last> <first>Peter</first> </author> <et-al/> </book> </bib> 7. List the titles and years of all books published by Addison-Wesley
after ((bib.book where
(publisher._VALUE="Addison-Wesley") and (year._VALUE>1991)). ((title as title, year as @year) as book) orderby
deref(book.title._VALUE)) groupas bib; Results: <?xml
version="1.0" encoding="UTF-8"?> <bib> <book year="1992"> <title>Advanced Programming in the
Unix environment</title> </book> <book year="1994"> <title>TCP/IP
Illustrated</title> </book> </bib> Let us
assume that the file contains: <?xml
version="1.0" encoding="UTF-8"?> <prices> <book> <title>Advanced Programming in the
Unix environment</title>
<source>bstore2.example.com</source> <price>65.95</price> </book> <book> <title>Advanced Programming in the
Unix environment</title>
<source>bstore1.example.com</source> <price>65.95</price> </book> <book> <title>TCP/IP
Illustrated</title>
<source>bstore2.example.com</source> <price>65.95</price> </book> <book> <title>TCP/IP
Illustrated</title>
<source>bstore1.example.com</source> <price>65.95</price> </book> <book> <title>Data on the Web</title>
<source>bstore2.example.com</source> <price>34.95</price> </book> <book> <title>Data on the Web</title>
<source>bstore1.example.com</source> <price>39.95</price> </book> </prices> It may
be loaded using CLI command: load
"prices.xml" using XMLImporter SBQL
query solving the task: (unique(deref(prices.book.title._VALUE)) as t). ((t as @title, (min((prices.book
where title._VALUE =
t).price._VALUE) as price) ) as
minprice ) groupas results; 9. For each book with an author, return the book with its title and
authors. For each book with an editor, return a reference with the book title
and the editor's affiliation. bib.(((book where exists(author)) as b). ((b.title as title union b.author as
author) groupas book) union ((book where exists(editor)) as b).(b.title as title, b.editor.affiliation as affiliation) as
reference) groupas bib; Results: <?xml
version="1.0" encoding="UTF-8"?> <bib> <book> <title>TCP/IP
Illustrated</title> <author> <last>Stevens</last> <first>W.</first> </author> </book> <book> <title>Advanced Programming in the
Unix environment</title> <author> <last>Stevens</last> <first>W.</first> </author> </book> <book> <title>Data on the Web</title> <author> <last>Abiteboul</last> <first>Serge</first> </author> <author> <last>Buneman</last> <first>Peter</first> </author> <author> <last>Suciu</last> <first>Dan</first> </author> </book> <reference> <title>The Economics of Technology
and Content for Digital <affiliation>CITI</affiliation> </reference> </bib> 10.
Find pairs of books that have different titles but the same set of authors
(possibly in a different order). bib.((book where exists(author)) as b1,
(book where exists(author)) as
b2). (b1 where ((title._VALUE <>
b2.title._VALUE) and forall(author
as a) (a in b2.author))).title as
title groupas book_pair groupas bib; Results: <?xml
version="1.0" encoding="UTF-8"?> <bib> <book-pair> <title>TCP/IP Illustrated</title> <title>Advanced Programming in the
Unix environment</title> </book-pair> </bib> 12.4.2 Example 2 – Departments and Employees
The XML
file contains two kinds of objects: employees and departments. An employee may
contain reference to a department he works in and optionally to a department
he manages. Each department contains bidirectional references to employed
employees and to the boss. Let us
assume that the file is named deptemp.xml: <?xml version="1.0"
encoding="UTF-8"?> <deptemp> <Emp
id="i1">
<name>Doe</name> <sal>2500</sal> <worksIn
idref="i17"></worksIn> <manages
idref="i17"></manages>
</Emp>
<Emp id="i5">
<name>Poe</name> <sal>2000</sal> <worksIn
idref="i22"></worksIn>
</Emp>
<Emp id="i9">
<name>Lee</name> <sal>900</sal> <address> <city> <street>Boogie</street> <house>13</house> </address> <worksIn
idref="i22"></worksIn>
</Emp>
<Emp id="i55">
<name>Bert</name> <sal>3000</sal> <address> <city> <street>Avenue</street> <house>34</house> </address> <worksIn
idref="i22"></worksIn> <manages
idref="i22"></manages>
</Emp>
<Dept id="i17">
<dname>Trade</dname> <loc> <loc> <budget>100000</budget> <employs
idref="i1"></employs> <boss
idref="i1"></boss>
</Dept>
<Dept id="i22">
<dname>Ads</dname> <loc> <budget>200000</budget> <employs
idref="i5"></employs> <employs
idref="i9"></employs> <employs
idref="i55"></employs> <boss
idref="i55"></boss>
</Dept> </deptemp> Now,
the file may be loaded using the following CLI command: load
“deptemp.xml” using XMLImporter Here
are the valid queries that may be executed: 1. Get
departments together with the average salaries of their employees: deptemp.((Dept
as d) join avg(d.employs.Emp.sal._VALUE)); deptemp.(Dept
join avg(employs.Emp.sal._VALUE)); 2. Get
name and department name for employees earning less than 2222 deptemp.(Emp where sal._VALUE <
2222).(name._VALUE, worksIn.Dept.dname._VALUE); 3. Get
names of employees working for the department managed by Bert. deptemp.(Emp where
(worksIn.Dept.boss.Emp.name._VALUE) = "Bert").name._VALUE; 4. Get
the name of Poes boss: deptemp.(Emp where name._VALUE =
"Poe").worksIn.Dept.boss.Emp.name._VALUE; 5.
Names and cities of employees working in departments managed by Bert: deptemp.(Dept
where (boss.Emp.name._VALUE) =
"Bert").employs.Emp. (name._VALUE,
((address.city._VALUE) union
("No address" where not exists(address)))); 6.Get
the minimal, average and maximal number of employees in departments: deptemp.(min(Dept.count(employs)), avg(Dept.count(employs)), max(Dept.count(employs))
); 7. For
each department get its name and the sum of salaries of employees being not
bosses: deptemp.(((Dept
as d) join ((sum(d.employs.Emp.sal._VALUE)
- (d.boss.Emp.sal._VALUE)) as s
)).(d.dname._VALUE, s)); 8. Is
it true that each department employs an employee earning the same as his/her
boss?: deptemp. forall (Dept as d) 9. For
each employee get the message containing his/her name and the percent of the
annual budget of his/her department that is consumed by his/her monthly
salary: deptemp. Emp
. ("Employee " + name._VALUE + " consumes " + 10. Get
cities hosting all departments: deptemp.(unique(deref(Dept.loc._VALUE)) as
deptcity) |
|
Last modified: June 20, 2008 |