Collections
Back to Description of SBA and SBQL.
Collections are data
structures consisting of data entities or objects of a similar structure.
Sometimes it is assumed that “similar” means “of the same
type”, however, such an assertion is vague, especially for environments
that are not typed. It is assumed that the size of a collection is vary and can
be dynamically changed at any moment. Typical collections are sets, relations
(tables), bags, sequences, lists, trees, some kinds of arrays (with open ends),
indices and perhaps others. Differences between kinds of collections concern
their structural properties, as perceived by the programmers, as well as their
behavioral properties, i.e. operations that are allowed for a particular collection kind. Usually collection are build
in a particular data structure environment, that is, their structural
properties and generic primitive operators acting on collection are predefined
of the given language and cannot be changed by application programmers. Some
languages (C++, Java) make it possible to define new kinds of collections, but
this concerns rather primitive, non-optimized and non-scalable cases. In
databases all kinds of collections are predefined and build in the data
environment.
Collections are not present in
popular programming languages, for instance, in C. In Pascal collections are
represented by the concept of file, burdened by implementation restrictions.
Lack of collections enforces the programmer to use dynamically allocated memory
(heap) to implement tasks requiring collections, with many negative consequences
such as error-prone operations on pointers, lack to typing safety and memory
leaks. Popular object oriented programming languages (Smalltalk, C++, Java)
does not introduce the collection concept explicitly, although there are some
possibilities to introduce some kinds of collections (e.g., vectors of
references). Lack of explicit collections cause many limitations, in
particular, major problems with defining nested collections (e.g. attributes
with any number of values), pointer links or binary relationships between
elements of collections, user-friendly queries acting on collections, and
others. Collections (sets of objects) are explicitly introduced in popular
analysis and design notations such as UML. However, the level of algorithmic
precision of such modeling tools is much below the level of precision required
for programming languages. Moreover, collections introduced in analysis and
design notations and tools are usually not associated with operations
(especially updating operations) that are inevitable in programming languages.
On the other hand, databases
mostly deal with collections. In particular, relational databases deal with
collections known as relations or tables. In databases collections have usually
(conceptually) unlimited size that can be millions or billions of data items.
Due to their size, collections in databases are carefully tuned both from the
point of view of their structure and organization and from the point of view of
access. In relational databases access to collections is accomplished through
the query language SQL, which is supported by a powerful query optimizer that
reduces the access time sometimes in several orders of magnitude. Due to query
optimization that is build in the SQL interpreter any other processing style
(e.g. processing a collection element-by-element) is discouraged or not
supported.
Lack of collections (or their
limited features) in programming languages is the direct reason of infamous
effect known as impedance mismatch. The
impedance mismatch causes reduction of programmer productivity and overhead in
the application program code. It is much disadvantageous for program
maintenance. It is also unaesthetic or even sloppy from the point of view of
software manufacturing tools.
Object oriented databases in
early assumptions (c.f. object-oriented database manifesto) provided no
impedance mismatch by unifying types in programming languages and databases.
Unfortunately, this tenet has remained among nice wishes. In the ODMG standard
of object-oriented databases (commonly perceived as flavous) the impedance
mismatch still persists, due to the fact that the standard provides Smalltalk,
Java and C++ as the only tools for application programming.
Kinds of collections depend on
whether we consider collections as data structures stored in data or object
stores or collections that are returned by queries. Majority of theoretical
models make no difference between these two situations, but looking on SQL we
see the fundamental difference. Inside a relational database we have only
unordered tables of tuples. Considering the output from SQL queries we can
distinguish the previous case, as well as ordered tables (sequences) returned
by SQL queries with the order by
clause. Moreover, collections returned by SQL differ from collections stored in
a database by many semantic properties. For instance, it makes a sense to
insert a new tuple to a stored collection, but it makes little sense (and is
impossible) to insert a tuple to a (volatile) collection returned by an SQL
query. Although the distinction is obvious, some strong typing professionals
and some industrial proposals (in particular, the ODMG standard) are trying to
unify the cases by a common typing system. In our opinion, this must lead to
inconsistencies. We intend to consider
collections precisely, according to their semantic nature and practical needs.
In general, we can distinguish
two kinds of collections that are structurally different:
·
Sets or bags, where the order
of elements is inessential,
·
Sequences or arrays, where the order
of elements is bearing some information.
All other kinds of collections
are derived from these two kinds by assuming some special constraints (e.g. no
duplicate elements in sets), by special physical implementation (e.g. indices
implemented as hash tables or B-trees) an/or by special set of generic
operations assigned to a particular collection kind (e.g. taking i-th element
from a collection). In particular, in the ODMG standard the following
collection kinds are proposed:
·
Sets: can contain
elements of any (but the same) type, without duplicates,
·
Bags: as before, but
duplicates are allowed,
·
Sequences: as before, but
the order of elements is essential,
·
Arrays: as before, with
some restriction on inserting and deleting elements (allowed only on the
array’s top),
·
Dictionaries: sets of pairs (t, v),
where t is a string of characters and
v is a value of any type. The
intention of dictionaries is organizing indices.
The ODMG standard makes no
difference between collection kinds stored in a database and collection kinds
returned by queries. For instance, concerning stored collections, it makes
little sense to consider bags, because every element of a bag is different due
to object identifiers. On the other hand, application of sets in practice is
rather low and can be fully substituted by bags, perhaps with an explicit
operator for removing duplicates (applicable only to bags returned as results
of queries). The above doubts we present only to realize that the number of
choices concerning collections is big and the ODMG proposal can be perceived
only as a particular composition of the choices, not the only possible, not
minimal and not optimal. The general principles that should be a guide to an
optimal choice is minimality and universality of the collection constructs, as
well as their full orthogonality.
If a particular typing system
introduces sets as a collection kind, then it assumes automatic removing
duplicates. This is required after any operation of insertion or assignment and
after any query that returns a set as its output. However, removing duplicates
is a costly operation which (as a rule) requires sorting. For this reason sets
should be used by programmers with caution. In some proposals (SQL, ODMG) the
operation of removing duplicates is explicitly in hands of the programmer (a distinct or unique clause). Existence of this statement undermines the
necessity of sets as a type constructor: it is enough to have only bags, with
the above duplicate removing operator. Note that this is just the solution of
SQL. The explicit removing duplicates operator and the set collection kind are
redundant, because declaring the result a query as a set should automatically
cause removing duplicates. In the ODMG standard there are both options, which
looks not very reasonable.
Sets, removing duplicates
operators and some operations on collections (equality, containment, etc.)
imply the necessity to have a criteria what does it mean that two elements of
the same or different collections are identical. In some cases this is not
obvious. For stored collections the issue is obvious, as
“identical” means “having the same object identifier”.
Some authors, however, consider deep comparisons, where “identical”
means “isomorphic up to object identifier”. However, we do not
think that the last point of view makes a sense for the programmers. For
collections returned by queries the question of identical elements becomes more
vague. For instance, are two structures {name:
“Smith”, salary: 2500}
and {salary: 2500, name: “Smith”} identical?
Query operators (except equality) are not able to distinguish them, however,
the doubt remains. At least, they are
probably instances of two different types, hence are not identical. In general,
the definition of identical elements may much be dependent on implementation,
representation and the assumed strong typing system.
Collections can be nested. The
kind of nesting and the number of nesting levels should not be constrained. In
Fig.51 we show nested collections, where elements of the collection Employees have nested bags Children and nested sequences Employments. Such a flexibility in
nesting collections is the consequence of orthogonality of the typing system,
the object relativism and some conceptual
closure saying that everything that could make sense for anybody should be
allowed. Such a conceptual closure is the feature of almost all programming
languages. In databases, however, it is assumed only recently (cf. ODMG and
SQL-99 standards). Relational databases have assumed the first normal form,
according to which nesting tables was impossible. This was motivated by the
necessity to simplify the conceptual model as well as by mathematical theories
that supported the relational databases concept (such as the relational
algebra). In our opinion, such limitations are not reasonable from the point of
view of conceptual modeling of business applications. Moreover, claims that
nested collections decrease the possibility to use mathematical theories are
obviously tendentious. If necessary, proper mathematical theories could be
developed for nested collections as well.

Fig.51. Nested collections
Operators and constructors
for processing collections
Operators on collections depend on whether we deal
with stored collections or collections returned by queries. They can be grouped
into two categories:
· Macroscopic
operators: arguments are
collections, the results are collections too.
· Iteration
operators: they process collections
element-by-element.
Operators can also be subdivided into functional and
updating operators. Functional operators do not change the state of an object
store, while updating operators make it possible to change the state. There are
several macroscopic operators for bags, in particular, sum, intersection,
difference, equality and inclusion of bags, removing duplicates from a bag,
changing a bag into a sequence (with no sorting), membership of an element
in a bag, the size of a bag, inserting an element to a bag, removing
an element from a bag, assignment to an element of a bag, and perhaps others.
Similarly, for sequences (and for arrays) the following operations are
possible: concatenation, intersection, difference, equality and inclusion of
sequences, first, last and i-th element of a sequence, the size of a sequence,
inserting an element into a sequence, removing an element from a sequence,
assignment to an element of a sequence, membership of an element in a sequence,
changing a sequence into a bag, removing duplicates from a sequence, sorting a
sequence according to some key, and perhaps others.
In practice, big set of
operators like above could be insufficient. Fully universal processing of
collections require a processing element-by-element by some constructs known as
iterators. Note that iterators can be nested and some nested iterator instances
may act on the same collection. Iterators are sets of methods such as “get_first”, “get_next”, “is_last?”. Each iterator instance
has a state, which is the reference to an element that is currently processed.
States of iterator must be somehow identified and stored somewhere; this could
present some non-trivial conceptual problem. Programming with iterators is
challenging, especially concerning achieving good performance during processing
large collections. Alternatively to explicit iterators, a query language,
perhaps with the full algorithmic power, could be developed. Queries can be
considered as nested iterators, however each application of an iterator
(selection, projection, join, etc.) has some meaning for the programmer. Query
languages usually adopt the above macroscopic operations on collections as
query operators. In a well-developed query languages queries are internally
optimized, thus the performance problems are much easier.
Some tradeoff between
iterators and queries is the operator for
each, which iterates along a collection. This collection, however, can be
returned by a query, for instance (SBQL):
for each ( Employee where Job = „analyst” ) as a do {
print (a.Surname);
a.Salary := a Salary + 100; };
In the ODMG standard operators
for processing collections are grouped as interfaces, which (syntactically) are
not different from normal interfaces:
interface
Collection: Object{
unsigned long cardinality();
boolean is_empty();
void
insert_element( in Object element );
boolean contains_element(
in Object element );
...};
interface Set : Collection{
Set union_with( in Set other );
Set difference_with( in Set other );
boolean is_subset_of( in Set other
);
... };
Note parameters of methods
typed as Object. Hence the above
interfaces are semantically different from normal interfaces, because they are
parameterized, in principle, by any type. Such interfaces are type polymorphic,
with parametric polymorphism which is one of difficult notions in the theory of
programming languages. In C++ terms such interfaces correspond to templates. However, templates are
parameterized by concrete types, not by the most general type Object. In effect, the strong typing
potential is much reduced [Alag97].
Contradiction of collections with object-oriented
principles
The collection concept leads
to problems with substitutability and open-close, the basic principles of
object-orientedness. Assume the following:
·
No two
different stored collections have common elements;
·
Substitutability
is supported, i.e. each member of a specialized class is also a member of a
more general class;
· Open-close is supported, i.e. a class can be closed for changes, but still open to specialization through sub-classes.
The first assumption says that collections Persons and Students are disjoint. The second assumptions requires that methods inside a class PersonClass should process a collection Students, because each student is a person. Assume that according to the third assumption the PersonClass is closed for changing. If during the development one concludes that a new class EmployeeClass of objects Employee is necessary, it would be necessary to alter the PersonClass by taking into account the collection Employees. This violates the third assumption. Similar problems arise due to multiple-inheritance and in general, with collections that store objects of different types. One of the solutions of this dilemma is the concept of dynamic object roles and dynamic inheritance described previously. In this concept each object can (conceptually) belong to many collections.
The concept of collections is
not obligatory. In XML the concept is not introduced (at least, on the level of
XML files). Instead, collections are represented as many objects having the
same name. For instance, for an XML object Book
there could be many sumbobjects Author.
However, the concept of collections is necessary for strong typing. On the
level of types, collections can be declared by special keywords (cf. in the
ODMG standard) or by cardinalities (cf. XML Schema and UML). Representing
collections by cardinalities allows for more precision in specification. For
instance, specification with cardinalities Person:
PersonType[0..*] denotes a collection
Person which can be empty, while Person:
PersonType[1..*] denotes a collection
Person which is never empty. Such specification precision is not possible in
specification Person: Bag PersonType.
The above change does not remove the problems with object-oriented principles.
Null values and collections
A lot of literature is devoted
to null values in databases, especially in the context of relational databases.
There are many reasons for null values, such as incomplete, irrelevant, delayed
or pending information. Languages such as SQL do not consider external (real
world) semantics of nulls, treating them as a technical facility that is in
hands of database designers and programmers, who can use them for any purpose,
including the use for bearing some known information. For instance, we can
imagine that for all patients of some health service the attribute Cancer is null for anybody who is not
ill on cancer, or the cancer kind for patients that are ill.
Null values in databases have
a bad fame. [Date86] contains a severe critics
of null values as the reason of many semantic anomalies in SQL. Following this
critics, nulls as a special values stored in the database should be avoided.
Instead, [Date86] proposes to use default values; however, this solution has
also disadvantages [Subi96a].
Null values were the subject
of many theoretical proposals, but none of them has found essential practical
applications. Majority of them assumed that null values are the manifestation
of incomplete (unknown) information. They describe situations when some
information exists (e.g. John for sure has some job), but we do not know
precisely its state (e.g. that John is a teacher). Unfortunately, such
assertions are wrong for majority of cases. Null values can appear for a lot of
other reasons, not related to incomplete information. Moreover, the research on null values was
reduced to some limited query languages (e.g. SPJ operators of the relational
algebra). This is in sharp contradiction with the real scope of null values,
which must be taken into account in all query operators. And even more, they
can influence the entire application programming environment, including
updating capabilities such as insert, update and delete operators and including
all features of application programming languages. Hence the presented research
devoted to null values was indeed meaningless, oriented towards a next
conference paper rather than to make a contribution to solving real problems.
Till now, there is no consistent proposal showing how to consistently involve
null values into query operators, updating capabilities and programming
languages. Solutions known from SQL are error-prone and inconsistent in many
places; see [Date86].
Majority of reasons for null
values can be found in irregularity of information that cannot be fit into a
predefined data structure format. This especially is true for the relational
model which is based on very inflexible, fixed data structures. For instance,
null values within the attribute MaidenName
in the Employees table stem from the
fact that the assumed data structure unifies males, unmarried females and
married females. In principle, the table Employees
should have no attribute MaidenName
and there should be a separate table EmployeesMarriedFemales,
which contains such an attribute. However, providing there are many such
reasons for nulls in the Employees
table, the number of such additional tables could be unreasonably high, with
negative consequences for the complexity of the database schema and the
complexity and performance of SQL queries. Null values prevent unreasonable
growth of these factors, but at the cost of extra complexity of SQL, with
options for processing nulls which are inconvenient, illogical and error-prone.
Null values behave differently than normal data values, hence a lot of special
syntax, options and special treatment, which (as practice has shown) cannot be
consistently designed.
A bit different (and indeed
reasonable) treatment of null values appeared in the concepts known as
semi-structured data, manifested in XML and related technologies. In these
concepts null values are represented by absent data. For instance, in an XML
file lack of information results in lack of a corresponding XML object
(including its tags). This idea fits well with the idea of collections that are
determined by cardinalities. If the database designer or programmer wants to
say that some object or attribute A can be null, he/she writes it type with the
cardinality [0..1]. Other rules related to such an options are the same as for
regular collections. In this way we receive the possibility of expressing nulls
in data structures with no special options related to nulls in query and
programming languages. For instance, let the attribute Job of objects Employee
will be declared as Job: JobType[0..1]. It means that the
attribute Job is optional in an object
Employee. If Job is absent in a
particular object, than it can be filled in by the operator insert. Similarly, the operator delete can nullify the attribute. To
such a collection we can use other query operators, such as quantifiers. Below
we present some examples (SBQL):
·
Find all employees with Job filled in:
Employee where exists(Job)
·
Find employees without Job:
Employee where Job ≡ Æ (≡ denotes equality of bags, Æ denotes empty bag)
·
Get all designers:
Employee where
$ Job as
j (j = “designer”)
Employee where Job ≡ bag{“designer”}
Note that the query
Employee where Job = “designer”
is typologically incorrect,
because a bag is compared with a string. However, such a query can be made more
friendly by using a special syntax, e.g.
Employee where exists Job =
“designer”
Such a query is also a bit
more complex in SQL:
select * from Employee where Job is
not null and Job = “designer” .
The advantage of this idea is
that it is completely universal concerning storing and processing nulls.
Because a null value is not introduced explicitly majority semantic problems
with null values disappear. Some problems remain, such as the definition of the
outer join operator, but they are not seem to be hard. The idea is also good to
represent absent complex objects, which could be difficult or impossible in
relational systems.
Problems with sequences
Sequences are frequently
listed as “regular” collections, similar to bags and sets. In some
cases a sequence naturally represent the order of elements in conceptual modeling.
For instance, it can be important to represent the sequence of authors of a
book and (frequently) the sequence represents the importance of particular
authors or the weight of their contribution to the book. A bit more careful
attention to sequences reveals their peculiarities, especially concerning
stored sequences. First observation is that stored sequences can be considered
as an optimization option: some collections are stored as sequences because of
some particular access to them (e.g. via their ordering indices) or because of
particular applications (e.g. making fast ordered reports). In majority of
cases sequences are supported by some attributes of their elements (e.g. the
time of inserting) that can be used to sort them. Hence we can distinguish the
following cases of sequences:
·
The programmer is responsible for inserting a new
sequence element into a proper place of the sequence according to his/her own
criteria; no sorting procedure is possible. This case is probably very rare and
can be neglected.
·
The programmer is responsible for inserting a new
sequence element into a proper place of the sequence, but there is a sorting
procedure that makes it possible to correct his/her decision. The sorting
procedure can also be called after some updates of sequence elements. In this
case a sequence can be considered as a bag supported by a sorting procedure.
·
The programmer is not responsible for inserting a new
sequence element into a proper place of the sequence. There is an automatic
rule that makes the sequence properly ordered after each operation, including
insertions and updates. In this case the ordering rule is a component of the
sequence concept. This important component is neglected in all known nowadays
approaches to sequences. We can note that this element, in a simplified form,
was the property of the very old DBTG CODASYL standard devoted to network data
model. In modern approaches the feature can be implemented by special triggers.
·
A bag can posses many stored orders, as many as the
database designer or administrator wishes. Orders are named (for
identification) and can be dynamically attached/deattached to/from a bag. The
option makes it possible to make quick ordered reports of different kinds. For
instance, employees can be ordered according to their names and simultaneously
according to their salaries. In this proposal the sequence concept does not
exist independently: it belongs to some layer on top of bags. Such a feature,
in a simplified form, was the property of the DBTG CODASYL standard. The
feature can be supported by rules assuring keeping order within all sequences
attached to a bag.
If we consider sequences as an
optimization option then it should be in hands of the database administrator
rather than the programmer. The programmer uses a sorting operator (e.g. order by in SQL) and depending of the
current ordering determined by the administrator this operator can trigger
sorting or can do nothing because the collection is already sorted. If the administrator
would decide to remove this ordering, then the sorting operator indeed will
make sorting, but only the result returned by the corresponding query. The
administrator can decide to keep many sorts for the given collection. For
instance, Person collection can be equipped by the administrator with the sort
according to persons names, persons profession, persons incomings, etc.,
depending on currently used sorting operators in applications. Hence the sorting
property is similar to indices that are also in hands of the administrator. The
application programmer does not use indices explicitly and needs not to be
aware of them. A similar approach can be implemented for different sorts of
collections. In this way the level of abstraction of end user application
programming is higher, because the concept of ordered collections is no more burdening
application programs. On the other hand, such an approach gives the database
administration much more freedom concerning tuning options for minimizing the
global processing time.
As we can see, sequences
present a bunch of options that may support important practical needs.
Development of these options in a database models and programming languages is
worth attention.
Delete operator and garbage collection
Some authors postulate
avoiding the delete operator, as it
may lead to inconsistency known as dangling
pointers. They argue that if an object is deleted, but pointers leading to
the object are not deleted (or nullified), then pointers would lead to garbage
or to a random object created within the storage space of the deleted object.
Hence – anyway – the programmer has to remove or nullify all the
pointers that lead to the object being deleted. If he or she does that,
deleting the object is inessential, as it is no more accessible via pointers;
hence it can be removed automatically by the garbage collector.
This view is obviously
inherited from programming languages such as C/C++ and Java. It assumes
specific understanding of a pointer (or a reference). In these languages
pointers have obvious physical representation; usually they are memory
addresses. Indeed, in such situations removing an object without nullifying (or
removing) pointers leading to it might cause the appearance of dangling
pointers.
This view is a nonsense for
collections that are considered at the conceptual level. First of all, delete
is a conceptual notion that is well
understood by database application programmers. Removing an object from a
collection may occur in a lot of business modeling situations. Substituting
this conceptual notion by purely physical or data representation-oriented
operations is a step back from conceptual modeling to the computer illegible
folklore. But this view is also nonsense for purely technical reasons, in
particular, the following:
·
It is quite easy to implement pointers in such a way
that the appearance of dangling pointers would be impossible, despite deletion
of objects. For instance, in Loqis and in ODRA each pointer from object A to
object B is associated with a twin backward pointer from B to A. Backward
pointers are transparent for programmers. Hence, the operation of deleting an
object is preceded by navigation according to backward pointers, nullifying or
deleting proper pointers and them, deleting the object. Additional storage
necessary for backward pointers can be disregarded for majority of practical
applications.
·
On the conceptual level it is possible to create
collections of objects with no pointers leading to them. How to remove pointers
that belong to the physical representation rather than to a programmer’s
schema?
·
If the programmer has to nullify or remove all the
pointers leading to an object, his or her knowledge on the database schema must
be much more complete that in case of removing a single object. An object of a
given class can be referenced from dozens of classes. This is not reasonable to
force the necessity of such knowledge for such a reason as deletion of an
object.
·
In shared environment parts of the entire schema can
be unavailable for a given programmer for security, privacy or other reasons.
How the programmer can nullify a pointer if he or she has no access to it?
·
If the programmer trying to remove an object forgets
about only one pointer and do not nullifies it, then this object will not be
physically removed and will cause inconsistent behaviour of some application.
For instance, if an employee object is to be removed by nullifying pointers,
but the pointer from the healthcare service is not removed, then this employee
object will be still processed by a healthcare application.
·
Pointers also can create collections, hence some
operation of deleting a pointer would be necessary anyway. Thus the question
concerns the principle of conceptual closure and relativity of objects: the
delete operation is to be available on pointers, but not available on other
kinds of objects.
The above arguments clearly
show that the above view on the delete operation is a conceptual weed arising
from transferring some reasonable argumentation to totally different
environment, where this argumentation leads to a nonsense. In databases,
especially in SQL, the delete operation has normal meaning and pragmatics and
there is no reason that object-oriented databases make in this respect any new
quality.
Last modified: May 23, 2009