Persistence
Back to Description of SBA and SBQL.
A programming entity is persistent
if it lives longer than the run of a program that have created it. A persistent
entity saves its state between subsequent runs of the program. All entities
stored in databases are assumed to be persistent. Programming languages’
data structures (in particular, variables or objects) are not persistent (are volatile), because after the program is
finished they are no longer available for next runs of this or another program.
The concept of persistence is orthogonal to object-orientedness and can be
discussed in the context of any data model. However, the object-oriented
literature treats persistence with special attention.
The concept of persistence has not been coined in the database domain.
Originally, the domain was based on the data
independence principle that assumes that a database is designed,
administered, maintained, secured, catalogued, published and accessed
independently from any application programs that act on the database. Moreover,
there is usually no assumption that a database application are to be written in
a single programming language. Just otherwise, data independence implicitly
assumes that database will be available for any programming language, providing
it implements a corresponding library (or a “driver”). Because in
databases all structures are persistent, there is no need for introducing the
concept of persistence. Similarly for operating systems, there is no need to
characterize its files as “persistent”, because they may exist
independently of any application program that may act on them. In this senses
database management systems, with their data independence principle, are more
close to operating systems than to programming languages. In the literature
there are several proposals to connect these two domains. This is actually done
for big DBMS such as Oracle that take over
(and refine) many functionalities that were traditionally on the side of
operating systems (for instance, granting access privileges).
Originally, the concept of persistence does not appear in the domain of
programming languages, either. From the very beginning programming languages
worked with files, which were treated very differently from program’s
data structures. Usually, files were created, read, updated, deleted, etc. by
special routines collected in some library available for a given programming
language, but still a file itself was external to a programming language on the
same principle as e.g. a keyboard or a mouse are external. An application
programmer explicitly uses these routines to do some actions on these external
resources, e,g., creating a file, recording some data (stored within program
variables) in a file or reading data from a file (to program variables). This
idea was naturally extended to databases, although with different (more
complex) options and libraries (APIs) equipped with query languages.
The concept of persistence has been born from the marriage of databases
and programming languages. It concerns a new type of programming languages
(so-called database programming languages, DBPLs) that are assumed to be
especially prepared for making applications that act on databases. None of
popular programming languages (Pascal, C, C++, Java, etc.) involves explicitly
the concept of persistence. To some extent it violates the old principle of
data independence because it implicitly assumes that database application
programs will be written in a single programming language (or in a single
family of languages having the same typing and/or data representation system).
It is not sure that the software community is currently prepared for such a
“monopoly” (which apparently violates our sense of democracy, free
commercial competition, free possibilities of inventions and the need of
diversity as a progress factor). However, as a matter of fact, many database
applications are actually based on such monopoly, hence the concept of
persistence is worth attention. Even more, because of persistent procedural
entities some kind of monopoly is inevitable.
The persistence concept is based on observation that both programming
languages and databases deal with data structures that are very similar in
conceptual models, construction, representation and typing. The only conceptual
difference is that data structures that are stored in databases exist
independently from a program run life, while data structures that the run
processes completely disappear when it is terminated. Hence they differ only by
one factor, which was just named “persistence status”. If so, let
this factor be separated, but all other properties and functionalities related
to data structures should be unified for both cases. Obviously, this point of
view is programming-languages-centric and (to some extent, as we will discuss)
is contradictory to the data independence principle that have established the
domain of databases.
The concept of persistence do not assume pure separation of persistent
and volatile entities. Different proposals assume various ideas of unification
of their construction and functionality. The most common ideas are the
following:
·
Unified naming,
scoping and binding. Programming variables and database structures are
named in the same way, they follow the same scope rules and the binding of a
volatile variable has the same syntax as the binding of a database persistent
variable. For instance, a programmer can create a persistent variable X that is
store in a database, but accessed from a program simply as X, with no special
syntax like procedure calls or special keywords.
·
Unified typing
system. Programming variables and database structures follow the same typing system
and the same (strong, static) type checking. Because of traditional cultural
differences (programming languages deal with individual variables, while
databases deal with collections), this idea requires some combination of the
cultures. In particular, it should be possible to create volatile collections
and persistent individual objects. There are many languages (Pascal/R, DBPL, Napier88, PS-Algol, Galileo, Fibonacci, Tycoon, PJama and others) that follow this idea.
·
Unified query and
expression language. Traditionally, volatile structures were accessed by
programming expressions, while database structures were accessed by queries. The
subdivision involves the previous factor, namely, expressions deal with
individual variables, while queries deal with collections. The subdivision is justified only for
historical reasons. The idea is to join both ideas so there will be no
difference between expressions and queries, e.g. x+y is a query on equal
rights with e.g. Employee where Salary = (x+y).
·
Integrated database
programming language. The language makes no difference between volatile and
persistent entities (except the persistence status). It is fully based on
queries that are used as expressions within imperative (updating) statements,
as parameters of procedures, functions and methods, and as specifications of
database abstractions (views, constraints, rules, triggers, etc.). The language
follows a unified strong typing system. Procedural abstractions written in the
language can persist, i.e. can be stored on the side of a database server as
(stored) procedures, functions, methods, (updatable) views, triggers, etc. The
language supports orthogonal
persistence, i.e. free, unlimited combination of the persistence status
with any feature of the language, including data structures, types, procedural
abstractions and database abstractions. Till now, only SBQL implemented in ODRA
fully accomplishes the above idea.
Till now only SBQL implemented in ODRA fully accomplishes the last, most
complete idea of persistence. Other database programming languages and implemented
database systems make always some eclectic tradeoffs that are mainly motivated
by historical (legacy) development, reluctance to revolutionary changes,
reluctance to developing new programming languages and a lot of unsolved
research problems (strong typing, query optimization, object-oriented updatable
views, etc.).
Persistence, impedance
mismatch and data independence
The concept of persistence is the consequence of attempts to avoid the impedance mismatch, i.e. incompatibility of
data models, types, access and updating facilities, program abstractions,
maintenance, refactoring, etc. of programming languages’ data structures
and database data structures. The impedance mismatch is an inherent consequence
of the data
independence principle. Hence, to some extent, the concept of persistence
is in opposition to the principle.
In the relationships between impedance mismatch and data independence there
is no ideal solution, only some tradeoffs. In particular, a tradeoff is
necessary for the data independence principle. The principle was formulated at
the time when databases (especially relational databases) contained pure data
only. Current database servers, including relational database servers, store
many entities that must be prepared in a query and programming language. These
entities include:
·
Stored procedures and functions.
·
Triggers, constraints and (business) rules.
·
Stored classes, including methods that are defined
within these classes, inheritance, and other features of object-orientedness.
·
Database views, in particular, updatable database
views.
·
Definitions of workflow processes.
·
Definitions of wrappers, mediators, adapters,
integrators, exporters, importers and other interoperability or data
distribution facilities.
Some other entities are possible and are currently considered such as
persistent threads, pre- and post-conditions, assertions and so on. One can imagine that these entities can be written in many
languages, but for several reasons such a freedom would be disadvantageous or
unrealistic. All such languages should be based on the same data structures
(determined by the database model and types) and this limitation much reduces
the freedom. The assumption that any
programming language can be used for this purpose is unrealistic at least for
two reasons: (1) early binding assumed in popular languages (which would
exclude many database features such as views, changes in the database schema,
etc.); (2) severe problems with impedance mismatch. Hence, as a final
conclusion, for a given DBMS all such active entities should be written in a
single, integrated query and programming language that deals with persistence
as a regular option. For these reasons the development of database programming
languages and their standards makes a great sense.
We also note that these (persistent) entities are prepared during the
database design phase or during database maintenance by a database server
administrator. They can be used by client applications, but are not under
control of these applications: they are to be designed, programmed and
administrated by a database designed or a database administrator. Obviously,
these persistent entities during their runs may create and maintain volatile
data, which makes the distinction between persistent and volatile entities
quite fuzzy.
Relativity of the persistence
status, data sharing and transactions
The persistence concept is relatively clear if it is considered w.r.t.
subdivision between main memory and magnetic discs as a data storage media or
w.r.t. the program life cycle. Volatile data are stored in main memory, while
persistent data are stored on discs. Volatile data are available during a run
of a program and unavailable when the program is terminated or non-active,
while persistent data are available on a disc and can be activated at any time
when required. This subdivision, however, becomes much unclear when data are stored in main memory only, which is now the
case of many modern DBMS (including ODRA) and other data environments. In such
systems magnetic discs may not exist at all or can be used as a back-up
facility only (e.g. ODRA uses for this purpose the technology of memory mapping
files). Such unclear attitude to persistence can be observed especially in
data-intensive grid solutions or P2P (peer-to-peer) networks, where many
servers cooperate, each of them can be switched off at any moment and all the
data that the server supplies becomes unavailable. This problem with the clear
definition of the persistence concept has appeared in particular in ODRA, where
all the data are kept within some abstract stores and the programmer has no any
possibilities to determine how and where the data physically reside. A very
similar problem appears with the environments based on CORBA or other
transparent middleware tools.
For such environment the traditional criteria of subdividing between
volatile and persistent data make little sense. For instance, some application
A may create volatile data and then make them available for application B. From the point of view of application A these
data are volatile and from the point of application B these data are
persistent. How to keep the sense of such a persistence concept?
One of conclusion is that a persistence status is relative: some data can be volatile or persistent depending on an
application acting on it. If so, how and where the persistence status of a data
is to be declared?
The concept of persistence makes a sense for a single local application
when one would like to distinguish data that are available when a program is
running from a data that retain their state when the program is terminated.
This situation seems to be the main case of so-called persistent programming
languages such as PJama. However, this situation is not typical for large
databases. Typically, database applications are subdivided into client and
server processes and in this case the server keeps persistent data that are
shared among many client applications. The concept of sharing is in this case
more relevant than the concept of persistence, because – anyway –
if the server process is not active than data that are kept on the server is
unavailable, event if is persistent. If the server is active, it can export to
clients not only “persistent” data but any data, including volatile
ones. The situation with the persistence status becomes even more unclear in
case when there is a problem of distinguishing client and servers, as e.g. in
P2P networks. Independently, persistent or volatile, a data on some server will
be unavailable for external use in the case when the server is switched off or
is down.
Confusing persistence with data sharing is a common mistake of many authors
in the context of transaction processing. If data is not shared then the
transactional semantics is inessential, independently if the data is persistent
or not. If data is shared, but cannot be simultaneously updated by many
processes then the transactional semantics is inessential too, independently of
its persistence status. The transactional semantics is essential only in the
case when data are shared and can be updated by many processes, but in this
case the persistence status of data does not matter.
Assuming any data server, its programmer or administrator should be
equipped with facilities allowing him/her to determine which data entities are to be shared among clients and other
servers. Moreover, he/she may be allowed to determine how the data entities are to be shared. The facilities may include
access and updating rights, some (database) views and some specific protocols of sharing, such as a
transactional semantics. During the development of the ODRA system we have
tried to solve these issues, but without deep feedback from practical
applications it is difficult to assess if our solutions are optimal for
majority of cases.
Persistence models
The orthogonal persistence was the feature of many prototype database
programming languages such as PS-Algol, DBPL, Napier88, Galileo and Tycoon. Popular commercial languages
such as C/C++, Java and Smalltalk do not deal with persistence at all. There
are attempts to introduce orthogonal persistence to some popular languages such
as Java (persistent Java or PJama). These projects, however, are too modest
concerning query languages and are based on some limited attitude to database
architecture and the concept of persistence that we have discussed in the previous
subsection.
The historical reasons cause some critique of the idea of orthogonal
persistence within commercial communities, as impractical and unnecessary. In
particular, the ODMG standard does not assume such a feature. However, because
the orthogonal persistence has many advantages for new systems, it is almost
sure that it will be the eventual winner in the longer time perspective.
Some authors claim that achieving perfect orthogonal persistence is
impossible because of such features as transaction processing which is relevant
to persistent but irrelevant to volatile data. As we have argued in the
previous section, the configuration of concepts, especially concerning
transactions, is more complex than it could be stated by simple observations.
In our opinion, these doubts concerning orthogonal persistence are caused by
particular understanding of the problem.
Persistence through reachability can be considered as a supplement to
orthogonal persistence that is motivated by some configurations of persistent
and volatile entities that apparently make little sense. There are few such
situations that we want to avoid:
·
A volatile object contains a persistent object as a component.
In this case removing the volatile object will cause removing the persistent
object, hence its persistence is a fiction. Alternatively, a persistent object
stored within a volatile object can be moved somehow to another logical place,
but this would require some extra semantics that makes little practical sense.
·
A persistent object contains a volatile object as a
component. This case (implemented in the Loqis system) is imaginable and can
have practical meaning for programming (e.g. keeping temporary results of
calculations within persistent objects). However, mixing up volatile and
persistent entities within one entity may be problematic for database
maintenance, query optimization and garbage collection. Hence, it should be
avoided too.
·
A persistent object contains a pointer (a reference)
to a volatile object. This case is disadvantageous similarly to the previous
one. If the volatile object is removed, then the persistent object will contain
a dangling pointer (i.e. a pointer leading to garbage or to improper object).
However, there are well recognized methods of avoiding dangling pointers and
this case does not imply anything special.
·
A volatile object contains a pointer (a reference) to
a persistent object. This case is quite reasonable, as in many applications the
programmer may need to refer to persistent objects. Sometimes, however, this
case implies implementation problems connected e.g. dangling pointers, garbage
collection and transaction processing.
Indeed, the persistence through reachability makes the data organization
more clean, but it does not undermine the general idea of the orthogonal
persistence. As an analogy, in every language it is possible to write some
senseless statements (e.g. an infinite loop), but they rarely influence the
construction of the language. Most frequently they are presented in manuals as
practical warnings and rules-of-thumb. Anyway, in databases and programming
languages everything is in hands of the designer or programmer and he/she will
be the first victim of his/her unreasonable decisions during database design or
writing programs. This does not mean that designers of database systems should
ignore the above disadvantageous situation: if they imply some additional
implementation effort the designer are in rights to forbid them.
The persistence through reachability principle is assumed in the ODMG
standard within Smalltalk and Java bindings. In the C++ binding the principle does not hold due to lack of
automatic garbage collection.
Persistence through inheritance assumes that persistence is an invariant
of a class that is inherited by all subclasses and by objects being members of
theses classes. No class can contain – directly or indirectly –
both persistent and volatile objects. It is possible to create a persistent
class that is a specialization of a volatile class, but not vice versa.
The intention of this concept is clearly motivated by physical
subdivision between persistent and volatile objects. Basically, it assumes
quite different typing and different access or updating interfaces that
inherently depend on a storage medium. This persistence model makes little
sense in case when the designers abstract from storage media (the system
automatically decides where persistent data are stored) or make storage media
transparent for programmers (like e,g. in CORBA). The idea makes impossible (or
more problematic) writing methods that act both on persistent and volatile
data. For instance, if the programmer wants to copy persistent object to a
volatile store, he or she must create two classes: one for persistent objects
and another for volatile; then to write similar methods in both classes. This
is obvious waste of resources (time, money, size and complexity of program
artifacts, etc.) that is caused by
physical features. Negative features of such an approach are obvious: more
complex schemas and programming interfaces, size of programs, maintenance of
applications, etc. The history of computer technologies clearly shows that such
a waste of resources sooner or latter becomes critical and unacceptable. The
idea of persistence through inheritance is obviously with the idea of
orthogonal persistence and the idea of integrated database query and
programming language. It also obviously promotes impedance mismatch and assumes
it as an inevitable rule.
The advantages of his idea are in the historical legacy. For many years
database management systems are constructed in such a way that operations on
database structures are very different from operations on volatile
programs’ variables. The mentioned before data independence principle was
the main catalyst of this subdivision. Current technologies, however, much
require revisiting these concepts, especially in the triangle: persistence,
data independence and impedance mismatch.
Last modified: January 10, 2008