RDFAlchemy

The goal of RDF Alchemy is to allow anyone who uses  python to have a object type API access to an RDF Triplestore.

The same way that:

  •  SQLAlchemy is an ORM (Object Relational Mapper) for relational database users
  • RDFAlchemy is an ORM (Object RDF Mapper) for semantic web users.

News Trunk now includes:

  • Read/Write access for collections and containers
  • Read access to SPARQL endpoints
  • Read/Write access to Sesame2
  • Cascading delete
  • chained descriptors and predicate range->class mapping

Related resources

Installation

RDFAlchemy is now available at the  Cheeseshop: Just type

easy_install rdfalchemy

If you don't have setuptools installed...well you should so  go get it. Trust me.

Code

Browse dev code at http://www.openvest.com/trac/browser/rdfalchemy/trunk and see the current trunk and all history.

SVN

This is an actively developing project so bugs come an go. Get your svn access to the trunk at::

  svn checkout http://www.openvest.com/svn/public/rdfalchemy/trunk  rdfalchemy

User Group

You can now  visit rdfalchemy-dev at Google Groups.

Bugs can be reported here directly if you have an openid to login.

API Docs

There are epydoc API Docs at  http://www.openvest.com/public/docs/rdfalchemy/api/. You can also use links there to browse source, but it might not be current with the trunk.

RDFALchemy overview

The use of persistant objects in RDFAlchemy will be as close as possible to what it would be in SQLAlchemy. Code like:

>>> c = Company.get_by(symbol = 'IBM')
>>> print c.companyName
International Business Machines Corp.

This code does not change as the user migrates from SQLAlchemy to RDFAlchemy and back, lowering the bar for adoption of RDF based datastores.

Capabilities

  • SQLAlchemy interface
  • Caching of data reads
  • Access from multiple datastores:
  • Access to RDF triples from SQL databases through D2Rq

SQL Alchemy

SQLAlchemy was chosen over the other popular python ORM ( SQLObject) because:

  1. There appears to be a migration of some from SQLObject to SQLAlchemy. This appears to be in part due to some of the more sophisticated SQL capability of SQLAlchemy.
  2. SQLAlchemy is being used in future releases of  trac and  Pylons, two systems in active use at Openvest.
  3. SQLAlchemy has a  line of demarcation between the SQL and ORM portions of the library. RDFAlchemy similarly provides the rdflib api to SPARQL and Sesame graphs.

Note: to avoid namespace clashes SQLAlchemy 0.4 will use 'query' as a method more so that:

c = Company.get_by(symbol='IBM')
# will become
c = Company.query.get_by(symbol='IBM')

I don't much like it but RDFAlchemy will move to play copy-cat...I mean to provide a standardized API.

Descriptors

Understanding Descriptors is key to using RDFAlchemy. A descriptor binds an instance variable to the calls to the RDF backend storage.

Class definitions are simple with the rdflib Descriptors. The descriptors are implemented with caching along the lines of  this recipe. The predicate must be passed in.

ov = Namespace('http://owl.openvest.org/2005/10/Portfolio#')
vcard = Namespace("http://www.w3.org/2006/vcard/ns#")

class Company(rdfSubject):
    rdf_type = ov.Company
    symbol = rdfSingle(ov.symbol,'symbol')  #second param is optional
    cik = rdfSingle(ov.secCik)
    companyName = rdfSingle(ov.companyName)
    address = rdfSingle(vcard.adr)
    stock = rdfMultiple(ov.hasIssue)


c = Company.get_by(symbol = 'IBM')
print "%s has an SEC symbol of %s" % (c.companyName, c.cik)
  • rdfSingle returns a single literal
  • rdfMultiple returns a list (may be a list of one)
  • rdfMultiple will return a python list if the predicate is:
    • in multiple triples for the (s p o1)(s p o2) etc yields [o1, o2]
    • points to an RDF Collection (rdf:List)
    • points to an RDF Container (rdf:Seq, rdf:Bag or rdf:Alt)
  • rdfList returns a list (may be a list of one) and on save will save as an RDF:Collection (aka List)
  • rdfContainer returns a list and on save will save as an RDF:Seq.

Chained predicates

Predicates can now be chained as in

c = Company.get_by(symbol='IBM')
print c[vcard.adr][vcard.region]
## or
print c.address[vcard.region]

This works because the generic rdfSubject[predicate.uri] notation maps to rdfSubject.getitem which endeavors to return an instance or rdfSubject.

Chained descriptors

The __init__ functions for the Descriptors now takes an optional argument of range_type. If you know the rdf.type (meaning the uriref of the type) you may pass it to the Class.__init__.

Within the samples module, a DOAP.Project maintainer is a FOAF.Person

DOAP=Namespace("http://usefulinc.com/ns/doap#")
FOAF=Namespace("http://xmlns.com/foaf/0.1/" )

class Project(rdfSubject):
    rdf_type = DOAP.Project
    name = rdfSingle(DOAP.name)
    # ... some other descriptors here
    maintainer = rdfSingle(DOAP.maintainer,range_type=FOAF.Person)

from rdfalchemy.samples.foaf import Person
from rdfalchemy.orm import mapper

mapper()
# some method to find an instance
p = Doap.ClassInstances().next()
p.maintainer.mbox

To get such mapping requires 3 steps:

  1. Classes must be declared with the proper rdf_type Class variable set
  2. Descriptors that return an instance of a python class should be created with the optional parameter of range_type with the same type as in step 1.
  3. Call the mapper() function from rdfalchemy.orm. This can be called later to 'remap' classes at any time.

The bindings are not created until the third step so classes and descriptors can be created in any order.

Hybrid SQL/RDF Alchemy Objects

If we look at the requirements for any python based object to respond to RDFAlchemy requests there are only two requirements:

  1. That some instance object inst be able to respond to an inst.resUri call (it needs to know it's URI)
  2. That there be some descriptor (like rdfSingle()) defined for the instance obj or its class type(obj)

The first requirement could be satisfied by creating some type of mixin class and inheriting from multiple base objects. Maybe I'll go there some day but the behavior of get_by would be uncertain (unless I reread the precedence rules :-). In the mean time we can assign or lookup the relevant URI for the object (assignment could be defined via the  D2Rq vocabulary).

From there you can assign descriptors on the fly and access your Triplestore. RDFDescriptors pull from the RDF Triplestore like rdf via RDFAlchemy and the rest pull from the relational database via SQLAlchemy. A developer need not put all of his data in one repository.

You can mix and match SQL, rdflib and SPARQL data with little effort.

CRUD

Create

class Person(rdfSubject):
            rdf_type = FOAF.Person
            first = rdfSingle(FOAF.givenname)
            last = rdfSingle(FOAF.surname)
        
p1 = Person() # creates a bnode with an [rdf:type foaf:Person] triple
p2 = Person('<http://www.openvest.com/user/phil') #creates a URIRef with the same triple
p3 = Person(last="Cooper",first="Philip") #creates a bnode with 3 triples (rdf:type FOAF:surname FOAF:givenname)

Read

Reading is simply a matter of using the declared descriptors

c = Company.get_by(symbol='IBM')
print c.companyName
print c.address.region

If a descriptor is not defined for a predicate and you still want to access the value you can use the __getitem__ dictionary type access

print c[ov.companyName]
print c[vcard.adr][vcard.region]

The flexibility of the item access is ok but descriptors should be used whenever possible as they are much more intelligent. They:

  • cache database calls
  • return the proper class of the returned item if orm.mapper() has been called
  • return lists correctly for collections (Lists, and Containers both)

Update

Writing to the database for rdflib is done at the time of assignment. It currently only performs set or delete operations for rdfSingle descriptors as the behavior for rdfMultiple is more ambiguous.

The basic syntax for the rdfSingle descriptors is:

ibm = Company.get_by(symbol = 'IBM')
sun = Company.get_by(symbol = 'JAVA')

## add another descriptor on the fly
Company.industry = rdfSingle(ov.yindustry,'industry')

## add an attribute (to the database)
sun.industry = 'Computer stuff'

## delete an attribute (from the database)
del ibm.industry

Delete

To delete a record, use the remove() method. Removing an object from a graph database is more complicated than removing the the triples where the item is the subject of the triple.

def remove(self, node=None, db=None, cascade = 'bnode', bnodeCheck=True):
        """remove all triples where this rdfSubject is the subject of the triple
        db -- limit the remove operation to this graph
        node -- node to remove from the graph defaults to self
        cascade -- must be one of:
                    * none -- remove none
                    * bnode -- (default) remove all unreferenced bnodes
                    * all -- remove all unreferenced bnode(s) AND uri(s)
        bnodeCheck -- boolean 
                    * True -- (default) check bnodes and raise exception if there are
                              still references to this node
                    * False -- do not check.  This can leave orphaned object reference 
                               in triples.  Use only if you are resetting the value in
                               the same transaction
        """

The important thing to understand here is that the default behavior is to cascade the delete recursively deleting all object nodes that are not the object of any other triples. This correctly deletes all lists and containers and things like the maintainer triples for a DOAP record or the author records of a bibliographic item.

Utility methods

The RDFAlchemy api is starting to grow a little bit.

In addition to the get_by which returns a single instance there is now a filter_by which returns a list of instances.

For console users (you are using  iPython aren't you?) you should check out the ppo method which dumps predicate object pairs to the console.

There is now a create_engine utility method in the engine submodule.

There is a samples submodule where some classes like Foaf and Doap will show sample usage of RDFAlchemy and a subdirectory where some rdf Schemes will be provoded.

Other RDF mappers

 TRAMP from the mind of Aaron Swartz. The clean use of rdflib Namespace type mapping is carried forward in RDFAlchemy.

>>> c = Company.get_by(symbol = 'IBM')
>>> print c.companyName
International Business Machines Corp.
>>>
>>> from rdflib import Namespace
>>> ov = Namespace('http://owl.openvest.org/2005/10/Portfolio#')
>>> print ov.companyName 
http://owl.openvest.org/2005/10/Portfolio#companyName
>>> print c[ov.companyName]
International Business Machines Corp.

This provides the user with complete flexibility. Any predicate can be given using the dict style notation. The predicate values can even be determined dynamically at run time.

In  Sparta however, the Namespace prefix is brought forward into the attribute name. Something like c.ov_companyName. I don't like this and will not carry it forward. If you know the prefix mapping and predicate name, use the TRAMP style dict access as above. If you want pythonic dot notation access, you should use descriptors. You can even declare them after the definition of the class as in

Company.stockDescription = rdfSingle(ov.stockDescription,'stockDescription')
print c.stockDescription

SPARQL Endpoints

WARNING: early alpha code at work there. Works by providing read-only access.

Standalone use

This module can stand alone. It is not dependent on the rest of RDFAlchemy. You can use it as a drop-in replacement for may rdflib ConjunctiveGraph applications.

Ported methods include:

  • triples including derivative methods like:
    • subjects, predicates, objects
    • predicate_objects, subject_predicates etc
    • value

The following update methods will not work for SPARQL Endpoints as they are read only (see Sesame below)

  • add and remove including derivatives like:
    • set
  • parse and load including the ability to load from a url
SELECT
Returns a generator of tuples for each return result
CONSTRUCT
Returns an rdflib ConunctiveGraph('IOMemory') instance which can be:
  • queried through the rdflib api
  • assigned as the db element to an rdfSubject instance
  • serialized to 'n3' or 'rdf/xml'

Sesame endpoints

Can provide read access of Sesame through endpoints. SELECT and CONSTRUCT methods supported.

If you know you have a Sesame2 endpoint use the SesameGraph() rather than SPARQLGraph as it has different capabilities.

Joseki endpoints

Can provide read access of Sesame through endpoints. SELECT, CONSTRUCT, and DESCRIBE methods supported.

triples
works but does not currently operate as a true stream. Therefore:
db.triples((None,None,None)) 
will attempt to load the entire endpoint into a memory resident graph and then iterate over the results.

Relational Data thru SPARQL

In general if your data is in a relational database, you will probably want to use SQLAlchemy as your ORM. If, however that data is in a relational table (yours or someone else's) across the web, and has a SPARQL Mapper on top of it, RDFAlchemy becomes your tool.

D2R Server

 D2R Server includes a Joseki servelett. If you depoloy a D2R Server you can access your relational database table through the web as an rdf datastore. RDFAlchemy usage looks like SQLAlchemy but now it can reach across the web into your rdbms (postgres, mysql, oracle, db2 etc).

D2R Server is used internally at Openvest but there are other engines which should all be accessible through the RDFAlchemy SPARQL client.

Other SPARQL / SQL maps

Another active projects providing SPARQL access to relational databases are

  •  SquirrelRDF. In addition to relational databases, SquirrelRDF also supports access to LDAP directories.
  •  Virtuoso which seams to have use pretty smart rewriting algorithm and also supports Named Graphs.
  •  DartQuery. DartQuery is a component of the DartGrid application framework which rewrites SPARQL queries as SQL against legacy relational databases.
  •  SPASQL is an open-source module compiled into the MySQL server to give MySQL native support for RDF.

Sesame

The RDFAlchemy trunk now includes access to  openrdf Sesame2 datastores. SesameGraph is a subclass of SPARQLGraph and builds on SPARQL endpoint capabilities as it provides write access via a  Sesame2 HTTP Protocol. Just pass the url of the Sesame2 repository endpoint and from there you can use an rdflib type api or use the returned graph in rdfSubject as you would any rdflib database.

Standalone use

This module can stand alone. It is not dependent on the rest of RDFAlchemy. You can use it as a drop-in replacement for may rdflib ConjunctiveGraph applications.

Ported methods include:

  • triples including derivative methods like:
    • subjects, predicates, objects
    • predicate_objects, subject_predicates etc
    • value
  • add and remove including derivatives like:
    • set
  • parse and load including the ability to load from a url
    from rdfalchemy.sesame2 import SesameGraph
    from rdflib import Namespace 
    
    doap = Namespace('http://www.w3.org/1999/02/22-rdf-syntax-ns#doap')
    rdf = Namespace('http://www.w3.org/1999/02/22-rdf-syntax-ns#')
    
    db = SesameGraph('http://localhost:8080/sesame/repositories/testdoap')
    db.load('data/rdfalchemy_doap.rdf')
    db.load('http://doapspace.org/doap/some_important.rdf')
    
    project = db.value(None,doap.name,Literal('rdflib'))
    for p,o in db.predicate_objects(project):
       print '%-30s = %s'%(db.qname(p),o)
    
    

RDFAlchemy use of Sesame

You can use it as you would any rdflib database. Near the head of your code, place a call like

from rdfalchemy.sesame2 import SesameGraph
rdfSubject.db = SesameGraph('http://some-place.com/repository')

Other Python SPARQL endpoints

Some of these have nice code which I hope to migrate into RDFAlchemy. For the impatient, you can check out:

Jython

Not sure if the project is ready to branch. If the Sesame2 HTTP access provided above is not enough and you need to access Sesame and/or Jena with python you and check out the RDFAlchemyJython page for some samples.