Tuesday, January 26, 2010

Doing a little Semantic Web programming with RDF2Go

I was actually doing a little (very little) Semantic Web programming yesterday. I did not even realize it until I was done. I was tracking down a nasty time stamp issue with some client code that uses the file/web crawling features of Aperture (1.4), which uses RDF2Go under the hood for storing file names and time stamps. Normally that is all transparent and problem-free, but I was doing something tricky (if people are paying me to do something, you can bet that there is something out of the ordinary involved.) To track down the problem I needed to verify the exact file names that Aperture was tracking. To do that, I needed to access and dump the Aperture repository.

Ultimately, I solved my problem fairly easily, but seeing and understanding what was in the repository was a big help.

I won't go into all of the gory details, but some of the concepts are worth noting.

What is Aperture? According to the Aperture home page on SourceForge:

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.

As I said, Aperture keeps track of those information sources using a repository based on RDF2Go. According to the RDF2Go home page:

RDF2Go is an abstraction over triple (and quad) stores. It allows developers to program against rdf2go interfaces and choose or change the implementation later easily.

Each RDF graph is stored as a model in RDF2Go. Each RDF2Go model has a context. Essentially the context is the name for the named graph that is stored as a model.

An RDF2Go repository contains one or more models, also referred to as a model set. In other words, a repository can hold multiple named RDF graphs.

And finally, an RDF2Go model consists of any number of statements, which are the actual RDF statements which comprise the named RDF graph. Each RDF statement is a triple consisting of three URIs, one for the subject, one for the predicate, and one for the object (S, P, O.) My errant file names were stored in the subject field and the time stamps in the object field. My root path for my Aperture crawl was stored as the context or model name. Ultimately, Aperture stored two statements for each file (one a date, the other the time stamp.) Iterating through the models in the model set gave me a list of the context names or my root file paths (sometimes file system paths, sometimes Web URLs.)

What RDF2Go really is is not a data repository itself, but an abstraction that can work with a variety of repositories or so-called stores.

The difference between a quad store and a triple store is that a triple store by itself represents an unnamed graph, while a quad store is capable of representing named graphs, with that fouth piece of information being the context or graph name. In practice, a lot of people use the terms interchangably and we tend to implicitly forgive people who refer to quad stores as triple stores.

-- Jack Krupansky



Post a Comment

Subscribe to Post Comments [Atom]

<< Home