Monday, August 3, 2009

Reference data for the Semantic Web

If you want to construct a database of any serious data you quickly realize that you need to share as much common data as possible. A key concept is reference data, which is simply common data the is needed is a number of places within the database schema. Reference data is a tool to help you cope with complexity as well as interoperability. This allows you to leverage the extra effort spent on defining and refining the reference data so that the rest of the database can depend on the quality and detail of that reference data without having to reinvent the wheel every time a bit of the common data is needed. Examples of reference data include names (and other info) of countries, states, and cities, named entities such as businesses or venders, names of products and services, codings for colors, shoe sizes, any forms of units, classes of service, types of foods, types of meals, forms of payment, medical conditions and treatments, names of bones, names of animals and plants, etc. In general, any of these pieces of reference data would have at least a natural language name an description, but the most important thing is that each item has an ID or identifier that can be used in the body of the database rather than storing the natural language text repeatedly all over the database.

The generic concept at work here is factoring where one or more models are compared, common elements are identified, extracted, and then the common elements are referenced indirectly and managed and controlled separately.

In the context of the Semantic Web, reference data includes global information which is common to many SemWeb applications. A developer may be constructing an application-specific database, but they he should be able to leverage off of the work of others by referencing ontologies and reference data that has already been developed by the global community, such as in the context of the Linked Data Web.

The ID or identifier for a piece of reference data on the Semantic Web is of course represented as a URI, an RDF URI reference.

A broad array of reference data is a necessary requirement for a solid foundation on which application developers can develop domain-specific and application-specific databases.

Reference data is also a key to being able to match disparate databases which were developed at different times and places. Gradually, databases will begin to share reference data, but at least in the short-run databases can be merged or meshed by figuring out how their common data meshes through the mechanism of reference data.

There can be many levels of reference data. Some data is truly global and readily shared across virtually all other databases. At the other end of the spectrum, there might be a family of applications or a niche domain, such that reference data is a useful data structuring tool, but the impact is much more limited compared to the entire global Semantic Web.

In the traditional database world there is also the concept of master reference data. In that conception, reference data would simply be common within a single database, while master reference data would be common across multiple databases. Both are useful concepts. Individual databases can certainly be structured better using factoring and data can be globally more interoperable when factoring is done on a more global basis. In the context of the Semantic Web, I will continue to use the simpler term reference data, primarily to refer to global data factoring, but not intending to exclude factoring within individual databases. After all, some of the best global reference data might well originate within a single application before people eventually realize the global benefits.

-- Jack Krupansky


Post a Comment

Subscribe to Post Comments [Atom]

<< Home