Thursday, July 16, 2009

Reasoning, the rational, irrational, objective, subjective, and the realm of the nonrational

Reasoning is a critical capability needed to survive and thrive in the modern world. It is also a foundation of modern computing. But, in the "real" world, reasoning is not always the foundation of all thought and action. We use the term rational to characterize thinking and behavior that employs reasoning. We use the term irrational to characterize thinking and behavior that at least appears to "defy all logic" or "flies in the face of reason." In general, rational thought and action are considered good and irrational thought and behavior are considered bad.

In the context of this note I am concerned mainly with the communication of information, beliefs, observations, facts, logic, and conclusions, so even if an individual may act reasonably (possibly even by flipping a coin or reading tea leaves), the question is whether they are able to effectively communicate their thought processes and observations to the proverbial neutral observer.

Reasoning works well when we have access to an objective view of the facts, when all relevant parties can agree on the truth of the facts.

Reasoning tends to break down when individual views of the facts are subjective. If we can't agree on the truth of the facts, we are less likely to come to compatible conclusions, except maybe by chance.

There is nothing wrong with subjectivity per se, and it may be an essential quality of much of the "real" world, but it does suggest that we cannot categorize all thinking and action as either rational or irrational.

Intuition is one example of thinking that defies categorization as rational or irrational and not clearly based on reasoning per se.

Gut feel is another example of a mental process that defies categorization as rational or irrational.

Personal preferences are commonly not guided exclusively by reasoning.

I would suggest that there is a realm of the nonrational which includes all forms of thought and behavior that might be considered reasonable by at least some neutral observers, but cannot clearly be characterized as strictly rational or irrational.

I would not categorize all aspects of religion, ethics, and aesthetics in the realm of the nonrational, but clearly the spiritual, including the existence and nature of a deity, the existence of a soul, and life before and after death would seem to fit nicely in the realm of the nonrational. Concepts such as beauty and preferred behavior and social values do not strictly flow from hard reasoning, but that does not make them implicitly unreasonable and irrational. They may have significant value to society even if we are currently unable to elucidate a formal logic for such a conclusion.

Many forms of mysticism might also reasonably be categorized as being in the realm of the nonrational, but any form of mysticism based on outright fraud should clearly go into the category of irrational. Or, maybe fraudulent mysticism should actually remain in the realm of the nonrational, but merely flagged as fraudulent, especially since "true" believers might not be inclined to accept any form of reasoning about their cherished beliefs.

I would not suggest that all forms of subjectivity should automatically be categorized as being in the realm of the nonrational. In cases where the range of subjectivity is fairly narrow and bounded, we can still reason reasonably effectively. But where the range of subjectivity is all over the map, unbounded, and unbridled, clearly reasoning is of little value.

Another general category in the realm of the nonrational are beliefs and claims of behavior that by their very definition and nature cannot be verified by observation or any amount of logic. Some examples are:

  • Out of body experiences
  • Communicating with the dead
  • Seeing the future
  • Recalling a past life
  • Having visions that others cannot see
  • Hearing voices in one's head that others cannot hear
  • Characterizing one's soul
  • Claiming the existence of a true soulmate
  • Claiming to have seen or done something without any credible, verifiable evidence

In a Knowledge Web, it does make sense to be able to represent both the irrational and the nonrational in additional to the clearly rational. This does highlight one of the difficulties with reasoning within the context of a Knowledge Web. One might derive a conclusion from irrational or nonrational claims, but one needs to be sure to properly categorize the result based on the strength or weakness of the claims upon which the "reasoning" is based.

In any case, the representation and use of the nonrational in a Knowledge Web is worthy of further consideration.

-- Jack Krupansky

Wednesday, July 15, 2009

Published my Semantic Web glossary

I updated and published more of my Semantic Web definitions.

In addition to being in a hyperlinked web (Semantic Web is a good place to start), the terms are listed alphabetically in my Semantic Web Glossary.

My own glossary is far from complete and not as readable as a traditional glossary, so if you are looking for an easy to read introductory glossary, check out Alex Genadinik's Semantic Web Glossary.

-- Jack Krupansky

Published more my Semantic Web-related definitions

I will continue tuning and extending my Semantic Web definitions, but I did publish what I have so far:

-- Jack Krupansky

Tuesday, July 14, 2009

Published my updated Semantic Web definition

I am still working on my Semantic Web definitions, but I did publish what I have so far:

Now I need to work on definitions for:

  • Linked Data
  • Web of Linked Data
  • Web application
  • Semantic Web application
  • Linking Open Data (LOD)
  • LOD cloud
  • metadata
  • RDFa
  • microformat

-- Jack Krupansky

Monday, July 13, 2009

First draft of my Semantic Web definition

I am still working on it, but here is my initial draft of my own definition for Semantic Web:

The Semantic Web is the architecture, technologies, and implementation of the vision of the Web of data which enables data to be shared and reused across application, enterprise, and community boundaries as a hyperlinked collection of data and metadata represented as Web resources combined with RDF triple statements that describe the details, meaning, and relationships among resources in a form that can be readily processed by computer software such as Web applications and software agents in a manner meaningful to applications rather than the presentation form of the traditional Web.

It is not my intention to reinvent or re-envision the Semantic Web, but simply to come up with a reasonably concise and accurate definition since there is none available today.

I am not completely satisfied with this draft, but I think it does say everything that is needed, even if it is a bit verbose.

-- Jack Krupansky

Friday, July 10, 2009

Problems, questions, answers, issues, ideas, speculation, processes, and imagination in a Knowledge Web

Today I happened to run across this quote from Albert Einstein:

Imagination is more important than knowledge...

Well, yeah, I suppose that does make sense.

A little more Googling turned up a more complete version of that quote:

Imagination is more important than knowledge. For knowledge is limited to all we now know and understand, while imagination embraces the entire world, and all there ever will be to know and understand.

Okay, I get it.

Now, I am pondering whether imagination should have some role or position in a comprehensive Knowledge Web. Not so much as actual, real entities, but maybe as placeholders for gaps where we know that something may be missing but we do not know what the missing link actually might be. We can also make use of links to indicate uncertainties about our knowledge. And, more directly to the point of imagination, we can represent speculation for the possibility of future knowledge.

Speculation is maybe simply the midpoint between imagination and knowledge.

A conjecture is a form of speculation.

In fact, one might consider a conjecture as a slightly congealed form of imagination.

Ditto for an idea, but an idea is even less congealed than a more formalized conjecture.

Imagination is more of a mental process, which generates ideas.

I think it makes a lot of sense for a Knowledge Web to include problems, questions, answers, and issues as first-class entities in the Knowledge Web, ranking right up there with knowledge itself, in the sense that they are conceptual things that we work with. Similarly, we do in fact work with ideas, conjectures, theories, and other forms of speculation.

Imagination per se does in fact fit into a Knowledge Web as a conceptual entity in the same way as any other process, whether physical or mental, and is a conceptual thing that we can contemplate, discuss, and record.

But processes also transcend a Knowledge Web in the sense that they do have an active life of their own, distinct from pure knowledge itself.

We can also speak of a Knowledge Web as supporting or facilitating a process.

A Knowledge Web can obviously store information about the various artifacts that may be generated by a process, whether physical or mental.

Nonetheless, imagination would seem to be a very special process unlike all other processes. Most processes have at least some degree of predictability and in most cases it is that predictability that yields the most value. In contrast, imagination is highly unpredictable and it is that unpredictability that is most highly valued.

How to mesh unpredictability into a Knowledge Web is an interesting challenge.

Ultimately, we want a Knowledge Web that supports creativity, encouraging and facilitating imagination and other creative thought processes, and enabling realistic conceptualization of our ideas so that they can be carried into development and practice, as we see fit.

-- Jack Krupansky

Thursday, July 9, 2009

Nicknames, alternate names, synonyms, abbreviations, and other shortcuts

The formal names for concepts such as objects, people, places, streets, etc. can be rather inconvenient or in some cases a matter of dispute. In the real world, in natural language we use a variety of shortcuts:

  • Nicknames
  • Alternate names
  • Synonyms
  • Abbreviations
  • Full names that require context (e.g., city or town names that occur in more than one state or country)

In theory, with the Semantic Web we can have a single concept URI for each thing and then state axioms to equate the various shortcuts with their equivalent specific concept. Unfortunately, a given shortcut might be used for more than one concept. Some form of context or other form of additional detail must be supplied to disambiguate ambiguous shortcuts.

In the case of a user interface, a popup list of the choices can be provided and the user can make an explicit choice of the specific concept.

But in the case of a computational agent, the agent must supply the disambiguating data.

This also begs the question of how to store the graph that would describe what facts need to be detailed in order for a computational agent to choose between competing alternatives for a given shortcut. Sure, that could be application specific, but it would be a shame if each application was forced to invent its own mechanism. Possible a generic context lookup mechanism (e.g., the PostScript dictionary stack metaphor) could be defined to satisfy this need.

Then there is the question of when a shortcut should be substituted with its equivalent umambiguous concept URI. Early has some advantages, but late binding also has some appeal. Another approach would be to carry around both, possibly in the form of a special shortcut mapping which gives the disambiguated concept for direct access but also provides the orginal shortcut for porting to other contexts or display, debugging, or other forms of convenience.

-- Jack Krupansky

I changed my name (in Facebook)

I had not been doing much with Facebook, but since I was pondering issues with names, I decided to go in and see what I had used for my name when I had claimed my Facebook profile (whenever that was, maybe a couple of years ago.)

I had in fact claimed Jack Krupansky as my name in Facebook. No surprise there. That is how most people know me.

But the more I thought about it, I decided that I needed some way to also be findable as John W. Krupansky.

I browsed through all of the options and settings and found where Jack Krupansky was set as my "real name." Hmmm... real name. I hadn't paid attention before.

While I was thinking about whether to change my "real" name in Facebook to John William Krupansky, I browsed some more and notice that Facebook also had an optional "Full Alternate Name." I went ahead and entered John William Krupansky as my full alternate name. Done.

Oops... I thought about it for a few more seconds and realized that I had my names backwards. I should have used John William Krupansky as my real name and Jack Krupansky as my full alternate name. That actually makes more sense. Done.

I would be more comfortable with just my middle initial when my name is used in general and then show the full spelling if someone looks at my profile, but Facebook does not give my any such option.

Unfortunately, the entire Facebook UI refers to me as John rather than Jack. Too bad they don't recognize formal and nick names and let you pick whether to default to formal or nick names. Actually, I'd rather have Facebook refer to me as Mr. Krupansky, just to make it clear what a subservient role the software real has. Facebook serves me. Facebook is not my friend.

Now that I have done all of this I realize another issue... findability in Google. My primary interest is professional in nature, so I would prefer that other professionals be able to find me as they know me, which is Jack Krupansky. But, by using John William Krupansky as my Facebook "real" name, my professional name on Facebook is not directly findable. Now I am thinking that I should set my "real" name to Jack Krupansky and my "alternate" name to John William Krupansky. But I'll think about this for more than a few seconds before changing it. Thinking... Done thinking. Changed. So, now my Facebook "real" name is back to Jack Krupansky and my "alternate" name is John William Krupansky. Logically that is backwards, but practically it should work better.

My Facebook profile is here:

Now, I need to go in and make sure I have LinkedIn set in a similar manner, if possible.

Twitter? Now there's a lost cause. Maybe they'll let me set my name properly when they figure out what they want to do in life.

Oh, and while I was at it, I found an Ivan Krupansky over in Slovakia to add as a friend. And he has a friend Jakub Krupansky (with an acute accent over the "y", which I do not know how to enter in an emailed blog post) who I also added as a friend. Whether either of them is even a distant relative is unknown. Do we really have the same last name if one uses a diacritical mark?

Now, I need to think some more about a sensible model for formal and informal names in the Semantic Web. It will be awhile before I get to the stage of addressing cultural difference in how names are used. That is all the more reason to strip the textual representations of names out of Semantic Web data and use a URI to reference the person rather than a culturally-dependent textual representation.

I need to take a look at the FOAF (Friend Of A Friend) vocabulary specification to at least use that as a starting reference point for name handling in the Semantic Web. Ditto for the Dublin Core Metadata Element Set. I do not think either will get me very far, but I at least need to cover those bases.

-- Jack Krupansky

Wednesday, July 8, 2009

What's my name? Who am I?

They seem like such simple, obvious questions: What's your name? Who are you? In the "real" world the answers are easy, and online casually they are also easy, but in a hard-core semantic sense, boy are they tough problems. Sure, there is no problem if all you are using a name for is a text label or where the context provides qualifying information, but in a general, abstract sense names and identities are very hard problems.

So, what is my name?

Casually, as you see at the bottom of my blog posts, I am Jack Krupansky. Simple enough.

But... Jack is just my nick name and not suitable for any legal documents. My driver's license and bills and credit cards and financial accounts all have my legal first name, John. So, I am "really" John Krupansky.

Actually, I almost never use John Krupansky. In formal, legal contexts, including my driver's license, bills, voter registration, etc., I always use my middle initial: W. So, legally I refer to myself as John W. Krupansky, with the period.

Actually, my driver's license says: KRUPANSKY, JOHN W, without the period.

And my credit cards say JOHN W KRUPANSKY, also without the period.

Personally, I never abbreviate my first name, but in some contexts my name could also be any of:

  • J. Krupansky
  • J Krupansky
  • J. W. Krupansky
  • J W Krupansky
  • Krupansky, J.
  • Krupansky, J
  • Krupansky, J. W.
  • Krupansky, J W

In some contexts, such as publication of a letter or comment, a publisher might abbreviate my last name as:

  • Jack K.
  • John K.
  • John W. K.

Oh, I forgot to mention that my middle W. stands for William. So my birth certificate says John William Krupansky. My passport says:


Please note that "J. Krupansky", "J Krupansky" and "J KRUPANSKY" are not necessarily my name. In some contexts the "J" is really an abbreviation for Judge. There are only two examples I know of, but they are (were) real: Judge Robert Brazil Krupansky and Judge Blanche Krupansky. They are not relatives as far as I know. They might be distant relatives, but that is not known.

Did I say that John Krupansky is my name? Well, yes, but it is not only my name. A Web search shows that there are at least two other people who "have" that name, so I cannot technically claim exclusive ownership. There is a John Krupansky from upstate NY or Kentucky and there is a John Joseph Krupansky out there somewhere.

Almost forgot, there was another John Krupansky, even before I was born, a John F. Krupansky or John Frank Krupansky, my grandfather. That may be part of the reason I became known as "Jack". The rest of the reason was that in first grade of elementary school, there were four John's out of 20 kids.

As far as I know, there are no other John W. Krupansky's out there. But, that is not something that we can count on.

You would think that with all of the "intelligence" and horsepower in modern computers that all of these variations could be sorted out with no effort required on our part, but that is not the case. Sure, various pieces of software do have varying degrees of smarts for dealing with names, but the emphasis is on varying.

Each of the various John Krupansky's does indeed have a distinct identity (probably at least social security number, driver's license state and number, and residential address), but automatically mapping from John Krupansky or J Krupansky or Krupansky, J. to each of us is as yet an unsolved problem (in general.)

As far as I know, the Semantic Web and the various Semantic Web technologies as well as the various prototype semantic search engines do not even offer a proposed solution to this problem of mapping an informal textual name reference to a specific identity. In theory, on the Semantic Web there should be a specific concept or URI for each of us Johns or Krupansky, J., for each of our identities. In fact, the situation is so complex that even Google does not offer a name search capability that is able to deal with the simple variations I have detailed here.

Oops, I forgot another variation, back in Europe, there was an accent on the y of Krupansky and you can even use Google to find some of those European Krupansky's. Semantic search needs to be able to handle both the accented and unaccented forms as well as an option for whether to require the accents to match.

The good news, for me personally, is that it does not appear that there is any other Jack Krupansky out there, at least right now.

Oh, and who is Jack Krupanski? Well, it's actually me, but spelled wrong. What computer software knows that?

To some people I am Mr. John Krupansky. Is the Mr. part of my name? Good question.

Almost forgot... there are also people out there who insist that my name is jack krupansky without any capitals. In general, capitalization does not matter, but it can matter when text is being parsed to be indexed and software is attempting to recognize names.

At this stage, I think we need to consider the following for any semantic web:

  1. Ultimately, each person needs to have a unique URI that represents their identity.
  2. That identity needs to include all of the name components, such as first name, middle name, last name, suffix, title, nick name, etc. as attributes.
  3. Each of the various forms of your name needs to have its own URI. That should include misspellings, for example, Jack Krupanski. That also includes variations in titles and suffixes.
  4. There should be RDF for many-to-many mappings between the various identities for each name form and the potential identities that share that name form, so that given a name form the possible identities can be examined and given an identity the possible name forms can be examined.
  5. Whether in a UI or an API, given a name form, it should be possible to examine the various name forms that might be equivalent.
  6. Have the concept of preferred name form. But there could be multiple preferred forms, such as nick name vs. legal name.

Back to the headline question, for any legal context I always use John W. Krupansky. But, sometimes, I actually run into a form that does not request a middle initial, so then I am John Krupansky. Even then, legal contexts tend to include one or more of social security number, drivers license state and number, and residential address. Still, it feels odd using a form of name that I know is not unique.

In non-legal contexts, such as random social networking web sites, I almost always use Jack Krupansky. I do the same for business cards as well, although I have thought of switching to using my legal first name on business cards.

My resume has John William Krupansky plus Jack Krupansky and happens to use John W. Krupansky in the copyright notice.

The other answer to the question is that I respond by asking what field format you need my name in (and whether it is for a "legal" context.) Actually, I usually respond with Jack Krupansky and then optionally revise to John if it becomes clear that it is a legal context.

In any case, I am dubious when I run into a single field such as name, author, or creator that doesn't seem to care what form a name is in. That is fine for famous names, but for everybody else it is a recipe for confusion. The solution is to require the identity URI for the person and to have a convenient UI for looking up names.

If it was up to me, I would bad simple text name fields. Or maybe not ban them but require a validation rule that checks for uniqueness and then automatically maps to the true identity.

-- Jack Krupansky

Monday, July 6, 2009

Meaning and the Semantic Web

If we look simply at the term Semantic Web, we assume that it is a web that has something to do with semantics, and semantics essentially is about meaning. I think most (but not all) people can agree with that. The rub comes when we try to figure out what various factions mean by meaning. Some of the common meanings of meaning (semantics):

  • The association of type with data so as to permit a computer to understand what the data means at the level of which type a given piece of data refers to.
  • Denotation of which object is referred to by words or terms, such as in a dictionary.
  • Human-level understanding of the "meaning", potentially (or even usually) subjective, of words, terms, and statements.
  • Human-level "meaning" in a deeper, more personalized sense for an individual, how someone feels about or experiences a concept.
  • Rich knowledge as opposed to mere information or raw data, that permits the reader to infer a much wider range of truth and acceptable behavior.
  • Formal semantics of computer science used to define a domain and the operations permitted over that domain in such a way that is complete, consistent, unambiguous (accurate), and verifiable. Even that begs the question of whether a description of a domain on a computer accurately matches the real world as it exists or as we think we know it.
  • Artificial intelligence (or computational intelligence) applying formal semantics to attempt to approximate human-level understanding.
  • Simple tagging to point from a term (e.g., keyword) to an object to cue a computer program as to the intended "meaning" of a term.
  • Simple textual natural language, even if in simple HTML or simple XML can embody an incredible range of meaning, although full processing of natural language by non-human entities is still only a partially solved problem.

The question of what "semantic" means in Semantic Web now comes down to the issue of how much and what kind of meaning is embodied in the Semantic Web. Alternatively phrased, is there enough semantic meaning embodied in the so-called Semantic Web to warrant the term "semantic"? Some might contend that the existing conceptualization of the Semantic Web is too weak, while others might asset that all of the complexity of RDF is simply not needed for most contemporary applications that need to work with limited forms of "meaning." In the end (or at the beginning), the folks at W3C made a call and sincerely believed that their concept of the Semantic Web was a close enough match between what they believed was needed and what they believed could be done. Whether their views will hold up over time remains to be seen.

At a primitive, operational level the Semantic Web really is just a Web of data or a Web of Linked Data. The modifier typed is implicit in there, since that is where most of the power comes from. This operational view is not denied, and most agree with that characterization, even if they chafe or disagree with the term Semantic Web per se.

Others believe that raw XML (and related non-RDF technologies) by itself is more than sufficient to represent and manipulate the lion share of the kinds of "meaning" that people need today in their applications. Fair enough, as far as it goes. RDF has somewhat grander goals, but many contemporary applications can do just fine with a subset of non-RDF XML-based technologies. But none of that really is a robust argument against RDF enabling a richer form of Semantic Web.

The hard-core computer scientists probably do have a point that the current RDF-based technology stack still isn't quite up to snuff to qualify as a formal semantics, but even that is not a truly robust argument against billing the RDF-based Semantic Web as a major advance in introducing semantics and meaning into the Web of Linked Data. Yes, the computer scientists can reasonably argue that we can and should do better to produce a true semantic web, but once again that is not a great argument to withhold the "semantic" label per se. Sometimes you can make better progress with your known bird in hand than spend too much effort pursuing another bird or two in the bush. Some might claim that alternative approaches are less risky, but such matters can be debated endlessly without resolution. Sometimes it is better to make rapid, informed decisions and run with them rather than to slow progress with an endless stream of second-guessed decisions. Or, who knows, maybe eventually there will be a "Version 2.0" of the Semantic Web which leapfrogs ahead of the current Semantic Web with a more robust sense of formal semantics.

Some of us would really like to see more of a Knowledge Web that goes well beyond merely linking together lots of typed data and it is not clear at all that the current RDF-based Semantic Web technology stack is indeed well-suited for that purpose, but even this is not a valid block to the use of the "semantic" label. One could also argue that a "knowledge" web needs more than "mere" semantics, including pragmatics and full-blown semiotics, but that certainly does not argue for withholding the "semantic" label.

More recently, a lot of the emphasis in the Semantic Web community is on Linked Data, Linking Open Data, and producing and populating a realistic Web of Linked Data. That is all fine and well and good, but once again does not by itself argue against the use of the "semantic" label.

My personal view is that all of these efforts are at heart attempts to increase the emphasis on meaning. Even if any given effort does not meet some impossibly high bar for the meaning of meaning, I do think it is the direction and intention of our efforts that matter. Sure, many of the current efforts focus simply on replicating basic data and information processing capabilities at Web-scale, but ultimately we are trying to get to the original Semantic Web vision of a comprehensive information infrastructure that software agents can use to automate a much broader swath of our manual tasks.

My other view is that the decision was made years ago and does have at least some valid technical and communication value, so we have more to gain by sticking with it than in jumping ship to some other term that may offer some short-term clarity but possibly at the expense of losing focus on the long-term vision.

Meanwhile, "meaning" can be found wherever it is stored, whether in RDF, RSS, XML, HTML, or raw text. Storing that meaning can be rather straightforward, but interpreting it is another story. Simple file structures have obvious advantages, but RDF is designed to be a long enough reach to give us some real intellectual leverage over non-RDF XML, but short enough reach that real applications are practical today, or at least in the not-too-distant future.

-- Jack Krupansky

Saturday, July 4, 2009

Where is the Semantic Web?

Quite a few people and organizations have been busily slaving away on the development of the Semantic Web for a number of years now, so where exactly is the Semantic Web? Not what stage of development it is at, but where do we go to find it? At a simplistic, operational level the Semantic Web is fragmented and scattered over a significant number of Web servers all around the world. If you know where to look, you can find bits and pieces here and there. The bottom line is that it is still too early in the development of the Semantic Web to think of it as one monolithic (although distributed) "thing" the way we think of the traditional World Wide Web.

In truth, the structure of the Semantic Web is really not a lot different than the existing Web. Both consist of files stored on servers that run Web server software and both are based on hyperlinks from one file to another.

But, if you did not know anything about the content of the current Web, where would you start? There actually isn't a logical answer since there is no master "root" of the Web. Sure, you could consider Google to be the place to start, but how would you even know about Google and even Google doesn't know everything about the Web, at least in a form that a user could make any sense out of.

Back in the early days of the Web (vintage 1994 or 1995) the "answer" was one of:

  • Your Web browser was pre-configured with a "home" page that had a bunch of links to interesting Web pages.
  • Somebody gave you an explicit URL which you carefully typed into the Web browser address box, or copy and pasted the URL from an email message.
  • You browsed the Yahoo "directory" of registered web sites, including its "What's New" page.
  • You used the Lycos search engine from Carnegie Mellon University to search for keywords and then browsed through the results to select a web page. Alta Vista, and a number of other search engines came along, and eventually Google joined the fray.
  • Once you "land" on one Web page you can follow links from that page to a number of other pages. Rinse and repeat and you could quickly navigate "all over" the Web. Or at least it seemed as if you were navigating "everywhere", although in actuality you were viewing only a very tiny portion of the vast Web, even in those early days.
  • Paper trade publications and even the traditional media began to review and highlight Web sites and Web resources. Eventually those publications opened shop online on the Web with the text of those articles and the links to those Web sites and resources could be clicked to quickly navigate.
  • Businesses advertised their Web addresses in magazines, newspapers, TV, and even billboards, as well as business cards and brochures.
  • Gradually, a number of Web portals emerged which endeavored to provide you with dense snapshots of portions of the Web that the authors imagined that you would find useful - news, sports, weather, finance, entertainment, etc.
  • Google introduced (or at least popularized) the concept of ranking search results more highly based on popularity or the number of inbound links for each Web page. This allowed users to find higher quality and more relevant Web pages with far less effort.
  • Web advertising emerged, providing another technique for informing the user of Web pages that they might find of interest.
  • Search engines began "crawling" and indexing ever-larger portions of the total Web, making it more likely that if a Web page existed, then the user could find it if they only had the proper combination of keywords.
  • Web site content developers put an interesting amount of effort into soliciting other Web sites to exchange links to provide more paths to their sites as well as to boost their "Google juice" to get a higher ranking in Google's search results.
  • Search Engine Optimization (SEO) and Search Engine Marketing (SEM) became full-fledged "disciplines" to increase the likelihood that users would "find" targeted Web sites.
  • Web 2.0 emerged with blogs, spaces, and various social media and social networking sites and technologies which enabled mere users and a wide range of professionals to rapidly generate their own content, including links to content that they found interesting.
  • Highly specialized Web sites (including Web 2.0 sites) emerged that catered to advising users what they might find interesting, including TechMeme, TechCrunch, Digg, StumbleUpon, and Twitter.

That's a brief summary of where we are today with the traditional Web in terms of how users can view the available content and answer the question "Where is the Web?" In short, there are plenty of "arrows" pointing users to an interesting subset of the total World Wide Web.

Unfortunately, the Semantic Web does not have this kind of rich support infrastructure, yet.

Sure, you can do a search for "Semantic Web" in Google, but mostly they will get you resources that describe the Semantic Web and its technologies, but will not point you to the Semantic Web itself.

There is a foundation question of the extent to which mere users would even want to know anything about the Semantic Web since it is all about data rather than presentation that users are used to with the traditional Web. Instead, it is applications and application developers who "need to know" where the Semantic Web data resides. Still, application developers do need a lot of the kinds of tools that are available for traditional Web site developers to find what is available on the Semantic Web that they can use. The fact that the Semantic Web architecture encourages code to be able to discover resources directly only makes the problem more difficult, and more interesting.

Some might assert that the Semantic Web should be completely invisible to users, but they are promoting a view that access to data should be controlled by various gatekeepers. In contrast, the view of open data, such as the Open Data Movement is that there should be no gatekeepers to prevent or enforce selective filtering of access or filtering of the data. Over time, developers will develop better and better tools that will  allow even users to manipulate complex data as directly as they desire. We aren't there yet, but the vision is there. Sure, there will still be plenty of need and demand for ever more-sophisticated  tools for filtering and presenting data, including so-called mashups for combining data from many sources, but the emphasis is still on transparency so that the user can still discern where the data really came from. No matter how finely or richly data is presented, users should be always be able to do their own mashups and filtering of data, as they see fit. The bottom line is that users should have direct access to the data of the Semantic Web, and hence that the Semantic Web must be visible. But, Semantic Web data will also in some cases be integrated with traditional Web applications so that users may indirectly "access" the Semantic Web without being aware that the Semantic Web is being accessed or that it even exists at all.

Another model is that the Semantic Web would be more of an on-call phantom, lurking in the background, but always available to be brought to the foreground if and when the user desires. Maybe the user will generally see a more traditional Web page interface, but occasionally drill down to examine the data more closely. For example, a Web page might present a conclusion, but the user may want to see the justification or provenance for that conclusion.

Still, even if the user does occasionally wish to see actual data, in general the Semantic Web should vanish into transparent ubiquity, meaning that it is always there, always everywhere, but generally is effectively invisible. But even if that is the case, users will on occasion still want to know where the data is and how to access and use it.

Eventually, as the Semantic Web does in fact become ubiquitous, it will in fact merge with the traditional Web so that there will once again be only one Web, but there will still be the conception of the Web of data that lies beneath the surface UI and presentation layer.

For now, how do you find out what is available on the Semantic Web? I'll summarize some of the current techniques:

  • Subscribe to various Semantic Web email lists and simply read about Semantic Web resources as they are discussed. In some cases projects are mentioned and you can visit the project web site to find out where the relevant Semantic Web data resources reside.
  • Ditto for trade journals and conference proceedings for the Semantic Web.
  • A friend or colleague emails you a link to Semantic Web data.
  • Using a data browser such as Tabulator, view a Semantic Web data source and then navigate data links much as you navigate links from a traditional Web page.
  • Check out the wiki for the more recent Linking Open Data (LOD) community project. One wiki page lists many of the known Semantic Web Linked Data datasets for the emerging Web of Linked Data.  There is a nice bubble diagram that shows the various LOD datasets and their relationships. This represents the best overall view of the Semantic Web, to date.
  • People are beginning to create search engine-like "crawlers" to index the known fragments of the LOD portion of the Semantic Web as caches of the LOD cloud. For example, OpenLink Software provides this cache of the LOD cloud that supports text searches and queries.
  • There are also some experimental semantic web search engines such as Swoogle.
  • Various semantic databases, such as Freebase are beginning to provide Linked Data interfaces.
  • Vendors are beginning to promote Semantic Web data that they are beginning to provide, either as RDF files or as so-called SPARQL endpoints.
  • Some vendors are providing access to their underlying relational databases, once again in the form of SPARQL endpoints.
  • With Linked Data, once you access one element of data you will generally have the opportunity to navigate to other, linked data, much as you would navigate the traditional Web by following links.
  • RDFa permits the embedding of Semantic Web data within HTML Web pages, so that the traditional Web and the Semantic Web can in at least some situations be co-located.
  • Google and Yahoo are in the early stages of experimenting with Semantic Web technologies, so we can expect that users will eventually be able to "find" interesting portions of the Semantic Web directly from our traditional search engines.
  • Plug-ins for traditional Web browsers are available or under development or in the research stage so that users will eventually be able to "see" the Semantic Web directly from the Web browser.

That's what I have discovered so far and my search is only in the early stages. I am sure there are additional resources (about resources) that I have not yet discovered, and the "industry" is still in the early development stages, maybe comparable to the Web in, say, 1994, before Yahoo appeared on the scene and helped promote a user-friendly approach to promoting Web resources.

Some loose ends:

  • How does non-RDF XML-based data relate to the Semantic Web and Linked Data (Linking of Open Data)?
  • How do RSS feeds relate to the Semantic Web? RSS feeds are problematic in at least one sense: they are frequently only a severe subset of the available data, so they certainly do not provide full access to the underlying data.
  • Data in online text files and non-W3C data formats, including CSV and spreadsheet files that users can directly access from the Web. Some sort of automated translation or "adaptor module" approach is needed so that such data can be accessed as if it were in a Semantic Web format.

Maybe one over-simplistic answer to my question is that the Semantic Web is spread all over the place, but you just need to know where and how to look for it.

-- Jack Krupansky

Thursday, July 2, 2009

What is the LOD cloud?

LOD is an acronym for Linking Open Data, although sometimes in is less correctly referred to as Linked Open Data. A set of principles for Linked Data were espoused by Tim Berners-Lee. Unlike the traditional Web which consists of hyperlinked HTML Web pages, Linked Data consists of hyperlinked data in the form of RDF triples. Technically, cloud usually refers to a network of servers, but sometimes it is used to refer to interconnected data, essentially a synonym for Web. The LOD cloud, or Linking Open Data cloud, is the current totality of the interconnected data produced by the Open Data Movement in the form of the W3C SWEO Linking Open Data community project. SWEO refers to the W3C Semantic Web Education and Outreach Interest Group.

The LOD cloud is essentially the rudimentary beginning of the Web of Data or Semantic Web as envisioned by Tim Berners-Lee.

The LOD cloud is also referred to as a data commons. The intention is that all of the data in the LOD cloud is open and freely available. Usually "open" will mean that at a minimum the data is at least accessible. It is also usually expected that the data can be freely copied, but that may not always be the case, depending on the license for a particular subset (data set) for a particular source or supplier. The ultimate sense of openness is that the data may be freely edited by users, but that frequently is not the case, especially for proprietary data or data from a government agency which controls the data. It may be more appropriate to refer to the entire cloud as the Linked Data cloud, and the more open subset as the LOD cloud (Linked Open Data cloud), but such fine distinctions are generally currently not drawn.

The LOD cloud is sometimes referred to as the LOD data cloud, but clearly that is redundant (Linking Open Data data cloud.)

The term LOD dataset (or LOD data set) is sometimes used to refer a well-defined subset of the LOD cloud, such as the data for a specific application or domain or data source. The term LOD datasets or (LOD data sets) refers to some collection of specific LOD data sets, or possibly even all of the datasets in the cloud. The SWEO wiki maintains a list of the known data sets in the LOD cloud.

The concept of data being published to the LOD cloud or the Web is also known as Linked Data on the Web (LDOW) or sometimes even the Semantic Web on the Web.

Another term sometimes used for the LOD cloud is Web of Linked Data.

The term Linked Data Cloud is also sometimes used for the LOD cloud.

The term Linking Open Data on the Semantic Web is sometimes used to refer to the LOD cloud.

Sometimes the term LOD cloud is simply used to refer to the "cloud" diagram or bubble diagram that shows all of the known data sets in the LOD cloud.

For most all intents and purposes, the LOD cloud is the Semantic Web or the Web of Data.

-- Jack Krupansky

Wednesday, July 1, 2009

Linked Data - link instance data, not just metadata

One clarification I forgot to emphasize clearly enough in my recent post on Linked Data is that the real goal of Linked Data is for a given Semantic Web application to link to instance data of other Semantic Web applications, not merely to reuse existing metadata. The goal is to aid discovery of other things by users (and their agents). Reuse of metadata such as vocabularies and schemas is a really good idea, but not sufficient to connect things into a Semantic Web.

An unfortunate side effect of using the single concept of a URI to refer to both data and metadata is that the emphasis on linking gets diffused onto both usages.

In summary, reuse of existing vocabularies and other metadata from other Semantic Web applications is good, but linking to instance data from other Semantic Web applications is what Linked Data is really trying to get at.

-- Jack Krupansky

Linked Data, Web of Data, and the Semantic Web

I have been wanting to write a post on the relationship of Linked Data and the Web of Data to the Semantic Web, but even now I am still struggling to get a secure handle on the distinctions between these three related concepts. Meanwhile, I stumbled across a relevant blog post by Tom Heath on the topic entitled "Linked Data? Web of Data? Semantic Web? WTF?" It's difficult to get a hard-core representative summary, but a semi-reasonable approximation is:

... in common usage Linked Data refers to the principles set out by Tim Berners-Lee in 2006.

So if we link data together using Web technologies, and according to these principles, the result is a Web of data. Personally I use the term Web of data largely interchangeably with the term Semantic Web, although not everyone in the Semantic Web world would agree with this. The precise term I use depends on the audience. With Semantic Web geeks I say Semantic Web, with others I tend to say Web of data -- it's not about rebranding, it's about using terms that make sense to your audience, and Web of data speaks to people much more clearly than Semantic Web. Similarly, Linked Data isn't about rebranding the Semantic Web, it's about clarifying its fundamentals.

Tim Berners-Lee said several times last year, in public, that "Linked Data is the Semantic Web done right" (e.g. see these slides from Linked Data Planet in New York), and who am I to argue, it's his vision.

I am still not prepared to write the definitive post on this topic, but here is the gist of my current research:

  1. W3C offers a number of Semantic Web technologies, including XML, XML Schema, RDF, RDFS, RDFa, OWL, SPARQL, XSLT, and others.
  2. The Semantic Web is the vision of the World Wide Web that utilizes the Semantic Web Technologies, particularly RDF as its core.
  3. Any application can utilize any one or more of the Semantic Web technologies.
  4. Mere use of Semantic Web technologies does not by itself indicate that the application is a Semantic Web application.
  5. A Semantic Web application is first and foremost a Web application, typically accessible on the World Wide Web, that utilizes Semantic Web technologies, and specifically uses RDF (or RDFa) for making statements about (Semantic) Web resources.
  6. A Semantic Web application might typically include a more traditional Web application (e.g., HTML) combined with underlying Semantic Web resources.
  7. Web of Data is simply a casual synonym for the Semantic Web that emphasizes that like the original, non-Semantic Web, the Semantic Web consists of an interconnected Web of resources, but they are data resources described at their core using RDF (or RDFa) rather than merely presentation resources (HTML web pages.)
  8. Linked Data is not introducing any new technologies, but is simply a collection of principles that emphasize that the Semantic Web (or Web of Data) has much greater utility to its users when data resources tend to refer to other, somewhat related data resources that may not necessarily be directly required by the local Semantic Web application.
  9. Put simply, Linked Data enables the user (or computational user agent) to navigate between Semantic Web applications (data resources).
  10. Even a proprietary application that uses Semantic Web technologies may also utilize resources (e.g., vocabularies or schemas) from elsewhere in the Semantic Web, but the real test of whether an application is a true Semantic Web application is whether other applications in turn reference it. It is this expanding chain of referencing to produce an ever-expanding and ever more-heavily interconnected Web that gives the Semantic Web its true "webbiness", not the mere use of the underlying Semantic Web technologies by themselves.

That model is not entirely accurate, but I think it's a good start. I need to include mention of HTTP and URIs; they are not unique to the Semantic Web, but are essential.

-- Jack Krupansky