Sunday, November 1, 2009

Philosophy and Ethics of Social Reality

I just ran across an interesting conference announcement, SOCREAL 2010: Second International Workshop on Philosophy and Ethics of Social Reality. The conference summary is:

In the past two decades, a number of logics and game theoretical analyses have been proposed and combined to model various aspects of social interaction among agents including individual agents, organizations, and individuals representing organizations. The aim of SOCREAL Workshop is to bring together researchers working on diverse aspects of such interaction in logic, philosophy, ethics, computer science, cognitive science and related fields in order to share issues, ideas, techniques, and results.

Topics will include:

  • Language (or communication) as part of social reality
  • Speech acts (or communicative acts) as what shape social reality
  • Moral commitments (and conflicts) in social interaction
  • Logic and game theory as tools for studying social reality
  • (Organized) collective agency
  • Norms and normative systems
  • Social institutional facts and their dynamics

From my own perspective, presently, software agents operate at a rather primitive level with little more than basic data transfer and simple control, but eventually software agents will evolve into intelligent agents whose activity is more in the line of social behavior, including ethics and the behavior of groups and even organizations and institutions of software agents. And, of course, software agents are acting as agents for other entities, whether computational, or human. There certainly is a lot of ground to be broken. It is at least heartening that people are beginning to scratch the surface of the potential for social reality of computational entities.

Eventually, somebody will realize that these social agents are communicating in a language and that language has semantics and that there is a potential for a great semantic abyss between the various communities of social agents, as well as a vast semantic abyss between these computational agents and their real world "masters".

Great challenges and great opportunities.

-- Jack Krupansky

Interesting blog: Decentralyze - Programming the Data Cloud - by Sandro Hawke of W3C

I just ran across an interesting blog related to the Semantic Web called Decentralyze - Programming the Data Cloud by Sandro Hawke of W3C. A key theme of the blog is, as Sandro puts it, "I want computer systems to decentralize, minimizing central points of control. I don't like walled gardens or bottlenecks." My sentiment exactly.

His latest blog post, RDF 2 Wishlist, offers his thoughts on the next iteration of the W3C RDF recommendation.

-- Jack Krupansky

Thursday, September 3, 2009

Representing events, time, space, objects, and persons

Modeling of events in terms of objects, time, and space is an important aspect of representing knowledge. Event-Model-F is a formal model of events developed by Dr. Ansgar Scherp, et al and based on the foundational ontology DOLCE+DnS Ultralight (DUL) which provides comprehensive support for representing time, space, objects, and persons, as well as mereological, causal, and correlative relationships between events  and enables interoperability in distributed event-based systems. As the Event-Model-F web page states,

In addition, the Event-Model-F provides a flexible means for event composition, modeling event causality and event correlation, and representing different interpretations of the same event. The Event-Model-F is developed following the pattern-oriented approach of DUL, is modularized in different ontologies, and can be easily extended by domain specific ontologies.

"Mereological" simply refers to part-to-whole and part-to-part relationships for decomposing or composing a complex object or system.

-- Jack Krupansky

Monday, August 31, 2009

More thoughts on the book: Wired for Thought by Jeffrey Stibel

Previously, I gave a rather lackluster mini-review of the new book Wired for Thought: How the Brain Is Shaping the Future of the Internet by Jeffrey M. Stibel which claims that "The Internet is more than just a series of interconnected computer networks: it's the first real replication of the human brain outside the human body", but I have had a few more thoughts, in particular related to the concept of a "collective consciousness."

My main regret is that I failed to note that the World Wide Web as a whole does to a fair extent represent a dynamic snapshot of the collective consciousness of the millions of people who use the Web. Blog posts and Twitter streams do in fact give a reasonably accurate sense of the topics that are at the front of our collective minds and the tip of our collective tongues.

The Web itself does not sense or have consciousness, but users using the Web as a wall to write on and read from can convey their thoughts and reactions through the Web.

But, I think that is about as far as I feel comfortable going on this idea of the Web being analogous to the human brain.

After all, this collective consciousness is not really a consciousness per se in the way the human brain has a consciousness. There is no single voice of the collective. There is no I. There is no sense of self.

We cannot have a true dialogue with the collective.

We cannot ask a question and get an answer.

The collective does not have a personality.

You cannot have a one-to-one or one-on-one interaction with the collective.

The collective never makes a decision.

The collective does not have a responsibility. Nor does it have any obligations.

The collective does not exhibit common sense.

Nonetheless, the book does contain some interesting insights and is well worth a browse even if you do not purchase it.

-- Jack Krupansky

Monday, August 24, 2009

Sentiment vs. facts

There was an interesting article in The New York Times by Alex Wright entitled Mining the Web for Feelings, Not Facts about how companies are beginning to "mine" online social media such as blogs and social networks for consumer attitudes towards companies and their policies, products, and services. The emerging field of sentiment analysis aims at translating vague or not so vague opinions into hard data. The key thing here is that companies are much more interested in how consumers feel about companies and their policies, products, and services than in traditional hard, factual data.

In addition to how people feel, companies are also interested in identifying who are the more influential opinion holders.

Organizing and presenting all of this data is also a key challenge.

The one point I would make is that this is all fine and dandy for companies, but I think that consumers would like to access similar data and analyses.

There is obviously a lot in common for what a company and a consumer would like to do in terms of understanding sentiment towards companies and their policies, products, and services, but there are differences. In some sense, consumers may have even more intense needs and desires to seek and be at the bleeding edge of consumer trends. After all, it is the consumers who both have an intense passion for being part of the latest trends as well as setting the trends.

The obvious difference is that consumers won't be paying an arm and a leg for expensive software and services for sentiment analysis.

Consumers already have some amount of familiarity with sentiment analysis as there are a wealth of lists of top topics, hit topics, most read stories, top keywords, ranking and sharing of preferences for web pages, etc.

My hunch is that there are probably more consumers that have a keener sense of sentiment on the Internet than your average corporate suit in traditional market intelligence.

In any case, consumers need ever-greater tools and capabilities for recording and monitoring sentiment, both as producers of sentiment and consumers of sentiment.

As we evolve an infrastructure for a true knowledge web, representation and access to sentiment knowledge and data needs to be a key focus.

-- Jack Krupansky

Book: Wired for Thought by Jeffrey Stibel

Yesterday I was browsing through the new book table at Barnes & Noble near Lincoln Center and found an interesting book entitled Wired for Thought: How the Brain Is Shaping the Future of the Internet by Jeffrey M. Stibel that informs us that "The Internet is more than just a series of interconnected computer networks: it's the first real replication of the human brain outside the human body" and that a "collective consciousness" is being created. Sounds fascinating. The Amazon blurb tells us that:

In this age of hyper competition, the Internet constitutes a powerful tool for inventing radical new business models that will leave your rivals scrambling. But as brain scientist and entrepreneur Jeffrey Stibel explains in "Wired for Thought", you have to understand its true nature. The Internet is more than just a series of interconnected computer networks: it's the first real replication of the human brain outside the human body. To leverage its power, you first need to understand how the Internet has evolved to take on similarities to the brain. This engaging and provocative book provides the answer. Stibel shows how exceptional companies are using their understanding of the Internet's brain like powers to create competitive advantage - such as building more effective Web sites, predicting consumer behavior, leveraging social media, and creating a collective consciousness.

The promise sounded truly compelling, but after five minutes of leafing through the book I was not able to isolate more than a few stray details that had any bearing on fulfilling the promise. There was was too much "pop puff" which may thrill the average reader ignorant of the relevant technology, but I simply was unable to find any substantive justification for the central thesis of the book. It may in fact be there since I did not read the book cover to cover, but if it is so compelling and presumably pervasive, how could I have missed it?

Nonetheless, this book may have a solid position simply as a statement of "the state of the art", telling us not how close we are to real success, but simply where we happen to be today. Yes, we are getting closer to the mountain, but that does not automatically translate into closeness to the peak.

There is a lot that we do not yet deeply compehend about the human brain, mind, consciousness, and intellect, so I am not sure how much mileage we can get out of comparing the Internet to the human brain. In fact, I have a hunch it might be an exercise in futility at this stage. Sure, we can paint a broad-brush picture and draw lots of fuzzy analogies, but none of that will necessarily result in true enlightenment.

By all means, browse the book yourself and make up your own mind whether it meshes with your own expertise and interest levels. The book does have a web site with chapter excerpts.

For me, I put down the book pondering the question 'Where's the beef?".

Oddly, Amazon does not have a picture of the book cover, but I was able to find it on the Harvard Business School Press web site since they are the publisher. Note: I get a small commission if you buy the book by clicking on any of my links to the book on Amazon.

-- Jack Krupansky

How reliable are questions?

In a recent post I commented on our dependence on the reliability of knowledge. Now, I'll extend that inquiry to the reliability of questions themselves. You might wonder how could a question be unreliable? How could a question be false? How could a question be misleading? Good questions. That is the point.

A question is really simply an implied statement at a foundation level. The implied statement is a characterization of a quantity of information or knowledge that is desired, coupled with a request or demand or command that the requested information be provided.

So, how can a question be unreliable?

  • The person asking the question may not really need the information. In that sense, the implied statement "I need X" may be a lie.
  • The person may already have the information so that supplying the information may not be necessary. They may merely be seeking confirmation or maybe testing the other party.
  • The question may be overly broad due to poor phrasing.
  • The question may be overly narrow due to poor phrasing.
  • The question may refer to knowledge that simply does not exist.
  • The timing of the question may be inappropriate, either too early or too late to get a reasonable answer.
  • The questioned party may not be a reasonable source for the answer.
  • The question may be an imposition or unfair or disrespectful or discourteous of the questioned party.
  • The tone of the question may be inappropriate.
  • The represented need for an answer may not be appropriate.
  • The implied statement may be offensive.
  • The question may be illegal or a violation of the questioned party's rights.
  • The complexity of the answer may be far out of proportion to the expectation of the questioner.
  • The two parties may not agree on compatible interpretations of the terms used within the question.
  • The questioner may be using private interpretations of some terms without disclosing those interpretations.
  • The context may not be explicit in the question.
  • The context may be incomplete or ambiguous in the question.
  • The question may be ambiguous. Even simple English words can be ambiguous.
  • The question may really be simply a statement in the form of a question with no intent that an answer is expected.
  • The question may be rhetorical. No "answer" is expected, but the question is intended to "hang" over the interaction.
  • The question may seem simple, but have underlying complexity that the questioner or the questioned party may be unaware of.
  • The answer to a question may have a radically different context than the questioner was prepared for and that seemed implied by the question.
  • The question might be loaded so that any answer might be misleading.
  • The question might be leading so that the answer might be improperly biased.
  • The question might be designed so that the legitimate answer might indirectly mislead a third party monitoring the interaction.
  • The question might be worded in such a way that the answer might be misleading if viewed by itself without the full context of the question.
  • The statement implied by the question might be internally inconsistent.
  • The question might be intended as a distraction rather than a genuine query for information.
  • The questioner may be asking the right question but the wrong person.
  • The questioner may not have done sufficient due diligence to identify a source that can reasonably be considered reliable.

At heart, the issue is whether the questioned party or a computational system being questioned can reasonably be expected to respond with an acceptable answer. And even if the response is considered "acceptable", was the question itself reliable enough for the questioner to be able to depend on the answer (assuming the answer itself is also reliable.)

Context is essential. The questioned party may be able to infer all or part of the questioner's context, but assumptions about context can be somewhat risky and unreliable, possibly leading to an unreliable question despite the best of intentions on the part of the questioner.

Answering questions reliably certainly requires very careful attention to detail, but there is still plenty of craft if not outright art that needs to go into constructing reliable questions. There is an old saying in the data processing world, "GIGO - Garbage In, Garbage Out."

All too often, people have a false sense of confidence in the reliability of their questions which can lead to a false sense of confidence that the answers are valid for the questions they thought they were asking.

The only tried and true method I know of to even come close to assuring the reliability of a question is to ask multiple parties the same question and to ask multiple corollary questions so that the multiple answers can be examined to reinforce the most reliable answer. This doesn't even come close to avoiding all of the reliability factors I listed above, but at least it is a start.

-- Jack Krupansky

Sunday, August 23, 2009

How reliable is knowledge?

We depend on knowledge in our daily lives. We presume that what we consider "knowledge" is true or at least highly likely to be true. But, how reliable is any of what we call "knowledge"? This raises some questions:

  • How can we know that any purported knowledge is in fact true?
  • How can we verify that any knowledge is true?
  • How can we determine how to verify any knowledge?
  • How can we have any confidence in our belief in any knowledge?
  • What can we really do when we are unsure whether any knowledge is really true?
  • What can we confidently say about the truth of any knowledge that we believe in?
  • What statements can we safely make about the reliability of any knowledge?
  • What disclaimers should we give regarding the reliability of any knowledge?
  • How certain do we need to be before we can assert that a statement or claim is in fact knowledge?

Ultimately, we need to be able to point to a piece of knowledge and ask and get the answer to one simple question: How reliable is this knowledge?

This implies that there needs to be some record of the history of asking and answering these questions for each and every bit of knowledge.

But, even with such a historical record, how reliable is any of that history and how can we even believe that any of it is reliable?

Maybe the bottom line is that every bit of knowledge is of dubious reliability, even if we do not quite express or acknowledge it.

Nonetheless, we need to have some sense of the reliability of every bit of knowledge.

Trust probably has a role. To wit, if we know who believes a bit of knowledge, we can then judge that person or institution's credibility for having good reason to believe in that knowledge.

Ultimately, we do depend on our own judgment of the veracity of any knowledge, but at least some of us know better than to trust our own judgment too far.

There are also two quite different statements any of us can make about knowledge:

  1. Do we believe and accept that a given bit of knowledge is valid?
  2. Do we have good reason for that belief?

Maybe a simple statement about why we believe in the validity of any knowledge is good enough or maybe even as good as it gets.

It would be nice for a knowledge web to have links for each bit of knowledge that say who believes it and who or what they can reference as to why they believe it. That is clearly not enough to judge the ultimately reliability of a bit of knowledge, but is surely a great start.

-- Jack Krupansky

Wednesday, August 5, 2009

What is the future of the English language, especially for the Semantic Web?

Despite all of the myriad technological advances in computer hardware and software and all of the wonderful specialized computer languages, it is amazing that natural language, in particular English, is still such a dominant force in the world. This is where we are today, but what about the future?

Computer language designers and application developers are quite busily at work incrementally chipping away at more and deeper and broader niches where computer languages can supplant natural language as the preferred "tongue." Still, progress is very slow. Natural language is still the choice for expressiveness and flexibility and ease of use. That seems unlikely to change any time soon.

Low birth rates in "English-speaking" countries make it increasingly likely that fewer and fewer people will consider English to be their native tongue in the decades to come. Still, somehow, English continues to have value to "open doors" across cultures, especially in business, government, science, and engineering, and especially computer software.

The Web makes it very easy for people to use their local language, which is fine when the intended audience is local, but many Web sites either use English or have an alternate set of Web pages in English to cater to a global audience.

Then we come to the Semantic Web. In some sense the Semantic Web is a direct parallel to the traditional non-semantic Web. It is difficult to say whether data in the Semantic Web is any more global than the old Web. Maybe initially more of the efforts are for a global audience, just as they were with the traditional Web in the early days, but over time we should hope that very specialized databases and applications will be tailored heavily for local audiences. Rest assured that the Semantic Web technologies are designed for internationalization and localization.

But since globalized Semantic Web applications and code libraries, by definition, know something about the data they are processing, that implies that this "knowledge" about the data needs to be in a language-independent form.

To be truly useful, software agents, especially intelligent agents, need to access the meaning of data and to access it globally. This means, once again, that knowledge about data needs to be represented in a form that is not hidden by localized natural languages.

As things stand today, the three main tools for globalizing applications are:

  • Use of English as the "core" language.
  • Maintaining data in conceptual, language-neutral form.
  • Tools for localizing data to the native or preferred tongue of the user.

Automatic language translation is still fairly primitive and unlikely to be "solved" in the near future.

As technologies, especially the Semantic Web, are under development and dynamically evolving at a rapid pace it makes sense to focus on a single, core natural language to assure that information is communicated as rapidly and widely as possible.

But as the technology matures (maybe in another ten to twenty years), the need for such broad communication will rapidly diminish. Sure, the elite will still communicate globally, but the average practitioner will likely serve a local audience. All important documents and specifications will have been translated into all the significant natural languages. In that kind of environment the need to "work" in English will effectively vanish, much as we see today with local Web sites, blogs, and other social media. Twitter is the latest to experience this localization phenomenon.

That still leaves the case of software agents. An unsolved problem. Sure, they can be heavily localized as well, but that is not a solution per se. Maybe initially a new development in agent technology might be English-only or English-centric, but as that technology matures, it is only natural that developers seek to refine the technology to exploit localized intelligence. That may mean that such an agent is less usable at a global level, but it may not be as important at that stage. Also, agents can be programmed with split personalities so that they can still operate at a global level albeit at a somewhat lower level of capability than the more specialized localized intelligence. That also requires greater effort and discipline on the part of developers. That is less than optimal.

There is also the underlying issue that besides superficial language issues, there are also cultural differences between countries, peoples, and regions of the world. Initial Semantic Web efforts may tend to be at a fairly high level where such cultural differences are rather muted, but as Semantic Web applications become deeper and more localized, cultural differences will start moving to the forefront.

Academic and high-end commercial developers have a need and interest to present their work globally, including marketing, journal papers, and conferences. English is the norm. Semantic Web content that is not in English will tend to not be preferred in such venues. Besides, high-end developers will tend to prefer to develop internationalized content that can be localized as needed.

Global communities of developers are also becoming a new norm. This includes open source community projects where, by definition, the initial and current contributors have no idea what country or culture future contributors will be from. This once again argues against doing development in anything other than English.

All of this leads me to believe that English will continue to be the dominant natural language for advanced and emerging computer software, especially the Semantic Web, for some time to come. Nonetheless, the issue of how to fully support and exploit local natural languages will remain and increasingly become a very thorny issue.

One lingering issue with the Semantic Web is the language to be used for constructing predicate and property names. They tend to be English, so far, which is okay since the end-user should never see them, but there is no requirement that they be English and some developers who have less global interests may begin to use their local language for predicate and property labels. This introduces a whole new level of mapping and matchmaking complexity. Sure, it is solvable, but it is also unsolved and a potential problem lurking around the corner. Sure, motivated developers can manually add the necessary mapping, but it simply tends to minimize the extent to which serendipity just works without manual intervention by developers.

All of this begs the question of how the English language itself might evolve over the coming decades. One interesting possibility is that at some point the technology "users" of English, including Semantic Web content developers, could become a much greater force than the native speakers that the users will realize that they are effectively in control of the English language and can evolve it as they see fit. It will be interesting to see how that post-English language evolves.

-- Jack Krupansky

Tuesday, August 4, 2009

OWL examples in Manchester Syntax

There is a small collection of OWL ontology reasoning examples from Manchester University in the UK. They are written in the Manchester OWL syntax, which is a more compact and easier to read frame-based syntax than the usual RDF/XML triple/axiom format of the Semantic Web. There are four example ontologies, People, Pets, Pizzas, and Sports Teams.

Here is an example of a class in the Manchester OWL syntax:

Class: man
    Annotations:
        rdfs:label "man"
    EquivalentTo:
        adult
        and male
        and person

-- Jack Krupansky

OWL examples in Manchester Syntax

There is a small collection of OWL ontology reasoning examples from Manchester University in the UK. They are written in the Manchester OWL syntax, which is a more compact and easier to read frame-based syntax than the usual RDF/XML triple/axiom format of the Semantic Web. There are four example ontologies, People, Pets, Pizzas, and Sports Teams.

Here is an example of a class in the Manchester OWL synatx:

Class: man
    Annotations:
        rdfs:label "man"
    EquivalentTo:
        adult
        and male
        and person

-- Jack Krupansky

Timestamping knowledge in the Semantic Web

Context is an important element of knowledge. Time is an important element of context. If we really want to understand a piece of knowledge, we need to know its context and its timing in the flow of events. Data and knowledge needs to be timestamped.

There is no single time or timestamp for any piece of knowledge. Various timings include:

  • When an observation was made, when the raw observation data was captured. This may be from a hardware sensor monitoring the real physical world, a process monitoring some data stream, or even a user interface.
  • When the raw observation data was analyzed to derive the nominal observation, the nominal knowledge.
  • When was the knowledge stored.
  • When was the knowledge validated.
  • When was the knowledge published or otherwise made available.
  • When the knowledge was calculated from other knowledge.
  • When is the validity of the observation expected to expire.

In some cases, the raw observation data might be preserved and re-analyzed at a later data with "improved" analytic capabilities and the nominal knowledge re-generated. In such cases there would then be multiple pieces of knowledge for each observation, each qualified by the time of analysis or re-analysis.

In some cases there may be latency between the raw sensor capture of the data and the reading of that raw data from the sensor device by the computational device that will record that sensor data. Typically that latency will be too small to matter, but for high-speed capture sequences it may be significant. Two separate timestamps may be needed. Or, a discrete timestamp for each processing step along the way.

A piece of knowledge may have been captured from multiple sources, so we need to represent the distinct sources and their distinct timings as well. Collectively they may still represent a single logical observation. An example might be a 3-D camera which is really multiple cameras.

One could also link a number of discrete but simultaneous observations, such as all cameras in a given area, so that collectively they can be considered a single super observation. That overall super observation can have its own timestamps, but there also needs to be a way to drill down to get all of the component timestamps.

The timing of capture by multiple sources may be close enough to be considered the same time, or maybe enough time had elapsed to suggest that they were different observations. Actually, they are different observations in any case, but the issue is whether they are equivalent, or more precisely equivalent in some particular sense. This concept of sense of equivalence needs to be explored more fully.

Each observation station may have its own timepiece and they may not be synchronized. One solution might be to suggest that timepiece synchronization should be a standard protocol when two or more devices are exchanging information that is time-sensitive. Maybe the local time is recorded and then a delta time is recorded for any data that is transferred between two devices.

Calculated data is especially problematic because each of the elements of data used in the calculation may have its own timestamps. The implication is that each piece of calculated data should have an element trail that references each of those pieces of knowledge used in the calculation so that they can be examined later if the data needs to be audited.

Now, how all of these timestamps would be represented and stored in the Semantic Web is another matter entirely and left for further contemplation.

-- Jack Krupansky

Monday, August 3, 2009

Reference data for the Semantic Web

If you want to construct a database of any serious data you quickly realize that you need to share as much common data as possible. A key concept is reference data, which is simply common data the is needed is a number of places within the database schema. Reference data is a tool to help you cope with complexity as well as interoperability. This allows you to leverage the extra effort spent on defining and refining the reference data so that the rest of the database can depend on the quality and detail of that reference data without having to reinvent the wheel every time a bit of the common data is needed. Examples of reference data include names (and other info) of countries, states, and cities, named entities such as businesses or venders, names of products and services, codings for colors, shoe sizes, any forms of units, classes of service, types of foods, types of meals, forms of payment, medical conditions and treatments, names of bones, names of animals and plants, etc. In general, any of these pieces of reference data would have at least a natural language name an description, but the most important thing is that each item has an ID or identifier that can be used in the body of the database rather than storing the natural language text repeatedly all over the database.

The generic concept at work here is factoring where one or more models are compared, common elements are identified, extracted, and then the common elements are referenced indirectly and managed and controlled separately.

In the context of the Semantic Web, reference data includes global information which is common to many SemWeb applications. A developer may be constructing an application-specific database, but they he should be able to leverage off of the work of others by referencing ontologies and reference data that has already been developed by the global community, such as in the context of the Linked Data Web.

The ID or identifier for a piece of reference data on the Semantic Web is of course represented as a URI, an RDF URI reference.

A broad array of reference data is a necessary requirement for a solid foundation on which application developers can develop domain-specific and application-specific databases.

Reference data is also a key to being able to match disparate databases which were developed at different times and places. Gradually, databases will begin to share reference data, but at least in the short-run databases can be merged or meshed by figuring out how their common data meshes through the mechanism of reference data.

There can be many levels of reference data. Some data is truly global and readily shared across virtually all other databases. At the other end of the spectrum, there might be a family of applications or a niche domain, such that reference data is a useful data structuring tool, but the impact is much more limited compared to the entire global Semantic Web.

In the traditional database world there is also the concept of master reference data. In that conception, reference data would simply be common within a single database, while master reference data would be common across multiple databases. Both are useful concepts. Individual databases can certainly be structured better using factoring and data can be globally more interoperable when factoring is done on a more global basis. In the context of the Semantic Web, I will continue to use the simpler term reference data, primarily to refer to global data factoring, but not intending to exclude factoring within individual databases. After all, some of the best global reference data might well originate within a single application before people eventually realize the global benefits.

-- Jack Krupansky

Thursday, July 16, 2009

Reasoning, the rational, irrational, objective, subjective, and the realm of the nonrational

Reasoning is a critical capability needed to survive and thrive in the modern world. It is also a foundation of modern computing. But, in the "real" world, reasoning is not always the foundation of all thought and action. We use the term rational to characterize thinking and behavior that employs reasoning. We use the term irrational to characterize thinking and behavior that at least appears to "defy all logic" or "flies in the face of reason." In general, rational thought and action are considered good and irrational thought and behavior are considered bad.

In the context of this note I am concerned mainly with the communication of information, beliefs, observations, facts, logic, and conclusions, so even if an individual may act reasonably (possibly even by flipping a coin or reading tea leaves), the question is whether they are able to effectively communicate their thought processes and observations to the proverbial neutral observer.

Reasoning works well when we have access to an objective view of the facts, when all relevant parties can agree on the truth of the facts.

Reasoning tends to break down when individual views of the facts are subjective. If we can't agree on the truth of the facts, we are less likely to come to compatible conclusions, except maybe by chance.

There is nothing wrong with subjectivity per se, and it may be an essential quality of much of the "real" world, but it does suggest that we cannot categorize all thinking and action as either rational or irrational.

Intuition is one example of thinking that defies categorization as rational or irrational and not clearly based on reasoning per se.

Gut feel is another example of a mental process that defies categorization as rational or irrational.

Personal preferences are commonly not guided exclusively by reasoning.

I would suggest that there is a realm of the nonrational which includes all forms of thought and behavior that might be considered reasonable by at least some neutral observers, but cannot clearly be characterized as strictly rational or irrational.

I would not categorize all aspects of religion, ethics, and aesthetics in the realm of the nonrational, but clearly the spiritual, including the existence and nature of a deity, the existence of a soul, and life before and after death would seem to fit nicely in the realm of the nonrational. Concepts such as beauty and preferred behavior and social values do not strictly flow from hard reasoning, but that does not make them implicitly unreasonable and irrational. They may have significant value to society even if we are currently unable to elucidate a formal logic for such a conclusion.

Many forms of mysticism might also reasonably be categorized as being in the realm of the nonrational, but any form of mysticism based on outright fraud should clearly go into the category of irrational. Or, maybe fraudulent mysticism should actually remain in the realm of the nonrational, but merely flagged as fraudulent, especially since "true" believers might not be inclined to accept any form of reasoning about their cherished beliefs.

I would not suggest that all forms of subjectivity should automatically be categorized as being in the realm of the nonrational. In cases where the range of subjectivity is fairly narrow and bounded, we can still reason reasonably effectively. But where the range of subjectivity is all over the map, unbounded, and unbridled, clearly reasoning is of little value.

Another general category in the realm of the nonrational are beliefs and claims of behavior that by their very definition and nature cannot be verified by observation or any amount of logic. Some examples are:

  • Out of body experiences
  • Communicating with the dead
  • Seeing the future
  • Recalling a past life
  • Having visions that others cannot see
  • Hearing voices in one's head that others cannot hear
  • Characterizing one's soul
  • Claiming the existence of a true soulmate
  • Claiming to have seen or done something without any credible, verifiable evidence

In a Knowledge Web, it does make sense to be able to represent both the irrational and the nonrational in additional to the clearly rational. This does highlight one of the difficulties with reasoning within the context of a Knowledge Web. One might derive a conclusion from irrational or nonrational claims, but one needs to be sure to properly categorize the result based on the strength or weakness of the claims upon which the "reasoning" is based.

In any case, the representation and use of the nonrational in a Knowledge Web is worthy of further consideration.

-- Jack Krupansky

Wednesday, July 15, 2009

Published my Semantic Web glossary

I updated and published more of my Semantic Web definitions.

In addition to being in a hyperlinked web (Semantic Web is a good place to start), the terms are listed alphabetically in my Semantic Web Glossary.

My own glossary is far from complete and not as readable as a traditional glossary, so if you are looking for an easy to read introductory glossary, check out Alex Genadinik's Semantic Web Glossary.

-- Jack Krupansky

Published more my Semantic Web-related definitions

I will continue tuning and extending my Semantic Web definitions, but I did publish what I have so far:

-- Jack Krupansky

Tuesday, July 14, 2009

Published my updated Semantic Web definition

I am still working on my Semantic Web definitions, but I did publish what I have so far:

Now I need to work on definitions for:

  • Linked Data
  • Web of Linked Data
  • Web application
  • Semantic Web application
  • Linking Open Data (LOD)
  • LOD cloud
  • metadata
  • RDFa
  • microformat

-- Jack Krupansky

Monday, July 13, 2009

First draft of my Semantic Web definition

I am still working on it, but here is my initial draft of my own definition for Semantic Web:

The Semantic Web is the architecture, technologies, and implementation of the vision of the Web of data which enables data to be shared and reused across application, enterprise, and community boundaries as a hyperlinked collection of data and metadata represented as Web resources combined with RDF triple statements that describe the details, meaning, and relationships among resources in a form that can be readily processed by computer software such as Web applications and software agents in a manner meaningful to applications rather than the presentation form of the traditional Web.

It is not my intention to reinvent or re-envision the Semantic Web, but simply to come up with a reasonably concise and accurate definition since there is none available today.

I am not completely satisfied with this draft, but I think it does say everything that is needed, even if it is a bit verbose.

-- Jack Krupansky

Friday, July 10, 2009

Problems, questions, answers, issues, ideas, speculation, processes, and imagination in a Knowledge Web

Today I happened to run across this quote from Albert Einstein:

Imagination is more important than knowledge...

Well, yeah, I suppose that does make sense.

A little more Googling turned up a more complete version of that quote:

Imagination is more important than knowledge. For knowledge is limited to all we now know and understand, while imagination embraces the entire world, and all there ever will be to know and understand.

Okay, I get it.

Now, I am pondering whether imagination should have some role or position in a comprehensive Knowledge Web. Not so much as actual, real entities, but maybe as placeholders for gaps where we know that something may be missing but we do not know what the missing link actually might be. We can also make use of links to indicate uncertainties about our knowledge. And, more directly to the point of imagination, we can represent speculation for the possibility of future knowledge.

Speculation is maybe simply the midpoint between imagination and knowledge.

A conjecture is a form of speculation.

In fact, one might consider a conjecture as a slightly congealed form of imagination.

Ditto for an idea, but an idea is even less congealed than a more formalized conjecture.

Imagination is more of a mental process, which generates ideas.

I think it makes a lot of sense for a Knowledge Web to include problems, questions, answers, and issues as first-class entities in the Knowledge Web, ranking right up there with knowledge itself, in the sense that they are conceptual things that we work with. Similarly, we do in fact work with ideas, conjectures, theories, and other forms of speculation.

Imagination per se does in fact fit into a Knowledge Web as a conceptual entity in the same way as any other process, whether physical or mental, and is a conceptual thing that we can contemplate, discuss, and record.

But processes also transcend a Knowledge Web in the sense that they do have an active life of their own, distinct from pure knowledge itself.

We can also speak of a Knowledge Web as supporting or facilitating a process.

A Knowledge Web can obviously store information about the various artifacts that may be generated by a process, whether physical or mental.

Nonetheless, imagination would seem to be a very special process unlike all other processes. Most processes have at least some degree of predictability and in most cases it is that predictability that yields the most value. In contrast, imagination is highly unpredictable and it is that unpredictability that is most highly valued.

How to mesh unpredictability into a Knowledge Web is an interesting challenge.

Ultimately, we want a Knowledge Web that supports creativity, encouraging and facilitating imagination and other creative thought processes, and enabling realistic conceptualization of our ideas so that they can be carried into development and practice, as we see fit.

-- Jack Krupansky

Thursday, July 9, 2009

Nicknames, alternate names, synonyms, abbreviations, and other shortcuts

The formal names for concepts such as objects, people, places, streets, etc. can be rather inconvenient or in some cases a matter of dispute. In the real world, in natural language we use a variety of shortcuts:

  • Nicknames
  • Alternate names
  • Synonyms
  • Abbreviations
  • Full names that require context (e.g., city or town names that occur in more than one state or country)

In theory, with the Semantic Web we can have a single concept URI for each thing and then state axioms to equate the various shortcuts with their equivalent specific concept. Unfortunately, a given shortcut might be used for more than one concept. Some form of context or other form of additional detail must be supplied to disambiguate ambiguous shortcuts.

In the case of a user interface, a popup list of the choices can be provided and the user can make an explicit choice of the specific concept.

But in the case of a computational agent, the agent must supply the disambiguating data.

This also begs the question of how to store the graph that would describe what facts need to be detailed in order for a computational agent to choose between competing alternatives for a given shortcut. Sure, that could be application specific, but it would be a shame if each application was forced to invent its own mechanism. Possible a generic context lookup mechanism (e.g., the PostScript dictionary stack metaphor) could be defined to satisfy this need.

Then there is the question of when a shortcut should be substituted with its equivalent umambiguous concept URI. Early has some advantages, but late binding also has some appeal. Another approach would be to carry around both, possibly in the form of a special shortcut mapping which gives the disambiguated concept for direct access but also provides the orginal shortcut for porting to other contexts or display, debugging, or other forms of convenience.

-- Jack Krupansky

I changed my name (in Facebook)

I had not been doing much with Facebook, but since I was pondering issues with names, I decided to go in and see what I had used for my name when I had claimed my Facebook profile (whenever that was, maybe a couple of years ago.)

I had in fact claimed Jack Krupansky as my name in Facebook. No surprise there. That is how most people know me.

But the more I thought about it, I decided that I needed some way to also be findable as John W. Krupansky.

I browsed through all of the options and settings and found where Jack Krupansky was set as my "real name." Hmmm... real name. I hadn't paid attention before.

While I was thinking about whether to change my "real" name in Facebook to John William Krupansky, I browsed some more and notice that Facebook also had an optional "Full Alternate Name." I went ahead and entered John William Krupansky as my full alternate name. Done.

Oops... I thought about it for a few more seconds and realized that I had my names backwards. I should have used John William Krupansky as my real name and Jack Krupansky as my full alternate name. That actually makes more sense. Done.

I would be more comfortable with just my middle initial when my name is used in general and then show the full spelling if someone looks at my profile, but Facebook does not give my any such option.

Unfortunately, the entire Facebook UI refers to me as John rather than Jack. Too bad they don't recognize formal and nick names and let you pick whether to default to formal or nick names. Actually, I'd rather have Facebook refer to me as Mr. Krupansky, just to make it clear what a subservient role the software real has. Facebook serves me. Facebook is not my friend.

Now that I have done all of this I realize another issue... findability in Google. My primary interest is professional in nature, so I would prefer that other professionals be able to find me as they know me, which is Jack Krupansky. But, by using John William Krupansky as my Facebook "real" name, my professional name on Facebook is not directly findable. Now I am thinking that I should set my "real" name to Jack Krupansky and my "alternate" name to John William Krupansky. But I'll think about this for more than a few seconds before changing it. Thinking... Done thinking. Changed. So, now my Facebook "real" name is back to Jack Krupansky and my "alternate" name is John William Krupansky. Logically that is backwards, but practically it should work better.

My Facebook profile is here: http://www.facebook.com/jack.krupansky.

Now, I need to go in and make sure I have LinkedIn set in a similar manner, if possible.

Twitter? Now there's a lost cause. Maybe they'll let me set my name properly when they figure out what they want to do in life.

Oh, and while I was at it, I found an Ivan Krupansky over in Slovakia to add as a friend. And he has a friend Jakub Krupansky (with an acute accent over the "y", which I do not know how to enter in an emailed blog post) who I also added as a friend. Whether either of them is even a distant relative is unknown. Do we really have the same last name if one uses a diacritical mark?

Now, I need to think some more about a sensible model for formal and informal names in the Semantic Web. It will be awhile before I get to the stage of addressing cultural difference in how names are used. That is all the more reason to strip the textual representations of names out of Semantic Web data and use a URI to reference the person rather than a culturally-dependent textual representation.

I need to take a look at the FOAF (Friend Of A Friend) vocabulary specification to at least use that as a starting reference point for name handling in the Semantic Web. Ditto for the Dublin Core Metadata Element Set. I do not think either will get me very far, but I at least need to cover those bases.

-- Jack Krupansky

Wednesday, July 8, 2009

What's my name? Who am I?

They seem like such simple, obvious questions: What's your name? Who are you? In the "real" world the answers are easy, and online casually they are also easy, but in a hard-core semantic sense, boy are they tough problems. Sure, there is no problem if all you are using a name for is a text label or where the context provides qualifying information, but in a general, abstract sense names and identities are very hard problems.

So, what is my name?

Casually, as you see at the bottom of my blog posts, I am Jack Krupansky. Simple enough.

But... Jack is just my nick name and not suitable for any legal documents. My driver's license and bills and credit cards and financial accounts all have my legal first name, John. So, I am "really" John Krupansky.

Actually, I almost never use John Krupansky. In formal, legal contexts, including my driver's license, bills, voter registration, etc., I always use my middle initial: W. So, legally I refer to myself as John W. Krupansky, with the period.

Actually, my driver's license says: KRUPANSKY, JOHN W, without the period.

And my credit cards say JOHN W KRUPANSKY, also without the period.

Personally, I never abbreviate my first name, but in some contexts my name could also be any of:

  • J. Krupansky
  • J. KRUPANSKY
  • J Krupansky
  • J KRUPANSKY
  • J. W. Krupansky
  • J. W. KRUPANSKY
  • J W Krupansky
  • J W KRUPANSKY
  • Krupansky, J.
  • KRUPANSKY, J.
  • Krupansky, J
  • KRUPANSKY, J
  • Krupansky, J. W.
  • KRUPANSKY, J. W.
  • Krupansky, J W
  • KRUPANSKY, J W

In some contexts, such as publication of a letter or comment, a publisher might abbreviate my last name as:

  • Jack K.
  • John K.
  • John W. K.

Oh, I forgot to mention that my middle W. stands for William. So my birth certificate says John William Krupansky. My passport says:

KRUPANSKY
JOHN WILLIAM

Please note that "J. Krupansky", "J Krupansky" and "J KRUPANSKY" are not necessarily my name. In some contexts the "J" is really an abbreviation for Judge. There are only two examples I know of, but they are (were) real: Judge Robert Brazil Krupansky and Judge Blanche Krupansky. They are not relatives as far as I know. They might be distant relatives, but that is not known.

Did I say that John Krupansky is my name? Well, yes, but it is not only my name. A Web search shows that there are at least two other people who "have" that name, so I cannot technically claim exclusive ownership. There is a John Krupansky from upstate NY or Kentucky and there is a John Joseph Krupansky out there somewhere.

Almost forgot, there was another John Krupansky, even before I was born, a John F. Krupansky or John Frank Krupansky, my grandfather. That may be part of the reason I became known as "Jack". The rest of the reason was that in first grade of elementary school, there were four John's out of 20 kids.

As far as I know, there are no other John W. Krupansky's out there. But, that is not something that we can count on.

You would think that with all of the "intelligence" and horsepower in modern computers that all of these variations could be sorted out with no effort required on our part, but that is not the case. Sure, various pieces of software do have varying degrees of smarts for dealing with names, but the emphasis is on varying.

Each of the various John Krupansky's does indeed have a distinct identity (probably at least social security number, driver's license state and number, and residential address), but automatically mapping from John Krupansky or J Krupansky or Krupansky, J. to each of us is as yet an unsolved problem (in general.)

As far as I know, the Semantic Web and the various Semantic Web technologies as well as the various prototype semantic search engines do not even offer a proposed solution to this problem of mapping an informal textual name reference to a specific identity. In theory, on the Semantic Web there should be a specific concept or URI for each of us Johns or Krupansky, J., for each of our identities. In fact, the situation is so complex that even Google does not offer a name search capability that is able to deal with the simple variations I have detailed here.

Oops, I forgot another variation, back in Europe, there was an accent on the y of Krupansky and you can even use Google to find some of those European Krupansky's. Semantic search needs to be able to handle both the accented and unaccented forms as well as an option for whether to require the accents to match.

The good news, for me personally, is that it does not appear that there is any other Jack Krupansky out there, at least right now.

Oh, and who is Jack Krupanski? Well, it's actually me, but spelled wrong. What computer software knows that?

To some people I am Mr. John Krupansky. Is the Mr. part of my name? Good question.

Almost forgot... there are also people out there who insist that my name is jack krupansky without any capitals. In general, capitalization does not matter, but it can matter when text is being parsed to be indexed and software is attempting to recognize names.

At this stage, I think we need to consider the following for any semantic web:

  1. Ultimately, each person needs to have a unique URI that represents their identity.
  2. That identity needs to include all of the name components, such as first name, middle name, last name, suffix, title, nick name, etc. as attributes.
  3. Each of the various forms of your name needs to have its own URI. That should include misspellings, for example, Jack Krupanski. That also includes variations in titles and suffixes.
  4. There should be RDF for many-to-many mappings between the various identities for each name form and the potential identities that share that name form, so that given a name form the possible identities can be examined and given an identity the possible name forms can be examined.
  5. Whether in a UI or an API, given a name form, it should be possible to examine the various name forms that might be equivalent.
  6. Have the concept of preferred name form. But there could be multiple preferred forms, such as nick name vs. legal name.

Back to the headline question, for any legal context I always use John W. Krupansky. But, sometimes, I actually run into a form that does not request a middle initial, so then I am John Krupansky. Even then, legal contexts tend to include one or more of social security number, drivers license state and number, and residential address. Still, it feels odd using a form of name that I know is not unique.

In non-legal contexts, such as random social networking web sites, I almost always use Jack Krupansky. I do the same for business cards as well, although I have thought of switching to using my legal first name on business cards.

My resume has John William Krupansky plus Jack Krupansky and happens to use John W. Krupansky in the copyright notice.

The other answer to the question is that I respond by asking what field format you need my name in (and whether it is for a "legal" context.) Actually, I usually respond with Jack Krupansky and then optionally revise to John if it becomes clear that it is a legal context.

In any case, I am dubious when I run into a single field such as name, author, or creator that doesn't seem to care what form a name is in. That is fine for famous names, but for everybody else it is a recipe for confusion. The solution is to require the identity URI for the person and to have a convenient UI for looking up names.

If it was up to me, I would bad simple text name fields. Or maybe not ban them but require a validation rule that checks for uniqueness and then automatically maps to the true identity.

-- Jack Krupansky

Monday, July 6, 2009

Meaning and the Semantic Web

If we look simply at the term Semantic Web, we assume that it is a web that has something to do with semantics, and semantics essentially is about meaning. I think most (but not all) people can agree with that. The rub comes when we try to figure out what various factions mean by meaning. Some of the common meanings of meaning (semantics):

  • The association of type with data so as to permit a computer to understand what the data means at the level of which type a given piece of data refers to.
  • Denotation of which object is referred to by words or terms, such as in a dictionary.
  • Human-level understanding of the "meaning", potentially (or even usually) subjective, of words, terms, and statements.
  • Human-level "meaning" in a deeper, more personalized sense for an individual, how someone feels about or experiences a concept.
  • Rich knowledge as opposed to mere information or raw data, that permits the reader to infer a much wider range of truth and acceptable behavior.
  • Formal semantics of computer science used to define a domain and the operations permitted over that domain in such a way that is complete, consistent, unambiguous (accurate), and verifiable. Even that begs the question of whether a description of a domain on a computer accurately matches the real world as it exists or as we think we know it.
  • Artificial intelligence (or computational intelligence) applying formal semantics to attempt to approximate human-level understanding.
  • Simple tagging to point from a term (e.g., keyword) to an object to cue a computer program as to the intended "meaning" of a term.
  • Simple textual natural language, even if in simple HTML or simple XML can embody an incredible range of meaning, although full processing of natural language by non-human entities is still only a partially solved problem.

The question of what "semantic" means in Semantic Web now comes down to the issue of how much and what kind of meaning is embodied in the Semantic Web. Alternatively phrased, is there enough semantic meaning embodied in the so-called Semantic Web to warrant the term "semantic"? Some might contend that the existing conceptualization of the Semantic Web is too weak, while others might asset that all of the complexity of RDF is simply not needed for most contemporary applications that need to work with limited forms of "meaning." In the end (or at the beginning), the folks at W3C made a call and sincerely believed that their concept of the Semantic Web was a close enough match between what they believed was needed and what they believed could be done. Whether their views will hold up over time remains to be seen.

At a primitive, operational level the Semantic Web really is just a Web of data or a Web of Linked Data. The modifier typed is implicit in there, since that is where most of the power comes from. This operational view is not denied, and most agree with that characterization, even if they chafe or disagree with the term Semantic Web per se.

Others believe that raw XML (and related non-RDF technologies) by itself is more than sufficient to represent and manipulate the lion share of the kinds of "meaning" that people need today in their applications. Fair enough, as far as it goes. RDF has somewhat grander goals, but many contemporary applications can do just fine with a subset of non-RDF XML-based technologies. But none of that really is a robust argument against RDF enabling a richer form of Semantic Web.

The hard-core computer scientists probably do have a point that the current RDF-based technology stack still isn't quite up to snuff to qualify as a formal semantics, but even that is not a truly robust argument against billing the RDF-based Semantic Web as a major advance in introducing semantics and meaning into the Web of Linked Data. Yes, the computer scientists can reasonably argue that we can and should do better to produce a true semantic web, but once again that is not a great argument to withhold the "semantic" label per se. Sometimes you can make better progress with your known bird in hand than spend too much effort pursuing another bird or two in the bush. Some might claim that alternative approaches are less risky, but such matters can be debated endlessly without resolution. Sometimes it is better to make rapid, informed decisions and run with them rather than to slow progress with an endless stream of second-guessed decisions. Or, who knows, maybe eventually there will be a "Version 2.0" of the Semantic Web which leapfrogs ahead of the current Semantic Web with a more robust sense of formal semantics.

Some of us would really like to see more of a Knowledge Web that goes well beyond merely linking together lots of typed data and it is not clear at all that the current RDF-based Semantic Web technology stack is indeed well-suited for that purpose, but even this is not a valid block to the use of the "semantic" label. One could also argue that a "knowledge" web needs more than "mere" semantics, including pragmatics and full-blown semiotics, but that certainly does not argue for withholding the "semantic" label.

More recently, a lot of the emphasis in the Semantic Web community is on Linked Data, Linking Open Data, and producing and populating a realistic Web of Linked Data. That is all fine and well and good, but once again does not by itself argue against the use of the "semantic" label.

My personal view is that all of these efforts are at heart attempts to increase the emphasis on meaning. Even if any given effort does not meet some impossibly high bar for the meaning of meaning, I do think it is the direction and intention of our efforts that matter. Sure, many of the current efforts focus simply on replicating basic data and information processing capabilities at Web-scale, but ultimately we are trying to get to the original Semantic Web vision of a comprehensive information infrastructure that software agents can use to automate a much broader swath of our manual tasks.

My other view is that the decision was made years ago and does have at least some valid technical and communication value, so we have more to gain by sticking with it than in jumping ship to some other term that may offer some short-term clarity but possibly at the expense of losing focus on the long-term vision.

Meanwhile, "meaning" can be found wherever it is stored, whether in RDF, RSS, XML, HTML, or raw text. Storing that meaning can be rather straightforward, but interpreting it is another story. Simple file structures have obvious advantages, but RDF is designed to be a long enough reach to give us some real intellectual leverage over non-RDF XML, but short enough reach that real applications are practical today, or at least in the not-too-distant future.

-- Jack Krupansky

Saturday, July 4, 2009

Where is the Semantic Web?

Quite a few people and organizations have been busily slaving away on the development of the Semantic Web for a number of years now, so where exactly is the Semantic Web? Not what stage of development it is at, but where do we go to find it? At a simplistic, operational level the Semantic Web is fragmented and scattered over a significant number of Web servers all around the world. If you know where to look, you can find bits and pieces here and there. The bottom line is that it is still too early in the development of the Semantic Web to think of it as one monolithic (although distributed) "thing" the way we think of the traditional World Wide Web.

In truth, the structure of the Semantic Web is really not a lot different than the existing Web. Both consist of files stored on servers that run Web server software and both are based on hyperlinks from one file to another.

But, if you did not know anything about the content of the current Web, where would you start? There actually isn't a logical answer since there is no master "root" of the Web. Sure, you could consider Google to be the place to start, but how would you even know about Google and even Google doesn't know everything about the Web, at least in a form that a user could make any sense out of.

Back in the early days of the Web (vintage 1994 or 1995) the "answer" was one of:

  • Your Web browser was pre-configured with a "home" page that had a bunch of links to interesting Web pages.
  • Somebody gave you an explicit URL which you carefully typed into the Web browser address box, or copy and pasted the URL from an email message.
  • You browsed the Yahoo "directory" of registered web sites, including its "What's New" page.
  • You used the Lycos search engine from Carnegie Mellon University to search for keywords and then browsed through the results to select a web page. Alta Vista, and a number of other search engines came along, and eventually Google joined the fray.
  • Once you "land" on one Web page you can follow links from that page to a number of other pages. Rinse and repeat and you could quickly navigate "all over" the Web. Or at least it seemed as if you were navigating "everywhere", although in actuality you were viewing only a very tiny portion of the vast Web, even in those early days.
  • Paper trade publications and even the traditional media began to review and highlight Web sites and Web resources. Eventually those publications opened shop online on the Web with the text of those articles and the links to those Web sites and resources could be clicked to quickly navigate.
  • Businesses advertised their Web addresses in magazines, newspapers, TV, and even billboards, as well as business cards and brochures.
  • Gradually, a number of Web portals emerged which endeavored to provide you with dense snapshots of portions of the Web that the authors imagined that you would find useful - news, sports, weather, finance, entertainment, etc.
  • Google introduced (or at least popularized) the concept of ranking search results more highly based on popularity or the number of inbound links for each Web page. This allowed users to find higher quality and more relevant Web pages with far less effort.
  • Web advertising emerged, providing another technique for informing the user of Web pages that they might find of interest.
  • Search engines began "crawling" and indexing ever-larger portions of the total Web, making it more likely that if a Web page existed, then the user could find it if they only had the proper combination of keywords.
  • Web site content developers put an interesting amount of effort into soliciting other Web sites to exchange links to provide more paths to their sites as well as to boost their "Google juice" to get a higher ranking in Google's search results.
  • Search Engine Optimization (SEO) and Search Engine Marketing (SEM) became full-fledged "disciplines" to increase the likelihood that users would "find" targeted Web sites.
  • Web 2.0 emerged with blogs, spaces, and various social media and social networking sites and technologies which enabled mere users and a wide range of professionals to rapidly generate their own content, including links to content that they found interesting.
  • Highly specialized Web sites (including Web 2.0 sites) emerged that catered to advising users what they might find interesting, including TechMeme, TechCrunch, Digg, StumbleUpon, and Twitter.

That's a brief summary of where we are today with the traditional Web in terms of how users can view the available content and answer the question "Where is the Web?" In short, there are plenty of "arrows" pointing users to an interesting subset of the total World Wide Web.

Unfortunately, the Semantic Web does not have this kind of rich support infrastructure, yet.

Sure, you can do a search for "Semantic Web" in Google, but mostly they will get you resources that describe the Semantic Web and its technologies, but will not point you to the Semantic Web itself.

There is a foundation question of the extent to which mere users would even want to know anything about the Semantic Web since it is all about data rather than presentation that users are used to with the traditional Web. Instead, it is applications and application developers who "need to know" where the Semantic Web data resides. Still, application developers do need a lot of the kinds of tools that are available for traditional Web site developers to find what is available on the Semantic Web that they can use. The fact that the Semantic Web architecture encourages code to be able to discover resources directly only makes the problem more difficult, and more interesting.

Some might assert that the Semantic Web should be completely invisible to users, but they are promoting a view that access to data should be controlled by various gatekeepers. In contrast, the view of open data, such as the Open Data Movement is that there should be no gatekeepers to prevent or enforce selective filtering of access or filtering of the data. Over time, developers will develop better and better tools that will  allow even users to manipulate complex data as directly as they desire. We aren't there yet, but the vision is there. Sure, there will still be plenty of need and demand for ever more-sophisticated  tools for filtering and presenting data, including so-called mashups for combining data from many sources, but the emphasis is still on transparency so that the user can still discern where the data really came from. No matter how finely or richly data is presented, users should be always be able to do their own mashups and filtering of data, as they see fit. The bottom line is that users should have direct access to the data of the Semantic Web, and hence that the Semantic Web must be visible. But, Semantic Web data will also in some cases be integrated with traditional Web applications so that users may indirectly "access" the Semantic Web without being aware that the Semantic Web is being accessed or that it even exists at all.

Another model is that the Semantic Web would be more of an on-call phantom, lurking in the background, but always available to be brought to the foreground if and when the user desires. Maybe the user will generally see a more traditional Web page interface, but occasionally drill down to examine the data more closely. For example, a Web page might present a conclusion, but the user may want to see the justification or provenance for that conclusion.

Still, even if the user does occasionally wish to see actual data, in general the Semantic Web should vanish into transparent ubiquity, meaning that it is always there, always everywhere, but generally is effectively invisible. But even if that is the case, users will on occasion still want to know where the data is and how to access and use it.

Eventually, as the Semantic Web does in fact become ubiquitous, it will in fact merge with the traditional Web so that there will once again be only one Web, but there will still be the conception of the Web of data that lies beneath the surface UI and presentation layer.

For now, how do you find out what is available on the Semantic Web? I'll summarize some of the current techniques:

  • Subscribe to various Semantic Web email lists and simply read about Semantic Web resources as they are discussed. In some cases projects are mentioned and you can visit the project web site to find out where the relevant Semantic Web data resources reside.
  • Ditto for trade journals and conference proceedings for the Semantic Web.
  • A friend or colleague emails you a link to Semantic Web data.
  • Using a data browser such as Tabulator, view a Semantic Web data source and then navigate data links much as you navigate links from a traditional Web page.
  • Check out the wiki for the more recent Linking Open Data (LOD) community project. One wiki page lists many of the known Semantic Web Linked Data datasets for the emerging Web of Linked Data.  There is a nice bubble diagram that shows the various LOD datasets and their relationships. This represents the best overall view of the Semantic Web, to date.
  • People are beginning to create search engine-like "crawlers" to index the known fragments of the LOD portion of the Semantic Web as caches of the LOD cloud. For example, OpenLink Software provides this cache of the LOD cloud that supports text searches and queries.
  • There are also some experimental semantic web search engines such as Swoogle.
  • Various semantic databases, such as Freebase are beginning to provide Linked Data interfaces.
  • Vendors are beginning to promote Semantic Web data that they are beginning to provide, either as RDF files or as so-called SPARQL endpoints.
  • Some vendors are providing access to their underlying relational databases, once again in the form of SPARQL endpoints.
  • With Linked Data, once you access one element of data you will generally have the opportunity to navigate to other, linked data, much as you would navigate the traditional Web by following links.
  • RDFa permits the embedding of Semantic Web data within HTML Web pages, so that the traditional Web and the Semantic Web can in at least some situations be co-located.
  • Google and Yahoo are in the early stages of experimenting with Semantic Web technologies, so we can expect that users will eventually be able to "find" interesting portions of the Semantic Web directly from our traditional search engines.
  • Plug-ins for traditional Web browsers are available or under development or in the research stage so that users will eventually be able to "see" the Semantic Web directly from the Web browser.

That's what I have discovered so far and my search is only in the early stages. I am sure there are additional resources (about resources) that I have not yet discovered, and the "industry" is still in the early development stages, maybe comparable to the Web in, say, 1994, before Yahoo appeared on the scene and helped promote a user-friendly approach to promoting Web resources.

Some loose ends:

  • How does non-RDF XML-based data relate to the Semantic Web and Linked Data (Linking of Open Data)?
  • How do RSS feeds relate to the Semantic Web? RSS feeds are problematic in at least one sense: they are frequently only a severe subset of the available data, so they certainly do not provide full access to the underlying data.
  • Data in online text files and non-W3C data formats, including CSV and spreadsheet files that users can directly access from the Web. Some sort of automated translation or "adaptor module" approach is needed so that such data can be accessed as if it were in a Semantic Web format.

Maybe one over-simplistic answer to my question is that the Semantic Web is spread all over the place, but you just need to know where and how to look for it.

-- Jack Krupansky

Thursday, July 2, 2009

What is the LOD cloud?

LOD is an acronym for Linking Open Data, although sometimes in is less correctly referred to as Linked Open Data. A set of principles for Linked Data were espoused by Tim Berners-Lee. Unlike the traditional Web which consists of hyperlinked HTML Web pages, Linked Data consists of hyperlinked data in the form of RDF triples. Technically, cloud usually refers to a network of servers, but sometimes it is used to refer to interconnected data, essentially a synonym for Web. The LOD cloud, or Linking Open Data cloud, is the current totality of the interconnected data produced by the Open Data Movement in the form of the W3C SWEO Linking Open Data community project. SWEO refers to the W3C Semantic Web Education and Outreach Interest Group.

The LOD cloud is essentially the rudimentary beginning of the Web of Data or Semantic Web as envisioned by Tim Berners-Lee.

The LOD cloud is also referred to as a data commons. The intention is that all of the data in the LOD cloud is open and freely available. Usually "open" will mean that at a minimum the data is at least accessible. It is also usually expected that the data can be freely copied, but that may not always be the case, depending on the license for a particular subset (data set) for a particular source or supplier. The ultimate sense of openness is that the data may be freely edited by users, but that frequently is not the case, especially for proprietary data or data from a government agency which controls the data. It may be more appropriate to refer to the entire cloud as the Linked Data cloud, and the more open subset as the LOD cloud (Linked Open Data cloud), but such fine distinctions are generally currently not drawn.

The LOD cloud is sometimes referred to as the LOD data cloud, but clearly that is redundant (Linking Open Data data cloud.)

The term LOD dataset (or LOD data set) is sometimes used to refer a well-defined subset of the LOD cloud, such as the data for a specific application or domain or data source. The term LOD datasets or (LOD data sets) refers to some collection of specific LOD data sets, or possibly even all of the datasets in the cloud. The SWEO wiki maintains a list of the known data sets in the LOD cloud.

The concept of data being published to the LOD cloud or the Web is also known as Linked Data on the Web (LDOW) or sometimes even the Semantic Web on the Web.

Another term sometimes used for the LOD cloud is Web of Linked Data.

The term Linked Data Cloud is also sometimes used for the LOD cloud.

The term Linking Open Data on the Semantic Web is sometimes used to refer to the LOD cloud.

Sometimes the term LOD cloud is simply used to refer to the "cloud" diagram or bubble diagram that shows all of the known data sets in the LOD cloud.

For most all intents and purposes, the LOD cloud is the Semantic Web or the Web of Data.

-- Jack Krupansky

Wednesday, July 1, 2009

Linked Data - link instance data, not just metadata

One clarification I forgot to emphasize clearly enough in my recent post on Linked Data is that the real goal of Linked Data is for a given Semantic Web application to link to instance data of other Semantic Web applications, not merely to reuse existing metadata. The goal is to aid discovery of other things by users (and their agents). Reuse of metadata such as vocabularies and schemas is a really good idea, but not sufficient to connect things into a Semantic Web.

An unfortunate side effect of using the single concept of a URI to refer to both data and metadata is that the emphasis on linking gets diffused onto both usages.

In summary, reuse of existing vocabularies and other metadata from other Semantic Web applications is good, but linking to instance data from other Semantic Web applications is what Linked Data is really trying to get at.

-- Jack Krupansky

Linked Data, Web of Data, and the Semantic Web

I have been wanting to write a post on the relationship of Linked Data and the Web of Data to the Semantic Web, but even now I am still struggling to get a secure handle on the distinctions between these three related concepts. Meanwhile, I stumbled across a relevant blog post by Tom Heath on the topic entitled "Linked Data? Web of Data? Semantic Web? WTF?" It's difficult to get a hard-core representative summary, but a semi-reasonable approximation is:

... in common usage Linked Data refers to the principles set out by Tim Berners-Lee in 2006.

So if we link data together using Web technologies, and according to these principles, the result is a Web of data. Personally I use the term Web of data largely interchangeably with the term Semantic Web, although not everyone in the Semantic Web world would agree with this. The precise term I use depends on the audience. With Semantic Web geeks I say Semantic Web, with others I tend to say Web of data -- it's not about rebranding, it's about using terms that make sense to your audience, and Web of data speaks to people much more clearly than Semantic Web. Similarly, Linked Data isn't about rebranding the Semantic Web, it's about clarifying its fundamentals.

Tim Berners-Lee said several times last year, in public, that "Linked Data is the Semantic Web done right" (e.g. see these slides from Linked Data Planet in New York), and who am I to argue, it's his vision.

I am still not prepared to write the definitive post on this topic, but here is the gist of my current research:

  1. W3C offers a number of Semantic Web technologies, including XML, XML Schema, RDF, RDFS, RDFa, OWL, SPARQL, XSLT, and others.
  2. The Semantic Web is the vision of the World Wide Web that utilizes the Semantic Web Technologies, particularly RDF as its core.
  3. Any application can utilize any one or more of the Semantic Web technologies.
  4. Mere use of Semantic Web technologies does not by itself indicate that the application is a Semantic Web application.
  5. A Semantic Web application is first and foremost a Web application, typically accessible on the World Wide Web, that utilizes Semantic Web technologies, and specifically uses RDF (or RDFa) for making statements about (Semantic) Web resources.
  6. A Semantic Web application might typically include a more traditional Web application (e.g., HTML) combined with underlying Semantic Web resources.
  7. Web of Data is simply a casual synonym for the Semantic Web that emphasizes that like the original, non-Semantic Web, the Semantic Web consists of an interconnected Web of resources, but they are data resources described at their core using RDF (or RDFa) rather than merely presentation resources (HTML web pages.)
  8. Linked Data is not introducing any new technologies, but is simply a collection of principles that emphasize that the Semantic Web (or Web of Data) has much greater utility to its users when data resources tend to refer to other, somewhat related data resources that may not necessarily be directly required by the local Semantic Web application.
  9. Put simply, Linked Data enables the user (or computational user agent) to navigate between Semantic Web applications (data resources).
  10. Even a proprietary application that uses Semantic Web technologies may also utilize resources (e.g., vocabularies or schemas) from elsewhere in the Semantic Web, but the real test of whether an application is a true Semantic Web application is whether other applications in turn reference it. It is this expanding chain of referencing to produce an ever-expanding and ever more-heavily interconnected Web that gives the Semantic Web its true "webbiness", not the mere use of the underlying Semantic Web technologies by themselves.

That model is not entirely accurate, but I think it's a good start. I need to include mention of HTTP and URIs; they are not unique to the Semantic Web, but are essential.

-- Jack Krupansky

Thursday, June 11, 2009

What is time?

Time is a critical aspect of many forms of knowledge. So, what is time? At a simplistic level, time is simply an ordering of moments at which events occur or could occur. Given any two events or moments, they either occur simultaneously, or one before the other and one after the other. Events can also partially overlap, so we technically should speak of the sub-events of when each event begins and when they end. For the purpose of this discussion it is those indivisible sub-events that are of significance.

Traditionally, we "measure" time by choosing some periodic and regular event and simply counting the occurrences of that event.

This begs the question of what an "event" is. At a simplistic level, an event is motion or propagation that is detectable. Put another way, an event is a detectable change in position or appearance. Motion includes chemical reactions, biological processes, and physical state changes. At the microscopic level motion may simply be large numbers of particles (molecules, atoms, subatomic particles) that move so that a macroscopic change is detectable. We need to use the term detectable in a theoretical sense of could something be detectable rather than the practical sense of whether we do in fact have usable tools to accomplish the detection. Time is the fact that the change occurred (an event), separate from our observation of the change (the event). Propagation includes a field or force or "wave" of some sort that travels some distance. At a human scale we depend on propagation to sense that motion has occurred, but in those cases it is the change in motion that signifies time rather than the role of propagation in aiding us in sensing the change in motion.

Rather than using transient events, we can also look for changes in conditions and look at the interval of time over which that condition is true. Ultimately, that is the same as looking at the pair of events, one of which corresponds to the start of the condition and the other of which corresponds to the end of the condition. Obviously events of some other form must be transpiring during that interval in order for us to judge that the interval of time is occurring.

So, we use time to measure and specify when an event occurred (or a condition changed) as well as the length of the interval between two events or the duration of a condition.

Now, what about the gaps between the occurrences of events, does time exist and "pass" (or "flow") in those gaps? If no detectable physical process can occur during the gap, I would say that time does not exist for the duration of the gap. The trivial answer is that since we can not detect the presence of the gap (i.e., two events seem to occur one right after the other), then there actually isn't any gap.

What is the smallest detectable motion or propagation that is detectable? Don't know. Is that the true unit of time? Maybe. Since we do not know that unit, we traditionally use some larger unit, typically the second, and refer to fractions of that unit. Conceptually, we can divide that unit ever finer so that conceptually we are referring to intervals which are smaller than the smallest intervals of motion and propagation that theoretically could be detectable, or at least have detected so far. Conceptually, we can do the math to refer to hypothetical moments of time which never existed as detectable events. But from a practical, human perspective, that is not much of an issue.

Humans also have their minds and even without observing events in the outside world can envision the passage of time. One's breathing or beating heart or blinking eyes can provide the periodic events needed to detect and measure the passage of time. Mental state changes (how fast can you think or count?) can also supply the events needed to detect the passage of time. At this human level, there is probably a cognitive "unit" of time which is the smallest interval that a typical human mind can detect. There is probably a smaller unit which would be the smallest unit of chemical and electrical and biological activity that a device could detect, but that is no longer the "human" unit of time.

At a macroscopic level, the motion of the sun, sunrise and sunset, movement and phases of the moon and stars, turns of seasons, movements of migratory animals, life cycle changes, sun spot cycles, comets, and other natural phenomena provide periodic event sequences needed to detect and measure the passage of time.

We can also measure time in a social sense in terms of ages and eras, times when particular technologies or values or modes of behavior are prevalent. Once again, events and changes are used to "measure" the rise and fall of a socially significant period of time.

And, we can also similarly measure the natural world in terms of geological and biological events and conditions that change.

Then there is time at the cosmic scale, with the birth, evolution, and death of stars, galaxies, and other cosmic structures. Millions and billions of years.

And we also have time at the sub-atomic level that is of concern to physicists, but that is a rather distinct discussion.

Ultimately, we measure all of this on a single, combined scale ranging over:

  • Fractions of seconds, down to a billionth or even a billionth of a billionth of a second
  • Seconds
  • Minutes
  • Hours
  • Days
  • Weeks
  • Months
  • Years
  • Decades
  • Centuries
  • Millennia
  • Millions of years
  • Billions of years

A future topic is: What does time mean? Does time itself have any meaning, apart from the meaning of events that we are measuring? Or, is time itself inherently, and almost by definition meaningless? Maybe not. We can certainly refer to time as an "object of discourse", which would of course have meaning. But, time itself, distinct from any events that might be occurring would seem rather meaningless. On the other hand, if we define time in terms of events occurring, maybe time is always implicitly linked to some meaning, in particular the meaning of the events that mark time. We might even go so far as to suggest that time carries meaning (i.e., time is the carrier of meaning), since without the medium of time, the events could not be transpiring and having their meaning.

-- Jack Krupansky

Wednesday, June 10, 2009

How do people relate to knowledge?

It is a rather simple but interesting question:

How do people relate to knowledge?

I do not really have a great answer right now, but it does seem that it cuts to the heart of how we represent knowledge in a way that people can make sense out of.

On the other hand, maybe the real answer is that intelligent software agents are the "way" that people "relate" to knowledge. Even so, people need a conceptual model of what and how they communicate with those software agents since the agents are merely an indirect mechanism to access the underlying knowledge. How much of that underlying structure can the agents reasonably hide?

Maybe there is a clue from the land of databases where you have an overall schema, a storage schema for the underlying data, and multiple views each of which is specialized to the needs of a distinct audience.

I am not convinced that is the answer, but maybe it is a clue to build on. Certainly the part about multiple audiences with different interests, needs, and abilities has got to be part of the solution.

-- Jack Krupansky

Monday, June 8, 2009

Concepts related to time

Granted, physicists have a very specialized conception of time, but I wanted to ponder the various roles that time plays in our own lives. Later it would be interesting to see how the physicists would weigh in on these social conceptions.

Some terms and concepts related to time in our lives:

  • Clocks - time within a day
  • Calendars - time across days, weeks, months, seasons, years
  • Measurement of time
  • Timekeeping device
  • Beginning
  • Ending
  • Expiration
  • Time remaining
  • Duration
  • Interval
  • Window of time to constrain activity
  • Time's up
  • Era
  • Eon
  • Period
  • Period of a cluster of related events
  • Period of shared values or common social characteristics
  • History
  • The past
  • The future
  • The here and now
  • Right now
  • Present 
  • Sometime
  • Sometimes
  • At times
  • All the time
  • All of time
  • Same time
  • Waiting
  • Scheduling
  • Periodic
  • Intervals
  • Continuity, continuous
  • Continual
  • Order of events
  • Before
  • After
  • Little time left
  • Synchronization
  • Recording, logging
  • Distance
  • Age
  • Too short to notice
  • Too long to notice
  • Too short to measure
  • Too long to measure
  • Next
  • Previous
  • Dependency on time
  • Addiction to time
  • Clock watching
  • Value of time
  • Monetary value of time
  • Relative value of time (personal)
  • Loss of time
  • Gaining time
  • Passage of time
  • Time is your friend
  • Time is your enemy
  • Process
  • Timing
  • Occasional
  • Special days
  • Season
  • Alignment with physical phenomena
  • Elapse of physical processes (incl. chemical, biological, and geological)
  • Forecast, prediction
  • Estimate
  • Time in fiction: time travel, time warps
  • Time in physics
  • Time as a dimension, a coordinate
  • Mental dissipation
  • Decay
  • Entropy
  • Birth
  • Growth
  • Aging
  • Death
  • Frequency
  • When
  • Impact of time
  • Wasting time
  • Saving time
  • Time delay
  • Time sharing
  • Cycles
  • Repetitions
  • Unit of time
  • Representation of time
  • Telling time
  • Communication of time
  • Timeless
  • Structure of music
  • Local time vs. universal time
  • Time zones
  • Daylight savings time
  • Father Time
  • Fatalism
  • Position in time
  • Moment
  • Verb tenses (past, present, future)
  • Passage of time
  • Passes time
  • Simultaneous
  • Instantaneous
  • Time composed of moments
  • Timeline

Some questions:

  1. Does space-time ever manifest itself in our daily lives, or do we always experience time and space as distinct dimensions?
  2. Can we define a complete ontology for time that does not depend on an ontology for space?
  3. Does time have a nature?
  4. Is time a domain?
  5. Can time ever be considered an entity? Other than the fact that a concept can be considered an entity.
  6. Can time be detected other than due to motion of objects or emission of energy?
  7. Is time in constant movement?
  8. Can time ever be frozen temporarily or permanently?
  9. Can time itself actually be observed?
  10. Can time itself actually have a rate?
  11. Can time itself cause anything?
  12. Can anything cause time itself?
  13. Whether or not spatial objects can travel in time, can time itself travel?
  14. Does time have only quantity but not quality?
  15. Is there a quantum of time?
  16. Is there a theoretical time distinct from observable time?
  17. Can time be discontinuous?
  18. Does time branch?
  19. Are there parallel timelines?
  20. If there are parallel universes, is time parallel or independent in each universe?

-- Jack Krupansky

Friday, May 29, 2009

Is good enough the enemy of vaguely better?

There is no question that with Semantic Web technologies we could produce a "better" Knowledge Web. The open question is how much better it would be. If a real user were to query a Semantic Web database how much better would the query results really be? The answer is unknown because it would all depend on the nature of the query processing infrastructure, the forms of inference and "reasoning" that are implemented, and how the database is structured. All of those aspects are continuing to evolve and as of today none of that infrastructure is in place in a form to perform queries comparable to even a simple Google query. Sure, we believe the results would be better, but that's about the strongest statement we can make today.

Just this morning I was following a discussion on a Semantic Web email list and David Huynh made the statement:

It's a case of "good enough is the enemy of vaguely better", unfortunately.

So, yes, we know that queries to the Semantic Web will be better, but our degree of specificity in how much better is clearly vague. In the face of a radically different approach that is completely unproven, it is not uncommon that "good enough" wins by default.

Interestingly, Microsoft's new Bing "decision engine" might have to deal with this same issue. Even if their results for many queries are actually "better", the question is whether overall their results are so clearly better that Google's "good enough" will not carry the day. A big unknown, but that is the nature of introducing innovation.

I myself often user a similar metaphor for the greater success of Microsoft-based PCs compared to Apple Macs: Maybe the Mac is better, but if the PC is "good enough" what does it matter if the Mac is somewhat "better"?

The real goal of a true Knowledge Web is that intelligent agents can do a lot more of our tasks for us. The risk today is that even if we succeeded in building such a knowledge web, the actual and perceived benefits relative to the costs and radical shift in mindset required to use such a web and agents might result in a similarly vague relative benefit in comparison with existing "good enough" approaches.

-- Jack Krupansky

Wednesday, May 27, 2009

What is the difference between a URI and a URL?

Anybody who has browsed the Web knows that a URL is the web address of a web page on a web site. Meanwhile, the Semantic Web is based on the URI. So, what is a URI, and how are they different? The short answer is that all URLs are by definition URIs and in the context of the Semantic Web the preferred term is URI.

Part of the answer is historic: URL (Uniform Resource Locator) is the original term for a web address, the location of a web resource or web page on a web site, but technically we should be using the newer term URI (Uniform Resource Identifier.)

Going further back in history, at one stage URI meant Universal Resource Identifier, but that usage has been superceded by Uniform Resource Identifier.

There is a little bit more to it. While all URLs are in fact URIs, a subset of URIs are not URLs, in particular, the subset known as URNs or Uniform Resource Names. An example of a URN might the ISBN number for a book such as "URN:ISBN:0-062-51587-X".

So:

  • A URI is either a URL or a URN.
  • Every URL is a URI.
  • Every URN is a URI.
  • A URN is never a URL.
  • A URL is never a URN.

For HTML Web pages, it still makes since to refer to the URL of a web page, even though URI is now the technically more precise term, since an HTML Web page URI is in fact always a URL (and vice versa).

For RDF statements, the subject, predicate, and objects of an RDF triple are by definition referred to as URIs. They may at times in fact be URLs and refer to resources which are files on Web servers, but that is not required in all cases.

If you really want to get technical, there is a discussion in IETF RFC 3305 entitled "Report from the Joint W3C/IETF URI Planning Interest Group: Uniform Resource Identifiers (URIs), URLs, and Uniform Resource Names (URNs): Clarifications and Recommendations".

-- Jack Krupansky

Tuesday, May 26, 2009

Semantic Drift

Semantic drift refers to the change in the meaning of a term or concept over time to the members of a community.

Obviously, it would be advantageous if the meaning of a term or concept did not vary over time, but reality is a force to be reckoned with.

The semantics of a term or concept can change because:

  • Changes in the real world, including people, technology, and the physical world, require updating of the meanings of terms and concepts.
  • What was considered important may no longer be considered important.
  • What was considered unimportant may no longer be considered unimportant.
  • New members of the community may have different values and requirements and need or choose to de-emphasize some aspects of the existing meaning and emphasize or add new aspects.
  • Existing members of the community may drop out and their influence on the importance of various aspects of the meanings of terms and concepts may wane. Some terms may become more strict, others looser.
  • Emergence of new or significantly different domains may borrow or modify existing semantic meanings of terms.
  • Communities can split or splinter and the new sub-communities could diverge in their interests and emphasis on the essential meanings of terms and concepts.
  • Communities can merge or overlap so that disjoint collections of terms and concepts will need to be merged and conflicting meanings for the same syntactic terms need to be resolved.
  • Bugs or other deficiencies may be discovered and "fixed."

Technically, there are some categories of semantic mapping which are not technically semantic drift, but may still be informally considered as such:

  • Distinct communities may have distinct meanings for superficially identical terms or even concepts. Bridging between communities is needed.
  • Proprietary communities within a single industry or interest area may contrive meanings of their own invention for what appear to be superficially identical terms or even concepts. Standards are needed.
  • Personal and place names, especially in distinct geographic areas.

The whole point is that we need a semantic infrastructure which acknowledges and helps us cope with semantic drift and all other forms of semantic mapping.

-- Jack Krupansky

Monday, May 25, 2009

Tim Berners-Lee's dream for the Web

Just for future reference, I have reproduced here Tim Berners-Lee's brief statement of his dream for the Web, including the Semantic Web and intelligent agents, from his book Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. He starts out Chapter 12, Mind to Mind, by saying:

I have a dream for the Web... and it has two parts.

In the first part, the Web becomes a much more powerful means for collaboration between people. I have always imagined the information space as something to which everyone has immediate and intuitive access, and not just to browse, but to create. The initial WorldWideWeb program opened with an almost blank page, ready for the jottings of the user. Robert Cailliau and I had a great time with it, not because we were looking for a lot of stuff, but because we were writing and sharing our ideas. Furthermore, the dream of people-to-people communication through shared knowledge must be possible for groups of all sizes, interacting electronically with as much ease as they do now in person.

In the second part of the dream, collaborations extend to computers. Machines become capable of analyzing all the data on the Web -- the content, links, and transactions between people and computers. A "Semantic Web," which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy, and our daily lives will be handled by machines talking to machines, leaving humans to provide the inspiration and intuition. The intelligent "agents" people have touted for ages will finally materialize. This machine-understandable Web will come about through the implementation of a series of technical advances and social agreements that are now beginning (and which I describe in the next chapter).

Once the two-part dream is reached, the Web will be a place where the whim of a human being and the reasoning of a machine coexist in an ideal, powerful mixture.

Realizing the dreeam will require a lot of nitty-gritty work. The Web is far from "done." It is in only a jumbled state of construction, and no matter how grand the dream, it has to be engineered piece by piece, with many of the pieces far from glamorous.

In short, he envisioned a Semantic Web of machines talking to machines comprising a machine-understandable Web.

It is also important to recognize that the Semantic Web is part of the overall Web dream and is not intended to be completely separate from part one of the Web.

Even eight years later, the book is just as relevant to the Web of today, and the future.

-- Jack Krupansky

Conceptual distance

One of the big canyons in the Semantic Abyss is how to compare concepts and sense their similarity or differences as well as their relations to other concepts. Sometimes a user can be laser-precise as to what concept is desired, but even then the user may not be aware that other concepts may be quite similar or related in some way. Sometimes it is desirable to treat very similar concepts as virtually identical, while other times it may be desirable merely to offer the user alternatives that might meet the desired objective. In any case, the starting point is to quantify the conceptual distance between concepts. As might be expected, that is likely to be much easier said than done.

Much of the existing research relates to determining conceptual distance of document from query terms, also known as document relevance. Here, the objective is to compare the terms or concepts themselves to determine how close they are and which are closest.

It is not clear if any absolute conceptual distance can be determined. Usually, a relative conceptual distance for a set of concepts is all that is needed, or maybe all that is possible.

Some of the reasons for comparing conceptual distances are to determine:

  • similarity
  • related
  • equivalence
  • equality (say, in a social sense)
  • same as
  • comparable
  • synonym

It may be true that any given application or even a given user of an application may have different criteria for how close the conceptual distance must be to satisfy their needs. Control over the looseness or tightness of the fit is probably also desirable.

A big challenge of the Semantic Web is that different developers and communities have different conceptions of the meanings of concepts. Sometimes seemingly different terms are used to refer to what are logically similar or even identical concepts. This means that we need a sophisticated level of concept matching that can transparently handle the bridging of superficial semantic gaps, as well as to alert the user were semantic gaps exist that cannot be automatically bridged but maybe the user can manually accept them as if they were automatically bridged.

Another problem is that superficially identical concepts may in fact be quite distinct at a deeper semantic level so that the concept matching should reject them as matches. In the alternative, the user can be alerted to these false concept matches and maybe redefine a new set of concepts to effectively bridge the perceived semantic gaps so that matching is more semantically correct.

In any case, the ability of the software to give the user excellent feedback on conceptual distance is a very important tool.

-- Jack Krupansky

Thursday, May 14, 2009

Mereology - the study of the relations between integral objects and portions of stuff

I was reading a post by Steffen Staab on the Semantic Web email list and ran across a link to a paper on Mereology, which is basically the study of the relations between complete or integral objects and the component parts that comprise the whole object as well as the relations between the parts themselves.

The Wikipedia article on Mereology tells us:

In philosophy, mereology (from the Greek μερος meros part and the ending -logy study, discussion, science) is a collection of axiomatic first-order theories dealing with parts and their respective wholes. In contrast to set theory, which takes the set-member relationship as fundamental, the core notion of mereology is the part-whole relationship. Mereology is both an application of predicate logic and a branch of formal ontology.

The Stanford Encyclopedia of Philosophy article on Mereology tells us:

Mereology (from the Greek μερος, 'part') is the theory of parthood relations: of the relations of part to whole and the relations of part to part within a whole. Its roots can be traced back to the early days of philosophy, beginning with the Presocratics and continuing throughout the writings of Plato (especially the Parmenides and the Thaetetus), Aristotle (especially the Metaphysics, but also the Physics, the Topics, and De partibus animalium), and Boethius (especially De Divisione and In Ciceronis Topica). Mereology occupies a prominent role also in the writings of medieval ontologists and scholastic philosophers such as Garland the Computist, Peter Abelard, Thomas Aquinas, Raymond Lull, Walter Burley, and Albert of Saxony, as well as in Jungius's Logica Hamburgensis (1638), Leibniz's Dissertatio de arte combinatoria (1666) and Monadology (1714), and Kant's early writings (the Gedanken of 1747 and the Monadologia physica of 1756). As a formal theory of parthood relations, however, mereology made its way into our times mainly through the work of Franz Brentano and of his pupils, especially Husserl's third Logical Investigation (1901). The latter may rightly be considered the first attempt at a thorough formulation of a theory, though in a format that makes it difficult to disentangle the analysis of mereological concepts from that of other ontologically relevant notions (such as the relation of ontological dependence). It is not until Leśniewski's Foundations of a General Theory of Manifolds (1916, in Polish) that a pure theory of part-relations was given an exact formulation. And because Leśniewski's work was largely inaccessible to non-speakers of Polish, it is only with the publication of Leonard and Goodman's The Calculus of Individuals (1940) that mereology has become a chapter of central interest for modern ontologists and metaphysicians.

This is quite heavy-duty stuff, but does show the increasing trend for the intersection of computer science and philosophy especially as we get deeper into the Semantic Web.

The original link pointed to the abstract for a paper entitled A Temporal Mereology for Distinguishing between Integral Objects and Portions of Stuff by Thomas Bittner and Maureen Donnelly. It discusses three categories of "stuff":

  • Integral objects, such as a car or computer.
  • Structured stuff, such as blood or the tissue of an organ.
  • Unstructured stuff, such as air and water that is homogenous.

They give the example of the distinction between the liver as an integral object and liver tissue as the structured stuff the comprises the liver. The two are obviously related, but need to be treated distinctly depending on your intentions and purposes.

In the case of blood, we can refer to human blood in general, the blood of a particular human, a sample or portion of the blood of that particular human, and the "structured stuff" within that portion as it might be processed and separated into the components of red and white cells, platelets, and plasma.

This post is primarily intended as more of a bookmark for later reference, so my apologies for not giving a more concise or more detailed account of mereology.

-- Jack Krupansky

Sunday, May 10, 2009

Truth should not be hard-coded but somehow emergent

I ran across an interesting statement in a post on the W3C Semantic Web email list by Jeremy J. Carroll, Chief Product Architect of TopQuadrant in reply to a post from John Sowa, that relates to truth in the context of the Semantic Web:

... truth should not be hard-coded but somehow emergent.

I would add that there may be many competing truths on any given issue and that the user will have to choose between competing value systems that each arrived at their own versions of the truth of an issue.

Each user may have their own preferred authorities and sources from which they may choose to select the appropriate value system.

All of this may evolve over time. Authorities and sources can change their minds. Underlying data can change. Calculations can change. Rules can change. Theories can change. Experiments can be re-evaluated or examined in a new light. The world can change. Authorities and sources can come and go and their influence can wax and wane. User preferences can change.

So, you cannot capture truth at a moment and hold it forever. You need to re-execute your query to determine the truth of some assertion at the time you need it. Of course, even the Semantic Web cannot give you the real truth, but merely the modeled truth, as it emerges and as it continues to evolve, even over time, with time being a variable required for determining the truth of a proposition.

-- Jack Krupansky

Thursday, April 30, 2009

Need for a Casual Semantic Web

Current Semantic Web technologies are difficult to utilize, even by highly-skilled professionals. Unlike basic HTML where interesting Web pages and blogs can be assembled and hyperlinked with very little effort and grow at a "viral" rate, the growth of the Semantic Web is proceeding at a snail's pace. A high level of sophistication is needed to develop even basic Semantic Web content. It should not be that way. What is needed is a Casual Semantic Web where even naive consumers and low-skilled workers can rapidly put together interesting Semantic Web content.

Existing Semantic Web technologies are extremely flexible and enable very complex information structures, but consumers and low-skilled workers do not need all or even any of that complexity. They need simple constructs.

They need little more than "elements" for concepts such as:

  • Names
  • Places
  • Addresses
  • Phone numbers
  • Email addresses
  • IM ids
  • Social networking ids
  • Dates
  • Ages
  • Activities
  • Interests
  • Preferences
  • Opinions
  • Polls
  • Rankings
  • Ratings
  • Friends
  • Colleagues
  • Businesses
  • Governmental agencies
  • Non-profit institutions
  • Hospitals
  • Doctors
  • Employers
  • Employees
  • Teams
  • Team members
  • Groups
  • Associations
  • Membership
  • Travel plans
  • Children
  • Parents
  • Relatives
  • Roles
  • Lists

Of course they need convenient methods to publish their personal Semantic Web.

They need convenient Semantic Web browsing tools, although that capability may simply fold right into the traditional Web browser.

Traditional search engine and blog "crawling" technology would be sufficient to aggregate data to enable queries to correlate between users, groups, organizations, interests, etc. There would also be plenty of opportunity for specialized aggregators or mirroring or caching services to evolve, but none would have a monopoly or be able to act as gatekeepers to innovation since the underlying data would always be freely available to all.

Client apps (including for the iPhone and other mobile devices) could provide the kind of user-friendly access UI that people have come to expect from current social networks, but the "open" nature of the "networks" would provide greater flexibility and opportunity for innovation.

Users also need access control for privacy.

They also need a mechanism to manage their identity.

Elsewhere I have suggested the utility of a Data Union for storage of personal data.

The Casual Semantic Web would in fact be a step in the direction of open garden social networking in which the users are in control rather than being under the thumb of the "keepers" of current walled-garden social networks.

Users would be capable of introducing innovative social networks rather than dependent on others to provide (and control) them.

Overall, the main starting point is an extremely user-friendly vocabulary that does not require a computer science degree or advanced training just to publish relatively basic information.

-- Jack Krupansky

The Semantic Web swamp

Swamps are interesting places, but not if you are looking to make rapid progress. They are an unfortunate hybrid between dry land and open water. A land vehicle will get mired in the muck. Ditto with a water vehicle. Sure, there may be patches of dry earth here and there or pools of water here and there, but not enough of either in a connected fashion to exploit either. This is how a lot of the current Semantic Web feels to me. There is simply too much technological "muck" that slows progress.

Just yesterday (and into today) I was following an email thread on the OWL list about the relatively simple concepts of subclass and superclass, but the discussion simply goes on and on because there is no clarity in the specifications. Maybe if somebody points you to the precisely right passage it will all become clear (or maybe not), but that should not be required.

Sure, there are books and tutorials and seminars and consultants, but none of that should be required, at least for the level that the technology is at today.

It is an open question whether tools or additional layers can be built on top of the current Semantic Web technologies that are sufficient to hide the "muck" of the "swamp." I am hopeful that is the case, but there are no guarantees.

Can we "flood" the swamp to turn it into a navigable lake or sea? Maybe.

Can we "fill" in the swamp to create solid, traversable dry land with the underlying swamp as an "aquifer"? Maybe.

Besides the concerns about usability of current Semantic Web technologies, there is the larger question of whether it is so complex even at this stage that even seasoned professionals may be unable to verify that Semantic Web constructions are technically correct and valid for their intended applications and not too fragile and are readily maintainable by other than their original developers. Five or ten years from now could we end up with a knowledge crisis analogous to the current banking crisis simply because we do not know the location or magnitude of risks?

-- Jack Krupansky

Wednesday, April 29, 2009

The Quest for Computable Knowledge

Check out Stephen Wolfram's thoughts on "The Quest for Computable Knowledge" on the new Wolfram|Alpha blog. He acknowledges Leibniz's role in collection of knowledge, reasoning, and computation:

I've always been particularly struck by Gottfried Leibniz's role. He really had pretty much the whole idea of Wolfram|Alpha--300 years ago.

At the end of the 1600s he came to believe that somehow there must be a way to mechanize the resolution of all human arguments.

He imagined that one could represent human discourse using logic and mathematics. Then he imagined that one could use a machine to work out answers from this--and in fact he even built some small mechanical calculators himself.

He also realized that to provide raw material for his mechanization it would be necessary to assemble lots of knowledge. So he worked hard to get libraries constructed, and to invent systems for organizing them.

Of course there were some elements missing. But Leibniz really had the right basic idea.

-- Jack Krupansky

Tuesday, April 28, 2009

Levels of language for knowledge

Although I am still not convinced that the current Semantic Web technologies, based on RDF, are in fact the optimal foundation for a true knowledge web, I will continue to proceed on the assumption that RDF is in fact a reasonable starting point. That said, there is a real question of what exactly we can model in RDF. Maybe, in theory, we can model anything and everything in RDF, but is it really an efficient and effective "language" for the higher-levels of knowledge? Even if it "works", in theory, is it really practical?

In computer programming languages we have "levels" of language:

  1. Machine code. The actual bits for the "instructions" executed by the hardware (or interpreter.)
  2. Assembly language. Mnemonic opcodes, symbolic names, macros, and other convenient features, but there is still a one-to-one relationship with machine code instructions.
  3. High-level languages. A compiler or interpreter translates declarations, expressions, "statements", functions, and classes into machine language instructions. These tend to be "procedural" languages.
  4. "4GL". User-oriented "query" languages that allow the user to interact in terms closer to the real world. These tend to be "declarative" languages -- the user says "here is what I want" and the computer figures out how to do it. Maybe even a little natural language or a structured subset.
  5. "5GL". Use of artificial intelligence, such as to infer what the user really wants. Deeper and broader support for natural language.

It may in fact be rather dangerous and counterproductive to assert that a knowledge web can be built and used based on such a hierarchy of languages, but for now it at least seems to be a reasonable conjecture to contemplate, at least until there is some clear and convincing evidence that it is a bad idea.

Using this programming language level model, RDF seems to "fit" as the assembly language level for knowledge. Names, in the form of name spaces and URIs, may be rather cryptic, but they are certainly symbolic, at least to some degree. Triples have a nice, fixed format, with three "fields" (object, predicate, subject), much as machine/assembly language "instructions."

Most significantly, an assembly language is a great tool for advanced, leading edge professionals, but an exceedingly poor tool for "users" such as subject matter experts who know their domain but not necessarily the nuances of the Semantic Web technologies such as RDF.

Clearly there is a need for higher-level knowledge languages. I do not have any detailed answers here and now, but this is obviously an area to think about.

I would close here by noting that we should be careful not to confuse languages and tools. Graphic interactive tools and environments will certainly be as useful in working with knowledge webs as they are in traditional computer programming, but it is still important to be clear about what level of language is being modeled directly behind the fancy graphical images. Putting a pretty GUI frontend on an RDF editor does not magically give the user the ability to converse in a 4GL. In short, a GUI frontend will be appropriate for each level of knowledge language, and the GUI may be radically different for different language levels.

-- Jack Krupansky

Tuesday, April 21, 2009

What is the unit of knowledge?

A lot of talk about knowledge, but what exactly is the unit of knowledge?

Computers have bits, bytes, words, integers, floating point, and strings, but how do we even talk about the units of knowledge?

Before continuing, I would note one interesting answer that I stumbled upon using Google. According to Dan Markovitz of TimeBack Management, a common saying around Toyota is that:

The basic unit of knowledge is a question.

That may have some utility, but begs the question is to what the unit of questions might be, leaving us with not much more than we started.

A variation of that adage might be an axiom about units of knowledge:

The basic unit of knowledge is the most narrow and focused question that we can formulate about knowledge.

A corollary of that axiom would be:

The basic unit of knowledge is the response to the most narrow and focused question that we can formulate about knowledge.

But, I am not so sure that such an axiom must necessarily be true. A question is like a tool, a measuring and manipulation device, used to access knowledge. But in the real world it seems as if matter has an even finer structure than the finest tools we can construct for measuring and manipulating matter. On the other hand, maybe that merely means that we simply are not yet smart enough to envision such tools. In some cases, such as with subatomic particles, we use indirect tools such as particle accelerators to smash particles apart so we can observe the results. So, maybe my axiom is not so far off, for now.

WikiAnswers.com has an interesting answer:

Q: What is the smallest unit of knowledge?

A: The adjective.

That is along the lines of a thought I had, that attributes of objects may be the smallest units of knowledge.

I mostly think of knowledge as collections of statements about objects, phenomena, or beliefs.

We could say that the statement is the "unit" of knowledge, but to me a statement is more a form of knowledge, a container rather than the contents of the container. We are more interested in the units of the contents of statement "containers."

Operationally, nouns, pronouns, adjectives, verbs, adverbs, prepositions, interjections, and conjunctions (the eight parts of speech) are the basic natural language units for knowledge. Or you could say that words are the units of natural language knowledge. This is certainly true, but seems to sidestep the issue of true "knowledge" in the sense that an assembly of words can suddenly conjure up a meaning that quite distinct from the meanings of the individual words.

A dictionary might contain all of the words used in a novel, but the real question is what is the unit of storytelling that makes a novel what it is rather than just a sequence of statements.

An operational definition from the world of the Semantic Web is the RDF statement or RDF triple which consists of a subject, predicate (or property), and object. An RDF statement can be somewhat analogous to an adjective. At least in the context of the Semantic Web, RDF triples are clearly the unit of "knowledge." But, that begs the question of whether the Semantic Web as currently envisioned is comprehensive enough to represent all knowledge.

For now, I am comfortable using the statement as the unit of basic knowledge. For example:

  • The apple is red.
  • Some apples are red.
  • Not all apples are red.
  • The apple is on the table.
  • There is no apple on the table.

Next, there are various forms of statements:

  • Existence. The fact that some object, phenomenon, or belief does or does not exist.
  • Attributes. Such as the color or location or size of an object.
  • Relationships to other objects (or phenomena or beliefs). How do the objects in the world interact.

We can also refer to such simple statements as facts. There is some appeal to suggesting that facts are the units of knowledge. Whether facts and statements are the same or dissimilar in some way is left for further consideration in the future.

An immediate question is the status of questions relative to statements. My current thesis is that questions are simply another form of statement, a kind of mirror reflection of statements:

  • Is the apple red?
  • Are all apples red?
  • Is there an apple on the table?
  • Where is the apple?

We could presume that the form of the answer or response to any question is the unit of knowledge.

Next, there is the issue of compositional structuring of statements, collections of statements that are related somehow. This is where things get, literally, interesting, since such collections of statements may in fact be the unit for storytelling, for constructing elaborate stories, including novels. These collections of statements may in fact represent a unit of meaning that is in fact far richer than the level of simple, factual statements. So, we have this issue of whether facts or story-level meaning should be our unit of knowledge.

Google has a project called knol which is billed as "a unit of knowledge". A knol is in fact a full-blown paper or essay or article, comparable to a Wikipedia article. That is a rather different usage of the term "unit." One could propose that a "unit" of knowledge is an interesting and usable package of knowledge, including books, web pages, PDF documents, magazines, movies, podcasts, blogs, blog posts, Twitter "tweets", etc. Fair enough.

Maybe my final thought, for now, is that a unit of knowledge is any form of knowledge that is usable, as is. Even a passage of text clipped out of the middle of a paragraph might be a usable unit of knowledge.

I have not answered the initial question precisely, but I think there is enough foundation to proceed without having a precise definition, for now.

-- Jack Krupansky

Cultivating knowledge vs. garbage in, garbage out

One day we will have a sufficiently rich and robust infrastructure capable of supporting the development of a true knowledge web, but will we be ready for it? Even with the proper tools in our hands, will we know how to use them effectively?

What is needed is some sense of how to cultivate knowledge so that we do not end up with vast mountain ranges of crap that suffer from GIGO (garbage in, garbage out.)

At a simplistic level we need tools, methods, and discipline for knowledge curation, but that is much easier said than done.

Further, we need a culture of knowledge that is compatible with and accepted by average consumers so that we can in fact build a vast consumer-centric knowledge web that does not depend on vast legions of human knowledge curators just to accumulate relatively simple tidbits of knowledge that consumers produce on a daily basis.

In short, we need a whole science of consumer-centric knowledge cultivation. Otherwise, we could end up producing a knowledge web that is not terribly useful relative to its promise.

-- Jack Krupansky

Monday, April 20, 2009

Software agents for virtual browsing and virtual presence

With so many places to go and so many things to see and do on the Web, it is getting almost impossible to keep up with the proliferation of interesting information out there. We need some help. A hefty productivity boost is simply not good enough. We need a lot of help. Browser add-ons, better search engines, and filtering tools are simply not enough. Unfortunately, the next few years holds more of the same.

But, longer term we should finally start to see credible advances in software agent technology which help to extend our own minds so that we can engage in virtual browsing and have a virtual presence on the Web so that we can effectively reach and touch a far broader, deeper, and richer lode of information than we can with personal browsing and our personal presence.

Twitter asks us what we are doing right now, but our online activity and presence with the aid of software agents will be a thousand or ten thousand or even a million or ten million times greater than we can personally achieve today. What are each of us interested in? How about everything?! Why not?

The gradual evolution of the W3C conception of the Semantic Web will eventually reach a critical mass where even relatively dumb software agents can finally appear to behave in a relatively intelligent manner that begins to approximate our own personal activity and personal presence on the Web.

It may take another five to ten years, but the long march in that direction is well underway.

The biggest obstacle right now is not the intelligence of an individual software agent per se, but the need to encode a rich enough density of information in the Semantic Web so that we can realistically develop intelligent software agents that can work with that data. We will also need an infrastructure that mediates between the actual data and the agents.

-- Jack Krupansky

Monday, April 13, 2009

Using Data Unions as repositories of personal data

In order to facilitate the development of open garden social networks it is necessary to have a safe place for consumers to place their personal data, not just where it can be stored and accessed, but also to control access and to provide a reliable digital identity. Many years ago I thought up a scheme I called a data union, kind of a cross between a data bank and a credit union, which would provide exactly that form of reliable and safe storage for a consumer's personal data. I finally wrote up a rough, summary description back in 2005, but I have not yet pursued the concept any further.

The intention is not so much to store a consumer's bulk data such as documents, photos, media, etc., but simply to store and control the attribute information that might be needed for online transactions and promotion of products and services, such as name, address, phone numbers, social security number, age and birth date, gender, interests, and whatever. The intention was to give the consumer great control over exactly what personal information is available to whomever.

It would be a natural extension to have a data union safety deposit box, which would be a modest amount of digital storage, maybe in the megabytes or a "few" gigabytes, sufficient for documents, valuable images, etc., but not intended for full-blown personal storage.

A data union would be an ideal repository for online digital identity credentials, or at least as a digital identity validation service. For example, the consumer could approve an entity with which they are willing to transact and then the consumer could provide a transaction code to that entity which the data union could verify.

A data union would enable the consumer to be as open and visible and transparent or as closed and hidden and secretive as they wish.

-- Jack Krupansky

Sunday, April 12, 2009

Open garden social networking vs. walled gardens

I am truly tired of social networking sites that are walled gardens, requiring some form of registration and holding my personal data hostage by maintaining it behind the walls of the "walled garden." What is the alternative? Is there an alternative? No, there is no alternative currently, but in the longer term we can hope that developers and entrepreneurs will recognize that open garden networks have distinct advantages over walled gardens.

The esence of an open garden social network is that users maintain their data wherever they want as long as it can be crawled by whatever sites wish to aggregate that data. Since the data is maintained publicly, it can easily be shared by more than one social networking aggregator.

The immediate technical obstacles are that: 1) the average consumer has no obvious public location to store their data and 2) we do not have a technology and public infrastructure in place for consumers to "sign" their personal data to associate it with their digital identity.

Who knows, maybe open garden social networking will take off in another five or ten years.

One of the key benefits of open garden personal data is that it will open up vast new opportunities for innovation in open garden social media since each innovator can piggyback on the existing (in the future) public open garden infrastructure rather than need to go through the time and expense of reinventing the wheel unnecessarily for each new social networking aggregator site.

-- Jack Krupansky

Monday, April 6, 2009

Semantic input for consumers

As yet there are no perfect methods for consumers to enter semantic data. Entering free text is certainly convenient, but we just don't have "perfect" natural language processing software, yet.

The common forms of consumer data for which semantic data are desirable include:

  • email messages
  • email address books
  • blog posts
  • Twitter "tweets" and other forms of micro-blogging
  • IM instant messages
  • cell phone calls
  • text messages
  • digital camera pictures
  • transaction data, including credit card transactions and online ecommerce forms.

Unless the consumer is an "English geek", it is unlikely that they will be willing to create structured sentence diagrams to express the meaning of even simple statements.

The full range of methods for semantic input include:

  1. Natural language processing (NLP) for text and audio.
  2. Controlled vocabularies (e.g., Structured English)
  3. Text mining.
  4. Full semantic map editing (e.g., ala sentence diagrams)
  5. Detection of object references in free text (e.g., proper names and nick names for people, places, and things), possibly based on customizable dictionaries.
  6. Form-based input, including drop-down lists for direct selection of semantics
  7. Transaction device data (e.g., GPS location, date and time, etc.)
  8. Transaction information (e.g., online ecommerce data)
  9. Tagging
  10. Background review and re-entry by trained "semantic coder" (e.g., in an "offshore" market.)
  11. Feedback and enhance - mine consumer input for apparent concepts and ambiguity, annotate the original, and allow consumer to approve and select between alternatives and "hints"

I am sure that there are a variety of other methods, existing, proposed, or not yet imagined, but these are a starting point for discussion, as well as an illustration of how much more research and innovation are needed.

I have been trying to avoid a reliance on full-bore NLP, but the simple truth is that it may in fact be the best foundation.

-- Jack Krupansky

URI-based resource location

I have never been happy with the Semantic Web concept of associating resources with specific Web locations using URLs that specify a server location such as a domain name. The main issues:

  1. Makes it difficult to move a resource to another domain.
  2. Increases the likelihood that a server might become a performance bottleneck, especially as popularity grows and the Semantic Web begins to scale up in size dramatically (so-called "exponential growth.") Wiring in a server location simply does not scale up.
  3. Encourages ad-hoc caching. Worse, as the Semantic Web scales up it requires a dependence on ad-hoc caching.

Although some form of caching is clearly part of the solution, the main component of a solution is to switch from URL-based resource locating to URI-based resource locating.

Rather than specifying a single URL and then depending on the existing, non-Semantic Web Domain Name System (DNS) to look up the actual path to "the" server, we need a non-DNS lookup mechanism that takes one or more URIs and does more of a "keyword" lookup (treating each URI as a "keyword" (actually a Semantic Web analog to a keyword)) and then redirects through a caching inrastructure that is designed to meet the needs of caching resources for the Semantic Web.

A Semantic Web resource URI list might also be supplemented with various attributes, such as version number or version requirements and other attributes needed to constrain and control resource access.

The SW resource infrastructure should be able to manage multiple versions for a resource and efficient and controlled propagation of changes.

One use of multiple URIs is to control the degree of specialization of a generic resource name. A single URI would be the most general resource reference and provide the most adapaptability, while adding on specialization URIs would provide access to resources that meet additional requirements. This is analogous to base and derived classes in object-oriented programming, but is not necessarily required.

One key attribute of such a resource infrastructure, besides scalable performance itself, would be that even a very small, under-powered web site could be the source host for even extremely popular Semantic Web resources, and migrating such resources to another host should be completely transparent to the "users" (user agents, UAs, or software agents) that have the URI list for the resource "wired" into their "code."

Alas, I am not optimistic that such an aechitecture will soon or even ever by made available for the Semantic Web as we know it today. The change may have to wait for whatever follows the Semantic Web, or maybe even for Ray Kurzweil's Singularity.

Still, it is useful to contemplate what a proper solution might look like.

-- Jack Krupansky

Geopolitical reference data from FAO

A key issue with building a semantic web is having robust base reference data, including geopolitical data such as the countries of the world. The Food and Agriculture Organization of the United Nations (FAO) has produced an ontology for geopolitical information that:

... manages information about territories and groups, such as, the different names in English, French, Spanish, Chinese and Arabic; associated classification codes, like, UN code -- M49, ISO-3166 Alpha-2 and Alpha-3, UNDP code, GAUL code, FAOSTAT, etc; historical changes; and specific relations like "x has border with y" and "x is member of group z".

That sounds like a great start, although I have not examined the details.

-- Jack Krupansky

Monday, March 9, 2009

What is curated data?

The compound term curated data merely means data (in a loose sense, that may include even the most sophisticated computational formats for structured information, media, and even knowledge) that has been collected and organized under the supervision of one or more persons considered to be qualified to engage in such an activity. Such a person may in fact be a scientist or other expert in a relevant field, or may merely be someone acting in a clerical capacity who simply verifies that the data is "acceptable" according to the requirements of whoever has commissioned the curation of the data. The implication is that the resulting database (or data series or data set) is of high quality. The contrast is with data which may have been gathered through some automated process or using particularly low or unskilled workers such that the quality of the data is unverified and possibly unreliable.

An example of curated data is the extensive computable data supplied with Mathematica from Wolfram Research.

-- Jack Krupansky

Sunday, March 8, 2009

The DIKW hierarchy - Data, Information, Knowledge, Wisdom

The DIKW hierarchy is a rough model for relating data, information, knowledge, and wisdom. Granted, the model lacks scientific precision and may not have a lot of functional utility, among numerous criticisms, but I personally find that it helps to clarify what level of refinement and robustness we are dealing with.

There is implicitly a lower-level below the DIKW hierarchy, the signal level where we have raw sensor readings before they are formatted into data.

Data (or a data item) has little meaning directly attached to it. We have primitive data types such as integers, floating point numbers, character strings, boolean flags, etc., and we have streams of data. We speak of data values or the value of a data item.

Information takes data and applies structure and rudimentary meaning. We have records, structures, database tables, and other methods and mechanisms for organizing raw data items into somewhat abstract structures. These structures may be coupled with methods for manipulating the information and rules that constrain the information and structures or define relationships among subsets of the information. This is the meat and potatoes of computation as we know it today.

Knowledge moves towards representation of meaning that may begin to approximate human knowledge. Knowledge has a meaning structure but may or may not be based on information structures as well. We may also approximate knowledge using semi-structured information.

Wisdom corresponds to judgment in the application of knowledge, but is not yet readily achievable in a typical computational environment.

This DIKW model is surely very limited and may not ultimately give us a lot of intellectual leverage, but it is a decent starting point.

I see the current Semantic Web as mostly focused at the information level, trying to give us computational power at the Web level that we currently have within individual computers and individual applications.

The hope is that once we have mastered information at the Web level, maybe then we can layer knowledge on top of that. And then maybe wisdom can be laid on top of that. Or so the fantasy goes.

-- Jack Krupansky

Wolfram Alpha - computational knowledge engine

Wolfram Research (Stephen Wolfram) is on the verge of unveiling a new project called "Alpha" which is billed as a "computational knowledge engine." It combines the computational power of Mathematica with tools to "explicitly curate all data so that it is immediately computable" to be able to "take questions people ask in natural language, and represent them in a precise form that fits into the computations one can do" and "handle all the shorthand notations that people in every possible field use." Wolfram says:

... I'm happy to say that with a mixture of many clever algorithms and heuristics, lots of linguistic discovery and linguistic curation, and what probably amount to some serious theoretical breakthroughs, we're actually managing to make it work.

He does add the caveat that:

And -- like Mathematica, or NKS -- the project will never be finished.

But he triumphantly announces that:

... I'm happy to say that we've almost reached the point where we feel we can expose the first part of it.

It's going to be a website: www.wolframalpha.com. With one simple input field that gives access to a huge system, with trillions of pieces of curated data and millions of lines of algorithms.

Having a simple Google-like search engine box is all well and good, but the real question is the extent to which the engine is "open", both in terms of programmatic API and Web Services access and integrating with external data.

How it compares with and meshes with the Semantic Web remains to be seen.

In any case, this does sound like a significant leap forward

-- Jack Krupansky

Thursday, March 5, 2009

Check out Knoodl which facilitates community-oriented development of OWL based ontologies and RDF knowledgebases

This is mostly just a note to myself to look into Knoodl:

Knoodl facilitates community-oriented development of OWL based ontologies and RDF knowledgebases. It also serves as a semantic technology platform, offering a service based interface so that communities can build their own semantic applications using their ontologies and knowledgebases. Knoodl is a product of Revelytix, Inc. and is hosted in the Amazon EC2 cloud and is available for free.

According to their web site, Knoodl offers:

  • Cloud-based application (Amazon EC2)
  • Ontology editing
  • Ontology import/export
  • Collaboration
  • Role-based security
  • Scalable RDF store (Mulgara)
  • NEW SPARQL Endpoints NEW
  • SPARQL query wizard (March '09)
  • Ontology guided search (March '09)
  • Graphical ontology mapping wizard (March '09)
  • User designed widgets and gadgets for viewing data (March '09)
  • User designed widgets and gadgets for entering data and submitting queries (March '09)

They tell us that:

All content in Knoodl is organized into Communities. You can browse the list of Communities by clicking on the Community menu at the top of the screen and selecting Directory. Within Communities, there are regular Wikis and there are Vocabularies. A Vocabulary is a combination of an OWL based ontology editor and a wiki. Wikitext in Knoodl is not semantic, it is there to provide users with the ability to collaborate more effectively and add rich documentation. Each Vocabulary represents an ontology. Every resource (class, property, and instance) in the ontology has its own page in the Vocabulary.

To get started, you can take the tour and see how to get started, then dive in and check out some of the example vocabularies, and see what vocabularies people have already uploaded. Better yet, register for an account, create or join a community, and start contributing!

Sounds quite interesting.

One question I have: Is the "k" in "Knoodl"enunciated or is it more like the silent "k" in "knowledge"? I am guessing that it is pronounced "noodle" rather than "ka-noodle", but who knows.

Hmmm... I wonder if any of ontologies for vocabularies include pronunciations?! Anybody have an ontology for natural language speech, the spoken word?

-- Jack Krupansky

Monday, February 16, 2009

Refinement and expansion of terms and concepts

Terms and concepts tend to be used rather loosely when a new field of interest is fairly young or poorly understood. That is to be expected. But as people drill down into more detailed examination of concepts they tend to refine terms. Similarly, as they find commonality and pursue application of concepts they expand the range of terms.

Refinement tends to bring concepts and terms into sharper focus, sharper and narrower than the pioneers required for their primitive needs.

Expansion recognizes that concepts have a greater utility and greater variation than many pioneers may have recognized.

Refinement can also recognize that a concept or term may be relatively generic or general and that there is value in specialization or subsetting of a concept or term. The specialized concepts essentially fan in to the more general concept.

Expansion can also recognize that a concept or term can be supplemented to increase its utility for certain forms of application. The more general concepts essentially fan out to the supplemented concepts and terms.

Refinement implies a many-to-one relationship of the refinements to the general concept. Alternatively, there is a one-to-many relationship between a general concept and its refinements.

Expansion implies a one-to-many relationship from the general concept to the expanded concepts. Alternatively, there is a many-to-one relationship between expanded concepts and their general concept.

-- Jack Krupansky

Sunday, February 15, 2009

The difference between truth and fiction is that fiction has to make sense

There was an amusing aphorism about truth and fiction in the new movie The International (with Clive Owen and Naomi Watts.) I may not have the exact wording, but it is roughly:

The difference between truth and fiction is that fiction has to make sense.

(Or maybe it was "There is a difference between truth and fiction -- fiction has to make sense.")

That sounded like it was probably a noteworthy quote from somebody, so I did a Google search. Mark Twain's name popped up a few times with various wordings. I did another search using his name and found these two quotes on BrainyQuote.com, so they are probably the definitive quotes:

It's no wonder that truth is stranger than fiction. Fiction has to make sense.

Why shouldn't truth be stranger than fiction? Fiction, after all, has to make sense.

A similar quote is attributed to Rosten, Leo:

Truth is stranger than fiction; fiction has to make sense.

And a similar quote is attributed to Tom Clancy:

The difference between fiction and reality? Fiction has to make sense.

My suspicion is that the film used Clancy's version. If I ever meet Clancy, I'll ask if he "borrowed" from Twain's adage.

Finally, Alex Lane asserts that Twain's adage is "roundly refuted" by the popularity of The X-Files.

So, can we use the fact that a proposition "makes sense" as a criteria for judging truth or lie, fact or fiction? If not, what good is it for us to obsess over whether anything "makes sense"?

-- Jack Krupansky

Tracking the evolution of meaning

Even the dictionary is not completely static and engraved in stone. In addition to the appearance of new words, old words can take on new meanings and cease to necessarily connote old meanings. Over time, the editors of dictionaries try to track the evolution of the meanings of words and phrases in both written and spoken language. Even when the dictionary is quite clear and most people solidly recognize the "official" meaning of a word, there will always be outliers, renegades, and revolutionaries (evolutionaries?) who insist on redefining words to have meanings of their own choice or "context." Dictionary editors do a fairly good job of tracking and reporting the evolution of meanings of words. Enter the Semantic Web.

The Semantic Web is not about natural language per se, but there is an intention to represent or at least indicate real-world concepts using URI resources and inferences.

There was an interesting email thread on the W3C Semantic Web email list triggered by an email from Jeremy J. Carroll, Chief Product Architect at TopQuadrant with the subject line "live meaning and dead languages." Jeremy opined that:

In terms of meaning on the web, I see that the web as a place where the life world is produced, by active extensions of our linguistic apparatus. I hence have an aversion to techniques and technologies that somehow pretend that meaning on the web, and in particular the semantic web, should or could be made static and somehow lifeless. So, I have difficulty seeing the meaning of any URI as univocal or fixed or even particularly well-defined. This leads to some hesitation concerning systems of definitions and axioms built on top of such univocity.

I think this worry becomes more so as axioms and systems of axioms become more complicated. (I just about see similarities between OWL2 and the Shorter Latin Primer I had at high school).

A term which is too tightly nailed down in its relationship to other terms has been dug into an early grave. Having fixed its meaning, as our world moves on, the term will become useless.

The trick, in natural language, is that the meaning of terms is somewhat loose, and moves with the times, while still having some limits.
This looseness of definition gives rise to some misunderstandings (aka interoperability failures), but not too many, we hope.

So I wonder, as some people try to describe some part of their world with great precision, using the latest and greatest formal techniques, just how long that way of describing the world will last. Maybe there is a role in such precision in allowing us to be clear about differences of opinion --- but it doesn't seem to me to be a good foundation for building knowledge.

He tells us that his thoughts were in part inspired by his recent reading of the book Emptiness & Brightness by Don Cupitt, from which he quotes:

By language, I mean the dance of signs, the continuous process of symbolic exchange between people, the humming communication network of which the human life world consists. I mean also to invoke the vast strange and multi-dimensional world of linguistic mean-ing -- and I am hyphenating mean-ing, like be-ing, because <em>mean-ing is a process too</em>. We need to make this point because for so long European intellectuals studied only dead languages -- Latin, Greek and Hebrew -- and failed to grasp the way the transactions of life are carried out and the life world is produced and formed by the <em>motion</em> of living language.

The book is (of course) available at Amazon:

There ensued a long discussion on the email list, including this issue of the distinction or disparity of the Semantic Web and natural language. This unresolved aspect of the Semantic Web will continue to haunt the practical application of the Semantic Web until somebody comes up with a model to transcend "Web" meaning and human meaning. Meanwhile, practitioners will continue to invent all manner of contrived methods for pretending that the vast gap between the two does not exist.

My immediate reaction to Jeremy's original email, sent directly to him, was:

That is why the Semantic Web is based on URIs rather than "keywords" -- as "meaning" evolves over time, people can simply construct new URIs representing the same natural language text but with the new "meaning." Sure, there is always the problem of misuse of URI when the associated natural language text "matches" but the meaning is not aligned with the real world context, but that is always going to be true in any language system, natural or non-natural. Over time, people can gradually detect "meaning misalignment" (or even "suspected meaning misalignment") and add knowledge of the perceived misalignment, so that the perceived strength of any inferences can be reduced to reflect the ambiguity of any inferred meaning.

In summary, we have two big problems here: 1) representing real-world meaning in the Semantic Web, and 2) tracking evolution of real-world meaning in the Semantic Web.

There are at least four distinct forms of variation in meaning that need to be tracked:

  1. Meaning evolves over time, either to take the meaning in a different direction or simply to refine or expand the existing direction.
  2. Difference camps or contexts have distinct interpretations.
  3. Different individuals interpret and use terms or concepts differently.
  4. Obsolete terms and concepts which have been superseded with distinct, newer terms and concepts.

-- Jack Krupansky

Thursday, February 12, 2009

Sarcasm, satire, truth, and lies for semantic data mining

Although semantic data mining has a lot of potential, it is quite a minefield of tricky issues. Even if we successfully filter, say, a blog post or purported article on a Web site into succinct statements, we then have the issue of determining the veracity of those statements. That is difficult enough in its own right, and then you have sarcasm and satire, where statements are being made that are known by the author and most human readers to not be the actual opinion of the author, but superficially do indeed appear to be explicit statement of belief by the author.

In essence, statements using sarcasm and satire are inherently "lies" in a superficial sense, but for most human readers they certainly do not betray any intention of misleading the reader.

An immediate application is for semantic data mining applications that seek to uncover brand reputation issues. For example a sarcastic product review read only superficially would have the reputation 180-degrees wrong. A wiseacre might express lavish, albeit sarcastic, praise for a poor product that he despises or withering, albeit sarcastic, criticism of a great product that he personally admires (maybe simply to tweak the insufferable zealous fans of the product.)

Still, there is value in recognizing the sarcasm and satire, even if a particular application (brand reputation) does not need it.

-- Jack Krupansky

Wednesday, January 28, 2009

Semantic Web challenges 2009

Here are my current thoughts about the challenges facing the Semantic Web, vintage January 2009:

  • Mind the Gap
    • Thesis: There is a dramatic semantic gap between how users think and communicate about knowledge and the mechanisms that the Semantic Web supports for organizing knowledge.
      • Superset of the semantic data mining problem
    • How to jump from comfort with natural language to comfort with the Semantic Web
    • Extent to which the user "sees" the Semantic Web as opposed to the Semantic Web simply being more power "under the hood" in a completely transparent manner
  • Mind the Gap II
    • How do we map and transition between natural language and the Semantic Web
      • How to represent natural language in the Semantic Web
        • Concepts, statements, reasoning, processes, prose passages, stories, outlines
  • Semantic search engines
    • Not just raw text, semantic inferences as well
    • No single best form for database, need open access to create specialized databases
  • Inference Broker
    • Need for inference brokers to mediate between creators of knowledge and users of knowledge. Due to:
      • Desire for privacy
      • Protection of intellectual property
      • Massive scalability requirements - divide and conquer
      • Division of labor, factoring large problems into smaller problems
  • Social structure of knowledge
    • Individuals have only some of the puzzle pieces of knowledge
    • Propositions of uncertain classification
    • Social groups aggregate and classify knowledge
  • A medium for intelligent agents
    • Software agents can act more intelligently with a richer, knowledge-centric information stream
  • Statements that are not strict, objective facts
    • Personal facts, opinions, speculation, gossip, questions
    • "Creations" - text, graphics, images, audio, video [? Separate challenge?]
    • False statements
      • May be outright lies, deceptions, misunderstandings, misstatements, changed information
  • Medical record difficulties
    • Pen on paper still most convenient for input
    • Quick human scan of paper still most convenient for browsing
    • Input decision process still far too intrusive
    • Semi/un-structured data still far too inconvenient
  • Distributed resource storage
    • Extremely diversified storage to assure timely and efficient access
    • Needs to be part of net infrastructure that is automatic and not subject to human whim and error
  • Robust personal and organization identity, as well as roles and interests
  • Authority and provenance identification and tracking
  • The cost of knowledge engineering, especially maintenance and testing
    • Who can really afford it?
  • Semantic matching challenges
    • Apparent differences that are easily bridged by a human
    • Subtle or apparently insignificant distinctions that a human would say are too significant for a match
      • For example, improper reuse of a resource for a different "meaning"
    • Incomplete matches due to cultural differences
    • Concept matches but with differences in contractual commitments (for services)
  • Distributed semantic matching/mapping services
    • Manual creation of libraries of semantic "logic" services for bridging semantic gaps - hide the details for how to get from "A" to "B"
  • Support for time and version dimensions of information

-- Jack Krupansky

Monday, January 26, 2009

More lies!

In addition to my previous list of types of false statements, also include:

  • Scams
  • Confidence games
  • Honest disagreements
  • Ideologial disagreements - each side firmly believes that the other is false
  • Definitional disagreements - how terms are interpreted by different parties
  • Temporary or transient truth - a statement may in general be false but happen to be true at the moment, or in general be true but happen to be false at the moment, or its truth may simply be volatile with or without any discernible pattern
  • Time delay (latency) since truth was validated - or multiple agents get different truth values due to differences in validation latencies
  • Information overload - too many statements to verify with available resources
  • Placeholder - a temporary statement of dubious truth but with the intent of replacing it with a proper statement in the future
  • Inadvertent - author has no reason to challenge the veracity of the statement, but may simply have failed to validate the statement
  • Accidental mistake - author knew truth but entered the correct statement improperly
  • Rumor
  • Gossip
  • Innuendo
  • Passthrough or cascaded or misguided/naive transitivity - author obtained truth from another party and passed the statement along as if true without further validation.
  • Mismeasurement - no intention to mislead, but measurement of source data was faulty

-- Jack Krupansky

That is a lie!

The heart of semantics is truth, the ability to examine a proposition and determine whether it is true or false. Sometimes we may not have enough information to determine whether a given statement or network of statements is true, but sometimes claims may simply not be true in an objective sense. False claims may be inintentional or intentional. Regardless, any semantic system or semantic agent needs to be able to make judgments as to the truth of statements and propositions.

Some of the ways in which even simple statements can be false are:

  • Outright lies
  • Deceptions that hide behind some legalism
  • Misleading by artful presentation of mostly truthful information
  • Honest mistakes
  • Simple misstatements
  • Subjective truth
  • Misunderstandings
  • Confusion
  • Misinterpretation
  • Incomplete information
  • Fuzzy statistical data
  • Changed information
  • Different points of view
  • Wishful thinking
  • Conjecture and speculation based on a weak foundation
  • Semantic mismatches, contextual mismatches - true in one system of reasoning, but not necessarily true in a different system of reasoning
  • Jokes and pranks
  • Hoaxes
  • Fraud
  • Madness of crowds
  • Emperor's New Clothes syndrome
  • Folklore
  • Political dogma
  • Exaggeration
  • Paradoxes
  • Poor estimation
  • Works of fiction
  • Dramatization
  • Hypotheticals
  • News reports - it may have been "said", or reported to have been said, but is it true?

Semantic data mining in particular needs to be able to classify statements as to their truth content, not simply whether a statement is believed to be true, but what form of untruth it might be.

Semantic agents need to be able to validate the veracity of claims that it encounters.

How to do all of this? Overall, unknown at the present time, but there of lots of special cases and plenty of room for heuristics.

Maybe even a heuristic could be considered a "lie" to some extent.

-- Jack Krupansky

Saturday, January 17, 2009

Semantic Web training videos

Marco Neumann of the New York Semantic Web Meetup suggests VideoLectures.net as a good source for Semantic Web training videos. In fact, here is a link to the results of searching that site for "Semantic Web": VideoLectures.net for "Semantic Web". The videos range from events, to invited talks, keynote presentations, lectures, panels, and tutorials.

I have not watched any of these videos in any detail to recommend them, but the one entitled Introduction and Overview to the Semantic Web by James A. Hendler sounds like a good starting point, or at least as a place to get the viewpoint of one of the original "biggies" of the Semantic Web.

There is also A short Tutorial on Semantic Web by York Sure and Invited Tutorial: An Introduction to the Semantic Web by Fabio Ciravegna.

I can't wait to watch some of these videos.

-- Jack Krupansky

Semantic Web books

I recently queried Marco Neumann of the New York Semantic Web Meetup to suggest a path for learning about the Semantic Web. In addition to recommending attending the New York Semantic Web Meetup, he has suggested several books:

I personally have not checked out these books, yet, other than looking at the blurbs on Amazon, but if Marco recommends them, they must be good.

Note: I do get a tiny commission from Amazon if you buy any of these books after clicking on any of the cover images or links above that redirect to Amazon. Thanks!

-- Jack Krupansky

Semantic Web training seminars

I asked Marco Neumann of the New York Semantic Web Meetup to suggest a path for learning about the Semantic Web. He has suggested several training seminars:

  • TopQuadrant (with Jim Hendler): Getting Ready for the Semantic Web with TopBraid Suite. Price: $1,795.
  • Stanford Protege team: Protege-OWL Short Course - provides an introduction to ontology development in OWL, both from a theoretical standpoint and from a practical standpoint through hands-on use of the Protege platform. The course also emphasizes how to use OWL ontologies, and other semantic technologies like SWRL, to build semantic applications with examples from real-world use cases. Price: $1,500.
  • Wilshire Conferences: Designing and Building Business Ontologies led by Dave McComb and Simon Robe - An Intensive 4-DAY SEMINAR with Workshops and Demonstrations, on Semantically Enabling the Enterprise - An ontology is a formal description of the meaning of the information stored in a system. It resembles a conceptual model, but goes much beyond a conceptual model in that the formal definitions allow the system to infer class membership based on properties. Additionally, inference engines, running on ontologies, allow users to extract and integrate information stored in distributed systems.
    This workshop, which will contain a number of live demos, will cover practical issues in employing ontologies. Price: $2,495.

-- Jack Krupansky

Semantic Web Meetup in New York City

Even though I moved to New York City back in May 2008, I only recently bothered to check for any Meetup group for the Semantic Web. It turns out that there are 14 Semantic Web Meetup groups around the world, in 12 cities in 4 countries with 1,897 members. The Semantic Web Meetup Web page is here and lets you click on a city or search by name.

The New York Semantic Web Meetup Web page is here and is run by Marco Neumann who is the principal of KONA. The meetup's charter is:

Meet local people interested in the Semantic Web, an initiative by the W3C [http://www.w3c.org] to make the web "one giant database": The Data Web. We address technologies such as RDF, RDFS, OWL and applications that help to develop or that use ontologies, controlled vocabularies and rules systems in the enterprise and on the World Wide Web.

The Meetup has 578 members and is meeting actively, with two meetups this past week and another scheduled in two weeks (SHER: A Scalable Highly Expressive Reasoner & Semantic Web at NYU at 6:30 p.m. on Thursday, January 29, 2009.)

The New York Semantic Web Meetup also has a wiki Web site.

Alas, the recent and coming meetups run into a conflict with my schedule. The good news is that presentation slides and blog posts are available. And, of course, the meetup has an email list that you can join, as I did.

-- Jack Krupansky

Facts, opinions, secrets, gossip, speculation, and questions in the Semantic Web

Whether one is mining text for embedded semantics or offering a structured interface for directly entering semantics, a user's information needs to be properly classified if it is to be used properly in the Semantic Web. Assuming one considers user input to be a sequence or collection or graph of statements, each statement would need to be classified as one or more of:

  • Fact. A statement that is believed to be true in some objective sense.
  • Opinion. A statement that the speaker believes is likely to be true, at least for themselves, regardless of the opinions of others.
  • Secret. A personal statement that is not intended to be shared with others, except possibly on a very selective basis.
  • Gossip. A statement about others that is intended to be shared to some extent, probably without attribution as to its originator.
  • Speculation. A statement that the speaker believes might or could hypothetically be true. It is not assumed to be true, but neither is it assumed to be false. The intention is to incite at least a subtle bias in the conjectural thinking of others.
  • Question. A purely interrogatory statement, a proposition whose truth or answer is essentially unknown, but whose answer is desired by the speaker.

It seems quite clear that a useful semantic mining tool would need to be able to classify its input stream according to these qualities.

On the other hand, a tool may simply categorize to the extent that it can and correlation between similar statements from multiple sources might reveal or suggest the proper, likely, or possible classification.

There should probably be an unknown category as well.

My original motivation in coming up with this classification scheme was to think about how a user interface might assist even average users in capturing at least some aspects of the semantics of their personal information at the time it is captured. For example, to offer the user some category headings that they can click on.

An interface tool could also show the user how other users have classified the same statement. That could be the default unless the user overrides with a desired classification.

-- Jack Krupansky

Exploring New Interaction Designs Made Possible by the Semantic Web

The Journal of Web Semantics has issued a call for papers for a special issue on the topic of "Exploring New Interaction Designs Made Possible by the Semantic Web." They tell us that they:

... seek papers that look at the challenges and innovate possible solutions for everyday computer users to be able to produce, publish, integrate, represent and share, on demand, information from and to heterogeneous data sources. Challenges touch on interface designs to support end-user programming for discovery and manipulation of such sources, visualization and navigation approaches for capturing, gathering and displaying and annotating data from multiple sources, and user-oriented tools to support both data publication and data exchange. The common thread among accepted papers will be their focus on such user interaction designs/solutions oriented linked web of data challenges. Papers are expected to be motivated by a user focus and methods evaluated in terms of usability to support approaches pursued.

Offering some background, they inform us that:

The current personal computing paradigm of single applications with their associated data silos may finally be on its last legs as increasing numbers move their computing off the desktop and onto the Web. In this transition, we have a significant opportunity – and requirement – to reconsider how we design interactions that take advantage of this highly linked data system. Context of when, where, what, and whom, for instance, is increasingly available from mobile networked devices and is regularly if not automatically published to social information collectors like Facebook, LinkedIn, and Twitter. Intriguingly, little of the current rich sources of information are being harvested and integrated. The opportunities such information affords, however, as sources for compelling new applications would seem to be a goldmine of possibility. Imagine applications that, by looking at one's calendar on the net, and with awareness of whom one is with and where they are, can either confirm that a scheduled meeting is taking place, or log the current meeting as a new entry for reference later. Likewise, documents shared by these participants could automatically be retrieved and available in the background for rapid access. Furthermore, on the social side, mapping current location and shared interests between participants may also recommend a new nearby location for coffee or an art exhibition that may otherwise have been missed. Larger social applications may enable not only the movement of seasonal ills like colds or flus to be tracked, but more serious outbreaks to be isolated. The above examples may be considered opportunities for more proactive personal information management applications that, by awareness of context information, can better automatically support a person's goals. In an increasingly data rich environment, the tasks may themselves change. We have seen how mashups have made everything from house hunting to understanding correlations between location and government funding more rapidly accessible. If, rather than being dependent upon interested programmers to create these interactive representations, we simply had access to the semantic data from a variety of publishers, and the widgets to represent the data, then we could create our own on-demand mashups to explore heterogeneous data in any way we chose. For each of these types of applications, interaction with information -- be it personal, social or public -- provides richer, faster, and potentially lighter-touch ways to build knowledge than our current interaction metaphors allow.

Finally, they pose their crucial question:

What is the bottleneck to achieving these enriched forms of interaction?

For which they propose the answer:

Fundamentally, we see the main bottleneck as a lack of tools for easy data capture, publication, representation and manipulation.

They provide a list of challenges to be addressed in the issue, including but not restricted to:

  • approaches to support integrating data that is readily published, such as RSS feeds that are only lightly structured.
  • approaches to apply behaviors to these data sources.
  • approaches to make it as easy for someone to create and to publish structured data as it is to publish a blog.
  • approaches to support easy selection of items within resources for export into structured semantic forms like RDF.
  • facilities to support the pulling in of multiple sources; for instance, a person may wish to pull together data from three organizations. Where will they gather this data? What tools will be available to explore the various sources, align them where necessary and enable multiple visualizations to be explored?
  • methods to support fluidity and acceleration for each of the above: lowering the interaction cost for gathering data sources, exploring them and presenting them; designing lightweight and rapid techniques.
  • novel input mechanisms: most structured data capture requires the use of forms. The cost of form input can inhibit that data from being captured or shared. How can we reduce the barrier to data capture?
  • evaluation methods: how do we evaluate the degree to which these new approaches are effective, useful or empowering for knowledge builders?
  • user analysis and design methods: how do we understand context and goals at every stage of the design process? What is different about designing for a highly personal, contextual, and linked environment?

In addition to traditional, full-length papers, they are also soliciting shorter papers as well as one to two page short, forward-looking more speculative papers addressing the challenges outlined above. I am tempted to submit one of the latter, possibly based on my proposal for The Consumer-Centric Knowledge Web - A Vision of Consumer Applications of Software Agent Technology - Enabling Consumer-Centric Knowledge-Based Computing. Or, maybe a stripped-down version of that vision that is more in line with the "reach" of the current, RDF-based vision of the Semantic Web.

-- Jack Krupansky

Wednesday, January 14, 2009

Text for the Semantic Web

I am having second thoughts as to whether the text for terms, definitions, descriptions, and other text belong in RDF or should be stored externally. I am not convinced one way or the other.

On the one hand, storing text externally could make it more manageable with traditional text tools, and even searchable using traditional text-oriented search engines.

On the other hand, storing all of that text separately increases the number of resources and may be more unmanageable than embedding the text directly in RDF.

One hybrid approach would be to store the "source" for the text in traditional text documents or simpler XML files, with labels, and then have a processing step that takes an intermediate form of RDF that has the labels and substitutes the associated text. This processing might in fact simply be done using XSLT.

Ultimately, I might simply prefer the "simplest" approach, but sometimes simplicity is not the cheapest or most flexible and maintainable approach.

The Semantic Web is still in its infancy, so techniques and tools are evolving, so that those techniques and tools in vogue today may not be the preferred approach in the future.

-- Jack Krupansky

Tuesday, December 30, 2008

SKOS and term definitions

I had been considering the Simple Knowledge Organization System (SKOS) for representing term definitions that I currently write on HTML Web pages, but after reading about it, trying it, and reading the discussion on the SKOS mail list, I have to reconsider. SKOS is design to publish knowledge hierarchies or thesauri, but actually is not designed to be the underlying model for them. Most of my terms are distinct and un-organized, too new and fluid to be nailed down into a fixed hierarchy. Yes, I can use SKOS to publish them, but I would be better off finding a Semantic Web scheme for actually representing them at the source.

I want to have some way to create a Semantic Web resource simply to be the semantic anchor for a concept with the possibility that there may be many terms that define that single concept and there may be many definitions for each of those terms and there may be many authors and sources for those definitions. Rather than creating a single SKOS hierarchy, I wish to work on concepts, terms, and definitions as independent entities. Yes, maybe they can be collected into an SKOS list or hierarchy eventually, but that is not my initial or even my ultimate goal.

So, these are the four concepts I am looking at:

  1. A concept "anchor" resource. May also refer to another anchored concept to refine it or to combine multiple concepts. Some "concepts" may in fact be flagged as being domains, and concepts and domains can be linked.
  2. A term text resource. Refers to an anchored concept. May also link to a domain if no anchored concept exists yet. May also be a "link" term which simply links to another term but with some changes to be applied.
  3. A term definition text resource. Refers to a term text resource, or even possibly multiple terms such as synonyms.
  4. A glossary resource. A list of term references.

Although a single term requires all of the first three (to be complete), there is no need or requirement for all three to be colocated or designed together. Terms for a minimal glossary may simply have term text and definition, without a concept resource for the term.

Term definitions can be both anchored to a specific term text resource or refer to the term by its text and some context anchor (e.g., domain.) This supports terms which have distinct definitions in distinct contexts.

-- Jack Krupansky

Sunday, December 28, 2008

Concepts, islands, and archipelagos

The heart of a true semantic web is the concept or a collection of concepts. The heart of the Semantic Web is the resource which is represented by a URI. A resource can be anything. It can be a document on the Web, a reference to a physical real-world object or phenomenon, or even an idea or abstract concept. The Semantic Web itself does not recognize concepts per se, only resources that users associated with concepts in their minds. So, if we want to refer to a concept in the Semantic Web, we need to assign it a resource and URI.

A given application domain, such as astronomy or auto repair or health care, would encompass a collection of concepts. The users of a given domain would need to agree on the terms to be used for the concepts and how they are mapped to resources and URIs. In other words, for a given application domain, the users share knowledge of the resource URI to be used for each concept in that domain.

Different domains may or may not have different users and they may or may not have different concepts, but the users of different domains are free to assign terms and concepts and resources and URIs differently than how they are assigned in other domains. Sometimes concepts in different domains will be distinct and separate and sometimes they will overlap or even be virtually identical.

Different organizations or groups may also have their own distinct concept and resource mappings for a given domain, so that there may be multiple mappings for the same domain concepts to different URIs.

There is no requirement that all concepts in an application domain be present in a given concept and resource mapping. Sometimes only a subset is needed. Sometimes it might be too impractical to represent all concepts.

Granted, there are clearly benefits to agreeing to share concept and resource mappings for each domain, but there are sometimes benefits to having the freedom to exercise full control over the mapping.

Ultimately, each concept and resource mapping for a given domain is essentially an island in this "sea" called the Semantic Web. If coherent and constructed well, call it an island of excellence. Anyone can visit and utilize the resources of a given island, but only to the extent that they agree to accept the concept and resource mappings.

Each island is a land to itself, but sometimes it makes sense for two or more islands to interact and define and make use of shared concepts and resources. These distinct islands may not share and agree on all of their concepts and resource mappings, but enough to make a collaboration of some sort worthwhile. We can think of these collaborating islands as an archipelago.

There may be many archipelagos in the "sea" of the Semantic Web. An unlimited number of them may also choose to share subsets of concept and resource mappings. Sometimes it will make perfect sense to have very large archipelagos, while at other times smaller island groups or even single islands may make perfect sense. Sharing concepts and resource mappings can present many valuable opportunities, but sometimes sharing can be a significant burden or maybe not even be practical at all.

But unless each island is truly "excellent", connecting them together in a network would be futile.

In short, constructing a Semantic Web means carefully mapping concepts to resources, constructing domain islands of excellence, and then interconnecting those domain islands of excellence so that collaboration is enabled and empowered.

-- Jack Krupansky

Monday, December 8, 2008

XML Schema definition language

I am biting the bullet and diving deep to get a full handle on the XML Schema definition language which defines the rules for constructing rules for how XML documents are constructed. In other words, if you want to define a new XML document format, you need to construct a schema for it. If you want to construct a schema, you need to understand the rules for XML Schema. There are in fact tools to simplify the task of constructing XML schemas, but for my purposes I really do need to become proficient, if not expert, in the finest level of details of XML schemas.

I am starting with the primer, XML Schema Part 0: Primer Second Edition, W3C Recommendation 28 October 2004. This is very dense material, but the primer includes lots of examples.

-- Jack Krupansky

Thursday, October 30, 2008

WordNet - a lexical database for the English language

Given my interest in glossaries and concept dictionaries, I am intrigued by WordNet from the Princeton Cognitive Science Laboratory, which bills itself as "a lexical database for the English language." The web site says:

WordNet(R) is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

I have not dug into this enough to get a handle on whether or how they utilize RDF, but the effort does certainly seem quite interesting.

-- Jack Krupansky

Monday, October 20, 2008

RDFa - W3C recommendation for adding RDF annotations to HTML documents

The folks over at W3C have been working on a new scheme that allows RDF-like annotations to be added to HTML Web pages. W3C just announced that RDFa is now a full-fledged "Recommendation" (W3C standard) and has an updated Primer. Actually, the annotations are for XHTML documents. According to the RDFa Primer:

Today's web is built predominantly for human consumption. Even as machine-readable data begins to appear on the web, it is typically distributed in a separate file, with a separate format, and very limited correspondence between the human and machine versions. As a result, web browsers can provide only minimal assistance to humans in parsing and processing web data: browsers only see presentation information. We introduce RDFa, which provides a set of XHTML attributes to augment visual data with machine-readable hints. We show how to express simple and more complex datasets using RDFa, and in particular how to turn the existing human-visible text and links into machine-readable data without repeating content.

...

The web is a rich, distributed repository of interconnected information organized primarily for human consumption. On a typical web page, an XHTML author might specify a headline, then a smaller sub-headline, a block of italicized text, a few paragraphs of average-size text, and, finally, a few single-word links. Web browsers will follow these presentation instructions faithfully. However, only the human mind understands that the headline is, in fact, the blog post title, the sub-headline indicates the author, the italicized text is the article's publication date, and the single-word links are categorization labels. The gap between what programs and humans understand is large.

What if the browser received information on the meaning of a web page's visual elements? A dinner party announced on a blog could be easily copied to the user's calendar, an author's complete contact information to the user's address book. Users could automatically recall previously browsed articles according to categorization labels (often called tags). A photo copied and pasted from a web site to a school report would carry with it a link back to the photographer, giving her proper credit. When web data meant for humans is augmented with hints meant for computer programs, these programs become significantly more helpful, because they begin to understand the data's structure.

RDFa allows XHTML authors to do just that. Using a few simple XHTML attributes, authors can mark up human-readable data with machine-readable indicators for browsers and other programs to interpret. A web page can include markup for items as simple as the title of an article, or as complex as a user's complete social network.

RDFa benefits from the extensive power of RDF [RDF], the W3C's standard for interoperable machine-readable data. However, readers of this document are not expected to understand RDF. Readers are expected to understand at least a basic level of XHTML.

I personally have not studied RDFa yet, but I have a strong suspicion that it may be relevant to my interests.

OTOH, it may simply represent a steppingstone on the path to better things.

-- Jack Krupansky

Saturday, October 11, 2008

Hypocorism - pet name or term of endearment (or lack thereof)

Courtesy of the Merriam-Webster Word of the Day, I just learned that hypocorism is a linguist's term for pet names, including "baby talk", and endearing terms. I would include nicknames and terms of lack of endearment. Traditionally that is in the personal sense, but pet names and nicknames are all too common in computing technology and other fields. Common enough that glossaries and term definitions should include them to be able to more completely capture all references to an entity, concept, term, or topic. I would include jargon, techspeak, euphemisms, and "hacker slang" as well. A key criteria is that the term have a relatively widespread usage as opposed to being used by only a very small and unknown group or single individual for their local environment.

Some common examples from computing:

  • Big Blue - IBM
  • Redmond - Microsoft
  • Mr. Softie - Microsoft
  • Microsloth - Microsoft
  • Windoze - Microsoft Windows
  • net - Internet and sometimes World Wide Web
  • web - World Wide Web
  • PC - personal computer that primarily runs the Windows operating system
  • Mac - personal computer that primarily runs the Apple Macintosh operating system
  • app - software application
  • Googleplex - the headquarters of Google
  • bare metal - a computer without an operating system
  • bit bucket - mythical destination of and euphemism for data that has been lost and destroyed

In fact, it might be interesting or amusing to have a glossary consisting only of hypocorisms in computing. There is in fact something called The Jargon Lexicon, but it is a more narrow collection of terms used primarily by "hackers."

I am not yet comfortable with using this relatively unknown 10-gallon term for such a simple concept. For now, I may stick with nickname as my preferred sobriquet for hypocorism.

Note that there should be a semantic distinction between alternate names, synonyms, acronyms, and nicknames. Another key aspect of a nickname is that its usage is quite informal.

-- Jack Krupansky

Wednesday, October 8, 2008

Simple acronyms in SKOS Turtle and RDF

It wasn't hard at all to convert my simple acronym experiment to SKOS Turtle and then use the rdf:about Validator and Converter to generate the equivalent RDF. I spent more effort formatting the "code" for this blog post!

For example, my pure XML acronym for Agent-Based Computing (ABC) was:

<Acronym>
<Term>ABC</Term>
  <CompoundTerms>
   <CompoundTerm>Agent-Based Computing</CompoundTerm>
  </CompoundTerms>
</Acronym>

And in Turtle that is:

ac:agent_based_computing rdf:type skos:Concept;
  skos:prefLabel "Agent-Based Computing"@en;
  skos:altLabel "ABC"@en.

And the XML/RDF created by the validator is:

<skos:Concept rdf:about="http://agtivity.com/xml/agent_based_computing">
<skos:prefLabel xml:lang="en">Agent-Based Computing</skos:prefLabel>
  <skos:altLabel xml:lang="en">ABC</skos:altLabel>
</skos:Concept>

Alas, each of the three RSS definitions is a distinct SKOS concept, but that does make some sense since each of the three meanings is somewhat distinct even though they are all under the same umbrella concept. I will have to think about what it might mean to have the acronym itself be a distinct concept. Actually, there was some discussion of semantic relationships and acronyms in the primer.

I have it online at http://agtivity.com/xml/acronym5.txt:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix skos: <http://www.w3.org/2008/05/skos#>.
@prefix ac:http://agtivity.com/xml/.
 
ac:agent_based_computing rdf:type skos:Concept;
skos:prefLabel "Agent-Based Computing"@en;
skos:altLabel "ABC"@en.
ac:resource_description_framework rdf:type skos:Concept;
skos:prefLabel "Resource Description Framework"@en;
skos:altLabel "RDF"@en.
ac:really_simple_syndication rdf:type skos:Concept;
skos:prefLabel "Really Simple Syndication"@en;
skos:altLabel "RSS"@en.
ac:rich_site_summary rdf:type skos:Concept;
skos:prefLabel "Rich Site Summary"@en;
skos:altLabel "RSS"@en.
ac:rdf_site_summary rdf:type skos:Concept;
skos:prefLabel "RDF Site Summary"@en;
skos:altLabel "RSS"@en.

The XML/RDF generated by the validator is online at http://agtivity.com/xml/acronym5.rdf:

<?xml version="1.0"?>
<rdf:RDF
xmlns:ac="http://agtivity.com/xml/"
xmlns:skos="http://www.w3.org/2008/05/skos#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<skos:Concept rdf:about="http://agtivity.com/xml/agent_based_computing">
<skos:prefLabel xml:lang="en">Agent-Based Computing</skos:prefLabel>
<skos:altLabel xml:lang="en">ABC</skos:altLabel>
</skos:Concept>
  <skos:Concept rdf:about="http://agtivity.com/xml/resource_description_framework">
<skos:prefLabel xml:lang="en">Resource Description Framework</skos:prefLabel>
    <skos:altLabel xml:lang="en">RDF</skos:altLabel>
</skos:Concept>
  <skos:Concept rdf:about="http://agtivity.com/xml/really_simple_syndication">
<skos:prefLabel xml:lang="en">Really Simple Syndication</skos:prefLabel>
    <skos:altLabel xml:lang="en">RSS</skos:altLabel>
</skos:Concept>
  <skos:Concept rdf:about="http://agtivity.com/xml/rich_site_summary">
<skos:prefLabel xml:lang="en">Rich Site Summary</skos:prefLabel>
    <skos:altLabel xml:lang="en">RSS</skos:altLabel>
</skos:Concept>
  <skos:Concept rdf:about="http://agtivity.com/xml/rdf_site_summary">
<skos:prefLabel xml:lang="en">RDF Site Summary</skos:prefLabel>
  <skos:altLabel xml:lang="en">RSS</skos:altLabel>
</skos:Concept>
</rdf:RDF>

-- Jack Krupansky

Converting Turtle to RDF/XML

I wanted to experiment with converting my little test acronym file to Turtle or RDF and looked around and found a tool, RDF Validator and Converter on the rdf:about Web site run by Joshua Tauberer, that can in fact convert Turtle to raw XML/RDF. I tried it out with that simple Turtle example from my last blog post:

ex:animals rdf:type skos:Concept;
skos:prefLabel "animals".

But the validator complained that the "ex:" prefix was undefined. I went back to the SKOS Primer, and found that you define the prefixes with "@prefix", and you need that for the "ex:", "rdf:", and "skos:" prefixes, like so:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix skos: <http://www.w3.org/2008/05/skos#>.
@prefix ex:<http://www.example.com/>.


ex:animals rdf:type skos:Concept;
skos:prefLabel "animals".

The validator accepted that and informs me that the equivalent XML/RDF is:

<?xml version="1.0"?>
<rdf:RDF
xmlns:skos="http://www.w3.org/2008/05/skos#"
xmlns:ex="http://www.example.com/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <skos:Concept rdf:about="http://www.example.com/animals">
  <skos:prefLabel>animals</skos:prefLabel>
 </skos:Concept>
</rdf:RDF>

So, the core XML/RDF for that one SKOS concept is:

 <skos:Concept rdf:about="http://www.example.com/animals">
  <skos:prefLabel>animals</skos:prefLabel>
 </skos:Concept>

The validator also tells me that there are two underlying triples here, one for the "Concept" and one for the "prefLabel":

<http://www.example.com/animals>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2008/05/skos#Concept> .
<http://www.example.com/animals>

<http://www.w3.org/2008/05/skos#prefLabel>
"animals" .

So, it actually is not so difficult, once somebody gives you a bunch of the clues.

Now on to converting my raw XML acronym test into Turtle.

-- Jack Krupansky

Reading the Primer for Simple Knowledge Organisation Systems (SKOS)

As background research for my little project to represent glossaries of terms and acronyms, I have started reading (studying) the primer for the W3C Simple Knowledge Organisation Systems (SKOS). The primer introduces SKOS by saying:

SKOS -- Simple Knowledge Organization System -- provides a model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary. As an application of the Resource Description Framework (RDF), SKOS allows concepts to be composed and published on the World Wide Web, linked with data on the Web and integrated into other concept schemes.

This document is a user guide for those who would like to represent their concept scheme using SKOS.

In basic SKOS, conceptual resources (concepts) are identified with URIs, labelled with strings in one or more natural languages, documented with various types of note, semantically related to each other in informal hierarchies and association networks, and aggregated into concept schemes.

In advanced SKOS, conceptual resources can be mapped across concept schemes and grouped into labelled or ordered collections. Relationships between concept labels can be specified. Finally, the SKOS vocabulary itself can be extended to suit the needs of particular communities of practice or combined with other modelling vocabularies.

This document is a companion to the SKOS Reference, which gives the normative reference on SKOS.

In W3C terminology, a document is "normative" when it describes features in terms of what is absolutely required, what is optional, and what absolutely must not be done, while a document is "nonnormative" (or merely "descriptive") when it more casually describes or explains or gives examples for features without necessarily giving the absolute requirements and full details.

The primer tells us that:

The Simple Knowledge Organization System (SKOS) is an RDF vocabulary for representing semi-formal knowledge organization systems (KOS), such as thesauri, taxonomies, classification schemes and subject heading lists. Because SKOS is based on the Resource Description Framework (RDF) [RDF-PRIMER] these representations are machine-readable and can be exchanged between software applications and published on the World Wide Web.

SKOS has been designed to provide a low-cost migration path for porting existing organization systems to the Semantic Web. SKOS also provides a lightweight, intuitive conceptual modeling language for developing and sharing new knowledge organization systems (KOSs). It can be used on its own, or in combination with more formal languages like the Web Ontology Language (OWL) [OWL]. SKOS can also be seen as a bridging technology, providing the missing link between the rigorous logical formalism of ontology languages such as OWL and the chaotic, informal and weakly-structured world of Web-based collaboration tools, as exemplified by social tagging applications.

The aim of SKOS is not to replace original conceptual vocabularies in their initial context of use, but to allow them to be ported to a shared space, based on a simplified model, enabling wider re-use and better interoperability.

In SKOS terminology, my little acronym project would be a concept scheme. A glossary is an example of a concept scheme. An individual acronym or term would be a concept in SKOS.

The starting point of SKOS is the concept which can have one or more labels as well as documentary notes. Then you can add semantic relationships between concepts.

Just to get started, here is a simple SKOS concept written in what is called TURTLE notation (not pure XML):

ex:animals rdf:type skos:Concept;   skos:prefLabel "animals".

This defines a concept named animals that has a preferred label of "animals." The "ex:" is simply a shorthand notation for referring to a name space where XML/RDF terms are defined.

Extending that concept a little to allow for various natural languages for the same concept:

ex:animals rdf:type skos:Concept;   skos:prefLabel "animals"@en;   skos:prefLabel "animaux"@fr.

Extending a little more to allow synonyms:

ex:animals rdf:type skos:Concept;   skos:prefLabel "animals"@en;   skos:altLabel "creatures"@en;   skos:prefLabel "animaux"@fr;   skos:altLabel "créatures"@fr.

SKOS is also designed to support "near-synonyms", abbreviations, and acronyms. For example:

ex:fao rdf:type skos:Concept;   skos:prefLabel "Food and Agriculture Organization"@en;   skos:altLabel "FAO"@en.

I am not completely happy with the idea that SKOS does not distinguish between the various forms of alternate labels, and in particular with representing acronyms.

SKOS supports notes for adding documentation, for example:

ex:documentation skos:definition       "the process of storing and retrieving information      in all fields of knowledge"@en.

A better example of a definition for a concept in SKOS:

ex:pineapples rdf:type skos:Concept;   skos:prefLabel "pineapples"@en;   skos:prefLabel "ananas"@fr;   skos:definition "The fruit of plants of the family Bromeliaceae"@en;   skos:definition         "Le fruit de la plante herbacée de la famille des broméliacées"@fr.

A simple example of defining a group of concepts as a concept scheme:

ex:animalThesaurus rdf:type skos:ConceptScheme;   dc:title "Simple animal thesaurus";   dc:creator ex:antoineIsaac.
ex:mammals rdf:type skos:Concept;   skos:inScheme ex:animalThesaurus. ex:cows rdf:type skos:Concept;   skos:broader ex:mammals;   skos:inScheme ex:animalThesaurus. ex:fish rdf:type skos:Concept;   skos:inScheme ex:animalThesaurus.

That example also illustrates how a concept such as ex:cows can be defined as being a member of a broader concept such as ex:mammals.

-- Jack Krupansky

Friday, October 3, 2008

uBio - Universal Biological Indexer and Organizer

I just heard about a biology taxonomy called uBio:

About uBio project

uBio is an initiative within the science library community to join international efforts to create and utilize a comprehensive and collaborative catalog of known names of all living (and once-living) organisms. The Taxonomic Name Server (TNS) catalogs names and classifications to enable tools that can help users find information on living things using any of the names that may be related to an organism.

I have no investigated it closely to determine whether it is simply a Web Service (SOAP interface) or whether it also is available on the Semantic Web (RDF).

More about what uBio does:

Information about organisms is often linked to a name.

This can create problems in information retrieval because:

uBio is working on tools for providers of biological information that address these problems.

The uBio Taxonomic Name Server acts as a name thesaurus.

Names have many different classes of relationships that can be used to organize and retrieve information that is annotated with names. These classes are divided into two inter-connected services.

NameBank is a repository of millions of recorded biological names and facts that link those names together. [more]

ClassificationBank stores multiple classifications and taxonomic concepts that are the result of expert opinions. It extends the functionality of NameBank. [more]

All data within these components are linked to mechanisms that provide credit and attribution to experts who provide name and linkage information within the TNS. [more]

Lastly, NameBank promotes the emergence of a layered biological informatics infrastructure that allows different expert systems to share common information. This conserves scarce resources and enhances the means to support continued expert work.

A foundation for collaboration

We are currently pursuing funding to separate the two logical components of the Taxonomic Name Server into separate services.

NameBank will become a biological name server focused on serving factual nomenclatural metadata. The ClassificationBank component derive taxonomic concepts from cached NameBank records. Formalizing this division into discreet components provides us with increased collaborative opportunity by facilitating multiple taxonomic models atop a common core set of factual metadata.

Different taxonomic systems can share common facts

A common nomenclatural resource allows different information systems to address different taxonomic issues, scopes, or user communities while sharing common reference data. Collaboration eliminates duplication, increases accountable attribution of work, and provides a common interchange core. A contributor to NameBank ca

We seek to establish that such an approach is technically sound and can reduce inefficient duplication and derivation of established facts while promoting a more effective attribution pathway that can increase the reach of the taxonomic profession without compromising quality. NameBank can also enhance interoperability between different infrastructures by providing a common address space.

The difference between a vocabulary and a taxonomy is that the latter organizes the terms in a hierarchical relationship.

-- Jack Krupansky

Defining terms and glossaries for domains and projects

I am going to leap ahead and start to conceptualize some semantic building blocks that are useful for acronyms, terms, glossaries and other foundation concepts.

My starting set of core concepts are roughly:

  • Term - a single word or phrase that has a particular meaning. This may be a single term (one word or two or more hyphen-separated words) or a compound term (two or more words or hyphen-separated words.)
  • Glossary - a collection of terms relevant to a particular project or domain.
  • Domain - a field of study or area of interest. Multiple domains may overlap or intersect (ala a Venn diagram)
  • Project - a collection of subsets of domains that are under study for some purpose.
  • Abbreviation - a shortened or shorthand form of a term.
  • Acronym - a stylized abbreviation for one or more compounds term that utilizes a sequence of the initial letters or abbreviations for the individual terms of which it is composed.

A glossary may contain actual term definitions or references to terms that are defined in anothe XML resource document by themselves of within another glossary. In its simplest form, a glossary would simply be a list of term references or pointers to externally-defined terms.

Glossaries should also be nestable so that a glossary can incorporate existing glossaries in their entirety.

A glossary would be associated with zero or more domains that may or may not overlap or intersect. For example, my software_agent glossary might list the domains software_agent_technology, computing, software, and distributed_processing.

One open question is the relationship between a glossary and a vocabilary. My current thinking is that a glossary is simply a subset of terms of interest in some project or a sub-field of a domain or even of interest to an individual. A vocabulary for a domain would consist of the universe of terms that are relevant for that domain, regardless of whether those terms are collected into one or more glossaries or exist as discrete XML resources not contained in glossaries. A vocabulary may be more of a computed collection whereas a glossary might be hand-crafted for its intended specific project.

One important nuance is that sometimes an existing term is mostly relevant but needs some modification to be more completely relevant for the domain of a glossary or to simplify and customize it to be more relevant to a project or domain. In such cases we want a local override definition for the term plus a reference to one or more existing term definitions upon which the term is based or from which it is synthesized.

-- Jack Krupansky

Thursday, October 2, 2008

Defining compound terms for acronyms

So far in my little acronym experiment, I defined a compound term simply as a string which happened to be a sequence of words. I actually started a separate experiment to look into defining a mini-dictionary or glossary of words and then use URI references to those XML resources in the definition of a compound term, but I ran into some issues that I was unable to resolve, so far. I may come back to that side experiment later, but it may become moot since I think the real solution is that each compound term should itself be a discrete XML resource and the acronym resource should simply tie the aconym term to the XML resource for the compound term.

I have not figured out all of the details yet, but rather than the acronym term "ABC" be defined as the string "Agent-Based Computing" or even the sequence of references to the XML resources for the individual terms "Agent-Based" and "Computing", the definition would be a single reference to a distinct XML resource for agent-based_computing.

Similarly, the definition for the acronym term "RSS" would be a collection of three references to distinct XML resources for really_simple_syndication, rich_site_summary, and rdf_site_summary.

I have not yet worked out the details, but I think I need to construct a standalone XML schema for a compound term, or maybe have the concept of a compound term glossary which is a list of compound terms relevant to a particular domain or subdomain. So, some compound terms could be represented as a single compound term in a single XML document, or a project could collect all of its compound terms into a glossary. There are pros and cons to both approaches.

The only problem here is that it introduces a separation between the abstract compound term for an acronym and the text of the words from which the individual letters of the acronym are derived.

One solution is to include both the text definition and the XML resource reference. Or, if the text of the compound term is included in the XML resource definition for the compound term then it can be obtained indirectly.

Or maybe the process by which the text of the acronym was derived is simply historical and is not strictly needed to operate at the purely semantic level.

Another approach is to actually decompose the words of the compound term and represent them in a structure that is organized by the sequence of letters of the acronym term. This structure would be kept with the acronym even though there is also a direct reference from the acronym resource to the XML resource for the compound term.

Incidentally, I already have a lot of resources on the Web for compound terms and acronyms, but they are in text in HTML documents rather than in XML. I will give some throughts as to how I might want to organize those compound terms and how to split the existing HTML into raw XML and presentation HTML that feeds off of that XML. There are links between many of my compound terms, which would mean XML resource references in the XML as well as synthesized HTML links for the presentation of the compound terms.

OTOH, it was not my intention to dive into how to solve the problem of representing full-blown term and compound term definitions at this time, but rather to tackle the simpler problem of acronyms. I need to figure out which portion of the problem to carve off to continue work on acronyms.

-- Jack Krupansky

Tuesday, September 30, 2008

Simple Knowledge Organisation Systems (SKOS)

I will continue to persue my little acronym project, but it is possible that it will eventually simply map into a Simple Knowledge Organisation Systems (SKOS) language. Oversimplifying, SKOS provides capabilities for defining controlled vocabularies.

From the Wikipedia:

SKOS Core [16] defines the classes and properties sufficient to represent the common features found in a standard thesaurus. It is based on a concept-centric view of the vocabulary, where primitive objects are not terms, but abstract concepts represented by terms. Each SKOS concept is defined as an RDF resource. Each concept can have RDF properties attached, including:

  • one or more preferred index terms (at most one in each natural language)
  • alternative terms or synonyms
  • definitions and notes, with specification of their language.

Concepts can be organized in hierarchies using broader-narrower relationships, or linked by non-hierarchical (associative) relationships. Concepts can be gathered in concept schemes, to provide consistent and structured sets of concepts, representing whole or part of a controlled vocabulary.

These features represent the stable part of SKOS Core. Other elements of the vocabulary are still considered unstable.

Acronyms might fit in as "alternative terms", but I am not so sure about that. To me, alternative terms should be full-fledged, first-class terms and not abbreviations or acronyms.

In any case, SKOS will be on my future reading list. For now, I want to keep the acronym project as simple as possible.

-- Jack Krupansky

Friday, September 26, 2008

Adding multiple definitions for an acronym

In the real world, there may be multiple definitions of the same acronym. Sometimes they are from distinct domains and unrelated but sometimes they have evolved over time within a single domain, possibly for variations in usage or different audiences. For example, RSS is commonly accepted to stand for Really Simple Syndication, but it technically stands for Rich Site Summary or even RDF Site Summary.

There are three ways to give multiple definitions for a single acronym:

  1. Define the acronym in multiple XML documents.
  2. Place multiple acronym definitions in a single XML document.
  3. Extend the schema definition for acronym to allow multiple definitions.

Ultimately, #1 is probably best and represents the distributed nature of the Web and Semantic Web and supports definitions within distinct domains. #2 can make sense when there is some obvious connection between the definitions such as for my RSS example. #3 is a tighter way of doing #2 and also ties the multiple meanings together.

I have created a sample XML document, acronym2a.xml, that illustrates placing multiple definitions of the same acronym term in a single XML document. Here is the fragment of that document for RSS:

<Acronym>
 
<Term>RSS</Term>
 
<CompoundTerm>Really Simple Syndication</CompoundTerm>
</Acronym>
<Acronym>
 
<Term>RSS</Term>
 
<CompoundTerm>Rich Site Summary</CompoundTerm>
</Acronym>
<Acronym>
 
<Term>RSS</Term>
 
<CompoundTerm>RDF Site Summary</CompoundTerm>
</Acronym>

This sample document uses the same schema as my second example, acronym2.xsd.

This approach basically works, but does nothing to suggest that these "meanings" are related and requires excessive verbiage.

Next, I modified the schema to allow an arbitrary list of compound term definitions for each acronym. Unfortunately, I have not yet been able to figure out how to design such a schema that does not require an extra level of XML element to represent the list. The new scheme does work, but is a bit more wordy than I would prefer.

So, using the old schema we wrote:

<Acronym>
  <Term>RDF</Term>
 
<CompoundTerm>Resource Description Framework</CompoundTerm>
</Acronym>

But with the new schema that same exact definition becomes:

<Acronym>
  <Term>RDF</Term>
 
<CompoundTerms>
   
<CompoundTerm>Resource Description Framework</CompoundTerm>
 
</CompoundTerms>
</Acronym>

I am still hoping that I can find a way to design the schema to make that extra level of XML element grouping optional, but for now at least this approach is functional.

Anyway, the XML that combines the three RSS definitions for one acronym now becomes:

<Acronym>
  <Term>RSS</Term>
 
<CompoundTerms>
   
<CompoundTerm>Really Simple Syndication</CompoundTerm>
   
<CompoundTerm>Rich Site Summary</CompoundTerm>
   
<CompoundTerm>RDF Site Summary</CompoundTerm>
 
</CompoundTerms>
</Acronym>

This is now finaly starting to look somewhat useful for structuring information, albeit at a very simple level.

One thing that immediately stands out for future work is that rather than "RDF" simply being a string, it would be preferable to actually link that first word of the third definition of RSS to the synonym definition for RDF. That would then start to have the feel of more of a "semantic" Web.

The full sample XML, acronym3.xml, is als available online:

<?xml version="1.0" encoding="utf-8"?>
<!-- Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com) -->
<Acronyms xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
    xsi:noNamespaceSchemaLocation="http://agtivity.com/xsd/acronym3.xsd">
  <Acronym>
    <Term>ABC</Term>
   
<CompoundTerms>
     
<CompoundTerm>Agent-Based Computing</CompoundTerm>
   
</CompoundTerms>
 
</Acronym>
  <Acronym>
   
<Term>RDF</Term>
   
<CompoundTerms>
     
<CompoundTerm>Resource Description Framework</CompoundTerm>
   
</CompoundTerms>
 
</Acronym>
 
<Acronym>
   
<Term>RSS</Term>
   
<CompoundTerms>
     
<CompoundTerm>Really Simple Syndication</CompoundTerm>
     
<CompoundTerm>Rich Site Summary</CompoundTerm>
     
<CompoundTerm>RDF Site Summary</CompoundTerm>
   
</CompoundTerms>
 
</Acronym>
</
Acronyms>

The full schema, acronym3.xsd, is starting to get a little verbose, but still fairly manageable:

<?xml version="1.0" encoding="utf-8" ?>
<!--Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com)-->
<xs:schema elementFormDefault="qualified"
   
xmlns:xs="http://www.w3.org/2001/XMLSchema">
 
<xs:element name="Acronyms" type="AcronymList" />
 
<xs:complexType name="Acronym">
   
<xs:all>
     
<xs:element name="Term" type="xs:string" />
     
<xs:element name="CompoundTerms" type="CompoundTermList" />
   
</xs:all>
 
</xs:complexType>
 
<xs:complexType name="AcronymList">
   
<xs:sequence minOccurs="0" maxOccurs="unbounded">
     
<xs:element name="Acronym" type="Acronym" />
   
</xs:sequence>
 
</xs:complexType>
 
<xs:complexType name="CompoundTermList">
   
<xs:sequence minOccurs="0" maxOccurs="unbounded">
     
<xs:element name="CompoundTerm" type="CompoundTerm" />
   
</xs:sequence>
 
</xs:complexType>
  <xs:simpleType name="CompoundTerm">
   
<xs:restriction base="xs:string" />
 
</xs:simpleType>
</
xs:schema>

Basically, I added two defined types, a complex type named CompoundTermList that is a container for the arbitrary list of acronym definitions, and a simple type named CompoundTerm that represents a single compound term. The other change was that the second element of Acronym is now a reference to a CompoundTermList rather than being a simple string. I could have stayed with simple strings for the elements of a CompoundTermList, but I have throughts about wanting to allow for more structure within a compound term in the future, such as "RDF" being a URI reference to the RDF synonym.

Once again, do not despair if a lot of this seems like total gibberish -- because it is! The goal at this stage is simply to get a flavor of XML, schemas, and Semantic Web Technologies so we have a sense of footing before diving too far and deep off the deep end.

The next thing I am thinking about is to produce rudimentary term and phrase schemas so that an acronym can refer to a term as a full-fledged XML resource and so that a compound term would be a sequence of references to term resources rather than literal string values.

-- Jack Krupansky

XML tutorial info

One online source of XML tutorial material that I have found useful is W3Schools.com, especially the XML Schema Tutorial.

-- Jack Krupansky

Wednesday, September 24, 2008

Dirt simple XML schema for acronyms

Although it was not my original intent to dive into XML "code" so soon, I was feeling more than a little disoriented and felt a need to get at least some footing before delving into all of the conceptual angles. In particular, I figured that by trying out an interactive XML schema design tool I could very quickly get a small schema running without the need to master all of the nuances of XML Schema. The process did not go quite as smoothly as I had expected, but several hours later I do have two small test schemas for acronyms, as well as two test XML files based on those schemas. Without any further ado I will present the two test XML files, but I do not intend to offer a tutorial on all of the XML angles at this time. Some stuff is obvious and some stuff may not even be explainable in even a series of blog posts. Focus on what is obvious and ignore the rest, for now. One might wonder why I do not present the schemas first, but the simple facts are that XML schemas are somewhat cryptic and it is much simpler to have pre-visualized some sample XML text in your head before trying to make sense of the schemas. You may also be wondering why I have two schemas, but that will be clear in a moment.

All of my XML-related files will be kept on my Software Agent Web site, Agtivity.com.

The tool that I used to create the XML schemas and XML test files is Liquid XML Studio 6.1.17.0 - Free Community Edition from Liquid Technologies Limited.

So, here it is, my first test XML file for acronyms, acronym1.xml:

<?xml version="1.0" encoding="utf-8"?>
<!-- Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com) -->
<Acronyms xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
    xsi:noNamespaceSchemaLocation="http://agtivity.com/xsd/acronym1.xsd">
  <Acronym Term="ABC" CompoundTerm="Agent-Based Computing" />
  <Acronym Term="RDF" CompoundTerm="Resource Description Framework" />
</
Acronyms>

It only has two acronyms, but it should be fairly obvious how to add more. They are completely expressed by these two lines:

  <Acronym Term="ABC" CompoundTerm="Agent-Based Computing" />
  <Acronym Term="RDF" CompoundTerm="Resource Description Framework"
/>

Each acronym has a term and the equivalent compound term. Pretty simple stuff, or so it would seem. In XML parlance Term and CompoundTerm are known as attributes. In this schema, each acronym has two attributes, a Term, and a CompoundTerm.

With this image of what the XML data actually looks like, it will be easier to make sense of the XML schema.

So, here it is, my first XML Schema for acronyms, acronym1.xsd:

<?xml version="1.0" encoding="utf-8" ?>
<!--Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com)-->
<xs:schema elementFormDefault="qualified"
   
xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Acronyms" type="AcronymList" />
  <xs:complexType name="Acronym">
    <xs:attribute name="Term" type="xs:string" />
    <xs:attribute name="CompoundTerm" type="xs:string" />
  </xs:complexType>
  <xs:complexType name="AcronymList">
    <xs:sequence>
      <xs:element minOccurs="0" maxOccurs="unbounded"
         
name="Acronym" type="Acronym" />
    </xs:sequence>
  </xs:complexType>
</
xs:schema>

There is plenty of gibberish there, but the essence is that the schema defines a list of acronyms using the type complexType named AcronymList which consists of zero or more occurrences of elements of the type Acronym which is also a complexType and consists simply of two attributes which are strings, one called Term and the other called CompoundTerm.

Back in acronym1.xml, you can see that the xsi:noNamespaceSchemaLocation attribute gives the URL of the schema file, acronym1.xsd.

If you can make sense out of all of this, that is great, but at least you have been exposed to what it takes to do even something very simple in XML. Actually, it is not too bad, but it is a bit more like looking at the components and wiring inside your computer rather than simply figuring out how to use it.

But wait... we are only halfway done. I said that there were two distinct approaches to the schema and test file. The first schema defined an acronym in terms of two attributes, which is fine for very simple, unstructured data, but is too limiting for structured data. The second approach to the schema uses elements rather than attributes.

So, here it is, my second test XML file for acronyms, acronym2.xml, using elements, rather than attributes:

<?xml version="1.0" encoding="utf-8"?>
<!-- Created with Liquid XML Studio 6.1.17.0 - FREE Community Edition (http://www.liquid-technologies.com) -->
<Acronyms xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
    xsi:noNamespaceSchemaLocation="http://agtivity.com/xsd/acronym2.xsd">
  <Acronym>
    <Term>ABC</Term>
    <CompoundTerm>Agent-Based Computing</CompoundTerm>
  </Acronym>
  <Acronym>
    <CompoundTerm>Resource Description Framework</CompoundTerm>
    <Term>RDF</Term>
  </Acronym>
</
Acronyms>

The header is almost identical but points to the second schema. Th