The semantic gap between bits and knowledge
We have a wide-range spectrum of levels of abstraction for representing information in computers, none of which is particularly well adapted to representing human knowledge in a form that is readily comprehended by computer programs. At the low end of the spectrum we have bits, bytes, characters, text, databases, XML, and even RDF for the Semantic Web. We have specialized abstractions for specialized applications as well. Somewhere in the middle of the spectrum we have various so-called knowledge representation languages which purport to being able to represent knowledge, but only in a host of well-defined, limited, constrained, forms that are still not representative of true human knowledge and are not directly recognizable and usable by mere mortals. Sad to say it, but text for natural language is the closest form we have in computers to something that is recognizable and usable by mere mortals. Unfortunately, free-form text is not readily and easily recognizable and usable to computer programs (as a surrogate for human knowledge.) So, we have a vast semantic gap between the bits of computers and the knowledge of humans.
I wish I had some graphic ability so that I could draw a fancy diagram of this spectrum of information and knowledge representation, but I don't, so I'll the spectrum as a simple list, starting at the low end:
- Bits - zero and one, on and off.
- Characters and numbers
- Strings - sequences of characters representing individual words or identifiers
- Text - free-form sequences of strings or words, possibly even natural language prose
- Structured text - tabular lists (e.g., CSV)
- Application-specific data formats
- Big Gap #1
- Knowledge representation languages
- Big Gap #2
- Human knowledge and human language
RDF is a knowledge representation language of sorts, but is more specialized and adapted to representing raw information than more humanly-recognizable knowledge.
It is worth noting that there is a distinction between knowledge and communication, but that is beyond the scope of the main issue of the point about bits vs. human knowledge. One distinction is the concept of tacit knowledge which is knowledge that defies straightforward communication or representation in language.
This information/knowledge spectrum layout immediately begs the question of the sub-spectrum of knowledge representation languages, a topic worthy of attention, but that is beyond the scope of the immediate issue.
One of the most notable causes of the vast sematic gap between current knowledge representation languages and human knowledge is the issue of vocabulary definition. Computer-based systems strive to minimize and eliminate ambiguity while human knowledge and language embrace and thrive on ambiguity. Coping with ambiguity may be the ultimate abyss to be hurdled before computers can have ready access to human knowledge.
One major problem with knowledge representation languages is that there are a lot of them, a virtual Tower of Babel of them, so that we do not have a common knowledge language that can be leveraged across all forms of knowledge and all application domains. Leverage is a very powerful tool for solving problems with computers, but lack of leverage is one of the most serious obstacles to solving problems with computers. Leverage can rapidly accelerate the adoption of new technology, but lack of leverage seriously retards adoption. XML and RDF were big leaps forward in leverage, but still nowhere near enough.
One open question is whether a rich-enough knowledge representation language can be built using RDF as its lower level, or whether something richer and more flexible than RDF is needed. This may hinge on what stance you take on ambiguity.