The Semantic Abyss - Plumbing the Semantic Web: Semantic gap between text and semantic markup

Sunday, July 3, 2011

Semantic gap between text and semantic markup

No matter how advanced our Semantic Web technology becomes, we still have an inherent problem, namely, the semantic gap between simple, plain text and our semantic markup. How do we correlate a textual representations and semantically marked-up representations?

At the most basic level, we need to be able to correlate semantic entities with textual references to them. Sometimes that can be a simple text lookup, but often there are multiple semantic entities that have similar if not identical textual representations, especially when the textual representations are frequently shorthand notations rather full, detailed entity references.

Lookups are complicated by the fact that some entities have names that are raw natural language prose so that they cannot be unambiguously distinguished from simple prose. For example, names of bands, songs, plays, books, movies, parks, etc. As an even more complex example, a movie based on a book may have the same name.

Even for references to people, people use nick names and some people have the same name. For examples, "Krupansky, J." may be a reference to me in the bibliography of a technical paper, or it may be a reference in a legal document to one of two court judges. This particular example suggests that context can aid in the identification process, but with the two judges even context can be problematic. A human can tell the two judges apart since one was at the state level and the other at the federal level, but both were in Ohio. They in fact were brother and sister, but with no apparent relation to me. How a computer would differentiate those two or even all three of us without significant guidance or hand-coded "intelligence" is an open question.

One simple identification issue is the use of articles in entity names. Technically, the Beatles are really "The Beatles" and "The" is quite significant when referring to "The Office." A lot of traditional text processing algorithms like to ignore punctuation, articles, and so-called "stop words", but increasingly these ephemera are becoming more significant. Yahoo is really "Yahoo!". And then there is the musician formerly known as "The Artist Formerly Known as Prince" with a non-textual symbol as his formal entity name. The point is that casual and even somewhat formal textual references to entities can be quite far from the pure, true, formal, literal entity identifier.

References to the works of an entity or to characteristics of an entity can be similarly problematic in raw text representations. Ultimately there may be a single, hard URI for the referenced entity, but getting from raw text to URI can be a real challenge.

In some cases, even our best computational efforts may still result in ambiguous references. Then we have a really tough choice, either to pick the "best" reference by some measure or heuristic, or to simply represent a list of possible references. The latter works semi-well for display for a human user, such as the results from a search engine, but is somewhat problematic when a computer program is processing the results and expecting a singular result.

The good news is that in many cases just a little context can go a long way. If someone is querying about computers and software, I would have a higher probability of being a match than the judges. If someone is querying about legal cases, then Krupansky the judge(s) could be selected, although even in that case we still have an ambiguity.

Correlating bands and songs is at least superficially a slam dunk since the mapping between bands and songs tends to be relatively sparse, but there are no guarantees and the state of the art for automated software is that some form of guarantee is needed.

Misspelling of entity names is also a problem. If you know the category of the entity, such as that it is a band or a song, then traditional spell-checking algorithms may be sufficient, but if you are just looking at a fragment of raw text with no context or category, the problem becomes much harder. A mis-phrased song or book title can look a lot like a lot of raw prose. Still, traditional phrase matching algorithms may do reasonably well telling you if a fragment of text happens to match up with one or more entity names, but you could also get a lot of false positives when the user is simply making a casual statement rather than intentionally referring to a named entity. Still, alerting the user to the possible entity reference can have at least some value even if it may not be 100% relevant. The harder problem is if there are a very large number of partial matches; then the user could well be overwhelmed rather than aided.

A simple solution is faceting where the user is told not the list of all possible matches, but the categories of matches. This can dramatically reduce the amount of information to be presented to the user. The user can then drill down for more detail. Still, even this approach may result in information overload.

Another tool is a user-generated dictionary that fills in the particular user's preference for a partial or ambiguous entity reference the first time it needs to be resolved. Not that any user would necessarily need to manually create such a dictionary. In fact a collection of such resolution dictionaries may be automatically supplied with just a little context about the user and their tasks. Once source is to find other users of similar characteristics and then offer the dictionaries for that other user as a starting point. Maybe the user could supply a list of people they "think like" or are interesting in following and that can be used to seed the user's resolution dictionary collection.

In summary, matching textual entity references embedded in raw text is an open problem. Yes, there are a lot of tools readily available that may address the problem, more work in this area may be quite helpful. And, most importantly, bridging the semantic gap between the worlds of text and semantic entities is an important goal.

-- Jack Krupansky

14 Comments:

At October 13, 2011 at 7:33 AM , Anonymous said...: I love all your posts and blogs.
At March 27, 2015 at 11:30 PM , Sharmayne said...: I enjoyed reading your blog and learned a lot from it..
Informative URL for Alaska Fishing Trips packages
At October 24, 2015 at 10:49 PM , Chelseayhun said...: the semantic gap between the two is very different.

Veritable Discount Laminate Flooring
At October 28, 2015 at 2:20 AM , Unknown said...: Very useful information. Thanks for bringing this up!

Veritable Cloud Hosted Phone Systems
At December 28, 2015 at 12:26 AM , Anonymous said...: Its really confusing and mind bugling too! But still i got the point. haha!

Heard about Digium Switchvox Mobile
At January 6, 2016 at 7:52 PM , Unknown said...: they have very big difference

Heard about Seattle Assisted Living website
At January 9, 2016 at 4:44 PM , HannahWinslett said...: yeah i agree they have both lots of difference

Heard about Top Austin Towing
At January 30, 2016 at 2:49 AM , Anonymous said...: that was really fascinating difference

You won't believe this Westport Fishing Charter Captain Don Davenport
At February 1, 2016 at 1:19 AM , Unknown said...: Text and semantic entities is an important goal. - yes indeed
You won't believe this Allied Locksmiths
At March 2, 2016 at 2:58 AM , Unknown said...: that was so fascinating i will share it

Cherry Hill Campus Security
At September 21, 2017 at 6:08 AM , Unknown said...: دائما نقل العفش و فى كل بلدان العالم لة الأفراد المتخصصة للقيام بة و جعل تلك العملية شركة كشف تسربات المياه بالجبيلعلى ما يرآم و لانك لن تسكن فى مكان واحد طيلة حياتك فأنك سوف تحتاج لتغير المكان و الذهاب لأخر و الأنتقال لمكان أفضل من حيث التصميم و أكبر من حيث المساحات و غيرها من المميزات التى
شركة نظافة عامه بالدمام
الأعتماد على العمالة المدربة
و كما ذكرنا أن الطاقات البشرية هى الأطارات التى تحرك الدنيا و لهذا فإن الأهم فى نقل العفش هى الأفراد المحترفة و المتخصصة فى مجالاتها مثل نقل الأجهزة و الأثاث على رؤية الخبرات السابقة التى قام بها و العروض الأعمال و معرض الأعمال الذى يمتلكة هذا هو أهم شى قبل أختيار شركتك لنقل العفششركة تنظيف مكيفات بالدمام .
السيارات المجهزة
و كما نعلم أن السيارات من الأجزاء التى لا يمكن أن نتغاضى عنها فى النقل و أن
الألتزام فى المواعيد
شركة نقل اثاث بالخبرالحصول على ثقة العملاء
و هذة الثقة لها العديد من الأسباب و النتائج و التى سوف نتحدث عنها بكل أستفاضة الأنشركة مكافحه النمل الابيض بالدمام
أن الثقة التى دائما نتحدث عنها و نريد الحصول عليها هى فى الواقع أهم من الربح الذى نسعى إلية و تسعى إلية أى شركة تبحث عن النجاح فى أى مجال و بعد
المستخدمه وافضل المبيدات التى تقضى دائما وابدا على اى حشرات وامراض واوبئه ابقوا معنا تجدوا كل ما هو جديد وفعال من خلال شركة سبق كلين لاعمال التنظيف والاعمال المنزليه الكبيره فى مجال مكافحة البق والنمل الابيض الذى يضر بكل اثاث المنزل ولابد من حقن اماكن تواجدالنمل الذى تؤثر على الاثاث المنزلي دوماشركة تنظيف منازل بالقطيف
At April 5, 2020 at 8:04 AM , شركة خدمات منزلية said...: Abha Transport Company شركة نقل اثاث بابها
would like to deal with Abha Transport company feel with her reassurance on your furniture and your luggage and ensure that you transport your luggage safely and peacefully,Introduction of Hafr Al-Batin Cleaning Company: Cleanliness of the faith, and recommended us Almighty cleanliness in ... Hafr Al-Batin Cleaning Company offers service in Hafr Al-Batin and all surrounding areas, .... Transport company transfer Dawadmi شركة تنظيف بحفر الباطن
We come to the best furniture transport company with the lowest prices, which areWant the best furniture move her door Hahi شركة نقل اثاث بابها

cleaning and cleaning services of air conditioning and sewage and insect control on all household services that we doشركة تنظيف بحفر الباطن

شركة تنظيف شقق بحفر الباطن
شركه نقل اثاث بالمدينة المنورة

شركة كشف تسربات المياه بالمدينة المنورة

Introduction of Hafr Al-Batin Cleaning Company: Cleanliness of the faith, and recommended us Almighty cleanliness in ... Hafr Al-Batin Cleaning Company offers service in Hafr Al-Batin and all surrounding areas, .... Transport company transfer Dawadmi شركة تنظيف بحفر الباطن

شركة تسليك مجارى بالمدينة المنورة
شركة تنظيف بحفر الباطن
At October 21, 2020 at 10:17 AM , محمد على said...: شركة تنظيف خزانات بالرياض
شركة تنظيف مكيفات بالرياض
شركة تنظيف كنب بالرياض
At December 9, 2020 at 4:20 PM , عبد الله محمد عبدالله said...: شركة تنظيف ارضيات بالرياض
شركة جلى بلاط بالرياض
شركة تعقيم بالرياض

The Semantic Abyss - Plumbing the Semantic Web

Sunday, July 3, 2011

Semantic gap between text and semantic markup

14 Comments:

About Me

Previous Posts