Wednesday, August 5, 2009

What is the future of the English language, especially for the Semantic Web?

Despite all of the myriad technological advances in computer hardware and software and all of the wonderful specialized computer languages, it is amazing that natural language, in particular English, is still such a dominant force in the world. This is where we are today, but what about the future?

Computer language designers and application developers are quite busily at work incrementally chipping away at more and deeper and broader niches where computer languages can supplant natural language as the preferred "tongue." Still, progress is very slow. Natural language is still the choice for expressiveness and flexibility and ease of use. That seems unlikely to change any time soon.

Low birth rates in "English-speaking" countries make it increasingly likely that fewer and fewer people will consider English to be their native tongue in the decades to come. Still, somehow, English continues to have value to "open doors" across cultures, especially in business, government, science, and engineering, and especially computer software.

The Web makes it very easy for people to use their local language, which is fine when the intended audience is local, but many Web sites either use English or have an alternate set of Web pages in English to cater to a global audience.

Then we come to the Semantic Web. In some sense the Semantic Web is a direct parallel to the traditional non-semantic Web. It is difficult to say whether data in the Semantic Web is any more global than the old Web. Maybe initially more of the efforts are for a global audience, just as they were with the traditional Web in the early days, but over time we should hope that very specialized databases and applications will be tailored heavily for local audiences. Rest assured that the Semantic Web technologies are designed for internationalization and localization.

But since globalized Semantic Web applications and code libraries, by definition, know something about the data they are processing, that implies that this "knowledge" about the data needs to be in a language-independent form.

To be truly useful, software agents, especially intelligent agents, need to access the meaning of data and to access it globally. This means, once again, that knowledge about data needs to be represented in a form that is not hidden by localized natural languages.

As things stand today, the three main tools for globalizing applications are:

  • Use of English as the "core" language.
  • Maintaining data in conceptual, language-neutral form.
  • Tools for localizing data to the native or preferred tongue of the user.

Automatic language translation is still fairly primitive and unlikely to be "solved" in the near future.

As technologies, especially the Semantic Web, are under development and dynamically evolving at a rapid pace it makes sense to focus on a single, core natural language to assure that information is communicated as rapidly and widely as possible.

But as the technology matures (maybe in another ten to twenty years), the need for such broad communication will rapidly diminish. Sure, the elite will still communicate globally, but the average practitioner will likely serve a local audience. All important documents and specifications will have been translated into all the significant natural languages. In that kind of environment the need to "work" in English will effectively vanish, much as we see today with local Web sites, blogs, and other social media. Twitter is the latest to experience this localization phenomenon.

That still leaves the case of software agents. An unsolved problem. Sure, they can be heavily localized as well, but that is not a solution per se. Maybe initially a new development in agent technology might be English-only or English-centric, but as that technology matures, it is only natural that developers seek to refine the technology to exploit localized intelligence. That may mean that such an agent is less usable at a global level, but it may not be as important at that stage. Also, agents can be programmed with split personalities so that they can still operate at a global level albeit at a somewhat lower level of capability than the more specialized localized intelligence. That also requires greater effort and discipline on the part of developers. That is less than optimal.

There is also the underlying issue that besides superficial language issues, there are also cultural differences between countries, peoples, and regions of the world. Initial Semantic Web efforts may tend to be at a fairly high level where such cultural differences are rather muted, but as Semantic Web applications become deeper and more localized, cultural differences will start moving to the forefront.

Academic and high-end commercial developers have a need and interest to present their work globally, including marketing, journal papers, and conferences. English is the norm. Semantic Web content that is not in English will tend to not be preferred in such venues. Besides, high-end developers will tend to prefer to develop internationalized content that can be localized as needed.

Global communities of developers are also becoming a new norm. This includes open source community projects where, by definition, the initial and current contributors have no idea what country or culture future contributors will be from. This once again argues against doing development in anything other than English.

All of this leads me to believe that English will continue to be the dominant natural language for advanced and emerging computer software, especially the Semantic Web, for some time to come. Nonetheless, the issue of how to fully support and exploit local natural languages will remain and increasingly become a very thorny issue.

One lingering issue with the Semantic Web is the language to be used for constructing predicate and property names. They tend to be English, so far, which is okay since the end-user should never see them, but there is no requirement that they be English and some developers who have less global interests may begin to use their local language for predicate and property labels. This introduces a whole new level of mapping and matchmaking complexity. Sure, it is solvable, but it is also unsolved and a potential problem lurking around the corner. Sure, motivated developers can manually add the necessary mapping, but it simply tends to minimize the extent to which serendipity just works without manual intervention by developers.

All of this begs the question of how the English language itself might evolve over the coming decades. One interesting possibility is that at some point the technology "users" of English, including Semantic Web content developers, could become a much greater force than the native speakers that the users will realize that they are effectively in control of the English language and can evolve it as they see fit. It will be interesting to see how that post-English language evolves.

-- Jack Krupansky

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home