SEMANTIC WEB Submitted in partial fulfillment of the requirements for the award of the degree of
MASTER OF COMPUTER APPLICATIONS [SOFTWARE ENGINEERING]
Guide(s):
Submitted By:
MR. S.K. MALIK
MANIT PANWAR MCA (SE) 1ST SEM. Roll No. 001164105409
University School of Information Technology GGS Indraprastha University, Delhi – 6 (2009-2012)
1
CERTIFICATE
This is to certify that the Term Paper (IT-655) entitled “SEMANTIC
WEB” done by Mr. MANIT PANWAR, Roll No. 00116404509 is an authentic work carried out by him at USIT, GGSIPU under my guidance. The matter embodied in this term work has not been submitted earlier for the award of any degree or diploma to the best of my knowledge and belief.
Dated:
(Signature of the Guide) Mr. S.K. MALIK
Lecturer, USIT GGSIPU,Delhi-6
2
ACKNOWLEDGEMENT
I owe a great many thanks to a great many people who helped and supported me during the writing of this term paper. My deepest thanks to Mr S.K. MALIK, the Guide of my term paper for guiding and correcting various documents of mine with attention and care. She has taken pain to go through the term paper and make necessary correction as and when needed. I express my thanks to the Dean of USIT, GGSIPU, for extending his support. My deep sense of gratitude to Mr. Amit Prakash Singh, Teacher Incharge of Term Paper for his support and guidance. Thanks and appreciation to the helpful people at UIRC and Computer Centre, for their support. I would also thank my Institution and faculty members without whom this term paper would have been a distant reality. I also extend my heartfelt thanks to my family, friends and well wishers.
MANIT PANWAR
3
4
ABSTRACT This paper presents the basic analysis of the semantic web. How this dream can bring a revolution in the web, businesses, enterprise, AI, Security system etc. everywhere, because it is an effort toward making the biggest basket of information and data which we call internet so intelligent that we can get the information what we want, as close as a person does when asked for something, he will tell exactly what is asked for and thus saving both time and effort. Semantic Web is going to bring a new dimension in the field of information technology by giving a better search facility on the web and is the fire that increasingly generating it’s flame in the IT industry. It is a collective effort towards making the biggest database from which we can retrieve information in an intelligent and meaningful manner. This paper focuses the various aspects of semantic web like: History, Introduction, Tools, Architecture layers .
5
INTRODUCTION Almost everyone uses the internet today for their specific purposes, they surf, they chat, they search, if you look on the internet as a sea of data, data here can be anything but the medium is the water (the internet), we know what data to search, most of the time, but yes then also we don’t know the exact location of the data where it is floating, the water (the internet) is just a medium, you get in, and go on the search of the data you want but where?, and How? And what exactly our data looks like? The semantic web, what I can understand in a very broad sense, will make those dead data live, and tell us where they are (location), what they look like (the exact data we are looking for) on simple call, i.e. data understand us, it is smart now, it carries the information about itself, can tell us what it is and why and from where it is, like a normal person gives his introduction. Now in the unending sea I can see where my data is, and in my call it will respond. Now, the question is how the machine will understand what data it is communicating with? That is we have made the data so smart that it can give the introduction about itself, but to whom? Not me, as I’m the end user, I’m not interested in knowing the data, I want to use it. The entity or the person who may be interested in knowing the data is the machine, now we have to make our machine understand what data is saying, but definitely not by changing the hardware, instead if I can make the information, the data is giving so, which can be directly understand by the machine then everything will work fine. So basically our focus is on the data and the information it carries. Now, the data from a dead entity which has no information about itself, to live and smart entity, which carries enough semantic information so that a machine can understand it (or
6
can process it), has gone through many changes, in the way we thought of it, the way it was used.
7
THE EARLY SEMANTIC WEB The original idea of the Semantic Web was to bring machine-readable descriptions to the data and documents already on the Web, in order to improve search and data usage. The Web was, and in most cases still is, a vast set of static and dynamically generated Web pages linked together. Pages are written in HTML (Hyper Text Markup Language), a language that is useful for publishing information intended only for human consumption. Humans can read Web pages and understand them, but the inherent meaning is not available in a way that allows interpretation by computers. The Semantic Web aims at defining ways to allow Web information to be used by computers not only for display purposes, but also for interoperability and integration between systems and applications. One way to enable machine-tomachine exchange and automated processing is to provide the information in such a way that computers can understand it. To give meaning to Web information, new standards and languages are being investigated and developed. Well-known examples include the Resource Description Framework (RDF) (RDF 2002) and the Web Ontology Language (OWL) (OWL 2004). The descriptive information made available by these languages allows characterizing individually and precisely the type of resources in the Web and the relationships between resources. Today, the Semantic Web is not only about increasing the expressiveness of Web information to enable the automatic or semiautomatic processing of Web resources and Web pages. Academia and industry have realized that the Semantic Web can facilitate the integration and interoperability of intra- and inter-business processes
8
and systems, as well as enable the creation of global infrastructures for sharing documents and data, make searching and reusing Information easier.
9
THE SEMANTIC WEB A major drawback of XML is that XML documents do not convey the meaning of the data contained in the document. Exchange of XML documents over the Web is only possible if the parties participating in the exchange agree beforehand on the exact syntactical format (expressed in XML Schema) of the data. The Semantic Web allows the representation and exchange of The information in a meaningful way, facilitating automated processing of descriptions on the Web. Annotations on the Semantic Web express links between information resources on the Web and connect information resources to formal terminologies – these connective structures are called ontologies. Ontologies form the backbone of the Semantic Web; they allow machine understanding of information through the links between the information resources and the terms in the ontologies. Furthermore, ontologies facilitate interoperation between information resources through links to the same ontology or links between ontologies. The term “ontology” originates from philosophy and has been adopted in the field of Computer Science with a slightly different meaning : An ontology is a formal explicit specification of a shared conceptualization. In the late 1990s the idea of a Semantic Web boosted interest in the development of ontologies even further. The general conviction held by the W3C is that the Semantic Web needs an ontology language that is compatible with current Web standards and is in fact layered on top of them. The language needs to be expressed in XML and, preferably, should be layered on top of RDF(S)
10
Ontologies and the Semantic Web A key feature of ontologies is that, through formal, real-world semantics and consensual terminologies, they interweave human and machine understanding This important property of ontologies facilitates the sharing and reuse of ontologies among humans, as well as among machines. A major reason for the recent increasing interest in ontologies is the development of the Semantic Web , which can be seen as knowledge management on a global scale. Tim Berners-Lee, inventor of the current World Wide Web and director of the World Wide Web Consortium (W3C), envisions the Semantic Web as the next generation of the current Web. This “next generation” will expand upon the prowess of the current Web by adding machinereadable information and automated services. According to , “The explicit representation of the semantics underlying data, programs, pages, and other Web resources will enable a knowledge-based Web that provides a qualitatively new level of service.” Ontologies provide such an explicit representation of semantics. The combination of ontologies with the Web has the potential to overcome many of the problems in knowledge sharing and reuse and in information integration. Ontologies interweave human and computer understanding of symbols. These symbols, also called terms and relations, can be interpreted by both humans and machines. The meaning for a human is represented by the term itself, which is usually a word in natural language, and by the semantic relationships between terms. An example of such a human-understandable relationship is a superconcept – subconcept relationship (often referred to by the term “is-a”). Such a relationship denotes the fact that one concept (the superconcept) is more general than another (the subconcept). For instance, the concept Person is more general than
11
Student. Figure below shows an example “is-a” hierarchy (or taxonomy), where the more general concepts are located above the more specialized concepts. Concepts describe a set of objects in the real world. For example, the concept PhDStudent aims to capture all existing PhD students. One such PhD student is Mary, who is modeled in Fig. as a box, and has an instance of relation to the concept PhDStudent. This instance-of relationship means that the actual object is captured by the concept PhD-Student. And because of the formal is-a relationships between the concepts PhD-Student, Researcher, Student, and Person, John must also be an instance of the concepts Researcher, Student, and Person. These relationships are fairly easy to understand for the human reader and, because the meanings of the relationships are formally defined, a machine can reason with them and draw the same conclusions as a human can. These relationships, which are implicitly known to humans (e.g. a human knows that every student is a person) are encoded in a formally explicitly way so that they can be understood by a machine. In a sense, the machine does not gain real “understanding”, but the understanding of humans is encoded in such a way that a machine can process it and draw conclusions through logical reasoning.
12
13
14
The Resource Description Framework The Resource Description Framework (RDF)
is the first language developed
especially for the Semantic Web. RDF was developed as a language for adding machine-readable metadata to existing data on the Web. RDF uses XML for its serialization in order to realize the layering depicted in the Semantic Web language layer cake (Fig. 3.1). RDF Schema [20] extends RDF with some basic (frame-based) ontological modeling primitives. There are primitives such as classes, properties, and instances. Also, the instance-of, subclass-of, and subproperty-of relationships have been introduced, allowing structured class and property hierarchies. RDF has the subject–predicate–object triple, commonly written as P(S,O), as its basic data model. An object of a triple can, in turn, function as the subject of another triple, yielding a directed labeled graph, where resources (subjects and objects) correspond to nodes, and predicates correspond to edges. Furthermore, RDF allows a form of reification (a statement about a statement), which means that any RDF statement can be used as a subject in a triple.
15
LAYER ARCHITECTURE Tim Berner’s Lee proposed a nine layer architectureas shown above in figure 1. It includes Unicode, URI, XML, Namespace, XMLSchema, RDF, RDF schema Ontology, Digital signature, Logic, Proof and Trust
Following is the description of various Layers:Unicode Unicode is a standard way of allowing computers to consistently representing and manipulating text expressed in most of the world’s writing systems. URI Uniform Resource Identifier (URI) is a compact string of characters used to identify or name a resource on the Internet. XML Extensible Markup Language (XML) is general-purpose specification for creating custom markup languages. It is classified as an extensible language because it allows its users to define their own elements. Its primary purpose is to help
16
information systems share structured data, particularly via the Internet. XML Schema An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure. XML Namespace An XML namespace is a collection of Names (identified by a URI) used in XML document as element types and attribute names. RDF Resource description framework actually creates the metadata about the document as a single entity, i.e. the author of the document, its creation date, its type etc. Ontology Vocabulary It is main layer, consist of hierarchical distribution of important concepts in thedomain and describing about the properties. Some basic ontology languages are OWL,DAML-ONT and DAML+OIL etc Digital Signature Digital Signature Support the notion of trust. Purpose:a)Is to digitally sign the document b) Encryption can be applied to prevent unauthorized access Logic It is a monotonic Logic. In this layer any rule can export the code but can’t be imported. Proof Goal is to make the smarter content, so to make machine understandable. Trust This is the top most layer, where the trustworthiness of information to be subjectively evaluated
17
CHALLENGES FOR A NEW SEMANTIC WORLD As with every technological evolution, the Semantic Web and ontologies need to promote their unique value proposition for specific target groups in order to achieve adoption. A common pitfall made in the studies of the Semantic Web is the limited focus on “technological perspectives” or, in the other extreme, the difficulty to communicate the underlying capacity of semantics and ontologies to meet critical real world challenges. Some of the challenges for the Semantic Web include vastness, vagueness, uncertainty, inconsistency and deceit. Automated reasoning systems will have to deal with all of these issues in order to deliver on the promise of the Semantic Web. Vastness: The World Wide Web contains at least 48 billion pages as of this writing (August 2, 2009). The SNOMED CT medical terminology ontology contains 370,000 class names, and existing technology has not yet been able to eliminate all semantically duplicated terms. Any automated reasoning system will have to deal with truly huge inputs. Vagueness: These are imprecise concepts like "young" or "tall". This arises from the vagueness of user queries, of concepts represented by content providers, of matching query terms to provider terms and of trying to combine different knowledge bases with overlapping but subtly different concepts. Fuzzy logic is the most common technique for dealing with vagueness. Uncertainty: These are precise concepts with uncertain values. For example, a patient might present a set of symptoms which correspond to a number of different distinct
18
diagnoses each with a different probability. Probabilistic reasoning techniques are generally employed to address uncertainty. Inconsistency: These are logical contradictions which will inevitably arise during the development of large ontologies, and when ontologies from separate sources are combined. Deductive reasoning fails catastrophically when faced with inconsistency, because "anything follows from a contradiction". Defeasible reasoning and paraconsistent reasoning are two techniques which can be employed to deal with inconsistency. Deceit: This is when the producer of the information is intentionally misleading the consumer of the information. Cryptography techniques are currently utilized to ameliorate this threat. This list of challenges is illustrative rather than exhaustive, and it focuses on the challenges to the "unifying logic" and "proof" layers of the Semantic Web. The World Wide Web Consortium (W3C) Incubator Group for Uncertainty Reasoning for the World Wide Web (URW3-XG) final report lumps these problems together under the single heading of "uncertainty". Many of the techniques mentioned here will require extensions to the Web Ontology Language (OWL) for example to annotate conditional probabilities. This is an area of active research,[ So3.
19
Semantic Web Application Areas As a result of the pervasive and user-friendly digital technologies emerging within our information society, web content is increasingly multiform, inconsistent and very dynamic. Such content is unsuitable for machine processing, and necessitates human interpretation and its respective costs in time and money for business. To remedy this, approaches aim at abstracting of this complexity (e.g. by using ontologies) and offering new and enriched services able to process those abstractions (e.g., by mechanized reasoning) in a fully automated way. This abstraction layer is the subject of a very dynamic activity in research, industry and standardization which is usually called "Semantic Web". The initial application of Semantic Web technology has focused on Information Retrieval (IR) where access through semantically annotated content, instead of classical (even sophisticated) statistical analysis, aimed to give far better results (in terms of precision and recall indicators). The next natural extension was to apply IR in the integration of enterprise legacy databases in order to leverage existing company information in new ways. Present research has turned to focusing on the seamless integration of heterogeneous and distributed applications and services. Some of the application areas of Semantic Web are: 1) Knowledge Management Knowledge is one of the key success factors for enterprises, both today and in the future. Therefore, company knowledge management has been identified as a strategic tool. However, if information technology is one of the foundational elements of KM; KM, in turn, is also interdisciplinary by its nature. In particular, it includes human
20
resource management, enterprise organization and culture. We view KM as the management of the knowledge arising from business activities, aiming at leveraging both the use and the creation of that knowledge for two main objectives: capitalization of corporate knowledge and durable innovation fully aligned with the strategic objectives of the organization. The development of knowledge portals serving the needs of companies or communities is still a manual process. Ontologies and related metadata provide a promising conceptual basis for generating parts of such knowledge portals. Obviously, among others, conceptual models of the domain, of the users and of the tasks are needed. The generation of knowledge portals has to be supplemented with the semi-automated evolution of portals. As business environments and strategies change rather rapidly, KM portals have to be kept up-to date.Evolution of portals should also include some mechanisms to ‘forget’ outdated knowledge. KM solutions based on a combination of intranet-based functionalities and mobile functionalities will be available very near future. Semantic Web technology is a promising approach to meet the needs of mobile environments, like location-aware personalization and adaptation of the presentation to the specific needs of mobile devices, i.e. the presentation of the required information at an appropriate level of granularity. Knowledge Management is obviously a very promising area for exploiting Semantic Web technology. Document-based KM solutions have already reached their limits, whereas semantic technology opens the way to meet KM requirements in the future.
21
2) E-Commerce Electronic commerce is mainly based on the exchange of information between involved stakeholders using a telecommunication infrastructure.There are two main scenarios: Business-to-Customer (B2C) and Business-to-Business (B2B). B2C applications enable service providers to promote their offers, and for customers to find offers which match their demands. By providing unified access to a large collection of frequently updated offers and customers, an electronic marketplace can match the demand and supply processes within a commercial mediation environment. B2B applications have a long history of using electronic messaging to exchange information related to services previously agreed among two or more businesses. A knowledge-based approach has the potential to significantly accelerate the penetration of electronic commerce within vertical industry sectors, by enabling interoperability at the business level, and reducing the need for standardization at the technical level. This will enable services to adapt to the rapidly changing online environment. Knowledge based applications of this kind use one or more shared ontologies to integrate heterogeneous information systems and allow common access for humans and computers. This enforces the shared ontology as the standard ontology for all participating systems, thereby removing the semantic heterogeneity from the information system. The heterogeneity is a problem because the systems to be integrated are already operational and it is too costly to redevelop them. A linguistic ontology is sometimes used to assist in the generation of the shared ontology or is used as a top-level ontology, describing very general concepts like space, time, matter, object, event, action, etc, which the shared ontologies can inherit
22
from. Benefits are the integration of heterogeneous information sources, which can improve interoperability, and more effective use and reuse of knowledge resources. 3) Biosciences and Medical Applications The medical domain is a favourite target for Semantic Web applications just as the expert system was for Artificial Intelligence application 20 years ago. The medical domain is very complex: medical knowledge is difficult to represent in a computer format, making the sharing of information even more difficult. Semantic Web solutions have become very promising in this context. One of the main mechanisms of the Semantic Web - resource description using annotation principles - is of major importance in the medical informatics domain, especially as regards the sharing of these resources (e.g. medical knowledge in the Web or genomic database). The web services technology allows us to imagine some solutions to the interoperability problem, which is substantial in medical informatics. 4) Other Areas The diverse application areas of Semantic Technologies also include the following: • Ambient Intelligence • Cognitive Systems • Data Integration • Multimedia Data Management • Software Engineering • Cognitive Systems
23
• Machine Learning • eScience • Information Extraction • Grid Computing • Peer-to-Peer Systems • eGovernment
HISTORY OF SEMANTIC WEB Following are the key milestones year wise in the history of Semantic Web: 1989 Tim Berner Lee proposed WWW to CERN as a Development project. 1991 Portable browser available and distributed. 1994 • Netscape was released as a commercial browser • Yahoo acted as search engine • There were 2500 web servers at that time. 1995 • There were 73500 web servers at that time. • Microsoft released IE and W3C was established as a standard body 1996 Semantic web initiated 1997 First working draft of the RDF language to define metadata was available
24
1998 Tim Berners Lee published a roadmap to the semantic web that includes query language, inference rules and proof validation 1999 RDF became a W3C recommendation-a crucial step towards the web’s interoperability and functionality. 2001 A vision of semantic web has broadened the vision further to include trus
TOOLS USED IN SEMANTIC WEB The World Wide Web is an interesting paradox -- it's made with computers but for people. The sites you visit every day use natural language, images and page layout to present information in a way that's easy for you to understand. Even though they are central to creating and maintaining the Web, the computers themselves really can't make sense of all this information. They can't read, see relationships or make decisions like you can. The Semantic Web proposes to help computers "read" and use the Web. The big idea is pretty simple -- metadata added to Web pages can make the existing World Wide Web machine readable. This won't bestow artificial intelligence or make computers self-aware, but it will give machines tools to find, exchange and, to a limited extent, interpret information. It's an extension of, not a replacement for, the World Wide Web. That probably sounds a little abstract, and it is. While some sites are already using Semantic Web concepts, a lot of the necessary tools are still in development. In this
25
article, we'll bring the concepts and tools behind the Semantic Web down to earth by applying them to a galaxy far, far away.
Comprehensive Listing of Few New Semantic Web Tools
The AMALGAM (Automatic Mapping Among LexicoGrammatical Annotation Models) project is an attempt to create a set of mapping algorithms to map between the AMALGAM main tagsets and phrase structure grammar schemes used in various research corpora. Software has been developed to tag text with up to 8 annotation schemes
Amine is a Multi-Layer Platform implemented in Java. It provides various Engines and GUIs to build a wide variety Amine of Ontology-based applications, Conceptual Graph based applications, Intelligent Systems and Multi-Agents Systems Anacubis is a visual analysis tool the lets its users visualize the relationships between entities in a collection of Anacubis information. The visualization is rather similar to concept maps. Exteca
The Exteca and document categorisation. It can be used in conjunction with search engines platform is an ontology-
26
based technology written in Java for high-quality knowledge management Jena is a Java framework to construct Semantic Web Applications. It provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based Jena inference engine. It also has the ability to be used as an RDF database via its Joseki layer. See the Jena discussion list for more information Pedro is an application that creates data entry forms based on a data model written in a particular style of XML Schema. Users can enter data through the forms to create Pedro data files that conform to the schema. They can use controlled vocabularies to mark-up text fields and have the application perform basic validation on field data
Platypus Wiki is an enhanced Wiki Wiki Web with ideas taken from Semantic Web. It offers a simple user interface Platypus Wiki
to create a Wiki Page plus metadata according with W3C standards. It uses RDF/RDFS and OWL to create ontologies and manage metadata Protege+OWL+Ruby (POR) Utilities provides an ontology, a set of ruby classes and methods to simplify the
POR development
of
Protege+OWL
Ontology
Driven
applications. At the moment project is limited to JRuby
27
ASIS (Ada Semantic Interface Specification) for GNAT on gcc. ASIS is a published international ISO standard ASIS for GNAT (ISO/IEC 15291:1999). ASIS based tools are available as well ATLAS (Architecture and Tools for Linguistic Analysis Systems) is a joint initiative of NIST, MITRE and the LDC ATLAS
to build a general purpose annotation architecture and a data interchange format. The starting point is the annotation graph model, with some significant generalizations
Swoogle
A semantic Web search engine with 1.5 M resources
SWOOP
A lightweight ontology editor The Raptor RDF parser toolkit is a free software / Open Source C library that provides a set of parsers and serializers that generate Resource Description Framework (RDF) triples by parsing syntaxes or serialize the triples into >a syntax. The supported parsing syntaxes are
Raptor RDF/XML, N-Triples, Turtle, RSS tag soup including Atom 1.0 and 0.3, GRDDL for XHTML and XML. The serializing
syntaxes
are
RDF/XML
(regular,
and
abbreviated), N-Triples, RSS 1.0, Atom 1.0 and Adobe XM Rasqual
Rasqal is a C library for querying RDF, supporting the
28
RDQL and SPARQL languages. It provides APIs for creating a query and parsing query syntax. It features pluggable triple-store source and matching interfaces, an engine for executing the queries and an API for manipulating results as bindings. It uses the Raptor RDF parser to return triples from RDF content and can alternatively work with the Redland RDF library’s persistent triple stores. It is portable across many POSIX systems
Protége Protégé provides a growing user community with a suite of tools to construct domain models and knowledge-based applications with ontologies. Protégé is a free, opensource platform developed by Stanford Medical Informatics with support from: - Defense Advance Research Projects Agency -National Cancer Institute - National Institute of Standards and Technology - National Library of Medicine - National Science Foundation with additional support from its affiliates: - DaimlerChrysler - iSOCO: Intelligent Software for the Networked Economy
29
At its core, Protégé implements a rich set of knowledge-modeling structures and actions that support the creation, visualization, and manipulation of ontologies in various representation formats. Protégé can be customized to provide domainfriendly support for creating knowledge models and entering data. Further, Protégé can be extended by way of a plug-in architecture and a Java-based Application Programming Interface (API) for building knowledge-based tools and applications. The Protégé platform supports two main ways of modeling ontologies: Protégé-Frames editor enables users to build and populate ontologies that are framebased, in accordance with the Open Knowledge Base Connectivity protocol (OKBC). In this model, an ontology consists of a set of classes organized in a subsumption hierarchy to represent a domain's salient concepts, a set of slots associated to classes to describe their properties and relationships, and a set of instances of those classes individual exemplars of the concepts that hold specific values for their properties. Protégé-OWL editor enables users to build ontologies for the Semantic Web, in particular in the W3C's Web Ontology Language (OWL). "An OWL ontology may include descriptions of classes, properties and their instances. Given such an ontology, the OWL formal semantics specifies how to derive its logical consequences, i.e. facts not literally present in the ontology, but entailed by the semantics. These entailments may be based on a single document or multiple distributed documents that have been combined using defined OWL mechanisms". Protégé ontologies can be exported into a variety of formats including RDF(S), OWL, and XML Schema.
Outstanding Protégé features
30
Some of the particular features of Protégé, not available in many of the other ontology building tools, are following: Automatic generation of graphical-user interfaces, based on user-defined models, for acquiring domain instances Extensible knowledge model and architecture Scalability to very large knowledge bases
Semantic MediaWiki (Semantic annotation tool)
Semantic MediaWiki (SMW) is a free extension of MediaWiki – the wiki-system powering Wikipedia – that helps to search, organize, tag, browse, evaluate, and share the wiki's content. While traditional wikis contain only texts which computers can neither understand nor evaluate, SMW adds semantic annotations that bring the power of the Semantic Web to the wiki.
Introduction to Semantic Mediawiki
31
Wikis have become a great tool for collecting and sharing knowledge in communities. This knowledge is mostly contained within texts and multimedia files, and is thus easily accessible for human readers. But wikis get bigger and bigger, and it can be very time-consuming to look for an answer inside a wiki. As a simple example, consider the following question a user might have: « What are the hundred world-largest cities with a female mayor? » Wikipedia should be able to provide the answer: it contains all large cities, their mayors, and articles about the mayor that tell us about their gender. Yet the question is almost impossible to answer for a human, since one would have to read all articles about all large cities first! Even if the answer is found, it might not remain valid for very long. Computers can deal with large datasets much easier, yet they are not able to support us very much when seeking answers from a wiki: Even sophisticated programs cannot yet read and «understand» human-language texts unless the topic and language of the text is very restricted. The wiki's keyword search does not help either in discovering complex relationships.
Semantic MediaWiki enables wiki communities to make some of their knowledge computer-processable, e.g. to answer the above question. The hard problem for the computer is to find out what the words in a wiki page (e.g. about cities) mean. Articles contain many names, but which one is the current mayor? Humans can easily grasp the problem by looking into a language edition of Wikipedia that they do not understand (Korean is a good start unless you are fluent there). While single tokens (names, numbers,…) might be readable, it is impossible to understand their
32
relevance in the article. Similarly, computers need some help for making sense of wiki texts. In Semantic MediaWiki, editors therefore add «hints» to the information in wiki pages. For example, someone can mark a name as being the name of the current mayor. This is done by editors who modify a page and put some special text-markup around the mayor's name. After this, computers can access this information (of course they still do not «understand» it, but they can search for it if we ask them to), and support users in many different ways. Where SMW can help Semantic MediaWiki introduces some additional markup into the wiki-text which allows users to add "semantic annotations" to the wiki. While this first appears to make things more complex, it can also greatly simplify the structure of the wiki, help users to find more information in less time, and improve the overall quality and consistency of the wiki. To illustrate this, we provide some examples from the daily business of Wikipedia: Manually generated lists. Wikipedia is full of manually edited listings such as this one. Those lists are prone to errors, since they have to be updated manually. Furthermore, the number of potentially interesting lists is huge, and it is impossible to provide all of them in acceptable quality. In SMW, lists are generated automatically like this. They are always up-to-date and can easily be customized to obtain further information. Searching information. Much of Wikipedia's knowledge is hopelessly buried within millions of pages of text, and can hardly be retrieved at all. For example, at the time
33
of this writing, there is no list of female physicists in Wikipedia. When trying to find all women of this profession that are featured in Wikipedia, one has to resort to textual search. Obviously, this attempt is doomed to fail miserably. Note that among the 20 first results, only 5 are about people at all, and that Marie Curie is not contained in the whole result set (since "female" does not appear on her page). Again, querying in SMW easily solves this problem (in this case even without further annotation, since existing categories suffice to find the results). Inflationary use of categories. The need for better structuring becomes apparent by the enormous use of categories in Wikipedia. While this is generally helpful, it has also led to a number of categories that would be mere query results in SMW. For some examples consider the categories Rivers in Buckinghamshire, Asteroids named for people, and 1620s deaths, all of which could easily be replaced by simple queries that use just a handful of annotations. Indeed, in this example Category:Rivers, Property:located in, Category:Asteroids, Category:People, Property:named after, and Property:date of death would suffice to create thousands of similar listings on the fly, and to remove hundreds of Wikipedia categories. Inter-language consistency. Most articles in Wikipedia are linked to according pages in different languages, and this can be done for SMW's semantic annotation as well. With this knowledge, you can ask for the population of Beijing that is given in Chinese Wikipedia without reading a single word of this language. This can be exploited to detect possible inconsistencies that can then be resolved by editors. For example, the population of Edinburgh at the time of this writing is different in English, German, and French Wikipedia.
34
External reuse. Some desktop tools today make use of Wikipedia's content, e.g. the media player Amarok displays articles about artists during playback. However, such reuse is limited to fetching some article for immediate reading. The program cannot exploit the information (e.g. to find songs of artists that have worked for the same label), but can only show the text in some other context. SMW leverages a wiki's knowledge to be useable outside the context of its textual article. Since semantic data can be published under a free license, it could even be shipped with software to save bandwidth and download time.
AceWiki (A Natural and Expressive Semantic Wiki) AceWiki is a semantic wiki that is powerful and at the same time easy to use. Making use of the controlled natural language Attempto Control English, ACE, the formal statements of the wiki are shown in a way that looks like natural English. The use of controlled natural language makes it easy for everybody to understand the semantics of the wiki. Introduction AceWiki shows the formal semantics in controlled English. Thus, the users do not need to cope with complicated formal languages like RDF or OWL. Unlike most other semantic wikis, the semantics are contained directly in the article texts and not in some form of annotations. Ontological entities like individuals, concepts, and properties are mapped one-to-one to linguistic entities like proper names, nouns, ofconstructs, and verbs. In order to help the users to write correct ACE sentences, AceWiki provides a predictive editor.
35
Design The main goal of AceWiki is to improve knowledge aggregation and representation. AceWiki should be easier to use and understand than other semantic wikis. In addition, it should support a higher degree of expressivity. Unlike other semantic wikis, the formal statements are not contained in “annotations” and are not considered “metadata”, but they are the main content of our wiki. In order to achieve a good usability and still support a high degree of expressivity, AceWiki follows three design principles: naturalness, uniformity, and strict user guidance. By naturalness we mean that the formal semantics has a direct connection to natural language. Uniformity means that only one language is used at the user-interface level. Strict user guidance, finally, means that a predictive editor ensures that only well-formed statements are created by the user. We will now discuss these three principles and show how they are achieved in AceWiki.
Naturalness AceWiki is natural in the sense that the ontology is represented in a form that is very close to natural language. This requires a direct mapping of ontological entities to natural language words. In AceWiki, individuals are represented as proper names (e.g. “Switzerland”), concepts5 are represented as nouns (e.g. “country”), and roles6 are represented as transitive verbs (e.g. “overlaps-with”) or as of-constructs (e.g. “part of”). Using those words together with the predefined function words of ACE (e.g. “every”, “if”, “then”, “something”, “and”, “or”, “not”), we can express ontological statements as ACE sentences. Since every ACE sentence is a valid
36
English sentence, those ontological statements can be immediately understood by any English speaker.
Uniformity The Semantic Web community defines three categories of languages on the logic level of the Semantic Web stack: ontology languages (e.g. OWL), rule languages (e.g. SWRL), and query languages (e.g. SPARQL). Most languages cover only one of those categories, and languages of different categories look usually very different. We claim that at the user-interface level ideally one single language should cover all those categories. In the background, there might be several internal languages, but the users should need to learn only one. For many users who are not familiar with formal conceptualizations, learning one formal language is already a hard task. We should not make this learning effort harder than necessary. ACE is able to represent those different kinds of formal statements in a very natural way. In the case of queries, this distinction does not need to be made explicit: If a sentence ends with a question mark then it is clear for the user that this is a query and not an assertion. However, queries are still future work for AceWiki. AceWiki classifies declarative ACE sentences into three categories: Some can be translated into OWL, others can be translated into SWRL, and finally there are ACE sentences that have no representation in OWL or SWRL at all. In ACE, this distinction is not visible and we think that users should not bother about it. The only thing they need to know is that if using an OWL reasoner only the OWL-compliant sentences are considered.
37
Strict User Guidance Learning a new formal language is normally accompanied by frequent syntax error messages from the parser. Wikis are supposed to enable easy and quick modifications of the content, and syntax errors can certainly be a major hindrance in this respect, especially for new users. This problem can be solved by guiding the users during the creation of new statements in a strict manner. By strict we mean that the creation of syntactically incorrect sentences is simply made impossible. This can be achieved by a predictive editor that guides the user step by step and ensures the syntactic correctness. Syntactic correctness can be subdivided into lexical correctness and grammatical correctness. Lexical correctness means that only the words that are defined in a certain lexicon are used. Grammatical correctness means that the grammar rules are respected. To some degree, a predictive editor could also take care of the semantic correctness. It could prevent the users from adding statements that introduce inconsistency into an underlying ontology. If the verb “meets”, for example, is defined in the ontology as a relation between humans then the predictive editor could prevent the user from writing sentences like “a man meets a car”, assuming that the ontology says that “car” is not human. AceWiki has a predictive editor that is used for the creation and modification of ACE sentences. It ensures lexical and grammatical correctness of the resulting sentences. The semantic correctness is not enforced, but the words that seem to be semantically suitable are shown first in the list. The suitable words are retrieved on the basis of the
38
hierarchy of concepts and roles and the domain and range restrictions of roles. For example, if a user creates the incomplete sentence “Limmat flows-through” and there is a range restriction that says “if something flows-through something Y then Y is a city” then the individuals that are known to be cities are shown first in the list.
CONCLUSION The Evolution of Semantic Web has opened a new window in IT and a hope for better search results on Web. Tim Berner’s Lee rightly says that Semantic web will be the next generation of current web and the next IT revolution. It is based on the 39
fundamental idea that web resources should be annotated with “Semantic Markup” that captures information about their meaning. Semantic Web is not far away when we understand and work on the various ways to make the current web more meaningful and intelligent web. This can be achieved by knowing about the various tools, technologies, layers etc of Semantic web which has been summarized in this paper.
References
1. en.wikipedia.org/wiki/Semantic_Web 2. www.w3.org/2001/sw/ 40
3. http://semanticweb.org/wiki/Tools 4. semantic-mediawiki.org 5. attempto.ifi.uzh.ch/acewiki 6. http://www.w3c.com 7. http://en.wikipedia.org/wiki/Taxonomy 8. http://www.m-w.com/dictionary/taxonomy 9. web ontology working group (http://www.w3.org/2001/sw/WebOnt/) 10. http://www.w3.org/TR/owl-features/ 11. The Sementic Web - The Real World Applications from Industry 12.Springer.Enabling.Semantic.Web.Services.Nov.2006
41