Scaffolding The Semantic Web (final)

  • Uploaded by: Aaron Helton
  • 0
  • 0
  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Download & View Scaffolding The Semantic Web (final) as PDF for free.

More details

  • Words: 4,010
  • Pages: 16

Aaron B. Helton St. Edward’s University MCIS 6309.01 Scaffolding the Semantic Web April 12, 2008

2 Abstract Tim Berners-Lee described a vision of the World Wide Web in which its information could be understandable to both humans and machines, enabling machines to act on that intelligence in ways that only humans had been able to do before. In practical terms, this means being able to answer specific questions informed by disparate data sources, provide trustworthy and meaningful connections between data, and to enable the continual reuse of data in any way that can be conceived. While a number of tools and specifications have been created to facilitate the implementation of the Semantic Web, very few practical implementations have appeared to date. Further, most of the current use cases listed on the World Wide Web Consortium's (W3C) site indicate that it is being implemented in concentrated, domain specific ways (Herman, I., & Stephens, S., 2007). This paper demonstrates a methodology by which one can approach the creation of new Semantic Web applications for the synthesis and creation of knowledge. It uses as its case example a small subset of the United States tourism industry.

3 Introduction to the Semantic Web Before any exploration is made on how best to achieve a Semantic Web implementation, one must understand what the Semantic Web is and how it can be of use. Tim Berners-Lee described the Semantic Web as a Web in which computers “become capable of analyzing all the data on the Web…machines talking to machines.” (“Semantic Web,” 2008) What this means in practical terms is that the Semantic Web is comprised of data that has been annotated with categorical and descriptive metadata, which allows it to be used and reused in countless forms. In theoretical terms, it represents a shift from proprietary incompatible data formats to a unified and predictable data format (this includes the metadata) with programmatically determinable characteristics, such as categories, connections to other data, and rules describing how to process the data. In time this will allow machines to form the network paths between data points such that one only has to ask a question, and a concise and accurate answer drawing from multiple relevant data sources is presented; other possibilities include truly intelligent agents, or computer processes that act on the behalf of humans, and the ability to take vast stores of seemingly disparate data and recombine them into new information. The net effect of this development will be greater potential for knowledge creation and discovery and ultimate reusability of data. With some recent news reporting on the Semantic Web, one might believe the concept to be fairly new. In fact, Berners-Lee spoke the quote above in 1999, making Semantic Web almost ten years old. Despite this, only a handful of Semantic Web implementations exist. Among the young implementations are such sites as GeoNames ( and Twine ( Until recently, even search engines, which could benefit the most from Semantic Web, have steered clear; Yahoo! announced its intention to add Semantic Web capabilities to its own search engine. (Arrington, M., 2008) In an official blog post regarding

4 this announcement, Yahoo! Search Product Management Director Amit Kumar suggested that development and adoption of the Semantic Web has lagged because “[w]ithout a killer semantic web app for consumers, site owners have been reluctant to support standards like RDF, or even microformats.” (Kumar, A., 2008) Thus the next step toward the realization of the Semantic Web is the creation of applications that pave the way. Components of the Semantic Web XML, XML Schema, and RDF XML, or Extensible Markup Language, provides the basic syntactical elements to describe the resources on the Semantic Web, but provides no indication of any meanings associated with these resources. XML Schema provides structural guidelines for the documents that describe resources. (Bray, T., Paoli, J., Sperberg-Mcqueen, C., Maler, E., Yergeau, F., & Cowan, J., 2006) The specification that confers meaning to resources is the Resource Description Framework, or RDF. This is a language to describe Semantic Web resources. RDF defines elemental attributes of resources, or the sets of attributes that belongs to a resource or a class of resources. It is through RDF that resources also gain some annotation about basic relationships with other resources. Assertions in RDF are called triples, and they consist of subject-predicateobject groupings. (Manola, F., & Miller, E., 2004) For instance, consider two resources, a poem called “The Waste Land” and an author named T.S. Eliot. In an RDF triple, one could make the assertion that the poem called “The Waste Land” has the author T.S. Eliot. This establishes a basic relationship between a poem and an author, both resources. Further, each of these resources has a number of attributes to describe them. The poem resource has a name and any other information that might be useful in identifying this poem from a group of poems. The

5 author resource also has a name, in this case the name of a person; since person is another common resource type, additional information could be linked via the author’s person resource. In this way, RDF helps to build out a basic structure of the Semantic Web. RDFS and OWL Extending RDF is the RDF Schema (RDFS) specification. RDFS allows a richer ontological vocabulary to be constructed, and defines the domain and range of particular assertions. While RDF is primarily concerned with elemental resources, RDFS is capable of expressing resource classes as resources in their own right. Classes and subclasses have members and are members of other classes, and RDFS describes the relationships between these classes. This establishes hierarchy upon the RDF network. Web Ontology Language (OWL) has largely superseded RDFS as the language for authoring ontologies, but its features are largely the same. It makes assertions about class membership and class relationships via axioms dictating any constraints. (McGuinness, D., & Van Harmelen, F., 2004) A reasonable assertion based on the poem, person, author example could say that “All authors are also people.” This simple statement establishes that every resource that is an author is also a person resource; thus author is a subclass of person (note that no reverse assertion can be made here; not every person is an author, which is a reasonable approach in most circumstances). It is through such assertions that the Semantic Web gains its usefulness, since arbitrary connections between resources can be explored in the pursuit of new knowledge. SPARQL As all of the above specifications can be ultimately reduced to an RDF document once instantiated, it is natural that there be some way to ask questions of the data represented by the

6 various resources and assertions. SPARQL (which stands for the recursively named SPARQL Protocol and RDF Query Language) is the query language invented for this purpose. Given an ontology suited for the particular resources, one could construct a query to answer some part of an RDF triple. SPARQL ties all of the core specifications together so that some meaningful information can be put together easily and in easily understood syntax. (Prud'Hommeaux, E., & Seaborne, A., 2008) Putting it all together All of this may seem to be academic so far. Indeed, the dearth of usable Semantic Web applications seems to reinforce this idea. What is missing is a methodology for turning this loose on any data set. The challenges in implementing Semantic Web specifications in the first place mostly involve where to start. For instance, it is well and good to suggest that content creators go forth and add semantics to their content, but it does nothing for the vast troves of content already in existence. Nor does it address the very real problem that most publishing software for the Web (or, for that matter, any other delivery platform) has no semantic capability, meaning that anyone who even wishes to publish semantically annotated documents to the Web must do so manually or not at all. Given no incentive on the part of content creators to add semantic annotation to their content, past, present or future, and no ready mechanism to ensure its automatic annotation, it is no wonder the Semantic Web has not materialized yet. These challenges can be overcome. The biggest challenges are how to determine annotation for existing documents and how to ensure automatic annotation going forward. A number of projects aimed at these already exist. For conversion of existing content, MIT’s SIMILE (Semantic Interoperability of Metadata and Information in unLike Environments) project has a number of programs available. “SIMILE seeks to enhance inter-operability among

7 digital assets, schemata/vocabularies/ontologies, metadata, and services.” (SIMILE:About, 2007) A related set of projects, called Microformats, also aims to lower the burden of content annotation by defining a number of RDF-compatible data formats (HTML vCard and iCal specifications, for instance). Microformats are “a set of simple, open data formats built upon existing and widely adopted standards.” (About microformats, n.d.) These efforts are predicated on the idea that certain kinds of content are already annotated in standardized ways, such as when they conform to some published RFC specification (SMTP, or email, headers are a perfect example). However, these are still limited to published specifications, and they do not take into account any heuristic information about non-standard content, such as blocks of free text. One option for adding semantic information to existing content is to explore text mining in such a way that previously unspecified details can be programmatically determined. In reality, this entails a statistical breakdown of a document’s words and how the word distribution and terminology fit with a pre-defined set of word clusters. The goal is automatic classification. This approach was documented in 1999 by a group at the University of Wolverhampton, UK, who proposed to use the Dewey Decimal System (DDC) for automatic classification of Web documents, an approach that should work in all but the most specialized fields. In their own words: "DDC is considered appropriate because it is a universal classification scheme covering all subject areas and geographically global information. It is familiar to anyone accustomed to using a library and has multilingual scope. The hierarchical nature enables the users of a search engine to refine their search from rough classifications to increasingly more accurate ones." (Jenkins) Such an automated classification system makes possible the gathering of domain specific information about a given document, and its efficacy in this regard is limited only by the skill

8 with which the underlying ontology has been authored. Further, if such a system can perform classification against defined domains, it should be capable of performing such classification against any other arbitrary but well-defined ontology. Lists of domain keywords can help to establish links between documents that, while not necessarily falling within the same domain, nevertheless may have some overlap between them. This fusion of knowledge from multiple domains could fuel knowledge discovery and creation. At this point, some thoughts about feasibility are in order. Given that few or no existing publishing systems perform any semantic markup on their documents, the whole exercise may still appear to be infeasible at worst and costly at best. It is not trivial to retrofit publishing systems with automatic classifiers, nor is it trivial to have content owners retroactively process their published documents. If one were creating a semantic search appliance for use inside a company where the document volume is more manageable, such an undertaking would be possible. The proposed system, however, while relying on a tightly defined domain, is not intended for the internal use of any company. It is much more related to one of the large search engines, Google or Yahoo!, for instance. Therefore it makes the most sense to equip the automatic classification system with some mechanism for discovering content that it can process and to which it can add its semantic markup. Google does this by following links within documents; since this system is not a general purpose search application, such an approach may not be appropriate. Adding a user-driven social networking approach, however, is. Extending the Core: Wisdom of the Masses and FOAF Social networking sites such as Digg (, Facebook (, LinkedIn (, and a host of others have shown that large groups of people are often good at making certain kinds of decisions. Digg is probably the most exemplary and most

9 applicable in this case, as it shows how content can be added once and moderated by a large group of people. It is certainly possible, like any other system, for something like this to be subverted by special interests, but such subversion appears to be the exception and not the rule. Thus the semantic search appliance should have some way for people to add information and provide input on the efficacy of any other information submitted to the site, trusting that once the normative forces of large crowds has taken hold, such information will be as reliable as possible (for additional reference, one need only turn to Wikipedia (, the Online Encyclopedia, as a perfect example of how crowdsourcing can work well). Domain experts will soon emerge, and information that they author or add will eventually be regarded as trustworthy. Next, the other popular component of social networks is the actual social aspect. In fact, the hallmark of social networks is the ability to build networks of people based on shared interests, shared ability, or any other ad hoc grouping. The general concept is known as Friend of a Friend, or FOAF. These serve to reinforce the filters created by user submit/user moderate mechanisms. In a sense, these groupings are ad hoc ontologies of people or other resources that work to build the so-called Web of Trust. So a semantic search application should also possess social network capabilities to help flesh out the gossamer strands between resources. The system, then, looks like this. With XML as its core syntax and SPARQL as its primary query language, all resources (people, documents, events, places, or any other arbitrary thing) are either indexed automatically or added intentionally, processed and annotated with semantic metadata (RDF) ala various defined ontologies (OWL), and their ultimate fate is determined by user moderation (a simple vote tally is all that should be required). This automates the building of ad hoc ontologies based on formal ontologies, shared interests and the Web of Trust, and exposes domain expertise in the process via FOAF.

1 0 Case: A Semantic Search Application for U.S. Tourism Armed with this methodology, consider something more concrete. Figures for U.S. Tourism earnings from all sources are not readily available, but the World Tourism Organization’s 2007 report on tourism estimates U.S. figures for internationally-sourced tourism at some USD 85.7 billion for 2006. (Tourism Highlights 2007 Edition, 2007) Since this represents only international receipts, the total figure from all sources is undoubtedly far higher than this. Thus any system that seeks to improve the ability for people to make connections between different kinds of tourism data should be economically worthwhile. What kinds of questions make sense for U.S. tourism? First, and these are probably obvious, people who are seeking to travel may ask questions about where to go in the first place, where to stay when they get there, where they can eat, where they can buy supplies or simply shop, what events are happening while they will be there, and anything else that might fit their interests better. To date, there is no single application or site capable of pulling together all of this information and presenting it to someone asking these kinds of questions. The process involves multiple visits either to a search engine or to the various websites that serve the respective silos of information, and the tourist may still be left with incomplete information or a sense that he or she could have gotten a better deal. Even asking a simple question like “What is the nearest state park to my house?” yields less than satisfactory results. It can be answered, but in multiple steps and with some guesswork. Add to this question any other attributes to answer, and the number of steps before an answer is reached begins to increase rapidly. For instance, one might rephrase the preceding question as “What state park is nearest to my house and allows fishing and swimming, and what restaurants are within ten miles of the park?” Needless to say, answering such a question, while certainly doable via a number of websites, nevertheless

1 1 presents a challenge. And yet there is value in being able to do so, especially for restaurant owners within that radius. The first step toward being able to answer such questions is to gather the data from its various repositories. As it happens, GeoNames provides geotagged information that can answer the question of where a state park is located (although querying to find the nearest one to another point can only be done visually on the GeoNames site at present), and Google (via its own maps service) can piece together which restaurants are near the park (again, establishing a specific radius does not seem to be possible). What is missing from easily searchable data, however, is any information regarding the services available at a given state park. As an example, the State of Texas does maintain a list of state parks (TWPD: Find a park, n.d.), and each state park has a standard listing. What is also included in each listing is a link to a PDF document where the services information resides. These documents are good candidates for the automatic classification system to index, as their consistent structure makes it fairly easy to develop an ontology for the document type. So the proposed system for this case will begin with these kinds of documents where possible, searching for instances of attributes that can provide semantic meaning to the contents. Other such documents will be sought out and added to the index, either by the application administrator or by end users. Since some sensible ontologies have to be developed for each type, and there are numerous structures to contend with, this presents the greatest challenge. This is the point at which reliance on end users becomes necessary (but see below for issues that may arise from this approach). After the process of indexing these documents has begun, the user moderation process can begin. Users from a given region are often aware of attractions and events that people from outside the region are largely unaware, so it makes sense to let them publish information about

1 2 these things. As new users discover such information, they will be allowed to agree or disagree with the information presented (it could very well be outdated) and publish their own information with what they think is correct. In this way duplication is not necessarily prohibited, but the normative forces of the knowledgeable users will act to promote the most accurate or most useful information. As new information is constantly added, indexed, tagged, moderated, and consumed, the system will become better at answering certain kinds of questions. Especially useful to business owners, advertisers and end users is the wealth of ancillary information returned around even a simple search, including service and retail listings, restaurants, suppliers, and event listings. Turning back to the complex question from earlier in the case, the system now has some capability to provide an answer. The question “What state park is nearest to my house and allows fishing and swimming, and what restaurants are within ten miles of the park?” can now be answered with a minimum of information provided by the end user (in this case the user need only specify, in addition to the search criteria, the location representing his/her house). Further, because the system has gathered some information that also has data connections nearby, options the user may not have considered can be presented to the user along with the originally-sought information. Such options present advertisers of services and goods that are in any way related to this sought information the opportunity to market to someone who is more likely to be looking for something they offer than a casual browser might. As advertisers continually seek ways to narrow the focus of their marketing campaigns for cost effectiveness, a system that delivers a higher percentage of a target demographic should meet wide appeal. Additionally, interesting

1 3 analytics can be gleaned from looking at the search and moderation trends, which could inform future marketing campaigns. The final consideration for a Semantic Web application is managing user trust. Since the application in question is intended for public use, the trustworthiness of any user-submitted information cannot be fully verified. In a perfect scenario, no user would seek to fraudulently promote any submitted resource, or, if such subversion did occur, the system would be normative enough to keep such activity in check. In most cases, this should be true. Given a large enough user base, subversive activity should occur rarely and its effects should be brief in duration. However, there may be occasions where this is not the case, especially where less well known or less popular resources are concerned. In the case of the tourism search application, a concentrated effort to promote certain businesses (the most likely occurrence) could undermine the perception of trustworthiness that the site has built over time. Thus there should be some mechanism by which such activity can be reported and dealt with, a sort of “management by exception” approach. If advertisers are allowed to buy prominence in any listings of related information, then those advertisements should be clearly labeled as such. In this case, the key to building user trust is to create the perception that errors will be weeded out and corrected over time. Whether this means allowing collaborative modification of listing information ala Wikipedia or setting up a system to deal with iteratively corrected duplicates is left to the implementer of the system in question. Some combination of these could be used, but under no circumstances should these mechanisms and the user moderation provide misleading information to the end user. Conclusion

1 4 For just one subset of one very large industry, it has been demonstrated that the Semantic Web holds great potential. From this, it is easy to see how it could be applied to virtually any domain. It will allow the drawing of knowledge from disparate data sources toward the goal of machine understandable information, whereupon such machines can take action in ways only humans have been able to do. With these guidelines, new applications capable of answering almost any question, discovering new knowledge, and informing knowledge creation can be developed and implemented.

1 5 References About microformats. (n.d.). Retrieved Apr. 12, 2008, from

Arrington, M. (2008, Mar. 13). Yahoo Embraces The Semantic Web - Expect The Internet To Organize Itself In A Hurry. Retrieved Apr. 12, 2008, from

Bray, T., Paoli, J., Sperberg-Mcqueen, C., Maler, E., Yergeau, F., & Cowan, J. (2006, August 16). Extensible Markup Language (XML) 1.1 (Second Edition). Retrieved Apr. 14, 2008, from

Herman, I., & Stephens, S. (2007, December 4). Semantic Web Education and Outreach Interest Group Case Studies and Use Cases. Retrieved Feb. 13, 2008, from

Jenkins, C., Jackson, M., Burden, P., & Wallis, J. (1999). Automatic RDF Metadata Generation for Resource Discovery. Computer Networks: The International Journal of Computer and Telecommunications Networking, 31(11-16), 1305-1320. Retrieved Apr. 12, 2008, from 305%253AARMGFR%253E/ articles%252FAutoma.pdf.

1 6

Kumar, A. (2008, Mar. 13). Yahoo! Search Blog: The Yahoo! Search Open Ecosystem. Retrieved Apr. 12, 2008, from

Manola, F., & Miller, E. (2004, Feb. 10). RDF Primer. Retrieved Apr. 14, 2008, from

Mcguinness, D., & Van Harmelen, F. (2004, Feb. 10). OWL Web Ontology Language Overview . Retrieved Apr. 14, 2008, from

Prud'Hommeaux, E., & Seaborne, A. (2008, January 15). SPARQL Query Language for RDF. Retrieved Feb. 13, 2008, from

SIMILE:About - SIMILE. (2007, Feb. 27). Retrieved Apr. 12, 2008, from

Semantic Web. (2008, Apr. 11). Retrieved Apr. 14, 2008, from

Tourism Highlights 2007 Edition. (n.d.). Retrieved Apr. 12, 2008, from

Related Documents

The Social Semantic Web
November 2019 12
Semantic Web
May 2020 8
Populating The Semantic Web
November 2019 24
Paper Web Semantic Rabnawaz
December 2019 16
Semantic Web Und Frbr
November 2019 19

More Documents from ""