Internet research guidelines M. Vladoiu, C. Negoita
Why anyone should use Internet research? • opportunity to gain an important advance over their competitors • a wealth of information on countless topics • access to a wide variety of services: vast information sources, electronic mail, file transfer, interest group membership, interactive collaboration, multimedia displays, and more…
Issues to deal with o While doing research on the internet the searcher has to deal with: • large number of founded entries • trustworthy information on the web • deep web
Large number of founded entries • ability to reduce the number of founded entries and to find the needed information on the Internet is a function of how precise the queries are and how effectively one uses search services • poor queries return poor results; good queries return great results very effective ways to "structure" a query and use special operators to target the results you seek
Guidelines to good queries (1) • use nouns and objects as query keywords – actions (verbs), modifiers (adjectives, adverbs, predicate subjects), and conjunctions are either "thrown away" by the search engines or too variable to be useful (e.g. planet or planets); • use 6 to 8 keywords in a query - more keywords, chosen at appropriate level, can reduce the universe of possible documents returned by 99% or more; • truncate words to pick up singular and plural versions – use asterisk wildcard (e.g. planet*). The wildcard tells the search engine to match all characters after it, preserving keyword slots and increasing coverage by 50% or more;
Guidelines to good queries (2) • use synonyms via the OR operator - cover the likely different ways a concept can be described; generally avoid OR in other cases; • combine keywords into phrases where possible - use quotes to denote phrases (“solar system”). Phrases restrict results to EXACT matches; if combining terms is a natural marriage, narrows and targets results by many times; • combine 2 to 3 concepts in query - triangulating on multiple query concepts narrows and targets results, generally by more than 100-to-1 ("solar system", "new planet*", discover* OR find); • distinguish concepts with parentheses - nest single query "concepts" with parentheses. Simple way to ensure the search engines evaluate your query in the way you want, from left to right – e.g. ("solar system") ("new planet*") (discover* OR find);
Guidelines to good queries (3) • order concepts with subject first - put main subject first. Engines tend to rank documents more highly that match first terms or phrases evaluated ("new planet*") (discover* OR find) ("solar system"); • link concepts with the AND operator - AND glues the query together. The resulting query is not overly complicated nor nested, and proper left-to-right evaluation order is ensured ("new planet*") AND (discover* OR find) AND ("solar system"); • issue query to full Boolean search engine or metasearcher - full-Boolean engines give you this control; metasearchers increase Web coverage by 3- to 4-fold ("new planet*") AND (discover* OR find) AND ("solar system")
Trustworthy information on web (C) • Credibility : trustworthy source, author’s credentials, evidence of quality control, known or respected authority, organizational support. Goal: an authoritative source, a source that supplies some good evidence that allows you to trust it.
Trustworthy information on web (A) • Accuracy: up to date, factual, detailed, exact, comprehensive, audience and purpose reflect intentions of completeness and accuracy. Goal: a source that is correct today (not yesterday), a source that gives the whole truth.
Trustworthy information on web (R) • Reasonableness: fair, balanced, objective, reasoned, no conflict of interest, absence of fallacies or slanted tone. Goal: a source that engages the subject thoughtfully and reasonably, concerned with the truth.
Trustworthy information on web (S) • Support: listed sources, contact information, available corroboration, claims supported, documentation supplied. Goal: a source that provides convincing evidence for the claims made, a source you can triangulate (find at least two other sources that support it).
DEEP WEB (1) • searching on the Internet today can be compared to dragging a net across the surface of the ocean; • while a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed; • the reason is simple: most of the Web's information is buried far down on dynamically generated sites, and standard search engines never find it.
DEEP WEB (2) • traditional search engines create their indices by spidering/crawling surface Web pages; • to be discovered, the page must be static and linked to other pages; • traditional search engines can not "see" or retrieve content in the deep Web - those pages do not exist until they are created dynamically as the result of a specific search.
DEEP WEB (3) • The Deep Web is qualitatively different from the surface Web. Deep Web sources store their content in searchable databases that only produce results dynamically in response to a direct request; • public information on the deep Web is currently 400 to 600 times larger than the commonly defined World Wide Web. The deep Web contains 9,500 terabytes of information compared to around twenty terabytes of information in the surface Web. More than half of the deep Web content resides in topic-specific databases.
DEEP WEB (4) • a full 95% of the deep Web is publicly accessible information - not subject to fees or subscriptions. Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web; • a direct query is a "one at a time" laborious way to search. BrightPlanet's search technology automates the process of making dozens of direct queries simultaneously using multiple-thread technology and thus is the only search technology, so far, that is capable of identifying, retrieving, qualifying, classifying, and organizing both "deep" and "surface" content.
DEEP WEB (5) The searchable databases on the web can be classified in 12 categories: 1. Topic Databases - subject-specific aggregations of information, such as SEC corporate filings,
medical databases, patent records etc. (54% from the deep web is formed by these topic databases websites); e.g. http://www.10kwizard.com/, http://www.uspto.gov/ 2. Internal site - searchable databases for the internal pages of large sites that are dynamically created, such as the knowledge base on the Microsoft site (13%); e.g. http://www.microsoft.com/ 3. Publications - searchable databases for current and archived articles (11%); e.g. http://www. pubmedcentral.nih.gov/ 4. Shopping/Auction (5%); e.g. http://www.flowerweb.nl/, http://www.locateaflowershop.com/ 5. Classifieds (5%) e.g. www.canadaeast.com/ 6. Portals - broader sites that included more than one of these other categories in searchable databases (3%); e.g. www.searchindia.com 7. Library - searchable internal holdings, mostly for university libraries (2%); e.g. www.lib.clemson.edu 8. Yellow and White Pages - people and business finders (2%; e.g. www.anywho.com 9. Calculators - while not strictly databases, many do include an internal data component for calculating results. Mortgage calculators, dictionary look-ups, and translators between languages are examples (2%); e.g. www.russiantranslation.ru 10. Jobs - job and resume postings (1%); e.g. http://www.medicsolve.com/ 11. Message or Chat (1%); e.g. www.multidbexpress.com 12. General Search - searchable databases most often relevant to Internet search topics and information (1%); e.g. www.cyndislist.com
Conclusions • Deep Web sites tend to be narrower, with deeper content, than conventional surface sites. • To put these findings in perspective it has to consider that the search engines with the largest number of Web pages indexed (such as Google) index no more than sixteen per cent of the surface Web. • Since they are missing the deep web when they use such search engines, Internet searchers are therefore searching only 0.03% - or one in 3,000 - of the pages available to them today. • Clearly, simultaneous searching of multiple surface and deep Web sources is necessary when comprehensive information retrieval is needed.