Timeline of events related to the Deep Web
October 7, 2008
Timeline of selected events related to the Deep Web (work in progress)
1980 Tim Berners-Lee “developed his first hypertext system, “Enquire” for his own use (although unaware of the existence of the term HyperText). With a background in text processing, real-time software and communications, Tim decided that high energy physics needed a networked hypertext system and CERN was an ideal site for the development of wide-area hypertext ideas (CERN).”
1989 Tim Berners-Lee started the WorldWideWeb project at CERN.
1992-09 Arthur Secret at the CERN created the first web gateway to a relational database system RDB (Shestakov 2008-05).
1994 Dr. Jill Ellsworth “first coined the phrase “invisible Web” to refer to information content that was “invisible” to conventional search engines (Bergman 2001 citing Garcia 1996).” See also
1996 Frank Garcia (1996) claimed Texas-based university professor Jill H. Ellsworth (d.2002), Internet consultant for Fortune 500 companies, coined the term “Invisible Web” in 1996 to refer to websites that are not registered with any search engine. ” “Ellsworth is co-author with her husband, Matthew V. Ellsworth, of The Internet Business Book (John Wiley & Sons, Inc., 1994), Marketing on the Internet: Multimedia Strategies for the World Wide Web (John Wiley & Sons, Inc.), and Using CompuServe. She has also explored education on the Internet, and contributed chapters on business and education to the massive tome, The Internet Unleashed.”
[S]igns of an unsuccessful or poor site are easily identified, says Jill Ellsworth. “Without picking on any particular sites, I’ll give you a couple of characteristics. It would be a site that’s possibly reasonably designed, but they didn’t bother to register it with any of the search engines. So, no one can find them! You’re hidden. I call that the invisible Web. Ellsworth also makes reference to the “dead Web,” which no one has visited for a long time, and which hasn’t been regularly updated (Garcia 1996).
1996-12-01 “The first commercial Deep Web tool (although they referred to it as the “Invisible Web”) was @1, announced December 12th, 1996 in partnership with large content providers. According to a December 12th, 1996 press release, @1 started with 5.7 terabytes of content which was estimated to be 30 times the size of the nascent World Wide Web. ( “America Online to Place AT1 from PLS in Internet Search Area: New AT1 Service Allows AOL Members to Search “The Invisible Web”).”See (Choi 2008-01-07).”
1996-12-12 “Personal Library Software, Inc. (PLS), the leading supplier of search and retrieval software to the online publishing industry, ushered in the next generation of Internet search engines with the introduction of a new Internet based service, AT1 which combines the best of PLS’s search, agent and database extraction technology to offer publishers and users something they have never had before: the ability to search for content residing in “hidden” databases — those large collections of documents managed by publishers not viewable by Web spiders. AT1 also allows users to create intelligent agents to search newsgroups and websites with E-Mail notification of results (Press release).”
1997 Michael Lesk wrote an unpublished paper entitled ”How much information is there in the world?”], in which he estimated that in 1997, the Library of Congress had between 20 terabytes and 3 petabytes.” See Choi (2008).
1999-02 Lawrence and Giles (1999) claimed that the publicly indexable World Wide Web (PIW) contained about 800 million pages; the search engine with the largest index, Northern Light, indexed roughly 16% of the publicly indexable World Wide Web; the combined index of 11 large search engines covered (very) roughly 42% of the publicly indexable World Wide Web.
2000-03 c. 43,000–96,000 Deep Web sites existed (Bergman 2001).
2000-07-26 BrightPlanet released a study documenting the Deep Web (a massive storehouse of databases and information that was invisible to search engines in 2000) claiming that the Deep Web was 500 times larger than the indexed Web accessible by most search engines. BrightPlanet researchers also released their direct-query search technology called LexiBot™ which automatically identifies, retrieves, qualifies, and classifies content from Deep Web sites. They listed c. 20,000 Deep Web searchable sites. Direct-query search technology that can access searchable databases unlike most search engines, implies that the Invisible Web is not really Invisible just harder to reach. BrightPlanet Unveils the ‘Deep’ Web: 500 Times Larger than the Existing Web.
“quantified the size and relevancy of the deep Web in a study based on data collected between March 13 and 30, 2000. Our key findings include: Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web; The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of information in the surface Web; The deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web; More than 200,000 deep Web sites presently exist; Sixty of the largest deep-Web sites collectively contain about 750 terabytes of information — sufficient by themselves to exceed the size of the surface Web forty times; On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median) deep Web site is not well known to the Internet-searching public; The deep Web is the largest growing category of new information on the Internet; Deep Web sites tend to be narrower, with deeper content, than conventional surface sites; Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web; Deep Web content is highly relevant to every information need, market, and domain; More than half of the deep Web content resides in topic-specific databases; A full ninety-five per cent of the deep Web is publicly accessible information — not subject to fees or subscriptions (Bergman 2001).”
2000 Shestakov (2008) cites Bergman (2001) as the source for the claim that the term deep Web was coined in 2000. Bergman distinguished the Surface Web from the Deep Web using the metaphor of Surface and Deep water fishing or trawling. Deep Web is preferred over the term Invisible Web.
2000 UC-Berkeley Biologist Michael Eisen, Nobel Laureate Harold Varmus and Stanford biochemist Patrick Brown helped start the Public Library of Science, PLoS is a “nonprofit organization of scientists and physicians committed to making the world’s scientific and medical literature a freely available public resource” by encouraging scientists to insist on open-access publishing models rather than being forced to sign over their (often publicly-funded research) to expensive scientific journals. Wright (2004) cited Eisen, Varmus and Brown as examples of scientists who are making making some areas of the Deep Web more accessible to the public.
2001 Raghavan and Garcia-Molina (2001) “presented an architectural model for a hidden-Web crawler that used key terms provided by users or collected from the query interfaces to query a Web form and crawl the deep Web resources (Choi 2008-01-07).”
2002-02 StumbleUpon began to use human crawlers or human-based computation techniques to uncover data on the Deep Web. Human crawlers can find relevant links that algorithmic crawlers miss (Choi 2008-01-07).”
2002-12 There were c. 130,000 Deep Web sites (He, Patel, Mitesh, Zhang and Chang 2007, Shestakov 2008).
2003-06-01 Dorner and Curtis (2003-06-01) conducted a survey (data collected from 2002-12 through 2003-04) of librarians in New Zealand to compare their common user interface software products supplied by vendors: Endeavour, ExLibris, Follet, Fretwell-Downing, Innovative Interfaces, MuseGlobal, OCLC, SIRSI, WebFeat and VTLS. MuseSearch, ENCompass, MetaLib, Single Search and WebFeat received the highest scores in 2003 (Dorner and Curtis 2003-06-01:2). SingleSearch was noted as having the added cost advantage to librairies since it was open access, open source (Dorner and Curtis 2003-06-01:2). In 2002-2003 a successful common user interface technology software should support formats and protocols other than Z39.50 such as OpenURL, HTTP, SQL, XML, MARC, CrossRef, DOI, EAD, Dublin Core and Telnet (Dorner and Curtis 2003-06-01:8).
2004-04 There were c. 310,000 Deep Web sites (He, Patel, Mitesh, Zhang and Chang 2007, Shestakov 2008).
2004 Between 2000 and 2004 the Deep Web increased in size by 3-7 times (He, Patel, Mitesh, Zhang and Chang 2007, Shestakov 2008).
2004-03-02 Yahoo announced its Content Acquisition Program users paid for enhanced search coverage by “unlocking” the deep Web (Wright 2004).
2005 Yahoo released Yahoo! Subscriptions which searched a few of the Deep Web’s subscription-only web sites.
2005 Ntoulas et al. (2005) “created a hidden-Web crawler that automatically generated meaningful queries to issue against search forms. Their crawler generated promising results, but the problem is far from being solved. Since a large amount of useful data and information resides in the deep Web, search engines have begun exploring alternative methods to crawl the deep Web (Choi 2008-01-07).”
The search engine Pipl crawlers can identify, interact and retrieve some information from the deep Web.
Deep Web “search engines like CloserLookSearch and Northern Light Group|Northern Light create specialty engines by topic to search the deep Web. Because these engines are narrow in their data focus, they are built to access specified deep Web content by topic. These engines can search dynamic or password protected databases that are otherwise closed to search engines (Choi 2008-01-07).”
Google’s “Sitemap and mod oai are mechanisms that allow search engines and other interested parties to discover deep Web resources on particular Web servers. Both mechanisms allow Web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not directly linked to the surface Web (Choi 2008-01-07).”
2007-06 WorldWideScience was created to provide access to the Deep Web. When it began it linked to 12 databases from 10 countries. It is a “science portal developed and maintained by the Office of Scientific and Technical Information (OSTI), an element of the Office of Science within the U.S. Department of Energy. The WorldWideScience Alliance, a partnership consisting of participating member countries provides the governance structure for the WorldWideScience.org portal (RWW).”
2007-07-27 “Indiana University faculty member Javed Mustafa appeared on National Public Radio’s Science Friday, and drawing on information in a published study from University of California, Berkeley entitled ”How much information is there?”, estimated that the deep web consists of about 91,000 terabytes. By contrast, the surface web, which is easily reached by search engines, is only about 167 terabytes. The Library of Congress contains about 11 terabytes, for comparison. Mustafa noted that these numbers were a bit dated and were just rough estimates (Choi 2008-01-07).”
2008-05-14 ReadWriteWeb contributor Sarah Perez listed a number of “Digital Image Resources on the Deep Web.”
2008-06 WorldWideScience portal to the Deep Web linked to 32 national, scientific databases and portals from 44 different countries. RWW.
2008 Several “Deep Web directories are under development such as OAIster by the University of Michigan, INFOMINE] at the University of California at Riverside and DirectSearch by Gary Price to name a few (Choi 2008-01-07).”
2008-09-22 Infovell launched its research engine for the Deep Web. “Available initially on a subscription basis, Infovell gives users access to hard to find, in-depth, expert information spanning Life Sciences, Medicines, Patents, and other reference categories with more to be added over time.” “Infovell’s research engine will be available beginning September 22 as a premium service for individual researchers and corporations who are seeking more affordable access to expert information. The Company is offering a risk-free trial through its website http://www.infovell.com. Later this year, Infovell will be beta-releasing a free version of its research engine on a limited basis for those individuals who want to search the Deep Web but don’t have the need for some of the advanced features available in the premium version.”
2009- United States “Congressional Representative John Conyers (D-MI) re-introduced a bill (HR801) that essentially would negate the National Institutes of Health (NIH) policy concerning depositing research in Open Access (OA) repositories. The bill goes further than prohibiting open access requirements, however, as the bill also prohibits government agencies from obtaining a license to publicly distribute, perform, or display such work by, for example, placing it on the Internet, and would repeal the longstanding ‘federal purpose’ doctrine, under which all federal agencies that fund the creation of a copyrighted work reserve the ‘royalty-free, nonexclusive right to reproduce, publish, or otherwise use the work’ for any federal purpose. The National Institutes of Health require NIH-funded research to be published in open-access repositories (Doctorwo 2009).” HR801 would benefit for-profit science publishers and increase challenges for making the Deep Web more accessible. See Doctorwo, Cory. 2009-02-16. “Scientific publishers get a law introduced to end free publication of govt-funded research.” >> Boing Boing.
“Metasearch technology, also known as federated search or broadcast search, creates a portal that could allow the library to become the one-stop shop their users and potential users find so attractive (Luther 2003-10-01).”
Joo-Won Choi’s (2008-01) useful categories of Deep Web resources include:
Dynamic content: “Dynamic Web page and/or dynamic pages, which are returned in response to a submitted query or accessed only through a form (especially if open-domain input elements e.g. text fields are used; such fields are hard to navigate without domain knowledge). ”
Unlinked content: “pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks or inlinks. ”
Private Web: “sites that require registration and login (password-protected resources).
Contextual Web: “pages with content varying for different access contexts (e.g. ranges of client IP addresses or previous navigation sequence).
Limited access content: “sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard, CAPTCHAs or HTTP headers, prohibiting search engines from browsing them and creating cached copies.”
Non-HTML/text content: “textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.” For more see Choi (2008-01).
Webliography and Bibliography
Bergman, Michael K. 2001-09-24. “The Deep Web: Surfacing Hidden Value.” White Paper.
Bergman, Michael. 2001. “The Deep Web: Surfacing Hidden Value.” Journal of Electronic Publishing. 7:1.
Choi, Joo-Won. 2008-01-07 “Deep Web.” KAIST
Dorner, Daniel G.; Curtis, Anne Marie. 2003-06-01. “A comparative review of common user interface software products for libraries.” National Library of New Zealand. 67 pp.
Ellsworth, Jill H.; Ellsworth, Matthew V. 1994. The Internet Business Book. John Wiley & Sons, Inc.
Ellsworth, Jill H.; Ellsworth, Matthew V. 1997. The Internet Business Book. John Wiley & Sons, Inc.
Ellsworth, Jill H.; Ellsworth, Matthew V. 1995. Marketing on the Internet: Multimedia Strategies for the World Wide Web. John Wiley & Sons, Inc.
Ellsworth, Jill H.; Ellsworth, Matthew V. 1996. Marketing on the Internet: Multimedia Strategies for the World Wide Web. 2nd Edition. John Wiley & Sons, Inc.
Ellsworth, Jill H.; Ellsworth, Matthew V. Using CompuServe. John Wiley & Sons, Inc.
Ellsworth, Jill H. Chapters? The Internet Unleashed.
Guernsey, Lisa. 2001-01-25. “Mining the deep web with sharper shovels“. New York Times, No.25: pp.G1.
Lawrence, Steve: Giles, C. Lee. 1999-07-08.”Accessibility of Information on the Web.” Nature. 400:6740:107 – 109. See http://www.wwwmetrics.com.
Luther, Judy. 2003-10-01. “Trumping Google? Metasearching’s Promise.” Library Journal.
PLS. 1996-12-01. “America Online to Place AT1 from PLS in Internet Search Area: New AT1 Service Allows AOL Members to Search “The Invisible Web”.” Press Release.
Shestakov, Dennis. 2008-05. deep web
Smith, Richard. 2008-10-07. “More evidence on why we need radical reform of science publishing.” PLoS.
Wright, Alex. 2004-03-09. “In search of the deep Web: The next generation of Web search engines will do more than give you a longer list of search results. They will disrupt the information economy.” Salon.
He, Bin; Patel, Mitesh; Zhang, Zhen; Chang, Kevin Chen-Chuan. 2007. “Accessing the deep Web. Communications. ACM. 50:5:94–101.
Joo-Won Choi’s Bibliography:
Panagiotis Ipeirotis, Luis Gravano, and Mehran Sahami. 2001. “Probe, Count, and Classify: Categorizing Hidden-Web Databases.”Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. pp. 67-78.
Gary Price & Chris Sherman. July 2001. ”The Invisible Web : Uncovering Information Sources Search Engines Can’t See.” CyberAge Books, ISBN 0-910965-51-X.
Michael K. Bergman. 2001-08. “The Deep Web: Surfacing Hidden Value.” The Journal of Electronic Publishing. 7:1.
Sriram Raghavan and Hector Garcia-Molina. 2001. “Crawling the Hidden Web.” In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB). pp. 129-138
Nigel Hamilton (2003). ”The Mechanics of a Deep Net Metasearch Engine.” 12th World Wide Web Conference poster.
Bin He and Kevin Chen-Chuan Chang. 2003. “Statistical Schema Matching across Web Query Interfaces.” In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data
Joe Barker (Jan 2004). ”[ Invisible Web: What it is, Why it exists, How to find it, and Its inherent ambiguity. UC Berkeley – Teaching Library Internet Workshops.
Alexandros Ntoulas, Petros Zerfos, and Junghoo Cho. 2005. “Downloading Textual Hidden Web Content Through Keyword Queries.” In Proceedings of the Joint Conference on Digital Libraries (JCDL). pp 100-109.
Frank McCown, Xiaoming Liu, Michael L. Nelson, and Mohammad Zubair. 2006.-03/4. “Search Engine Coverage of the OAI-PMH Corpus.” IEEE Internet Computing. pp. 66-73. 10:2.
[Bell 1994]. Alan Bell; IBM Academy Digital Library Workshop (Sept 12-13, 1994).
[Census 1995]. United States Census Bureau Statistical Abstract of the United States Government Printing Office (1995).
[Fargion 1996]. G. S. Fargion, R. Harberts, and J. G. Masek An Emerging Technology Becomes an Opportunity for EOS From the online file.
[Landauer 1986]. T. K. Landauer; “How much do people remember? Some estimates of the quantity of learned information in long-term memory,” Cognitive Science, 10 (4) pp. 477-493 (Oct-Dec 1986).
[Louis 1996 ]. Steve Louis Cooperative High-Performance Storage in the Accelerated Strategic Computing Initiative 5th NASA Goddard Conference on Mass Storage Systems and Technologies (Sept. 17-19, 1996 ). As reported by Ron Van Meter, .
[Markoff 1997]. John Markoff; “When Big Brother is a Librarian,” The New York Times pp. 3, sec. 4 (March 9, 1997).
[Mauldin 1995]. Matt Mauldin, “Measuring the Web with Lycos,” Third International World-Wide Web Conference, April 1995.
[Mills 1996]. Mike Mills; “Photo Opportunity,” Washington Post pp. H01 (January 28, 1996).
[Optitek]. The Need for Holographic Storage http://www.optitek.com/hdss_competition.htm.
[Radding 1990]. Alan Radding; “Putting data in its proper place,” Computerworld pp. 61 (August 13, 1990).
[Tenopir 1997]. Carol Tenopir, and Jeff Barry; “The Data Dealers,” Library Journal pp. 28-36 (May 15, 1997).
[UNESCO 1995]. UNESCO Statistical Yearbook Bernan Press (1995).
[Wells 1938]. H. G. Wells World Brain Methuen (1938).