5603: Introduction to Information Services

Web searching


Web search engine

Teoma is a search engine created by Rutgers professor Apostolos Gerasoulis and his associates. It uses a clustering algorithm to weigh relevancy ratings. Each indexed page is assigned to one or more "communities" - sets of pages about the same subject. Inbound links from pages in the community of related pages are ranked higher than similar links from outside the community. In addition to relevance weighting, this page classification is the basis for two additional features of the search engine user interface. The "Refine - suggestions to narrow your search list " list lets the user select the appropriate classification for a given request. Similarly, the "Resources - link collections from experts and enthusiasts" list presents pages that Teoma has essentially determined to be bibliographies (lists of resources) on the topic at hand.

Simple search
Advanced search

Teoma offers an advanced search. The user interface tries to demystify the options, providing choices in lay language instead of computer or logic talk. Choices include:

Search examples
  1. popcorn energy machineThis is what I started with... I was trying to think of three unrelated terms, but of course, there are lots of relationships among these words! Did some variations to explore the syntax rules; various quote marks; put them into the advanced search to explore how the pulldown and the options like "must have", "should not have", and "must not have" worked. Tried eliminating a keyword and the characteristics of the result list changed dramatically.
  2. Tried a bunch of searches inspired by Dr. C. Jorgensen: running, happy, fun, sad... "conceptual" searches, hard searches in the image world. In the text world Teoma always finds something, but the results were rather scattered and the secondary search tools (Refine and Resources) were similarly unimpressive.
  3. digital library collection development When entered as unquoted terms, this search resulted in a high quality list of digital library resources: NYPL digital library, SunSITE, Yale, IFLA, a CLIR report, DLib, Glasgow. The ninth entry on the list was the first I did not know. When the check box "Find this phrase" was checked, the list changed to a tighter focus on specific policy statements of organizations. Other searches: "digital library collection policy" only yielded two hits. "digital library collection management" had twelve hits and no refinements or resources. "digital library" "collection management" was overwhelming, 4,000+ hits, but the Refine and Resources that it generated were worthless.

The "Refine" and "Resources" sometimes provide powerful enhancements to web searching, but the idea seems more useful than this implementation delivers. Google has a feature similar to "Refine" but it is not as sophisticated. Teoma lists classification titles so the user knows what community is being selected. Google's "Similar pages" feature, in contrast, is a "classification by example". You pick an individual site and see pages that are similar, but you do not know the criteria or classification system determining the similarity. Teoma lets you make fine distinctions explicitly, where in Google you need to guess. The "Resources" feature has no equivalent in Google. I used Teoma as my default search machinery for four days by making it my browser home page. I found that I did not trust the results; if it was a search I really cared about, I ran the same search in Google.


Web directory

The "open directory project" is an open source directory constructed and maintained by volunteers.

Simple search
Advanced search

The open directory project offers an advanced search, albeit with very few choices. Choices include:

Search examples
  1. isearch relevance ranking: This is recursive but ineffectual, as you would guess: "No results found.". I tried permutations of this phrase as well. The phrase "isearch" pops up a few of the same stale results from Teoma.
  2. digital library collection development: This give two directories, two sites. One site was irrelevant; one was an excellent find (The Digital Library Center at University of Tennessee) but not specifically on the topic of collection development in digital libraries. One directory was irrelevant (but interesting) and Reference:Libraries:Digital was close. Refining the search reveals why the original was not successful: "collection development" is not a category in dmoz. "Collection policy" is the phrase they use. However "digital library collection policy" returns zero hits. Truncate to "digital library" and the world opens up: 7 categories, 624 sites. Exploring leads to nothing: Top: Reference: Libraries: Library and Information Science: Digital Library Development is closest, but still does not contain what I'm looking for. Explore more. Conclusion: Collection policy in digital libraries is too specialized to have a category; Reference: Libraries: Library and Information Science: Technical Services: Collection Development is the best I'll get in dmoz.
  3. directory crawl: The value of a directory is in its classification system rather than its search. I used dmoz for several reference needs. My elder son couldn't find his English-Spanish dictionary. When I searched for English Spanish dictionary, I rapidly located http://dmoz.org/Reference/Dictionaries/World_Languages/S/Spanish/English/.Trying to find it via directory crawl took much longer. Refereces/Dictionaries is easy, but then you have to decide between English and World Languages. English? Wrong answer. Once in World Languages, a new interface convention shows up, an alphabet, from which you must choose the first letter of the language. No prompt, no explanation; very hard to figure out. If you guess "S" you can easily find Spanish and you are all set. However, the search engine was easier and faster.

I haven't used it dmoz in years and thought it would be worth another look. It forms the foundation for Google Directories and other directories behind search engines. Since it is open, you and I can create and edit categories. I'm looking forward to seeing http://dmoz.org/Reference/Libraries/Library_and_Information_Science/Librarians/Kazmer,_Michelle. The directory structure is logical; the user interface is clean and easy to traverse. Downsides? The classification system is not too deep, so my topic (LCSH is Digital libraries Collection development) is not included. Over the three days I was using it for this paper, the server seemed very sluggish and at times failed to respond to http requests.



Metasearch service

A new metasearch tool provided by Vivisimo, a company who previously sold search software components and now offers consumer level search, Clusty is rich in features and newideas, works quickly, and has a nice interface. This offering was developed by computer scientists from Carnegie-Mellon. The CEO is Raol Valdes-Perez, who has published widely on a variety of topics.

Clusty's tabbed top lets a user select from nine sources: Web+, News, Images, Shopping, Encyclopedia, Gossip, eBay, Blogs, Slashdot. Choices like "Gossip", "Blogs", and "Slashdot" differentiate this search machine from the competition. There's even a customize function that lets you create your own tab and title, with a customized set of search sources.

Simple search
Advanced search

Advanced search allows the user more control over sources, clustering, and type of content.

Search examples
  1. vivisimo relevance ranking: This time recursion yields a panoply of sources, but still, the algorithmic description remains unfound. I suspect they consider a trade secret. I've looked in trade publications and computer science literature.
  2. digital library collection development: This is the best search result from the variety of search engines I have used for this search. The immediately found sites are the highest quality ones: California DIgital Library policy pages, SunSITE, D-Lib, American Memories. The classification choices are excellent, with choices: University, Development Policy, Science, California Digital Library, Library of Congress, Library Research, Library Resources, Conference, Framework, Issues, and (more...).
  3. defaults: the morning after the first Presidential debate, the gossip column defaults included a "Bush, Debate" category. At first I thought it was strange; shouldn't that be in the "News" tab? Then I started reading, and son of a gun! It is gossip about the debate! "News", on the other hand, had a "Kerry, Debate" category with substantive articles. Political bias in the Clustering Engine? The "Encyclopedia" tab had a links to articles in Wikipedia about the 2004 debate program, along with links to the candidates' pages. Other tabs display only a blank search box.
  4. images: I'm wandering through gossip land (Paris is everywhere!) and I see this picture: ....What is Maureen Dowd doing in the gossip pages? The image search quickly shows me the source of my confusion: Melissa Etheridge or Maureen Dowd?

The nice thing about this tool is that once you find an initial category that is close to the type of information you seek, you navigate through the classification system to narrow or broaden your search. The results returned were excellent. I've been using it for several days and it seems to produce high quality results consistently. I have yet to find something that would make me want to return to Google as my default search engine.

About Us | Contact Us | ©2004 H. Richmond Ackerman