The Intellectual Web of Documents

Comparison

The third part of this assignment is to compare the Amazon intellectual web with the World of Knowledge intellectual web in terms of four variables: structure, cyclicality, content, and denseness.

Given that I have generated and saved many Amazon-derived intellectual webs, I will draw upon several in this discussion. One cannot generalize about the nature of all Amazon webs from a particular instance, and using several lets us see some patterns that we might otherwise miss. In a similar vein, one cannot generalize about the nature of intellectual webs derived from Web of Knowledge based on a single instance. In both cases, though, thinking about the nature of the collections may offer insights that could be confirmed by much more extensive experimentation.

First, though, we have to decide what these terms mean. If we are to use them to measure our webs, they must be operationalized. We'll base this on dictionary.com definitions:

Structure

"The way in which parts are arranged or put together to form a whole; makeup: triangular in structure."

The structure of an intellectual web, then, is the physical shape of the resulting network of entities and relationships. This is straight out of topological studies in network theory. Some possibilities:

Linear: all but two objects had one incoming and one outgoing link, and the end entities were only connected via a single link
Circular: a linear structure where all entities had one incoming and one outgoing connection
Tree topology: a structure where all but terminal nodes each had one incoming link and two or more outgoing links. With two outgoing links it is a special case of a binary tree.
Star: a single entity is in the middle with relationship with every other entity; all the rest have a relationship only with the central node.
Fully connected network: every entity had a relationship with every other entity
Hybrid topology: a network consisting of a combination of other types of topologies

In these terms, my intellectual web from Web of Knowledge has a hybrid topology with mostly linear characteristics; seven of the ten nodes have either a one or two links, which are the defining characteristic of linearity. Only nodes #2, #5, and #9 have more than two nodes. Borgman's article (#2) is the center of a little mini-star topological network.

The Amazon intellectual webs I have created, in contrast, are generally hybrid topologies with tendencies towards being fully connected. Consider the "high school classics" web:

CD and MD are outliers; MP and MF are in linear strands; all the rest are linked to three or more other nodes.

These patterns are dependent on the particular search, rather than reflecting any structure innate in either Amazon.com or Web of Knowledge. While following citations, one could easily pick a set to create any of the topologies mentioned above. From Borgman's opus with 220 citations, for instance, selection of ten papers published, say, between six and seven months prior to her paper would likely produce a star topology. Conversely, picking papers far apart in time on as disparate subjects as possible would likely result in a very linear topology. Following a "normal" research pattern where one follows citations to closely related articles, one would expect something a hybrid topology with a fair number of interconnections on each node.

Amazon.com webs might tend towards a slightly more natural grouping, as the books on a "similar to" list are all, by definition, similar to the original. We have seen clumping around authors and multi-volume sets, for instance. However, just as with the Web of Knowledge, it is possible to create webs that travel far and wide with few links joining the various nodes.

Cyclicality

"Of, relating to, or characterized by cycles: a cyclic pattern of weather changes.
Recurring or moving in cycles: cyclical history."

In general, the sample space of ten entities specified for each of the two intellectual web is too small to allow development of cyclical patterns. One interesting Amazon web, the "China web", did have a bar-bell pattern where two groups of tightly related books were joined together through a single link. This displays the only element of cyclicality I discerned in a web of a single type of media. In personal webs that span media (starting with a musician's biography, for instance, and jumping into recordings) a similar barbell pattern was noticed several times. With larger sample sizes, one could envision this pattern recurring more frequently.

Looking at the larger picture, both the Amazon database and the citation database would tend to exhibit cyclicality over various scales. Moving from topic to topic or academic discipline to academic discipline, one envisions groups of books or citations joined together. Again, though, there are so many differences - differences in citing behavior among different academic disciplines, for instance - it is hard to make any generally valid observation other than "things will vary". White, in "Authors as citers over time," contains an interesting observation: "Citing styles in identities differ: "scientific-paper style" authors recite heavily, adding to core; "bibliographic-essay style" authors are heavy on unicitations, adding to scatter; "literature-review style" authors do both at once." (2000) The author's style, then, will also effect the patterns and cyclicality in a given Web of Knowledge web.

Content

"Something contained, as in a receptacle. Often used in the plural: the contents of my desk drawer; the contents of an aerosol can."

Assuming that the webs are the receptacles, then the entities in the webs are very similar: in a WOK web, they are lists of citations joining academic papers, in an ASE web, they are lists of links from one book to another. The content will be the same regardless of the subject matter of the individual searches, although subject matter does effect structure and density, as we have seen and will see.

Denseness

"Crowded closely together; compact"

The dictionary definition makes no sense in the context of an intellectual web, nor does the physics definition of mass divided by volume. However, if we consider link density to be the number of actual links in a network divided by the potential number of links, we find a measure of an intellectual web that could provide some sense of "linkedness" of the collection.

As described early, our "open citation linking" WOK web is very linear, resulting in a low density. With 11 links present and 90 possible, it has a density of roughly .12 or 12% of all possible connections. On the other hand. the "high school literature" Amazon web has 25 links, or density of around .27, more than double that of my WOK web.

My experience playing with the Amazon Similarity Explorer is that webs are generally more dense in this environment that in Web of Knowledge. Again, one can create any topology (and thus, any denseness) one desires; but a typical collection of books from Amazon.com, chosen through the "Similar to", are going to be more dense. Intuitively, this makes sense; in WOK, the information organization is one way - one paper cites another; that cited paper, by definition, cannot link back. On Amazon.com, the opposite is true: it is extremely likely that links will be mutual, going in both directions, for book deemed "similar" by the Amazon web creators. Now, we saw earlier a case where it was user performance, not true similarity, that linked "HTML Help Authoring" and "Stupid White Men" and, in fact, the latter does not link to the former. So again, generalization is pretty difficult.

Other comparisons

A few other metrics might be considered: user friendliness, reliability, and completeness come to mind as interesting variables.

Neither ASE nor WOK are particularly user friendly. Of course, ASE is a hack, not a commercial product, but it benefits from Amazon's (usually) intelligent handling of queries. It is very tolerant of error and returns something regardless of the input. On the other hand, I laughed as I read Atkin's paper, describing how much improved WOK was over earlier versions. I would not have liked to try to use the earlier tools; the current one is still cryptic and intolerant at times.

In terms of reliability, experiences with both ASE and WOK leave questions in my mind. Two issues are open.

First, they both have returned different answers to identical queries submitted about a week apart in time. Given that the underlying databases are constantly in a state of development, this may be inevitable; it is still unsettling not to get the same answer back when you run the same query after a relatively small amount of time has elapsed. The reference system for books is more stable than that used for citing references. WOK uses a proprietary tagging system to identify articles, while Amazon uses the standard ISBN. Going forward this will help Amazon and hinder WOK.

The second issue of reliability is an Amazon one exclusively; sometimes it returns values that just make no sense, and often the return values change over time. Recently, for instance, a keyword search on "education" was consistently returning "Harry Potter and the Order of the Phoenix (Book 5)" with no other books associated with it. During the same session I did a search on "music education" that returned a string of guitar lesson manuals. One day later, the search for "education" returned the result list from the prior day's "music education" search. I'm scratching my head wondering if it is a bug in my code or if Amazon is tailoring its AWS responses to previous queries! I suspect it is Amazon reacting to my searches: all AWS transactions include a unique identifier so it knows when a query is coming from ASE.

Completeness is an issue in any reference system. My experience with WOK points out that a citation database is only useful if it indexes the journals containing research of interest to you. Amazon has a much broader percentage of works in print than WOK has of citations that have been made. WOK, on the other hand, could be a useful tool when used as an adjunct to other sources. It is clearly insufficient to use as a primary research tool, at least in the field of digital libraries.

Finally, WOK has a certain intellectual deceit involving its core business. Every search display shows two fields: "Cited References" and "Times Cited". A casual user will view them as both equally credible and inclusive, but they are not. "Cited References" is in fact a complete list of references originating in the current paper. However, as we have seen, "Times Cited" only includes the citations in the journals included in WOK database. Looking through the "Cited References" for the articles in my personal web, I'd estimate that more than half of the links were not live; thus, the references out from the articles were to others not included in WOK. From that I conclude that "Times Cited" has missed at least half of the follow-up citations.

Conclusions

In many ways this project was a flop:

I wanted to learn some nitty-gritty web service protocols; instead, I found a very nice API that let me interface cleanly and easily to Amazon.com's database.
I wanted to learn more about OpCit and the open citation movement; instead, I found a commercial citation database that doesn't include the literature on open citation movement.
I wanted to build a cool tool to explore Amazon.com; instead, the "similar to" relationships returned from Amazon were mostly boring and repetitive.
I was hoping that WOK would be a new reference tool that I could use heavily in the future. Instead, I ran into a hard to use tool that does not include the literature in which I'm interested.

But it had its good moments too:

Writing ASE was fun; I hope you enjoy playing with it a little.
I learned about some very active researchers I hadn't encountered before
It provided a nice context for thought as I was reading the information retrieval user interface assignments.