This page briefly describes the design of the Amazon Similarity Explorer
(ASE), covering the components, architecture, and algorithms, and ends
with some restrictions and limitations.
ASE is constructed with open source web development tools. It uses a
MySQL database as a back-end data store for intermediate and final search
results. Application logic is written in PHP. The front-end is a standard
web browser. It is hosted on a FreeBSD system running the Apache web server.
The Amazon database is accessed through Amazon Web Services (AWS) application
programming interface. Amazon's Web Services home page is http://associates.amazon.com/exec/panama/associates/ntg/browse/-/1067662/102-8627506-9944939
The PHP modules to use the web services were provided by Filzhut.
A full description of the Filzhut API is available at http://hray.com/5703/a1/amazon_functions/documentation/
A standard three tier web architecture is employed, with the browser
front-end, PHP application layer, and MySQL database.
Three PHP modules provide the programming framework. "input.php"
processes initial capture of search criteria. "showsearch.php"
contains the logic to display archived search results. "amazon.php"
contains the bulk of the application, containing the code that builds
both the similarities list and the personal web.
A very simple database is used. One table stores the input parameters
and results of similarity searches while a second table stores results
of personal web construction.
Common Gateway Interface (CGI) is used to pass values throughout the
application; only final calculated results are stored in the database.
3. Algorithm overview
Upon initial submission of a search set, ASE requests information from
AWS and gets back a result set data structure containing an unspecified
number of entities. The first ISBN is extracted, and a "similar to"
query is sent to Amazon to get the first similarities list. The software
then basically loops through similarity sets, with the search pattern
determined by the user settings (deep vs. wide, number of searches). After
all similarities are processed, the list is presented to the user. Each
book has a link that will start a new ASE search, and a checkbox to indicate
inclusion in the user's personal web. A simple overall duplication metric
is calculated (# duplicates / # comparisons * 100.)
Once the users has selected the books to comprise his personal web, the
"submit" command sends the list of selected books back to the
application for cross-referencing. For each book submitted, the similarities
are retrieved from Amazon and each is compared against all the other selected
books to see how many matches are present. Again, a simple density percentage
( # hits / # tries * 100) is calculated. If there are fewer than ten items
picked, the number of tries is adjusted to n * n-1.
The wide traversal algorithm was designed to move "across"
Amazon's database. My thinking was that this would result in a more sparsely
linked set of books. In fact, this algorithm consistently returns a higher
percentage of duplicates than the deep search algorithm. Duplicates are
cross references to books already encountered.
The deep traversal algorithm fully explores each similarities list returned
from Amazon.com before moving on to a new one. A full tree traversal algorithm
would get even deeper results, but that was not implemented. This approach
returns a lower percentage of duplicates overall; by the time one is at
the bottom of the list, the subject matter may have changed enough to
get to a fairly fresh new list.
4. Restrictions and areas of improvement
No attempt was made to build this to industrial strength and there are
many subtle and not-so-subtle areas where improvements could be made:
Obvious visual and navigational gloss is missing. A fairly advanced
search screen is offered with many parameters; a simple Google-like interface
could be used as well, where common settings are assumed or deduced from
Algorithm for traversal should include a tree-traversal, where every
node is deeply investigated.
The existing deep traversal algorithm is subject to a trap where the
last reference on one list points back to the top of a prior list. While
never encountered in real life, this looping would continue until the
search depth had been reached.
Many bibliographic improvements could be made: using, for instance, the
initially returned list rather than the similarity tree for the first
book would make a big difference in how ASE works. It is too easy to get
sidetracked by the reliance on the first search term.
None of the HTML has been validated or tested for accessibility; this
is a hack, not a product.
The data retrieved from Amazon is not always reliable; at times, the
wrong title is returned for a given ISBN, or it says similarities are
not available for a search that should obviously return many items ("baseball"
was returning a single entry one morning, for instance.). This tends to
happen on wide search patterns with large search depths. I suspect it
is due to bandwidth limitations they impose dynamically on the service
available through their Web Services API. The specification says that
a user is limited to one query a second. I make no attempt to throttle
my requests and am certainly exceeding their stated access rate specification
during wide search patterns where we move quickly from one similarity
list to another. Doing deep search patterns is much slower, as all nine
"similar to" titles in a result set are processed before the
next request is made to Amazon Web Services.
The density calculation of the Amazon personal web is misleading for
webs of more than ten documents; since my code only examines nine similarities
per book, a web of more than ten would have potential links that will
never be explored. The limit of nine similarities per book derives from
the Amazon Web Services, where nine is the maximum number returned in
a single request. It is possible to request additional "pages"
of data, but that was not implemented. It is interesting to note that
their website displays "Similar to" items in a three by three
grid. It is possible that the output of Web Services was determined by
the visual representation of a relationship on their website.
This was an easy program to write, but it took more time than I really
wanted to spend on it. There are occasional irregularities and inconsistencies
in its operation. I have not had time to figure out whether they result
from bugs in my code, side effects of caching by the PHP library, or variabililty
in Amazon.com's operation (or a combination of all three!)