by Frederick Rustam
Part Three, THE NEARNESS OF YOU
|Part One||Part Two|
Questor Institute is a new, experimental technical school where
bright-but-poor high-school graduates on full scholarships spend
two years seeking to become wizards of Internet sorcery by studying
the science and philosophy of information retrieval from textual
databases such as the World Wide Web. In Part Two, "The Delight of
Being Together," Kevin and Marylou learned about the AND logical
operator and about co-occurrence, correlation, and context. They
also studied the usefulness of natural-language phrases for the
retrieval of more than just interesting quotations.
"We've devoted several days to the AND operator because it's the most
important of the Boolean search operators. There are two reasons for
its importance. It's the one we most-often use for retrieval because
it allows us to put two or more words together to search subjects more
complex and more specific than we can express with one word. And it's
often the only operator we'll find at many websites' internal search
engines. Be thankful that some larger, more-versatile search engines
offer us three other logical operators.
"After we've constructed a complex search-subject by using some
natural-language words, we might have to use one or more additional
words to 'qualify' our complex subject by what amounts to an 'aspect'
of it, that is, a special way of looking at it---by place, time,
or bibliographic format, for example.
"When we seek a complex subject such as 'the history of stream
pollution in West Virginia by coal mining,' for example, our basic
subjects can be viewed as 'stream pollution' and 'coal mining.'
'West Virginia' and 'history' are the place and time aspects of
our complex subject. When we seek webpages which treat those two
aspects, we add the qualifying searchwords 'West Virginia' and
'history' to our search input:
<stream pollution "coal mining" "west virginia" history>
"I've used 'stream pollution' without quotes because it may also
be written as 'pollution of streams.' My two unquoted words will
pick up both forms. The word 'streams' is actually an inadequate
searchword, though; I should use the names of all the kinds of
streams which might be polluted, but for now we'll simplify this
part of our subject's complexity.... So is this the way to search
for it?... Maybe.
"'West Virginia' is a bang-on qualifier, but 'history' is tricky.
It involves dates we don't know, and which may not be found on
webpages about the subject. The qualifier, 'history,' doesn't
always appear as a textword on relevant webpages, either, because
historical treatments often don't label themselves as such. An
author may write from a historical perspective, but may not title
his work, 'A History of...'---or even use the words 'history' or
'historical' anywhere in the body of his work. The nature of a
subject's treatment in a document is the kind of slippery concept
which human concept-indexers perceive in their subject analysis,
but which Web indexing crawlers can't.
"Qualification, when we can use it, is a positive way of concentrating
on the aspects of a subject which we want, and conversely, of excluding
---'disqualifying'---other aspects of it which we don't want. Subject
qualification is a more difficult retrieval process than the complex
interaction of concrete subjects, but it's an important means of
zeroing-in on a subject which may be written about in many aspects.
You must learn how to qualify subject searchwords with aspect words
---and to abandon your aspect words if they don't retrieve well.
You'll find today's homework to be a very challenging problem in
subject qualification by place."
"Single words or short phrases must be rare or unique for us to
achieve quick and obvious retrieval success with them.... When
English mystery writer, Ellis Peters, wrote her Brother Cadfael
series of mystery novels, she chose 'Cadfael' as the name of her
medieval monk-detective because it's a very rare name, even in
Wales. So if we want to retrieve webpages about Brother Cadfael,
we can just search for:
"And we get only Brother Cadfael. Rarity is built into that subject,
so to speak, and we don't even have to use the word 'Brother' with
the name Cadfael to retrieve him. In fact, it's not a good idea to
search for 'Brother Cadfael' because he's been written about on the
Web often as simply 'Cadfael.' This retrieval situation illustrates
the negative face of subject qualification.
"When we search with several searchwords, and one of our words is not
a textword in a relevant document, we've 'disqualified' that document
from retrieval. In this sense, 'disqualification' is the opposite of
qualification. We use more searchwords to qualify---to specify---a
subject, but too many searchwords can 'overspecify' a search and
exclude relevant documents.
"In one episode of the Brother Cadfael mysteries on TV, Cadfael had
to use mos teutonicus on a corpse to view its bones. Anybody
here have enough Latin to translate that phrase?... It means 'the
German practice'---of boiling a corpse in vinegar to remove its
flesh. I was curious to see if the Web had anything on this rare
subject, so I searched for:
"Despite its presumed rarity, I quoted this phrase in my search
Marylou had the answer. "Because 'mos' might be found in words as a
fragment of them, and they might falsely co-occur with 'teutonicus.'"
"Right. Be careful with searchwords like 'mos' which often appear
in text as fragments of other words.... I retrieved two items. They
were two postings from a Web discussion forum about mos teutonicus.
One posting inquired about it. A follow-up posting offered a short
bibliography of books on medieval burial practices in which that
subject was treated. In this manner, the Web can lead us to other
information sources, even when it can't provide anything substantive
about a subject. But as you know, the Web isn't all the Internet.
"Much Internet information can also be found in in the 'news groups'
of the Usenet. Years of Usenet discussions have been archived and
textword-indexed by some of the search engines. But you'll probably
use this resource only if Web information proves insufficient for
your purposes. We'll deal specifically with the Usenet archives
as an information source, later, and we'll also search there for
"Somebody here will do that tonight, I'll bet," whispered Kevin,
who knew how classroom suck-ups operated. "So I gotta do it, too."
"Your reasoning escapes me," replied Marylou, haughtily. But she
made a mental note to make the search, also.
The Biggest Ocean Wave Ever Measured
"We learned in our previous 'orange juice' example that searching
for a subject with a long, quoted phrase can sometimes retrieve gold,
and at the same time greatly diminish chicken feed and garbage....
Another way to achieve pinpoint retrieval is to search for an ANDed
combination of known words which don't form a natural-language phrase
or sentence but whose words collectively profile our desired subject
and act to filter-out nonrelevant items. Note---I said 'known words,'
not 'guessed words.' We have to know enough specific facts about our
subject so that we can express those facts with their own words,
so to speak. An example of this:
I read that the largest ocean wave ever seen was measured in
1933 in the Pacific Ocean. It was 112 feet high. I wondered,
'Wow! How did they measure that one and yet survive it.' So,
with faith in the Web's great ocean of information, I searched
for a combination of known words:
<wave pacific 1933 112>
"Numbers are considered as words in text. They can be written-out
also---for example, 'nineteen thirty three'---but I guessed that
these particular numbers wouldn't be.... Make this search now,
and see what you retrieve."
As usual, Marylou was racing ahead. "Only a few result-items. The
first one is right on the money: a miniature webpage which succinctly
sets forth the whole story about this huge wave. It's a little gem."
"It is, indeed. It tells you how the Navy oiler, USS Ramapo,
measured the wave and why they survived it. The World Wide Web is
truly fruitful when we know some of the words that're almost certain
to be on a webpage somewhere. Note that the order, the 'syntax,' of
our known words can affect our search results in some engines. So
put them in their 'natural order' for searching, that is, the order
in which they would most-likely occur in text. Some search engines
use that order as one of their 'secret' techniques for ranking
A Streetcar Named Jette
"Now that we've had great success by using unique single words and
names, phrases, and topic-profiling combinations of descriptive
words, let's turn to a subject which seems quite-specifically named
but isn't all that easy to find. Many search-combinations contain
the fragmentary seeds of their own retrieval difficulty. Here's one:
I was watching CNN. They ran a brief promo for their service,
a montage of international images that showed how widespread
their newsgathering service is. In the last few seconds of this
filmed promo, there was a street scene which showed a streetcar
approaching the camera. Just before the promo ended, the trolley
came close enough to the camera for me to read the destination
sign on the front, above the windshield. It read, '94 | Jette.'
Right away, I wondered if I could search out this streetcar line
on the Web and discover the city where that portion of the promo
film was shot.
"This retrieval may seem like grasping at straws, but that's how
we sometimes find what we seek. Even three years ago, when I made
my search, the World Wide Web had become a database of astronomical
proportions. Did any computer scientist ever envision a distributed
database of such vast dimensions that a search for almost anything
conceivable retrieves something relevant?
"Back to the streetcar: I searched for the two known parts of its
<94 AND jette>
"This returned an astounding 8878 items!... It was at the sixth
item on the sixth page of search results that I found a webpage at
the website of 'Planitram,' the outfit that runs the streetcars in
Brussels, Belgium. The 94 line to the district of Jette was listed,
and some info was given about the line's 'headway'---how long we'll
have to wait for the next streetcar after we've just missed one.
Success on the sixth results page may seem like rough retrieval,
but it's better than no success or success on the twelfth page.
"Jette is a personal name, by the way. And anytime somebody named
Jette and the number 94 appear somewhere on a webpage---with 94 as
a fragment of the date '1994,' for example---we retrieve that page
in an ANDed search, even though it's not relevant to the Brussels
streetcar. There were plenty of these items in my results.
"But there's a better way of matching two pieces of data than by
simply doing an ANDed search of them. What is it?"
"Use the NEAR operator," ventured Marylou.
"Right you are. Here's another search principle: the closer the
'proximity'---nearness---of two textwords, the more likely they are
to be correlated. If we can specify that our searchwords be found
close to each other in text, we increase the chances they'll be
correlated in the retrieved webpages. The NEAR operator which does
this was used in proprietary online information systems years before
the World Wide Web was born.
"Unfortunately, many Web search engines don't offer us 'proximity
searching,' as the use of the NEAR operator is called, so we may have
to abandon our favorite engine to use another one which has it. If we
put the NEAR operator between two searchwords, we'll retrieve only
those documents in which the two words are fairly close to each other
---no more than ten words apart in one Web search engine I used to
search for '94 | Jette'.
"By the way, some commercial online databases allow us to specify the
number of words in the text separating any two NEARed searchwords, and
some textual database management programs simply allow us to retrieve
any two words occurring in the same sentence or in the same paragraph
"So my revised streetcar search was:
<94 NEAR jette>
"This strategy returned only 474 items, a gross retrieval reduction
of ninety-five percent!... And the same Planitram webpage I found
in my previous AND search was now the fourth item on the first page
of results!... I looked further through these results, and I found
another page with a Planitram table which listed all their bus and
streetcar lines, including old No. 94.
"Now's the time to remind you that the Planitram table of transit
data I retrieved was indexed by a general search engine because it
was put up on a webpage in HTML format. If their data had been in
a separate, non-HTML database---even one searchable via a webpage
gateway---their data couldn't have been indexed by a Web indexing
crawler.... A familiar example of a non-Web-indexed, non-HTMLed
database is the public library's catalog, which we can search
from a page on the library's website. But we'll never find any
of that catalog's entries directly by using a general Web engine.
"Okay. Maybe you're thinking about searching my streetcar line as
a quoted phrase. Well, I did that for comparison. My search results
differed markedly from those where I used the AND or NEAR operator!
The Web is full of nasty surprises. I searched one engine for:
"It returned sixteen items. Only one chicken-feed item was relevant,
and that page was a humongous list of European 'tramcar' types and
the lines they ran on.
"The same phrase search on another engine retrieved twenty-four items.
Five of those on the first page were Planitram webpages, but none of
these had the 94 line listed, even though these items 'dropped' on a
search for the 94!" The teacher sighed, "Woe is me."
"After a webpage is indexed, its text may be changed, but textwords
no longer there may still be in the search engine's index file, and
they'll remain there until the page is crawled again. However, you
may be able to view the originally-indexed text if the search engine
caches a 'snapshot' of the original page and offers you that page
as an alternative to viewing the current page. This is a helpful
feature. In one engine, a link to the originally-indexed page appears
at the bottom of each item-annotation as the underlined link-word,
'Cached.' Click on that link and look for your searchwords which are
missing from the currently-retrieved version of the page.
"Even a better logical operator may not much improve your retrieval
from the Web's vast universe of text if you have a tough subject to
search. Although my search for <94 NEAR jette> proved more fruitful
than <94 AND jette>, many items not relevant to my purpose were still
returned by the search engine. The principle I previously mentioned
about the relationship between textual database size and the number
of nonrelevant retrievals never fails us. And there's no database
larger than the World Wide Web.
The teacher decided to inflate his students a bit.
"Today's lesson revisits the awful reality of textword retrieval
from the Web. It's often difficult and frustrating because there're
so many possibilities for false co-occurrence among the textwords
of webpages. But the way I see it is this: some people just gotta be
skilled at Web textword retrieval, and some of those skilled people
will be Questor graduates. You. That's why you'll be in this class
for your whole two years, studying the problems of infotrieval and
learning the solutions to them, where there are any solutions.
"Before we wrap it up for today, a brief word about the relevance-
sorting of search results. With all the search engines, exactly how
this is done is mostly proprietary info. But some of the techniques
are mentioned on their Help pages. They usually count the number of
times our searchwords occur as textwords in the retrieved webpages.
And those pages which have our searchwords in their HTML metatag
'Title,' 'Keyword,' or 'Description' fields are ranked higher than
those where our words are only found in the body of the text.
"Some engines also rank the sorted items by the number of hyperlinks
to them from other webpages, on the working theory that those pages
which are most linked-to are the most relevant ones. But this ranking
is valid for an individual search only after the retrieved webpages
are first sorted by an examination of the number and position in them
of our searchwords. A webpage is usually linked-to for its main
subject, and this first has to be determined---guessed, really---by
the search engine before a page's link statistics can be used to give
it a boost upward in the results. This means that when we retrieve a
page for a minor subject on it---a subject which other webpages haven't
linked to the page for---the page's link statistics are of less value
in ranking it for us.
"These complex relevance-judgment techniques are, you understand,
a dumb-computer substitute for human evaluation. But they are
useful; they generally cause the more-relevant pages to bubble up
in the search results---I said 'generally.' But relevance is a
tricky, subjective attribute of text. Counting words and 'links-to'
doesn't always put what we want on the first page of search results.
"A search engine is like any other computer program: it does what we
tell it to, not what we want it to. If it were really 'intelligent,'
it would go beyond our bare statement of searchwords. It would add
some relevant terms that we didn't anticipate---and join all these
words with the correct logical operators. Someday, maybe they'll do
that.... But second guessing about searchwords can be destructive
as well as constructive, especially if it's done by 'artificial
intelligence' programming. You're here to learn how to use your minds
to retrieve, and how to use proven computer programs as tools for
doing it. Remember that the old truism, 'garbage in, garbage out,'
applies to our search strategies as well as to the documents we seek.
Don't be garbagey with your searchwords.
"Okay. Your homework assignment is to retrieve a photograph which
will allow us to judge for ourselves, but not to verify, an assertion
that's difficult to prove unless we travel to a far place and do some
historical research there. In this case, the Web will give us no
informational verification, but it will help us to draw our own
conclusions. Here's your retrieval situation:
Some say that the Paramount Pictures name and logo which was
created by the company's founder---a man from Ogden, Utah---
was inspired by Mt. Ben Lomond, a peak in the Wasatch Range
near his hometown.
"I want you to find a photograph of Mt. Ben Lomond and see how
inspirational you find it to be for the Paramount logo. You can
search any of the Web's text-and-image or image-only databases.
Those of you with the most-inspiring photos will achieve the
"This homework assignment may seem like a simple retrieval problem,
but it isn't. There are three troublesome complications.
"First, there are four Mt. Ben Lomonds in the world; I don't want
pictures of the other three. So you'll have to qualify your basic
search-name by ANDing or NEARing it with a qualifier which 'localizes'
the putative Paramount Mt. Ben Lomond and more-or-less excludes the
others. We'll learn about the exclusion of nonrelevant textwords with
the NOT operator, later. For now, you'll use the positive process of
qualification in your attempt to emphasize Utah's Mt. Ben Lomond
in your search results.
"The second complication is that you'll find pages which mention
Mt. Ben Lomond, but few of them will illustrate it for you---
chicken feed.... The third complication is that Web image databases
retrieve a lot of really weird garbage because these images must be
indexed indirectly by crawling their filenames, their webpage captions,
or nearby text. Web images are not concept-indexed by human indexers
whose minds grasp the identity and meaning of images.
"Okay. Gentlemen and gentlewomen, start your search engines!"
In the hallway, the class joker began singing,
You take the high road,
And I'll take the low road,
And I'll get Ben Lomond before ye.
Kevin scowled. "I hope that guy craps out on the homework."
"That's not a very nice sentiment," chided Marylou. "He may find
the best picture of Mt. Ben, and you may be the one to 'crap out.'"
"No way. I'll stay up all night if I have to."
"That's the spirit."
THE END OF PART THREE
Next: Part Four, "Ordinary Citizens as Scholars"
© 2002 by Frederick Rustam. Frederick Rustam is a retired civil
servant. He formerly indexed technical reports for the Department of
Defense. He writes science fiction for Web ezines as a hobby. He
studies and enjoys the Internet as a hobby.