by Frederick Rustam
Part Two, THE DELIGHT OF BEING TOGETHER
Questor Institute is a new, experimental technical school where
bright-but-poor high-school graduates on full scholarships spend
two years seeking to become wizards of Internet sorcery by studying
the science and philosophy of information retrieval from textual
databases such as the World Wide Web. In Part One, "A School for
Internet Sorcery," two students, Kevin and Marylou, have their
aptitudes tested, strike up a friendship, and attend the school's
first Assembly, where they're welcomed by the Rector in a speech
setting forth the unusual educational goals of Questor Institute.
Co-occurrence, Correlation, Context
Kevin and Marylou had arrived early and were experimenting with their
high-speed workstations when the teacher arrived. He was a tweedy man
in his forties who wore a bowtie and parted his hair conspicuously
in the middle.
"I'll be giving you search examples from my personal experiences
in Internet searching, mostly from the Web," the teacher began.
"Use your computers to search my examples as I discuss them, if you
wish, but don't forget to pay attention to what I'm saying. Okay?
Some of the examples I discuss may seem trivial, especially by
comparison with the complex subjects that professional searchers
have to retrieve. But at this stage in your instruction, I prefer
to use curiosity-satisfying examples which are easier to understand
and which'll give us some enjoyment to pursue. You'll be searching
for hard-to-find subjects you've never heard of, soon enough.
"There're many search issues for us to deal with. Some will arise
as we proceed, but I'll be unable to pursue them right then, and
I'll say, 'We'll deal with that in detail, later.' If I stopped my
flow of instruction and sidetracked us into every new search issue
which reared its ugly head, I'd subvert your learning process.
"First, let's define some basic terms. A textword is a word
used in the text of a webpage or a Usenet posting. Textwords are
copied to create index words. A searchword is a word
we formulate in our minds to make a subject search, without knowing
for certain that it exists as a textword. Practically speaking then,
a searchword is a 'probable textword,' viewed from a searcher's
perspective. Because the Web and Usenet are such immense databases,
our searchwords will almost always be found as textwords somewhere
on the Internet.
"A term you'll often see on the Internet, in literature about the
Internet, and spoken by most people is 'keyword.' This term is
overused---like the much overused term, 'homepage.' As students
seeking to be Questors, you'll mostly avoid 'keyword,' except to
understand how others use it so that you can properly communicate
with them. A keyword is, broadly, a word which is a key to finding
information. It's our searchword of choice, it's the index word which
matches our searchword, and it's the textword we seek in a document.
People use 'keyword' to refer to all those things.”
The teacher smirked in anticipation of the forthcoming reactions
his students would have at his terminology.
"Three other terms I'll use are offered as jargon for us Questors.
'Gold' is a search-result item which is relevant to our search and
useful to our purpose.... 'Chicken feed' is a search result which
is technically relevant to our searchword but not useful for our
purposes. A mere mention of something we're seeking---that's the
usual textual form chicken feed takes.
"'Garbage' is a collective term for search-result items which
aren't at all relevant to our purposes, but which show up anyway.
"Let's begin our study of the AND operator with a nice sentiment:
'The Delight of Being Together.' This sentiment has served lovers
for countless generations of human existence. But it also serves
those of us who seek information from textual databases. When we put
together several searchwords, we hope to retrieve relevant text where
our words are found together in the same meaningful relationship they
were in our minds when we chose them for searching. The delight of
being together throughout the entire infotrieval process is not
easily experienced, though.
"There are three 'C's which we must understand: co-occurrence,
correlation, context. These are three fundamental realities
of retrieval using the AND operator, by far the most-often used
logical search operator. When we use several searchwords in most
search engines, our words are ANDed to each other by default---
that is, even if we don't actually type the word AND between them.
In this way, complex, more-specific search subjects are expressed
by using an increasing number of single words as building blocks,
just as natural language phrases are constructed from words.
"To illustrate these three realities, here's a retrieval situation
which sprang from one of my casual curiosities:
I heard a know-it-all radio talkshow host mention Charles
Martel, a medieval French leader, and he added as an aside,
'That's Charlemagne.' I thought he was wrong: Charles Martel
and Charlemagne (Charles the Great) weren't the same man.
How can I easily use the Web to prove my assumption?"
A student said, "Search either guy's name, and do a page-search for
the other name."
"Possible, but not quick enough. I could spend a lot of time checking
webpages about one man for a mention of the other. Let's construct a
logical word relationship before we search." He turned and wrote on
the easily-erasable whiteboard a search strategy:
<charles martel charlemagne>
"Use your computers now to make this exact search."
The students pounded on their keyboards. This is kid stuff,
thought Kevin. I know the point he’s making mused Mary Lou.
"When we search this way, what do we retrieve?... The results may
"Webpages with both names on them," offered a girl, who was reading
the search results as she spoke.
Charles Martel - Wikipedia
... turned the tide of Islamic advance, and the unification of the
Frankish kingdom under Charles Martel, his son Pepin the short,
and his grandson Charlemagne ...
www.wikipedia.org/wiki/Charles_Martel - 12k - Cached -
"Right. We've used the search engine's default AND operator to
retrieve webpages which have both names on them. We should have
searched Charles Martel's entire name as a quoted phrase---set off
with quotation marks---to retrieve his forename and surname only
when they were 'juxtaposed,' next to each other in webpage text.
But I wanted you to input his forename and surname separately to
illustrate our second 'C,' correlation.... So what do we have in
the search results for our three words?"
Mary Lou was ahead of the pack. "Six of the ten items on the first
page of search results state in their annotations either that Charles
Martel was Charlemagne's grandfather, or that Charlemagne was Charles
Martel's grandson. We don't even have to click on the links and read
the webpages to find that out."
"Right. These are very good search results---and there's a reason
for it. When two words are searched with the AND operator, there
has to be a co-occurrence of them on a webpage for the page to be
returned in the search results. However, two co-occurring words
aren't necessarily correlated, that is, semantically related to
each other. Our three searchwords are highly correlated
in the page's text.... Why?"
Silence. The students weren't sure how to answer this question.
"Because six webpage authors used all three of our searchwords
as textwords in the same sentence!... Notice that in each of those
six webpage annotations, each of our three searchwords is rendered
in boldface by the search engine software to show us where they are.
Also, some other words that occurred on either side of these boldfaced
words in the webpage text have been excerpted from the page to show
the 'context' of our searchwords---the way they were used in the text
of those webpages.
"If each of our three searchwords had been uncorrelated with the
others, each word's 'contextual excerpt' would have been isolated and
separated from the other two excerpts by ellipses, those three dots
which represent textwords not excerpted. You can see this in the other
result-items where our three searchwords didn't occur so close to each
other in the text. In six of the annotations, our three words are found
close together in single sentences because, on those six webpages, the
page authors wrote it that way.
"This example shows us that simple facts can be teased from the Web
by tickling it with its own words, so to speak. Knowing how to do
this is a Questor skill, a skill you'll be glad you've learned.
Sites, Pages, Indexes
Let me ask you a question: what do we retrieve with a Web search
A confident student piped up, "websites."
"No. We retrieve webpages. Believe it or not, websites don't
exist in the physical world. They're a mental construct, a way of
looking at a single webpage or a collected group of webpages. The
webpage does exist, physically, as a single file. It's the basic
retrieval-unit of Web information. Webpages, not websites, are what
are stored on servers. Even the so-called 'homepage' of a website---
the main page which may have no discrete filename, just the site's
domain name---is a single webpage file chosen to visually present
the site when we first access it by its domain name.
"How many webpages are there?... Nobody knows for certain. It's been
claimed that there are currently about 36,000,000 registered websites
with uncounted billions of pages. Some webpages don't have any text---
not even captions, just graphics. Those pages are indexable only when
their authors provide HTML 'Title,' 'Keyword,' or 'Description'
metatags, which are part of the page but which are not normally
displayed by Web browsers. We'll discuss metatags and image indexing
and retrieval, later.
"An index is a representation of the webpages it indexes. It's a
very 'deep' representation of webpages because it often contains all
the words on the pages. Yes, I said all the words, even those termed
'nonsignificant,' such as 'the,' 'of,' and 'in.'" He turned to the
whiteboard. "If you doubt this, search a general engine for:
<"the war of the pacific">
"If that search engine doesn't index the 'nonsignificant' word, 'of'
(or 'the'), it can only search for:
Then, your search results will mostly be about 'the war in
the Pacific'---World War II---and the few items about the 19th-century
war between Chile, Peru, and Bolivia will be scattered among the many
result-items. It's because the better general search engines now index
these little words that we can search for exact phrases and sentences
and retrieve them precisely.
"A webpage index represents webpages much 'deeper' than a few subject
headings represent the book they catalog. But a cataloger's subject
headings are a form of concept indexing. They're the cataloger's
conception of what a book or other textual work is about. A textword
index is just a 'deconstructed' collection of the words on a webpage,
copied from the page by a computer program called a 'crawler' or
"Indexes compiled from textwords index webpages much more deeply,
but in a much dumber way than concept indexing. Textwords supply
the raw material of retrieval; we have to supply the intelligence.
Online, we only get help from concept indexing when a webpage author
chooses to get involved in the indexing process by putting meaningful
words and phrases from his mind into his page's metatag fields.
"Okay... >From our example of a highly-correlated co-occurrence
of searchwords which retrieved highly-successful search results,
we'll proceed down the garden path toward examples of searchword
co-occurrence which plunge us into morasses of chicken feed and
garbage. This is the greater reality of textword information
Kevin and Mary Lou headed for the cafeteria. "I knew all that stuff.
I just didn't know it in the terms he used," declared Kevin. "So did
"I knew not what I knew," agreed Mary Lou. "Search principles do
seem more obvious when someone presents them in an organized way
and in elegant terminology such as 'chicken feed' and 'garbage.'"
"Yeah, but infotrieval is easier than we're supposed to think it is.
I've been doing it since I was a freshman."
"You are a freshman, here. And wait 'til the teacher starts giving us
tough retrieval problems to solve. We'll both feel like freshpersons."
"Hey, you aren't a wild-eyed feminist, are you?"
"Only when I have to be."
Phrases and Sentences
"Okay. We've learned how the AND operator can deliver the ball
right down the alley to the kingpin. In our Charlemagne example,
we searched for naturally-correlated textwords, and they appeared
smack-dab on the first page of our search results.
"The AND operator can be be used for the retrieval of natural-
language phrases and even whole sentences, but successfully only
when the words in those search-phrases and search-sentences are
identical with those on the webpages we seek. Despite what you may
have heard about 'fuzzy logic,' we're usually dependent upon our own
searchword choices and upon elementary search logic which predates
the computer by many years.
"There's only one automatic adjustment for 'synonymy' in textword
searching. Synonymy is the problem given us when a concept can be
expressed equally by two or more ways of writing it. A convenient
adjustment for synonymy occurs when the search engine searches our
words as both 'whole words' and as 'fragments' of other words.
This handy fragment-inclusion search feature allows us to retrieve
most nouns in their plural forms by simply searching for their
singular forms. But 'fragment inclusion' can be counter-productive
if it picks up nonrelevant words when our smaller searchwords,
searched also as fragments by the search engine, happen to occur
within those nonrelevant words.
The teacher paused to allow that to sink into the open minds before
him. He guessed they were familiar with such retrieval concepts as
fragment inclusion, but he feared his "elegant" language might be
overwhelming them. The public high-schools from which they had
recently graduated did not excel in vocabulary building.
"There are more kinds of synonymy and 'near-synonymy' than are
found in singular/plural variations or in word-stem variations like
'history' and 'historical.' A search for a scientific term such as
'columbium,' retrieves only those few webpages which use that older
name for 'niobium.' If we search for 'Saint' abbreviated as 'St.'
we won't retrieve pages which spell out Saint. We must be precise
with our searchword choices, and we should try word 'variants'
if at first we don't retrieve anything at all.
"I dislike to keep saying this, but later... we'll discuss the handy
process of 'stemming,' or 'wildcarding,' which some search engines
offer us to include some near-synonyms---words like 'history' and
'historical,' which mean nearly the same thing. And we'll study the
OR operator, which allows us to methodically include synonyms and
near-synonyms in our searches.
"There's a troublesome truth about ANDing: generally, the fewer
words we use, the more chicken feed and garbage we retrieve. And
the more words we use, the less likely we are to retrieve anything."
"That sounds like we have everything working against us. But it's
a principle which combines two realities. Somewhere between using
too few and using too many searchwords is where we achieve our
best retrieval. There are two exceptions to this principle.
"One: if we're searching something unique, a rare word or name,
we can use that word with great retrieval effectiveness, and we
won't be overcome with garbage.... Two: if we're searching a small
Web database, such as the past two weeks of recent news which is
archived on many news websites, then a single, fairly-uncommon word
or name usually won't return much garbage because the database is
so small that false correlations don't occur as often as they do
in large databases."
Orange Juice First Thing in the Morning
"Here's an example of searching a subject by using a descriptive
natural language phrase, not just a couple of meaningful words:
Drinking a glass of orange juice first thing in the morning
may not be a good idea for older people. Why? The Web can
tell us, but not so quickly as in our Charlemagne example.
To find out, we search for... what?"
"'Orange juice first thing in the morning,'" chanted a student.
"Right. Don't be afraid to search a lengthy natural-language phrase.
Remember that we're searching Web text, indirectly. Phrases and
sentences are the stuff of natural-language text. But it's better
for us to quote our phrase or sentence than to rely on a simple
ANDing of its words by the search engine. By quoting, we request
that our words be 'juxtaposed' or 'contiguous' where they occur on
webpages." The teacher wrote a search example on the whiteboard:
<"orange juice first thing in the morning">
"Search that now and you'll retrieve about twenty webpages which might
discuss a little-known physiological phenomenon. Some weeks ago, I did
just that, and I retrieved a single item which had the answer to my
question. But yesterday, I repeated my clever search to prepare for
today's lesson, and the item that I retrieved before is now missing,
leaving me only chicken feed!" He groaned.
"That's one of the frustrations of Web searching. One day, a relevant
item is retrieved in a good, precise search---and the next month,
it's missing from the same good search. But I've decided to use the
OJ example anyway. I call this technique of searching with a long,
natural-language phrase, 'searching long.' Using that strategy, we'll
retrieve only twenty-or-so chicken feed items about drinking orange
juice first thing in the morning---all technically relevant to our
searchwords, but not relevant to our search for unhealthfulness.
"We've searched the same way some Web authors wrote their text---I
said 'some,' not 'all'---but none of the items we retrieved actually
gave us the answer I previously found. Sometimes, there are multiple
answers to a query.... So what are we left with if searching long
"Searching short?" a student said, hesitantly.
"Searching shorter. That's our only alternative in an AND search:
to cut back on our searchwords. We'll purge our long, precise phrase
of all its words except the three most-meaningful ones." He wrote:
<"orange juice" morning>
"I've quoted orange juice as a phrase to reduce false co-occurrences
of its two words. Even so, we still retrieve 92,900 items! Amazing!
Can you imagine any other database which would retrieve so much darn
stuff for those three words?!... Try this search now and see if
there's any gold in the first few pages of results."
The students searched, then examined the results while the teacher
walked about, observing them. The silence of cogitative labor was
broken by the clacking sounds of keyboards being furiously used.
"I found one!" exclaimed an eager searcher. "But it wasn't on the
first page of results. It says that the fruit sugar in orange juice
acts to 'elevate blood lipid levels'---whatever that means."
"Some might claim that a college graduate would know what it means.
But I doubt that many of them would. Lipids are fat compounds, and
elevated blood lipids can be a factor in heart attacks for those
vulnerable to them.... It helps to know that before you search.
"There's a principle at work here, and it's the first principle of
subject searching on the Web. You know it from personal experience:
the Web is such a vast database that almost any few common words
we AND together will return a flood of information, most of it
not relevant to our search intent. How can this be?" He gestured
like a sawdust evangelist.
There was only silence, as the class awaited his explanation for
this seemingly-unknowable principle.
"It's not carved into granite anywhere, but it's the big reality of
textword retrieval that the larger the database searched, the more
likely co-occurrences of searchwords will return webpages with false
correlations retrieved by those co-occurrences. This reality occurs
in offline databases, as well. Textual databases are very different
from structured databases with their discrete data fields, defined
data types, and system query language. In natural-language text,
subject 'data' are jumbled together. We have to retrieve from that
jumble by what amounts to guesswork searching.
"The key phrase here is 'natural language.' On the Internet, we
search for subjects which are expressed in the language of English
prose, even where the text is formatted with its words or numbers
in tables instead of in paragraphs. Tabular data in textual
databases aren't really divided-up and put into discrete 'fields'
by data type, although they may be displayed that way. A webpage
is a single field for everything on it, unless it's divided into
two or more separate display 'frames,' each with its own URL.
"There are, of course, a page's hidden-but-independently-searchable,
metatag fields. In these, the webpage author can catalog his page
with 'metadata'---that is, bibliographic data such as title, date
of preparation or modification, and some descriptive words. Later,
we'll learn how to directly search these metatags, for what they're
worth. Even when these fields are filled, however, they may be less
useful than the visible text. Webpage authors are good HTMLers but
poor catalogers. The same is true of scientists.
"Okay. Access my homework webpage to see your homework assignment:
A famous scientist once said, 'The universe is not only queerer
than we know, it's queerer than we can know.' Use either
the Web or the Usenet to discover who originally said that.
"Although this sentence is occasionally quoted by astronomers and
others today, they usually change the word 'queerer' to something
more appropriate to today's linguistic reality. I've also seen the
quote attributed to the wrong scientist. If you find such a misquote,
let us know about it.
"I'm giving you these hints to demonstrate a troublesome problem with
Internet searching: sometimes we may unknowingly use the wrong words
to express a desired subject. Wrong words---poor retrieval, or no
retrieval. Yet the Internet is so enormous that variations of a
memorable phrase or sentence may get put up there, and some of these
may provide us with clues for finding the genuine stuff.... Good luck.
"Thanks a lot." hissed Kevin, sarcastically.
"Don't you think we need luck to retrieve from the Internet?"
asked Marylou, tongue-in-cheek.
"Luck is for gamblers. The Internet isn't Las Vegas. It's a queerer
place than we can imagine."
"True. But chance on the Web does seem to favor the house."
THE END OF PART TWO
Next: Part Three, "The Nearness of You"
© 2002 by Frederick Rustam. Frederick Rustam is a retired civil
servant. He formerly indexed technical reports for the Department of
Defense. He writes science fiction for Web ezines as a hobby. He
studies and enjoys the Internet as a hobby.