The Questors

by Frederick Rustam

 

Part Two, THE DELIGHT OF BEING TOGETHER



If you haven't read the previous parts of this story, follow the links below:

Part One


_____________________________________________________________________

Questor Institute is a new, experimental technical school where

bright-but-poor high-school graduates on full scholarships spend

two years seeking to become wizards of Internet sorcery by studying

the science and philosophy of information retrieval from textual

databases such as the World Wide Web. In Part One, "A School for

Internet Sorcery," two students, Kevin and Marylou, have their

aptitudes tested, strike up a friendship, and attend the school's

first Assembly, where they're welcomed by the Rector in a speech

setting forth the unusual educational goals of Questor Institute.

_____________________________________________________________________

 

Co-occurrence, Correlation, Context

 

Kevin and Marylou had arrived early and were experimenting with their

high-speed workstations when the teacher arrived. He was a tweedy man

in his forties who wore a bowtie and parted his hair conspicuously

in the middle.

 

"I'll be giving you search examples from my personal experiences

in Internet searching, mostly from the Web," the teacher began.

"Use your computers to search my examples as I discuss them, if you

wish, but don't forget to pay attention to what I'm saying. Okay?

Some of the examples I discuss may seem trivial, especially by

comparison with the complex subjects that professional searchers

have to retrieve. But at this stage in your instruction, I prefer

to use curiosity-satisfying examples which are easier to understand

and which'll give us some enjoyment to pursue. You'll be searching

for hard-to-find subjects you've never heard of, soon enough.

 

"There're many search issues for us to deal with. Some will arise

as we proceed, but I'll be unable to pursue them right then, and

I'll say, 'We'll deal with that in detail, later.' If I stopped my

flow of instruction and sidetracked us into every new search issue

which reared its ugly head, I'd subvert your learning process.

 

"First, let's define some basic terms. A textword is a word

used in the text of a webpage or a Usenet posting. Textwords are

copied to create index words. A searchword is a word

we formulate in our minds to make a subject search, without knowing

for certain that it exists as a textword. Practically speaking then,

a searchword is a 'probable textword,' viewed from a searcher's

perspective. Because the Web and Usenet are such immense databases,

our searchwords will almost always be found as textwords somewhere

on the Internet.

 

"A term you'll often see on the Internet, in literature about the

Internet, and spoken by most people is 'keyword.' This term is

overused---like the much overused term, 'homepage.' As students

seeking to be Questors, you'll mostly avoid 'keyword,' except to

understand how others use it so that you can properly communicate

with them. A keyword is, broadly, a word which is a key to finding

information. It's our searchword of choice, it's the index word which

matches our searchword, and it's the textword we seek in a document.

People use 'keyword' to refer to all those things.

 

The teacher smirked in anticipation of the forthcoming reactions

his students would have at his terminology.

 

"Three other terms I'll use are offered as jargon for us Questors.

'Gold' is a search-result item which is relevant to our search and

useful to our purpose.... 'Chicken feed' is a search result which

is technically relevant to our searchword but not useful for our

purposes. A mere mention of something we're seeking---that's the

usual textual form chicken feed takes.

 

"'Garbage' is a collective term for search-result items which

aren't at all relevant to our purposes, but which show up anyway.

 

"Let's begin our study of the AND operator with a nice sentiment:

'The Delight of Being Together.' This sentiment has served lovers

for countless generations of human existence. But it also serves

those of us who seek information from textual databases. When we put

together several searchwords, we hope to retrieve relevant text where

our words are found together in the same meaningful relationship they

were in our minds when we chose them for searching. The delight of

being together throughout the entire infotrieval process is not

easily experienced, though.

 

"There are three 'C's which we must understand: co-occurrence,

correlation, context. These are three fundamental realities

of retrieval using the AND operator, by far the most-often used

logical search operator. When we use several searchwords in most

search engines, our words are ANDed to each other by default---

that is, even if we don't actually type the word AND between them.

In this way, complex, more-specific search subjects are expressed

by using an increasing number of single words as building blocks,

just as natural language phrases are constructed from words.

 

"To illustrate these three realities, here's a retrieval situation

which sprang from one of my casual curiosities:

I heard a know-it-all radio talkshow host mention Charles

Martel, a medieval French leader, and he added as an aside,

'That's Charlemagne.' I thought he was wrong: Charles Martel

and Charlemagne (Charles the Great) weren't the same man.

How can I easily use the Web to prove my assumption?"

 

A student said, "Search either guy's name, and do a page-search for

the other name."

 

"Possible, but not quick enough. I could spend a lot of time checking

webpages about one man for a mention of the other. Let's construct a

logical word relationship before we search." He turned and wrote on

the easily-erasable whiteboard a search strategy:

 

<charles martel charlemagne>

 

"Use your computers now to make this exact search."

 

The students pounded on their keyboards. This is kid stuff,

thought Kevin. I know the point hes making mused Mary Lou.

 

"When we search this way, what do we retrieve?... The results may

surprise you."

 

"Webpages with both names on them," offered a girl, who was reading

the search results as she spoke.

 

----------------------------------------------------------------------

Charles Martel - Wikipedia

... turned the tide of Islamic advance, and the unification of the

Frankish kingdom under Charles Martel, his son Pepin the short,

and his grandson Charlemagne ...

www.wikipedia.org/wiki/Charles_Martel - 12k - Cached -

Similar pages

----------------------------------------------------------------------

 

"Right. We've used the search engine's default AND operator to

retrieve webpages which have both names on them. We should have

searched Charles Martel's entire name as a quoted phrase---set off

with quotation marks---to retrieve his forename and surname only

when they were 'juxtaposed,' next to each other in webpage text.

But I wanted you to input his forename and surname separately to

illustrate our second 'C,' correlation.... So what do we have in

the search results for our three words?"

 

Mary Lou was ahead of the pack. "Six of the ten items on the first

page of search results state in their annotations either that Charles

Martel was Charlemagne's grandfather, or that Charlemagne was Charles

Martel's grandson. We don't even have to click on the links and read

the webpages to find that out."

 

"Right. These are very good search results---and there's a reason

for it. When two words are searched with the AND operator, there

has to be a co-occurrence of them on a webpage for the page to be

returned in the search results. However, two co-occurring words

aren't necessarily correlated, that is, semantically related to

each other. Our three searchwords are highly correlated

in the page's text.... Why?"

 

Silence. The students weren't sure how to answer this question.

 

"Because six webpage authors used all three of our searchwords

as textwords in the same sentence!... Notice that in each of those

six webpage annotations, each of our three searchwords is rendered

in boldface by the search engine software to show us where they are.

Also, some other words that occurred on either side of these boldfaced

words in the webpage text have been excerpted from the page to show

the 'context' of our searchwords---the way they were used in the text

of those webpages.

 

"If each of our three searchwords had been uncorrelated with the

others, each word's 'contextual excerpt' would have been isolated and

separated from the other two excerpts by ellipses, those three dots

which represent textwords not excerpted. You can see this in the other

result-items where our three searchwords didn't occur so close to each

other in the text. In six of the annotations, our three words are found

close together in single sentences because, on those six webpages, the

page authors wrote it that way.

 

"This example shows us that simple facts can be teased from the Web

by tickling it with its own words, so to speak. Knowing how to do

this is a Questor skill, a skill you'll be glad you've learned.

 

 

Sites, Pages, Indexes

 

Let me ask you a question: what do we retrieve with a Web search

engine?"

 

A confident student piped up, "websites."

 

"No. We retrieve webpages. Believe it or not, websites don't

exist in the physical world. They're a mental construct, a way of

looking at a single webpage or a collected group of webpages. The

webpage does exist, physically, as a single file. It's the basic

retrieval-unit of Web information. Webpages, not websites, are what

are stored on servers. Even the so-called 'homepage' of a website---

the main page which may have no discrete filename, just the site's

domain name---is a single webpage file chosen to visually present

the site when we first access it by its domain name.

 

"How many webpages are there?... Nobody knows for certain. It's been

claimed that there are currently about 36,000,000 registered websites

with uncounted billions of pages. Some webpages don't have any text---

not even captions, just graphics. Those pages are indexable only when

their authors provide HTML 'Title,' 'Keyword,' or 'Description'

metatags, which are part of the page but which are not normally

displayed by Web browsers. We'll discuss metatags and image indexing

and retrieval, later.

 

"An index is a representation of the webpages it indexes. It's a

very 'deep' representation of webpages because it often contains all

the words on the pages. Yes, I said all the words, even those termed

'nonsignificant,' such as 'the,' 'of,' and 'in.'" He turned to the

whiteboard. "If you doubt this, search a general engine for:

 

<"the war of the pacific">

 

"If that search engine doesn't index the 'nonsignificant' word, 'of'

(or 'the'), it can only search for:

 

<war pacific>

 

Then, your search results will mostly be about 'the war in

the Pacific'---World War II---and the few items about the 19th-century

war between Chile, Peru, and Bolivia will be scattered among the many

result-items. It's because the better general search engines now index

these little words that we can search for exact phrases and sentences

and retrieve them precisely.

 

"A webpage index represents webpages much 'deeper' than a few subject

headings represent the book they catalog. But a cataloger's subject

headings are a form of concept indexing. They're the cataloger's

conception of what a book or other textual work is about. A textword

index is just a 'deconstructed' collection of the words on a webpage,

copied from the page by a computer program called a 'crawler' or

'spider.'

 

"Indexes compiled from textwords index webpages much more deeply,

but in a much dumber way than concept indexing. Textwords supply

the raw material of retrieval; we have to supply the intelligence.

Online, we only get help from concept indexing when a webpage author

chooses to get involved in the indexing process by putting meaningful

words and phrases from his mind into his page's metatag fields.

 

"Okay... >From our example of a highly-correlated co-occurrence

of searchwords which retrieved highly-successful search results,

we'll proceed down the garden path toward examples of searchword

co-occurrence which plunge us into morasses of chicken feed and

garbage. This is the greater reality of textword information

retrieval."

 

 

Confidence

 

Kevin and Mary Lou headed for the cafeteria. "I knew all that stuff.

I just didn't know it in the terms he used," declared Kevin. "So did

you... right?"

 

"I knew not what I knew," agreed Mary Lou. "Search principles do

seem more obvious when someone presents them in an organized way

and in elegant terminology such as 'chicken feed' and 'garbage.'"

 

"Yeah, but infotrieval is easier than we're supposed to think it is.

I've been doing it since I was a freshman."

 

"You are a freshman, here. And wait 'til the teacher starts giving us

tough retrieval problems to solve. We'll both feel like freshpersons."

 

"Hey, you aren't a wild-eyed feminist, are you?"

 

"Only when I have to be."

 

 

Phrases and Sentences

 

"Okay. We've learned how the AND operator can deliver the ball

right down the alley to the kingpin. In our Charlemagne example,

we searched for naturally-correlated textwords, and they appeared

smack-dab on the first page of our search results.

 

"The AND operator can be be used for the retrieval of natural-

language phrases and even whole sentences, but successfully only

when the words in those search-phrases and search-sentences are

identical with those on the webpages we seek. Despite what you may

have heard about 'fuzzy logic,' we're usually dependent upon our own

searchword choices and upon elementary search logic which predates

the computer by many years.

 

"There's only one automatic adjustment for 'synonymy' in textword

searching. Synonymy is the problem given us when a concept can be

expressed equally by two or more ways of writing it. A convenient

adjustment for synonymy occurs when the search engine searches our

words as both 'whole words' and as 'fragments' of other words.

This handy fragment-inclusion search feature allows us to retrieve

most nouns in their plural forms by simply searching for their

singular forms. But 'fragment inclusion' can be counter-productive

if it picks up nonrelevant words when our smaller searchwords,

searched also as fragments by the search engine, happen to occur

within those nonrelevant words.

 

The teacher paused to allow that to sink into the open minds before

him. He guessed they were familiar with such retrieval concepts as

fragment inclusion, but he feared his "elegant" language might be

overwhelming them. The public high-schools from which they had

recently graduated did not excel in vocabulary building.

 

"There are more kinds of synonymy and 'near-synonymy' than are

found in singular/plural variations or in word-stem variations like

'history' and 'historical.' A search for a scientific term such as

'columbium,' retrieves only those few webpages which use that older

name for 'niobium.' If we search for 'Saint' abbreviated as 'St.'

we won't retrieve pages which spell out Saint. We must be precise

with our searchword choices, and we should try word 'variants'

if at first we don't retrieve anything at all.

 

"I dislike to keep saying this, but later... we'll discuss the handy

process of 'stemming,' or 'wildcarding,' which some search engines

offer us to include some near-synonyms---words like 'history' and

'historical,' which mean nearly the same thing. And we'll study the

OR operator, which allows us to methodically include synonyms and

near-synonyms in our searches.

 

"There's a troublesome truth about ANDing: generally, the fewer

words we use, the more chicken feed and garbage we retrieve. And

the more words we use, the less likely we are to retrieve anything."

 

"That sounds like we have everything working against us. But it's

a principle which combines two realities. Somewhere between using

too few and using too many searchwords is where we achieve our

best retrieval. There are two exceptions to this principle.

 

"One: if we're searching something unique, a rare word or name,

we can use that word with great retrieval effectiveness, and we

won't be overcome with garbage.... Two: if we're searching a small

Web database, such as the past two weeks of recent news which is

archived on many news websites, then a single, fairly-uncommon word

or name usually won't return much garbage because the database is

so small that false correlations don't occur as often as they do

in large databases."

 

 

Orange Juice First Thing in the Morning

 

"Here's an example of searching a subject by using a descriptive

natural language phrase, not just a couple of meaningful words:

Drinking a glass of orange juice first thing in the morning

may not be a good idea for older people. Why? The Web can

tell us, but not so quickly as in our Charlemagne example.

To find out, we search for... what?"

 

"'Orange juice first thing in the morning,'" chanted a student.

 

"Right. Don't be afraid to search a lengthy natural-language phrase.

Remember that we're searching Web text, indirectly. Phrases and

sentences are the stuff of natural-language text. But it's better

for us to quote our phrase or sentence than to rely on a simple

ANDing of its words by the search engine. By quoting, we request

that our words be 'juxtaposed' or 'contiguous' where they occur on

webpages." The teacher wrote a search example on the whiteboard:

 

<"orange juice first thing in the morning">

 

"Search that now and you'll retrieve about twenty webpages which might

discuss a little-known physiological phenomenon. Some weeks ago, I did

just that, and I retrieved a single item which had the answer to my

question. But yesterday, I repeated my clever search to prepare for

today's lesson, and the item that I retrieved before is now missing,

leaving me only chicken feed!" He groaned.

 

"That's one of the frustrations of Web searching. One day, a relevant

item is retrieved in a good, precise search---and the next month,

it's missing from the same good search. But I've decided to use the

OJ example anyway. I call this technique of searching with a long,

natural-language phrase, 'searching long.' Using that strategy, we'll

retrieve only twenty-or-so chicken feed items about drinking orange

juice first thing in the morning---all technically relevant to our

searchwords, but not relevant to our search for unhealthfulness.

 

"We've searched the same way some Web authors wrote their text---I

said 'some,' not 'all'---but none of the items we retrieved actually

gave us the answer I previously found. Sometimes, there are multiple

answers to a query.... So what are we left with if searching long

doesn't produce?"

 

"Searching short?" a student said, hesitantly.

 

"Searching shorter. That's our only alternative in an AND search:

to cut back on our searchwords. We'll purge our long, precise phrase

of all its words except the three most-meaningful ones." He wrote:

 

<"orange juice" morning>

 

"I've quoted orange juice as a phrase to reduce false co-occurrences

of its two words. Even so, we still retrieve 92,900 items! Amazing!

Can you imagine any other database which would retrieve so much darn

stuff for those three words?!... Try this search now and see if

there's any gold in the first few pages of results."

 

The students searched, then examined the results while the teacher

walked about, observing them. The silence of cogitative labor was

broken by the clacking sounds of keyboards being furiously used.

 

"I found one!" exclaimed an eager searcher. "But it wasn't on the

first page of results. It says that the fruit sugar in orange juice

acts to 'elevate blood lipid levels'---whatever that means."

 

"Some might claim that a college graduate would know what it means.

But I doubt that many of them would. Lipids are fat compounds, and

elevated blood lipids can be a factor in heart attacks for those

vulnerable to them.... It helps to know that before you search.

 

"There's a principle at work here, and it's the first principle of

subject searching on the Web. You know it from personal experience:

the Web is such a vast database that almost any few common words

we AND together will return a flood of information, most of it

not relevant to our search intent. How can this be?" He gestured

like a sawdust evangelist.

 

There was only silence, as the class awaited his explanation for

this seemingly-unknowable principle.

 

"It's not carved into granite anywhere, but it's the big reality of

textword retrieval that the larger the database searched, the more

likely co-occurrences of searchwords will return webpages with false

correlations retrieved by those co-occurrences. This reality occurs

in offline databases, as well. Textual databases are very different

from structured databases with their discrete data fields, defined

data types, and system query language. In natural-language text,

subject 'data' are jumbled together. We have to retrieve from that

jumble by what amounts to guesswork searching.

 

"The key phrase here is 'natural language.' On the Internet, we

search for subjects which are expressed in the language of English

prose, even where the text is formatted with its words or numbers

in tables instead of in paragraphs. Tabular data in textual

databases aren't really divided-up and put into discrete 'fields'

by data type, although they may be displayed that way. A webpage

is a single field for everything on it, unless it's divided into

two or more separate display 'frames,' each with its own URL.

 

"There are, of course, a page's hidden-but-independently-searchable,

metatag fields. In these, the webpage author can catalog his page

with 'metadata'---that is, bibliographic data such as title, date

of preparation or modification, and some descriptive words. Later,

we'll learn how to directly search these metatags, for what they're

worth. Even when these fields are filled, however, they may be less

useful than the visible text. Webpage authors are good HTMLers but

poor catalogers. The same is true of scientists.

 

"Okay. Access my homework webpage to see your homework assignment:

A famous scientist once said, 'The universe is not only queerer

than we know, it's queerer than we can know.' Use either

the Web or the Usenet to discover who originally said that.

 

"Although this sentence is occasionally quoted by astronomers and

others today, they usually change the word 'queerer' to something

more appropriate to today's linguistic reality. I've also seen the

quote attributed to the wrong scientist. If you find such a misquote,

let us know about it.

 

"I'm giving you these hints to demonstrate a troublesome problem with

Internet searching: sometimes we may unknowingly use the wrong words

to express a desired subject. Wrong words---poor retrieval, or no

retrieval. Yet the Internet is so enormous that variations of a

memorable phrase or sentence may get put up there, and some of these

may provide us with clues for finding the genuine stuff.... Good luck.

 

"Thanks a lot." hissed Kevin, sarcastically.

 

"Don't you think we need luck to retrieve from the Internet?"

asked Marylou, tongue-in-cheek.

 

"Luck is for gamblers. The Internet isn't Las Vegas. It's a queerer

place than we can imagine."

 

"True. But chance on the Web does seem to favor the house."

 

 

THE END OF PART TWO

Next: Part Three, "The Nearness of You"

_______________________________________________________________________

2002 by Frederick Rustam. Frederick Rustam is a retired civil

servant. He formerly indexed technical reports for the Department of

Defense. He writes science fiction for Web ezines as a hobby. He

studies and enjoys the Internet as a hobby.