The Questors

by Frederick Rustam

Part Three, THE NEARNESS OF YOU

If you haven't read the previous parts of this story, follow the links below:

Part One Part Two

_____________________________________________________________________

Questor Institute is a new, experimental technical school where

bright-but-poor high-school graduates on full scholarships spend

two years seeking to become wizards of Internet sorcery by studying

the science and philosophy of information retrieval from textual

databases such as the World Wide Web. In Part Two, "The Delight of

Being Together," Kevin and Marylou learned about the AND logical

operator and about co-occurrence, correlation, and context. They

also studied the usefulness of natural-language phrases for the

retrieval of more than just interesting quotations.

_____________________________________________________________________

Subject Qualification

"We've devoted several days to the AND operator because it's the most

important of the Boolean search operators. There are two reasons for

its importance. It's the one we most-often use for retrieval because

it allows us to put two or more words together to search subjects more

complex and more specific than we can express with one word. And it's

often the only operator we'll find at many websites' internal search

engines. Be thankful that some larger, more-versatile search engines

offer us three other logical operators.

"After we've constructed a complex search-subject by using some

natural-language words, we might have to use one or more additional

words to 'qualify' our complex subject by what amounts to an 'aspect'

of it, that is, a special way of looking at it---by place, time,

or bibliographic format, for example.

"When we seek a complex subject such as 'the history of stream

pollution in West Virginia by coal mining,' for example, our basic

subjects can be viewed as 'stream pollution' and 'coal mining.'

'West Virginia' and 'history' are the place and time aspects of

our complex subject. When we seek webpages which treat those two

aspects, we add the qualifying searchwords 'West Virginia' and

'history' to our search input:

"I've used 'stream pollution' without quotes because it may also

be written as 'pollution of streams.' My two unquoted words will

pick up both forms. The word 'streams' is actually an inadequate

searchword, though; I should use the names of all the kinds of

streams which might be polluted, but for now we'll simplify this

part of our subject's complexity.... So is this the way to search

for it?... Maybe.

"'West Virginia' is a bang-on qualifier, but 'history' is tricky.

It involves dates we don't know, and which may not be found on

webpages about the subject. The qualifier, 'history,' doesn't

always appear as a textword on relevant webpages, either, because

historical treatments often don't label themselves as such. An

author may write from a historical perspective, but may not title

his work, 'A History of...'---or even use the words 'history' or

'historical' anywhere in the body of his work. The nature of a

subject's treatment in a document is the kind of slippery concept

which human concept-indexers perceive in their subject analysis,

but which Web indexing crawlers can't.

"Qualification, when we can use it, is a positive way of concentrating

on the aspects of a subject which we want, and conversely, of excluding

---'disqualifying'---other aspects of it which we don't want. Subject

qualification is a more difficult retrieval process than the complex

interaction of concrete subjects, but it's an important means of

zeroing-in on a subject which may be written about in many aspects.

You must learn how to qualify subject searchwords with aspect words

---and to abandon your aspect words if they don't retrieve well.

You'll find today's homework to be a very challenging problem in

subject qualification by place."

Hot Vinegar

"Single words or short phrases must be rare or unique for us to

achieve quick and obvious retrieval success with them.... When

English mystery writer, Ellis Peters, wrote her Brother Cadfael

series of mystery novels, she chose 'Cadfael' as the name of her

medieval monk-detective because it's a very rare name, even in

Wales. So if we want to retrieve webpages about Brother Cadfael,

we can just search for:

"And we get only Brother Cadfael. Rarity is built into that subject,

so to speak, and we don't even have to use the word 'Brother' with

the name Cadfael to retrieve him. In fact, it's not a good idea to

search for 'Brother Cadfael' because he's been written about on the

Web often as simply 'Cadfael.' This retrieval situation illustrates

the negative face of subject qualification.

"When we search with several searchwords, and one of our words is not

a textword in a relevant document, we've 'disqualified' that document

from retrieval. In this sense, 'disqualification' is the opposite of

qualification. We use more searchwords to qualify---to specify---a

subject, but too many searchwords can 'overspecify' a search and

exclude relevant documents.

"In one episode of the Brother Cadfael mysteries on TV, Cadfael had

to use mos teutonicus on a corpse to view its bones. Anybody

here have enough Latin to translate that phrase?... It means 'the

German practice'---of boiling a corpse in vinegar to remove its

flesh. I was curious to see if the Web had anything on this rare

subject, so I searched for:

<"mos teutonicus">

"Despite its presumed rarity, I quoted this phrase in my search

because... why?"

Marylou had the answer. "Because 'mos' might be found in words as a

fragment of them, and they might falsely co-occur with 'teutonicus.'"

"Right. Be careful with searchwords like 'mos' which often appear

in text as fragments of other words.... I retrieved two items. They

were two postings from a Web discussion forum about mos teutonicus.

One posting inquired about it. A follow-up posting offered a short

bibliography of books on medieval burial practices in which that

subject was treated. In this manner, the Web can lead us to other

information sources, even when it can't provide anything substantive

about a subject. But as you know, the Web isn't all the Internet.

"Much Internet information can also be found in in the 'news groups'

of the Usenet. Years of Usenet discussions have been archived and

textword-indexed by some of the search engines. But you'll probably

use this resource only if Web information proves insufficient for

your purposes. We'll deal specifically with the Usenet archives

as an information source, later, and we'll also search there for

mos teutonicus.

"Somebody here will do that tonight, I'll bet," whispered Kevin,

who knew how classroom suck-ups operated. "So I gotta do it, too."

"Your reasoning escapes me," replied Marylou, haughtily. But she

made a mental note to make the search, also.

The Biggest Ocean Wave Ever Measured

"We learned in our previous 'orange juice' example that searching

for a subject with a long, quoted phrase can sometimes retrieve gold,

and at the same time greatly diminish chicken feed and garbage....

Another way to achieve pinpoint retrieval is to search for an ANDed

combination of known words which don't form a natural-language phrase

or sentence but whose words collectively profile our desired subject

and act to filter-out nonrelevant items. Note---I said 'known words,'

not 'guessed words.' We have to know enough specific facts about our

subject so that we can express those facts with their own words,

so to speak. An example of this:

I read that the largest ocean wave ever seen was measured in

1933 in the Pacific Ocean. It was 112 feet high. I wondered,

'Wow! How did they measure that one and yet survive it.' So,

with faith in the Web's great ocean of information, I searched

for a combination of known words:

"Numbers are considered as words in text. They can be written-out

also---for example, 'nineteen thirty three'---but I guessed that

these particular numbers wouldn't be.... Make this search now,

and see what you retrieve."

As usual, Marylou was racing ahead. "Only a few result-items. The

first one is right on the money: a miniature webpage which succinctly

sets forth the whole story about this huge wave. It's a little gem."

"It is, indeed. It tells you how the Navy oiler, USS Ramapo,

measured the wave and why they survived it. The World Wide Web is

truly fruitful when we know some of the words that're almost certain

to be on a webpage somewhere. Note that the order, the 'syntax,' of

our known words can affect our search results in some engines. So

put them in their 'natural order' for searching, that is, the order

in which they would most-likely occur in text. Some search engines

use that order as one of their 'secret' techniques for ranking

search results."

A Streetcar Named Jette

"Now that we've had great success by using unique single words and

names, phrases, and topic-profiling combinations of descriptive

words, let's turn to a subject which seems quite-specifically named

but isn't all that easy to find. Many search-combinations contain

the fragmentary seeds of their own retrieval difficulty. Here's one:

I was watching CNN. They ran a brief promo for their service,

a montage of international images that showed how widespread

their newsgathering service is. In the last few seconds of this

filmed promo, there was a street scene which showed a streetcar

approaching the camera. Just before the promo ended, the trolley

came close enough to the camera for me to read the destination

sign on the front, above the windshield. It read, '94 | Jette.'

Right away, I wondered if I could search out this streetcar line

on the Web and discover the city where that portion of the promo

film was shot.

"This retrieval may seem like grasping at straws, but that's how

we sometimes find what we seek. Even three years ago, when I made

my search, the World Wide Web had become a database of astronomical

proportions. Did any computer scientist ever envision a distributed

database of such vast dimensions that a search for almost anything

conceivable retrieves something relevant?

"Back to the streetcar: I searched for the two known parts of its

destination sign:

<94 AND jette>

"This returned an astounding 8878 items!... It was at the sixth

item on the sixth page of search results that I found a webpage at

the website of 'Planitram,' the outfit that runs the streetcars in

Brussels, Belgium. The 94 line to the district of Jette was listed,

and some info was given about the line's 'headway'---how long we'll

have to wait for the next streetcar after we've just missed one.

Success on the sixth results page may seem like rough retrieval,

but it's better than no success or success on the twelfth page.

"Jette is a personal name, by the way. And anytime somebody named

Jette and the number 94 appear somewhere on a webpage---with 94 as

a fragment of the date '1994,' for example---we retrieve that page

in an ANDed search, even though it's not relevant to the Brussels

streetcar. There were plenty of these items in my results.

"But there's a better way of matching two pieces of data than by

simply doing an ANDed search of them. What is it?"

"Use the NEAR operator," ventured Marylou.

"Right you are. Here's another search principle: the closer the

'proximity'---nearness---of two textwords, the more likely they are

to be correlated. If we can specify that our searchwords be found

close to each other in text, we increase the chances they'll be

correlated in the retrieved webpages. The NEAR operator which does

this was used in proprietary online information systems years before

the World Wide Web was born.

"Unfortunately, many Web search engines don't offer us 'proximity

searching,' as the use of the NEAR operator is called, so we may have

to abandon our favorite engine to use another one which has it. If we

put the NEAR operator between two searchwords, we'll retrieve only

those documents in which the two words are fairly close to each other

---no more than ten words apart in one Web search engine I used to

search for '94 | Jette'.

"By the way, some commercial online databases allow us to specify the

number of words in the text separating any two NEARed searchwords, and

some textual database management programs simply allow us to retrieve

any two words occurring in the same sentence or in the same paragraph

of text.

"So my revised streetcar search was:

<94 NEAR jette>

"This strategy returned only 474 items, a gross retrieval reduction

of ninety-five percent!... And the same Planitram webpage I found

in my previous AND search was now the fourth item on the first page

of results!... I looked further through these results, and I found

another page with a Planitram table which listed all their bus and

streetcar lines, including old No. 94.

"Now's the time to remind you that the Planitram table of transit

data I retrieved was indexed by a general search engine because it

was put up on a webpage in HTML format. If their data had been in

a separate, non-HTML database---even one searchable via a webpage

gateway---their data couldn't have been indexed by a Web indexing

crawler.... A familiar example of a non-Web-indexed, non-HTMLed

database is the public library's catalog, which we can search

from a page on the library's website. But we'll never find any

of that catalog's entries directly by using a general Web engine.

"Okay. Maybe you're thinking about searching my streetcar line as

a quoted phrase. Well, I did that for comparison. My search results

differed markedly from those where I used the AND or NEAR operator!

The Web is full of nasty surprises. I searched one engine for:

<"94 jette">

"It returned sixteen items. Only one chicken-feed item was relevant,

and that page was a humongous list of European 'tramcar' types and

the lines they ran on.

"The same phrase search on another engine retrieved twenty-four items.

Five of those on the first page were Planitram webpages, but none of

these had the 94 line listed, even though these items 'dropped' on a

search for the 94!" The teacher sighed, "Woe is me."

"After a webpage is indexed, its text may be changed, but textwords

no longer there may still be in the search engine's index file, and

they'll remain there until the page is crawled again. However, you

may be able to view the originally-indexed text if the search engine

caches a 'snapshot' of the original page and offers you that page

as an alternative to viewing the current page. This is a helpful

feature. In one engine, a link to the originally-indexed page appears

at the bottom of each item-annotation as the underlined link-word,

'Cached.' Click on that link and look for your searchwords which are

missing from the currently-retrieved version of the page.

"Even a better logical operator may not much improve your retrieval

from the Web's vast universe of text if you have a tough subject to

search. Although my search for <94 NEAR jette> proved more fruitful

than <94 AND jette>, many items not relevant to my purpose were still

returned by the search engine. The principle I previously mentioned

about the relationship between textual database size and the number

of nonrelevant retrievals never fails us. And there's no database

larger than the World Wide Web.

The teacher decided to inflate his students a bit.

"Today's lesson revisits the awful reality of textword retrieval

from the Web. It's often difficult and frustrating because there're

so many possibilities for false co-occurrence among the textwords

of webpages. But the way I see it is this: some people just gotta be

skilled at Web textword retrieval, and some of those skilled people

will be Questor graduates. You. That's why you'll be in this class

for your whole two years, studying the problems of infotrieval and

learning the solutions to them, where there are any solutions.

"Before we wrap it up for today, a brief word about the relevance-

sorting of search results. With all the search engines, exactly how

this is done is mostly proprietary info. But some of the techniques

are mentioned on their Help pages. They usually count the number of

times our searchwords occur as textwords in the retrieved webpages.

And those pages which have our searchwords in their HTML metatag

'Title,' 'Keyword,' or 'Description' fields are ranked higher than

those where our words are only found in the body of the text.

"Some engines also rank the sorted items by the number of hyperlinks

to them from other webpages, on the working theory that those pages

which are most linked-to are the most relevant ones. But this ranking

is valid for an individual search only after the retrieved webpages

are first sorted by an examination of the number and position in them

of our searchwords. A webpage is usually linked-to for its main

subject, and this first has to be determined---guessed, really---by

the search engine before a page's link statistics can be used to give

it a boost upward in the results. This means that when we retrieve a

page for a minor subject on it---a subject which other webpages haven't

linked to the page for---the page's link statistics are of less value

in ranking it for us.

"These complex relevance-judgment techniques are, you understand,

a dumb-computer substitute for human evaluation. But they are

useful; they generally cause the more-relevant pages to bubble up

in the search results---I said 'generally.' But relevance is a

tricky, subjective attribute of text. Counting words and 'links-to'

doesn't always put what we want on the first page of search results.

"A search engine is like any other computer program: it does what we

tell it to, not what we want it to. If it were really 'intelligent,'

it would go beyond our bare statement of searchwords. It would add

some relevant terms that we didn't anticipate---and join all these

words with the correct logical operators. Someday, maybe they'll do

that.... But second guessing about searchwords can be destructive

as well as constructive, especially if it's done by 'artificial

intelligence' programming. You're here to learn how to use your minds

to retrieve, and how to use proven computer programs as tools for

doing it. Remember that the old truism, 'garbage in, garbage out,'

applies to our search strategies as well as to the documents we seek.

Don't be garbagey with your searchwords.

"Okay. Your homework assignment is to retrieve a photograph which

will allow us to judge for ourselves, but not to verify, an assertion

that's difficult to prove unless we travel to a far place and do some

historical research there. In this case, the Web will give us no

informational verification, but it will help us to draw our own

conclusions. Here's your retrieval situation:

Some say that the Paramount Pictures name and logo which was

created by the company's founder---a man from Ogden, Utah---

was inspired by Mt. Ben Lomond, a peak in the Wasatch Range

near his hometown.

"I want you to find a photograph of Mt. Ben Lomond and see how

inspirational you find it to be for the Paramount logo. You can

search any of the Web's text-and-image or image-only databases.

Those of you with the most-inspiring photos will achieve the

most success.

"This homework assignment may seem like a simple retrieval problem,

but it isn't. There are three troublesome complications.

"First, there are four Mt. Ben Lomonds in the world; I don't want

pictures of the other three. So you'll have to qualify your basic

search-name by ANDing or NEARing it with a qualifier which 'localizes'

the putative Paramount Mt. Ben Lomond and more-or-less excludes the

others. We'll learn about the exclusion of nonrelevant textwords with

the NOT operator, later. For now, you'll use the positive process of

qualification in your attempt to emphasize Utah's Mt. Ben Lomond

in your search results.

"The second complication is that you'll find pages which mention

Mt. Ben Lomond, but few of them will illustrate it for you---

chicken feed.... The third complication is that Web image databases

retrieve a lot of really weird garbage because these images must be

indexed indirectly by crawling their filenames, their webpage captions,

or nearby text. Web images are not concept-indexed by human indexers

whose minds grasp the identity and meaning of images.

"Okay. Gentlemen and gentlewomen, start your search engines!"

In the hallway, the class joker began singing,

You take the high road,

And I'll take the low road,

And I'll get Ben Lomond before ye.

Kevin scowled. "I hope that guy craps out on the homework."

"That's not a very nice sentiment," chided Marylou. "He may find

the best picture of Mt. Ben, and you may be the one to 'crap out.'"

"No way. I'll stay up all night if I have to."

"That's the spirit."

THE END OF PART THREE

Next: Part Four, "Ordinary Citizens as Scholars"

_______________________________________________________________________

servant. He formerly indexed technical reports for the Department of

Defense. He writes science fiction for Web ezines as a hobby. He

studies and enjoys the Internet as a hobby.