CHAPTER
2 How the search engines work Search
engines are one of the most important ways to the internet success. We are going
to explain how they work and also attempt to describe the different types of search
engines Crawler-Based
Search Engines Crawler-based
search engines, such as Google, create their listings automatically. They "crawl"
or "spider" the web, then people search through what they have found. If
you change your web pages, crawler-based search engines eventually find these
changes, and that can affect how you are listed. Page titles, body copy and other
elements all play a role Human-Powered
Directories A
human-powered directory, such as the Open Directory, depends on humans for its
listings. You submit a short description to the directory for your entire site,
or editors write one for sites they review. A search looks for matches only in
the descriptions submitted. Changing
your web pages has no effect on your listing. Things that are useful for improving
a listing with a search engine have nothing to do with improving a listing in
a directory. The only exception is that a good site, with good content, might
be more likely to get reviewed for free than a poor site "Hybrid
Search Engines" Or Mixed Results In
the web's early days, it used to be that a search engine either presented crawler-based
results or human-powered listings. Today, it is extremely common for both types
of results to be presented. Usually, a hybrid search engine will favour one type
of listings over another. For example, MSN Search is more likely to present human-powered
listings from LookSmart. However, it does also present crawler-based results (as
provided by Inktomi), especially for more obscure queries The
Parts of a Crawler-Based Search Engine Crawler-based
search engines have three major elements. First is the spider, also called the
crawler. The spider visits a web page, reads it, and then follows links to other
pages within the site. This is what it means when someone refers to a site being
"spidered" or "crawled." The spider returns to the site on a regular basis, such
as every month or two, to look for changes. Everything
the spider finds goes into the second part of the search engine, the index. The
index, sometimes called the catalogue, is like a giant book containing a copy
of every web page that the spider finds. If a web page changes, then this book
is updated with new information. Sometimes
it can take a while for new pages or changes that the spider finds to be added
to the index. Thus, a web page may have been "spidered" but not yet "indexed."
Until it is indexed -- added to the index -- it is not available to those searching
with the search engine. Search
engine software is the third part of a search engine. This is the program that
sifts through the millions of pages recorded in the index to find matches to a
search and rank them in order of what it believes is most relevant. Major
Search Engines: The Same, But Different All
crawler-based search engines have the basic parts described above, but there are
differences in how these parts are tuned. That is why the same search on different
search engines often produces different results. Information on this page has
been drawn from the help pages of each search engine, along with knowledge gained
from articles, reviews, books, independent research, tips from others and additional
information received directly from the various search engines. Now
let's look more about how crawler-based search engine rank the listings that they
gather. Search
for anything using your favourite crawler-based search engine. Nearly instantly,
the search engine will sort through the millions of pages it knows about and present
you with ones that match your topic. The matches will even be ranked, so that
the most relevant ones come first. Of
course, the search engines don't always get it right. Non-relevant pages make
it through, and sometimes it may take a little more digging to find what you are
looking for. But, by and large, search engines do an amazing job. Imagine
walking up to a librarian and saying, 'travel.' They’re going to look at you with
a blank face. OK
-- a librarian's not really going to stare at you with a vacant expression. Instead,
they're going to ask you questions to better understand what you are looking for. Unfortunately,
search engines don't have the ability to ask a few questions to focus your search,
as a librarian can. They also can't rely on judgment and past experience to rank
web pages, in the way humans can. So,
how do crawler-based search engines go about determining relevancy, when confronted
with hundreds of millions of web pages to sort through? They follow a set of rules,
known as an algorithm. Exactly how a particular search engine's algorithm works
is a closely-kept trade secret. However, all major search engines follow the general
rules in chapter 4. |