Search engines consist of five discrete software components:
- spider : a robotic browser like program that downloads webpages.
- crawler : a wandering spider that automatically follows links found
on pages.
- indexer : a blender like program that dissects webpages that are
downloaded by spiders.
- the database : a warehouse of the pages downloaded and processed.
- search engine results engine : digs search results out of the
database.
Spider:
A spider is a robotic program that downloads webpages. It works just as
your browser
does when you connect to a web site and download a page.
The spider just
doesn't have
any visual components. You can see the same
thing by viewing any webpage,
and then selecting "view source" in your browser.
Crawler:
As a spider downloads pages, it can strip apart the page and look for "links".
It is the crawlers job to then decide where the spider should go to next
based on the links, or based upon a preprogrammed list of urls.
Indexer:
An indexer rips apart a page into it's various components and analyze them.
Entities such as, titles, headings, links, text, constructs, bold, italic,
and
other style portions of a page are ripped apart and analyzed.
Database:
The database is the storage medium for all the data a search engine
downloads and
analyzes. This can require huge amounts of storage space.
Search Engine Results Engine:
Ah, the heart of the beast. It is the results engine's job to decide what
pages
matches a users search. This is the portion of a search engine you
interact with
when you perform a search. It is also the one part we are concerned with
here.
When a user types in a keyword and does a search, the search engine
decides
what to match for results under varying criteria. The means with which it
decides is
called an algorithm. You may hear search engine optimization (SEO)
professionals discuss "algos" from time to time and this is what they are
referring too.
Although search engines have changed a great deal, most still match
results to
searches similar to the following:
- Title: Is the keyword present in the title?
- Domain/URL: Is the keyword present here?
- Style: Bold, Italic, Large Headings: Is there some place on
the page that the
keyword is used in bold, italic, or in a Hx type heading?
- Density: How many times does the keyword show on the page?
The number of
keywords in relation to page text is called Keyword Density.
- MetaInformation: Although deprecated, some search engines
till read
meta keywords and meta descriptions.
- Outbound Links: Who does the page link with and what are
keywords in the link.
- Inbound Links: Who else on the net has linked to this site?
What is the words of the link? These are called "off the page"
criteria because it's value is not immediately controllable by the page
author.
- Insite Links: What other pages in the users site does the
page point too?
As you can see, a search engine will have to make many judgement calls
based
upon the entire page downloaded.
That's the abbreviated version of how a search engine works