How does indexer walk through hypertext links
=================================================

When indexer tries to insert a new URL into database or is trying to
index an existing one, it first of all checks whether this URL 
has corresponding "Server" or "Realm" command given in indexer.conf. 
URLs without corresponding "Server" or "Realm" command are not indexed.
By default those URLs which are already in database and have no Server/Realm 
commands will be deleted from database. It may happen for example after 
removing some Server/Realm commands from indexer.conf.

"Server" command
----------------

This is the main command of the indexer.conf file. It is used to add servers 
or their parts to be indexed. The format of Server command is:

Server [subsection] <URL> [alias]

This command also says indexer to insert given URL into database at
startup.

"Server" command has required "URL" and two optional "subsection" and 
"alias" parameters. Usage of alias optional parameters is covered in 
alias.txt.

E.g. command "Server http://localhost/"  allows  to index 
whole http://localhost/ server. It also makes indexer insert given URL into 
database at startup.  You can also specify some path to index server 
subsection: "Server http://localhost/subsection/". It also says indexer to 
insert given URL at startup.

Note that you can supress indexer behaviour to add URL given in
Server command by using -q indexer command line argument. It is useful
when you have hundreds or thousands Server commands and their URLs are 
already in database. This allows to have more quick indexer startup.


Checking that URL matches "Server" command
------------------------------------------
There are several ways how indexer checks that URL corresponds to some
Server command.  Use optional subsection parameter to specify server's 
checking behaviour. Values of subsection are the same with "Follow" command 
arguments. Subsection value must be one of the following: page, path, site, 
world and has "path" value by defaul. If subsection is not specified, current 
"Follow" value will be used. So, the only "Server site http://localhost/"
command and combination of "Follow site" and "Server http://localhost/"
have the same effect.


  1) "path" subsection
When indexer seeks for a "Server" command correspondent to an URL it 
checks that the discovered URL starts with URL given in Server command 
argument but without trailing file name. For example, if 
"Server path http://localhost/path/to/index.html" is given, all URLs
which have "http://localhost/path/to/" at the beginning correspond to this
Server command.  

 Commands 
Server path http://localhost/path/to/index.html
Server path http://localhost/path/to/index
Server path http://localhost/path/to/index.cgi?q=bla
Server path http://localhost/path/to/index?q=bla

  have the same effect except that they insert different URLs into
database.

  2) "site" subsection

  indexer checks that the discovered URL have the same hostname with URL given 
in Server command. For example, "Server site http://localhost/path/to/a.html"
will allow to index whole http://localhost/ server. 

  3) "world" subsection

  If world subsection is specified in Server command, it has the same effect
that URL is considered to match this Server command. Check an explanation
below.

  4) "page" subsection

  This subsection describes the only one URL given in Server argument.


  5) subsection in news:// schema

  Subsection is always considered as "site" for news:// URL schema.
This is because news:// schema has no nested paths like ftp:// or http://
Use  "Server news://news.server.com/" to index whole news server or
for example "Server news://news.server.com/udm" to index all messages
from "udm" hierarchy.



Realm command
-------------

Realm command is more powerful way to describe web area to be
indexed. The format of Realm command is:

Realm [String|Regex] [Match|NoMatch] <URLMask> [alias]

It works almost like "Server" command but takes a regular expression or 
string wildcards as it's argument. There are two comparison types in Realm 
command. String wildcards is default match type. You can use ? and * signs in 
URLMask parameters, they means "one character" and "any number of characters"
respectively. For example, if you want to index all HTTP sites in .ru domain, 
use this command:

Realm http://*.ru/*

Regex comparison type takes a regular expression as it's argument. Activate 
regex comparison type using "Regex" keyword. For example, you can describe 
everything in .ru domain using regex comparison type:

Realm Regex ^http://.*\.ru/

Second optional argument means match type. There are "Match" and "NoMatch"
possible values with "Match" as default. "Realm NoMatch" has reverse 
effect. It means that URL that does not match given URLMask will correspond
to this Realm command. For example, use this command to index everything 
without .com domain:

Realm NoMatch http://*.com/*


Optional "alias" argument allows to provide very complicated URL rewrite
more powerful than other aliasing mechanism. Take a look into alias.txt
for "alias" argument usage explanation. Alias works only with "Regex"
comparison type and has no effect with "String" type.

Realm and Follow commands
-------------------------
As far as subsection actually means which part of argument given
in Server command to compare with a URL, Realm command does not have 
similar optional subsection parameter. Is is useless in the case
of string wildcards and regular expressions. Because of it "Follow"
command does not affect "Realm" command. Imagine that you have:

Follow path
Realm  http://localhost/*
URL    http://localhost/somepath/

If you add into database for example an URL http://localhost/somepath/
either using "URL" indexer.conf command given above or using 
"indexer -i -u http://localhost/somepath/", indexer
WILL follow any URL beyond "/somepath/" directory of 
localhost if there is a link to it from "/somepath/". 
"Follow path" has no effect if Realm command is used.


Using different parameter for server and it's subsections
---------------------------------------------------------
Indexer seeks for "Server" and "Realm" commands in order of their 
appearance. Thus if you want to give different parameters to e.g. whole server 
and its subsection you should add subsection line before whole server's. Imagine 
that you have server subdirectory which contains news articles. Surely those
articles are to be reindexed more often than the rest of the server. The following
combination may be usefull in such cases:

# Add subsection
Period 200000
Server http://servername/news/

# Add server
Period 600000
Server http://servername/

These commands give different reindexing period for /news/ subdirectory
comparing with the period of server as a whole. indexer will choose the first
"Server" record for the http://servername/news/page1.html as far as it
matches and was given first.



Default indexer's behaviour
---------------------------

The default behaviour of indexer is to follow through links 
having correspondent Server/Realm command in the indexer.conf file.
It also jumps between servers if both of them are present in indexer.conf
either directly in Server command or indirectly in Realm command.
For example, there are two Server commands:

Server http://www/
Server http://web/

When indexing http://www/page1.html indexer WILL follow the link 
http://web/page2.html if the last one has been found. Note that these 
pages are on different servers, but BOTH of them have correspondent 
Server record.

If one of the Server command is deleted, indexer will remove
all expired URLs from this server during next reindexing.



Using "Follow world"
--------------------

 The first way to change described default behavour is to use 
"Follow world" indexer.conf command. indexer will walk through ANY 
found URLs and will jump between different servers. Theoretically, 
it will index all Internet in this case if there are no harware limits :-)

When "Follow world" command is specified, indexer just adds one server 
record to memory with an empty start URL during loading indexer.conf.
This empty server will be found only in the case when no other Server 
records with non-empty start URL are found.



Using "DeleteNoServer no"
-------------------------

The second way to change default behavour is to use "DeleteNoServer no" command. 
This command means that URLs which are already in database will not be deleted 
even if they have no corresponding Server/Realm command. "DeleteNoServer no" is 
implemented by adding one empty server just like "Follow world". 
The difference between those two commands is that in case of "DeleteNoServer no"
indexer follows links ONLY INSIDE servers and does not jump between different 
servers. This allows to index only those servers which are already in database
and do not follow other servers.

Example of command sequence:

DeleteNoServer no
Server http://www/
Server http://web/

While indexing http://www/page1.html indexer WILL follow the link 
http://www/page2.html but DOES NOT follow http://web/page2.html link
because http://www/page1.html and http://web/page2.html are on different
servers. 

Note that if you delete URL from the list in url.txt using the 
"DeleteNoServer no" scheme, indexer WILL NOT delete URLs from the same server. 
Imagine that you have removed http://www/ from url.txt. To remove all URLs 
of this server from the database you'll have to run 
"indexer -C -u http://www/%".


Realm *
-------
You may note that "Realm *" is something like "DeleteNoServer no". Actually 
it has almost the same effect with "DeleteNoServer no". The only difference 
is that this command does allow indexer to jump between servers.


Using "indexer -f <filename>
----------------------------

The third scheme is very useful for "indexer -i -f url.txt" running. You may
maitain required servers in the url.txt. When new URL is added into url.txt
indexer will index the server of this URL during next startup. 

if you are using "DeleteNoServer no" it does not matter whether you have 
passed the root URL (http://www/) of the server or one of internal pages 
(http://www/path/to/some/page.html). Indexer will index whole server 
http://www/ 


