Run-time indexer configuration and usage
========================================


Configuration
-------------

First, you should configure mnoGoSearch. indexer configuration is covered
mostly by indexer.conf-dist file. You can find it in etc directory of
mnoGoSearch distribution. You may take a look at other *.conf
samples in doc/samples directory. 

To set up indexer.conf file, cd to mnoGoSearch installation /etc directory,
copy indexer.conf-dist to indexer.conf and edit it.

To configure search frontends (search.cgi and/or search.php3), you should
edit search.htm file in /etc directory of mnoGoSearch installation. See
doc/templates.txt for detailed description.


Running indexer 
---------------
  Just run indexer once a week (a day, an hour ...) to find the latest
modifications in your web sites. You may also insert indexer into your 
crontab job.

Indexing options:                                                               
-----------------

  -a            reindex all documents even if not expired (may be               
                limited using -t, -u, -s, -c and -f options)                    
  -m            reindex expired documents even if not modified (may             
                be limited using -t, -u, -c and -s options)                     
  -e            index 'most expired' (oldest) documents first                   
  -o            index documents with less depth (hops value) first              
  -n n          index only n documents and exit                                 
  -c n          index only n seconds and exit                                   
  -q            quick startup (do not add Server URLs)                          
  -k            skip locking (affects MySQL and PostgreSQL only)                
                                                                                
  -i            insert new URLs (URLs to insert must be given using -u or -f)   
  -p n          sleep n seconds after each URL                                  
  -w            do not warn before clearing documents from database             
                                                                                
Subsection control options (may be combined):                                   
  -s status     limit indexer to documents matching status (HTTP Status code)   
  -t tag        limit indexer to documents matching tag                         
  -g category   limit indexer to documents matching category                    
  -u pattern    limit indexer to documents with URLs matching pattern           
                (supports SQL LIKE wildcard '%')                                
  -f filename   read URLs to be indexed/inserted/cleared from file (with -a     
                or -C option, supports SQL LIKE wildcard '%'; has no effect     
                when combined with -m option)                                   
  -f -          Use STDIN instead of file as URL list                       
                                                                            
Logging options:                                                            
  -l            do not log to stdout/stderr                                 
  -v n          verbose level, 0-5                                          
                                                                            
Ispell import options:                                                      
  -L language   Two letters Language code (en, ru, de etc.)                 
  -A filename   ispell Affix file                                           
  -D filename   ispell Dictionary file                                      
  -d            dump to stdout instead of storing to database               
                                                                            
Misc. options:                                                              
  -C            clear database and exit                                     
  -S            print statistics and exit                                   
  -I            print referers and exit                                     
  -h,-?         print help page and exit                               


Built-in database support notes
-------------------------------
indexer with built-in database support can't do reindexing and 
indexes the whole site every time it is started.



SQL backend notes
-----------------
By default, indexer being called without any command line arguments
reindex only expired documents. You can change expiration period with 
'Period' indexer.conf command. 
If you want to reindex all documents irrelevant if those are expired 
or not, use -a option. indexer will mark all documents as expired at 
startup. 

Retrieving documents, indexer sends 'If-Modified-Since' HTTP
header for documents that are already stored in database. When indexer gets 
next document it calculates document's checksum. If checksum is the 
same with old checksum stored in database, it will not parse document
again. indexer '-m' command line option prevents indexer from sending 
'If-Modified-Since' headers and make it parse document even if checksum 
is the same. It is usefull for example when you have changed your 
Allow/Disallow rules in indexer.conf and it is required to add new pages 
that was disallowed earlier.

If mnoGoSearch retrieves URL with redirect HTTP 301,302,303 status it will
index URL given in "Location: " field of HTTP-header instead.



Subsection control with SQL backend 
-----------------------------------
indexer has -t, -u, -s options to limit action to only a
part of the database. -t corresponds 'Tag' limitation, -u is a URL
substring limitation (SQL LIKE wildcards). -s limits URLs with
given HTTP status. All limit options in the same group are ORed
and in the different groups are ANDed. mnoGoSearch with built-in database
dos not support subsection control.



 How to clear database (SQL only)
 --------------------------------
  To clear the whole database, use 'indexer -C'. You may also delete
only the part of database by using -t,-u,-s subsection control options.



 Database Statistics with SQL backend
-------------------------------------
  If you run 'indexer -S', it will show database statistics, including
count of total and expired documents of each status. -t, -u, -s filters
are usable in this mode too.


The meaning of status is:
0 - new (not indexed yet) URL

If status is not 0, then it is HTTP response code:

Some of HTTP codes are here: 
200 - "OK" (url is successfully indexed)
301 - "Moved Permanently" (redirect to another URL)
302 - "Moved Temporarily" (redirect to another URL)
303 - "See Other" (redirect to another URL)
304 - "Not modified" (url has not been modified since last indexing)
401 - "Authorization required" (use login/password for given URL)
403 - "Forbidden" (you have no access to this URL(s))
404 - "Not found" (there were references to URLs that do not exist)
500 - "Internal Server Error" (error in cgi, etc)
503 - "Service Unavailable" (host is down, connection timed out)
504 - "Gateway Timeout" (read timeout when retrieving document)

HTTP 401 means that this URL is password protected. You can
use AuthBasic command in indexer.conf to set login:password for
this URL(s).

HTTP 404 means that you have incorrect reference in one of your document 
(reference to resource that does not exist). 

Take a look on HTTP specific documentation for futher explanation of
different HTTP status codes.



Link validation (SQL only)
--------------------------
Being started with -I command line argument, indexer displays
URL and it's referer pairs. It is very usefull to find bad
links on your site. Don't forget to use 'DeleteBad no' indexer.conf 
command for this mode. You may use subsection  control options 
-t,-u,-s in this mode. For example, 'indexer -I -s 404'
will display all 'Not found' URLs with referers where links to
those bad documents are found. Setting relevant indexer.conf commands
and command line options you may use mnoGoSearch special for site
validation purposes. Take a look at 'url-checker.conf' example for 
this mode in doc/samples directory of mnoGoSearch distribution.



Parallel indexing (SQL only)
----------------------------
MySQL and PostgreSQL users may run several indexer simultaniously with 
the same indexer.conf file. We have successfully tested 30 simultaneous
indexers with MySQL database. Indexer uses MySQL and PostgreSQL locking
mechanism to avoid double indexing of the same URL by different indexer's
copies. Parallel indexing in the same database is not implemented for other
backends yet. You may use multi-threaded version of indexer with any SQL 
backend thought which does support several simultanuious connections. 
Multi-threaded indexer version uses own locking mechanism.

It is not recommended to use the same database with different
indexer.conf files! First process could add something but second could 
delete it, and it may never stop.

On the other hand, you may run several indexer processes with 
different databases with ANY supported SQL backend.

