How does mnoGoSearch store word information
=========================================


General storage information
---------------------------
mnoGoSearch stores only unique words found in document.
If the word appeares several times in the same document all it's
weights in different parts of the document are binary ORed. It
means that count of word appearence in the document does not
affect it's weight. But the fact whether the word appeares in
more important parts of the document (title,description etc) is
taken in account however.


Different modes of words storage 
--------------------------------
There are different modes of word storage which are currently supported
by mnoGoSearch: "single","multi","crc","crc-multi". Default mode is "single".
Mode is to be selected by "DBMode" command in both indexer.conf 
and search.htm files.

Examples:
DBMode single
DBMode multi
DBMode crc
DBMode crc-multi

mnoGoSearch compiled with built-in database supports only "single","crc" 
and "crc-multi" modes. "multi" mode is not implemented in built-in database.


"single" mode
-------------
When "single" is specified, all words are stored in one table (or in text file
in built-in database) with structure (url_id,word,weight), where url_id is 
the ID of the document which is refferenced by rec_id field in "url" table.
Word has "variable char(32)" SQL type.


"multi" mode
------------
If "multi" is selected, words will be located in different 13 tables 
depending of their lengths. Structures of these tables are the same
with "single" mode, but fixed length char type is used, which is
usually faster in most databases. This fact makes "multi" mode usually 
faster comparing with "single" mode. This mode is not implemented for
built-in database.


"crc" mode
----------
If "crc" mode is selected, mnoGoSearch will store 32 bit integer
word IDs calculated by CRC32 algorythm instead of words. This
mode requres less disc space and is faster than "single"
and "multi" modes. mnoGoSearch uses the fact that CRC32 calculates
quite unique check sums for different words. According to our tests
there are only 250 pairs of words have the same CRC in the list of about
1.600.000 unique words. Most of these pairs (>90%) have at least one
misspelled word. Words information is stored in the structure
(url_id,word_id,weight), where word_id is 32 bit integer ID calculated
by CRC32 algorythm. This mode is recommended for big search engines.


"crc-multi"
-----------
When "crc-multi" mode is selected, mnoGoSearch stores CRC32 word IDs in
several tables (or binary files in built-in database) with the same to 
"crc" structures depending on word lengths like in "multi" mode. This
mode usually is the most fast and recommended for big search engines.


"cache"
-------
There is a new "cache" storage mode. It is the most fast and allows
to index and quickly search through several millions documents.
Take a look into "cachemode.txt" for explanation.


SQL structure notes
-------------------
Please note that we develop mnoGoSearch with MySQL as backend and often have 
no possibility to test each version with all of other supported databases. 
So, if there is no table definition in create/you_database  directory, 
you may found MySQL definition for the same table and just adopt it for your 
backend. MySQL table definitions are always up-to-date.


Non-CRC storage modes additional features
-----------------------------------------
"single" mode in both SQL and build-in database  as well as 
"multi" mode with SQL database have a support for substring
search. As far as "crc" and "crc-multi" do not store words
themself and use integer values generated by CRC32 algorythm
instead, there is no possibility of substring search in these
mode.


