Character sets
==============


Supported character sets
------------------------
mnoGoSearch supports the following character sets:

Cyrillic group:
	koi8-r, windows-1251, iso-8859-5, cp866, x-mac-cyrillic

Western group:
	iso-8859-1

Central Europe group:
	windows-1250, iso-8859-2

Arabic group:
	windows-1256 

Greek group:
	windows-1253, iso-8859-7 

Hebrew group:
	iso-8859-8, windows-1255

Baltic group:
	iso-8859-4, iso-8859-13, windows-1257

Turkish group:
	iso-8859-9, windows-1254

Recoding
--------
indexer recodes all documents to the character set specified
in the "LocalCharset" indexer.conf command. Recoding only inside 
character set group is available. This is currently implemented 
for "Cyrillic","Central Europe","Greek","Baltic" groups. Hebrew iso-8859-8 
and windows-1255 character sets are letters compatible, Turkish iso-8859-9 
and windows-1254 character sets are letters compatible, i.e. no recoding is 
required for Hebrew and Turkish character sets. 

Recoding between character sets from different groups, for example, 
from Cyrillic koi8-r into Western iso-8859-1 will never be done by indexer.


Character sets aliases
----------------------
Web servers can return the same charset in different notation.
For example, iso-8859-2, iso8859-2, latin2 are the same charsets. 
There is support for charsets names aliases which search engine 
can understand:

1. Aliases for all ISO charsets (using iso-8859-2 as an example):

	iso-8859-2, iso8859-2, iso8859.2, iso-8859.2,
	iso_8859-2:1988, iso_8859-2, iso_8859.2

2. Aliases for all MS charsets (using windows-1250 as an example):

	windows-1250, cp-1250, cp1250, windows1250, x-cp1250

3. Aliases for Cyrillic koi8-r:

	koi8-r, koi8r, koi-8-r, koi8, koi-8, koi

4. Aliases for x-mac-cyrillic
	
	x-mac-cyrillic, mac

5. Aliases for DOS cp-866 Cyrillic
	
	cp-866, cp866, csibm866, 866, ibm866, x-cp866, x-ibm866, alt

6. Aliases for some latin character sets:

	latin1 for iso-8859-1
	latin2 for iso-8859-2
	latin4 for iso-8859-4
	latin5 for iso-8859-9
	latin7 for iso-8859-13


Document charset detection
--------------------------
indexer detects document character set in this order:

1) "Content-type: text/html; charset=xxx"
2) <META NAME="Content" CONTENT="text/html; charset=xxx">
3) Defaults from "Charset" indexer.conf command (user preferences)


Automatic charset guesser
-------------------------
There is also automatic cyrillic charset guesser which is not compiled
by default. You may activate it using "--with-charset-guesser" configure
argument. If the automatic character set guesser was built at 
installation time, the above three methods of charset detection will 
be used only in the case when automatic guessing fails.


Default Language
----------------
You can set default language for Servers by using DefaultLang 
indexer.conf variable. This is useful while restricting search
by URL language.
