Using URL aliases
=================

mnoGoSearch has an alias support making it possible to index 
sites taking information from another location. For example, if you 
index local web server, it is possible to take pages directly from 
disk without involving your web server in indexing process. Another 
example is building of search engine for primary site and using it's 
mirror while indexing. There are several ways of using aliases.


Alias indexer.conf command
--------------------------

Format of "Alias" indexer.conf command:

Alias <masterURL> <mirrorURL>

E.g. you wish to index http://search.mnogo.ru/ using nearest
German mirror http://www.gstammw.de/mirrors/mnoGoSearch/. Add these lines 
in your indexer.conf:

Server http://search.mnogo.ru/
Alias  http://search.mnogo.ru/  http://www.gstammw.de/mirrors/mnoGoSearch/

search.cgi will display URLs from master site http://search.mnogo.ru/
but indexer will take correspondent page from mirror site 
http://www.gstammw.de/mirrors/mnoGoSearch/.

Another example. If you want to index everything in udm.net domain
and one of servers, for example http://home.udm.net/ is stored on 
local machine in /home/httpd/htdocs/ directory. These commands will 
be useful:

Realm http://*.udm.net/
Alias http://home.udm.net/ file:/home/httpd/htdocs/

  Indexer will take home.udm.net from local disk and index other sites
using HTTP.


Different aliases for server parts
----------------------------------
Aliases are searched in the order of their appearence in indexer.conf.
So, you can create different aliases for server and it's parts:

# First, create alias for example for /stat/ directory which
# is not under common location:
Alias http://home.udm.net/stat/  file:/usr/local/stat/htdocs/

# Then create alias for the rest of the server:
Alias http://home.udm.net/ file:/usr/local/apache/htdocs/

 Note that if you change the order of these commands, alias
for /stat/ directory will never be found.

Using alias in Server command
-----------------------------

You may specify location used by indexer as an optional argument for
Server command:

Server  http://home.udm.net/  file:/home/httpd/htdocs/


Using alias in Realm command
----------------------------

Aliases in Realm command is a very powerful feature based on regular 
expressions. The idea of aliases in Realm command implementation is similiar 
to how PHP preg_replace() function works. Aliases in Realm command work 
only if "regex" match type is used and does not work with "string" match type.

Use this syntax to write Realm aliases:

Realm regex <URL_pattern> <alias_pattern>

Indexer searches URL for matches to URL_pattern and build an URL alias using 
alias_pattern. alias_pattern may contain references of the form $n. Where n
is a number in the range of 0-9. Every such reference will be replaced by  
text captured by the n'th parenthesized pattern. $0 refers to text matched 
by the whole pattern. Opening parentheses are counted from left to right 
(starting from 1) to obtain the number of the capturing subpattern.

Example: your company hosts several hundreds users with their 
domains in the form of www.username.yourname.com. Every user's site is stored
on disk in "htdocs" under user's home directory:  /home/username/htdocs/.

You may write this command into indexer.conf (note that dot '.' character has
a special meaning in regular expressions and must be escaped with '\' sign
when dot is used in usual meaning):

Realm regex (http://www\.)(.*)(\.yourname\.com/)(.*)  file:/home/$2/htdocs/$4


  Imagine indexer process "http://www.john.yourname.com/news/index.html" page.
It will build patterns from $0 to $4:

   $0 = 'http://www.john.yourname.com/news/index.htm' (whole patter match)
   $1 = 'http://www.'      subpattern matches '(http://www\.)'
   $2 = 'john'             subpattern matches '(.*)'
   $3 = '.yourname.com/'   subpattern matches '(\.yourname\.com/)'
   $4 = '/news/index.html' subpattern matches '(.*)'

Then indexer will compose alias using $2 and $4 patterns:

   file:/home/john/htdocs/news/index.html

and will use the result as document location to fetch it.

Using AliasProg command
-----------------------

You may also specify "AliasProg" command for aliasing purposes. AliasProg is 
useful for major webhosting companies which want to index their webspace 
taking documents directly from a disk without having to involve web server
in indexing process. Documents layout may be very complex to describe it 
using alias in Realm command. AliasProg is an external program that can be 
called, that takes a URL and returns one string with the appropriate alias 
to stdout. Use $1 to pass URL to command line.

For example this AliasProg command uses 'replace' command from MySQL 
distribution and replaces URL substring "http://www.apache.org/" to 
"file:/usr/local/apache/htdocs/":


AliasProg  "echo $1 | /usr/local/mysql/bin/mysql/replace http://www.apache.org/ file:/usr/local/apache/htdocs/"

You may also write your own very complex program to process URLs.

