Using external parsers
======================

Since version 2.1 indexer can use external parsers to index
different file types (mime types).

Parser is any executable program which converts one of the mime
types to text/plain or text/html. For example, if you have postscript
files, you can use ps2ascii parser (filter), which reads postscript
file from stdin and produces ascii to stdout.


Supported parser types
======================

Indexer supports four types for parsers which can:
 * read data from stdin and send result to stdout
 * read data from file  and send result to stdout
 * read data from file  and send result to file
 * read data from stdin and send result to file


How to setup parsers
====================

1. Configure mime types
-----------------------

  Configure your web server to send appropriate "Content-Type" header.
For apache, have a look at mime.types file, most mime types are already
defined there.

 If you want to index local files use "AddType" command in indexer.conf
to accociate file name extensions with their mime types. For example:

AddType text/html *.html


2. Add parsers
--------------

Add lines with parsers definitions.
Lines have the following format with three arguments:

Mime <from_mime> <to_mime> <command line>

For example, the following line defines parser for man pages:

# Use deroff for parsing man pages ( *.man )
Mime  application/x-troff-man   text/plain   deroff

 This parser will take data from stdin and output result to stdout.


 Many parsers can not operate on stdin and require a file to read from.
In this case indexer creates a temporary file in /tmp and will remove it when
parser exits. Use $1 macro in parser command line to substitute file name.
For example, Mime command for "catdoc" MS Word to ASCII converters may look
like this:

Mime application/msword text/plain "/usr/bin/catdoc -a $1"


 If your parser writes result into output file, use $2 macro. indexer
will replace $2 by temporary file name, start parser, read result from
this temporary file then remove it. For example:

Mime application/msword text/plain "/usr/bin/catdoc -a $1 >$2"

The parser above will read data from first temporary file and write
result to second one. Both temporary files will be removed when
parser exists. Note that result of usage of this parser will be absolutely 
the same with the previous one, but they use different execution mode:
file->stdout and file->file correspondently.


Pipes in parser's command line:
===============================

  You can use pipes in parser's command line. For example, these lines
will be useful to index gzipped man pages from local disk:

AddType  application/x-gzipped-man  *.1.gz *.2.gz *.3.gz *.4.gz
Mime     application/x-gzipped-man  text/plain  "zcat | deroff"



Charsets and parsers
====================

 Some parsers can produce output in other charset than given in LocalCharset
command. Specify charset to make indexer convert parser's output to proper 
one. For example, if your catdoc is configured to produce output in 
windows-1251 charset but LocalCharset is koi8-r, use this command for 
parsing MS Word documents:

Mime  application/msword  "text/plain; charset=windows-1251" "catdoc -a $1"



UDM_URL variable
================

When executing a parser indexer creates UDM_URL environment variable 
with an URL being processed as a value. You can use this variable in 
parser scripts.



Parser examples
===============

Nice RPM parser by Mario Lang <lang@zid.tu-graz.ac.at>
------------------------------------------------------

        /usr/local/bin/rpminfo:

#!/bin/bash
/usr/bin/rpm -q --queryformat="<html><head><title>RPM: %{NAME} %{VERSION}-%{RELEASE}(%{GROUP})</title><meta name=\"description\" content=\"%{SUMMARY}\"></head><body>%{DESCRIPTION}\n</body></html>" -p $1


        indexer.conf:

Mime application/x-rpm text/html "/usr/local/bin/rpminfo $1"

        It renders to such nice RPM information:

3. RPM: mysql 3.20.32a-3 (Applications/Databases) [4]
       Mysql is a SQL (Structured Query Language) database server.
       Mysql was written by Michael (monty) Widenius. See the CREDITS
       file in the distribution for more credits for mysql and related
       things....
       (application/x-rpm) 2088855 bytes


---------------------------------------------------------------------
Please feel free to contribute your scripts and parsers configuration 
to general@mnogosearch.org.


