HTML parser
===========


Tag parser
----------
Tag parser understands the following tag notation:

1) < ... parameter=value ...   >

2) < ... parameter="value" ... >

3) < ... parameter='value' ... >



Special characters
------------------
indexer understands the following special HTML characters.

1) &lt; &gt; &amp; &nbsp; &quot;
2) All SGML ISO-8859-1 entities:  &auml; &uuml; and other.
3) Characters in their ASCII code notation: &#234; 



Meta tags
---------
Indexer's HTML parser currently understands the following META tags.
Note that "HTTP-EQUIV" may be used instead of "NAME" in all entries.

1) <META NAME="Content-Type" Content="text/html; charset=xxxx">
This is used to detect document character set if it is not specified
in "Content-type" HTTP header.

2) <META NAME="REFRESH" Content="5; URL=http://www.somewhere.com">
URL value will be inserted in database.

3) <META NAME="Keywords" Content="xxx">

4) <META NAME="Description" Content="xxx">

5) <META NAME="Robots" Content="xxx"> with content value ALL, NONE, INDEX,
NOINDEX, FOLLOW, NOFOLLOW.



Links
-----

HTML parser understand the following links.

1) <A HREF="xxx">

2) <IMG SRC="xxx">

3) <LINK HREF="xxx">

4) <FRAME SRC="xxx">

5) <AREA HREF="xxx">

6) <BASE HREF="xxx">
If BASE HREF value has incorrectly formed URL, current one will
be used instead to compose relative links.


Comments
--------

1) Text inside the <!-- .... --> tag is recognized as HTML comment.

2) You may use special <!--UdmComment--> .... <!--/UdmComment-->
comment tags to exclude the text between from indexing. This
may be usefull to hide such things like menus and others from
indexing.

3) You may also use <NOINDEX> ... </NOINDEX> as a synonims to 
<!--UdmComment--> and <!--/UdmComment-->
