Class RobotsManager
- java.lang.Object
-
- org.apache.manifoldcf.core.database.BaseTable
-
- org.apache.manifoldcf.crawler.connectors.webcrawler.RobotsManager
-
public class RobotsManager extends org.apache.manifoldcf.core.database.BaseTableThis class manages the database table into which we write robots.txt files for hosts. The data resides in the database, as well as in cache (up to a certain point). The result is that there is a memory limited, database-backed repository of robots files that we can draw on.
robotsdataField Type Description hostname VARCHAR(255) Primary Key robotsdata BIGINT expirationtime BLOB
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static classRobotsManager.HostDescriptionThis is the object description for a robots host object.protected static classRobotsManager.HostExecutorThis is the executor object for locating robots host objects.protected static classRobotsManager.RecordThis class represents a record in a robots.txt file.protected static classRobotsManager.RobotsCacheClassCache class for robots.protected static classRobotsManager.RobotsDataThis is a cached data item.
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String_rcsidprotected static java.lang.StringexpirationFieldprotected static java.lang.StringhostFieldprotected static RobotsManager.RobotsCacheClassrobotsCacheClassprotected static java.lang.StringrobotsField
-
Constructor Summary
Constructors Constructor Description RobotsManager(org.apache.manifoldcf.core.interfaces.IThreadContext tc, org.apache.manifoldcf.core.interfaces.IDBInterface database)Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.BooleancheckFetchAllowed(java.lang.String userAgent, java.lang.String hostName, long currentTime, java.lang.String pathString, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)Read robots.txt data from the cache or from the database.voiddeinstall()Uninstall the manager.protected static booleandoesPathMatch(java.lang.String path, int pathIndex, java.lang.String spec, int specIndex)Recursive method for matching specification to path.protected static booleandoesPathMatch(java.lang.String path, java.lang.String spec)Check if path matches specificationprotected static java.lang.StringgetRobotsKey(java.lang.String hostName)Construct a key which represents an individual host name.voidinstall()Install the manager.protected static java.lang.StringmakeReadable(java.lang.String inputString)Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).protected RobotsManager.RobotsDatareadRobotsData(java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)Read robots data, if it exists.voidwriteRobotsData(java.lang.String hostName, long expirationTime, java.io.InputStream data)Write robots.txt, replacing any existing row.-
Methods inherited from class org.apache.manifoldcf.core.database.BaseTable
addTableIndex, analyzeTable, beginTransaction, buildConjunctionClause, constructCountClause, constructDistinctOnClause, constructDoubleCastClause, constructOffsetLimitClause, constructRegexpClause, constructSubstringClause, endTransaction, findConjunctionClauseMax, getDatabaseCacheKey, getDBInterface, getMaxInClause, getMaxOrClause, getSleepAmt, getTableIndexes, getTableName, getTableSchema, getTransactionID, getWindowedReportMaxRows, makeTableKey, noteModifications, performAddIndex, performAlter, performCommit, performCreate, performDelete, performDrop, performInsert, performModification, performQuery, performQuery, performRemoveIndex, performUpdate, prepareRowForSave, readRow, reindexTable, signalRollback, sleepFor
-
-
-
-
Field Detail
-
_rcsid
public static final java.lang.String _rcsid
- See Also:
- Constant Field Values
-
robotsCacheClass
protected static RobotsManager.RobotsCacheClass robotsCacheClass
-
hostField
protected static final java.lang.String hostField
- See Also:
- Constant Field Values
-
robotsField
protected static final java.lang.String robotsField
- See Also:
- Constant Field Values
-
expirationField
protected static final java.lang.String expirationField
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
RobotsManager
public RobotsManager(org.apache.manifoldcf.core.interfaces.IThreadContext tc, org.apache.manifoldcf.core.interfaces.IDBInterface database) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionConstructor. Note that one robotsmanager handle is only useful within a specific thread context, so the calling connector object logic must recreate the handle whenever the thread context changes.- Parameters:
tc- is the thread context.database- is the database handle.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
-
Method Detail
-
install
public void install() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionInstall the manager.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
deinstall
public void deinstall() throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionUninstall the manager.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
checkFetchAllowed
public java.lang.Boolean checkFetchAllowed(java.lang.String userAgent, java.lang.String hostName, long currentTime, java.lang.String pathString, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities) throws org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionRead robots.txt data from the cache or from the database.- Parameters:
hostName- is the host for which the data is desired.currentTime- is the time of the check.- Returns:
- null if the record needs to be fetched, true if fetch is allowed.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
writeRobotsData
public void writeRobotsData(java.lang.String hostName, long expirationTime, java.io.InputStream data) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException, java.io.IOExceptionWrite robots.txt, replacing any existing row.- Parameters:
hostName- is the host.expirationTime- is the time this data should expire.data- is the robots data stream. May be null.- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionjava.io.IOException
-
getRobotsKey
protected static java.lang.String getRobotsKey(java.lang.String hostName)
Construct a key which represents an individual host name.- Parameters:
hostName- is the name of the connector.- Returns:
- the cache key.
-
readRobotsData
protected RobotsManager.RobotsData readRobotsData(java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities) throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Read robots data, if it exists.- Returns:
- null if the data doesn't exist at all. Return robots data if it does.
- Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
-
makeReadable
protected static java.lang.String makeReadable(java.lang.String inputString)
Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).
-
doesPathMatch
protected static boolean doesPathMatch(java.lang.String path, java.lang.String spec)Check if path matches specification
-
doesPathMatch
protected static boolean doesPathMatch(java.lang.String path, int pathIndex, java.lang.String spec, int specIndex)Recursive method for matching specification to path.
-
-