| Prev Class | Next Class | Frames | No Frames |
| Summary: Nested | Field | Method | Constr | Detail: Nested | Field | Method | Constr |
java.lang.Objectau.id.jericho.lib.html.Segmentpublic class Segmentextends java.lang.Objectimplements Comparable, CharSequenceSource document.
The span of a segment is defined by the combination of its begin and end character positions.
Constructor Summary | |
Method Summary | |
char |
|
int |
|
boolean | |
boolean |
|
boolean |
|
String |
|
String |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
List |
|
FormFields |
|
List |
|
int | |
List |
|
String |
|
int | |
String |
|
String |
|
int |
|
void |
|
boolean |
|
boolean |
|
static boolean |
|
int |
|
Attributes |
|
CharSequence |
|
String |
|
public Segment(Source source, int begin, int end)
Constructs a newSegmentwithin the specified source document with the specified begin and end character positions.
- Parameters:
source- theSourcedocument, must not benull.begin- the character position in the source where this segment begins.end- the character position in the source where this segment ends.
public final char charAt(int index)
Returns the character at the specified index. This is logically equivalent totoString().charAt(index)for valid argument values0 <= index <32length(). However because this implementation works directly on the underlying document source string, it should not be assumed that anIndexOutOfBoundsExceptionis thrown for an invalid argument value.
- Parameters:
index- the index of the character.
- Returns:
- the character at the specified index.
public int compareTo(Object o)
Compares thisSegmentobject to another object. If the argument is not aSegment, aClassCastExceptionis thrown. A segment is considered to be before another segment if its begin position is earlier, or in the case that both segments begin at the same position, its end position is earlier. Segments that begin and end at the same position are considered equal for the purposes of this comparison, even if they relate to different source documents. Note: this class has a natural ordering that is inconsistent with equals. This means that this method may return zero in some cases where calling theequals(Object)method with the same argument returnsfalse.
- Parameters:
o- the segment to be compared
- Returns:
- a negative integer, zero, or a positive integer as this segment is before, equal to, or after the specified segment.
public final boolean encloses(Segment segment)
Indicates whether thisSegmentencloses the specifiedSegment. This is the case ifgetBegin()<=segment.getBegin()&&getEnd()>=segment.getEnd().
- Parameters:
segment- the segment to be tested for being enclosed by this segment.
- Returns:
trueif thisSegmentencloses the specifiedSegment, otherwisefalse.
public final boolean encloses(int pos)
Indicates whether this segment encloses the specified character position in the source document. This is the case ifgetBegin()<= pos <getEnd().
- Parameters:
pos- the position in theSourcedocument.
- Returns:
trueif this segment encloses the specified character position in the source document, otherwisefalse.
public final boolean equals(Object object)
Compares the specified object with thisSegmentfor equality. Returnstrueif and only if the specified object is also aSegment, and both segments have the sameSource, and the same begin and end positions.
- Parameters:
object- the object to be compared for equality with thisSegment.
- Returns:
trueif the specified object is equal to thisSegment, otherwisefalse.
public String extractText()
Extracts the text content of this segment. This method removes all of the tags from the segment and decodes the result, collapsing all white space. See the documentation of theextractText(boolean includeAttributes)method for more details. This is equivalent to callingextractText(false).
- Returns:
- the text content of this segment.
public String extractText(boolean includeAttributes)
Extracts the text content of this segment. This method removes all of the tags from the segment and decodes the result, collapsing all white space. Tags are also converted to whitespace unless they belong to an inline-level element. An exception to this is theBRelement, which is also converted to whitespace despite being an inline-level element. Text insideSCRIPTandSTYLEelements contained within this segment is ignored. Specifying a value oftrueas an argument to theincludeAttributesparameter causes the values of title, alt, label, and summary attributes of normal tags to be included in the extracted text.Note that in version 2.1, no tags were converted to whitespace and text inside
<div><b>O</b>ne</div><div><b>T</b><script>//a script </script>wo</div>One TwoSCRIPTandSTYLEelements was included. The example above produced the text "OneT//a script wo".
- Returns:
- the text content of this segment.
public List findAllCharacterReferences()
Returns a list of allCharacterReferenceobjects that are enclosed by this segment.
- Returns:
- a list of all
CharacterReferenceobjects that are enclosed by this segment.
public List findAllComments()
Deprecated. Use
findAllTags(StartTagType.COMMENT)instead.Returns a list of allStartTagobjects representing HTML comments that are enclosed by this segment. This method has been deprecated as of version 2.0 in favour of the more genericfindAllTags(TagType)method.
public List findAllElements()
Returns a list of allElementobjects that are enclosed by this segment. The elements returned correspond exactly with the start tags returned in thefindAllStartTags()method.
public List findAllElements(String name)
Returns a list of allElementobjects with the specified name that are enclosed by this segment. The elements returned correspond exactly with the start tags returned in thefindAllStartTags(String name)method. Specifying anullargument to thenameparameter is equivalent tofindAllElements(). This method also returns elements consisting of unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
name- the name of the elements to find.
public List findAllElements(StartTagType startTagType)
Returns a list of allElementobjects with start tags of the specified type that are enclosed by this segment. The elements returned correspond exactly with the start tags returned in thefindAllTags(TagType)method.
- Parameters:
startTagType- the type of start tags to find, must not benull.
public List findAllStartTags()
public List findAllStartTags(String name)
Returns a list of allStartTagobjects with the specified name that are enclosed by this segment. See theTagclass documentation for more details about the behaviour of this method. Specifying anullargument to thenameparameter is equivalent tofindAllStartTags(). This method also returns unregistered tags if the specified name is not a valid XML tag name.
- Parameters:
name- the name of the start tags to find.
public List findAllStartTags(String attributeName,
String value,
boolean valueCaseSensitive)Returns a list of allStartTagobjects with the specified attribute name/value pair that are enclosed by this segment. See theTagclass documentation for more details about the behaviour of this method.
- Parameters:
attributeName- the attribute name (case insensitive) to search for, must not benull.value- the value of the specified attribute to search for, must not benull.valueCaseSensitive- specifies whether the attribute value matching is case sensitive.
public List findAllTags()
public List findAllTags(TagType tagType)
Returns a list of allTagobjects of the specified type that are enclosed by this segment. See theTagclass documentation for more details about the behaviour of this method. Specifying anullargument to thetagTypeparameter is equivalent tofindAllTags().
- Parameters:
tagType- the type of tags to find.
public List findFormControls()
Returns a list of theFormControlobjects that are enclosed by this segment.
- Returns:
- a list of the
FormControlobjects that are enclosed by this segment.
public FormFields findFormFields()
Returns theFormFieldsobject representing all form fields that are enclosed by this segment. This is equivalent tonew FormFields(findFormControls()).
- Returns:
- the
FormFieldsobject representing all form fields that are enclosed by this segment.
- See Also:
findFormControls()
public final List findWords()
Deprecated. no replacement
Returns a list ofSegmentobjects representing every word in this segment separated by white space. Note that any markup contained in this segment is regarded as normal text for the purposes of this method. This method has been deprecated as of version 2.0 as it has no discernable use.
- Returns:
- a list of
Segmentobjects representing every word in this segment separated by white space.
public final int getBegin()
Returns the character position in theSourcedocument at which this segment begins.
- Returns:
- the character position in the
Sourcedocument at which this segment begins.
public List getChildElements()
Returns a list of the immediate children of this segment in the document element hierarchy. The returned list may include an element that extends beyond the end of this segment, as long as it begins within this segment. The objects in the list are all of typeElement. See theSource.getChildElements()method for more details.
- Returns:
- the a list of the immediate children of this segment in the document element hierarchy, guaranteed not
null.
- See Also:
Element.getParentElement()
public String getDebugInfo()
Returns a string representation of this object useful for debugging purposes.
- Returns:
- a string representation of this object useful for debugging purposes.
public final int getEnd()
Returns the character position in theSourcedocument immediately after the end of this segment. The character at the position specified by this property is not included in the segment.
- Returns:
- the character position in the
Sourcedocument immediately after the end of this segment.
public String getSourceText()
Deprecated. Use
toString()instead.Returns the source text of this segment. This method has been deprecated as of version 2.0 as it now duplicates the functionality of thetoString()method.
- Returns:
- the source text of this segment.
public final String getSourceTextNoWhitespace()
Deprecated. Use the more useful
CharacterReference.decodeCollapseWhiteSpace(CharSequence)method instead.Returns the source text of this segment without white space. All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space. This method has been deprecated as of version 2.0 as it is no longer used internally and has no practical use as a public method. It is similar to the newCharacterReference.decodeCollapseWhiteSpace(CharSequence)method, but does not decode the text after collapsing the white space.
- Returns:
- the source text of this segment without white space.
public int hashCode()
Returns a hash code value for the segment. The current implementation returns the sum of the begin and end positions, although this is not guaranteed in future versions.
- Returns:
- a hash code value for the segment.
public void ignoreWhenParsing()
Causes the this segment to be ignored when parsing. This method is usually used to exclude server tags or other non-HTML segments from the source text so that they do not interfere with the parsing of the surrounding HTML. This is necessary because many server tags are used as attribute values and in other places within HTML tags, and very often contain characters that prevent the parser from recognising the surrounding tag. Any tags appearing in this segment that are found before this method is called will remain in the tag cache, and so will continue to be found by the tag search methods. If this is undesirable, theSource.clearCache()method can be called to remove them from the cache. Calling theSource.fullSequentialParse()method after this method clears the cache automatically. For efficiency reasons, this method should be called on all segments that need to be ignored without calling any of the tag search methods in between.
- See Also:
Source.ignoreWhenParsing(Collection segments)
public boolean isComment()
Deprecated. Use
this instanceofTag&& ((Tag)this).getTagType()==StartTagType.COMMENTinstead.Indicates whether this segment is aTagof typeStartTagType.COMMENT. This method has been deprecated as of version 2.0 as it is not a robust method of checking whether an HTML comment spans this segment.
- Returns:
trueif this segment is aTagof typeStartTagType.COMMENT, otherwisefalse.
public final boolean isWhiteSpace()
Indicates whether this segment consists entirely of white space.
- Returns:
trueif this segment consists entirely of white space, otherwisefalse.
public static final boolean isWhiteSpace(char ch)
Indicates whether the specified character is white space. The HTML 4.01 specification section 9.1 specifies the following white space characters:Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not recognise them as whitespace and renders them as an unprintable character (empty square). Even zero-width spaces included using the numeric character reference
- space (U+0020)
- tab (U+0009)
- form feed (U+000C)
- line feed (U+000A)
- carriage return (U+000D)
- zero-width space (U+200B)
​are rendered this way.
- Parameters:
ch- the character to test.
- Returns:
trueif the specified character is white space, otherwisefalse.
public final int length()
Returns the length of the segment. This is defined as the number of characters between the begin and end positions.
- Returns:
- the length of the segment.
public Attributes parseAttributes()
Parses anyAttributeswithin this segment. This method is only used in the unusual situation where attributes exist outside of a start tag. TheStartTag.getAttributes()method should be used in normal situations. This is equivalent tosource.parseAttributes(getBegin(),getEnd()).
- Returns:
- the
Attributeswithin this segment, ornullif too many errors occur while parsing.
public final CharSequence subSequence(int beginIndex,
int endIndex)Returns a new character sequence that is a subsequence of this sequence. This is logically equivalent totoString().subSequence(beginIndex,endIndex)for valid values ofbeginIndexandendIndex. However because this implementation works directly on the underlying document source string, it should not be assumed that anIndexOutOfBoundsExceptionis thrown for invalid argument values as described in theString.subSequence(int,int)method.
- Parameters:
beginIndex- the begin index, inclusive.endIndex- the end index, exclusive.
- Returns:
- a new character sequence that is a subsequence of this sequence.
public String toString()
Returns the source text of this segment as aString. The returnedStringis newly created with every call to this method, unless this segment is itself an instance ofSource. Note that before version 2.0 this returned a representation of this object useful for debugging purposes, which can now be obtained via thegetDebugInfo()method.
- Returns:
- the source text of this segment as a
String.