|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.ObjectLBJ2.parse.LineByLine
LBJ2.nlp.SentenceSplitter
public class SentenceSplitter
Use this class to extract sentences from plain text. The user constructs an object of this class with the file name of a document written in natural English (i.e., with no annotations added or any type of preprocessing performed). It should be noted that this class will interpret empty lines that appear in the input as paragraph boundaries.
The user can then retrieve Sentences one at a time with
the next() method, or all at once with the
splitAll() method. The returned Sentences'
start and end fields represent offsets into the
file they were extracted from. Every character in between those two
offsets inclusive, including extra spaces, newlines, etc., is included in
the Sentence as it appeared in the paragraph.
A main(String[]) method is also implemented which applies
this class to plain text in a straight-forward way.
Sentence| Field Summary | |
|---|---|
protected int |
currentOffset
Contains the offset of a paragraph currently being processed. |
protected int |
index
When the constructor taking an array argument is used, this variable keeps track of the element in the array currently being used. |
protected java.lang.String[] |
input
When the constructor taking an array argument is used, this variable stores that array. |
protected java.util.LinkedList |
sentences
Contains sentences ready to be returned to the user upon request. |
| Fields inherited from class LBJ2.parse.LineByLine |
|---|
fileName, in |
| Constructor Summary | |
|---|---|
SentenceSplitter(java.lang.String file)
Sentence splits the given file. |
|
SentenceSplitter(java.lang.String[] input)
Sentence splits the given input. |
|
| Method Summary | |
|---|---|
protected boolean |
boundary(int index,
Word word,
Word next1,
Word next2)
Determines whether the given punctuation represents the end of a sentence based on elements of the paragraph immediately surrounding the punctuation. |
protected boolean |
endsWithQuote(Word w)
Determines whether the argument ends with any of the following varieties of closing quote: ' '' ''' " '" . |
protected java.lang.String |
getParagraph()
This method is used to extract a paragraph at a time from the input. |
protected boolean |
hasStartMarker(Word w)
Determines whether the argument contains any of the following varieties of "start marker" at its beginning: an open quote, and open bracket, or a capital letter. |
protected boolean |
isClose(Word w)
Determines whether the argument represents a closing bracket or a closing quote. |
protected boolean |
isClosingBracket(Word w)
Determines whether the argument is exactly equal to any of the following varieties of closing bracket: ) } ] -RBR- . |
protected boolean |
isClosingQuote(Word w)
Determines whether the argument is exactly equal to any of the following varieties of closing quote: ' '' ''' " '" . |
protected boolean |
isHonorific(Word w)
Determines wheter the argument is exactly equal to any of the honorifics listed below. |
protected boolean |
isTerminal(Word w)
Determines whether the argument is exactly equal to any of the following terminal abbreviations: Esq Jr Sr M.D Ph.D . |
protected boolean |
isTimeZone(Word w)
Determines whether the argument is a United States time zone abbreviation (AST, CST, EST, HST, MST, PST, ADT, CDT, EDT, HDT, MDT, PDT, or UTC-11). |
static void |
main(java.lang.String[] args)
Run this program on a file containing plain text, and it will produce the same text rearranged so that each line contains exactly one sentence on STDOUT. |
java.lang.Object |
next()
Retrieves the next sentence off the queue and returns it. |
protected void |
process(java.lang.String paragraph)
This method does the actual work, deciding where sentences begin and end and populating the sentences member variable. |
protected java.lang.String |
readLine()
If constructor taking a file name as input was used, this method simply calls the method of the same name in LineByLine; otherwise,
it returns the next element of the array. |
protected boolean |
sentenceBeginner(Word word)
Simple check to see if the given word can reliably be identified as the first word of a sentence. |
Sentence[] |
splitAll()
Retrieves every sentence found in the input paragraphs that have been provided so far in array form. |
protected boolean |
startsWithOpenBracket(Word w)
Determines whether the argument starts with any of the following varieties of open bracket: ( { [ -LBR- . |
protected boolean |
startsWithOpenQuote(Word w)
Determines whether the argument starts with any of the following varieties of open quote: ` `` ``` " "` . |
protected boolean |
startsWithQuote(Word w)
Determines whether the first character of the argument is any of the three varieties of quotes: ' " `. |
| Methods inherited from class LBJ2.parse.LineByLine |
|---|
reset |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected int currentOffset
protected java.util.LinkedList sentences
protected int index
protected java.lang.String[] input
| Constructor Detail |
|---|
public SentenceSplitter(java.lang.String file)
file - The name of the file to sentence split.public SentenceSplitter(java.lang.String[] input)
input - Plain text. Each element of this array represents a line,
with any line termination characters removed.| Method Detail |
|---|
public static void main(java.lang.String[] args)
STDOUT.
Usage:
java LBJ2.nlp.SentenceSplitter <file name>
args - The command line arguments.protected java.lang.String readLine()
LineByLine; otherwise,
it returns the next element of the array.
readLine in class LineByLineprotected java.lang.String getParagraph()
public java.lang.Object next()
null if there are no
more sentences.public Sentence[] splitAll()
protected void process(java.lang.String paragraph)
sentences member variable.
paragraph - The paragraph to process.
protected boolean boundary(int index,
Word word,
Word next1,
Word next2)
index - The index of the punctuation in question in its word.word - The word containing the punctuation.next1 - The word one after the word containing the
punctuation.next2 - The word two after the word containing the
punctuation.
punctuation beforeprotected boolean sentenceBeginner(Word word)
word - The word in question.protected boolean startsWithQuote(Word w)
w - The word in question.
true if and only if the first character of the
argument is any of the three varieties of quotes.protected boolean endsWithQuote(Word w)
w - The word in question.
true if and only if the argument ends with any of
the varieties of quotes named above.protected boolean isClose(Word w)
w - The word in question.
true if and only if the argument represents
either a closing bracket or a closing quote.protected boolean isClosingBracket(Word w)
w - The word in question.
true if and only if the argument is exactly equal
to any of the above varieties of closing bracket.protected boolean isClosingQuote(Word w)
w - The word in question.
true if and only if the argument is exactly equal
to any of the above varieties of closing quote.protected boolean hasStartMarker(Word w)
w - The word in question.
true if and only if the argument starts with a
"start marker".protected boolean startsWithOpenQuote(Word w)
w - The word in question.
true if and only if the argument starts with one
of the varieties of open quote named above.protected boolean startsWithOpenBracket(Word w)
w - The word in question.
true if and only if the argument starts with any
of the varieties of open bracket named above.protected boolean isTimeZone(Word w)
w - The word in question.
true if and only if the argument matches any of
the above time zone abbreviations.protected boolean isTerminal(Word w)
w - The word in question.
true if and only if the argument matches any of
the above terminal abbreviations.protected boolean isHonorific(Word w)
w - The word in question.
true if and only if the argument is exactly equal
to any of the honorifics listed above.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||