LBJ2.nlp
Class SentenceSplitter

java.lang.Object
  extended by LBJ2.parse.LineByLine
      extended by LBJ2.nlp.SentenceSplitter
All Implemented Interfaces:
Parser

public class SentenceSplitter
extends LineByLine

Use this class to extract sentences from plain text. The user constructs an object of this class with the file name of a document written in natural English (i.e., with no annotations added or any type of preprocessing performed). It should be noted that this class will interpret empty lines that appear in the input as paragraph boundaries.

The user can then retrieve Sentences one at a time with the next() method, or all at once with the splitAll() method. The returned Sentences' start and end fields represent offsets into the file they were extracted from. Every character in between those two offsets inclusive, including extra spaces, newlines, etc., is included in the Sentence as it appeared in the paragraph.

A main(String[]) method is also implemented which applies this class to plain text in a straight-forward way.

See Also:
Sentence

Field Summary
protected  int currentOffset
          Contains the offset of a paragraph currently being processed.
protected  int index
          When the constructor taking an array argument is used, this variable keeps track of the element in the array currently being used.
protected  java.lang.String[] input
          When the constructor taking an array argument is used, this variable stores that array.
protected  java.util.LinkedList sentences
          Contains sentences ready to be returned to the user upon request.
 
Fields inherited from class LBJ2.parse.LineByLine
fileName, in
 
Constructor Summary
SentenceSplitter(java.lang.String file)
          Sentence splits the given file.
SentenceSplitter(java.lang.String[] input)
          Sentence splits the given input.
 
Method Summary
protected  boolean boundary(int index, Word word, Word next1, Word next2)
          Determines whether the given punctuation represents the end of a sentence based on elements of the paragraph immediately surrounding the punctuation.
protected  boolean endsWithQuote(Word w)
          Determines whether the argument ends with any of the following varieties of closing quote: ' '' ''' " '" .
protected  java.lang.String getParagraph()
          This method is used to extract a paragraph at a time from the input.
protected  boolean hasStartMarker(Word w)
          Determines whether the argument contains any of the following varieties of "start marker" at its beginning: an open quote, and open bracket, or a capital letter.
protected  boolean isClose(Word w)
          Determines whether the argument represents a closing bracket or a closing quote.
protected  boolean isClosingBracket(Word w)
          Determines whether the argument is exactly equal to any of the following varieties of closing bracket: ) } ] -RBR- .
protected  boolean isClosingQuote(Word w)
          Determines whether the argument is exactly equal to any of the following varieties of closing quote: ' '' ''' " '" .
protected  boolean isHonorific(Word w)
          Determines wheter the argument is exactly equal to any of the honorifics listed below.
protected  boolean isTerminal(Word w)
          Determines whether the argument is exactly equal to any of the following terminal abbreviations: Esq Jr Sr M.D Ph.D .
protected  boolean isTimeZone(Word w)
          Determines whether the argument is a United States time zone abbreviation (AST, CST, EST, HST, MST, PST, ADT, CDT, EDT, HDT, MDT, PDT, or UTC-11).
static void main(java.lang.String[] args)
          Run this program on a file containing plain text, and it will produce the same text rearranged so that each line contains exactly one sentence on STDOUT.
 java.lang.Object next()
          Retrieves the next sentence off the queue and returns it.
protected  void process(java.lang.String paragraph)
          This method does the actual work, deciding where sentences begin and end and populating the sentences member variable.
protected  java.lang.String readLine()
          If constructor taking a file name as input was used, this method simply calls the method of the same name in LineByLine; otherwise, it returns the next element of the array.
protected  boolean sentenceBeginner(Word word)
          Simple check to see if the given word can reliably be identified as the first word of a sentence.
 Sentence[] splitAll()
          Retrieves every sentence found in the input paragraphs that have been provided so far in array form.
protected  boolean startsWithOpenBracket(Word w)
          Determines whether the argument starts with any of the following varieties of open bracket: ( { [ -LBR- .
protected  boolean startsWithOpenQuote(Word w)
          Determines whether the argument starts with any of the following varieties of open quote: ` `` ``` " "` .
protected  boolean startsWithQuote(Word w)
          Determines whether the first character of the argument is any of the three varieties of quotes: ' " `.
 
Methods inherited from class LBJ2.parse.LineByLine
reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

currentOffset

protected int currentOffset
Contains the offset of a paragraph currently being processed.


sentences

protected java.util.LinkedList sentences
Contains sentences ready to be returned to the user upon request.


index

protected int index
When the constructor taking an array argument is used, this variable keeps track of the element in the array currently being used.


input

protected java.lang.String[] input
When the constructor taking an array argument is used, this variable stores that array.

Constructor Detail

SentenceSplitter

public SentenceSplitter(java.lang.String file)
Sentence splits the given file.

Parameters:
file - The name of the file to sentence split.

SentenceSplitter

public SentenceSplitter(java.lang.String[] input)
Sentence splits the given input.

Parameters:
input - Plain text. Each element of this array represents a line, with any line termination characters removed.
Method Detail

main

public static void main(java.lang.String[] args)
Run this program on a file containing plain text, and it will produce the same text rearranged so that each line contains exactly one sentence on STDOUT.

Usage: java LBJ2.nlp.SentenceSplitter <file name>

Parameters:
args - The command line arguments.

readLine

protected java.lang.String readLine()
If constructor taking a file name as input was used, this method simply calls the method of the same name in LineByLine; otherwise, it returns the next element of the array.

Overrides:
readLine in class LineByLine
Returns:
The next line of input.

getParagraph

protected java.lang.String getParagraph()
This method is used to extract a paragraph at a time from the input.

Returns:
The extracted paragraph, or a string containing only whitespace if no text remains in the input.

next

public java.lang.Object next()
Retrieves the next sentence off the queue and returns it.

Returns:
The next sentence found or null if there are no more sentences.

splitAll

public Sentence[] splitAll()
Retrieves every sentence found in the input paragraphs that have been provided so far in array form.

Returns:
All sentences in the input paragraphs.

process

protected void process(java.lang.String paragraph)
This method does the actual work, deciding where sentences begin and end and populating the sentences member variable.

Parameters:
paragraph - The paragraph to process.

boundary

protected boolean boundary(int index,
                           Word word,
                           Word next1,
                           Word next2)
Determines whether the given punctuation represents the end of a sentence based on elements of the paragraph immediately surrounding the punctuation.

Parameters:
index - The index of the punctuation in question in its word.
word - The word containing the punctuation.
next1 - The word one after the word containing the punctuation.
next2 - The word two after the word containing the punctuation. punctuation before

sentenceBeginner

protected boolean sentenceBeginner(Word word)
Simple check to see if the given word can reliably be identified as the first word of a sentence.

Parameters:
word - The word in question.

startsWithQuote

protected boolean startsWithQuote(Word w)
Determines whether the first character of the argument is any of the three varieties of quotes: ' " `.

Parameters:
w - The word in question.
Returns:
true if and only if the first character of the argument is any of the three varieties of quotes.

endsWithQuote

protected boolean endsWithQuote(Word w)
Determines whether the argument ends with any of the following varieties of closing quote: ' '' ''' " '" .

Parameters:
w - The word in question.
Returns:
true if and only if the argument ends with any of the varieties of quotes named above.

isClose

protected boolean isClose(Word w)
Determines whether the argument represents a closing bracket or a closing quote.

Parameters:
w - The word in question.
Returns:
true if and only if the argument represents either a closing bracket or a closing quote.

isClosingBracket

protected boolean isClosingBracket(Word w)
Determines whether the argument is exactly equal to any of the following varieties of closing bracket: ) } ] -RBR- .

Parameters:
w - The word in question.
Returns:
true if and only if the argument is exactly equal to any of the above varieties of closing bracket.

isClosingQuote

protected boolean isClosingQuote(Word w)
Determines whether the argument is exactly equal to any of the following varieties of closing quote: ' '' ''' " '" .

Parameters:
w - The word in question.
Returns:
true if and only if the argument is exactly equal to any of the above varieties of closing quote.

hasStartMarker

protected boolean hasStartMarker(Word w)
Determines whether the argument contains any of the following varieties of "start marker" at its beginning: an open quote, and open bracket, or a capital letter.

Parameters:
w - The word in question.
Returns:
true if and only if the argument starts with a "start marker".

startsWithOpenQuote

protected boolean startsWithOpenQuote(Word w)
Determines whether the argument starts with any of the following varieties of open quote: ` `` ``` " "` .

Parameters:
w - The word in question.
Returns:
true if and only if the argument starts with one of the varieties of open quote named above.

startsWithOpenBracket

protected boolean startsWithOpenBracket(Word w)
Determines whether the argument starts with any of the following varieties of open bracket: ( { [ -LBR- .

Parameters:
w - The word in question.
Returns:
true if and only if the argument starts with any of the varieties of open bracket named above.

isTimeZone

protected boolean isTimeZone(Word w)
Determines whether the argument is a United States time zone abbreviation (AST, CST, EST, HST, MST, PST, ADT, CDT, EDT, HDT, MDT, PDT, or UTC-11).

Parameters:
w - The word in question.
Returns:
true if and only if the argument matches any of the above time zone abbreviations.

isTerminal

protected boolean isTerminal(Word w)
Determines whether the argument is exactly equal to any of the following terminal abbreviations: Esq Jr Sr M.D Ph.D .

Parameters:
w - The word in question.
Returns:
true if and only if the argument matches any of the above terminal abbreviations.

isHonorific

protected boolean isHonorific(Word w)
Determines wheter the argument is exactly equal to any of the honorifics listed below.

Parameters:
w - The word in question.
Returns:
true if and only if the argument is exactly equal to any of the honorifics listed above.