Lucene remove stopwords from file

Lucene remove stopwords from file - java

I'm new to Lucene and I wish to remove stopwords from sentences in a large text file. Every sentence is stored on a separate line in the text file. The code I have currently is:
Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_41, new StringReader("if everyone got spam from me im extremely sorry"));
final StandardFilter standardFilter = new StandardFilter(Version.LUCENE_41, tokenizer);
final StopFilter stopFilter = new StopFilter(Version.LUCENE_41, standardFilter, sa.getStopwordSet());
final CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
try{
stopFilter.reset();
while(stopFilter.incrementToken()) {
final String token = charTermAttribute.toString().toString();
System.out.printf("%s ", token);
}
}catch(Exception ex){
}
However, as you can see the StringReader only has one predefined sentence. Now, I was wondering how can it be done so I can have the program read in all sentences from my text file?
Thanks in advance!

Related

Java: Need Help Reading a large file into an array list by paragraph

OK here is my question, I am in an introductory course in Java so I cannot use any advanced code. I am needing to read in a large text file and store each paragraph as an address in an array list. So I am needing to read in the file and split on the carriage return. What I have so far is posted below. Thanks in advance.
public static void fileReader(String x)throws FileNotFoundException{
String fileName = (x + ".txt");
File input= new File(fileName);
Scanner in =new Scanner(input);
ArrayList<String> linesInFile = new ArrayList<>();
while (in.hasNextLine()){
if ( != '/n'){ //this is where i'm losing it
String line = in.nextLine();
linesInFile.add(line);
}
}
in.close();

If the text file contains paragraphs (doesn't contain any line-breaks within the paragraph), then you don't have to check "/n".
while (in.hasNextLine()){
String line = in.nextLine();
linesInFile.add(line);
}
This would suffice

ShingleFilter in Lucene 4.6 adds words to bigrams

I am puzzled by the strange behavior of ShingleFilter in Lucene 4.6. What I would like to do is extract all possible bigrams from a sentence. So if the sentence is "this is a dog", I want "this is", "is a ", "a dog".
What I see instead is:
"this this is"
"this is is"
"is is a"
"is a a"
"a a dog"
"a dog dog"
So somehow it replicates words and makes 3-grams instead of bigrams.
Here is my Java code:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(analyzer,2,2);
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
TokenStream tokenStream = shingleAnalyzer.tokenStream("content", reader);
ShingleFilter theFilter = new ShingleFilter(tokenStream);
theFilter.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = theFilter.addAttribute(CharTermAttribute.class);
theFilter.reset();
while (theFilter.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
theFilter.end();
theFilter.close();
Could someone please help me modify it so it actually outputs bigrams and does not replicate words? Thank you!
Natalia

I was able to produce correct results, using this code
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
StandardTokenizer source = new StandardTokenizer(Version.LUCENE_46, reader);
TokenStream tokenStream = new StandardFilter(Version.LUCENE_46, source);
ShingleFilter sf = new ShingleFilter(tokenStream);
sf.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = sf.addAttribute(CharTermAttribute.class);
sf.reset();
while (sf.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
sf.end();
sf.close();
And get as expected 3 bigrams - this is, is a, a dog.

The problem is you are applying two ShingleFilters. ShingleAnalyzerWrapper tacks one onto the StandardAnalyzer, and then you add another one explicitly. Since the ShingleAnalyzerWrapper uses it's default behavior of outputing unigrams, you end up with the following tokens off of that first ShingleFilter:
this
this is
is
is a
a
a dog
dog
so when the second filter comes along (this time without unigrams), it simply combines each consecutive tokens among those, leading to the result your are seeing.
So, either eliminate the ShingleAnalyzerWrapper, or the ShingleFilter added later. For instance, this should work:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
TokenStream tokenStream = analyzer.tokenStream("content", reader);
ShingleFilter theFilter = new ShingleFilter(tokenStream);
theFilter.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = theFilter.addAttribute(CharTermAttribute.class);
theFilter.reset();
while (theFilter.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
theFilter.end();
theFilter.close();

How to access values of a line, while reading in a text file in Java

I am trying to load in two files at the same time but also access the first gps1 file. I want to access the gps1 file line-by-line and depending on the sentence type which I will explain later I want to do different stuff with that line and then move to the next line.
Basically gps1 for example has multiple lines but each line falls under a couple of catagories all starting with $GPS(then other characters). Some of these types have a time stamp which I need to collect and some types do not have a time stamp.
File gps1File = new File(gpsFile1);
File gps2File = new File(gpsFile2);
FileReader filegps1 = new FileReader(gpsFile1);
FileReader filegps2 = new FileReader(gpsFile2);
BufferedReader buffer1 = new BufferedReader(filegps1);
BufferedReader buffer2 = new BufferedReader(filegps2);
String gps1;
String gps2;
while ((gps1 = buffer1.readLine()) != null) {
The gps1 data file is as follows
$GPGSA,A,3,28,09,26,15,08,05,21,24,07,,,,1.6,1.0,1.3*3A
$GPRMC,151018.000,A,5225.9627,N,00401.1624,W,0.11,104.71,210214,,*14
$GPGGA,151019.000,5225.9627,N,00401.1624,W,1,09,1.0,38.9,M,51.1,M,,0000*72
$GPGSA,A,3,28,09,26,15,08,05,21,24,07,,,,1.6,1.0,1.3*3A
Thanks

I don't really understand the problem you are facing but anyway, if you want to get your lines content you can use a StringTokenizer
StringTokenizer st = new StringTokenizer(gps1, ",");
And then access the data one by one
while(st.hasMoreToken)
String s = st.nextToken();
EDIT:
NB: the first token will be your "$GPXXX" attribute

Find location name using OpenNLP

I am new in OpenNLP. I use OpenNLP to find location's name from sentence. My input string is "Italy pardons US colonel in CIA case". I can not find "Italy" word in result set. How can I solve this problem. Thanks in advance!
try {
InputStream modelIn = new FileInputStream("en-token.bin");
TokenizerModel tokenModel = new TokenizerModel(modelIn);
modelIn.close();
Tokenizer tokenizer = new TokenizerME(tokenModel);
NameFinderME nameFinder =
new NameFinderME(
new TokenNameFinderModel(new FileInputStream("en-ner-location.bin")));
String tokens[] = tokenizer.tokenize(documentStr);
Span nameSpans[] = nameFinder.find(tokens);
for( int i = 0; i<nameSpans.length; i++) {
System.out.println("Span: "+nameSpans[i].toString());
}
}
catch(Exception e) {
System.out.println(e.toString());
}

opennlp results are dependent on the data the model was created from. The en-ner-location.bin file at sourceforge may not contain samples that make sense for your data. Also, extracting nouns or noun phrases (NNP) with a chunker or POS tagger will not be isolated to only locations. So the answer to your question is: The model doesn't account for every case in your data, this is the reason why you don't get a hit on this particular sentence. BTW, NER is never perfect and will always return some degree of false positives and false negatives.

StandardAnalyzer - Apache Lucene

I'm actually developing a system where you input some text files to a StandardAnalyzer, and the contents of that file are then replaced by the output of the StandardAnalyzer (which tokenizes and removes all the stop words). The code ive developed till now is :
File f = new File(path);
TokenStream stream = analyzer.tokenStream("contents",
new StringReader(readFileToString(f)));
CharTermAttribute charTermAttribute = stream.getAttribute(CharTermAttribute.class);
while (stream.incrementToken()) {
String term = charTermAttribute.toString();
System.out.print(term);
}
//Following is the readFileToString(File f) function
StringBuilder textBuilder = new StringBuilder();
String ls = System.getProperty("line.separator");
Scanner scanner = new Scanner(new FileInputStream(f));
while (scanner.hasNextLine()){
textBuilder.append(scanner.nextLine() + ls);
}
scanner.close();
return textBuilder.toString();
The readFileToString(f) is a simple function which converts the file contents to a string representation.
The output i'm getting are the words each with the spaces or the new line between them removed. Is there a way to preserve the original spaces or the new line characters after the analyzer output, so that i can replace the original file contents with the filtered contents of the StandardAnalyzer and present it in a readable form?

Tokenizers save the term position, so in theory you could look at the position to determine how many characters there are between each token, but they don't save the data which was between the tokens. So you could get back spaces, but not newlines.
If you're comfortable with JFlex you could modify the tokenizer to treat newlines as a token. That's probably harder than any gain you'd get from it though.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene remove stopwords from file - java

Related

Java: Need Help Reading a large file into an array list by paragraph

ShingleFilter in Lucene 4.6 adds words to bigrams

How to access values of a line, while reading in a text file in Java

Find location name using OpenNLP

StandardAnalyzer - Apache Lucene

Categories

Resources