Display full line with CharTermAttribute in Lucene - java

I have some e-books in .txt format where usually their page has the author, their title in the form (author: blah blah, title: blah blah etc.).
I want Lucene to automatically locate this information and then display it on the screen with the offset from the found term. More specifically i want to display:
BOOK1, (1)title: blah blah(15) offset 1->15 , (16)author: blah blah(32) offset 16-32 , release date: 14/12/1923 32->50
However, what I do below in the code is to insert in the displayTokensWithFullDetails the file I want to find, and it only shows me when it finds the word author in the text And its Offset counter,
while I want to show me the whole line of the title with cumulative Offset that is from the first letter of the line to the last one. Is there any way I can display the whole line and the offset corresponding to the line?
I would describe it as a command that says "Find a token in the text which matched in "title/author "etc and display the whole line with its offset."
Unless there is a completely different way to do it than through the attributes of StandardAnalyzer.
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
public class Indexer
{
public static void main(String[] args) throws Exception {
File f= new File("C:/test1.txt");
displayTokensWithFullDetails(f);
}
public static void displayTokensWithFullDetails(File F,String info)throws IOException {
TokenStream stream = new StandardAnalyzer().tokenStream("contents",new FileReader(F));
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
PositionIncrementAttribute posIncr =
stream.addAttribute(PositionIncrementAttribute.class);
OffsetAttribute offset =
stream.addAttribute(OffsetAttribute.class);
int position = 0;
stream.reset();
while(stream.incrementToken())
{
String term1 = term.toString();
if (term1.equals("title"))
{
System.out.print("" +term1 + ":" +offset.startOffset() + "->" +offset.endOffset());
}
if (term1.equals("author"))
{
System.out.print("" +term1 + ":" +offset.startOffset() + "->" +offset.endOffset());
}
if (term1.equals("release"))
{
System.out.print("" +term1 + ":" +offset.startOffset() + "->" +offset.endOffset());
}
}
stream.close();
System.out.println();
}

Related

Java - extract from line based on regex

Small question regarding a Java job to extract information out of lines from a file please.
Setup, I have a file, in which one line looks like this:
bla,bla42bla()bla=bla+blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla
The file contains many of those lines (as describe above)
In each of the lines, there are two particular information I am interested in, the primaryKey, and the country.
In my example, ZAPDBHV7120D41A and USA
For sure, each line of the file has exactly once the primaryKey, and exactly once the country, they are separated by a comma. It is there exactly once. in no particular order (it can appear at the start of the line, middle, end of the line, etc).
The primary key is a combination of alphabet in caps [A, B, C, ... Y, Z] and numbers [0, 1, 2, ... 9]. It has no particular predefined length.
The primary key is always in between primaryKey="({primaryKey},{country},
Meaning, the actual primaryKey is found after the string primaryKey-equal-quote-open parenthesis. And before another comma three letters country comma.
I would like to write a program, in which I can extract all the primary key, as well as all countries from the file.
Input:
bla,bla42bla()bla=bla+blablaprimaryKey="(ZAPDBHV7120D41A,USA,blablablablablabla
bla++blabla()bla=bla+blablaprimaryKey="(AA45555DBMW711DD4100,ARG,bla
[...]
Result:
The primaryKey is ZAPDBHV7120D41A
The country is USA
The primaryKey is AA45555DBMW711DD4100
The country is ARG
Therefore, I tried following:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.Pattern;
public class RegexExtract {
public static void main(String[] args) throws Exception {
final String csvFile = "my_file.txt";
try (final BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
String line;
while ((line = br.readLine()) != null) {
Pattern.matches("", line); // extract primaryKey and country based on regex
String primaryKey = ""; // extract the primary from above
String country = ""; // extract the country from above
System.out.println("The primaryKey is " + primaryKey);
System.out.println("The country is " + country);
}
}
}
}
But I am having a hard time constructing the regular expression needed to match and extract.
May I ask what is the correct code in order to extract from the line based on above information?
Thank you
Explanations after the code.
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexExtract {
public static void main(String[] args) {
Path path = Paths.get("my_file.txt");
try (BufferedReader br = Files.newBufferedReader(path)) {
Pattern pattern = Pattern.compile("primaryKey=\"\\(([A-Z0-9]+),([A-Z]+)");
String line = br.readLine();
while (line != null) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
String primaryKey = matcher.group(1);
String country = matcher.group(2);
System.out.println("The primaryKey is " + primaryKey);
System.out.println("The country is " + country);
}
line = br.readLine();
}
}
catch (IOException xIo) {
xIo.printStackTrace();
}
}
}
Running the above code produces the following output (using the two sample lines in your question).
The primaryKey is ZAPDBHV7120D41A
The country is USA
The primaryKey is AA45555DBMW711DD4100
The country is ARG
The regular expression looks for the following [literal] string
primaryKey="(
The double quote is escaped since it is within a string literal.
The opening parenthesis is escaped because it is a metacharacter and the double backslash is required since Java does not recognize \( in a string literal.
Then the regular expression groups together the string of consecutive capital letters and digits that follow the previous literal up to (but not including) the comma.
Then there is a second group of capital letters up to the next comma.
Refer to the Regular Expressions lesson in Oracle's Java tutorials.

BreakIterator doesn't find correct sentence boundary with parenthesized "i.e." or "e.g."

In the example below, BreakIterator appears to be failing on a fairly straightforward example.
Am I using BreakIterator incorrectly, or is this just a bug?
Example class:
import java.text.BreakIterator;
import java.util.Locale;
public class BreakIteratorTest {
public static void main(String[] args) throws Exception {
String text = "Due to a problem (e.g., software bug), the server is down.";
BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
bi.setText(text);
int r = bi.preceding(30);
System.out.println("bi.preceding(30) returned " + r);
String sentence = r == BreakIterator.DONE ? text : text.substring(0, r);
System.out.println("first sentence: \"" + sentence + "\"");
}
}
Output:
$ javac BreakIteratorTest.java
$ java BreakIteratorTest
bi.preceding(30) returned 21
first sentence: "Due to a problem (e.g"
It seems like bi.preceding(30) should have returned BreakIterator.DONE instead.
JDK version 1.8.0.

How to extract triples using Stanford CoreNLP package in java?

I want a code snippet which would take input a sentence or set of sentences and output or extract the triples(Subject,Predicate and Object) using Stanford CoreNLP package in java
Are you looking for OpenIE triples, or more structured relation triples (e.g., for things like per:city_of_birth)? For the former, the OpenIE system is likely what you're looking for: https://stanfordnlp.github.io/CoreNLP/openie.html. Copying from the example there:
import edu.stanford.nlp.ie.util.RelationTriple;
import edu.stanford.nlp.simple.*;
/**
* A demo illustrating how to call the OpenIE system programmatically.
*/
public class OpenIEDemo {
public static void main(String[] args) throws Exception {
// Create a CoreNLP document
Document doc = new Document("Obama was born in Hawaii. He is our president.");
// Iterate over the sentences in the document
for (Sentence sent : doc.sentences()) {
// Iterate over the triples in the sentence
for (RelationTriple triple : sent.openieTriples()) {
// Print the triple
System.out.println(triple.confidence + "\t" +
triple.subjectLemmaGloss() + "\t" +
triple.relationLemmaGloss() + "\t" +
triple.objectLemmaGloss());
}
}
}
}
Or, using the Annotators API:
import edu.stanford.nlp.ie.util.RelationTriple;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.naturalli.NaturalLogicAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.Collection;
import java.util.Properties;
/**
* A demo illustrating how to call the OpenIE system programmatically.
*/
public class OpenIEDemo {
public static void main(String[] args) throws Exception {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate an example document.
Annotation doc = new Annotation("Obama was born in Hawaii. He is our president.");
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
System.out.println(triple.confidence + "\t" +
triple.subjectLemmaGloss() + "\t" +
triple.relationLemmaGloss() + "\t" +
triple.objectLemmaGloss());
}
}
}
}

read rgb values stored in csv file seperated by comma deliminator

I am using file reader to read the csv file, the second column of the csv file is an rgb value such as rgb(255,255,255) but the columns in the csv file is separate by commas. If I use comma deliminator, it will read like "rgb(255," so how do I read the whole rgb value, the code is pasted below. Thanks!
FileReader reader = new FileReader(todoTaskFile);
BufferedReader in = new BufferedReader(reader);
int columnIndex = 1;
String line;
while ((line = in.readLine()) != null) {
if (line.trim().length() != 0) {
String[] dataFields = line.split(",");
//System.out.println(dataFields[0]+dataFields[1]);
if (!taskCount.containsKey(dataFields[columnIndex])) {
taskCount.put(dataFields[columnIndex], 1);
} else {
int oldCount = taskCount.get(dataFields[columnIndex]);
taskCount.put(dataFields[columnIndex],oldCount + 1);
}
}
I would strongly suggest not to use custom methods to parse CSV input. There are special libraries that do it for you.
#Ashraful Islam posted a good way to parse the value from a "cell" (I reused it), but getting this "cell" raw value must be done in a different way. This sketch shows how to do it using apache.commons.csv library.
package csvparsing;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class GetRGBFromCSV {
public static void main(String[] args) throws IOException {
Reader in = new FileReader(GetRGBFromCSV.class.getClassLoader().getResource("sample.csv").getFile());
Iterable<CSVRecord> records = CSVFormat.DEFAULT.withFirstRecordAsHeader().parse(in); // remove ".withFirstRecordAsHeader()"
for (CSVRecord record : records) {
String color = record.get("Color"); // use ".get(1)" to get value from second column if there's no header in csv file
System.out.println(color);
Pattern RGB_PATTERN = Pattern.compile("rgb\\((\\d{1,3}),(\\d{1,3}),(\\d{1,3})\\)", Pattern.CASE_INSENSITIVE);
Matcher m = RGB_PATTERN.matcher(color);
if (m.find()) {
Integer red = Integer.parseInt(m.group(1));
Integer green = Integer.parseInt(m.group(2));
Integer blue = Integer.parseInt(m.group(3));
System.out.println(red + " " + green + " " + blue);
}
}
}
}
This is a custom valid CSV input which would probably make regex-based solutions behave unexpectedly:
Name,Color
"something","rgb(100,200,10)"
"something else","rgb(10,20,30)"
"not the value rgb(1,2,3) you are interested in","rgb(10,20,30)"
There are lots of options which you might forget to take into account when you write your custom parser: quoted and unquoted strings, delimiter within quotes, escaped quotes within quotes, different delimiters (, or ;), multiple columns etc. Third-party csv parser would take care about those things for you. You shouldn't reinvent the wheel.
line = "rgb(25,255,255)";
line = line.replace(")", "");
line = line.replace("rgb(", "");
String[] vals = line.split(",");
cast the values in vals to Integer and then you can use them.
Here is how you can do this :
Pattern RGB_PATTERN = Pattern.compile("rgb\\((\\d{1,3}),(\\d{1,3}),(\\d{1,3})\\)");
String line = "rgb(25,255,255)";
Matcher m = RGB_PATTERN.matcher(line);
if (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
System.out.println(m.group(3));
}
Here
\\d{1,3} => match 1 to 3 length digit
(\\d{1,3}) => match 1 to 3 length digit and stored the match
Though ( or ) are meta character we have to escape it.

Read string after " " and before "(" using split in java

I have txt file with line:
1st line - 20-01-01 Abs Def est (xabcd)
2nd line - 290-01-01 Abs Def est ghj gfhj (xabcd fgjh fgjh)
3rd line - 20-1-1 Absfghfgjhgj (xabcd ghj 5676gyj)
I want to keep 3 diferent String array:
[0]20-01-01 [1]290-01-01 [2] 20-1-1
[0]Abs Def est [1]Abs Def est ghj gfhj [2] Absfghfgjhgj
[0]xabcd [1]xabcd fgjh fgjh [2] xabcd ghj 5676gyj
Using String[] array 1 = myLine.split(" ") i only have piece 20-01-01 but i also want to keep other 2 Strings
EDIT: I want to do this using regular Expressions (text file is large)
This is my piece of code:
Please help, i searching, but does not found anything
Thx.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.Comparator;
import java.util.Date;
import java.util.Set;
import java.util.TreeSet;
public class Holiday implements Comparable<Date>{
Date date;
String name;
public Holiday(Date date, String name){
this.date=date;
this.name=name;
}
public static void main(String[] args) throws IOException {
FileInputStream fis = new FileInputStream(new File("c:/holidays.txt"));
InputStreamReader isr = new InputStreamReader(fis, "windows-1251");
BufferedReader br = new BufferedReader(isr);
TreeSet<Holiday> tr=new TreeSet<>();
System.out.println(br.readLine());
String myLine = null;
while ( (myLine = br.readLine()) != null)
{
String[] array1 = myLine.split(" "); //OR use this
//String array1 = myLine.split(" ")[0];//befor " " read 1-st string
//String array2 = myLine.split("")[1];
//Holiday h=new Holiday(array1, name)
//String array1 = myLine.split(" ");
// check to make sure you have valid data
// String[] array2 = array1[1].split(" ");
System.out.println(array1[0]);
}
}
#Override
public int compareTo(Date o) {
// TODO Auto-generated method stub
return 0;
}
}
Pattern p = Pattern.compile("(.*?) (.*?) (\\(.*\\))");
Matcher m = p.matcher("20-01-01 Abs Def est (abcd)");
if (!m.matches()) throw new Exception("Invalid string");
String s1 = m.group(1); // 20-01-01
String s2 = m.group(2); // Abs Def est
String s3 = m.group(3); // (abcd)
Use a StringTokenizer, which has a " " as a delimiter by default.
You seem to be splitting based on whitespace. Each element of the string array would contain the individual whitespace-separate substrings, which you can then piece back together later on via string concatenation.
For instance,
array1[0] would be 20-01-01
array1[1] would be Abs
array1[2] would be Def
so on and so forth.
Another option is to Java regular expressions, but that may only be useful if your input text file is has a consistent formatting and if there's a lot of lines to process. It is very powerful, but requires some experience.
Match required text data by regular expression.
The regexp below ensure there are exactly 3 words in the middle and 1 word in the bracket.
String txt = "20-01-01 Abs Def est hhh (abcd)";
Pattern p = Pattern.compile("(\\d\\d-\\d\\d-\\d\\d) (\\w+ \\w+ \\w+) ([(](\\w)+[)])");
Matcher matcher = p.matcher(txt);
if (matcher.find()) {
String s1 = matcher.group(1);
String s2 = matcher.group(2);
String s3 = matcher.group(3);
System.out.println(s1);
System.out.println(s2);
System.out.println(s3);
}
However if you need more flexibility you may want to use code provided by Lence Java.

Categories