Getting original text after using stanford NLP parser - java

Hello people of the internet,
We're having the following problem with the Stanford NLP API:
We have a String that we want to transform into a list of sentences.
First, we used String sentenceString = Sentence.listToString(sentence); but listToString does not return the original text because of the tokenization. Now we tried to use listToOriginalTextString in the following way:
private static List<String> getSentences(String text) {
Reader reader = new StringReader(text);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
String sentenceString = Sentence.listToOriginalTextString(sentence);
sentenceList.add(sentenceString.toString());
}
return sentenceList;
}
This does not work. Apparently we have to set an attribute " invertible " to true but we don't know how to. How can we do this?
In general, how do you use listToOriginalTextString properly? What preparations do you need?
sincerely,
Khayet

If I understand correctly, you want to get the mapping of tokens to the original input text after tokenization. You can do it like this;
//split via PTBTokenizer (PTBLexer)
List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();
//do the processing using stanford sentence splitter (WordToSentenceProcessor)
WordToSentenceProcessor processor = new WordToSentenceProcessor();
List<List<CoreLabel>> splitSentences = processor.process(tokens);
//for each sentence
for (List<CoreLabel> s : splitSentences) {
//for each word
for (CoreLabel token : s) {
//here you can get the token value and position like;
//token.value(), token.beginPosition(), token.endPosition()
}
}

String sentenceStr = sentence.get(CoreAnnotations.TextAnnotation.class)
It gives you original text. An example for JSONOutputter.java file :
l2.set("id", sentence.get(CoreAnnotations.SentenceIDAnnotation.class));
l2.set("index", sentence.get(CoreAnnotations.SentenceIndexAnnotation.class));
l2.set("sentenceOriginal",sentence.get(CoreAnnotations.TextAnnotation.class));
l2.set("line", sentence.get(CoreAnnotations.LineNumberAnnotation.class));

Related

Java: How to fill placeholders in a text with Map<String,String>?

I am working with a code where I want fill several string place holders with another strings. This is the example text I have used to test my code.
String myStr = "Media file %s of size %s has been approved"
This is how I fill the place holders. Since I expect to use several place holders I have used java Map<>.
Map<String, String> propMap = new HashMap<String,String>();
propMap.put("file name","20mb");
String newNotification = createNotification(propMap);
I used following method to create the string.
public String createNotification(Map<String, String> properties){
String message = "";
message = String.format(myStr, properties);
return message;
}
How do I replace two of '%s' with "file name" and "20mb"?
That's not what a Map is intended to do.
What you add is an entry "file name" -> "20 mb", which basically means the property "file name" has the value "20 mb". What you are trying to do with it is "maintain a tuple of items".
Note that the formatting string has a fixed amount of placeholder; you want a data structure that contains exactly the same amount of items; so essentially an array or a List.
Thus, what you want to have is
public String createNotification(String[] properties) {
assert(properties.length == 2); // you might want to really check this, you will run into problems if it's false
return String.format("file %s has size %s", properties);
}
If you want to create notifications of all items in a map, you need to do something like this:
Map<String,String> yourMap = //...
for (Entry<String,String> e : yourMap) {
System.out.println(createNotification(e.getKey(), e.getValue()));
}
Your approach to String#format is wrong.
It expects a variable amount of objects to replace the placeholders as the second argument, not a map. To group them all together, you can use an array or a list.
String format = "Media file %s of size %s has been approved";
Object[] args = {"file name", "20mb"};
String newNotification = String.format(format, args);
You can simply do this formatting using var-args:
String myStr = "Media file %s of size %s has been approved";
String newNotification = createNotification(myStr, "file name", "20mb");
System.out.println(newNotification);
Pass var-args in createNotification method, here is the code:
public static String createNotification(String myStr, String... strings){
String message = "";
message=String.format(myStr, strings[0], strings[1]);
return message;
}
I think %s is Python’s grammar to place holder, can’t use this in Java environment; and your method createNotification() defined needs two parameters, can’t only give one.
After trying several ways finally found a good solution. Place holders must be like this [placeholder] .
public String createNotification(){
Pattern pattern = Pattern.compile("\\[(.+?)\\]");
Matcher matcher = pattern.matcher(textTemplate);
HashMap<String,String> replacementValues = new HashMap<String,String>();
StringBuilder builder = new StringBuilder();
int i = 0;
while (matcher.find()) {
String replacement = replacementValues.get(matcher.group(1));
builder.append(textTemplate.substring(i, matcher.start()));
if (replacement == null){ builder.append(matcher.group(0)); }
else { builder.append(replacement); }
i = matcher.end();
}
builder.append(textTemplate.substring(i, textTemplate.length()));
return builder.toString()
}

Vaadin access content of sql container

EDITED!
I retriev data from a mysql DB using Vaadin, SQLContainer and the Freeformquery. Now I want to get all Descriptions in one String (later each string will printed/added to a text file).
....
String text ="";
FreeformQuery subcatExtractionQuery = new FreeformQuery("select Description from customers", connectionPool);
SQLConateiner s = new SQLContainer(subcatExtractionQuery);
Collection<?> c = s.getContainerPropertyIds();
for(Object o : c){
Property<?> p = s.getContainerProperty(o, "Description");
text+=(String)p.getValue();
}
System.out.println(text);
I get the Error: java.lang.NullPointerException
Problem line is the text+=...if I remove it, no error appears. BUT The Query returns Data!
These descriptions are Strings. I need to have all words of each string in ony single list of tokens (i already have a method to create a token list of a file.)
(Don't ask if this makes sense, it's just an example).
How can I access the Strings of my sql container? Till now I only used the container als data source of a table and didn't access the single items/strings.
I need a for loop to get each String...how does this work?
Its getItemIds insted of getContainerPropertyIds. So this words
....
String text ="";
FreeformQuery subcatExtractionQuery = new FreeformQuery("select Description from customers", connectionPool);
SQLConateiner s = new SQLContainer(subcatExtractionQuery);
Collection<?> c = s.getItemIds();
for(Object o : c){
Property<?> p = s.getContainerProperty(o, "Description");
text+=(String)p.getValue();
}
System.out.println(text);

How to get original codes from generated pattern in java?

Suppose I have java.util.Set<String> of "200Y2Z", "20012Y", "200829", "200T2K" which follows the same pattern "200$2$", where "$" is the placeholder. Now which is the most efficient way to get Set of just unique codes from such strings in Java?
Input: java.util.Set<String> of "200Y2Z", "20012Y", "200829", "200T2K"
Expected output: java.util.Set<String> of "YZ", "1Y", "89", "TK"
My Try ::
public static void getOutPut()
{
Set<String> input = new HashSet<String>();
Set<String> output = new HashSet<String>();
StringBuffer out = null;
for(String in : input)
{
out = new StringBuffer();
StringCharacterIterator sci = new StringCharacterIterator(in);
while (sci.current( ) != StringCharacterIterator.DONE){
if (sci.current( ) == '$')
{
out.append(in.charAt(sci.getIndex()));
}
sci.next( );
}
output.add(out.toString());
}
System.out.println(output);
}
It is working fine, but is there any efficient way than this to achieve it? I need to do it for more than 1000K codes.
Get the indexes of the placeholder in the pattern:
int i = pattern.getIndexOf('$');
You'll must to iterate to obtain all the indexes:
pattern.getIndexOf('$', lastIndex+1);
The loop and the checks are up to you.
Then use charAt with the indexes over each element of the set.

Parsing xml content line by line and extracting some values from it

How can I elegantly extract these values from the following text content ? I have this long file that contains thousands of entries. I tried the XML Parser and Slurper approach, but I ran out of memory. I have only 1GB. So now I'm reading the file line by line and extract the values. But I think there should be a better in Java/Groovy to do this, maybe a cleaner and reusable way. (I read the content from Standard-In)
1 line of Content:
<sample t="336" lt="0" ts="1406036100481" s="true" lb="txt1016.pb" rc="" rm="" tn="Thread Group 1-9" dt="" by="0"/>
My Groovy Solution:
Map<String, List<Integer>> requestSet = new HashMap<String, List<Integer>>();
String reqName;
String[] tmpData;
Integer reqTime;
System.in.eachLine() { line ->
if (line.find("sample")){
tmpData = line.split(" ");
reqTime = Integer.parseInt(tmpData[1].replaceAll('"', '').replaceAll("t=", ""));
reqName = tmpData[5].replaceAll('"', '').replaceAll("lb=", "");
if (requestSet.containsKey(reqName)){
List<Integer> myList = requestSet.get(reqName);
myList.add(reqTime);
requestSet.put(reqName, myList);
}else{
List<Integer> myList = new ArrayList<Integer>();
myList.add(reqTime);
requestSet.put(reqName, myList);
}
}
}
Any suggestion or code snippets that improve this ?

how to read two consecutive commas from .csv file format as unique value in java

Suppose csv file contains
1,112,,ASIF
Following code eliminates the null value in between two consecutive commas.
Code provided is more than it is required
String p1=null, p2=null;
while ((lineData = Buffreadr.readLine()) != null)
{
row = new Vector(); int i=0;
StringTokenizer st = new StringTokenizer(lineData, ",");
while(st.hasMoreTokens())
{
row.addElement(st.nextElement());
if (row.get(i).toString().startsWith("\"")==true)
{
while(row.get(i).toString().endsWith("\"")==false)
{
p1= row.get(i).toString();
p2= st.nextElement().toString();
row.set(i,p1+", "+p2);
}
String CellValue= row.get(i).toString();
CellValue= CellValue.substring(1, CellValue.length() - 1);
row.set(i,CellValue);
//System.out.println(" Final Cell Value : "+row.get(i).toString());
}
eror=row.get(i).toString();
try
{
eror=eror.replace('\'',' ');
eror=eror.replace('[' , ' ');
eror=eror.replace(']' , ' ');
//System.out.println("Error "+ eror);
row.remove(i);
row.insertElementAt(eror, i);
}
catch (Exception e)
{
System.out.println("Error exception "+ eror);
}
//}
i++;
}
how to read two consecutive commas from .csv file format as unique value in java.
Here is an example of doing this by splitting to String array. Changed lines are marked as comments.
// Start of your code.
row = new Vector(); int i=0;
String[] st = lineData.split(","); // Changed
for (String s : st) { // Changed
row.addElement(s); // Changed
if (row.get(i).toString().startsWith("\"") == true) {
while (row.get(i).toString().endsWith("\"") == false) {
p1 = row.get(i).toString();
p2 = s.toString(); // Changed
row.set(i, p1 + ", " + p2);
}
...// Rest of Code here
}
The StringTokenizer skpis empty tokens. This is their behavious. From the JLS
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
Just use String.split(",") and you are done.
Just read the whole line into a string then do string.split(",").
The resulting array should have exactly what you are looking for...
If you need to check for "escaped" commas then you will need some regex for the query instead of a simple ",".
while ((lineData = Buffreadr.readLine()) != null) {
String[] row = line.split(",");
// Now process the array however you like, each cell in the csv is one entry in the array

Categories