Parsing xml content line by line and extracting some values from it

Parsing xml content line by line and extracting some values from it - java

How can I elegantly extract these values from the following text content ? I have this long file that contains thousands of entries. I tried the XML Parser and Slurper approach, but I ran out of memory. I have only 1GB. So now I'm reading the file line by line and extract the values. But I think there should be a better in Java/Groovy to do this, maybe a cleaner and reusable way. (I read the content from Standard-In)
1 line of Content:
<sample t="336" lt="0" ts="1406036100481" s="true" lb="txt1016.pb" rc="" rm="" tn="Thread Group 1-9" dt="" by="0"/>
My Groovy Solution:
Map<String, List<Integer>> requestSet = new HashMap<String, List<Integer>>();
String reqName;
String[] tmpData;
Integer reqTime;
System.in.eachLine() { line ->
if (line.find("sample")){
tmpData = line.split(" ");
reqTime = Integer.parseInt(tmpData[1].replaceAll('"', '').replaceAll("t=", ""));
reqName = tmpData[5].replaceAll('"', '').replaceAll("lb=", "");
if (requestSet.containsKey(reqName)){
List<Integer> myList = requestSet.get(reqName);
myList.add(reqTime);
requestSet.put(reqName, myList);
}else{
List<Integer> myList = new ArrayList<Integer>();
myList.add(reqTime);
requestSet.put(reqName, myList);
}
}
}
Any suggestion or code snippets that improve this ?

Related

Get,Put key and values from nested hashmap

I want to create a nested HashMap which returns the frequency of terms among multiple files. Like,
Map<String, Map<String, Integer>> wordToDocumentMap=new HashMap<>();
I have been able to return the number of times a term appears in a file.
Map<String, Integer> map = new HashMap<>();//for frequecy count
String str = "Wikipedia is a free online encyclopedia, created and edited by
volunteers around the world."; //String str suppose a file a.java
// The query string
String query = "edited Wikipedia volunteers";
// Split the given string and the query string on space
String[] strArr = str.split("\\s+");
String[] queryArr = query.split("\\s+");
// Map to hold the frequency of each word of query in the string
Map<String, Integer> map = new HashMap<>();
for (String q : queryArr) {
for (String s : strArr) {
if (q.equals(s)) {
map.put(q, map.getOrDefault(q, 0) + 1);
}
}
}
// Display the map
System.out.println(map);
In my code its count the frequency of the given query Individually. But I want to Map the query term and its frequency with its filenames. I have searched around the web for a solution but am finding it tough to find a solution that applies to me. Any help would be appreciated!

I hope I'm understanding you correctly.
What you want is to be able to read in a list of files and map the file name to the map you create in the code above. So let's start with your code and let's turn it into a function:
public Map<String, Integer> createFreqMap(String str, String query) {
Map<String, Integer> map = new HashMap<>();//for frequecy count
// The query string
String query = "edited Wikipedia volunteers";
// Split the given string and the query string on space
String[] strArr = str.split("\\s+");
String[] queryArr = query.split("\\s+");
// Map to hold the frequency of each word of query in the string
Map<String, Integer> map = new HashMap<>();
for (String q : queryArr) {
for (String s : strArr) {
if (q.equals(s)) {
map.put(q, map.getOrDefault(q, 0) + 1);
}
}
}
// Display the map
System.out.println(map);
return map;
}
OK so now you have a nifty function that makes a map from a string and a query
Now you're going to want to set up a system for reading in a file to a string.
There are a bunch of ways to do this. You can look here for some ways that work for different java versions: https://stackoverflow.com/a/326440/9789673
lets go with this (assuming >java 11):
String content = Files.readString(path, StandardCharsets.US_ASCII);
Where path is the path to the file you want.
Now we can put it all together:
String[] paths = ["this.txt", "that.txt"]
Map<String, Map<String, Integer>> output = new HashMap<>();
String query = "edited Wikipedia volunteers"; //String query = "hello";
for (int i = 0; i < paths.length; i++) {
String content = Files.readString(paths[i], StandardCharsets.US_ASCII);
output.put(paths[i], createFreqMap(content, query);
}

Create CSV file with columns and values from HashMap

Be gentle,
This is my first time using Apache Commons CSV 1.7.
I am creating a service to process some CSV inputs,
add some additional information from exterior sources,
then write out this CSV for ingestion into another system.
I store the information that I have gathered into a list of
HashMap<String, String> for each row of the final output csv.
The Hashmap contains the <ColumnName, Value for column>.
I have issues using the CSVPrinter to correctly assign the values of the HashMaps into the rows.
I can concatenate the values into a string with commas between the variables;
however,
this just inserts the whole string into the first column.
I cannot define or hardcode the headers since they are obtained from a config file and may change depending on which project uses the service.
Here is some of my code:
try (BufferedWriter writer = Files.newBufferedWriter(
Paths.get(OUTPUT + "/" + project + "/" + project + ".csv"));)
{
CSVPrinter csvPrinter = new CSVPrinter(writer,
CSVFormat.RFC4180.withFirstRecordAsHeader());
csvPrinter.printRecord(columnList);
for (HashMap<String, String> row : rowCollection)
{
//Need to map __record__ to column -> row.key, value -> row.value for whole map.
csvPrinter.printrecord(__record__);
}
csvPrinter.flush();
}
Thanks for your assistance.

You actually have multiple concerns with your technique;
How do you maintain column order?
How do you print the column names?
How do you print the column values?
Here are my suggestions.
Maintain column order.
Do not use HashMap,
because it is unordered.
Instead,
use LinkedHashMap which has a "predictable iteration order"
(i.e. maintains order).
Print column names.
Every row in your list contains the column names in the form of key values,
but you only print the column names as the first row of output.
The solution is to print the column names before you loop through the rows.
Get them from the first element of the list.
Print column values.
The "billal GHILAS" answer demonstrates a way to print the values of each row.
Here is some code:
try (BufferedWriter writer = Files.newBufferedWriter(
Paths.get(OUTPUT + "/" + project + "/" + project + ".csv"));)
{
CSVPrinter csvPrinter = new CSVPrinter(writer,
CSVFormat.RFC4180.withFirstRecordAsHeader());
// This assumes that the rowCollection will never be empty.
// An anonymous scope block just to limit the scope of the variable names.
{
HashMap<String, String> firstRow = rowCollection.get(0);
int valueIndex = 0;
String[] valueArray = new String[firstRow.size()];
for (String currentValue : firstRow.keySet())
{
valueArray[valueIndex++] = currentValue;
}
csvPrinter.printrecord(valueArray);
}
for (HashMap<String, String> row : rowCollection)
{
int valueIndex = 0;
String[] valueArray = new String[row.size()];
for (String currentValue : row.values())
{
valueArray[valueIndex++] = currentValue;
}
csvPrinter.printrecord(valueArray);
}
csvPrinter.flush();
}

for (HashMap<String,String> row : rowCollection) {
Object[] record = new Object[row.size()];
for (int i = 0; i < columnList.size(); i++) {
record[i] = row.get(columnList.get(i));
}
csvPrinter.printRecord(record);
}

Extracting multiple strings from a sentence that has been passed through the Stanford NER tagger

I wrote code to extract multiple patterns from my string which has passed through a Stanford NER parser and gives output like:
Input Sentence - Goldman profit at risk under Volcker rule
Output Sentence - Goldman profit at risk under <PERSON>Volcker</PERSON> rule
I need to extract the word Volker and put it in personTag map which eventually gets printed later in the code. The code below gives me a null pointer exception on list.add(m.group(1));
I am unable to figure out why. Please help with this.
..............
HashMap<String, String> regs = new HashMap<String, String>();
regs.put("PERSON", "<PERSON>(.+?)</PERSON>");
regs.put("LOCATION", "<LOCATION>(.+?)</LOCATION>");
regs.put("TIME", "<TIME>(.+?)</TIME>");
regs.put("PERCENT", "<PERCENT>(.+?)</PERCENT>");
regs.put("MONEY", "<MONEY>(.+?)</MONEY>");
regs.put("DATE", "<DATE>(.+?)</DATE>");
for (Entry<String, String> entry : regs.entrySet())
{
String key = entry.getKey();
String value = entry.getValue();
Matcher m = Pattern.compile(value).matcher(NER);
ArrayList<String> list = null;
while (m.find())
{
if (key.contains("PERSON")){
list.add(m.group(1));
personTag.put(key, list);
//System.out.println("Person Tag:" + personTag);
roleStrings.put(SemanticRole.PERSON, personTag.toString());
}
else if (key.contains("LOCATION")){
list.add(m.group());
locationTag.put(key, list);
roleStrings.put(SemanticRole.LOCATION, locationTag.toString());
}
else if (key.contains("TIME")){
list.add(m.group(1));
timeTag.put(key, list);
roleStrings.put(SemanticRole.TIME, timeTag.toString());
}
else if (key.contains("DATE")){
list.add(m.group(1));
timeTag.put(key, list);
roleStrings.put(SemanticRole.TIME, timeTag.toString());
}
}
}
return roleStrings;
}

Never mind that. I had not initialized my list hence was getting the null pointer exception. This is what I had to do:
List<String> list = new ArrayList<String>();
Instead of:
List<String> list = null;

String to map conversion java

Code:
Map test = new HashMap<String,String>();
test.put("1", "erica");
test.put("2", "frog");
System.out.println(test.toString());
This code gives output as :
{1=erica, 2=frog}
I want this output to be again put in a map as key value-pair .
Any suggestions how can i implement this ?
Or is ther any predefined utility class for conversion of the output to HashMap again ?

For me a proper way would be to use a JSON parser like Jackson since the way a HashMap is serialized is not meant to be parsed after such that if you use specific characters like = or , they won't be escaped which makes it unparsable.
How to serialize a Map with Jackson?
ObjectMapper mapper = new ObjectMapper();
String result = mapper.writeValueAsString(myMap);
How to deserialize a String to get a Map with Jackson?
ObjectMapper mapper = new ObjectMapper();
Map map = mapper.readValue(contentToParse, Map.class);

You can try to use this:
String[] tk = mystring.split(" |=");
Map<String, String> map = new HashMap<>();
for (int i=0; i < tk.length-1; i++)
{
map.put(tk[i], tk[i]);
}
return map;

If you want to replicate the Java code filling the map, you may use something like this:
StringBuilder sb = new StringBuilder("Map<String, String> test = new HashMap<>();");
for(Map.Entry<?, ?> entry : test.entrySet())
{
sb.append("\ntest.put(\"");
sb.append(entry.getKey());
sb.append("\", \"");
sb.append(entry.getValue());
sb.append("\");");
}
String string = sb.toString();
System.out.println(string);
But I agree with the comments, that in many applications a format such as JSON is more appropriate to serialize a map.
Note that the above solution does not escape strings, it only works if the strings don't contain characters like " or \n. If you need to handle these cases it will become more complicated.

You could try the following:
String out = test.toString();
Map<String, String> newMap = new HashMap();
// remove the first and last "{", "}"
out = out.subString(1,out.size()-1)
String[] newOut = out.split(", ");
for (int i=0; i<newOut.length;i++){
// keyValue is size of 2. cell 0 is key, cell 1 is value
String[] keyValue = newOut.split("=");
newMap.put(keyValue[0], keyValue[1]);
}
I haven't tested the code in java i just wrote from my mind. I hope it will work

Getting original text after using stanford NLP parser

Hello people of the internet,
We're having the following problem with the Stanford NLP API:
We have a String that we want to transform into a list of sentences.
First, we used String sentenceString = Sentence.listToString(sentence); but listToString does not return the original text because of the tokenization. Now we tried to use listToOriginalTextString in the following way:
private static List<String> getSentences(String text) {
Reader reader = new StringReader(text);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
String sentenceString = Sentence.listToOriginalTextString(sentence);
sentenceList.add(sentenceString.toString());
}
return sentenceList;
}
This does not work. Apparently we have to set an attribute " invertible " to true but we don't know how to. How can we do this?
In general, how do you use listToOriginalTextString properly? What preparations do you need?
sincerely,
Khayet

If I understand correctly, you want to get the mapping of tokens to the original input text after tokenization. You can do it like this;
//split via PTBTokenizer (PTBLexer)
List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();
//do the processing using stanford sentence splitter (WordToSentenceProcessor)
WordToSentenceProcessor processor = new WordToSentenceProcessor();
List<List<CoreLabel>> splitSentences = processor.process(tokens);
//for each sentence
for (List<CoreLabel> s : splitSentences) {
//for each word
for (CoreLabel token : s) {
//here you can get the token value and position like;
//token.value(), token.beginPosition(), token.endPosition()
}
}

String sentenceStr = sentence.get(CoreAnnotations.TextAnnotation.class)
It gives you original text. An example for JSONOutputter.java file :
l2.set("id", sentence.get(CoreAnnotations.SentenceIDAnnotation.class));
l2.set("index", sentence.get(CoreAnnotations.SentenceIndexAnnotation.class));
l2.set("sentenceOriginal",sentence.get(CoreAnnotations.TextAnnotation.class));
l2.set("line", sentence.get(CoreAnnotations.LineNumberAnnotation.class));

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing xml content line by line and extracting some values from it - java

Related

Get,Put key and values from nested hashmap

Create CSV file with columns and values from HashMap

Extracting multiple strings from a sentence that has been passed through the Stanford NER tagger

String to map conversion java

Getting original text after using stanford NLP parser

Categories

Resources