Storing string into hashmap with occurrences - java

I have a method that returns some kind of string. I want to store the individual words in a HashMap with their number of occurrences?
public static void main(String[] args) {
String s = "{link:hagdjh, matrics:[{name:apple, value:1},{name:jeeva, value:2},{name:abc, value:0}]}";
String[] strs = s.split("matrics");
System.out.println("Substrings length:" + strs.length);
for (int i = 0; i < strs.length; i++) {
System.out.println(strs[i]);
}
}
For eg, I have a string- "{link:https://www.google.co.in/, matrics:[{name:apple, value:1},{name:graph, value:2},{name:abc, value:0}]}";
Now my hashmap should look like
apple = 1
graph = 2
abc = 0
How should I proceed?
I know how to use HashMaps. My problem, in this case, is that I don't know how to parse through the given string and store the words with their number of occurrences.

String regex = "\\{name:(.*), value:(\\d+)\\}";
HashMap<String, Integer> link = new HashMap<>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
String found = matcher.group(1);
String number = matcher.group(2);
link.put(found, Integer.parseInt(number));
}

import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
Map<String, Integer> map = new LinkedHashMap<String, Integer>();
Pattern pattern = Pattern.compile("matrics:\\[\\{(.*?)\\]\\}");
Matcher matcher = pattern
.matcher("{link:hagdjh, matrics:[{name:apple, value:1},{name:jeeva, value:2},{name:abc, value:0}]}");
String data = "";
if (matcher.find()) {
data = matcher.group();
}
List<String> records = new ArrayList<String>();
pattern = Pattern.compile("(?<=\\{).+?(?=\\})");
matcher = pattern.matcher(data);
while (matcher.find()) {
records.add(matcher.group());
}
for (String s : records) {
String[] parts = s.split(", ");
map.put(parts[0].substring(parts[0].indexOf(":") + 1),
Integer.parseInt(parts[1].substring(parts[1].indexOf(":") + 1)));
}
map.entrySet().forEach(entry -> {
System.out.println(entry.getKey() + " = " + entry.getValue());
});
}
}
Output:
apple = 1
jeeva = 2
abc = 0

It appeares that your data is in JSON format.
If it is guaranteed to be in JSON format, you can parse it using JSON parsing library and than analyze the matrics data in a convinient way (code follows).
If the data is not guaranteed to be in JSON format, you can use REGEX to help you parse it, as in Reza soumi's answer.
import org.json.JSONObject;
import org.json.JSONArray;
import java.util.HashMap;
String s = "{link:hagdjh, matrics:[{name:apple, value:1},{name:jeeva, value:2},{name:abc, value:0}]}";
JSONObject obj = new JSONObject(s);
JSONArray matrics = obj.getJSONArray("matrics");
System.out.println(matrics);
HashMap<String, Integer> matricsHashMap = new HashMap<String, Integer>();
for (int i=0;i < matrics.length();i++){
JSONObject matric = matrics.getJSONObject(i);
System.out.println("Adding matric: " + matric + " to hash map");
String matricName = matric.getString("name");
Integer matricValue = Integer.valueOf(matric.getInt("value"));
matricsHashMap.put(matricName, matricValue);
}
System.out.println(matricsHashMap);

Try this:
import static java.lang.System.err;
import static java.lang.System.out;
import static java.util.Arrays.stream;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.toMap;
/**
* Counting the words in a String.
*/
public class CountWordsInString
{
/*-----------*\
====** Constants **========================================================
\*-----------*/
/**
* An empty array of {#code ${type_name}} objects.
*/
public static final String INPUT = "{link:https://www.google.co.in/, matrics:[{name:apple, value:1},{name:graph, value:2},{name:abc, value:0}]}";
/*---------*\
====** Methods **==========================================================
\*---------*/
/**
* The program entry point.
*
* #param args The command line arguments.
*/
public static void main( final String... args )
{
try
{
final var result = stream( INPUT.split( "\\W+" ) )
.filter( s -> !s.isBlank() )
.filter( s -> !s.matches( "\\d*" ) )
.collect( groupingBy( s -> s ) )
.entrySet()
.stream()
.collect( toMap( k -> k.getKey(), v -> Long.valueOf( v.getValue().size() ) ) );
out.println( result.getClass() );
for( final var entry : result.entrySet() )
{
out.printf( "'%s' occurred %d times%n", entry.getKey(), entry.getValue() );
}
}
catch( final Throwable t )
{
//---* Handle any previously unhandled exceptions *----------------
t.printStackTrace( err );
}
} // main()
}
// class CountWordsInString
Confessed, not the most obvious solution, but I wanted to have some fun with it, too.
The INPUT.split( "\\W+" ) gives you the words in the string, but also numbers and an 'empty' word at the beginning.
The 'empty' word is eliminated with the first filter() statement, the numbers go with the second.
The first collect( groupingBy() ) gives you a HashMap<String,List<String>>, so I had to convert that to a HashMap<String,Long> in the following steps (basically with the second collect( groupingBy() )).
May be there is a more efficient solution, or one that is more elegant, or even one that is both, more efficient and more elegant … but it works as expected, and I had some fun with it.
The output is:
class java.util.HashMap
'apple' occurred 1 times
'matrics' occurred 1 times
'abc' occurred 1 times
'in' occurred 1 times
'www' occurred 1 times
'name' occurred 3 times
'link' occurred 1 times
'google' occurred 1 times
'https' occurred 1 times
'co' occurred 1 times
'value' occurred 3 times
'graph' occurred 1 times

Related

How to create JSON in java for pipe delimiter?

Suppose i have this structure
FirstName| Auro
LastName|Winkies
Age|26
How can we convert it into json I want the word which are before pipe delimiter | should be in L property and the word which are after pipe delimiter | should be shuffled and saved it into another property R and the C property is like Winkies is at 2 position after pipe delimiter , similarly auro is at 1 position and 26 is at 3 position
Is it possible to create this json structure in java.
I thought first i need to split \n and further split it into \\|
{
"L": ["FirstName" , "LastName" , "Age"],
"R": ["Winkies" , "Auro" , "26"],
"C":["2" ,"1" , "3"]
}
If possible anybody can help me out with the logic
i don't find the utility of the "C" field but here is solution
import com.fasterxml.jackson.databind.node.JsonNodeFactory;
import com.fasterxml.jackson.databind.node.ObjectNode;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public static void main(final String[] args) {
String data = "FirstName|Auro\n" +
"LastName|Winkies\n" +
"Age|26";
List<String> l = new ArrayList<>();
List<String> r = new ArrayList<>();
ObjectNode node = JsonNodeFactory.instance.objectNode();
List<String> c = Arrays.asList("1,2,3");
String[] split = data.split("\n");
for (String s : split) {
int i = s.indexOf('|');
l.add(s.substring(0, i));
r.add(s.substring(i + 1, s.length()));
}
node.put("L",l.toString());
node.put("R",r.toString());
node.put("C",c.toString());
System.out.println(node);
}

LDA in Spark 1.3.1. Converting raw data into Term Document Matrix?

I'm trying out LDA with Spark 1.3.1 in Java and got this error:
Error: application failed with exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "��"
My .txt file looks like this:
put weight find difficult pull ups push ups now
blindness diseases everything eyes work perfectly except ability take light use light form images
role model kid
Dear recall saddest memory childhood
This is the code:
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;
public class JavaLDA {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("LDA Example");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse the data
String path = "/tutorial/input/askreddit20150801.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
new Function<String, Vector>() {
public Vector call(String s) {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
}
}
);
// Index documents with unique IDs
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
return doc_id.swap();
}
}
));
corpus.cache();
// Cluster the documents into three topics using LDA
LDAModel ldaModel = new LDA().setK(100).run(corpus);
// Output topics. Each is a distribution over words (matching word count vectors)
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 100; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}
System.out.println();
}
ldaModel.save(sc.sc(), "myLDAModel");
}
}
Anyone know why this happened? I'm just trying LDA Spark for the first time. Thanks.
values[i] = Double.parseDouble(sarray[i]);
Why are you trying to convert each word of your text file into a Double?
That's the answer to your issue:
http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#parseDouble%28java.lang.String%29
Your code is expecting the input file to be a bunch of lines of whitespace separated text that looks like numbers. Assuming your text is words instead:
Get a list of every word that appears in your corpus:
JavaRDD<String> words =
data.flatMap((FlatMapFunction<String, String>) s -> {
s = s.replaceAll("[^a-zA-Z ]", "");
s = s.toLowerCase();
return Arrays.asList(s.split(" "));
});
Make a map giving each word an integer associated with it:
Map<String, Long> vocab = words.zipWithIndex().collectAsMap();
Then instead of your parsedData doing what it's doing up there, make it look up each word, find the associated number, go to that location in an array, and add 1 to the count for that word.
JavaRDD<Vector> tokens = data.map(
(Function<String, Vector>) s -> {
String[] vals = s.split("\\s");
double[] idx = new double[vocab.size() + 1];
for (String val : vals) {
idx[vocab.get(val).intValue()] += 1.0;
}
return Vectors.dense(idx);
}
);
This results in an RDD of vectors, where each vector is vocab.size() long, and each spot in the vector is the count of how many times that vocab word appeared in the line.
I modified this code slightly from what I'm currently using and didn't test it, so there could be errors in it. Good luck!

String pattern wit ha dollar present

I have the below data in a text file.
CS##NEWSLTR$$
RY##GLMALAW$$
VW##NWL$$
VW##GLS$$
IS##4$$
ST##NJ$$
ST##NY$$
SORTX##0050004018001$$
RC##18 No. 4 GLMALAW 1$$
CR##18 No. 4 M & A Law. 1$$
SO3##The M & A Lawyer$$
DL##April, 2014$$
TI##DUSTING OFF APPRAISAL RIGHTS: THE DEVELOPMENT OF A NEW INVESTMENT
STRATEGY$$
here i'm actually trying to fetch these values into a java array with the below code.
package strings;
import com.sun.org.apache.xalan.internal.xsltc.runtime.BasisLibrary;
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.regex.Pattern;
/**
*
* #author u0138039
*/
public class Strings {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
Scanner inFile1 = null;
try {
inFile1 = new Scanner(new File("C:\\Users\\u0138039\\Desktop\\Adhil\\WDA.TP.GLASSER.IB.F486806.A.D140605.T.txt")).useDelimiter("$\\\\\\\\\\\\$");
} catch (FileNotFoundException ex) {
Logger.getLogger(Strings.class.getName()).log(Level.SEVERE, null, ex);
}
List<String> tokens = new ArrayList<String>();
while (inFile1.hasNext()) {
tokens.add(inFile1.nextLine());
}
String[] tokenArray = tokens.toArray(new String[0]);
for (int i = 0; i < tokenArray.length; i++) {
String s = tokenArray[i];
System.out.println("a["+i+"]" +tokenArray[i]);
}
}
}
here my concept is that the line ends with a $$ and this is how it should be stored in an array, but when i run the above program i get the below output.
a[0]CS##NEWSLTR$$
a[1]RY##GLMALAW$$
a[2]VW##NWL$$
a[3]VW##GLS$$
a[4]IS##4$$
a[5]ST##NJ$$
a[6]ST##NY$$
a[7]SORTX##0050004018001$$
a[8]RC##18 No. 4 GLMALAW 1$$
a[9]CR##18 No. 4 M & A Law. 1$$
a[10]SO3##The M & A Lawyer$$
a[11]DL##April, 2014$$
a[12]TI##DUSTING OFF APPRAISAL RIGHTS: THE DEVELOPMENT OF A NEW INVESTMENT
a[13] STRATEGY$$
here a[12] and a[13] belong to same array number(index), but here these are divided into 2.
The expected output is as below(since the end $$ of a[12] came in a[13])
a[0]CS##NEWSLTR$$
a[1]RY##GLMALAW$$
a[2]VW##NWL$$
a[3]VW##GLS$$
a[4]IS##4$$
a[5]ST##NJ$$
a[6]ST##NY$$
a[7]SORTX##0050004018001$$
a[8]RC##18 No. 4 GLMALAW 1$$
a[9]CR##18 No. 4 M & A Law. 1$$
a[10]SO3##The M & A Lawyer$$
a[11]DL##April, 2014$$
a[12]TI##DUSTING OFF APPRAISAL RIGHTS: THE DEVELOPMENT OF A NEW INVESTMENT STRATEGY$$
please let me know where am i going wrong and how to fix it.
Thanks
Forget the useDelimiter
List<String> tokens = new ArrayList<String>();
int next = 0;
while (inFile1.hasNext()) {
String line = inFile1.nextLine();
if( next >= tokens.size() ){
tokens.add( line );
} else {
tokens.set( next, tokens.get(next) + line );
}
if( line.endsWith( "$$" ) ) next++;
}
You're issuing a inFile1.nextLine() so naturally, the strings in a[12] and a[13] would be separated.
One approach I can think of is putting the content of the file in a String object, then do a split using "\$\$" .
String s = "Hello$$World$$Sample$$";
for(String sa: s.split("\\$\\$")) {
System.out.println(sa);
}
Output:
Hello
World
Sample
But this will not include the trailing "$$" since you used it in the split. You can easily add that do the end of your string, but this is just one approach.
Hope this helps.
String partialLine = null;
while (inFile1.hasNext()) {
String line = inFile1.nextLine();
if (partialLine != null) {
line = partialLine + line;
partialLine = null;
}
if (line.endsWith("$$") {
tokens.add(line);
} else {
partialLine = line;
}
}
if (partialLine != null) {
// Probably empty line.
}
A bit of buffering: not adding a partial line (missing $$), but keeping it in partialLine.
As you see even several partial lines would work.

Java string parsing to HashMap

I have an input string of the following format:
Message:id1:[label1:label2....:labelN]:id2:[label1:label2....:labelM]:id3:[label1:label2....:labelK]...
It is basically ids associated with sets of labels. There can be an arbitrary number of ids and labels associated with those ids.
I want to be able to parse this string and generate a HashMap of the form id->labels for quick look up later.
I was wondering what would be the most efficient way of parsing this message in java?
Something like this should work for you:
String str = "Message:id1:[label1:label2:labelN]:id2:[label1:label2:labelM]:id3:[label1:label2:labelK]";
Pattern p = Pattern.compile("([^:]+):\\[([^\\]]+)\\]");
Matcher m = p.matcher(str.substring(8));
Map<String, List<String>> idmap = new HashMap<String, List<String>>();
while (m.find()) {
List<String> l = new ArrayList<String>();
String[] tok = m.group(2).split(":");
for (String t: tok)
l.add(t);
idmap.put(m.group(1), l);
}
System.out.printf("IdMap %s%n", idmap);
Live Demo: http://ideone.com/EoieUt
Consider using Guava's Multimap
If you take the string you gave:
Message:id1:[label1:label2....:labelN]:id2:[label1:label2....:labelM]:id3:[label1:label2....:labelK]
And do String.split("]"), You get:
Message:id1:[label1:label2....:labelN
:id2:[label1:label2....:labelM
:id3:[label1:label2....:labelK
If you loop through each of those, splitting on [, you get:
Message:id1: label1:label2....:labelN
:id2: label1:label2....:labelM
:id3: label1:label2....:labelK
Then, you can parse the id name out of the first element in the String[], and the labelname out of the second element in the String, and store that in your Multimap.
If you don't want to use Guava, you can also use a Map<String, List<String>>
Following code will serve your requirement.
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args){
String msg = "id1:[label1:label2]:id2:[label1:label2:label3]:id3:[label1:label2:label3:label4]";
Pattern pattern = Pattern.compile("id");
HashMap<String,String> kv = new HashMap<String,String>();
Matcher m = pattern.matcher(msg);
int prev = -1;
int next = -1;
int start = -1;
int end = -1;
String subMsg = "";
while (m.find()){
if(prev == -1){
prev = m.end();
}
else
{
next = m.end();
start = prev;
end = next;
subMsg = msg.substring(start,end);
kv.put(String.valueOf(subMsg.charAt(0)),subMsg.substring(subMsg.indexOf("["),subMsg.lastIndexOf("]")+1));
prev = next;
}
}
subMsg = msg.substring(next);
kv.put(String.valueOf(subMsg.charAt(0)),subMsg.substring(subMsg.indexOf("["),subMsg.lastIndexOf("]")+1));
System.out.println(kv);
}
}
Output : {3=[label1:label2:label3:label4], 2=[label1:label2:label3], 1=[label1:label2]}
Live Demo : http://ideone.com/HM7989

can any one tell me what is wrong in my java code

import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.Stack;
import java.util.StringTokenizer;
public class shift {
#SuppressWarnings("unchecked")
public static void main(String args[])
{
String speech = "Sentence:NounPhrase VerbPhrase:NounPhrase :Art Noun:VerbPhrase : Verb | Adverb Verb: Art : the | a : Verb :jumps | sings |: Noun:dog | cat | ";
HashMap<String, String> hashmap = new HashMap<String, String>();
String a;
StringTokenizer st = new StringTokenizer(speech,":");
while (st.hasMoreTokens()) {
String key=st.nextToken().trim();
String value=st.nextToken().trim();
StringTokenizer st1 = new StringTokenizer(value,"|");
while (st1.hasMoreTokens()) {
a=st1.nextToken().trim();
hashmap.put(key, a);
}
}
Set set = hashmap.entrySet();
Iterator ia = set.iterator();
while(ia.hasNext()) {
Map.Entry me = (Map.Entry)ia.next();
System.out.println(me.getKey()+"->"+me.getValue());
}
}
}
the output is
Noun->cat
NounPhrase->Art Noun
Art->a
Sentence->NounPhrase VerbPhrase
Verb->sings
VerbPhrase->Adverb Verb
this code is missing some values to return such as the the jumps etc are not show
Not sure I get your question fully, but keep in mind that a HashMap can only store one value per key.
If you want to store multiple verbs for the key "Verb", then you would have to declare the map using something like:
HashMap<String, Set<String>> hashmap = new HashMap<String, Set<String>>();
and store the words mapped to by "Verb" in a set.
Here is a brushed up (working) version of the code:
import java.util.*;
public class Shift {
public static void main(String args[]) {
String speech = "Sentence:NounPhrase VerbPhrase:NounPhrase :Art " +
"Noun:VerbPhrase : Verb | Adverb Verb: Art : the | " +
"a : Verb :jumps | sings |: Noun:dog | cat | ";
Map<String, Set<String>> hashmap = new HashMap<String, Set<String>>();
StringTokenizer st = new StringTokenizer(speech, ":");
while (st.hasMoreTokens()) {
String key = st.nextToken().trim();
String value = st.nextToken().trim();
StringTokenizer st1 = new StringTokenizer(value, "|");
while (st1.hasMoreTokens()) {
String a = st1.nextToken().trim();
if (!hashmap.containsKey(key))
hashmap.put(key, new HashSet<String>());
hashmap.get(key).add(a);
}
}
for (String key : hashmap.keySet())
System.out.printf("%s -> %s%n", key, hashmap.get(key));
}
}
You're overwriting the existing value when you call hashmap.put(key, a), since you're assigning a value to a key that already has a value.

Categories