Java String Parsing Without Regular Expressions

Java String Parsing Without Regular Expressions - java

From a server, I get strings of the following form:
String x = "fixedWord1:var1 data[[fixedWord2:var2 fixedWord3:var3 data[[fixedWord4] [fixedWord5=var5 fixedWord6=var6 fixedWord7=var7]]] , [fixedWord2:var2 fixedWord3:var3 data[[fixedWord4][fixedWord5=var5 fixedWord6=var6 fixedWord7=var7]]]] fixedWord8:fixedWord8";
(only spaces divide groups of word-var pairs)
Later, I want to store them in a Hashmap, like myHashMap.put(fixedWord1, var1); and so on.
Problem:
Inside the first "data[......]"-tag, the number of other "data[..........]"-tags is variable, and I don't know the length of the string in advance.
I don't know how to process such Strings without resorting to String.split(), which is discouraged by our assignment task givers (university).
I have searched the internet and couldn't find appropriate websites explaining such things.
It would be of great help, if experienced people could give me some links to websites or something like a "diagrammatic plan" so that I could code something.
EDIT:
got mistake in String (off-topic-begin "please don't lynch" off-topic-end), the right string is (changed fixedWord7=var7 ---to---> fixedWord7=[var7]):
String x = "fixedWord1:var1 data[[fixedWord2:var2 fixedWord3:var3 data[[fixedWord4] [fixedWord5=var5 fixedWord6=var6 fixedWord7=[var7]]]] , [fixedWord2:var2 fixedWord3:var3 data[[fixedWord4][fixedWord5=var5 fixedWord6=var6 fixedWord7=[var7]]]]] fixedWord8:fixedWord8";

I assume your string follows a same pattern, which has "data" and "[", "]" in it. And the variable name/value will not include these strings
remove string "data[", "[", "]", and "," from the original string
replaceAll("data[", "")
replaceAll("[", "")
etc
separate the string by space: " " by using StringTokenizer or loop through the String char by char.
then you will get array of strings like
fixedWorld1:var1
fixedWorld2:var2
......
fixedWorld4
fixedWorld5=var5
......
then again separate the sub strings by ":" or "=". and put the name/value into the Map

Problem is not absolutely clear but may be something like this will work for you:
Pattern p = Pattern.compile("\\b(\\w+)[:=]\\[?(\\w+)");
Matcher m = p.matcher( x );
while( m.find() ) {
System.out.println( "matched: " + m.group(1) + " - " + m.group(2) );
hashMap.put ( m.group(1), m.group(2) );
}

Related

Find a duplicate word in a webpage using regurlar expression (clueless)

I'm trying to figure out a way to use regular expressions to find duplicate words on a webpage, I'm completely clueless and apologise in advance if I'm using the incorrect terminology.
So far I've found the following regular expressions which work well but only on words that are consecutively (e.g. hello hello) but not words that are placed in different parts of the webpage or separated by another word (e.g. hello food hello)
\b(\w+)(\s+\1\b)*
\b(\w+(?:\s*\w*))\s+\1\b
I would be super grateful to anyone that can help, I realise I might not be in the right place since I'm basically a noob.

Capture the first word (surrounded by word boundaries) in a group, and then backreference it later in a lookahead, after repeating optional characters in between:
\b(\w+)\b(?=.*\b\1\b)
https://regex101.com/r/TcS1UW/3

I would use Jsoup to get the text from the webpage. Then you could keep track of the counts using a HashMap, and then search the map for any number of occurrences you want:
String url = "https://en.wikipedia.org/wiki/Jsoup";
String body = Jsoup.connect(url).get().body().text();
Map<String,Integer> counts = new HashMap<>();
for ( String word : body.split(" ") )
{
counts.merge(word, 1, Integer::sum);
}
for ( String key : counts.keySet() )
{
if ( counts.get(key) >= 2 )
{
System.out.println(key + " occurs " + counts.get(key) + " times.");
}
}
You may need to clean up the map to get rid of some entries that aren't words, but this will get you most of the way.

get rid of square brackets in java

I am writing a program in SWI-prolog and Java.
My problem is, when i print the result from prolog in returns with [] and I don't want this.
The code for printing the results is
String t8 = "findDiseases(" + mylist + ",Diseases)."+ "\n";
Query q8 = new Query(t8);
Diagnosis_txt.append("Με τις δοθείσες πληροφορίες πάσχετε από: " +
"\n" +
"\n" +
q8.oneSolution().get("Diseases"));
while (q8.hasMoreSolutions()) {
Map<String, Term> s7 = q8.nextSolution();
System.out.println("Answer is " + s7.get("Diseases"));
}
And the printed results is
Answer is '[|]'(drepanocytocis, '[|]'(drepanocytocis, '[]'))
I want to get rid of this [|] and the []. I want to print only drepanocytocis.

if you want to remove all special characters you can do something like this:
answer = answer.replaceAll("[^a-zA-Z ]+", "").trim();
update
to remove any duplicate spaces after that run, the full solution can do somthing like this:
answer.replaceAll("[^a-zA-Z ]+", " ")
// remove duplicate spaces
.replaceAll("[ ]([ ]+)", " ")
// remove leading & trailing spaces
.trim();
It can then be split on spaces to get the correct sanitized answer...
However, as #andy suggested, I recommend finding the source of the data, and building a proper data structure for it to return exactly what you want. post processing should only kinda be used for data you have no control of, or old versions, etc...

Match string between multiple brackets

I have this very long JSON string. I would like to filtrate it and only get the data between the first bracket. The problem is, I have many other brackets therefore my regex pattern is not working properly.
Here is the JSON string:
String jsondata = "["
+"{"
+ "test: 63453645"
+"date: 2016-07-17"
"{"
+ "id:534534"
+"}"
+ "blank : null"
+ "flags : null"
+ "}"
+"{"
+ "test: 543564236"
+"date: 2014-07-17"
+"{"
+ "id:6532465"
+"}"
+ "blank : null"
+ "flags : null"
+ "}"
+"]";
pattern = "\\{[^{}]*\\}";
pr = Pattern.compile(pattern);
math = pr.matcher(jsondata);
if (math.find()) {
System.out.println(math.group());
}
else
System.out.println("nomatch");
The problem with the pattern that I have is that it only prints out to the first } after the id:, but I want it to end at the last } which is after flags: null.
And I only want to print the first match, i.e not the string after because the also start and end with the same character, and that is why I have an if statement instead of a while loop.
Any suggestions? Thank you!
Regex with multiple brackets seems like a very difficult task. Can I match the last string instead? Starting from { to flags : null?

Like I said in comment,
I usually make use of JSON-Simple.
A great tutorial, decoding.
would look somewhat like:
JSONObject obj = JSONValue.parse(jsondata);
obj.get("test");
PS.
I do see some errors in your json data, make use of jsonlint to verify if your json is formatted correctly...

This will grab everything between the first { and last }:
String guts = jsondata.replaceAll("(?s)^.*?\\{(.*?flags : null[^}]*).*$", "$1");
The regex captures everything after the first { up to your semaphore text and all non-} chars following.

How to remove hard spaces with Jsoup?

I'm trying to remove hard spaces (from entities in the HTML). I can't remove it with .trim() or .replace(" ", ""), etc! I don't get it.
I even found on Stackoverflow to try with \\u00a0 but didn't work neither.
I tried this (since text() returns actual hard space characters, U+00A0):
System.out.println( "'"+fields.get(6).text().replace("\\u00a0", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().replace(" ", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().trim()+"'"); //'94,00 '
System.out.println( "'"+fields.get(6).html().replace(" ", "")+"'"); //'94,00' works
But I can't figure out why I can't remove the white space with .text().

Your first attempt was very nearly it, you're quite right that Jsoup maps to U+00A0. You just don't want the double backslash in your string:
System.out.println( "'"+fields.get(6).text().replace("\u00a0", "")+"'" ); //'94,00'
// Just one ------------------------------------------^
replace doesn't use regular expressions, so you aren't trying to pass a literal backslash through to the regex level. You just want to specify character U+00A0 in the string.

The question has been edited to reflect the true problem.
New answer;
The hardspace, ie. entity (Unicode character NO-BREAK SPACE U+00A0 ) can in Java be represented by the character \u00a0, thus code becomes, where str is the string gotten from the text() method
str.replaceAll ("\u00a0", "");
Old answer;
Using the JSoup library,
import org.jsoup.parser.Parser;
String str1 = Parser.unescapeEntities("last week, Ovokerie Ogbeta", false);
String str2 = Parser.unescapeEntities("Entered » Here", false);
System.out.println(str1 + " " + str2);
Prints out:
last week, Ovokerie Ogbeta Entered » Here

Regular expression help in java

I am lost when it comes to building regex strings. I need a regular expression that does the following.
I have the following strings:
[~class:obj]
[~class|class2|more classes:obj]
[!class:obj]
[!class|class2|more classes:obj]
[?method:class]
[text]
A string can have multiple of whats above. Example string would be "[if] [!class:obj]"
I want to know what is in between the [] and broken into match groups. For example, the first match group would be the symbol if present (~|!|?) next what is before the : so that could be class or class|class2|etc... then what is on the right of the : and stop before the ]. There may be no : and what goes before it, but just something between the [].
So, how would I go about writing this regex? And is it possible to give the match group names so I know what it matched?
This is for a java project.

If you're sure enough of your inputs, you can probably use something like /\[(\~|\!|\?)?(?:((?:[^:\]]*?)+):)?([^\]]+?)\]/. (to translate that into Java, you'll want to escape the backslashes and use quotation marks instead of forward slashes)

Here are some web sites that might be helpful:
http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
http://txt2re.com/index.php3?s=Test+test+june+2011+test&submit=Show+Matches
http://www.regexplanet.com/simple/

I believe that this should work:
/[(.*?)(?:\|(.*?))*]/
Also:
[a-z]*

Try this code
final Pattern
outerP = Pattern.compile("\\[.*?\\]"),
innerP = Pattern.compile("\\[([~!?]?)([^:]*):?(.*)\\]");
for (String s : asList(
"[~class:obj]",
"[if][~class:obj]",
"[~class|class2|more classes:obj]",
"[!class:obj]",
"[!class|class2|more classes:obj]",
"[?method:class]",
"[text]"))
{
final Matcher outerM = outerP.matcher(s);
System.out.println("Input: " + s);
while (outerM.find()) {
final Matcher m = innerP.matcher(outerM.group());
if (m.matches()) System.out.println(
m.group(1) + ";" + m.group(2) + ";" + m.group(3));
else System.out.println("No match");
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java String Parsing Without Regular Expressions - java

Problem is not absolutely clear but may be something like this will work for you: Pattern p = Pattern.compile("\\b(\\w+)[:=]\\[?(\\w+)"); Matcher m = p.matcher( x ); while( m.find() ) { System.out.println( "matched: " + m.group(1) + " - " + m.group(2) ); hashMap.put ( m.group(1), m.group(2) ); }

Related

Find a duplicate word in a webpage using regurlar expression (clueless)

get rid of square brackets in java

Match string between multiple brackets

How to remove hard spaces with Jsoup?

Regular expression help in java

Categories

Resources