String split for specific element - java

I need to split the following string only the data between the "CHAR" tabs:
Input:
<MSG><KEY>name.extObject</KEY><PARAM><CHAR>Number</CHAR><CHAR>7015:188188</CHAR></PARAM></MSG>
Expected output: Number 7015:188188
I am looking for something efficient.
Any recommendation ?
Thanks

It is good practice to avoid parsing XML/HTML with regex. Instead you can use proper XML parser? I like to use jsoup so here is example how it can be done with this libraryL:
String data = "<MSG><KEY>name.extObject</KEY><PARAM><CHAR>Number</CHAR><CHAR>7015:188188</CHAR></PARAM></MSG>";
Document doc = Jsoup.parse(data, "", Parser.xmlParser());
String charText = doc.select("CHAR").text();
System.out.println(charText);
Output: Number 7015:188188

I think you meant to capture the content between tags than splitting the string.
It's well known that you should NOT use a regex to parse xhtml since you can get w͈̦̝͉̬͔͕͡ͅe̴͏̰̜͖̗̤̙̖̕i̧̩̭̳̱̖̦͠ͅŗ̴̼̺̻͕̀d̶̩̖̦̖̲̣̺̫͘ ̡͇̥̩͓c͕̻̫͉̞͝ͅo̯̗͜͜͝ṇ̠͘t̛̬̮̞̥͕̙̞e̷̸̗̼͟ͅn̡͎̖̜̱͟͢t̨̙̫̻̱̺͈̗͝. Although, if you still want a regex you can use a regex like this:
<CHAR>(.*?)<\/CHAR>
Working demo
And you can have this java code:
String line = "<MSG><KEY>name.extObject</KEY><PARAM><CHAR>Number</CHAR><CHAR>7015:188188</CHAR></PARAM></MSG>";
Pattern pattern = Pattern.compile("<CHAR>(.*?)<\\/CHAR>");
Matcher matcher = pattern.matcher(line);
String result = "";
while (matcher.find()) {
result += matcher.group(1) + " ";
}
System.out.println(result); //Prints: Number 7015:188188
Update: as Pshemo pointed in his comment:
/ is not special character in Java regex engine. You don't have to escape it
So, you can use:
Pattern pattern = Pattern.compile("<CHAR>(.*?)</CHAR>");
Btw, I really like Pshemo answer, it's a nice approach to solve this without regex and xhtml

In case you know the tag value is always some digit, then an optional colon with digits, and it is the only <CHAR> tag that has such a numeric value, you may want to use this regex:
(?<=<CHAR>)\d+(?::\d+)?(?=<\/CHAR>)
Java string:
String pattern = "(?<=<CHAR>)\\d+(?::\\d+)?(?=</CHAR>)";
Sample code:
String str = "<MSG><KEY>name.extObject</KEY><PARAM><CHAR>Number</CHAR><CHAR>7015:188188</CHAR></PARAM></MSG>";
Pattern ptrn = Pattern.compile("(?<=<CHAR>)\\d+(?::\\d+)?(?=</CHAR>)");
Matcher matcher = ptrn.matcher(str);
if (matcher.find()) {
System.out.println(matcher.group(0));
}
Output:
7015:188188

String s = inputString;
String result="";
while(s.indexOf("<CHAR>") != -1)
{
result += s.substring(s.indexOf("<CHAR>") + "<CHAR>".length(), s.indexOf("</CHAR>")) + " ";
s = s.substring(s.indexOf("</CHAR>") + "</CHAR>".length());
}
//result is now the desired output

Regex for that is : (.*?)</CHAR>
However, it is better to use an XML parser for that.

Related

Regex to get value between two colon excluding the colons

I have a string like this:
something:POST:/some/path
Now I want to take the POST alone from the string. I did this by using this regex
:([a-zA-Z]+):
But this gives me a value along with colons. ie I get this:
:POST:
but I need this
POST
My code to match the same and replace it is as follows:
String ss = "something:POST:/some/path/";
Pattern pattern = Pattern.compile(":([a-zA-Z]+):");
Matcher matcher = pattern.matcher(ss);
if (matcher.find()) {
System.out.println(matcher.group());
ss = ss.replaceFirst(":([a-zA-Z]+):", "*");
}
System.out.println(ss);
EDIT:
I've decided to use the lookahead/lookbehind regex since I did not want to use replace with colons such as :*:. This is my final solution.
String s = "something:POST:/some/path/";
String regex = "(?<=:)[a-zA-Z]+(?=:)";
Matcher matcher = Pattern.compile(regex).matcher(s);
if (matcher.find()) {
s = s.replaceFirst(matcher.group(), "*");
System.out.println("replaced: " + s);
}
else {
System.out.println("not replaced: " + s);
}
There are two approaches:
Keep your Java code, and use lookahead/lookbehind (?<=:)[a-zA-Z]+(?=:), or
Change your Java code to replace the result with ":*:"
Note: You may want to define a String constant for your regex, since you use it in different calls.
As pointed out, the reqex captured group can be used to replace.
The following code did it:
String ss = "something:POST:/some/path/";
Pattern pattern = Pattern.compile(":([a-zA-Z]+):");
Matcher matcher = pattern.matcher(ss);
if (matcher.find()) {
ss = ss.replaceFirst(matcher.group(1), "*");
}
System.out.println(ss);
UPDATE
Looking at your update, you just need ReplaceFirst only:
String result = s.replaceFirst(":[a-zA-Z]+:", ":*:");
See the Java demo
When you use (?<=:)[a-zA-Z]+(?=:), the regex engine checks each location inside the string for a * before it, and once found, tries to match 1+ ASCII letters and then assert that there is a : after them. With :[A-Za-z]+:, the checking only starts after a regex engine found : character. Then, after matching :POST:, the replacement pattern replaces the whole match. It is totlally OK to hardcode colons in the replacement pattern since they are hardcoded in the regex pattern.
Original answer
You just need to access Group 1:
if (matcher.find()) {
System.out.println(matcher.group(1));
}
See Java demo
Your :([a-zA-Z]+): regex contains a capturing group (see (....) subpattern). These groups are numbered automatically: the first one has an index of 1, the second has the index of 2, etc.
To replace it, use Matcher#appendReplacement():
String s = "something:POST:/some/path/";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile(":([a-zA-Z]+):").matcher(s);
while (m.find()) {
m.appendReplacement(result, ":*:");
}
m.appendTail(result);
System.out.println(result.toString());
See another demo
This is your solution:
regex = (:)([a-zA-Z]+)(:)
And code is:
String ss = "something:POST:/some/path/";
ss = ss.replaceFirst("(:)([a-zA-Z]+)(:)", "$1*$3");
ss now contains:
something:*:/some/path/
Which I believe is what you are looking for...

Text between <pre> tag does not retain line breaks when using regex Java

this is my problem.
String pattern1 = "<pre.*?>(.+?)</pre>";
Matcher m = Pattern.compile(pattern1).matcher(html);
if(m.find()) {
String temp = m.group(1);
System.out.println(temp);
}
temp does not retain line breaks...it flows as a single line. How to keep the line breaks within temp?
You shouldn't parse HTML with regular expressions, but to fix this use the dotall modifier ...
String pattern1 = "(?s)<pre[^>]*>(.+?)</pre>";
↑↑↑↑
|_______ Forces the . to span across newline sequences.
Using JSoup: html parser
It's very well known that you shouldn't use regex to parse html content, you should use a html parser instead. You can see below how to do it with JSoup:
String html = "<p>lorem ipsum</p><pre>Hello World</pre><p>dolor sit amet</p>";
Document document = Jsoup.parse(html);
Elements pres = document.select("pre");
for (Element pre : pres) {
System.out.println(pre.text());
}
Pattern.DOTALL: single line compiled flag
However, if you still want to use regex, bear in mind that dot it's a wildcard which doesn't match \n unless you specify it intentionally, so you can achieve this in different ways, like using Pattern.DOTALL
String pattern1 = "<pre.*?>(.+?)</pre>";
Matcher m = Pattern.compile(pattern1, Pattern.DOTALL).matcher(html);
if(m.find()) {
String temp = m.group(1);
System.out.println(temp);
}
Inline Single line flag:
Or using the s flag inline in the regex like this:
String pattern1 = "(?s)<pre.*?>(.+?)</pre>";
Matcher m = Pattern.compile(pattern1).matcher(html);
if(m.find()) {
String temp = m.group(1);
System.out.println(temp);
}
Regex trick
Or you can also use a regex trick that consists of using complementary sets like [\s\S], [\d\D], [\w\W], etc.. like this:
String pattern1 = "<pre.*?>([\\s\\S]+?)</pre>";
Matcher m = Pattern.compile(pattern1).matcher(html);
if(m.find()) {
String temp = m.group(1);
System.out.println(temp);
}
But as nhahtdh pointed in his comment this trick might impact in the regex engine performance.

How to replace all matching patterns only if it does not match another pattern?

I want to replace a substring that matches a pattern, only if it does not match a different pattern. For example, in the code shown below, I want to replace all '%s' but leave ':%s' untouched.
String template1 = "Hello:%s";
String template2 = "Hello%s";
String regex = "[%s&&^[:%s]]";
String str = template1.replaceAll(regex, "");
System.out.println(str);
str = template2.replaceAll(regex, "");
System.out.println(str);
The output should be:
Hello:%s
Hello
I am missing something in my regex. Any clues? Thanks!
Use a negative lookbehind to achieve your goal:
String regex = "(?<!:)%s";
It matches %s only if there is not a : right before it.

parsing string with substring

what is the best way to parse string
Example
SomeName_Some1_Name2_SomeName3
I want to get out SomeName. What is the best way to do? With substring and calculationg positions or is another better way
You can match pattern SomeName for extracting-
String str= "SomeName_Some1_Name2_SomeName3";
Pattern ptrn= Pattern.compile("SomeName");
Matcher matcher = ptrn.matcher(str);
while(matcher.find()){
System.out.println(matcher.group());
}
Split it by underscore _ using method split()
Get index # 0 from returning array from previous step
If you know the delimiter then you can just try this:
System.out.println("SomeName_Some1_Name2_SomeName3".split("_")[0]);
See also: Javadoc of String.split()
Depends on your configuration and whether you're interested in the other fields.
In that case, go for splitting the string using the _ separator.
In case you just want a part of the string, I'ld just go for substringing in combination with indexOf('_').
In case you want all Occurences you could also find all occurences of 'someName' in your text.
Use regex and Pattern Matcher API to get SomeName.
Here you go:
String str = "SomeName_Some1_Name2_SomeName3";
String newStr = str.substring(0, str.indexOf("_"));
System.out.println(newStr);
Output:
SomeName
String your_String = "SomeName_Some1_Name2_SomeName3";
your_String = your_String.split("_")[0];
Log.v("log","your string "+ your_String);
String str = "SomeName_Some1_Name2_SomeName3";
String output = str.split ( "_" ) [ 0 ];
you will get your output as SomeName.

how to ignore a value in a text?

I have a string like this :
EQ=ENABLED,QLPUB=50,EPRE=ENABLED
how can I ignore, the value of QLPUB? Actually I want to check this string in 3000 lines but I want to ignore 50.
is there any way to ignore it, for example with java regular expression or %s or ... ?
Try this regular expression:
s = s.replaceAll("(^|,)QLPUB=[^,]*", "");
See it working online: ideone
If value of QLPUB is always numeric you can use the following regex:
^EQ=ENABLED,QLPUB=\d*,EPRE=ENABLED$
Here's an example:
String text = "EQ=ENABLED,QLPUB=502,EPRE=ENABLED";
String pattern = "^EQ=ENABLED,QLPUB=\\d*,EPRE=ENABLED$";
Pattern compiledPattern = Pattern.compile(pattern);
Matcher matcher = compiledPattern.matcher(text);
if(matcher.find()) {
System.out.println(matcher.group());
}
If the value of QLPUB is anything but a , change the regex to:
^EQ=ENABLED,QLPUB=[^,]*,EPRE=ENABLED$
You could use regex /^EQ=ENABLED,QLPUB=\d+,EPRE=ENABLED$/. In java this would look like this:
String myString = "EQ=ENABLED,QLPUB=50,EPRE=ENABLED";
if(myString.matches("^EQ=ENABLED,QLPUB=\\d+,EPRE=ENABLED$"))
{
//your string matches regardless of the value of QLPUB
}

Categories