Java regex to parse a particular semicolon delimited param from a URL? - java

I have a URL I'm expecting like:
www.somewebsite.com/misc-session/;session-id=1FSDSF2132FSADASD13213
I want to parse out
session-id=1FSDSF2132FSADASD13213
Using a regular express in Java, what would be the best approach to take for this?
Using a test regex website I've experimented with some different ways but I'm wondering what is the best approach that is the most fail safe, and protected incase the URL is actually formed like:
www.somewebsite.com/misc-session/;session-id=1FSDSF2132FSADASD13213?someExtraParam=false
or
www.somewebsite.com/misc-session/extra-path/;session-id=1FSDSF2132FSADASD13213?someExtraParam=false
I am always just looking for the value of "session-id".
EDIT:
The value of session-id is NOT limited to digits and is guaranteed to contain a combination of both.

What is the best approach that is the most fail safe, and protected.
Well I think matching word boundary on both sides will be enough.
Regex: \bsession-id=\d+\b
Note:- Use \\d and \\b if regex flavor you are using needs double escaping.
Regex101 Demo
Just in case session-id have characters in range [A-Za-z0-9] use this regex.
Regex: \bsession-id=[A-Za-z0-9]+\b
Regex101 Demo
Ideone Demo
Remember to include
import java.util.regex.Matcher;
import java.util.regex.Pattern;

Try this one:
String str = "www.somewebsite.com/misc-session/;session-id=213213213";
Pattern p = Pattern.compile("(session-id=\\d+)");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(0));
}
Note that session-id= is always given and you are interested in the following number, that is represented with \d (use double \\d in Java). The + stands for at least one number at all.
However better look at the detailed description at Regex101.

Related

Regex extract string in java

I'm trying to extract a string from a String in Regex Java
Pattern pattern = Pattern.compile("((.|\\n)*).{4}InsurerId>\\S*.{5}InsurerId>((.|\\n)*)");
Matcher matcher = pattern.matcher(abc);
I'm trying to extract the value between
<_1:InsurerId>F2021633_V1</_1:InsurerId>
I'm not sure where am I going wrong but I don't get output for
if (matcher.find())
{
System.out.println(matcher.group(1));
}
You can use:
Pattern pattern = Pattern.compile("<([^:]+:InsurerId)>([^<]*)</\\1>");
Matcher matcher = pattern.matcher(abc);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
RegEx Demo
You may want to use the totally awesome page http://regex101.com/ to test your regular expressions. As you can see at https://regex101.com/r/rV8uM3/1, you only have empty capturing groups, but let me explain to you what you did. :D
((.|\n)*) This matches any character, or a new line, unimportant how often. It is capturing, so your first matching group will always be everything before <_1:InsurerId>, or an empty string. You can match any character instead, it will include new lines: .*. You can even leave it away as it isn't actually part of the String you want to match - using anything here will actually be a problem if you have multiple InsurerIds in your file and want to get them all.
.{4}InsurerId> This matches "InsurerId>" with any four characters in front of it and is exactly what you want. As the first character is probably always an opening angle bracket (and you don't want stuff like "<ExampleInsurerId>"), I'd suggest using <.{3}InsurerId> instead. This still could have some problems (<Test id="<" xInsurerId>), so if you know exactly that it's "_<a digit>:", why not use <_\d:InsurerId>?
\S* matches everything except for whitespaces - probably not the best idea as XML and similar files can be written to not contain any space at all. You want to have everything to the next tag, so use [^<]* - this matches everything except for an opening angle bracket. You also want to get this value later, so you have to use a capturing group: ([^<]*)
.{5}InsurerId> The same thing here: use <\/.{3}InsurerId> or <\/_\d:InsurerId> (forward slashes are actually characters interpreted by other RegEx implementations, so I suggest escaping them)
((.|\n)*) Again the same thing, just leave it away
The resulting Regular Expression would then be the following:
<_\d:InsurerId>([^<]*)<\/_\d:InsurerId>
And as you can see at https://regex101.com/r/mU6zZ3/1 - you have exactly one match, and it's even "F2021633_V1" :D
For Java, you have to escape the backslashes, so the resulting code would look like this:
Pattern pattern = Pattern.compile("<_\\d:InsurerId>([^<]*)<\\/_\\d:InsurerId>");
If you are using Java 7 and above, you can use naming groups to make the Regex a little bit more readable (also see the backreference group \k for close tag to match the openning tag):
Pattern pattern = Pattern.compile("(?:<(?<InsurancePrefix>.+)InsurerId>)(?<id>[A-Z0-9_]+)</\\k<InsurancePrefix>InsurerId>");
Matcher matcher = pattern.matcher("<_1:InsurerId>F2021633_V1</_1:InsurerId>");
if (matcher.matches()) {
System.out.println(matcher.group("id"));
}
Using back reference the matches() fails, for example, on this text
<_1:InsurerId>F2021633_V1</_2:InsurerId>
which is correct
Javadoc has a good explanation: https://docs.oracle.com/javase/8/docs/api/
Also you might consider using a different tool (XML parser) instead of Regex, as well, as other people have to support your code, and complex Regex is usually difficult to understand.

Regex dot operator in Java seems to always work greedy

I'm trying to fetch first paragraph content from HTML snippet... nothing easier, huh? But for some reason, .*? operator seems to work greedy:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test
{
public static void main(String[] args)
{
Pattern regex = Pattern.compile("<p(?: [^>]*)?>(.*?)</p>", Pattern.DOTALL);
Matcher match = regex.matcher("<p class=\"baz\">foo</p> <p>bar</p>");
System.out.println(match.matches());
System.out.println(match.group(1));
}
}
I expect to match just the content of the first paragraph (foo), but here is the result:
$ javac test.java && java test
true
foo</p> <p>bar
Any reason why the .*? continues to match after first </p>?
As explained by npinti in the comments, the problem is caused by calling match.match(). This attempts to match your pattern against the entire input string. It only succeeds if the regex engine finds some way to express your string as an instance of your pattern. The only way to achieve this is for it to match (.*?) against foo</p> <p>bar.
There are two ways to solve this:
The easiest is to switch to match.find(). This finds the first match of your pattern within the string. Since there is no requirement for the whole string to match, the non-greedy quantifier ensures you get foo as required.
Adjust your pattern to match the whole string. I.e. "<p(?: [^>]*)?>(.*?)</p>.*".
Inevitably, however, these "simple" plans to parse some HTML grow more and more unwieldy as requirements change. It really is quite simple to parse HTML with something like JSoup. Switch to that now and don't look back. Look how easy it is:
Document doc = Jsoup.parseBodyFragment("<p class=\"baz\">foo</p> <p>bar</p>");
Elements paragraphs = doc.getElementsByTag("p");
if (paragraphs.size() > 0) {
System.out.println(paragraphs.get(0).text());
}
Prints: foo.
Sorry for not posting this earlier, did not have an access to a Java environment.
The problem is that matches() will try to match the entire string. Meaning that it will implicitly add ^ and $. Replacing matches() with find() should fix the issue:
Pattern regex = Pattern.compile("<p(?: [^>]*)?>(.*?)</p>", Pattern.DOTALL);
Matcher match = regex.matcher("<p class=\"baz\">foo</p> <p>bar</p>");
System.out.println(match.find());
System.out.println(match.group(1));
Yields:
true
foo

Java regex for matching #<string>vs<string>

I have a string "Waiting for match #indvspak and #indvsaus" and want to match the strings "#indvspak" and "#indvsaus" seperately.
I am using the following regex (^|)#.*vs.+?\s\b. But it matches the entire string starting from the hash sign. How can i achieve my requirement please help.
I though you want to match the string which startswith # contains vs and the whole string must be preceded by a non-space character.
"(?<!\\S)#\\S*vs\\S+"
(?<!\\S) negative look-behind asserts that the match won't be preceded by a non-space character.
Code:
String s = "Waiting for match #indvspak and #indvsaus";
Matcher m = Pattern.compile("(?<!\\S)#\\S*vs\\S+").matcher(s);
while(m.find())
{
System.out.println(m.group());
}
Output:
#indvspak
#indvsaus
You need this regex:
#[^\\s]+
it matches anything after (including) # but not spaces.
Edit:
As #AvinashRaj suggested, if you want to ensure "vs" appears in the hashtag, you should use a negative lookbehind.
I highly recommend you to go though the String API, there are many methods that can help you with your problem.
EDITED
(copied from other answer comments)
Use this:
"(?<!\\B)#\\w+vs\\o/\S#vas\\S-[]"
Easy...

Need regex to match the given string

I need a regex to match a particular string, say 1.4.5 in the below string . My string will be like
absdfsdfsdfc1.4.5kdecsdfsdff
I have a regex which is giving [c1.4.5k] as an output. But I want to match only 1.4.5. I have tried this pattern:
[^\\W](\\d\\.\\d\\.\\d)[^\\d]
But no luck. I am using Java.
Please let me know the pattern.
When I read your expression [^\\W](\\d\\.\\d\\.\\d)[^\\d] correctly, then you want a word character before and not a digit ahead. Is that correct?
For that you can use lookbehind and lookahead assertions. Those assertions do only check their condition, but they do not match, therefore that stuff is not included in the result.
(?<=\\w)(\\d\\.\\d\\.\\d)(?!\\d)
Because of that, you can remove the capturing group. You are also repeating yourself in the pattern, you can simplify that, too:
(?<=\\w)\\d(?:\\.\\d){2}(?!\\d)
Would be my pattern for that. (The ?: is a non capturing group)
Your requirements are vague. Do you need to match a series of exactly 3 numbers with exactly two dots?
[0-9]+\.[0-9]+\.[0-9]+
Which could be written as
([0-9]+\.){2}[0-9]+
Do you need to match x many cases of a number, seperated by x-1 dots in between?
([0-9]+\.)+[0-9]+
Use look ahead and look behind.
(?<=c)[\d\.]+(?=k)
Where c is the character that would be immediately before the 1.4.5 and k is the character immediately after 1.4.5. You can replace c and k with any regular expression that would suit your purposes
I think this one should do it : ([0-9]+\\.?)+
Regular Expression
((?<!\d)\d(?:\.\d(?!\d))+)
As a Java string:
"((?<!\\d)\\d(?:\\.\\d(?!\\d))+)"
String str= "absdfsdfsdfc**1.4.5**kdec456456.567sdfsdff22.33.55ffkidhfuh122.33.44";
String regex ="[0-9]{1}\\.[0-9]{1}\\.[0-9]{1}";
Matcher matcher = Pattern.compile( regex ).matcher( str);
if (matcher.find())
{
String year = matcher.group(0);
System.out.println(year);
}
else
{
System.out.println("no match found");
}

Java Regexp clarification

I have a string like :
<RandomText>
executeRule(x, y, z)
<MoreRandomText>
What I would like to accomplish is the following: if this executeRule string exists in the bigger text block, I would like to get its 2'nd parameter.
How could I do this ?
What do you mean the bigger text block?
If you want to extract the second param from that expression, it would be something like
executeRule\(\w+,\s*(\w+),\s*\w+\)
The second param is held on capture group $1.
Keep in mind that to use this expression in Java, you need to escape the '\'. Also, I'm just assuming \w is good enough to match your params, that would depend on your particular rules.
If you need some help with actually using regexes in Java, there are many resources you can turn to, I found this tutorial to be fairly simple and it explains the basic usages:
http://www.vogella.de/articles/JavaRegularExpressions/article.html
import java.util.regex.Matcher;
import java.util.regex.Pattern;
...
Pattern p = Pattern.compile("executeRule\\(\\w+, (\\w+), \\w+\\)");
Matcher m = p.matcher(YOUR_TEXT_FROM_FILE);
while (m.find()) {
String secondArgument = m.group(1);
...process secondArgument...
}
Once this code executes secondArgument will contain the value of y. The above regular expression assumes that you expect the arguments to be composed of word characters (i.e. small and capital letters, digits and underscore).
Double backslashes are needed by Java string literal syntax, regexp engine will see single backslashes.
If you'd like to allow for whitespace in the string as it is allowed in most programming languages, you may use the following regexp:
Pattern p = Pattern.compile("executeRule\\(\\s*\\w+\\s*,\\s*(\\w+)\\s*,\\s*\\w+\\s*\\)");

Categories