Java regexp in matcher input - java

I'm trying to get quoted strings using regexp.
String regexp = "('([^\\\\']+|\\\\([btnfr\"'\\\\]|[0-3]?[0-7]{1,2}|u[0-9a-fA-F]{4}))*'|\"([^\\\\\"]+|\\\\([btnfr\"'\\\\]|[0-3]?[0-7]{1,2}|u[0-9a-fA-F]{4}))*\")";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(source);
while (m.find()) {
String newElement = m.group(1);
//...
}
It works well, but if source text contains
' onkeyup="this.value = this.value.replace (/\D/, \'\')">'
program goes into eternal loop.
How can I correctly get this string?
For example, I have a text(php code):
'qty'=>'<input type="text" maxlength="3" class="qty_text" id='.$key.' value ='
The result should be
'qty'
'<input type="text" maxlength="3" class="qty_text" id='
' value ='

Your regex seems to work okay when presented with a string it matches; it's when it can't match that it goes into the endless loop. (In this case it's the \D that's causing it to choke.) But that regex is much more complicated than it needs to be; you're trying to match them, not validate them. Here's the quintessential regex for a string literal in C-style languages:
"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"
...and the single-quoted version, for languages that support that style:
'[^'\\\r\n]*(?:\\.[^'\\\r\n]*)*'
It uses Friedl's "unrolled loop" technique for maximum efficiency. Here's the Java code for it, as generated by RegexBuddy 4:
Pattern regex = Pattern.compile(
"\"[^\"\\\\\r\n]*(?:\\\\.[^\"\\\\\r\n]*)*\"|'[^'\\\\\r\n]*(?:\\\\.[^'\\\\\r\n]*)*'"
);

Maybe I misunderstand the principle, but that looks rather trivial now that you added the example.
Consider this for instance:
String input = "'qty'=>'<input type=\"text\" maxlength=\"3\" class=\"qty_text\" id='.$key.' value ='";
String otherInput = "' onkeyup=\"this.value = this.value.replace (/\\D/, \'\')\">'";
// matching anything starting with single quote and ending with single quote
// included, reluctant quantified
Pattern p = Pattern.compile("'.+?'");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group());
}
m = p.matcher(otherInput);
System.out.println();
while (m.find()) {
System.out.println(m.group());
}
Output:
'qty'
'<input type="text" maxlength="3" class="qty_text" id='
' value ='
' onkeyup="this.value = this.value.replace (/\D/, '
')">'
See the Java Pattern documentation for more detailed explanations.

The character groups that match neither backslashes nor quotes shouldn't be followed by a +. Remove the +es to fix the hang (which was due to catastrophic backtracking).
Also, your original regex wasn't recognizing \D as a valid backslash escape - therefore the string constant in your test input containing \D wasn't being matched. If you make the rules of your regex more liberal to recognize any character immediately following a backslash as part of the string constant, it will behave the way you expect.
"('([^\\\\']|\\\\.)*'|\"([^\\\\\"]|\\\\.)*\")"

You can do it all in one line using split() with the right regex:
String[] array = source.replaceAll("^[^']+", "").split("(?<!\\G.)(?<=').*?(?='|$)");
There's a reasonable amount of regex kung fu going on here, so I'll break it down:
The delimiter is wrapped by even/odd quotes, but can not contain the quotes because split() consumes the delimiter, so a look behind (?<=') and look ahead (?=') (which are non-consuming) is used to match the quotes instead of a literal quote in the regex
a reluctant match .*? for characters between the quotes ensures that it stops at the next quote (instead of matching through to the last quote)
I added an alternate match for end of input tot he look ahead (?='|$) in case there's no trailing close quote
And saving the best for last, the regex that is key to making this all work is the negative look behind (?<!\\G.) which means "don't match on the end of the previous match" and ensures the next match advances past the end of the previous delimiter, without which you would end up with just the quote characters in your array. \G matches the end of the previous match, but also matches start of input for the first match, so it rather neatly automatically handles not matching on the first quote - thus making the delimiter wrapped in even/odd quote instead of odd/even as it would be otherwise.
To cater for the input's first character not being a quote, you need to strip off the leading characters before splitting - that's why the replaceAll() is needed
Here's some test code using your sample input:
String source = "'qty'=>'<input type=\"text\" maxlength=\"3\" class=\"qty_text\" id='.$key.' value ='";
String[] array = source.replaceAll("^[^']+", "").split("(?<!\\G.)(?<=').*?(?='|$)");
System.out.println(Arrays.toString(array));
Output:
['qty', '<input type="text" maxlength="3" class="qty_text" id=', ' value =']

Related

Using patern matcher to extract html

I have a pice of HTML:
<div class="content" itemprop="softwareVersion"> 2.3 </div>
(This is the version of my app in the play store) What i am trying to do, is get the latest version using Pattern matching.
what i have thus far for matching the pattern is:
String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> [^ <]*</dd");
Matcher matcher = pattern.matcher(Html);
matcher.find();
How do i now go about extractin 2.3 from the htmlString?
Using JSoup xhtml parser
It's well known that you should not parse xhtml with regex unless you know the html character set you are going to parse. You should use a xhtml parser instead like JSoup. So, you could use something like this:
String htmlString = "YOUR HTML HERE";
Document document=Jsoup.parse(htmlString);
Element element=document.select("div[itemprop=softwareVersion]").first();
System.out.println(element.text());
Regex approach
However, if you want to use regex, then you have to use capturing groups and then grab its content.
String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*)</dd");
// ^------^ Here
Matcher matcher = pattern.matcher(htmlString);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Try to capture it in a capture group?
("softwareVersion\"> ([^ <]*)< /dd");
Then accessing the value with matcher.group(1)
I had to tweak a few things to make this work:
String htmlString = "String that includes <div class=\"content\" itemprop=\"softwareVersion\"> 2.3 </div>";
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*) +</div");
Matcher matcher = pattern.matcher(htmlString);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
//else??
The () in the RE make it possible to use matcher,group(1)
Try this Regex \"softwareVersion\">\s([0-9].?[0-9]?+)\s\s<\/div>:
\" matches the character " literally
softwareVersion matches the characters softwareVersion literally (case sensitive)
\" matches the character " literally
> matches the characters > literally
\s match any white space character [\r\n\t\f ]
1st Capturing group ([0-9].?[0-9]?+)
[0-9] match a single character present in the list below
0-9 a single character in the range between 0 and 9
.? matches any character (except newline)
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[0-9]?+ match a single character present in the list below
Quantifier: ?+ Between zero and one time, as many times as possible, without giving back [possessive]
0-9 a single character in the range between 0 and 9
\s match any white space character [\r\n\t\f ]
\s match any white space character [\r\n\t\f ]
< matches the characters < literally
\/ matches the character / literally
div> matches the characters div> literally (case sensitive)
https://regex101.com/r/kR7lC2/1
First, as comments point out, you can't parse HTML with a regex (thanks to Jeff Burka for linking to the canonical answer).
Second, since you are looking at a very limited and particular situation you can match using a capturing group to get the version.
Assuming that the div in question is not broken across lines, my strategy would be much like your posted attempt; look for the string softwareVersion and the tag close > character, optional whitespace, the version string, optional whitespace, and the closing tag.
That gives a regex like softwareVersion[^>]*>\s*([0-9.]+)\s*</
From debuggex (which needs the .* to match the leading part):
.*softwareVersion[^>]*>\s*([0-9.]+)\s*</
Debuggex Demo
This will give you the version in a capturing group, which will be matcher.group(1)
As a Java string, that's softwareVersion[^>]*>\\s*([0-9.]+)\\s*</
I omitted the div after </ because, while it's in a div now, maybe it'll be a span or something else in the future.
I went simple with [0-9.] so it can match 2.3 but also 3.0.1, however it would also match ..382.1...33 — you could make one that matches a limited or arbitrary set of n(.n)* dotted numbers if it was important.
softwareVersion[^>]*>\\s*([1-9][0-9]*(\\.[0-9]+){0,3})\\s*</ matches a version number n with zero to three .n point releases, so 3.0.2.1 but not 1.2.3.4.5

A regex that doesn't match with this character sequence

Here is my Regex, I am trying to search all special characters so that I can escape them.
(\(|\)|\[|\]|\{|\}|\?|\+|\\|\.|\$|\^|\*|\||\!|\&|\-|\#|\#|\%|\_|\"|\:|\<|\>|\/|\;|\'|\`|\~)
My problem here is, I don't want to escape some sepcial characters only when the come in a sequence
like this (.*)
So, Lets consider an example.
Sting message = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (,*) &$#%#*(....))(((";
After escaping according to current regex what i get is,
Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , \(,\*\) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\(
But is don't want to escape this part (.*) want to keep it as it is.
My above regex is only used for searching, So i just don't want to match with this part (.*) and my problem will be solved
Can anyone suggest regex that doesn't escape that part of the string?
See #nhahtdh for how to do this with a regex.
As an alternative, Here is a solution which does not use a regex, using Guava's CharMatcher instead:
private static final CharMatcher SPECIAL
= CharMatcher.anyOf("allspecialcharshere");
private static final String NO_ESCAPE = "(.*)";
public String doEncode(String input)
{
StringBuilder sb = new StringBuilder(input.length());
String tmp = input;
while (!tmp.isEmpty()) {
if (tmp.startsWith(NO_ESCAPE)) {
sb.append(NO_ESCAPE);
tmp = tmp.substring(NO_ESCAPE.length());
continue;
}
char c = tmp.charAt(0);
if (SPECIAL.matches(c))
sb.append('\\');
sb.append(c);
tmp = tmp.substring(1);
}
return sb.toString();
}
This answer is to demonstrate the possibility only. Using it in production code is questionable.
It is possible with Java String replaceAll function:
String input = "Hi, Mr.Xyz! Your account number is :- (1234567890) , (.*) &$#%#*(....))(((";
String output = input.replaceAll("\\G((?:[^()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-]|\\Q(.*)\\E)*+)([()\\[\\]{}?+\\\\.$^*|!&##%_\":<>/;'`~-])", "$1\\\\$2");
Result:
"Hi, Mr\.Xyz\! Your account number is \:\- \(1234567890\) , (.*) \&\$\#\%\#\*\(\.\.\.\.\)\)\(\(\("
Another test:
String input = "(.*) sdfHi test message <> >>>>><<<<f<f<,,,,<> <>(.*) sdf (.*) sdf (.*)";
Result:
"(.*) sdfHi test message \<\> \>\>\>\>\>\<\<\<\<f\<f\<,,,,\<\> \<\>(.*) sdf (.*) sdf (.*)"
Explanation
Raw regex:
\G((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.*)\E)*+)([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-])
Note that \ is escaped once more when the regex is specified inside the string, and " needs to be escaped. The resulting regex in string can be seen above.
Raw replacement string:
$1\\$2
Since $ has special meaning in replacement string, and you want to keep it for $2, you need to escape the \ so that \ won't escape the $. And putting the replacement string in quoted string, you need to double up the number of \ to escape the \.
Before we dissect the monster, let's talk about the idea. We will consume non-special characters, and the sequence that we don't want to replace, and as many times as possible. The next character will either be a special character not forming sequence we don't want to replace, or is the end of the string (which means that we have found all character that needs replacing if any).
Naturally, we can think of any arbitrary string as consisting of many of the following pattern consecutively: [0 or more (non-special character or special pattern not to be replace)][special character], and the string ends with [0 or more (non-special character or special pattern not to be replace)].
replaceAll function when used with a regex without \G may find matches that are not consecutive, which can cut in the middle of the sequence not to be replaced and mess it up. \G means the boundary of last match, and can be used to make sure the next match starts from where the last match left off.
\G: Starts from last match
((?:[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]|\Q(.\*)\E)*+): Capture 0 or more of, the non-special character or the special pattern not to be replaced. Note that I have added the possessive qualifier + after *. This will prevent the engine from backtracking when it cannot find the special character that we specify after this.
[^()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]: Negated character class of special characters.
\Q(.*)\E: Special sequence (.*) not to be replaced, literal quoted by \Q and \E.
([()\[\]{}?+\\.$^*|!&##%_":<>/;'`~-]): Capture the single special character.
The whole regex will match string with minimum length of 1 (the special character). The first capturing group contains the parts that shouldn't be replaced, and the 2nd capturing group contains the special character that should be replaced.

How to match a string's end using a regex pattern in Java?

I want a regular expression pattern that will match with the end of a string.
I'm implementing a stemming algorithm that will remove suffixes of a word.
E.g. for a word 'Developers' it should match 's'.
I can do it using following code :
Pattern p = Pattern.compile("s");
Matcher m = p.matcher("Developers");
m.replaceAll(" "); // it will replace all 's' with ' '
I want a regular expression that will match only a string's end something like replaceLast().
You need to match "s", but only if it is the last character in a word. This is achieved with the boundary assertion $:
input.replaceAll("s$", " ");
If you enhance the regular expression, you can replace multiple suffixes with one call to replaceAll:
input.replaceAll("(ed|s)$", " ");
Use $:
Pattern p = Pattern.compile("s$");
public static void main(String[] args)
{
String message = "hi this message is a test message";
message = message.replaceAll("message$", "email");
System.out.println(message);
}
Check this,
http://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
When matching a character at the end of string, mind that the $ anchor matches either the very end of string or the position before the final line break char if it is present even when the Pattern.MULTILINE option is not used.
That is why it is safer to use \z as the very end of string anchor in a Java regex.
For example:
Pattern p = Pattern.compile("s\\z");
will match s at the end of string.
See a related Whats the difference between \z and \Z in a regular expression and when and how do I use it? post.
NOTE: Do not use zero-length patterns with \z or $ after them because String.replaceAll(regex) makes the same replacement twice in that case. That is, do not use input.replaceAll("s*\\z", " ");, since you will get two spaces at the end, not one. Either use "s\\z" to replace one s, or use "s+\\z" to replace one or more.
If you still want to use replaceAll with a zero-length pattern anchored at the end of string to replace with a single occurrence of the replacement, you can use a workaround similar to the one in the How to make a regular expression for this seemingly simple case? post (writing "a regular expression that works with String replaceAll() to remove zero or more spaces from the end of a line and replace them with a single period (.)").

How can I remove all leading and trailing punctuation?

I want to remove all the leading and trailing punctuation in a string. How can I do this?
Basically, I want to preserve punctuation in between words, and I need to remove all leading and trailing punctuation.
., #, _, &, /, - are allowed if surrounded by letters
or digits
\' is allowed if preceded by a letter or digit
I tried
Pattern p = Pattern.compile("(^\\p{Punct})|(\\p{Punct}$)");
Matcher m = p.matcher(term);
boolean a = m.find();
if(a)
term=term.replaceAll("(^\\p{Punct})", "");
but it didn't work!!
Ok. So basically you want to find some pattern in your string and act if the pattern in matched.
Doing this the naiive way would be tedious. The naiive solution could involve something like
while(myString.StartsWith("." || "," || ";" || ...)
myString = myString.Substring(1);
If you wanted to do a bit more complex task, it could be even impossible to do the way i mentioned.
Thats why we use regular expressions. Its a "language" with which you can define a pattern. the computer will be able to say, if a string matches that pattern. To learn about regular expressions, just type it into google. One of the first links: http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial
As for your problem, you could try this:
myString.replaceFirst("^[^a-zA-Z]+", "")
The meaning of the regex:
the first ^ means that in this pattern, what comes next has to be at
the start of the string.
The [] define the chars. In this case, those are things that are NOT
(the second ^) letters (a-zA-Z).
The + sign means that the thing before it can be repeated and still
match the regex.
You can use a similar regex to remove trailing chars.
myString.replaceAll("[^a-zA-Z]+$", "");
the $ means "at the end of the string"
You could use a regular expression:
private static final Pattern PATTERN =
Pattern.compile("^\\p{Punct}*(.*?)\\p{Punct}*$");
public static String trimPunctuation(String s) {
Matcher m = PATTERN.matcher(s);
m.find();
return m.group(1);
}
The boundary matchers ^ and $ ensure the whole input is matched.
A dot . matches any single character.
A star * means "match the preceding thing zero or more times".
The parentheses () define a capturing group whose value is retrieved by calling Matcher.group(1).
The ? in (.*?) means you want the match to be non-greedy, otherwise the trailing punctuation would be included in the group.
Use this tutorial on patterns. You have to create a regex that matches string starting with alphabet or number and ending with alphabet or number and do inputString.matches("regex")

Need Regex Expression Advice

<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
I know this regex expression is used to retrieve the value of src. Can anyone teach me how i should interpret this expression? stucked at it.
Explaining:
<img matches exactly the string "<img"
[^>]+ matches multiple times of everything but >, so the tag will not be closed
src matches exactly the string "src"
\\s* matches any number of whitespace characters
= matches exactly the string "="
\\s* matches any number of whitespace characters
['\"] matches the two quotes. The double quote is escaped, because otherwise it will terminate the string of the regex
([^'\"]+) mathches multiple times everything but quotes. The contents are wrapped in brackets, so that they are declared as group and can be retrieved later
['\"] matches the two quotes. The double quote is escaped, because otherwise it will terminate the string of the regex
[^>]* matches the remaining non ">" characters
> matches exactly the string ">", the closing bracket of the tag.
I would not agree this expression is a crap, just a bit complex.
EDIT Here you go some examplary code:
String str = "<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>";
String text = "<img alt=\"booo\" src=\"image.jpg\"/>";
Pattern pattern = Pattern.compile (str);
Matcher matcher = pattern.matcher (text);
if (matcher.matches ())
{
int n = matcher.groupCount ();
for (int i = 0; i <= n; ++i)
System.out.println (matcher.group (i));
}
The output is:
<img alt="booo" src="image.jpg"/>
image.jpg
So matcher.group(1) returns what you want. experiment a bit with this code.
Hi check one of the tutorials available on the net - e.g. http://www.vogella.com/articles/JavaRegularExpressions/article.html. Section 3.1 and 3.2 common matching symbols explains briefly each symbol and what it replaces as well as metacharacters. Break what you have here into smaller chunks to understand it easier. For example you have \s in two places it is a metacharacter for a whitespace character. Backslash is an escape character in Java thus you have \s instead of \s. After each of them you have a . Section 3.3 explains the quantifiers - this particular one means it occurs 0 or more times. Thus the \s means "search for a whitespace character that occurs 0 or more times". You do the same with other chunks.
Hope it helps.

Categories