Java regex for string pattern - java

I would want to write a regex for this string pattern:
<Col name="SKU_UPC_NBR">85634546495</Col>
I want to fetch the value between Col tag.
I tried the below pattern :
Pattern TAG_REGEX = Pattern.compile("<Col name='SKU_UPC_NBR'>(.+?)</col>");
Matcher matcher = TAG_REGEX.matcher(str);
The above is not matching my string and returns empty.
Please help me on this problem.

You can try:
<Col[^>]*>(.+?)<\/Col>
<Col[^>]*> will match the opening tag. [^>]* means match any character but >, so that the match ends at the first > encountered.
(.+?) means grab 1 or more characters between the opening and closing tag
<\/Col> this matches the closing tag

Regex matches exactly what you type. It does not generalize, it does not understand that sometimes to you ' == ", it does not match mixing cases.
The data format you've specified is open tag, space, name attribute, equals, double quote, name attr data ...
The regex format you've specified is open tag, space, name attribute, equals, single quote, name attr data ...
What you need is
Pattern TAG_REGEX = Pattern.compile("<Col name=\"SKU_UPC_NBR\">(.+?)</Col>");
NOTE: You may want to use (\d+?) instead of (.+?) as \d will match any digit, so the regex is more specific to the data you're matching, and is easier to read. This won't work however, if you know some Col tags won't have just digits in them
You may want to refer to this neat interactive Regex tutorial for practice with regex's.
You also may want to refer to the Java documentation for Regex patterns, this is useful when you need special characters.

Try this please:
(?<=">)\d*(?=<\/)
It will match 0 or more digits preceded by "> (quotation mark and greater than sign) and followed by (less than sign and forward slash)
You can test this here:
https://regex101.com/

Related

Regex first character not matching

I am having some Java Pattern problems. This is my pattern:
"^[\\p{L}\\p{Digit}~._-]+$"
It matches any letter of the US-ASCII, numerals, some special characters, basically anything that wouldn't scramble an URL.
What I would like is to find the first letter in a word that does not match this pattern. Basically the user sends a text as an input and I have to validate it and to throw an exception if I find an illegal character.
I tried negating this pattern, but it wouldn't compile properly. Also find() didn't help out much.
A legal input would be hello while ?hello should not be, and my exception should point out that ? is not proper.
I would prefer a suggestion using Java's Matcher, Pattern or something using util.regex. Its not a necessity, but checking each character in the string individually is not a solution.
Edit: I came up with a better regex to match unreserved URI characters
Try this :
^[\\p{L}\\p{Digit}.'-.'_]*([^\\p{L}\\p{Digit}.'-.'_]).*$
The first character non matching is the group n°1
I made a few try here : http://fiddle.re/gkkzm61
Explanation :
I negate your pattern, so i built this :
[^\\p{L}\\p{Digit}.'-.'_] [^...] means every character except for
^ ^ the following ones.
| your pattern inside |
The pattern has 3 parts :
^[\\p{L}\\p{Digit}.'-.'_]*
Checks the regex from the first character until he meets a non matching character
([^\\p{L}\\p{Digit}.'-.'_])
The non-matching character (negation) inside a capturing group
.*$
Any character until the end of the string.
Hope it helps you
EDIT :
The correct regex shoud be :
^[\\p{L}\\p{Digit}~._-]*([^\\p{L}\\p{Digit}~._-]).*$
It is the same method, i only change the contents of the first and second part.
I tried and it seems to work.
The "^[\\p{L}\\p{Digit}.'-.'_]+$" pattern matches any string containing 1+ characters defined inside the character class. Note that double ' and . are suspicious and you might be unaware of the fact that '-. creates a range and matches '()*+,-.. If it is not on purpose, I think you meant to use .'_-.
To check if a string starts with a character other than the one defined in the character class, you can negated the character class, and check the first character in the string only:
if (str.matches("[^\\p{L}\\p{Digit}.'_-].*")) {
/* String starts with the disallowed character */
}
I also think you can shorten the regex to "(?U)[^\\w.'-].*". At any rate, \\p{Digit} can be replaced with \\d.
Try out this one to find the first non valid char:
Pattern negPattern = Pattern.compile(".*?([^\\p{L}^\\p{Digit}^.^'-.'^_]+).*");
Matcher matcher = negPattern.matcher("hel?lo");
if (matcher.matches())
{
System.out.println("'" + matcher.group(1).charAt(0) + "'");
}

Using patern matcher to extract html

I have a pice of HTML:
<div class="content" itemprop="softwareVersion"> 2.3 </div>
(This is the version of my app in the play store) What i am trying to do, is get the latest version using Pattern matching.
what i have thus far for matching the pattern is:
String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> [^ <]*</dd");
Matcher matcher = pattern.matcher(Html);
matcher.find();
How do i now go about extractin 2.3 from the htmlString?
Using JSoup xhtml parser
It's well known that you should not parse xhtml with regex unless you know the html character set you are going to parse. You should use a xhtml parser instead like JSoup. So, you could use something like this:
String htmlString = "YOUR HTML HERE";
Document document=Jsoup.parse(htmlString);
Element element=document.select("div[itemprop=softwareVersion]").first();
System.out.println(element.text());
Regex approach
However, if you want to use regex, then you have to use capturing groups and then grab its content.
String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*)</dd");
// ^------^ Here
Matcher matcher = pattern.matcher(htmlString);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Try to capture it in a capture group?
("softwareVersion\"> ([^ <]*)< /dd");
Then accessing the value with matcher.group(1)
I had to tweak a few things to make this work:
String htmlString = "String that includes <div class=\"content\" itemprop=\"softwareVersion\"> 2.3 </div>";
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*) +</div");
Matcher matcher = pattern.matcher(htmlString);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
//else??
The () in the RE make it possible to use matcher,group(1)
Try this Regex \"softwareVersion\">\s([0-9].?[0-9]?+)\s\s<\/div>:
\" matches the character " literally
softwareVersion matches the characters softwareVersion literally (case sensitive)
\" matches the character " literally
> matches the characters > literally
\s match any white space character [\r\n\t\f ]
1st Capturing group ([0-9].?[0-9]?+)
[0-9] match a single character present in the list below
0-9 a single character in the range between 0 and 9
.? matches any character (except newline)
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[0-9]?+ match a single character present in the list below
Quantifier: ?+ Between zero and one time, as many times as possible, without giving back [possessive]
0-9 a single character in the range between 0 and 9
\s match any white space character [\r\n\t\f ]
\s match any white space character [\r\n\t\f ]
< matches the characters < literally
\/ matches the character / literally
div> matches the characters div> literally (case sensitive)
https://regex101.com/r/kR7lC2/1
First, as comments point out, you can't parse HTML with a regex (thanks to Jeff Burka for linking to the canonical answer).
Second, since you are looking at a very limited and particular situation you can match using a capturing group to get the version.
Assuming that the div in question is not broken across lines, my strategy would be much like your posted attempt; look for the string softwareVersion and the tag close > character, optional whitespace, the version string, optional whitespace, and the closing tag.
That gives a regex like softwareVersion[^>]*>\s*([0-9.]+)\s*</
From debuggex (which needs the .* to match the leading part):
.*softwareVersion[^>]*>\s*([0-9.]+)\s*</
Debuggex Demo
This will give you the version in a capturing group, which will be matcher.group(1)
As a Java string, that's softwareVersion[^>]*>\\s*([0-9.]+)\\s*</
I omitted the div after </ because, while it's in a div now, maybe it'll be a span or something else in the future.
I went simple with [0-9.] so it can match 2.3 but also 3.0.1, however it would also match ..382.1...33 — you could make one that matches a limited or arbitrary set of n(.n)* dotted numbers if it was important.
softwareVersion[^>]*>\\s*([1-9][0-9]*(\\.[0-9]+){0,3})\\s*</ matches a version number n with zero to three .n point releases, so 3.0.2.1 but not 1.2.3.4.5

Java Regex : How to return the whole word if the words ends with a specific string

Using Pattern/Matcher, I'm trying to find a regex in Java for searching in a text for table names that end with _DBF or _REP or _TABLE or _TBL and return the whole table names.
These tables names may contain one or more underscores _ in between the table name.
For example I'd like to retrieve table names like :
abc_def_DBF
fff_aaa_aaa_dbf
AAA_REP
123_frfg_244_gegw_TABLE
etc
Could someone please propose a regex for this ?
Or would it be easier to read text line by line and use String's method endsWith() instead ?
Many thanks in advance,
GK
Regex pattern
You could use a simple regex like this:
\b(\w+(?:_DBF|_REP|_TABLE|_TBL))\b
Working demo
Java code
For java you could use a code like below:
String text = "HERE THE TEXT YOU WANT TO PARSE";
String patternStr = "\\b(\\w+(?:_DBF|_REP|_TABLE|_TBL))\\b";
Pattern pattern = Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("found: " + matcher.group(1));
}
This is the match information:
MATCH 1
1. [0-11] `abc_def_DBF`
MATCH 2
1. [28-43] `fff_aaa_aaa_dbf`
MATCH 3
1. [45-52] `AAA_REP`
MATCH 4
1. [54-77] `123_frfg_244_gegw_TABLE`
Regex pattern explanation
If you aren't familiar with regex to understand how this pattern works the idea of this regex is:
\b --> use word boundaries to avoid having anything like $%&abc
(\w+ --> table name can contain alphanumeric and underscore characters (\w is a shortcut for [A-Za-z_])
(?:_DBF|_REP|_TABLE|_TBL)) --> must finish with any of these combinations
\b --> word boundaries again
Try this:
System.out.println("blah".matches(".*[_DBF|_REP|_TABLE|_TBL]$"));
System.out.println("blah_TBL".matches(".*[_DBF|_REP|_TABLE|_TBL]$"));
System.out.println("blah_TBL1".matches(".*[_DBF|_REP|_TABLE|_TBL]$"));
This regexp should work to match the whole word:
\w+_([Dd][Bb][Ff]|REP|TABLE)
Here is is:
This regexp should work to match the keywords:
_(DBF)|(REP)|(TABLE)
The _ is matched, followed by either DBF or REP or TABLE.
It is unclear to me if you wish to match _dbf (lower case). If so simply change DBF to [Dd][Bb][Ff]:
_([Dd][Bb][Ff])|(REP)|(TABLE)
If you wish to match any more keywords just add another |(abc) group.
Of course this method works only if you know that these "keywords" will appear only once, and only at the end of the string. If you have 123_frfg_TABLE_244_gegw_TABLE for example you will match both.
Below is a screenshot of regexpal in action:
A simple alternative might be this regex ".*(_DBF|_REP|_TABLE|_TBL)$" which means any string that ends in _DBF or _REP or _TABLE or _TBL.
PS: Specify the regex to be caseless

regex help in java

I'm trying to compare following strings with regex:
#[xyz="1","2"'"4"] ------- valid
#[xyz] ------------- valid
#[xyz="a5","4r"'"8dsa"] -- valid
#[xyz="asd"] -- invalid
#[xyz"asd"] --- invalid
#[xyz="8s"'"4"] - invalid
The valid pattern should be:
#[xyz then = sign then some chars then , then some chars then ' then some chars and finally ]. This means if there is characters after xyz then they must be in format ="XXX","XXX"'"XXX".
Or only #[xyz]. No character after xyz.
I have tried following regex, but it did not worked:
String regex = "#[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"]";
Here the quotations (in part after xyz) are optional and number of characters between quotes are also not fixed and there could also be some characters before and after this pattern like asdadad #[xyz] adadad.
You can use the regex:
#\[xyz(?:="[a-zA-z0-9]+","[a-zA-z0-9]+"'"[a-zA-z0-9]+")?\]
See it
Expressed as Java string it'll be:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
What was wrong with your regex?
[...] defines a character class. When you want to match literal [ and ] you need to escape it by preceding with a \.
[a-zA-z][0-9] match a single letter followed by a single digit. But you want one or more alphanumeric characters. So you need [a-zA-Z0-9]+
Use this:
String regex = "#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\]";
When you write [a-zA-z][0-9] it expects a letter character and a digit after it. And you also have to escape first and last square braces because square braces have special meaning in regexes.
Explanation:
[a-zA-z0-9]+ means alphanumeric character (but not an underline) one or more times.
(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? means that expression in parentheses can be one time or not at all.
Since square brackets have a special meaning in regex, you used it by yourself, they define character classes, you need to escape them if you want to match them literally.
String regex = "#\\[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"\\]";
The next problem is with '"[a-zA-z][0-9]' you define "first a letter, second a digit", you need to join those classes and add a quantifier:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
See it here on Regexr
there could also be some characters before and after this pattern like
asdadad #[xyz] adadad.
Regex should be:
String regex = "(.)*#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\](.)*";
The First and last (.)* will allow any string before the pattern as you have mentioned in your edit. As said by #ademiban this (=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? will come one time or not at all. Other mistakes are also very well explained by Others +1 to all other.

regular expressions using java.util.regex API- java

How can I create a regular expression to search strings with a given pattern? For example I want to search all strings that match pattern '*index.tx?'. Now this should find strings with values index.txt,mainindex.txt and somethingindex.txp.
Pattern pattern = Pattern.compile("*.html");
Matcher m = pattern.matcher("input.html");
This code is obviously not working.
You need to learn regular expression syntax. It is not the same as using wildcards. Try this:
Pattern pattern = Pattern.compile("^.*index\\.tx.$");
There is a lot of information about regular expressions here. You may find the program RegexBuddy useful while you are learning regular expressions.
The code you posted does not work because:
dot . is a special regex character. It means one instance of any character.
* means any number of occurrences of the preceding character.
therefore, .* means any number of occurrences of any character.
so you would need something like
Pattern pattern = Pattern.compile(".*\\.html.*");
the reason for the \\ is because we want to insert dot, although it is a special regex sign.
this means: match a string in which at first there are any number of wild characters, followed by a dot, followed by html, followed by anything.
* matches zero or more occurrences of the preceding token, so if you want to match zero or more of any character, use .* instead (. matches any char).
Modified regex should look something like this:
Pattern pattern = Pattern.compile("^.*\\.html$");
^ matches the start of the string
.* matches zero or more of any char
\\. matches the dot char (if not escaped it would match any char)
$ matches the end of the string

Categories