I have an input string like:
abc(123:456),def(135.666:3434.777),ghi("2015-06-07T09:01:05":"2015-07-08")
Basically, it is (naive idea with regex):
[a-zA-Z0-9]+(((number)|(quoted datetime)):((number)|(quoted datetime)))(,[a-zA-Z0-9]+(((number)|(quoted datetime)):((number)|(quoted datetime))))+?
How can I make a regex pattern in Java to validate that the input string follows this pattern and then I can extract the values [a-zA-Z0-9]+ and ((number)|(quoted datetime)):((number)|(quoted datetime)) from them?
You can use
(\w+)\((\d+(?:\.\d+)?|\"[^\"]*\"):(\d+(?:\.\d+)?|\"[^\"]*\")\)
See the regex demo.
In Java, it can be declared as:
String regex = "(\\w+)\\((\\d+(?:\\.\\d+)?|\"[^\"]*\"):(\\d+(?:\\.\\d+)?|\"[^\"]*\")\\)";
Details:
(\w+) - Group 1: one or more word chars
\( - a ( char
(\d+(?:\.\d+)?|\"[^\"]*\") - Group 2: one or more digits optionally followed with . and one or more digits, or ", zero or more chars other than " and then a " char
: - a colon
(\d+(?:\.\d+)?|\"[^\"]*\") - Group 3: one or more digits optionally followed with . and one or more digits, or ", zero or more chars other than " and then a " char
\) - a ) char
I'm new to regex and have been trying to work this out on my own but I don't seem to get it working. I have an input that contains start and end flags and I want to replace a certain char, but only if it's between the flags.
So for example if the start flag is START and the end flag is END and the char i'm trying to replace is " and I would be replacing it with \"
I would say input.replaceAll(regex, '\\\"');
I tried making a regex to only match the correct " chars but so far I have only been able to get it to match all chars between the flags and not just the " chars. -> (?<=START)(.*)(?=END)
Example input:
This " is START an " example input END string ""
START This is a "" second example END
This" is "a START third example END " "
Expected output:
This " is START an \" example input END string ""
START This is a \"\" second example END
This" is "a START third example END " "
Find all characters between START and END, and for those characters replace " with \".
To achieve this, apply a replacer function to all matches of characters between START and END:
string = Pattern.compile("(?<=START).*?(?=END)").matcher(string)
.replaceAll(mr -> mr.group().replace("\"", "\\\\\""));
which produces your expected output.
Some notes on how this works.
This first step is to match all characters between START and END, which uses look arounds with a reluctant quantifier:
(?<=START).*?(?=END)
The ? after the .* changes the match from greedy (as many chars as possible while still matching) to reluctant (as few chars as possible while still matching). This prevents the middle quote in the following input from being altered:
START a"b END c"d START e"f END
A greedy quantifier will match from the first START all the way past the next END to the last END, incorrectly including c"d.
The next step is for each match to replace " with \". The full match is group 0, or just MatchResult#group. and we don't need regex for this replacement - just plain string replace is enough (and yes, replace() replaces all occurrences).
For now i've been able to solve it by creating 3 capture groups and continuously replacing the match until there are no more matches left. In this case I even had to insert a replace indentifier because replacing with " would keep the " char there and create an infinite loop. Then when there are no more matches left I replaced my identifier and i'm now getting the expected result.
I still feel like there has to be a way cleaner way to do this using only 1 replace statement...
Code that worked for me:
class Playground {
public static void main(String[ ] args) {
String input = "\"ThSTARTis is a\" te\"\"stEND \" !!!";
String regex = "(.*START.+)\"+(.*END+.*)";
while(input.matches(regex)){
input = input.replaceAll(regex, "$1---replace---$2");
}
String result = input.replace("---replace---", "\\\"");
System.out.println(result);
}
}
Output:
"ThSTARTis is a\" te\"\"stEND " !!!
I would love any suggestions as to how I could solve this in a better/cleaner way.
Another option is to make use of the \G anchor with 2 capture groups. In the replacement use the 2 capture groups followed by \"
(?:(START)(?=.*END)|\G(?!^))((?:(?!START|END)(?>\\+\"|[^\r\n\"]))*)\"
Explanation
(?: Non capture group
(START)(?=.*END) Capture group 1, match START and assert there is END to the right
| Or
\G(?!^) Assert the current position at the end of the previous match
) Close non capture group
( Capture group 2
(?: Non capture group
(?!START|END) Negative lookhead, assert not START or END directly to the right
(?>\\+\"|[^\r\n\"]) Match 1+ times \ followed by " or match any char except " or a newline
)* Close the non capture group and optionally repeat it
) Close group 2
\" Match "
See a Java regex demo and a Java demo
For example:
String regex = "(?:(START)(?=.*END)|\\G(?!^))((?:(?!START|END)(?>\\\\+\\\"|[^\\r\\n\\\"]))*)\\\"";
String string = "This \" is START an \" example input END string \"\"\n"
+ "START This is a \"\" second example END\n"
+ "This\" is \"a START third example END \" \"";
String subst = "$1$2\\\\\"";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
String result = matcher.replaceAll(subst);
System.out.println(result);
Output
This " is START an \" example input END string ""
START This is a \"\" second example END
This" is "a START third example END " "
Want to capture the string after the last slash and before the first occurrence of backward slash().
sample data:
sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; ,"myreferer":"https://www.example.com/mybook/order/newbooking/itemSummary/amex","Accept":"application/json, application/javascript","sessionId":"ggh76734",
targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734¶=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW; ,"myreferer":"https://www.example.com/mybook/order/newbooking/itemList/basket","Accept":"application/json, application/javascript","sessionId":"ggh76734", targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; ,"myreferer":"https://www.example.com/mybook/order/newbooking/itemList/","Accept":"application/json, application/javascript","sessionId":"ggh76734",targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;List item
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734¶=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW; ,"myreferer":"https://www.example.com/mybook/order/newbooking/itemList/basket?id=76734¶=jhjdfhj&type=new&ordertype=kjkf", "Accept":"application/json, application/javascript","sessionId":"ggh76734", targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
Expecting the below output:
amex
basket
''(empty string)
basket
Have build the below regex to capture it but its 100% accurate. It is capturing some additional part.
Regex
\bmyreferer\\\":\\\"\S+\/(.*?)\\\",
Could you please help me to improve the regex to get desired output?
You could use a negated character class with a capture group:
\bmyreferer":"[^"]+/([^/"]*)"
\bmyreferer":" Match literally preceded by a word boundary
[^"]+/ Match 1+ times any char except ", followed by a /
( Capture group 1
[^/"]* Optionally match (to also match an empty string) any char except / and "
)" Close group 1 and match "
regex demo | Java demo
Example code
String regex = "\\bmyreferer\":\"[^\"]+/([^/\"]*)\"";
String string = "sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; ,\"myreferer\":\"https://www.example.com/mybook/order/newbooking/itemSummary/amex\",\"Accept\":\"application/json, application/javascript\",\"sessionId\":\"ggh76734\", targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;\n\n"
+ "sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734¶=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW; ,\"myreferer\":\"https://www.example.com/mybook/order/newbooking/itemList/basket\",\"Accept\":\"application/json, application/javascript\",\"sessionId\":\"ggh76734\", targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;\n\n"
+ "sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; ,\"myreferer\":\"https://www.example.com/mybook/order/newbooking/itemList/\",\"Accept\":\"application/json, application/javascript\",\"sessionId\":\"ggh76734\",targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;List item";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Group 1 value: " + matcher.group(1));
}
Output
Group 1 value: amex
Group 1 value: basket
Group 1 value:
How to extract all characters from a string without the last number (if exist ) in Java, I found how to extract the last number in a string using this regex [0-9.]+$ , however I want the opposite.
Examples :
abd_12df1231 => abd_12df
abcd => abcd
abcd12a => abcd12a
abcd12a1 => abcd12a
What you might do is match from the start of the string ^ one or more word characters \w+ followed by not a digit using \D
^\w+\D
As suggested in the comments, you could expand the characters you want to match using a character class ^[\w-]+\D or if you want to match any character you could use a dot ^.+\D
If you want to remove one or more digits at the end of the string, you may use
s = s.replaceFirst("[0-9]+$", "");
See the regex demo
To also remove floats, use
s = s.replaceFirst("[0-9]*\\.?[0-9]+$", "");
See another regex demo
Details
(?s) - a Pattern.DOTALL inline modifier
^ - start of string
(.*?) - Capturing group #1: any 0+ chars other than line break chars as few as possible
\\d*\\.?\\d+ - an integer or float value
$ - end of string.
Java demo:
List<String> strs = Arrays.asList("abd_12df1231", "abcd", "abcd12a", "abcd12a1", "abcd12a1.34567");
for (String str : strs)
System.out.println(str + " => \"" + str.replaceFirst("[0-9]*\\.?[0-9]+$", "") + "\"");
Output:
abd_12df1231 => "abd_12df"
abcd => "abcd"
abcd12a => "abcd12a"
abcd12a1 => "abcd12a"
abcd12a1.34567 => "abcd12a"
To actually match a substring from start till the last number, you may use
(?s)^(.*?)\d*\.?\d+$
See the regex demo
Java code:
String s = "abc234 def1.566";
Pattern pattern = Pattern.compile("(?s)^(.*?)\\d*\\.?\\d+$");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(1));
}
With this Regex you could capture the last digit(s)
\d+$
You could save that digit and do a string.replace(lastDigit,"");
I need to build a regex that match words with these patterns:
Letters and numbers:
A35, 35A, B503X, 1ABC5
Letters and numbers separated by "-", "/", "\":
AB-10, 10-AB, A10-BA, BA-A10, etc...
I wrote this regex for it:
\b[A-Za-z]+(?=[(?<!\-|\\|\/)\d]+)[(?<!\-|\\|\/)\w]+\b|\b[0-9]+(?=[(?<!\-|\\|\/)A-Za-z]+)[(?<!\-|\\|\/)\w]+\b
It works partially, but it's match only letters or only numbers separated by symbols.
Example:
10-10, open-office, etc.
And I don't wanna this matches.
I guess that my regex is very repetitive and somewhat ugly.
But it's what I have for now.
Could anyone help me?
I'm using java/groovy.
Thanks in advance.
Interesting challenge. Here is a java program with a regex that picks out the types of "words" you are after:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "A35, 35A, B503X, 1ABC5 " +
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +
"10-10, open-office, etc.";
Pattern regex = Pattern.compile(
"# Match special word having one letter and one digit (min).\n" +
"\\b # Match first word having\n" +
"(?=[-/\\\\A-Za-z]*[0-9]) # at least one number and\n" +
"(?=[-/\\\\0-9]*[A-Za-z]) # at least one letter.\n" +
"[A-Za-z0-9]+ # Match first part of word.\n" +
"(?: # Optional extra word parts\n" +
" [-/\\\\] # separated by -, / or //\n" +
" [A-Za-z0-9]+ # Match extra word part.\n" +
")* # Zero or more extra word parts.\n" +
"\\b # Start and end on a word boundary",
Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(s);
while (regexMatcher.find()) {
System.out.print(regexMatcher.group() + ", ");
}
}
}
Here is the correct output:
A35, 35A, B503X, 1ABC5, AB-10, 10-AB, A10-BA, BA-A10,
Note that the only complex regexes which are "ugly", are those that are not properly formatted and commented!
Just use this:
([a-zA-Z]+[-\/\\]?[0-9]+|[0-9]+[-\/\\]?[a-zA-Z]+)
In Java \\ and \/ should be escaped:
([a-zA-Z]+[-\\\/\\\\]?[0-9]+|[0-9]+[-\\\/\\\\]?[a-zA-Z]+)
Excuse me to write my solution in Python, I don't know enough Java to write in Java.
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## This part verifies that
'[^ ]*' ## there are at least one
'(?(1)\d|[A-Z]))' ## letter and one digit.
'('
'(?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])' # start of second group
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,)' # end of second group
')',
re.IGNORECASE) # this group 2 catches the string
.
My solution catches the desired string in the second group: ((?:(?<={ ,])[A-Z0-9]|\A[A-Z0-9])[A-Z0-9-/\\\\]*[A-Z0-9](?= |\Z|,))
.
The part before it verifies that one letter at least and one digit at least are present in the catched string:
(?(1)\d|[A-Z]) is a conditional regex that means "if group(1) catched something, then there must be a digit here, otherwise there must be a letter"
The group(1) is ([A-Z]) in (?=(?:([A-Z])|[0-9])
(?:([A-Z])|[0-9]) is a non-capturing group that matches a letter (catched) OR a digit, so when it matches a letter, the group(1) isn't empty
.
The flag re.IGNORECASE allows to treat strings with upper or lower cased letters.
.
In the second group, I am obliged to write (?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9]) because lookbehind assertions with non fixed length are not allowed. This part signifies one character that can't be '-' preceded by a blank or the head of the string.
At the opposite, (?= |\Z[,) means 'end of string or a comma or a blank after'
.
This regex supposes that the characters '-' , '/' , '\' can't be the first character or the last one of a captured string . Is it right ?
import re
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## (from here) This part verifies that
'[^ ]*' # there are at least one
'(?(1)\d|[A-Z]))' ## (to here) letter and one digit.
'((?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])'
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,))',
re.IGNORECASE) # this group 2 catches the string
ch = "ALPHA13 10 ZZ 10-10 U-R open-office ,10B a10 UCS5000 -TR54 code vg4- DV-3000 SEA 300-BR gt4/ui bn\\3K"
print [ mat.group(2) for mat in pat.finditer(ch) ]
s = "A35, 35A, B503X,1ABC5 " +\
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +\
"10-10, open-office, etc."
print [ mat.group(2) for mat in pat.finditer(s) ]
result
['ALPHA13', '10B', 'a10', 'UCS5000', 'DV-3000', '300-BR', 'gt4/ui', 'bn\\3K']
['A35', '35A', 'B503X', '1ABC5', 'AB-10', '10-AB', 'A10-BA', 'BA-A10']
My first pass yields
(^|\s)(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)(\s|$)
Sorry, but it's not java formatted (you'll need to edit the \ \s etc.). Also, you can't use \b b/c a word boundary is anything that is not alphanumeric and underscore, so I used \s and the start and end of the string.
This is still a bit raw
EDIT
Version 2, slightly better, but could be improved for performance by usin possessive quantifiers. It matches ABC76 AB-32 3434-F etc, but not ABC or 19\23 etc.
((?<=^)|(?<=\s))(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)((?=$)|(?=\s))
A condition (A OR NOT A) can be omited. So symbols can savely been ignored.
for (String word : "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" "))
if (word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)"))
// do something
You didn't mention -x4, 4x-, 4-x-, -4-x or -4-x-, I expect them all to match.
My expression looks just for something-alpha-something-digits-something, where something might be alpha, digits or symbols, and the opposite: something-alpha-something-digits-something. If something else might occur, like !#$~()[]{} and so on, it would get longer.
Tested with scala:
scala> for (word <- "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" ")
| if word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)")) yield word
res89: Array[java.lang.String] = Array(10B, A10, UCS5000, DV-3000, 300-BR)
Slightly modified to filter matches:
String s = "A35, 35A, B53X, 1AC5, AB-10, 10-AB, A10-BA, BA-A10, etc. -4x, 4x- -4-x- 10-10, oe-oe, etc";
Pattern pattern = java.util.regex.Pattern.compile ("\\b([^ ,]*[A-Za-z][^ ,]*[0-9])[^ ,]*|([^ ,]*[0-9][^ ,]*[A-Za-z][^ ,]*)\\b");
matcher = pattern.matcher (s);
while (matcher.find ()) { System.out.print (matcher.group () + "|") }
But I still have an error, which I don't find:
A35|35A|B53X|1AC5|AB-10|10-AB|A10-BA|BA-A10|-4x|4x|-4-x|
4x should be 4x-, and -4-x should be -4-x-.