I have a string containing n substrings in the following format which I want to match:
{varName:param1, param2, param2}
Requirements are as follows:
Only the varName (inside the curly brackets) is mandatory
No limit on the number of parameters
No restrictions on whitespace inside curly brackets apart from var and param names which must not contain whitespace
I would like to be able to capture the varName and each of the parameters separately.
I have come up with a regex that is nearly there, but not quite. Any help would be appreciated.
I'm wondering whether it would be easier to simply use String.split() judiciously, rather than battle with regexps for the above. The delimiters (colons/whitespace/commas) seem well-defined.
String s = "blah blah\n{varName:param1, param2, param2}\nblah";
Pattern p = Pattern.compile(
"\\{([a-zA-Z]+)(?:\\s*:\\s*([^,\\s]+(?:\\s*,\\s*[^,\\s]+)*))\\}"
);
Matcher m = p.matcher(s);
if (m.find())
{
String varName = m.group(1);
String[] params = m.start(2) != -1
? m.group(2).split("[,\\s]+")
: new String[0];
System.out.printf("var: %s%n", varName);
for (String param : params)
{
System.out.printf("param: %s%n", param);
}
}
If you're holding out for a way to match the string and break out all the components with one regex, don't bother; this is as good as it gets (unless you switch to Perl 6). As for performance, I wouldn't worry about that until it becomes a problem.
How about a regex, AND a Scanner ?
import java.util.Scanner;
public class Regex {
public static void main(String[] args) {
String string = "{varName: param1, param2, param2}";
Scanner scanner = new Scanner(string);
scanner.useDelimiter("[\\s{:,}]+");
System.out.println("varName: " + scanner.next());
while (scanner.hasNext()) {
System.out.println("param: " + scanner.next());
}
}
}
A quick solution in psuedocode:
string.match(/{(\w+):([\w\s,]+)}/);
varName = matches[1];
params = matches[2].split(',');
Post what you have so far.
You can test it very easily on this website:
http://www.regexplanet.com/simple/index.html
Ok I've got a solution in regex that seems to work just fine:
\{\s*([^\{\},\s]+)\s*(?:(?::\s*([^\{\},\s]+)\s*)(?:,\s*([^\{\},\s]+)\s*)*)?\}
Or to even keep the pretense of being able to understand it:
name = [^\{\},\s]+
ws = \s*
\{ws(name)ws(?:(?::ws(name)ws)(?:,ws(name)ws)*)?\}
I wouldn't recommend it but short testing seems to indicate that it works - nice brain teaser for 3am in the morning ;)
PS: If you're comparing the split solution to this or something similar I'd be interested in hearing if there were any performance differences - I don't think regex would be especially performant.
Related
Hello Im having trouble getting the third element of a string (F604080)
<sourceDocumentId>AX02_APF604_F604080</sourceDocumentId>
I have tried with this regular expression and variations, but i can manage to get
F604080.
(?<=\w+_)\w+(?=\<)
(?<=\w+_\w+_)\w+(?=\<)
....
Any help will be appreciated.
Thanks.
You don't need look behind or look ahead, instead just use this simple regex,
.*_(\w+)
and capture group 1.
Java codes,
public static void main(String[] args) {
String s = "<sourceDocumentId>AX02_APF604_F604080</sourceDocumentId>";
Pattern p = Pattern.compile(".*_(\\w+)");
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
} else {
System.out.println("Didn't match");
}
}
Prints this like you wanted.
F604080
Using regex you can use something like >\w+_\w+_(\w+)<\/
String str = "<sourceDocumentId>AX02_APF604_F604080</sourceDocumentId>";
String code = null;
Matcher m = Pattern.compile(">\\w+_\\w+_(\\w+)</").matcher(str);
if (m.find()) {
code = m.group(1);
}
Simply use substring() operation
String code = str.substring(str.lastIndexOf('_') + 1, str.lastIndexOf('<'));
If later you parse XML with more element, you may use something like Java DOM Parser XML, but here this is not the best option as you have only one element
Can you just parse the string using "_" as separator and take the 3rd element ?
Both of your regular expressions seems to be matching the given string.
Anyway you could be a little bit more specific with this one:
^(?:<\w+>)(?:\w+)_(?:\w+)_(\w+)(?:<\/\w+>)$
Be sure that the input is the string you think it is and no additional text is given after that.
This code works fine :
final String result = myString.replaceAll("<tag1>", "{").replaceAll("<tag2>", "}");
but I have to parse big files, so I'm asking me if I can have a Pattern.compile("REGEX"); before the while :
Patter p = Pattern.compile("REGEX");
while(scan.hasNextLine()){
final String myWorkLine = scan.readLine();
p.matcher(s).replaceAll("$1"); // or other value
..;
}
I expect faster result because regex compilation is maid once and only once.
EDIT
I want to put (if it is possible) the replaceAll(..).replaceAll(..) model in a Pattern, and have tag1==>{, and tag2==>}.
Question : is outside loop Pattern model faster than inside loop replaceAll.replaceAll model?
To answer your original question: yes, you could do that, and indeed it would be faster than your original code, if you apply the same regular expression(s) multiple times in a loop. Your loop should be rewritten like this:
Pattern p1 = Pattern.compile("REGEX1");
Pattern p1 = Pattern.compile("REGEX1");
while (scan.hasNextLine()) {
String myWorkLine = scan.readLine();
myWorkLine = p1.matcher(myWorkLine).replaceAll("replacement1");
myWorkLine = p2.matcher(myWorkLine).replaceAll("replacement2");
...;
}
But, if your're not using regular expressions, as your first example suggests ("<tag1>"), then don't use String.replaceAll(String regex, String replacement), as it is slower because of the regular expression. Instead use String.replace(CharSequence target, CharSequence replacement), as it doesn't work with regular expression and is much faster.
Example:
"ABAP is fun! ABAP ABAP ABAP".replace("ABAP", "Java");
See: Java Docs for String.replace
It's not nice changing your question that radically, but ok, here again an answer for your regular expression:
String s1
= "You can <bold>have nice weather</bold>, but <bold>not</bold> always!";
//EDIT: the regex was 'overengineered', and .?? should have been .*?
//String s2 = s1.replaceAll("(.*?)<bold>(.*?)</bold>(.??)", "$1{$2}$3");
String s2 = s1.replaceAll("<bold>(.*?)</bold>", "{$1}");
System.out.println(s2);
Output: You can {have nice weather}, but {not} always!
Here the loop with this new regex, and yes, this would be faster than original loop:
//EDIT: the regex was 'overengineered'
Pattern p = Pattern.compile("<bold>(.*?)</bold>");
while (scan.hasNextLine()) {
String myWorkLine = scan.readLine();
myWorkLine = p.matcher(myWorkLine).replaceAll("{$1}");
...;
}
EDIT:
Here the description of Java RegEx syntax constructs
replaceAll uses regex Patterns. From the java.lang.String source code:
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}
Edit1: Please stop changing what you're asking. Pick a question and stick with it.
Edit2:
If you're really sure you want to do it this way, compiling a regex outside of the loop, in the simplest case you'd need two different patterns:
Pattern tag1Pattern = Pattern.compile("<tag1>");
Pattern tag2Pattern = Pattern.compile("<tag2>");
while( scan.hasNextLine() ) {
String line = scan.readLine();
String modifiedLine = tag1Pattern.matcher(line).replaceAll("{");
modifiedLine = tag2Pattern.matcher(line).replaceAll("}");
...
}
You're still applying the pattern matcher twice per line, so if there's any performance hits that's why.
Without knowing what your data looks like, it's hard to give you a more precise answer or better regex. Unless you've edited your question (again) while I was writing this.
I have strings formatted similar to the one below in a Java program. I need to get the number out.
Host is up (0.0020s latency).
I need the number between the '(' and the 's' characters. E.g., I would need the 0.0020 in this example.
If you are sure it will always be the first number you could use the regular expresion \d+\.\d+ (but note that the backslashes need to be escaped in Java string literals).
Try this code:
String input = "Host is up (0.0020s latency).";
Pattern pattern = Pattern.compile("\\d+\\.\\d+");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
System.out.println(matcher.group());
}
See it working online: ideone
You could also include some of the surrounding characters in the regular expression to reduce the risk of matching the wrong number. To do exactly as you requested in the question (i.e. matching between ( and s) use this regular expression:
\((\d+\.\d+)s
See it working online: ideone
Sounds like a case for regular expressions.
You'll want to match for the decimal figure and then parse that match:
Float matchedValue;
Pattern pattern = Pattern.compile("\\d*\\.\\d+");
Matcher matcher = pattern.matcher(yourString);
boolean isfound = matcher.find();
if (isfound) {
matchedValue = Float.valueOf(matcher.group(0));
}
It depends on how "similar" you mean. You could potentially use a regular expression:
import java.math.BigDecimal;
import java.util.regex.*;
public class Test {
public static void main(String args[]) throws Exception {
Pattern pattern = Pattern.compile("[^(]*\\(([0-9]*\\.[0-9]*)s");
String text = "Host is up (0.0020s latency).";
Matcher match = pattern.matcher(text);
if (match.lookingAt())
{
String group = match.group(1);
System.out.println("Before parsing: " + group);
BigDecimal value = new BigDecimal(group);
System.out.println("Parsed: " + value);
}
else
{
System.out.println("No match");
}
}
}
Quite how specific you want to make your pattern is up to you, of course. This only checks for digits, a dot, then digits after an opening bracket and before an s. You may need to refine it to make the dot optional etc.
This is a great site for building regular expressions from simple to very complex. You choose the language and boom.
http://txt2re.com/
Here's a way without regex
String str = "Host is up (0.0020s latency).";
str = str.substring(str.indexOf('(')+1, str.indexOf("s l"));
System.out.println(str);
Of course using regular expressions in this case is best solution but in many simple cases you can use also something like :
String value = myString.subString(myString.indexOf("("), myString.lastIndexOf("s"))
double numericValue = Double.parseDouble(value);
This is not recomended because text in myString can changes.
I want to split the string say [AO_12345678, Real Estate] into AO_12345678 and Real Estate
how can I do this in Java using regex?
main issue m facing is in avoiding "[" and "]"
please help
Does it really have to be regex?
if not:
String s = "[AO_12345678, Real Estate]";
String[] split = s.substring(1, s.length()-1).split(", ");
I'd go the pragmatic way:
String org = "[AO_12345678, Real Estate]";
String plain = null;
if(org.startsWith("[") {
if(org.endsWith("]") {
plain = org.subString(1, org.length());
} else {
plain = org.subString(1, org.length() + 1);
}
}
String[] result = org.split(",");
If the string is always surrounded with '[]' you can just substring it without checking.
One easy way, assuming the format of all your inputs is consistent, is to ignore regex altogether and just split it. Something like the following would work:
String[] parts = input.split(","); // parts is ["[AO_12345678", "Real Estate]"]
String firstWithoutBrace = parts[0].substring(1);
String secondWithoutBrace = parts[1].substring(0, parts[1].length() - 1);
String first = firstWithoutBrace.trim();
String second = secondWithoutBrace.trim();
Of course you can tailor this as you wish - you might want to check whether the braces are present before removing them, for example. Or you might want to keep any spaces before the comma as part of the first string. This should give you a basis to modify to your specific requirements however.
And in a simple case like this I'd much prefer code like the above to a regex that extracted the two strings - I consider the former much clearer!
you can also use StringTokenizer. Here is the code:
String str="[AO_12345678, Real Estate]"
StringTokenizer st=new StringTokenizer(str,"[],",false);
String s1 = st.nextToken();
String s2 = st.nextToken();
s1=AO_12345678
s1=Real Estate
Refer to javadocs for reading about StringTokenizer
http://download.oracle.com/javase/1.4.2/docs/api/java/util/StringTokenizer.html
Another option using regular expressions (RE) capturing groups:
private static void extract(String text) {
Pattern pattern = Pattern.compile("\\[(.*),\\s*(.*)\\]");
Matcher matcher = pattern.matcher(text);
if (matcher.find()) { // or .matches for matching the whole text
String id = matcher.group(1);
String name = matcher.group(2);
// do something with id and name
System.out.printf("ID: %s%nName: %s%n", id, name);
}
}
If speed/memory is a concern, the RE can be optimized to (using Possessive quantifiers instead of Greedy ones)
"\\[([^,]*+),\\s*+([^\\]]*+)\\]"
I have a seemingly simple problem though i am unable to get my head around it.
Let's say i have the following string: 'abcabcabcabc' and i want to get the last occurrence of 'ab'. Is there a way i can do this without looping through all the other 'ab's from the beginning of the string?
I read about anchoring the end of the string and then parsing the string with the required regular expression. I am unsure how to do this in Java (is it supported?).
Update: I guess i have caused a lot of confusion with my (over) simplified example. Let me try another one. Say, i have a string as thus - '12/08/2008 some_text 21/10/2008 some_more_text 15/12/2008 and_finally_some_more'. Here, i want the last date and hence i need to use regular expressions. I hope this is a better example.
Thanks,
Anirudh
Firstly, thanks for all the answers.
Here is what i tried and this worked for me:
Pattern pattern = Pattern.compile("(ab)(?!.*ab)");
Matcher matcher = pattern.matcher("abcabcabcd");
if(matcher.find()) {
System.out.println(matcher.start() + ", " + matcher.end());
}
This displays the following:
6, 8
So, to generalize - <reg_ex>(?!.*<reg_ex>) should solve this problem where '?!' signifies that the string following it should not be present after the string that precedes '?!'.
Update: This page provides a more information on 'not followed by' using regex.
This will give you the last date in group 1 of the match object.
.*(\d{2}/\d{2}/\d{4})
Pattern p = Pattern.compile("ab.*?$");
Matcher m = p.matcher("abcabcabcabc");
boolean b = m.matches();
I do not understand what you are trying to do. Why only the last if they are all the same? Why a regular expression and why not int pos = s.lastIndexOf(String str) ?
For the date example, you could do this with the Pattern API and not in the regex itself. The basic idea is to get all the matches, then return the last one.
public static void main(String[] args) {
// this may be over-kill, you can replace with a much simpler but more lenient version
final String dateRegex = "\\b(0?[1-9]|[12][0-9]|3[01])[- /.](0?[1-9]|1[012])[- /.](19|20)?[0-9]{2}\\b";
final String sample = "12/08/2008 some_text 21/10/2008 some_more_text 15/12/2008 and_finally_some_more";
List<String> allMatches = getAllMatches(dateRegex, sample);
System.out.println(allMatches.get(allMatches.size() - 1));
}
private static List<String> getAllMatches(final String regex, final String input) {
final Matcher matcher = Pattern.compile(regex).matcher(input);
return new ArrayList<String>() {{
while (matcher.find())
add(input.substring(matcher.start(), matcher.end()));
}};
}