Regex to parse command line options - java

I'm faced with a need for parsing a string into key-value pairs, where the value may be optional. Standard command line parsers are not useful, because all the ones I checked accept a String[] and not a String. Thus, I resorted to regex, and sure enough, faced with the following:
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
First, the input string:
"/opt/sensu/embedded/bin/ruby /opt/sensu/embedded/bin/check-graphite-stats.rb " +
"--crit 25 --host 99.99.999.9999:8082 --period -5mins --target 'alias(scale(divideSeries(" +
"summarize(sumSeries(nonNegativeDerivative(transformNull(exclude(" +
"\\\"unknown\\\"), 0))), \\\"30d\\\", \\\"sum\\\", false),summarize(" +
...gigantuous string
\\\"sum\\\", false)), 100), \\\"3pp error rate\\\")' " +
"--unknown-ignore --warn 5"
Next, my regex:
(--(?<option>.+?)\s+(?<value>.+?(?=--))?)+?
the above almost works, but not quite.
Output:
--crit 25
--host 99.99.999.9999:8082
--period -5mins
--target 'gigantuous string'
--unknown-ignore
--warn
Why is the value of --warn not picked up?

Because you're doing a positive lookahead to the next -- at the end of the regex ((?=--)), the value of the last parameter in the string isn't picked up as it's not followed by --. Accepting the end of the string as an alternative ((?:(?=--)|$)) and then filtering values that don't start with -- (by replacing .+? with .(?:[^-].+?)?) should behave in the way you want:
(--(?<option>.+?)\s+(?<value>.(?:[^-].+?)?(?:(?=--)|$))?)+?
(However, as others have mentioned, I'd be very surprised that there isn't a Java argument parsing library that would suit your use case. Even if it means writing the code to split your string into arguments yourself, it might be less brittle.)

Related

JSON broken when double quotes comes inside the key/value

Sample data:
{"630":{"TotalLength":"33-3/8" - 36-3/4""},"631":{"Length":"34 37 7/8"}}
We are facing the double quotes issue in JSON response. How we can replace the double quotes with " \" " which comes inside the key or value? Java is the development platform.
This answer is assuming that you are not in control of creating this JSON-like string. If you can control that part, then you should be escaping properly there itself.
In this case, since parsing systematically is not an option as it's not a valid JSON yet, all I could suggest is to go through the various strings and see if you can find a pattern on which you can apply some logic and escape all the "s which prevent the string from being a valid JSON.
Here is probably a way to start:
All of the "s that are needed to be there for the string to be a vaild JSON are surrounded by one or multiple characters among {, :, ,, and }, with or without space in between the " and the other JSON characters.
So, if you parse the JSON-like string using Java and look for all the "s, and, when encountered with one, if they are along with any of the above characters (with or without space in between), you just leave it as it is. If not, replace that " with a \".
Note that the above method may or may not work depending on the data in question. What I mean to convey is the approach that you may find useful if there's absolutely no way for the string to be escaped during it's creation, and, if these strings follow a strict pattern with respect to the unescaped "s.

Solr Query: replacing whitespace with +

The application I'm working on uses solr to index and search entries. I've been reading a bit about the logic and syntax behind in. Currently there's a bit of code that I'm confuses me and I'm hoping someone can clear up why the person who wrote this bit of code did it the way they did.
trimmedSearchField = SolrQueryUtil.escapeQueryString(trimmedSearchField).replaceAll("\\s+", "+");
String qString = "+(title:" + trimmedSearchField + "^100 OR description_t:" + trimmedSearchField + "^10 " +
"OR +" + trimmedSearchField +"^1)";
I'm just wanting bring attention to the .replaceAll method, why would we want to replace whitespace with +? My goal is to refactor a bit a search bar and I get better results ommitting the replaceAll call.
Example: two elements with the descriptions: "Helen of Troy" and "Helen from Troy" respectively. With replaceAll present, searching "Helen of Troy" will provide me with only the first element, with replaceAll removed, both will appear (which is what I want to occur)
that .replaceAll() call is just encoding any series of consecutive whitespaces into a single '+', which mean 'required' in lucene syntax (and Solr)
So it makes 'trimmedSearchField' mandatory in those fields.

Regular Expression to Match Number of Lines and Characters per Line

I'm trying to make sure that a string contains between 0 and 3 lines, and that for a given line that is present that it contains 0 to 100 characters. It would need to be a valid expression for JavaScript and Java. Like many people doing RegEx I'm copying from various spots on the Internet.
Working backwards I think ^.{0,100}$ gets me the "line contains 0 to 100 characters", but trying to group that as (^.{0,100}$){0,3} doesn't work.
The new line character is probably part of my problem, so I ended up with something like .{0,100}(?:\n.{0,100}){0,2} trying to say "a line of 0 to 100 characters optionally followed by 0 to 2 instances of a new line and 0 to 100 more characters", but that also failed.
Up until now I got those expressions from other people. Using an online test tool I finally monkeyed this together: ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ which appears to work.
So, my question is, am I missing any pitfalls in ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ given what I'm after? Furthermore, even if that does work is it the best expression to use?
I think what you have will work fine. You can make the line break part a little more compact if you want, and you don't need ^ and $ if you are using matches():
String regex = ".{0,100}(?:[\r\n]+.{0,100}){0,2}";
EDIT
After some more thoughts I realized the newline suggestion above will match 4 (or more) lines as long as a couple of them are empty. So, we are back to your suggested example. Oh well, at least the start and end characters can be omitted.
String regex = ".{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}";
I'm not very good at regular expressions but would this work?
^.{0,100}\n?(.{0,100}\n)?.{0,100}?$
Again I'm still new to reg exp, so if there is an error(which is likely) please tell me.

Java Regex Engine Crashing

Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.
You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.
Try this:
=\\s*(.*)$

What is the effect of "*" in regular expressions?

My Java source code:
String result = "B123".replaceAll("B*","e");
System.out.println(result);
The output is:ee1e2e3e.
Why?
'*' means zero or more matches of the previous character. So each empty string will be replaced with an "e".
You probably want to use '+' instead:
replaceAll("B+", "e")
You want this for your pattern:
B+
And your code would be:
String result = "B123".replaceAll("B+","e");
System.out.println(result);
The "*" matches "zero or more" - and "zero" includes the nothing that's before the B, as well as between all the other characters.
I spent over a month working at a big tech company fixing a bug with * (splat!) in regular expressions. We maintained a little-known UNIX OS. My head nearly exploded because it matches ZERO occurrences of an encounter with a character. Talk about a hard bug to understand through your own recreates. We were double substituting in some cases. I couldn't figure out why the code was wrong, but was able to add code that caught the special (wrong) case and prevented double subbing and didn't break any of the utilities that included it (including sed and awk). I was proud to have fixed this bug, but as already mentioned.
For god's sake, just use + !!!!

Categories