Optimization of regular expression for custom key-value pairs - java

I am trying to extract some key-value pairs plus their preceding text from a large file, but the regular expression used runs very slowly, so it needs optimization.
The input consists of fairly short strings with 1 or 2 key-value pairs, like
one two three/1234==five/5678 some other text
or
one two three/1234==five/5678 some other text four/910==five/1112 more text
The (apparently suboptimal) regular expression used is
(.*?)\s*([^ /]+)\s*/\s*([\d]+)\s*==\s*([^ /]+)\s*/\s*([\d]+)\s*
(Spaces may appear in numerous areas within the string, hence the repeated \s* elements.)
Sample code to test the above:
public static void main(String[] args) {
String text = "one two three/1234==five/5678 some other text";
text = "one two three/1234==five/5678 some other text four/910==five/1112 more text";
String regex = "(.*?)\\s*([^ /]+)\\s*/\\s*([\\d]+)\\s*==\\s*([^ /]+)\\s*/\\s*([\\d]+)\\s*";
Matcher matcher = Pattern.compile(regex).matcher(text);
int end = 0;
System.out.println("--------------------------------------------------");
while (matcher.find()) {
System.out.println("\"" + matcher.group(1) + "\"");
System.out.println(matcher.group(2) + " == " + matcher.group(3));
System.out.println(matcher.group(4) + " == " + matcher.group(5));
end = matcher.end();
System.out.println("--------------------------------------------------");
}
System.out.println(text.substring(end).trim());
}
The output is the key-value pairs, plus the preceding text (all extracted fields are required). For example, for the longer string, the output is:
--------------------------------------------------
"one two"
three == 1234
five == 5678
--------------------------------------------------
"some other text"
four == 910
five == 1112
--------------------------------------------------
more text
In other words, the matcher.find() method runs for 1 or 2 rounds, depending on whether the string has the short or long form (1 or 2 key-value pairs, respectively).
The problem is that the extraction speed is low and at times, depending on the variation of the input string, the find() method takes a lot of time to complete.
Is there any better form for the regular expression, to significantly speed up processing?

It's never a good idea to put (.*?) at the beginning of a regex.
First, it can be slow. Although in theory non-greedy matches can be handled efficiently (see, for example, Russ Cox's re2 implementation), many regex implementations do not handle non-greedy matches very well, especially in the case where the find operation is going to fail. I don't know whether the Java regex implementation falls into this category or not, but there's no reason to tempt fate.
Second, it's pointless. The semantics of regex searching is that the first possible match will be found, which is identical to the semantics of .*?. To get the capture (.*?), you only need the substring from the end of the previous match (or the beginning of the string) to the beginning of the current match. That's trivial, especially since you're already tracking the end of the previous match.

How are you reading the file? If you read the file line-by-line with BufferedReader#readLine() or Scanner#nextLine(), all you need to do is add \G to the beginning of your regex. It acts like \A the first time you apply the regex, anchoring the match to the beginning of the string. If that match succeeds, the next find() will be anchored to the position where the previous match ended. If it doesn't find a match starting right there, it gives up and doesn't look for any more matches in that string.
EDIT: I'm assuming each of the sequences you want to match, whether it's one key/value pair or two, is on its own line. If you read the file one line at a time, you can run the code in your question on on each line.
As for why your regex is so slow, it's because the regex engine has to make multiple match attempts--possibly hundreds of them--on every non-matching line before it gives up. It isn't smart enough to realize that if the first attempt on a given line fails, no further attempts on that line will do any good. So it bumps forward one position and tries again. And it keeps doing that for the whole line.
If you were only expecting one match per line, I would say to use a start-of-line anchor (^ in MULTILINE mode).

Related

Regex: symbols before and after the word should have the same amount

I want to work on a type of regular expression that works out if a string has the same amount of asteriks before and after a word.
So to say the same amount of asteriks the left side as on the right side. I've come up with \*+T\*+ where T is any non-asteriks word (simplified for this question). This also counts in ****hello* in "****hello****" for example, which I don't want. What it should only find is hello, *hello* , **hello**, ***hello***, ****hello**** in ****hello****
Is there a way in Java, or generally, to use a variable amount of lengths more than once, something along the lines of \*{x}T\*{x}. In which both sides "*" have the same amount of * concatenated at the end and the beginning?
You can use this regex,
(?<!\*)(\**)\b\w+\b(\1)(?!\*)
This will either match only word like hello or *hello* or ***hello*** but not match *hello***
Explanation:
\b\w+\b --> This part enables it to just match a word like hello
(\**) --> This part captures zero or more asterisk and captures them in group 1 to back reference them to match at the end of word and only if exactly same amount of asterisks are found, it gets successful match.
(?<!\*) and (?!\*) --> Negative look ahead ensures it doesn't do a partial match in a larger string and only find a match that are not preceded or succeeded by asterisk characters.
Live Demo

Java Regex with "Joker" characters

I try to have a regex validating an input field.
What i call "joker" chars are '?' and '*'.
Here is my java regex :
"^$|[^\\*\\s]{2,}|[^\\*\\s]{2,}[\\*\\?]|[^\\*\\s]{2,}[\\?]{1,}[^\\s\\*]*[\\*]{0,1}"
What I'm tying to match is :
Minimum 2 alpha-numeric characters (other than '?' and '*')
The '*' can only appears one time and at the end of the string
The '?' can appears multiple time
No WhiteSpace at all
So for example :
abcd = OK
?bcd = OK
ab?? = OK
ab*= OK
ab?* = OK
??cd = OK
*ab = NOT OK
??? = NOT OK
ab cd = NOT OK
abcd = Not OK (space at the begining)
I've made the regex a bit complicated and I'm lost can you help me?
^(?:\?*[a-zA-Z\d]\?*){2,}\*?$
Explanation:
The regex asserts that this pattern must appear twice or more:
\?*[a-zA-Z\d]\?*
which asserts that there must be one character in the class [a-zA-Z\d] with 0 to infinity questions marks on the left or right of it.
Then, the regex matches \*?, which means an 0 or 1 asterisk character, at the end of the string.
Demo
Here is an alternative regex that is faster, as revo suggested in the comments:
^(?:\?*[a-zA-Z\d]){2}[a-zA-Z\d?]*\*?$
Demo
Here you go:
^\?*\w{2,}\?*\*?(?<!\s)$
Both described at demonstrated at Regex101.
^ is a start of the String
\?* indicates any number of initial ? characters (must be escaped)
\w{2,} at least 2 alphanumeric characters
\?* continues with any number of and ? characters
\*? and optionally one last * character
(?<!\s) and the whole String must have not \s white character (using negative look-behind)
$ is an end of the String
Other way to solve this problem could be with look-ahead mechanism (?=subregex). It is zero-length (it resets regex cursor to position it was before executing subregex) so it lets regex engine do multiple tests on same text via construct
(?=condition1)
(?=condition2)
(?=...)
conditionN
Note: last condition (conditionN) is not placed in (?=...) to let regex engine move cursor after tested part (to "consume" it) and move on to testing other things after it. But to make it possible conditionN must match precisely that section which we want to "consume" (earlier conditions didn't have that limitation, they could match substrings of any length, like lets say few first characters).
So now we need to think about what are our conditions.
We want to match only alphanumeric characters, ?, * but * can appear (optionally) only at end. We can write it as ^[a-zA-Z0-9?]*[*]?$. This also handles non-whitespace characters because we didn't include them as potentially accepted characters.
Second requirement is to have "Minimum 2 alpha-numeric characters". It can be written as .*?[a-zA-Z0-9].*?[a-zA-Z0-9] or (?:.*?[a-zA-Z0-9]){2,} (if we like shorter regexes). Since that condition doesn't actually test whole text but only some part of it, we can place it in look-ahead mechanism.
Above conditions seem to cover all we wanted so we can combine them into regex which can look like:
^(?=(?:.*?[a-zA-Z0-9]){2,})[a-zA-Z0-9?]*[*]?$

How to filtrate a long string (dynamic) with regex?

I have stored the response from a web-application in a string. The string contains several URL:s, and it is dynamic. Could be anything from 10-1000 URL:s.
I work with performance engineering, but this time I have to code a plugin in java, and I am far from an expert in programming.
The problem I have is that in my response-string, I have a lot of gibberish that I don't need, and I don't know how to filtrate it. In my print/request I only want to send the URLS.
I've come this far:
responseData = "http://xxxx-f.akamaihd.net/i/world/open/20150426/1370235-005A/EPISOD-65354-005A-016f1729028090bf_,892,144,252,360,540,1584,2700,.mp4.csmil/segment1_4_av.ts?null=" +
"#EXTINF:10.000, " +
"http://xxxxx-f.akamaihd.net/i/world/open/20150426/1370235-005A/EPISOD-65365-005A-016f1729028090bf_,892,144,252,360,540,1584,2700,.mp4.csmil/segment2_4_av.ts?null=" +
"#EXTINF:fgsgsmoregiberish, " +
"http://xxxx-f.akamaihd.net/i/world/open/20150426/1370235-005A/EPISOD-6353-005A-016f1729028090bf_,892,144,252,360,540,1584,2700,.mp4.csmil/segment2_4_av.ts?null=";
pattern = "^(http://.*\\.ts)";
pr = Pattern.compile(pattern);
math = pr.matcher(responseData);
if (math.find()) {
System.out.println(math.group());
// in this print, I get everything from the response. I only want the URLS (dynamic. could be different names, but they all start with http and end with .ts).
}
else {
System.out.println("No Math");
}
Depending of how looks your URLs, you can use this naive pattern that works for your examples and stops before the ? (written in java style):
\\bhttps?://[^?\\s]+
to ensure there is .ts at the end, you can change it to:
\\bhttps?://[^?\\s]+\\.ts
or
\\bhttps?://[^?\\s]+\\.ts(?=[\\s?]|\\z)
to check that the end of the path is reached.
Note that these patterns don't deal with URLs that contain spaces between double quotes.
Just make you regex lazy with .*? instead of greedy .*, i.e.:
pr = Pattern.compile("(https?.*?\\.ts)");
Regex demo:
https://regex101.com/r/nQ5pA7/1
Regex Explanantion:
(https?.*?\.ts)
Match the regex below and capture its match into backreference number 1 «(https?.*?\.ts)»
Match the character string “http” literally (case sensitive) «http»
Match the character “s” literally (case sensitive) «s?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match any single character that is NOT a line break character (line feed, carriage return, next line, line separator, paragraph separator) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “.” literally «\.»
Match the character string “ts” literally (case sensitive) «ts»
Use the following regex pattern:
(((http|ftp|https):\/{2})+(([0-9a-z_-]+\.)+([a-z]{2,4})(:[0-9]+)?((\/([~0-9a-zA-Z\#\+\%#\.\/_-]+))?(\?[0-9a-zA-Z\+\%#\/&\[\];=_-]+)?)?))\b
Explanation:
contains http or https or ftp with // : ((http|ftp|https):\/{2})
now add '+' sign to add next part in the same string
URL name with one . : ([0-9a-z_-]+.)
domain name : ([a-z]{2,4})
any digit occurs no or one time (here ? denote non or one time) : (:[0-9]+)?
rest url occurs non or one time : '(/([~0-9a-zA-Z#+\%#./_-]+))?(\?[0-9a-zA-Z+\%#/&[];=_-]+)?)'

Regex for excluding a pattern of more than 2 "==" in Java

I need to extract any text between one or more than 2 equals to(i.e. ==,===,===,==== etc) and subsequent text until it searches for next one or more than 2 equals and store in array list.
Ex:
==Notes and references== {{Refli=st|35e=m}}=====Bibliography=====Text starts
Expected output:
[==Notes and references== {{Refli=st|35e=m}}, =====Bibliography=====Text starts]
I have got the regex syntax:
"==+([^==+]*)==+([^==+]*)";
Output i am getting until it encounters single =:
[==Notes and references== {{Refli, =====Bibliography=====Text starts]
[^==+]* matches all characters except = and +. This is not what you want.
Here, it might be easier to use something like:
"==+(.*?)==+(.*?)(?===|$)";
So that you can allow single = signs in between the multiple =.
(?===|$) is a positive lookahead ((?= ... )) and makes sure there's either two consecutive = signs ahead or there's the end of the string.
Or if you want to negate specifically the ==+ in the parts in between you can use negative lookaheads:
"==+((?:(?!==+).)*)==+((?:(?!==+).)*)";
This syntax ((?:(?!==+).)*) will check for every character and make sure it isn't a == (or more).

regular expression to allow only 1 dash

I have a textbox where I get the last name of a user. How do I allow only one dash (-) in a regular expression? And it's not supposed to be in the beginning or at the end of the string.
I have this code:
Pattern p = Pattern.compile("[^a-z-']", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(name);
Try to rephrase the question in more regexy terms. Rather than "allow only one dash, and it can't be at the beginning" you could say, "the string's beginning, followed by at least one non-dash, followed by one dash, followed by at least one non-dash, followed by the string's end."
the string's beginning: `^
at least one non-dash: [^-]+
followed by one dash: -
followed by at least one non-dash: [^-]+
followed by the string's end: $
Put those all together, and there you go. If you're using this in a context that matches against the complete string (not just any substring within it), you don't need the anchors -- though it may be good to put them in anyway, in case you later use that regex in a substring-matching context and forget to add them back in.
Why not just use indexOf() in String?
String s = "last-name";
int first = s.indexOf('-');
int last = s.lastIndexOf('-');
if(first == 0 || last == s.length()-1) // Checks if a dash is at the beginning or end
System.out.println("BAD");
if(first != last) // Checks if there is more than one dash
System.out.println("BAD");
It is slower than using regex but with usually small size of last names it should not be noticeable in the least bit. Also, it will make debugging and future maintenance MUCH easier.
It looks like your regex represents a fragment of an invalid value, and you're presumably using Matcher.find() to find if any part of your value matches that regex. Is that correct? If so, you can change your pattern to:
Pattern p = Pattern.compile("[^a-zA-Z'-]|-.*-|^-|-$");
which will match a non-letter-non-hyphen-non-apostrophe character, or a sequence of characters that both starts and ends with hyphens (thereby detecting a value that contains two hyphens), or a leading hyphen, or a trailing hyphen.
This regex represents one or more non-hyphens, followed by a single hyphen, followed by one or more non-hyphens.
^[^\-]+\-[^\-]+$
I'm not sure if the hyphen in the middle needs to be escaped with a backslash... That probably depends on what platform you're using for regex.
Try pattern something like [a-z]-[a-z].
Pattern p = Pattern.compile("[a-z]-[a-z]");

Categories