Regex expression for multiple patterns in 1 line - java

I am scraping information from a log that I need 3 elements. Another added difficulty is that I am parsing the log via readLine() in my java program aka one(1) line at a time. (If there is a possibility to read multiple lines when parsing let me know :) ) NOTE: I have no control over the log output format.
There are 2 possibilities of what I must extract. Either the log is nice and gives the following
NICE FORMAT
.text.rank 0x0000000000400b8f 0x351 is_x86.o
where I must grab .text.rank , 0x0000000000400b8f , and 0x351
Now the not so nice case: If the name is too long, it bumps everything else to the next line like is below, now the only thing after the first element is one blank space followed by a newline (\n) which gets clobbered by readLine() anyway.
EVIL FORMAT : Note each line is in a separate arraylist entry.
.text.__sfmoreglue
0x0000000000401d00 0x55 /mnt/drv2homelibc_popcorn.a(lib_a-findfp.o)
Therefore what the regex actually sees is:
.text.__sfmoreglue
CORNER CASE FORMAT that also occurs within the log but I DO NOT want
*(.text.unlikely)
Finally below is my Pattern line I am currently using for the first line and pline2 is what is used on the next line when group 2 of the first line is empty.
UPDATE: The pattern below works for the NICE FORMAT and EVIL FORMAT But now pattern pline2 has no matches, even though on regex101.com it is correct. Link: https://regex101.com/r/vS7vZ3/9
UPDATE2: I fixed it, I forgot to add m2.find() once I compiled the second line with Pattern pline2. Corrected code is below.
Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\#a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");
Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");
To give a little background I am first matching the name .text.whatever to m.group(1) followed by the address 0x000012345 to m.group(2) and finally the size 0xa48 to m.group(3). This is all assuming the log is in the NICE format. If it is in the EVIL format I see that group(2) is empty and therefore readin the next line of the log to a temp buffer and apply the second pattern pline2 to new line.
Can someone help me with the regex?
Is there a way I can make sure my current line (or even better, just the second grouping) is either the NICE FORMAT or is empty?
As requested my java code:
//1st line pattern
Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\#a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");
//conditional 2nd line pattern
Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");
while((temp = br1.readLine()) != null){
Matcher m = p.matcher(temp);
while(m.find()){
System.out.println("What regex finds: m1:"+m.group(1)+"# m2:"+m.group(2)+"# m3:"+m.group(3));
if(!m.group(1).isEmpty() && m.group(2).isEmpty() && m.group(3).isEmpty()){
//means we probably hit a long symbol name and important stuff is on the next line
//save the name at least
name = m.group(1);
//read and utilize the next line
if((temp = br1.readLine()) == null){
return;
}
System.out.println("EVILline2:"+temp); //sanity check the input
System.out.println(pline2.toString()); //sanity check the regex
Matcher m2= pline2.matcher(temp);
while(m2.find()){
System.out.println("regex line2 finds: m1:"+m2.group(1));//+"# m2:"+m2.group(2));
if(m2.group(2).isEmpty()){
size = 0;
}else{
size = Long.parseLong(m2.group(2).replaceFirst("0x", ""),16);
}
addr = Long.parseLong(m2.group(1).replaceFirst("0x", ""),16);
System.out.println("#########LONG NAME: "+name+" addr:"+addr+" size:"+size);
}
}//end if
else{ // assume in NICE FORMAT
//do nice format stuff.
}//end while
}//end outerwhile
An Aside, The output I currently get:
line: .text.c_print_results
What regex finds: m1:.text.c_print_results# m2:# m3:
EVIL FORMATline2: 0x00000000004001e6 0x231 c_print_results_x86.o
^\s*([x0-9a-f]*)[ \s]*([x0-9a-f]*)\s*[\w\(\)\.\-]*
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:536)
at java.util.regex.Matcher.group(Matcher.java:496)
at regexTest.regex.grabSymbolsInRange(regex.java:143)
at regexTest.regex.main(regex.java:489)

You have a few issues with your pattern.
1st is the separation of first and second groups (that's why group 2 is returning null).
You have 4 groups and you need 3
After capturing your 3 values you can stop matching, so pattern after
last group isn't necessary
you need global modifier \g so it returns all matches
So, instead of your posted Regex, you can try:
(\\.[tex]*\\.[\\._\\-\\#a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]+([x0-9a-f]*)/g
Tested on Regex101.com:
https://regex101.com/r/lM4bQ9/1
Other then that, a few suggestions:
if you know your text is going to start with text, just put it on the
pattern, don't use [tex]*, which will require a few extra work from
the engine.
[ \s] is the same thing of \s.
[\._\-\#a-zA-Z0-9]* from what i understood, is basically
everything but space, so why not just use [^\s]*
So having these in mind I would suggest you to use this pattern instead:
(\\.text\\.[^\\s]*)\\s*([x0-9a-f]*)\\s+([x0-9a-f]*)/g

Related

Cannot match string with regex pattern when such string is done of multiple lines

I have a string like the following:
SYBASE_OCS=OCS-12_5
SYBASE=/opt/sybase/oc12.5.1-EBF12850
//there is a newline here as well
The string at the debugger appears like this:
I am trying to match the part coming after SYBASE=, meaning I'm trying to match /opt/sybase/oc12.5.1-EBF12850.
To do that, I've written the following code:
String key = "SYBASE";
Pattern extractorPattern = Pattern.compile("^" + key + "=(.+)$");
Matcher matcher = extractorPattern.matcher(variableDefinition);
if (matcher.find()) {
return matcher.group(1);
}
The problem I'm having is that this string on 2 lines is not matched by my regex, even if the same regex seems to work fine on regex 101.
State of my tests:
If I don't have multiple lines (e.g. if I only had SYBASE=... followed by the new line), it would match
If I evaluate the expression extractorPattern.matcher("SYBASE_OCS=OCS-12_5\\nSYBASE=/opt/sybase/oc12.5.1-EBF12850\\n") (note the double backslash in front of the new line), it would match.
I have tried to use variableDefinition.replace("\n", "\\n") to what I give to the matcher(), but it doesn't match.
It seems something simple but I can't get out of it. Can anyone please help?
Note: the string in that format is returned by a shell command, I can't really change the way it gets returned.
The anchors ^ and $ anchors the match to the start and end of the input.
In your case you would like to match the start and end of a line within the input string. To do this you'll need to change the behavior of these anchors. This can be done by using the multi line flag.
Either by specifying it as an argument to Pattern.compile:
Pattern.compile("regex", Pattern.MULTILINE)
Or by using the embedded flag expression: (?m):
Pattern.compile("(?m)^" + key + "=(.+)$");
The reason it seemed to work in regex101.com is that they add both the global and multi line flag by default:

Java pattern to check text line

I have a text line, and i read form android, i want to check if line is aceptable , the code will run.
Here my line
[al:Vol30]
[offset:0]
[00:37.00]3
[00:38.00]2
[00:39.00]1
[00:40.00]0/
So i want check line have pattern like this [00:37.00]3
I create 1 pattern with this code:
String pattern = "^[d{2}:d{2}.d{2}].";
....
//check line
if(str.matches(pattern))
{//do some thing}
How ever, this pattern is not correct so all line are fail. Can some one suggestion?
Try this
String pattern = "^\\[\\d{2}:\\d{2}.\\d{2}\\].";

Negative lookahead regex not working in Java

The following regex successfully works when testing here, but when I try to implement it into my Java code, it won't return a match. It uses a negative lookahead to ensure no newlines occur between MAIN LEVEL and Bedrooms. Why won't it work in Java?
regex
^\s*\bMAIN LEVEL\b\n(?:(?!\n\n)[\s\S])*\bBedrooms:\s*(.*)
Java
pattern = Pattern.compile("^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");
match = pattern.matcher(content);
if(match.find())
{
//Doesn't reach here
String bed = match.group(1);
bed = bed.trim();
}
content is just a string read from a text file, which contains the exact text shown in the demo linked above.
File file = new File("C:\\Users\\ME\\Desktop\\content.txt");
content = new Scanner(file).useDelimiter("\\Z").next();
UPDATE:
I changed my code to include a multiline modifier (?m), but it prints out "null".
pattern = Pattern.compile("(?m)^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");
match = pattern.matcher(content);
if(match.find())
{ // Still not reaching here
mainBeds=match.group(1);
mainBeds= mainBeds.trim();
}
System.out.println(mainBeds); // Prints null
The problem:
As explained in Alan Moore's answer, it's a mismatch between the format of the Line-Separators used in your file (\r\n), and what your pattern is specifying (\n):
Original code:
Pattern.compile("^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");
Note: I explain what the \r and \n represent, and the context and difference between \r\n and \n, in the second item of the "side notes" section.
The solution(s):
Most/all Java versions:
You can use \r?\n to match both formats, and this is sufficient in most cases.
Most/all Java versions:
You can use \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029] to match "Any Unicode linebreak sequence".
Java 8 and later:
You can use the Linebreak Matcher (\R). It is equivalent to the second method (above), and whenever possible (Java 8 or later), this is the recommended method.
Resulting code (3rd method):
Pattern.compile("^\\s*\\bMAIN LEVEL\\b\\R(?:(?!\\R\\R)[\\s\\S])*\\bBedrooms:\\s*(.*)");
Side notes:
You can replace \\R\\R with \\R{2}, which is more readable.
Different formats of line-breaks exist and are used in different systems because early OSs inherited the "line-break logic" from mechanical typing machines, like typewriters.
The \r in code represents a Carriage-Return, aka CR. The idea behind this is to return the typing cursor to the start of the line.
The \n in code represents a Line-Feed, aka LF. The idea behind this is to move the typing cursor to the next line.
The most common line-break formats are CR-LF (\r\n), used primarily by Windows; and LF (\n), used by most UNIX-like systems. This is the reason why "\r?\n will be sufficient in most cases", and you can reliably use it for systems intended for household-grade users.
However, some (rare) OSs, usually in industrial-grade stuff such as servers, may use CR, LF-CR, or something else entirely, which is why the second method has so many characters in it, so if you need the code to be compatible with every system, you will need the second, or preferably, the third method.
Here is a useful method for testing where your patterns are failing:
String content = "..."; //Replace "..." with your content.
String patternString = "..."; //Replace "..." with your pattern.
String lastPatternSuccess = "None. You suck at Regex!";
for (int i = 0; i <= patternString.length(); i++) {
try {
String patternSubstring = patternString.substring(0, i);
Pattern pattern = Pattern.compile(patternSubstring);
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {
lastPatternSuccess = i + " - Pattern: " + patternSubstring + " - Match: \n" + matcher.group();
}
} catch (Exception ex) {
//Ignore and jump to next
}
}
System.out.println(lastPatternSuccess);
It's the line separators. You're looking for \n, but your file actually uses \r\n. If you're running Java 8, you can change every \\n in your code to \\R (the universal line separator). For Java 7 or earlier, use \\r?\\n.

Use RegEx in Java to extract parameters in between parentheses

I'm writing a utility to extract the names of header files from JSPs. I have no problem reading the JSPs line by line and finding the lines I need. I am having a problem extracting the specific text needed using regex. After looking at many similar questions I'm hitting a brick wall.
An example of the String I'll be matching from within is:
<jsp:include page="<%=Pages.getString(\"MY_HEADER\")%>" flush="true"></jsp:include>
All I need is MY_HEADER for this example. Any time I have this tag:
<%=Pages.getString
I need what comes between this:
<%=Pages.getString(\" and this: )%>
Here is what I have currently (which is not working, I might add) :
String currentLine;
while ((currentLine = fileReader.readLine()) != null)
{
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}}
I need to be able to use the Java RegEx API and regex to extract those header names.
Any help on this issue is greatly appreciated. Thanks!
EDIT:
Resolved this issue, thankfully. The tricky part was, after being given the right regex, it had to be taken into account that the String I was feeding to the regex was always going to have two " / " characters ( (/"MY_HEADER"/) ) that needed to be escaped in the pattern.
Here is what worked (thanks to the help ;-)):
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\\"]*)");
This should do the trick:
<%=Pages\\.getString\\(\\\\\"([^\\\\]*)
Yeah that's a scary number of back slashes. matcher.group(1) should return MY_HEADER. It starts at the \" and matches everything until the next \ (which I assume here will be at \")%>.)
Of course, if your target text contains a backslash (\), this will not work. But you didn't give an indication that you'd ever be looking for something like <%=Pages.getString(\"Fun!\Yay!\")%> -- where this regex would only return Fun! and ignore the rest.
EDIT
The reason your test case was failing is because you were using this test string:
String currentLine = "<%=Pages.getString(\"MY_HEADER\")%>";
This is the equivalent of reading it in from a file and seeing:
<%=Pages.getString("MY_HEADER")%>
Note the lack of any \. You need to use this instead:
String sCurrentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Which is the equivalent of what you want.
This is test code that works:
String currentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}

Replace the line containing the Regex

I have an input string containing multiple lines(demarcated by \n). I need to search for a pattern in the lines and if its found, then replace the complete line with empty string.
My code looks like this,
Pattern p = Pattern.compile("^.*##.*$");
String regex = "This is the first line \n" +
"And this is second line\n" +
"Thus is ##{xyz} should not appear \n" +
"This is 3rd line and should come\n" +
"This will not appear ##{abc}\n" +
"But this will appear\n";
Matcher m = p.matcher(regex);
System.out.println("Output: "+m.group());
I expect the response as :
Output: This is the first line
And this is second line
This is 3rd line and should come
But this will appear.
I am unable to get it, please help, me out.
Thanks,
Amit
In order to let the ^ match the start of a line and $ match the end of one, you need to enable the multi-line option. You can do that by adding (?m) in front of your regex like this: "(?m)^.*##.*$".
Also, you want to keep grouping while your regex finds a match, which can be done like this:
while(m.find()) {
System.out.println("Output: "+m.group());
}
Note the regex will match these lines (not the ones you indicated):
Thus is ##{xyz} should not appear
This will not appear ##{abc}
But if you want to replace the lines that contain ##, as the title of your post suggests, do it like this:
public class Main {
public static void main(String[] args) {
String text = "This is the first line \n" +
"And this is second line\n" +
"Thus is ##{xyz} should not appear \n" +
"This is 3rd line and should come\n" +
"This will not appear ##{abc}\n" +
"But this will appear\n";
System.out.println(text.replaceAll("(?m)^.*##.*$(\r?\n|\r)?", ""));
}
}
Edit: accounted for *nix, Windows and Mac line breaks as mentioned by PSeed.
Others mention turning on multiline mode but since Java does not default to DOTALL (single line mode) there is an easier way... just leave the ^ and $ off.
String result = regex.replaceAll( ".*##.*", "" );
Note that the issue with either this or using:
"(?m)^.*##.*$"
...is that it will leave the blank lines in. If it is a requirement to not have them then the regex will be different.
Full regex that does not leave blank lines:
String result = regex.replaceAll( ".*##.*(\r?\n|\r)?", "" );
Is there a multiline option in Java, check the docs. There is one in C# atleast, I think that should be the issue.
Take a look at the JavaDoc on the Matcher.matches() method:
boolean java.util.regex.Matcher.matches()
Attempts to match the entire input sequence against the pattern.
If the match succeeds then more information can be obtained via the start, end, and group methods.
Returns:
true if, and only if, the entire input sequence matches this matcher's pattern
Try calling the "matches" method first. This won't actually do the text replacement as noted in your post, but it will get you further.

Categories