How to chomp previous line regex? - java

I have the following input:
-- input --
Keep this
Chomp this
ChompHere:
Anything below gets chomped
And I need the output to look like:
-- output (expected) --
Keep this
Right now I get the following based on the code below:
-- output (actual) --
Keep this
Chomp this
ASK: How can I delete the previous line of a regex match (Chomp this):
public void chompPreviousLine() {
String text = "Keep this\n"
+ "Chomp this\nChompHere:\nAnything below gets chomped";
Pattern CHOMP= Pattern.compile("^(ChompHere:(.*))$", Pattern.MULTILINE | Pattern.DOTALL);
Matcher m = CHOMP.matcher(text);
if (m.find()) {
// chomp everything below and one line above!
text = m.replaceAll("");
// but....??? how to delete the previous line ???
text = text .replaceAll("[\n]+$", ""); // delete any remaining /n
System.out.println(text);
}
}

You can modify the regex so that it also gets the previous line:
Pattern CHOMP= Pattern.compile("[^\n]+\nChompHere:(.*)", Pattern.MULTILINE | Pattern.DOTALL);
[^\n]+\n will match any consecutive character that is not an end-of-line character then the end-of-line itself. Since it is before ChompHere in the regex, it will match the complete line before ChompHere.
I have removed parenthesis since you don't really use groups in your algorithm; you are indeed replacing the whole matching text.

You could use a positive look-ahead:
^(.*)(?=ChompHere:)
Depending on whether you want the line break matched or not, you have to add it to the lookahead.
But would a simple parser not be easier for this?

Related

Cannot match string with regex pattern when such string is done of multiple lines

I have a string like the following:
SYBASE_OCS=OCS-12_5
SYBASE=/opt/sybase/oc12.5.1-EBF12850
//there is a newline here as well
The string at the debugger appears like this:
I am trying to match the part coming after SYBASE=, meaning I'm trying to match /opt/sybase/oc12.5.1-EBF12850.
To do that, I've written the following code:
String key = "SYBASE";
Pattern extractorPattern = Pattern.compile("^" + key + "=(.+)$");
Matcher matcher = extractorPattern.matcher(variableDefinition);
if (matcher.find()) {
return matcher.group(1);
}
The problem I'm having is that this string on 2 lines is not matched by my regex, even if the same regex seems to work fine on regex 101.
State of my tests:
If I don't have multiple lines (e.g. if I only had SYBASE=... followed by the new line), it would match
If I evaluate the expression extractorPattern.matcher("SYBASE_OCS=OCS-12_5\\nSYBASE=/opt/sybase/oc12.5.1-EBF12850\\n") (note the double backslash in front of the new line), it would match.
I have tried to use variableDefinition.replace("\n", "\\n") to what I give to the matcher(), but it doesn't match.
It seems something simple but I can't get out of it. Can anyone please help?
Note: the string in that format is returned by a shell command, I can't really change the way it gets returned.
The anchors ^ and $ anchors the match to the start and end of the input.
In your case you would like to match the start and end of a line within the input string. To do this you'll need to change the behavior of these anchors. This can be done by using the multi line flag.
Either by specifying it as an argument to Pattern.compile:
Pattern.compile("regex", Pattern.MULTILINE)
Or by using the embedded flag expression: (?m):
Pattern.compile("(?m)^" + key + "=(.+)$");
The reason it seemed to work in regex101.com is that they add both the global and multi line flag by default:

Regex expression for multiple patterns in 1 line

I am scraping information from a log that I need 3 elements. Another added difficulty is that I am parsing the log via readLine() in my java program aka one(1) line at a time. (If there is a possibility to read multiple lines when parsing let me know :) ) NOTE: I have no control over the log output format.
There are 2 possibilities of what I must extract. Either the log is nice and gives the following
NICE FORMAT
.text.rank 0x0000000000400b8f 0x351 is_x86.o
where I must grab .text.rank , 0x0000000000400b8f , and 0x351
Now the not so nice case: If the name is too long, it bumps everything else to the next line like is below, now the only thing after the first element is one blank space followed by a newline (\n) which gets clobbered by readLine() anyway.
EVIL FORMAT : Note each line is in a separate arraylist entry.
.text.__sfmoreglue
0x0000000000401d00 0x55 /mnt/drv2homelibc_popcorn.a(lib_a-findfp.o)
Therefore what the regex actually sees is:
.text.__sfmoreglue
CORNER CASE FORMAT that also occurs within the log but I DO NOT want
*(.text.unlikely)
Finally below is my Pattern line I am currently using for the first line and pline2 is what is used on the next line when group 2 of the first line is empty.
UPDATE: The pattern below works for the NICE FORMAT and EVIL FORMAT But now pattern pline2 has no matches, even though on regex101.com it is correct. Link: https://regex101.com/r/vS7vZ3/9
UPDATE2: I fixed it, I forgot to add m2.find() once I compiled the second line with Pattern pline2. Corrected code is below.
Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\#a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");
Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");
To give a little background I am first matching the name .text.whatever to m.group(1) followed by the address 0x000012345 to m.group(2) and finally the size 0xa48 to m.group(3). This is all assuming the log is in the NICE format. If it is in the EVIL format I see that group(2) is empty and therefore readin the next line of the log to a temp buffer and apply the second pattern pline2 to new line.
Can someone help me with the regex?
Is there a way I can make sure my current line (or even better, just the second grouping) is either the NICE FORMAT or is empty?
As requested my java code:
//1st line pattern
Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\#a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");
//conditional 2nd line pattern
Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");
while((temp = br1.readLine()) != null){
Matcher m = p.matcher(temp);
while(m.find()){
System.out.println("What regex finds: m1:"+m.group(1)+"# m2:"+m.group(2)+"# m3:"+m.group(3));
if(!m.group(1).isEmpty() && m.group(2).isEmpty() && m.group(3).isEmpty()){
//means we probably hit a long symbol name and important stuff is on the next line
//save the name at least
name = m.group(1);
//read and utilize the next line
if((temp = br1.readLine()) == null){
return;
}
System.out.println("EVILline2:"+temp); //sanity check the input
System.out.println(pline2.toString()); //sanity check the regex
Matcher m2= pline2.matcher(temp);
while(m2.find()){
System.out.println("regex line2 finds: m1:"+m2.group(1));//+"# m2:"+m2.group(2));
if(m2.group(2).isEmpty()){
size = 0;
}else{
size = Long.parseLong(m2.group(2).replaceFirst("0x", ""),16);
}
addr = Long.parseLong(m2.group(1).replaceFirst("0x", ""),16);
System.out.println("#########LONG NAME: "+name+" addr:"+addr+" size:"+size);
}
}//end if
else{ // assume in NICE FORMAT
//do nice format stuff.
}//end while
}//end outerwhile
An Aside, The output I currently get:
line: .text.c_print_results
What regex finds: m1:.text.c_print_results# m2:# m3:
EVIL FORMATline2: 0x00000000004001e6 0x231 c_print_results_x86.o
^\s*([x0-9a-f]*)[ \s]*([x0-9a-f]*)\s*[\w\(\)\.\-]*
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:536)
at java.util.regex.Matcher.group(Matcher.java:496)
at regexTest.regex.grabSymbolsInRange(regex.java:143)
at regexTest.regex.main(regex.java:489)
You have a few issues with your pattern.
1st is the separation of first and second groups (that's why group 2 is returning null).
You have 4 groups and you need 3
After capturing your 3 values you can stop matching, so pattern after
last group isn't necessary
you need global modifier \g so it returns all matches
So, instead of your posted Regex, you can try:
(\\.[tex]*\\.[\\._\\-\\#a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]+([x0-9a-f]*)/g
Tested on Regex101.com:
https://regex101.com/r/lM4bQ9/1
Other then that, a few suggestions:
if you know your text is going to start with text, just put it on the
pattern, don't use [tex]*, which will require a few extra work from
the engine.
[ \s] is the same thing of \s.
[\._\-\#a-zA-Z0-9]* from what i understood, is basically
everything but space, so why not just use [^\s]*
So having these in mind I would suggest you to use this pattern instead:
(\\.text\\.[^\\s]*)\\s*([x0-9a-f]*)\\s+([x0-9a-f]*)/g

How to match the beginning of a String in Java?

I want to match an HTML file:
If the file starts with spaces and then an end tag </sometag>, return true.
Else return false.
I used the "(\\s)*</(\\w)*>.*", but it doesn't match \n </p>\n </blockquote> ....
Thanks to Gabe's help. Gabe is correct. The . doesn't match \n by default. I need to set the DOTALL mode on.
To do it, add the (?s) to the beginning of the regex, i.e. (?s)(\\s)*</(\\w)*>.*.
You can also do this:
Pattern p = Pattern.compile("(\\s)*</(\\w)*>");
Matcher m = p.matcher(s);
return m.lookingAt();
It just checks if the string starts with the pattern, rather than checking the whole string matches the pattern.

Java Regex - Extract link from HTML anchor

I have the following code
private String anchorRegex = "\\<\\s*?a\\s+.*?href\\s*?=\\s*?([^\\s]*?).*?\\>";
private Pattern anchorPattern = Pattern.compile(anchorRegex, Pattern.CASE_INSENSITIVE);
String content = getContentAsString();
Matcher matcher = anchorPattern.matcher(content);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
The call to getContentAsString() returns the HTML content from a web page. The problem I'm having is that the only thing that gets printed in my System.out is a space. Can anyone see what's wrong with my regex?
Regex drives me crazy sometimes.
You need to delimit your capturing group from the following .*?. There's probably double quotes " around the href, so use those:
<\s*a\s+.*?href\s*=\s*"(\S*?)".*?>
Your regex contains:
([^\s]*?).*?
The ([^\s]*?) says to reluctantly find all non-whitespace characters and save them in a group. But the reluctant *? depends on the next part, which is .; any character. So the matching of the href aborts at the first possible chance and it is the .*? which matches the rest of the URL.
The regex you should be using is this:
String anchorRegex = "(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^\\s>]*)['\"]";
This should be able to pull out the href without too much trouble.
The link is in capture group 2, its expanded and assumes dot-all.
Use Java delimiters as necessary.
(?s)
<a
(?=\s)
(?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s) href \s*=\s* (['"]) (.*?) \1
(?:".*?"|'.*?'|[^>]*?)+
>
or not expanded, not dot-all.
<a(?=\s)(?:[^>"']|"[^"]*"|'[^']*')*?(?<=\s)href\s*=\s*(['"])([\s\S]*?)\1(?:"[\s\S]*?"|'[\s\S]*?'|[^>]*?)+>

Replace the line containing the Regex

I have an input string containing multiple lines(demarcated by \n). I need to search for a pattern in the lines and if its found, then replace the complete line with empty string.
My code looks like this,
Pattern p = Pattern.compile("^.*##.*$");
String regex = "This is the first line \n" +
"And this is second line\n" +
"Thus is ##{xyz} should not appear \n" +
"This is 3rd line and should come\n" +
"This will not appear ##{abc}\n" +
"But this will appear\n";
Matcher m = p.matcher(regex);
System.out.println("Output: "+m.group());
I expect the response as :
Output: This is the first line
And this is second line
This is 3rd line and should come
But this will appear.
I am unable to get it, please help, me out.
Thanks,
Amit
In order to let the ^ match the start of a line and $ match the end of one, you need to enable the multi-line option. You can do that by adding (?m) in front of your regex like this: "(?m)^.*##.*$".
Also, you want to keep grouping while your regex finds a match, which can be done like this:
while(m.find()) {
System.out.println("Output: "+m.group());
}
Note the regex will match these lines (not the ones you indicated):
Thus is ##{xyz} should not appear
This will not appear ##{abc}
But if you want to replace the lines that contain ##, as the title of your post suggests, do it like this:
public class Main {
public static void main(String[] args) {
String text = "This is the first line \n" +
"And this is second line\n" +
"Thus is ##{xyz} should not appear \n" +
"This is 3rd line and should come\n" +
"This will not appear ##{abc}\n" +
"But this will appear\n";
System.out.println(text.replaceAll("(?m)^.*##.*$(\r?\n|\r)?", ""));
}
}
Edit: accounted for *nix, Windows and Mac line breaks as mentioned by PSeed.
Others mention turning on multiline mode but since Java does not default to DOTALL (single line mode) there is an easier way... just leave the ^ and $ off.
String result = regex.replaceAll( ".*##.*", "" );
Note that the issue with either this or using:
"(?m)^.*##.*$"
...is that it will leave the blank lines in. If it is a requirement to not have them then the regex will be different.
Full regex that does not leave blank lines:
String result = regex.replaceAll( ".*##.*(\r?\n|\r)?", "" );
Is there a multiline option in Java, check the docs. There is one in C# atleast, I think that should be the issue.
Take a look at the JavaDoc on the Matcher.matches() method:
boolean java.util.regex.Matcher.matches()
Attempts to match the entire input sequence against the pattern.
If the match succeeds then more information can be obtained via the start, end, and group methods.
Returns:
true if, and only if, the entire input sequence matches this matcher's pattern
Try calling the "matches" method first. This won't actually do the text replacement as noted in your post, but it will get you further.

Categories