How to spot a regex that contains escape characters? - java

I have the following text (string):
System.out.println(text)
..............
BLOOMINGTON, IL 61710
Page 4 of 5
8/2/2009file://C:\hjO Fhjes\hShjort_2012w211231_0323212_575.htm
Location: EAST JEFRYN, NY
..............
I need to get rid of any substring that starts with the word "Page" and ends with ".htm"
I tried the following:
Pattern patternP = Pattern.compile("(?:Page.*?)(\\n+)+htm", Pattern.DOTALL);
Matcher matcherP = patternP.matcher(filtered);
matcherP.find();
String page = matcherP.group();
text = text.replace(page, "");
But this does not filter, i think because of the escape characters. How can i improve it?

Your regex doesn't allow for any of the content between the \n and the htm. you might want to change it to
"(?:Page.*?)(\n+).+htm"
take note that I only used 1 \ to escape the newline. That's because \n is a java escape sequence, you only need to use 2 \ for regex escape sequences like \\d
*you might need to make sure that your regex implementation supports newlines like that.

No, it is because your regex is wrong. Try this regex for your match:
Pattern.compile("Page(.+?)\\.htm", Pattern.DOTALL);
You can just call String#replaceFirst to do this in one call:
String repl = filtered.replaceFirst("(?s)Page(.+?)\\.htm", "");
Where (?s) acts as Pattern.DOTALL

Related

How to split by nonescaped dot and by ignoring double blackslash?

I need to split data by dot. And I have escaped dot(.), that I should ignore. Also I should ignore escaped backslash too (\).
For example,
data1\\.d\\\\\.ata2\\\\.da\.ta3.data4
This string should be splitted to for substrings like as
data1\\
d\\\\\.ata2\\\\
da\.ta3
data4
I cannot to create regex for that. Do you know, it is possible?
I tried to use following:
(?<!\\((\\\\){2,}))\\. - not working
I can create following regex if escaped slash defined only one time:
"((?<!\\\\)\\.)|((?=([^\\\\]*((\\\\\\\\)+[^\\\\]*)))\\.)";
For example data1\\.d\.ata2.da\.ta3.data4 splitted correctly:
data1\\
d\.ata2
da\.ta3
data4
But I cannot detect backslash definition even number times.
Can you help me, please?
Thank you!
You may extract these strings using
(?s)(?:[^\\.]|\\.)+
See the regex demo. Details:
(?s) - enable the Pattern.DOTALL flag so that . could match across lines
(?:[^\\.]|\\.)+ - one or more occurrences of any char other than \ and ., or a \ followed with any char.
See a Java demo:
String line = "data1\\\\.d\\.ata2.da\\.ta3.data4";
Pattern p = Pattern.compile("(?s)(?:[^\\\\.]|\\\\.)+");
Matcher m = p.matcher(line);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group());
}
System.out.println(res);
// => [data1\\, d\.ata2, da\.ta3, data4]
You may use this regex to get your matches:
(?=[^.])[^.\\]*(?:\\.[^.\\]*)*(?=\.|$)
RegEx Demo
RegEx Demo:
(?=[^.]): Make sure there is non-dot character ahead
[^.\\]*: Match 0+ of any character that is not a . not a \
(?:\\.[^.\\]*)*: A non-capture group that matches an backslash followed by an escaped character and that should be followed by 0 or more of any character that is not a . not a \. Match 0 or more of this group
(?=\.|$): Make sure we have a dot or end of line ahead

Regex to match \a574322 in Java

I have long string looking like this: \c53\e59\c9\e28\c20140326\a4095\c8\c15\a546\c11 and I need to find expressions starting with \a and followed by digits. For example: \a574322
And I have no idea how to build it. I can't use:
Pattern p = Pattern.compile("\\a\\d*");
because \a is special character in regex.
When I try to group it like this:
Pattern p = Pattern.compile("(\\)(a)(\\d)*");
I get unclosed group error even though there is even number of brackets.
Can you help me with this?
Thank you all very much for solution.
You can use this regex:
\\\\a\\d+
Code Demo
Since in Java you need to double escape the \\ once for String and second time for regex engine.
You have to change your regex to:
Pattern p = Pattern.compile("(\\\\a\\d+)");
The regex is:
(\\a\d+)
The idea is to escape a backslash and then also escape the backslash for \a, and match digits too.
You need 4 \.
2 to indicate to regex that it is not a special character, but a plain \, and 2 for each to tell the Java String that these are not special characters either. So you need to represent it in code this way:
"\\\\a\\d*"
Which is actually the regex \\a\d*
\\(a)[0-9]+ this should work
you can't try your regexps on this page or some similar
http://regex101.com/

Removing repeated characters in String

I am having strings like this "aaaabbbccccaaddddcfggghhhh" and i want to remove repeated characters get a string like this "abcadcfgh".
A simplistic implementation for this would be :
for(Character c:str.toCharArray()){
if(c!=prevChar){
str2.append(c);
prevChar=c;
}
}
return str2.toString();
Is it possible to have a better implementation may be using regex?
You can do this:
"aaaabbbccccaaddddcfggghhhh".replaceAll("(.)\\1+","$1");
The regex uses backreference and capturing groups.
The normal regex is (.)\1+ but you've to escape the backslash by another backslash in java.
If you want number of repeated characters:
String test = "aaaabbbccccaaddddcfggghhhh";
System.out.println(test.length() - test.replaceAll("(.)\\1+","$1").length());
Demo
With regex, you can replace (.)\1+ with the replacement string $1.
You can use Java's String.replaceAll() method to simply do this with a regular expression.
String s = "aaaabbbccccaaddddcfggghhhh";
System.out.println(s.replaceAll("(.)\\1{1,}", "$1")) //=> "abcadcfgh"
Regular expression
( group and capture to \1:
. any character except \n
) end of \1
\1{1,} what was matched by capture \1 (at least 1 times)
use this pattern /(.)(?=\1)/g and replace with nothing
Demo

How to escape characters in a regular expression

When I use the following code I've got an error:
Matcher matcher = pattern.matcher("/Date\(\d+\)/");
The error is :
invalid escape sequence (valid ones are \b \t \n \f \r \" \' \\ )
I have also tried to change the value in the brackets to('/Date\(\d+\)/'); without any success.
How can i avoid this error?
You need to double-escape your \ character, like this: \\.
Otherwise your String is interpreted as if you were trying to escape (.
Same with the other round bracket and the d.
In fact it seems you are trying to initialize a Pattern here, while pattern.matcher references a text you want your Pattern to match.
Finally, note that in a Pattern, escaped characters require a double escape, as such:
\\(\\d+\\)
Also, as Rohit says, Patterns in Java do not need to be surrounded by forward slashes (/).
In fact if you initialize a Pattern like that, it will interpret your Pattern as starting and ending with literal forward slashes.
Here's a small example of what you probably want to do:
// your input text
String myText = "Date(123)";
// your Pattern initialization
Pattern p = Pattern.compile("Date\\(\\d+\\)");
// your matcher initialization
Matcher m = p.matcher(myText);
// printing the output of the match...
System.out.println(m.find());
Output:
true
Your regex is correct by itself, but in Java, the backslash character itself needs to be escaped.
Thus, this regex:
/Date\(\d+\)/
Must turn into this:
/Date\\(\\d+\\)/
One backslash is for escaping the parenthesis or d. The other one is for escaping the backslash itself.
The error message you are getting arises because Java thinks you're trying to use \( as a single escape character, like \n, or any of the other examples. However, \( is not a valid escape sequence, and so Java complains.
In addition, the logic of your code is probably incorrect. The argument to matcher should be the text to search (for example, "/Date(234)/Date(6578)/"), whereas the variable pattern should contain the pattern itself. Try this:
String textToMatch = "/Date(234)/Date(6578)/";
Pattern pattern = pattern.compile("/Date\\(\\d+\\)/");
Matcher matcher = pattern.matcher(textToMatch);
Finally, the regex character class \d means "one single digit." If you are trying to refer to the literal phrase \\d, you would have to use \\\\d to escape this. However, in that case, your regex would be a constant, and you could use textToMatch.indexOf and textToMatch.contains more easily.
To escape regex in java, you can also use Pattern.quote()

Java Regex lookahead takes too much time

I'm trying to create a proper regex for my problem and apparently ran into weird issue.
Let me describe what I'm trying to do..
My goal is to remove commas from both ends of the string. E,g, string , ,, ,,, , , Hello, my lovely, world, ,, , should become just Hello, my lovely, world.
I have prepared following regex to accomplish this:
(\w+,*? *?)+(?=(,?\W+$))
It works like a charm in regex validators, but when I'm trying to run it on Android device, matcher.find() function hangs for ~1min to find a proper match...
I assume, the problem is in positive lookahead I'm using, but I couldn't find any better solution than just trim commas separately from the beginning and at the end:
output = input.replaceAll("^(,?\\W?)+", ""); //replace commas at the beginning
output = output.replaceAll("(,?\\W?)+$", ""); //replace commas at the end
Is there something I am missing in positive lookahead in Java regex? How can I retrieve string section between commas at the beginning and at the end?
You don't have to use a lookahead if you use matching groups. Try regex ^[\s,]*(.+?)[\s,]*$:
EDIT: To break it apart, ^ matches the beginning of the line, which is technically redundant if using matches() but may be useful elsewhere. [\s,]* matches zero or more whitespace characters or commas, but greedily--it will accept as many characters as possible. (.+?) matches any string of characters, but the trailing question mark instructs it to match as few characters as possible (non-greedy), and also capture the contents to "group 1" as it forms the first set of parentheses. The non-greedy match allows the final group to contain the same zero-or-more commas or whitespaces ([\s,]*). Like the ^, the final $ matches the end of the line--useful for find() but redundant for matches().
If you need it to match spaces only, replace [\s,] with [ ,].
This should work:
Pattern pattern = Pattern.compile("^[\\s,]*(.+?)[\\s,]*$");
Matcher matcher = pattern.matcher(", ,, ,,, , , Hello, my lovely, world, ,, ,");
if (!matcher.matches())
return null;
return matcher.group(1); // "Hello, my lovely, world"

Categories