How to split by nonescaped dot and by ignoring double blackslash?

How to split by nonescaped dot and by ignoring double blackslash? - java

I need to split data by dot. And I have escaped dot(.), that I should ignore. Also I should ignore escaped backslash too (\).
For example,
data1\\.d\\\\\.ata2\\\\.da\.ta3.data4
This string should be splitted to for substrings like as
data1\\
d\\\\\.ata2\\\\
da\.ta3
data4
I cannot to create regex for that. Do you know, it is possible?
I tried to use following:
(?<!\\((\\\\){2,}))\\. - not working
I can create following regex if escaped slash defined only one time:
"((?<!\\\\)\\.)|((?=([^\\\\]*((\\\\\\\\)+[^\\\\]*)))\\.)";
For example data1\\.d\.ata2.da\.ta3.data4 splitted correctly:
data1\\
d\.ata2
da\.ta3
data4
But I cannot detect backslash definition even number times.
Can you help me, please?
Thank you!

You may extract these strings using
(?s)(?:[^\\.]|\\.)+
See the regex demo. Details:
(?s) - enable the Pattern.DOTALL flag so that . could match across lines
(?:[^\\.]|\\.)+ - one or more occurrences of any char other than \ and ., or a \ followed with any char.
See a Java demo:
String line = "data1\\\\.d\\.ata2.da\\.ta3.data4";
Pattern p = Pattern.compile("(?s)(?:[^\\\\.]|\\\\.)+");
Matcher m = p.matcher(line);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group());
}
System.out.println(res);
// => [data1\\, d\.ata2, da\.ta3, data4]

You may use this regex to get your matches:
(?=[^.])[^.\\]*(?:\\.[^.\\]*)*(?=\.|$)
RegEx Demo
RegEx Demo:
(?=[^.]): Make sure there is non-dot character ahead
[^.\\]*: Match 0+ of any character that is not a . not a \
(?:\\.[^.\\]*)*: A non-capture group that matches an backslash followed by an escaped character and that should be followed by 0 or more of any character that is not a . not a \. Match 0 or more of this group
(?=\.|$): Make sure we have a dot or end of line ahead

Related

How to extract and replace a String with specific format?

I have input String like;
(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)
What I want to do is find all words starting with "rm" and replace them with remove function.
(remove(01ADS21212), 'adfffddd', remove(Adssssss), '1231232131', remove(2321312322))
I am trying to use replaceAll function but I don't know how to extract parts after "rm" literal.
statement.replaceAll("\\(rm*.,", "remove($1)");
Is there any way to get these parts?

You have not captured any substring with a capturing group, thus $1 is null.
You may use
.replaceAll("\\brm(\\w*)", "remove($1)")
See the regex demo
Details
\b - a word boundary (to start matching only at the start of a word)
rm - a literal part
(\w*) - Group 1: 0+ word chars (letters, digits or underscores)
The $1 in the replacement pattern stands for Group 1 value.
If you mean to match any chars other than a comma and whitespace after rm, use "\\brm([^\\s,]*)", see this regex demo.

Use "Replace" with empty string .
Eg;
string str = "(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)";
Console.WriteLine(str.Replace("rm", ""));
Output : (01ADS21212, 'adfffddd', Adssssss, '1231232131', 2321312322)

Erase any string that doesn't match a pattern using replaceall()

I need to replace ALL characters that don't follow a pattern with "".
I have strings like:
MCC-QX-1081
TEF-CO-QX-4949
SPARE-QX-4500
So far the closest I am using the following regex.
String regex = "[^QX,-,\\d]";
Using the replaceAll String method I get QX1081 and the expected result is QX-1081

You're using a character class which matches single characters, not patterns.
You want something like
String resultString = subjectString.replaceAll("^.*?(QX-\\d+)?$", "$1");
which works as long as nothing follows the QX-digits part in your strings.

Put the dash at the end of the regex: [^QX,\d-]
Next you just have to substring to filter out the first dash.
Don't know exactly what you expect for all strings but if you want to match a dash in a character class then it must be set as last character.

You are using a character class where you have to either escape the hyphen or put it at the start or at the end like [^QX,\d-] or else you are matching a range from a comma to a comma. But changing that will give you -QX-1081 which is not the desired result.
You could match your pattern and then replace with the first capturing group $1:
^(?:[A-Z]+-)+(QX-\d+)$
In Java you have to double escape matching a digit \\d
That will match:
^ Start of the string
(?:[A-Z]+-)+ Repeat 1+ times one or more uppercase charactacters followed by a hyphen
(QX-\d+) Capture in a group QX- followed by 1+ digits
$ End of the string
For example:
String result = "MCC-QX-1081".replaceAll("^(?:[A-Z]+-)+(QX-\\d+)$", "$1");
System.out.println(result); // QX-1081
See the Regex demo | Java demo
Note that if you are doing just 1 replacement, you could also use replaceFirst

Match starting and ending character using Java Matcher class

I want to get words from string that starts with # and end with space. I've tried using this Pattern.compile("#\\s*(\\w+)") but it doesn't include characters like ' or :.
I want the solution with only Pattern Matching method.

We can try matching using the pattern (?<=\\s|^)#\\S+, which would match any word starting with #, followed by any number of non whitespace characters.
String line = "Here is a #hashtag and here is #another has tag.";
String pattern = "(?<=\\s|^)#\\S+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(0));
}
#hashtag
#another
Demo
Note: The above solution might give you an edge case of pulling in punctuation which appears at the end of a hashtag. If you don't want this, then we can rephrase the regex to only match positive certain characters, e.g. letters and numbers. But, maybe this is not a concern for you.

The opposite of \s is \S, so you can use a regex like this:
#\s*(\S+)
Or for Java:
Pattern.compile("#\\s*(\\S+)")
It will capture anything that is not a white space.
See demo here.
If you want to stop on the space character and not any white space change the \S to [^ ].
The ^ inside the brackets means it will negate whatever comes after it.
Pattern.compile("#\\s*([^ ]+)")
See demo here.

Regular expression to match escaped sequences in java

I am looking for regex to check for all escape sequences in java
\b backspace
\t horizontal tab
\n linefeed
\f form feed
\r carriage return
\" double quote
\' single quote
\\ backslash
How do I write regex and perform validation to allow words / textarea / strings / sentences containing valid escape sequences

This regex will match all your escape sequence that you have written:
\\[btnfr"'\\]
In Java you need to duplicate the backslash, the code will result as:
Pattern p = Pattern.compile("\\\\[btnfr\\\"\\'\\\\]");
if(p.matcher("\\b backspace").find()){
System.out.println("Contains escape sequence");
}

The following regex should meet your need:
Pattern pattern = Pattern.compile("\\\\[\\\\btnfr\'\"]");
as in
Pattern pattern = Pattern.compile("\\\\[\\\\btnfr\'\"]");
String[] strings = new String[]{"\\b","\\t","\\n","\\f","\\r","\\\'","\\\"", "\\\\"};
for (String s:strings) {
System.out.println(s + " - " + pattern.matcher(s).matches());
}
To match a single \, you would have to add 4 \ inside a regex string.
Considering a string, "\\" stands for a single \.
When you have "\\" as a regex string, it means a \ which is a special character in regex and it is supposed to be followed by certain other character to form an escape sequence.
In this way, we need "\\\\", to match a single \ which is equivalent to the string "\\".
EDIT: There is no need to escape the single quote in the regex string. So "\\\\[\\\\btnfr\'\"]" can be replaced with "\\\\[\\\\btnfr'\"]".

You'll need to use DOTALL to match line terminators. You might also find \s handy as it represents all whitespace. Eg
Pattern p = Pattern.compile("([\\s\"'\\]+)", Pattern.DOTALL);
Matcher m = p.matcher("foo '\r\n\t bar");
assertTrue(m.find());
assertEquals(" '\r\n\t ", m.group(1));

How to spot a regex that contains escape characters?

I have the following text (string):
System.out.println(text)
..............
BLOOMINGTON, IL 61710
Page 4 of 5
8/2/2009file://C:\hjO Fhjes\hShjort_2012w211231_0323212_575.htm
Location: EAST JEFRYN, NY
..............
I need to get rid of any substring that starts with the word "Page" and ends with ".htm"
I tried the following:
Pattern patternP = Pattern.compile("(?:Page.*?)(\\n+)+htm", Pattern.DOTALL);
Matcher matcherP = patternP.matcher(filtered);
matcherP.find();
String page = matcherP.group();
text = text.replace(page, "");
But this does not filter, i think because of the escape characters. How can i improve it?

Your regex doesn't allow for any of the content between the \n and the htm. you might want to change it to
"(?:Page.*?)(\n+).+htm"
take note that I only used 1 \ to escape the newline. That's because \n is a java escape sequence, you only need to use 2 \ for regex escape sequences like \\d
*you might need to make sure that your regex implementation supports newlines like that.

No, it is because your regex is wrong. Try this regex for your match:
Pattern.compile("Page(.+?)\\.htm", Pattern.DOTALL);
You can just call String#replaceFirst to do this in one call:
String repl = filtered.replaceFirst("(?s)Page(.+?)\\.htm", "");
Where (?s) acts as Pattern.DOTALL

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to split by nonescaped dot and by ignoring double blackslash? - java

Related

How to extract and replace a String with specific format?

Erase any string that doesn't match a pattern using replaceall()

Match starting and ending character using Java Matcher class

Regular expression to match escaped sequences in java

How to spot a regex that contains escape characters?

Categories

Resources