REGEXP - how to read " character? - java

I'm using hadoop pig with regexp (REGEX_EXTRACT_ALL) - this is Java parsing.
I have a string:
"DYN_USER_ID=32753477; $Path=\"/\"; DYN_USER_CONFIRM=e6d2a0a7b7715cb10d1dca504e3c5e80; $Path=\"/\"" "Nokia6070/2.0 (03.20) Profile/MIDP-2.0 Configuration/CLDC-1.1"
I'm expeting two groups:
First: DYN_USER_ID=32753477; $Path=\"/\"; DYN_USER_CONFIRM=e6d2a0a7b7715cb10d1dca504e3c5e80; $Path=\"/\"
Second: Nokia6070/2.0 (03.20) Profile/MIDP-2.0 Configuration/CLDC-1.1
As you can see, inside the first string there is " character but with escape character \.
The simplies solution is:
"(.*)" "(.*)"
But is it the best one?

"(.*)(?<!\\\\)" "(.*)"
This uses negatve lookbehind: (?<!☀) where ☀ is some string, here the character backspace is represented by an regex-escaped and String-escaped backslash.

Ideally, you should be using the negated character class [^"] so that it matches from the first delimiter " to the last delimiter ", but the problem is that it ignores escaped " characters. If you can have escaped " and escaped \ in your strings, it will be better if you use something like this:
"((?:\\.|[^"\\])+)" "((?:\\.|[^"\\])+)"
The group (?:\\.|[^"\\])+ will match either an escaped character or many [^"\\] characters.
regex101 demo

Related

regex with replaceAll

I have done some searching and would like advice on this problem:
I want to replace "labels":"Webapp" with "labels":["Webapp"]
I found the regex (\"labels\"\:\")+(([a-zA-Z]|\s|\-)+)+(\") with the following substitution "labels":["$2"]
I use the method replaceAll and the Talend editor.
I write output_row.json = output_row.json.replaceAll("(\"labels\"\:\")+(([a-zA-Z]|\s|\-)+)+(\")",""labels":["$2"]"); but It doesn't work.
Message détaillé: Invalid escape sequence (valid ones are \b \t \n \f \r \" \' \ )
Then I escaped the characters, I did:
output_row.json = output_row.json.replaceAll("(\\"labels\\"\:\\")+(([a-zA-Z]|\\s|\-)+)+(\\")","\"labels\":[\"$2\"]");
But It doesn't work yet.
Please could you help me?
Thanks.
Issues : don't escape - and : they are not special characters in regex
escape \s with \\s plus escape " as you did in your second example \"labels\":[\"$2\"]
Although you can use a more concise regex and combine your \\s , - inside character class []
You can use (\"labels\":\")+([a-zA-Z -]+)\
System.out.println("labels\":\"Webapp"
.replaceAll("(\"labels\":\")+([a-zA-Z -]+)\""
, "\"labels\":[\"$2\"]"));

Regular expression to match escaped sequences in java

I am looking for regex to check for all escape sequences in java
\b backspace
\t horizontal tab
\n linefeed
\f form feed
\r carriage return
\" double quote
\' single quote
\\ backslash
How do I write regex and perform validation to allow words / textarea / strings / sentences containing valid escape sequences
This regex will match all your escape sequence that you have written:
\\[btnfr"'\\]
In Java you need to duplicate the backslash, the code will result as:
Pattern p = Pattern.compile("\\\\[btnfr\\\"\\'\\\\]");
if(p.matcher("\\b backspace").find()){
System.out.println("Contains escape sequence");
}
The following regex should meet your need:
Pattern pattern = Pattern.compile("\\\\[\\\\btnfr\'\"]");
as in
Pattern pattern = Pattern.compile("\\\\[\\\\btnfr\'\"]");
String[] strings = new String[]{"\\b","\\t","\\n","\\f","\\r","\\\'","\\\"", "\\\\"};
for (String s:strings) {
System.out.println(s + " - " + pattern.matcher(s).matches());
}
To match a single \, you would have to add 4 \ inside a regex string.
Considering a string, "\\" stands for a single \.
When you have "\\" as a regex string, it means a \ which is a special character in regex and it is supposed to be followed by certain other character to form an escape sequence.
In this way, we need "\\\\", to match a single \ which is equivalent to the string "\\".
EDIT: There is no need to escape the single quote in the regex string. So "\\\\[\\\\btnfr\'\"]" can be replaced with "\\\\[\\\\btnfr'\"]".
You'll need to use DOTALL to match line terminators. You might also find \s handy as it represents all whitespace. Eg
Pattern p = Pattern.compile("([\\s\"'\\]+)", Pattern.DOTALL);
Matcher m = p.matcher("foo '\r\n\t bar");
assertTrue(m.find());
assertEquals(" '\r\n\t ", m.group(1));

Regex matching quoted string but ignoring escaped quotation mark

What I want to know is how to modify following regex: \".*?\" so it will ignore escaped " character (\") so it won't end matching at \".
For example:
parameter1 = " fsfsdfsd \" " parameter2 = " fsfsfs "
I want to match:
" fsfsdfsd \" "
and
" fsfsfs "
but not
" fsfsdfsd \" " parameter2 = " fsfsfs "
etc...
Try this one:
"(?:\\"|[^"])*"
It matches "test \" though(you can probably avoid that using lookbehind). Escape the character if you need using \
Online Demo
I usually handle this sort of task by figuring out what are the elements that can appear between quote marks. In this case, each element can be:
any character that is not \ or ";
the two-character sequence \";
a \ that is not followed by ".
You can expand this if desired, by allowing \\ to represent \, for instance, or allowing other escapes; it should be pretty simple to modify the above list.
Then the regular expression just follows the rules in the list: Note: this is a regex and not a Java string literal
"(([^\\"]|\\"|\\(?!"))*)"
which means that, within the quote marks, we match one or more of: (1) a character other than \ or " (the character class); (2) the sequence \"; (3) \ not followed by " (negative lookahead). Of course, the Java string literal looks pretty ugly:
"\"(([^\\\\\"]|\\\\\"|\\\\(?!\"))*)\""
(Note: not tested.)
You will need negative lookbehind in your regex:
(?<!\\\\)\".*?(?<!\\\\)\"
correct regexp for matching strings between quotes will be:
"([^\\"]+|\\.|\\\\)*"
but besause in java slashes need to be escaped, the result expression will be:
Pattern.compile("\"(?:[^\\\\\"]+|\\\\.|\\\\\\\\)*\"");
this expression matches slash-escaped characters and slash themselve, for example:
... "123 \\\" 456 \\" ...
^ ^ slash literal
^
^ slash literal + escaped quote
regexp written in comments above will fail on this example

Removing backslash and newline character (occurring together) in Java

I have stream of data coming from different feeds which I need to clean up.
Data is in specific format and if some sentence spans through multiple lines it is separated using "\"(backslash), which I want to remove. \ is also present in other part of text for escaping quotes etc and I don't want to remove these backslashes. So eventually I want to remove "\\n".
I have tried following regex for removing \ and \n but it didn't work :
singleLine.replaceAll("(\\\\n|\\\\r)", "");
I am not sure what regex would work in this case.
Regex isn't really necessary for this; If I were you, I would use...
singleLine=singleLine.replace("\\\\n", "");
Many people think the replace method only replaces one, but in fact the only difference is that replaceAll uses regex, while replace simply replaces exact matches of the String.
If you do want to use regex though, I believe you have to do \\\\\\\\ (you have to 'nullify' the escape character in Java, and in regex, so x4, not just x2)
Explaining this some more
The only other issue is in your example, you never set singeLine equal to anything; I'm not sure if you hid that, or missed that.
Edit:
Explaining the reasoning for \\\\\\\\ some more, Java requires that you do "\\" to represent one \. Regex also has a use for the \ character, and requires you do the same again for it. If you just "\\" in Java, the regex parser essentially receives "\", it's escape character for certain things. You need to give the regex parser two of them, to escape it, so in Java, you need to do "\\\\" just to represent a match for a single "\"
You'll need 5 backslash characters for each pattern in that regexp.
Use:
singleLine.replaceAll("(\\\\\n|\\\\\r)", "");
The backslash character is both an escape sequence in your string and an escape sequence in the regexp. So to represent a literal \ in a regexp you'll need to use 4 \ characters - your regexp needs \\ to get an escaped backslash, and each of those needs to be escaped in the java String - and then another to represent either \n or \r.
String str = "string with \\\n newline and \\\n newline ...";
String repl = str.replaceAll("(\\\\\n|\\\\\r)", "");
System.out.println("str: " + str);
System.out.println("repl: " + repl);
Output:
STR: string with \
newline and \
newline ...
REPL: string with newline and newline ...
You need to assign the return value to another String object, or the same object, because of String immutability.
singleLine = singleLine.replaceAll("(\\\\n|\\\\r)", "");
More info is here
Remember that Strings are immutable. This means that replaceAll() does not change the String in singleLine. You must use the return value to get the modified String. For example, you can do
singleLine = singleLine.replaceAll("(\\\\n|\\\\r)", "");

java regex: find pattern of 1 or more numbers followed by a single

I'm having a java regex problem.
how can I find pattern of 1 or more numbers followed by a single . in a string?
"^[\\d]+[\\.]$"
^ = start of string
[\\d] = any digit
+ = 1 or more ocurrences
\\. = escaped dot char
$ = end of string
I think this is the answer to your question:
String searchText = "asdgasdgasdg a121341234.sdg asdg as12..dg a1234.sdg ";
searchText.matches("\\d+\\.[^.]");
This will match "121341234." and "1234." but not "12."
(\\d)+\\.
\\d represents any digit
+ says one or more
Refer this http://www.vogella.com/articles/JavaRegularExpressions/article.html
In regex the metacharacter \d is used to represent an integer but to represent it in a java code as a regex one would have to use \\d because of the double parsing performed on them.
First a string parser which will convert it to \d and then the regex parser which will interpret it as an integer metacharacter (which is what we want).
For the "one or more" part we use the + greedy quantifier.
To represent a . we use \\. because of the double parsing scenario.
So in the end we have (\\d)+(\\.).
\\d+)\\.
\\d is for numbers, + is for one and more, \\. is for dot. If . is written without backslash before it it matches any character.

Categories