How to use split for CSV while escaping \ - java

I am trying to split a csv list.
csvList=hello there, how are you, what is your name\, again
I have to use the Java split function to get the three components:
hello there
how are you
what is your name, again
I want to escape the comma that is preceded by the '\'.
Can anyone please help?
Thanks.

You can use lookbehind egex:
String[] tok="hello there, how are you, what is your name\\, again".split(" *(?<!\\\\), *");

You can use a negative look behind like this:
input.split("\\W*(?<!\\\\),\\W*")
Really the key here is the (?<!\\\\),. This says, "Find me comma's that don't have a slash behind them."
You need 4 slashes because in Java, the first slash will be considered an escape (eg: like the slash for \t). Two slashes will be considered as slash, but in a regex, a slash is a special character. So you need to escape the escape.
The \\W* says, "match 0 or more whitespace characters". The point of that is simply to trim your results so they don't have spaces before or after them.

Related

Java Scanner backslash delimiter

I try to use a series of delimiter for an input. It's for a homework. They said that we should use backslash () too. If I use it like this (it's at the end):
scanner.useDelimiter("\\;|\\:|\\?|\\~|/|\\.|,|\\<|\\>|\\`|\\[|\\]|\\{|\\}|\\(|\\)|\\!|\\#|\\#|\\$|\\%|\\^|\\&|\\-|\\_|\\+|\\'|\\=|\\*|\"|\\||\n|\t|\r|\\");
It won't work. It says unsupported escape sequence. If I add another backslash it says Illegal line end in string literal. If I add another it will escape to double backslash and that's not what I need.
I couldn't find any solution for this and that's why I'm asking. I already finished the homework and I used Scanner and right now changing it it's not a solution (a lot to re-implement).
Thank you.
You should use four backslashes at the end, like:
scanner.useDelimiter("\\;|\\:| ... |\r|\\\\");
This is the way it should work. You said if you tried it would match double backslashes. Have you tried it? If you did, and it still matches double backslashes, I suspect your input is escaped too somewhere. (maybe it is a string literal somewhere in your code?)
The reason behind this is that your string is de-escaped twice. Once at compile time as every other string literal in the Java language, and once compiling the regex. That means, after the first step it is escaped once, so the regex compiler gets two backslashes \\. The regex compiler will de-escape that too (just like \r), and will match a single \ character.
If you would like to match two backslashes this way, then you have to use eight backslash (\\\\\\\\ or \\\\{2}) in your literal. Yeah, pretty ugly.
You are using the delimiter in wrong way i think.
There is a related topic.
Check this first
How do I use a delimiter in Java Scanner?

java regex escape sequences

I was wondering about regex in Java and stumbled upon the use of backslashes. For instance, if I wanted to look for occurences of the words "this regex" in a text, I would do something like this:
Pattern.compile("this regex");
Nonetheless, I could also do something like this:
Pattern.compile("this\\sregex");
My question is: what is the difference between the two of them? And why do I have to type the backslash twice, I mean, why isn't \s an escape sequence in Java? Thanks in advance!
\s means any whitespace character, including tab, line feed and carriage return.
Java string literals already use \ to escape special characters. To put the character \ in a string literal, you need to write "\\". However regex patterns also use \ as their escape character, and the way to put that into a string literal is to use two, because it goes through two separate escaping processes. If you read your regex pattern from a plain text file for example, you won't need double escaping.
The reason you need two backslashes is that when you enter a regex string in Java code you are actually dealing with two parsers:
The first is the Java compiler, which is converting your string literal to a Java String.
The second is the regex parser, which is interpreting your regex, after it has been converted to a Java string and then passed to the regex parse when you call Pattern.compile.
So when you input "this\\sregex", it will be converted to the Java string "this\sregex" by the Java compiler. Then when you call Pattern.compile with the string, the backslash will be interpreted by the regex compiler as a special character.
The difference is that \s denotes a whitespace character, which can be more than just a blank space. It can be a tab, newline, line feed, to name a few.

How to change special character "\" in a text file with replace

I need to change some things in a big .rtf. I do it correctly in another files with another text changing, but in the text has something like this "\line". I want to change it to "\par"
I know the '\' is special character, and I can't use simple .replace("\line", "\par"). I tried the .replace("\\line", "\\par").
Neither worked, is there a way to do this? I can't use simple .replace("line", "par") because some words have the line between but without the "\". I only need to change when line has a "\" before
Strings are immutable
line = line.replace("\\line", "\\par");
You need to escape the \ in the regex as \\. However each of these needs to be escaped in the string. You'll need a full regex:
replaceAll("\\\\line", "\\\\par");
4 backslashes are turned into 2 \ characters in the string during compiler parsing, and \\ is parsed by the regex engine as a single literal backslash.

Splitting into sentences Java

I want to split a text into sentences. My text contains \n character in between. I want the splitting to be done at \n and .(dot). I cannot use BreakIterator as splitting condition for it is a space followed by a period (In the text I want to split, that isn't necessary).
Example:
i am a java programmer.i like coding in java. pi is 3.14\n regex not working
Should output:
['i am a java programmer', 'i like coding in java', 'pi is 3.14', 'regex not working']
I tried a simple regex which splits on either \n or .:
[\\\\n\\.]
This isn't working although, specifying separately works.
\\\\n
\\.
So can anyone give a regex that will split on either \n or . ?
Another problem is I don't want splitting to be done in case of decimals like 5.6.
This java regex should go it:
"\n|((?<!\\d)\\.(?!\\d))"
Points here:
you don't need to escape \n, ever
those weird looking things around the dot are negative look arounds, and means "the previous/next character must not be a digit
This regex says: "either a newline, or a literal dot that is not preceded or followed by a digit
FYI, you don't need to escape characters in a character class (between []) except for the brackets themselves.
Use string.split("[\n.]") to split at \n or .
Inside character class, . has no special meaning. So there is no need for escaping .
Edit: string.split("\n|[.](?<!\\d)(?!\\d)") avoids splitting of decimal numbers.
Here, for each . a lookbehind and a lookahead is there to check whether there is a digit on both sides. If both are not numbers, split is applied.
\n|\\.(?!\\d)|(?<!\\d)\\. avoids split for . with digits on both sides.
\n|(?<!\\d)[.](?!\\d) avoids split if any side has a digit
So what you require might be
string.split("\n|\\.(?!\\d)|(?<!\\d)\\.")
which splits something.4 but not 3.14
You need not double-escape stuff in a Java regex in the [] block:
[.\n]
should work.

Need a regular expression for field which should allow special characters, alphanumeric characters, and spaces

I am using the following regex:
[a-zA-Z0-9-#.()/%&\\s]{0,19}.
The requirement for the field is it should allow any thing and the field size should be 19.
Let me know if any corrections.Any help is appreciated.
You simply need to escape the special characters. Try:
[a-zA-Z0-9\-#\.\(\)\/%&\s]{0,19}
You can test your regular expressions on http://rubular.com/
Your regex is incorrect in at least one way - if you're considering a hyphen to be a "special character", then you should put it at the beginning or end of the range. So: [a-zA-Z0-9#.()/%&\s-]{0,19}.
Characters that are "special" within the context of the regex itself are often not parsed if they're inside a range. So you're fine with ., ( and ). But check your parser to make sure that it understands what \s means. It might be simpler just to put a space.
Also, if your regex parser tends to delimit the regex with slashes, then you may have to escape the slash in the middle of the range: [a-zA-Z0-9#.()\/%&\s-]{0,19}.
Just escape the dash - or put it at the begining or at the end of the character class:
[a-zA-Z0-9\\-#.()/%&\\s]{0,19}
or
[-a-zA-Z0-9#.()/%&\\s]{0,19}
or
[a-zA-Z0-9#.()/%&\\s-]{0,19}

Categories