I had to export a bunch of strings to a CSV that I opened in excel. The strings contained '\n' and '\t' which I needed included in the CSV so I did the following before exporting the data:
public static String unEscapString(String s)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); i++)
{
switch (s.charAt(i))
{
case '\n': sb.append("\\n"); break;
case '\t': sb.append("\\t"); break;
default: sb.append(s.charAt(i));
}
}
return sb.toString();
}
The problem is that I am now reimporting the data into Java but I can't figure out how to get the newline and tab to print correctly again. I've tried:
s.replaceAll("\\n", "\n");
but it still ignores the newlines. Help?
EDIT: Example of what i'm trying to do:
Say one string in the CSV is "foo \n bar". When I import it using Java and i'm trying to print the same string to the console but have the newline behave correctly
replaceAll's first argument is a regular expression. You have 2 choices. You can use plain old replace like so:
s.replace("\\n", "\n");
or you can escape the slash for the regex parser (which is stripping the single slash out):
s.replaceAll("\\\\n", "\n");
or
s.replaceAll(Pattern.quote("\\n"), "\n");
I would opt for replace since you're not using a regular expression.
It should be
sb.append("\n");
Otherwise, you will get a '\' and a 'n' by using "\\n".
But I recommend you to use:
sb.append(System.getProperty("line.separator"));
Here System.getProperty("line.separator") gives you platform independent newline in java. Also from Java 7 there's a method that returns the value directly: System.lineSeparator().
If you want an actual newline in the string, it should be \n, not \\n. The way you have it, it is being interpreted as a backslash and then an 'n'.
Related
The question is pretty simple.
A CSV file looks like this:
1, "John", "John Joy"
If I want to get each column, I just use String[] splits = line.split(",");
What if the CSV file looks like this:
1, "John", "Joy, John"
So we have a comma inside a double quotes pair. The above split won't work any more, because I want "Joy, John" as a complete part.
So is there a elegant / simple algorithm to deal with this situation?
Edit:
Please do not consider it as a formal CSV parsing thing. I just use CSV as a use case where I need to split.
What I really want is NOT a proper CSV parser, instead, I just want an algorithm which can properly split a line by comma considering the double quotes.
It's better to use existing library for this purpuse instead of writing custom implementation (If you don't do this for studing).
Because CSV has some specifics that you can miss in custom implementation and usually library is well tested.
Here you can find some good one Can you recommend a Java library for reading (and possibly writing) CSV files?
EDIT
I've created method that will parse your string but again it could work not perfect because I haven't tested it well.
It could be just as a start point for you and you can improve it further.
String inputString = "1, \"John\",\"Joy, John\"";
char quote = '"';
List<String> csvList = new ArrayList<String>();
boolean inQuote = false;
int lastStart = 0;
for (int i = 0; i < inputString.length(); i++) {
if ((i + 1) == inputString.length()) {
//if this is the last character
csvList.add(inputString.substring(lastStart, i + 1));
}
if (inputString.charAt(i) == quote) {
//if the character is quote
if (inQuote) {
inQuote = false;
continue; //escape
}
inQuote = true;
continue;
}
if (inputString.charAt(i) == ',') {
if (inQuote) continue;
csvList.add(inputString.substring(lastStart, i));
lastStart = i + 1;
}
}
System.out.println(csvList);
Question for you
What if you will get string like that 1, "John", ""Joy, John""
(two quotes on "Joy, John")?
// use regxep with matcher
String string1 = "\"John\", \"John Joy\"";
String string2 = "\"John\", \"Joy, John\"";
Pattern pattern = Pattern.compile("\"[^\"]+\"");
Matcher matcher = pattern.matcher(string1);
System.out.println("string1: " + string1);
int start = 0;
while(matcher.find(start)){
System.out.println(matcher.group());
start = matcher.end() + 1;
if(start > string1.length())
break;
}
matcher = pattern.matcher(string2);
System.out.println("string2: " + string2);
start = 0;
while(matcher.find(start)){
System.out.println(matcher.group());
start = matcher.end() + 1;
if(start > string2.length())
break;
}
Using regular expressions is quite elegant.
Sorry, I don't familiar with Java regex, so my example is in Lua:
(this example doesn't take into account that there may be newline chars inside quoted text, and that original quote chars would be doubled inside quoted text)
--- file.csv
1, "John", "John Joy"
2, "John", "Joy, John"
--- Lua code
for line in io.lines 'file.csv' do
print '==='
for _, s in (line..','):gmatch '%s*("?)(.-)%1%s*,' do
print(s)
end
end
--- Output
===
1
John
John Joy
===
2
John
Joy, John
You could start with the regular expression:
[^",]*|"[^"]*"
which matches either a non-quoted string not containing a comma or a quoted string. However, there are lots of questions, including:
Do you really have spaces after the commas in your input? Or, more generally, will you allow quotes which are not exactly at the first character of a field?
How do you put quotes around a field which includes a quote?
Depending on how you answer that question, you might end up with different regular expressions. (Indeed, the customary advice to use a CSV parsing library is not so much about handling the corner cases; it is about not having to think about them because you assume "standard CSV" handling, whatever that might be according to the author of the parsing library. CSV is a mess.)
One regular expression I've used with some success (although it is not CSV compatible) is:
(?:[^",]|"[^"]*")*
which is pretty similar to the first one, except that it allows any number of concatenated fields, so that both of the following are all recognized as a single field:
"John"", Mary"
John", "Mary
CSV standard would treat the first one as representing:
John", Mary -- internal quote
and treat the quotes in the second one as ordinary characters, resulting in two fields. So YMMV.
In any event, once you decide on an appropriate regex, the algorithm is simple. In pseudo-code since I'm far from a Java expert.
repeat:
match the regex at the current position
and append the result to the result;
if the match fails:
report error
if the match goes to the end of the string:
done
if the next character is a ',':
advance the position by one
otherwise:
report error
Depending on the regex, the two conditions under which you report an error might not be possible. Generally, the first one will trigger if the quoted field is not terminated (and you need to decide whether to allow new-lines in the quoted field -- CSV does). The second one might happen if you used the first regex I provided and then didn't immediately follow the quoted string with a comma.
First split the string on quotes. Odd segments will have quoted content; even ones will have to be split one more time on commas. I use it on logs, where quoted text doesn't have escaped quotes, just like in this question.
boolean quoted = false;
for(String q : str.split("\"")) {
if(quoted)
System.out.println(q.trim());
else
for(String s : q.split(","))
if(!s.trim().isEmpty())
System.out.println(s.trim());
quoted = !quoted;
}
When I have a string such as:
String x = "hello\nworld";
How do I get Java to print the actual escape character (and not interpret it as an escape character) when using System.out?
For example, when calling
System.out.print(x);
I would like to see:
hello\nworld
And not:
hello
world
I would like to see the actual escape characters for debugging purposes.
Use the method "StringEscapeUtils.escapeJava" in Java lib "org.apache.commons.lang"
String x = "hello\nworld";
System.out.print(StringEscapeUtils.escapeJava(x));
One way to do this is:
public static String unEscapeString(String s){
StringBuilder sb = new StringBuilder();
for (int i=0; i<s.length(); i++)
switch (s.charAt(i)){
case '\n': sb.append("\\n"); break;
case '\t': sb.append("\\t"); break;
// ... rest of escape characters
default: sb.append(s.charAt(i));
}
return sb.toString();
}
and you run System.out.print(unEscapeString(x)).
You have to escape the slash itself:
String x = "hello\\nworld";
System.out.println("hello \\nworld");
Java has its escape-sequence just the same as that in C.
use String x = "hello\\nworld";
Just escape the escape character.
String x = "hello\\nworld";
Try to escape the backslash like \\n
You might want to check out this method. Although this may do more than you intend. Alternatively, use String replace methods for new lines, carriage returns and tab characters. Do keep in mind that there are also such things as unicode and hex sequences.
With the help of tucuxi from the existing post Java remove HTML from String without regular expressions I have built a method that will parse out any basic HTML tags from a string. Sometimes, however, the original string contains html hexadecimal characters like é (which is an accented e). I have started to add functionality which will translate these escaped characters into real characters.
You're probably asking: Why not use regular expressions? Or a third party library? Unfortunately I cannot, as I am developing on a BlackBerry platform which does not support regular expressions and I have never been able to successfully add a third party library to my project.
So, I have gotten to the point where any é is replaced with "e". My question now is, how do I add an actual 'accented e' to a string?
Here is my code:
public static String removeHTML(String synopsis) {
char[] cs = synopsis.toCharArray();
String sb = new String();
boolean tag = false;
for (int i = 0; i < cs.length; i++) {
switch (cs[i]) {
case '<':
if (!tag) {
tag = true;
break;
}
case '>':
if (tag) {
tag = false;
break;
}
case '&':
char[] copyTo = new char[7];
System.arraycopy(cs, i, copyTo, 0, 7);
String result = new String(copyTo);
if (result.equals("é")) {
sb += "e";
}
i += 7;
break;
default:
if (!tag)
sb += cs[i];
}
}
return sb.toString();
}
Thanks!
Java Strings are unicode.
sb += '\u00E9'; # lower case e + '
sb += '\u00C9'; # upper case E + '
You can print out just about any character you like in Java as it uses the Unicode character set.
To find the character you want take a look at the charts here:
http://www.unicode.org/charts/
In the Latin Supplement document you'll see all the unicode numbers for the accented characters. You should see the hex number 00E9 listed for é for example. The numbers for all Latin accented characters are in this document so you should find this pretty useful.
To print use character in a String, just use the Unicode escape sequence of \u followed by the character code like so:
System.out.print("Let's go to the caf\u00E9");
Would produce: "Let's go to the café"
Depending in which version of Java you're using you might find StringBuilders (or StringBuffers if you're multi-threaded) more efficient than using the + operator to concatenate Strings too.
try this:
if (result.equals("é")) {
sb += char(130);
}
instead of
if (result.equals("é")) {
sb += "e";
}
The thing is that you're not adding an accent to the top of the 'e' character, but rather that is a separate character all together. This site lists out the ascii codes for characters.
For a table of accented in characters in Java take a look at this reference.
To decode the html part, use Apache StringEscapeUtils from Apache commons lang:
import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);
See also this Stack Overflow thread:
Replace HTML codes with equivalent characters in Java
Using Java, I want to go through the lines of a text and replace all ampersand symbols (&) with the XML entity reference &.
I scan the lines of the text and then each word in the text with the Scanner class. Then I use the CharacterIterator to iterate over each characters of the word. However, how can I replace the character? First, Strings are immutable objects. Second, I want to replace a character (&) with several characters(amp&;). How should I approach this?
CharacterIterator it = new StringCharacterIterator(token);
for(char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
if(ch == '&') {
}
}
Try using String.replace() or String.replaceAll() instead.
String my_new_str = my_str.replace("&", "&");
(Both replace all occurrences; replaceAll allows use of regex.)
The simple answer is:
token = token.replace("&", "&");
Despite the name as compared to replaceAll, replace does do a replaceAll, it just doesn't use a regular expression, which seems to be in order here (both from a performance and a good practice perspective - don't use regular expressions by accident as they have special character requirements which you won't be paying attention to).
Sean Bright's answer is probably as good as is worth thinking about from a performance perspective absent some further target requirement on performance and performance testing, if you already know this code is a hot spot for performance, if that is where your question is coming from. It certainly doesn't deserve the downvotes. Just use StringBuilder instead of StringBuffer unless you need the synchronization.
That being said, there is a somewhat deeper potential problem here. Escaping characters is a known problem which lots of libraries out there address. You may want to consider wrapping the data in a CDATA section in the XML, or you may prefer to use an XML library (including the one that comes with the JDK now) to actually generate the XML properly (so that it will handle the encoding).
Apache also has an escaping library as part of Commons Lang.
StringBuilder s = new StringBuilder(token.length());
CharacterIterator it = new StringCharacterIterator(token);
for (char ch = it.first(); ch != CharacterIterator.DONE; ch = it.next()) {
switch (ch) {
case '&':
s.append("&");
break;
case '<':
s.append("<");
break;
case '>':
s.append(">");
break;
default:
s.append(ch);
break;
}
}
token = s.toString();
You may also want to check to make sure your not replacing an occurrence that has already been replaced. You can use a regular expression with negative lookahead to do this.
For example:
String str = "sdasdasa&adas&dasdasa";
str = str.replaceAll("&(?!amp;)", "&");
This would result in the string "sdasdasa&adas&dasdasa".
The regex pattern "&(?!amp;)" basically says: Match any occurrence of '&' that is not followed by 'amp;'.
Just create a string that contains all of the data in question and then use String.replaceAll() like below.
String result = yourString.replaceAll("&", "&");
You can use stream and flatMap to map & to &
String str = "begin&end";
String newString = str.chars()
.flatMap(ch -> (ch == '&') ? "&".chars() : IntStream.of(ch))
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
Escaping strings can be tricky - especially if you want to take unicode into account. I suppose XML is one of the simpler formats/languages to escape but still. I would recommend taking a look at the StringEscapeUtils class in Apache Commons Lang, and its handy escapeXml method.
Try this code.You can replace any character with another given character.
Here I tried to replace the letter 'a' with "-" character for the give string "abcdeaa"
OutPut -->_bcdef__
public class Replace {
public static void replaceChar(String str,String target){
String result = str.replaceAll(target, "_");
System.out.println(result);
}
public static void main(String[] args) {
replaceChar("abcdefaa","a");
}
}
If you're using Spring you can simply call HtmlUtils.htmlEscape(String input) which will handle the '&' to '&' translation.
//I think this will work, you don't have to replace on the even, it's just an example.
public void emphasize(String phrase, char ch)
{
char phraseArray[] = phrase.toCharArray();
for(int i=0; i< phrase.length(); i++)
{
if(i%2==0)// even number
{
String value = Character.toString(phraseArray[i]);
value = value.replace(value,"*");
phraseArray[i] = value.charAt(0);
}
}
}
String taskLatLng = task.getTask_latlng().replaceAll( "\\(","").replaceAll("\\)","").replaceAll("lat/lng:", "").trim();
I need to remove commas within a String only when enclosed by quotes.
example:
String a = "123, \"Anders, Jr.\", John, john.anders#company.com,A"
after replacement should be
String a = "123, Anders Jr., John, john.anders#company.com,A"
Can you please give me sample java code to do this?
Thanks much,
Lina
It also seems you need to remove the quotes, judging by your example.
You can't do that in a single regexp. You would need to match over each instance of
"[^"]*"
then strip the surrounding quotes and replace the commas. Are there any other characters which are troublesome? Can quote characters be escaped inside quotes, eg. as ‘""’?
It looks like you are trying to parse CSV. If so, regex is insufficient for the task and you should look at one of the many free Java CSV parsers.
I believe you asked for a regex trying to get an "elegant" solution, nevertheless maybe a "normal" answer is better fitted to your needs... this one gets your example perfectly, although I didn't check for border cases like two quotes together, so if you're going to use my example, check it thoroughly
boolean deleteCommas = false;
for(int i=0; i > a.length(); i++){
if(a.charAt(i)=='\"'){
a = a.substring(0, i) + a.substring(i+1, a.length());
deleteCommas = !deleteCommas;
}
if(a.charAt(i)==','&&deleteCommas){
a = a.substring(0, i) + a.substring(i+1, a.length());
}
}
There are two major problems with the accepted answer. First, the regex "(.*)\"(.*),(.*)\"(.*)" will match the whole string if it matches anything, so it will remove at most one comma and two quotation marks.
Second, there's nothing to ensure that the comma and quotes will all be part of the same field; given the input ("foo", "bar") it will return ("foo "bar). It also doesn't account for newlines or escaped quotation marks, both of which are permitted in quoted fields.
You can use regexes to parse CSV data, but it's much trickier than most people expect. But why bother fighting with it when, as bobince pointed out, there are several free CSV libraries out there for the downloading?
Should work:
s/(?<="[^"]*),(?=[^"]*")//g
s/"//g
This looks like a line from a CSV file, parsing it through any reasonable CSV library would automatically deal with this issue for you. At least by reading the quoted value into a single 'field'.
Probably grossly inefficiënt but it seems to work.
import java.util.regex.*;
StringBuffer ResultString = new StringBuffer();
try {
Pattern regex = Pattern.compile("(.*)\"(.*),(.*)\"(.*)", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(a);
while (regexMatcher.find()) {
try {
// You can vary the replacement text for each match on-the-fly
regexMatcher.appendReplacement(ResultString, "$1$2$3$4");
} catch (IllegalStateException ex) {
// appendReplacement() called without a prior successful call to find()
} catch (IllegalArgumentException ex) {
// Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
// Non-existent backreference used the replacement text
}
}
regexMatcher.appendTail(ResultString);
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
This works fine. '<' instead of '>'
boolean deleteCommas = false;
for(int i=0; i < text.length(); i++){
if(text.charAt(i)=='\''){
text = text.substring(0, i) + text.substring(i+1, text.length());
deleteCommas = !deleteCommas;
}
if(text.charAt(i)==','&&deleteCommas){
text = text.substring(0, i) + text.substring(i+1, text.length());
}
}
A simpler approach would be replacing the matches of this regular expression:
("[^",]+),([^"]+")
By this:
$1$2
The following perl works for most cases:
open(DATA,'in/my.csv');
while(<DATA>){
if(/(,\s*|^)"[^"]*,[^"]*"(\s*,|$)/){
print "Before: $_";
while(/(,\s*|^)"[^"]*,[^"]*"(\s*,|$)/){
s/((?:^|,\s*)"[^"]*),([^"]*"(?:\s*,|$))/$1 $2/
}
print "After: $_";
}
}
It's looking for:
(comma plus optional spaces) or start of line
a quote
0 or more non-quotes
a comma
0 or more non-quotes
(optional spaces plus comma) or end of line
If found, it will then keep replacing the comma with a space until it can find no more examples.
It works because of an assumption that the opening quote will be preceded by a comma plus optional spaces (or will be at the start of the line), and the closing quote will be followed by optional spaces plus a comma, or will be the end of the line.
I'm sure there are cases where it will fail - if anyone can post 'em, I'd be keen to see them...
My answer is not a regex, but I believe it is simpler and more efficient. Change the line to a char array, then go through each char. Keep track of even or odd quote amounts. If odd amount of quotes and you have a comma, then don't add it. Should look something like this.
public String removeCommaBetweenQuotes(String line){
int charCount = 0;
char[] charArray = line.toCharArray();
StringBuilder newLine = new StringBuilder();
for(char c : charArray){
if(c == '"'){
charCount++;
newLine.append(c);
}
else if(charCount%2 == 1 && c == ','){
//do nothing
}
else{
newLine.append(c);
}
}
return newLine.toString();
}