Java write escaped characters into file - java

I'm writing to a file and I need to escape some characters like a quotation mark.
File fout = new File("output.txt");
try (FileOutputStream fos = new FileOutputStream(fout); BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fos));) {
String insert = "quote's";
s += "'"+insert.replaceAll("'", "\\\'")+"'";
bw.write(s.replaceAll("\r\n", "\\\r\\\n"));
bw.newLine();
}
I'm trying to acheive writing 'quote\'s' to the file but it keeps removing the backslash and producing 'quote's'
I also want to write newlines into the file as the escaped character i.e instead of inserting a newline in file I want to write \r\n
Is this possible. I feel like I'm missing/forgetting something.

replaceAll() works with regex and accepts a special replacement syntax:
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string
You're not using regex, so you can use the plaintext replace() instead. And you only need 2 backslashes at a time:
s += "'"+insert.replace("'", "\\'")+"'";
bw.write(s.replace("\r\n", "\\r\\n"));

Related

How to skip invalid double quote character line in csv file using java?

I have a csv file contain 78400 lines (25MB).
When I read the csv file line by line, 1 column has error in 2nd line.
It contains backslash character.
When I read this column, it read all the remaining columns in the csv file as single column.
"CDE","456","6346","testdata2","MyData2","ClassB"
"ABC","123","4567\","testdata","MyData","ClassA"
"CDE","456","6346","testdata2","MyData2","ClassB"
How to skip that line by using line seperator in java?
you can write method which would check by splitting the line into words and then identify the \ using as a char
String line=br.readline();
String words =line.split(",");
char[] word=words.toCharArray();
boolean escape=(word=='\');
You can identify the escape and handle it specially .
If you are using openCSV then just define your parser with an escape character other than backslash. If you don't want an escape character you can use the ICSVParser.NULL_CHARACTER or if you are using the 3.9 version of openCSV you can use the RFC4180Parser.
RFC4180ParserBuilder rfc4180ParserBuilder = new RFC4180ParserBuilder();
ICSVParser rfc4180Parser = rfc4180ParserBuilder.build();
CSVReaderBuilder builder = new CSVReaderBuilder(sr);
CSVReader reader = builder.withCSVParser(parser).build();

Can't print newline character (\n) in strings returned by String.split in Java

I am writing a Java program in which a tab separated values (TSV) file containing two columns of information is read by a BufferedReader and then split into two components (which will serve as [key,value] pairs in a HashMap later in the program) using String.split("\t"). Let's say the first line of the TSV file is as follows:
Key1\tHello world\nProgramming is cool\nGoodbye
The code shown below would separate this line into "Key1" and "Hello world\nProgramming is cool\nGoodbye":
File file = new File("sample.tsv");
BufferedReader br = new BufferedReader(new FileReader(file));
String s = br.readLine();
String[] tokens = new String[2];
tokens = s.split("\t");
The problem now comes in trying to print the second string (i.e. tokens[1]).
System.out.println(tokens[1]);
The line of code above results in the second string being printed with the newline characters (\n) being ignored. In other words, this is printed...
Hello world\nProgramming is cool\nGoodbye
...instead of this...
Hello worldProgramming is coolGoodbye
If I create a new string with the same text as above and use the String.equals() method to compare the two, it returns false.
String str = "Hello world\nProgramming is cool\nGoodbye";
boolean sameString = str.equals(tokens[1]); // false
Why can't special characters in the strings returned by String.split() be printed properly?
BufferedReader.readLine() read your string as one line, as that's how it's represented in the file. Buffered reader didn't read "\n" as ASCII(10) 0x0A, it read "ASCII(92) 0x9C ASCII(110) 0x6E".
If you type the input file the way you expect to see it with your text editor, it will print the way you expect.
on a unix like system:
echo -e "Hello world\nProgramming is cool\nGoodbye" > InputFile.result_you_want
echo "Hello world\nProgramming is cool\nGoodbye" > InputFile.result_you_get
You could use a program like echo to convert your TSV, but then you will need to split on the "\t" character, ASCII(9) 0x09, and not a literal "\t".
Split takes a regular expression. Escaping that tab character may be interesting.
"\t" or "\\t" may do the trick there.
If this is for work, you may want to use a tool or library to work around having to convert your file with echo.
String parsing in Java with delimeter tab "\t" using split has some suggestions there.
Searching for CSV java API's could be very useful. Most will let you set the delimiter character and information on line ending formats.
because in computer aspect, the text '\n' is not like the binary '\n'.
the first line of ur file, i think is like key1 Hello world\nProgramming\ncool
so it's the it can split the \t,but when it comes to print, it only show the text
'\n' but not the binary '\n' which will make the new Line

Unicode Replacement with ASCII

I have created a text file on windows system where I think default encoding style is ANSI and contents of the file looks like this :
This is\u2019 a sample text file \u2014and it can ....
I saved this file using the default encoding style of windows though there were encoding styles were also available like UTF-8,UTF-16 etc.
Now I want to write a simple java function where I will pass some input string and replace all of the unicodes with the corresponding ascii value.
e.g :- \u2019 should be replaced with "'"
\u2014 should be replaced with "-" and so on.
Observation :
When i created a string literal like this
String s = "This is\u2019 a sample text file \u2014and it can ....";
My code is working fine , but when I am reading it from the file it is not working. I am aware that in Java String uses UTF-16 encoding .
Below is the code that I am using to read the input file.
FileReader fileReader = new FileReader(new File("C:\\input.txt"));
BufferedReader bufferedReader = new BufferedReader(fileReader)
String record = bufferedReader.readLine();
I also tried using the InputStream and setting the Charset to UTF-8 , but still the same result.
Replacement code :
public static String removeUTFCharacters(String data){
for(Entry<String,String> entry : utfChars.entrySet()){
data=data.replaceAll(entry.getKey(), entry.getValue());
}
return data;
}
Map :
utfChars.put("\u2019","'");
utfChars.put("\u2018","'");
utfChars.put("\u201c","\"");
utfChars.put("\u201d","\"");
utfChars.put("\u2013","-");
utfChars.put("\u2014","-");
utfChars.put("\u2212","-");
utfChars.put("\u2022","*");
Can anybody help me in understanding the concept and solution to this problem.
Match the escape sequence \uXXXX with a regular expression. Then use a replacement loop to replace each occurrence of that escape sequence with the decoded value of the character.
Because Java string literals use \ to introduce escapes, the sequence \\ is used to represent \. Also, the Java regex syntax treats the sequence \u specially (to represent a Unicode escape). So the \ has to be escaped again, with an additonal \\. So, in the pattern, "\\\\u" really means, "match \u in the input."
To match the numeric portion, four hexadecimal characters, use the pattern \p{XDigit}, escaping the \ with an extra \. We want to easily extract the hex number as a group, so it is enclosed in parentheses to create a capturing group. Thus, "(\\p{XDigit}{4})" in the pattern means, "match 4 hexadecimal characters in the input, and capture them."
In a loop, we search for occurrences of the pattern, replacing each occurrence with the decoded character value. The character value is decoded by parsing the hexadecimal number. Integer.parseInt(m.group(1), 16) means, "parse the group captured in the previous match as a base-16 number." Then a replacement string is created with that character. The replacement string must be escaped, or quoted, in case it is $, which has special meaning in replacement text.
String data = "This is\\u2019 a sample text file \\u2014and it can ...";
Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher m = p.matcher(data);
StringBuffer buf = new StringBuffer(data.length());
while (m.find()) {
String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
System.out.println(buf);
If you can use another library, you can use apache commons
https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html
String dirtyString = "Colocaci\u00F3n";
String cleanString = StringEscapeUtils.unescapeJava(dirtyString);
//cleanString = "Colocación"

What are \xHEX characters and is there a table for them?

When reading a textfile, I read these characters, when printed out to console it outputs blanks or �:
['\x80', '\xc3', '\x94', '\x99', '\x98','\x9d', '\x9c', '\xa9', '\xa6', '\xe2']
What are these \xHEX characters? Is there a link to the table to lookup these characters?
SOLVED:
it's not an ascii textfile, it was a unicode utf8 file. That was why I was unable to get correct the characters.
For Java:
import java.io.*
File infile = new File('\home\foo\bar.txt');
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(infile), "UTF8"));
while ((str = in.readLine()) != null) {
System.out.println(str);
}
if system.out.println complains try:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(str);
For Python, simply:
import codecs
infile = '\home\foo\bar.txt'
reader = codecs.open(infile,'r','urf8')
for l in reader:
print ln
Here is a link to all unicode characters:
http://en.wikipedia.org/wiki/List_of_Unicode_characters
Also, if you are using Eclipse, make sure your project "Text File Encoding" is set to UTF-8.
Project->properties->resources->Text File Encoding.
I had similar problem with cyrillic characters :)
I may suggest that your text file is not really a "text file". The first two bytes form the unicode 'À' character. The rest, I guess, are non-printable characters. It seems that your file has a raw sequence of bytes, that don't have to be characters.
You've got a table here.
Please note that java encodes characters in an unicode format (\u...) . It is possible to display the numbers '80', but not its character's presentation '\x80' to the console.
For a list, please refer to ascii characters list, like this one

Reading a text file with a Scanner in Java - Token's return character

I'm triying to read the text file below with a java.util.Scanner in a simple Java Program.
0001;GUAJARA-MIRIM;RO
0002;ALTO ALEGRE DOS PARECIS;RO
0003;PORTO VELHO;RO
I read the text file using the code below:
scanner = new Scanner(filerader).useDelimiter("\\;|\\n");
while (scanner.hasNext()) {
int id= scanner.nextInt();
String name = scanner.next();
String code = scanner.next();
System.out.printf(".%s.%s.%d.\n", name, code, id);
}
The results are:
.GUAJARA-MIRIM.RO.1
.
.ALTO ALEGRE DOS PARECIS.RO.2
.
.PORTO VELHO.RO.3
.
But the result of the third token of each line has an incovenient '\r' caracther at the end (ANSI code 13). I have no idea why (I used the '.' character on the formatting string to to make it clear where the '\r' is).
So,
Why there's a '\r' at the end of the third token?
How to bypass it.
It is very simple to use an workaround like code.substring(0, 2), but instead I want to understand why there's a '\r' character there.
In some file systems(specially Windows), \r\n is used a new line character. You are using \n only a delimiter so \r remain out. Add \r also in your delimiters.
To make your code little more robust, use System.lineSeparator() to get the new line characters and use the delimiters accordingly.
You are using a Windows file, which uses \r\n as line delimiters (aka Carriage Return Line Feed). Unix uses only \n (Line Feed).
To fix this, add \r to your scanner delimiter.
The reason why it happens is already given, Other way to avoid this is to use scanner.nextLine() and then split by ; .

Categories