How to clean a csv file from weird characters (e.g. SUB)?

How to clean a csv file from weird characters (e.g. SUB)? - java

I am uploading csv files using jdbc to teradata. Everything used to be fine, until recently I came across a csv file that had some weird characters and my code failed to upload .
I opened the csv file in Notepad ++ , and it look like this SUB . When I open it in Excel it looks like this ->->
When I manually deleted those characters, everything went back to normal. I am curious , is there any way I could use java to clean a csv file to remove all kind of invalid characters ?

The SUB character is an ASCII 26 (= hex 0x1A). Back when DEC-10s ruled the earth, this was called Ctrl-Z. It is used to indicate the end of a file.
If it indeed at the end of the file, and you read it in using a Java InputStream (and please have a look at Read/convert an InputStream to a String) it will take off that terminal Ctrl-Z.
It would be quite unusual (and a problem) to have the SUB inside the CSV data, unless it were representing a binary object.

You can try:
myString.replaceAll("\\p{C}", "?");
If you want to remove it:
myString.replaceAll("\\p{C}", "");
More here:
How can I replace non-printable Unicode characters in Java?

Related

Java Could not recognize the cedilla character in text

I have small piece in code which requires to convert cedilla delimited file to comma separated file.
It worked fine for the normal test cases. But, when it went to real time, when the file is provided in unix environment, it could not recognize cedilla character in the text and failed to convert to proper CSV file.
Could you please help me out , if any one faced this issue.?
I need to pass the delimiter from the command line arguments.
Sorry if the question is in improper format, but i didnt recieve any help. so posted in stackoverflow.
Sample Code:
line = line.replace(Character.toLowerCase(context.getConfiguration().get("input.delimiter").charAt(0)), ',');
line = line.replace(Character.toUpperCase(context.getConfiguration().get("input.delimiter").charAt(0)), ',');

It seems that the problem you are facing is that in real environment default character encoding differs from the one on your local development machine. When processing the CSV file you should specify wich is the csv file encoding.
You should check How to read a file in Java with specific character encoding?

Java unexpected character parsing txt file

I am trying to divide txt files into ArrayList of strings and so far it works, but first words in the file always starts with (int)'65279' and I can't even copy this character here. Also, in GUI it looks like second letter of word is missing but at the same time it works in console. Other words are as they should be. I am using UTF-8 format .txt files. How can I change format in netBeans and GUI made in this IDE?

U+FEFF is the byte order mark. It's used to indicate the character encoding/endianness (to you can easily tell the difference between big and little-endian UTF-16, for example).
If it's causing you a problem, the simplest thing is just to strip it:
if (text.startsWith("\ufeff")) {
text = text.substring(1);
}

Symbols ' ï»¿ ' is showing when reading from text file

Using the same project and text file as here: Java.NullPointerException null (again) the program is outputting the data but with ï»¿. To put you in the picture:
This program is a telephone directory, ignoring the first "code" block, look at the second "code" block on that link, that is the text file with the entries. The program outputs them as it should but it is giving ï»¿ at the beginning of the entries read from the text file ONLY.
Any help as to how to remove it? I am using Buffered Reader with File Reader in it.
Encoding of Text File: UTF-8
Using Java 7
Windows 7

Does the read in textfile uses UTF-8 with BOM? It looks like BOM signs: "ï»¿"
http://en.wikipedia.org/wiki/Byte_order_mark
Are you runnig Windows? Notepad++ sould be able to convert. If using linux or the VI(M) you can use ":set nobomb"

I suppose your input file is encoded in UTF-8 with BOM.
You can either save your input file without a BOM, or handle this in Java.
The thing one might want to do here is to use an InputStreamReader with appropriate encoding. Sadly, that's not possible. The thing is, Java assumes that an UTF-8 encoded file has no BOM, so you have to handle that case manually.
A quick hack would be to check if the first three bytes of your file are 0xEF, 0xBB, 0xBF, and if they are, ignore them.
For a more sophisticated example, have a look at the UnicodeBOMInputStream class in this answer.

Get filename as UTF-8? (ä,ü,ö ... is always '?')

I have to read the name of some files and put them in a list as a string. Its not so hard I just have some Problems with some characters like ä,ö,ü ... they are always as a '?' in my string.
Whats the Problem? Well the encoding. Ok this should be easy... thats what i thought. So I tried to use functions like:
new String(insert.getBytes("UTF-8")
or
new String(insert.getBytes("ISO-8859-1"), "UTF-8")
because the most of the files are ISO-8859-1
Its not helping. This is my code:
...
File[] fileList = dir.listFiles();
String insert;
for(File f : fileList) {
...
insert=f.getName().substring(0,f.getName().length()-4);
insert=insert.charAt(0)+insert.substring(1,insert.length()).toLowerCase().replaceFirst("([0-9]*(_s?(i)?(_dat)?)*$)", "").replaceFirst("_", " ");
...
System.out.println("test UTF8: " + new String(insert.getBytes("UTF-8"))); //not helping
System.out.println("test ISO , UTF8: " + new String(insert.getBytes("ISO-8859-1"), "UTF-8")); //not helping
...
names.add(insert);
}
At the end there are a lot of strings with '?' characters in my list.
How to fix the problem? And whats the best way if there are not only ISO-8859-1 files? (lets say there are a lot of unknown encoded files)
Thank You!

Given the extended comments back and forth under the question, it now looks like this is either a font problem or (perhaps more likely) a filename encoding problem.
I asked Lissy to run the following command to let us figure out what the problem is. If she is sure that the filename contain "ä" in them, but that character does not appear when she ls the filename, then this command will tell us whether this is a font or encoding problem.
touch filenäme
ls filen*me
If this shows "filenäme" in the output of ls then we know the problem is with the creation/copy of the files onto this system. This could happen if the program which created the files didn't realize what the filesystem encoding was or was too stupid to do the right thing. The convmv program will probably be the best way to fix this.
convmv -f ENCODING -t utf8 -r .
The question is what is the proper encoding. Possibilities include UTF-16, cp850, or perhaps iso8859-1. convmv --list will show you the list of currently known (to your system) encodings. Since the listed command above only shows you what it might do, it is safe to run several times with different encodings until you find one which works for all files.
If this is a font problem, we'll have to look into that

Unexpected question marks, spalts, etc in a String are a sign that something somewhere doesn't recognize a particular character when converting from one character set to another.
In your case, the problem could be occurring in a couple of places:
It could be occurring when your Java program is reading the file names from the directory (in the dir.listFiles() call).
It could be happening when you print the characters to the console stream.
In either case, the root cause is most likely a mismatch between what Java thinks the locale settings should be and the settings that the operating system and/or command shell are using.
As an experiment, try to list a directory containing the problematic file names from the command line. Do you see question marks or other splats there?
A second experiment to perform is to modify your Java program to dump one of the problem Strings as a sequence of numbers representing the character codes for each of the characters. Do you see the character codes for an ASCII / Unicode '?'.

The encoding of the content of the file name has nothing to do with the encoding of the file name itself.
You should get correct results from System.out.println(insert)
If you don't, it means that the shell has a different character encoding that the default character encoding for your system (this rarely happens; it would usually be the result of an explicit command to switch encodings in the shell).
If the file names are displayed correctly when you list the directory in the shell, I would expect them to be displayed correctly without specifying an encoding in your Java program.
If the shell is incapable of displaying the character (it is substituting the replacement character 0xFFFD (�) for these unprintable characters), there's nothing you can do from your Java application to change that. You need to change the terminal character encoding, install the right fonts, etc.; that is a operating system issue, not a Java issue.
At the same time, even if your terminal can't display the correct results, the Java program should be handling the character encodings correctly without your intervention.
The library behind the File API is figuring out the correct character encoding for your system and doing the necessary decoding into characters. Likewise, the database driver should negotiate with the database to determine the correct encoding, and do any necessary encoding into bytes on behalf of your application.

In a comment you wrote:
#mdrg: well, theres a Problem. I have to read the name of the files and then put them into a database. And there are a lot of '?' , that shouldnt be... – Lissy 27 mins ago
My guess is that the column you're inserting the filenames into specifies US-ASCII as the encoding and replaces characters outside that range with a replacement character, which in your case is the question mark.
So you have to find out the encoding for the column in your database table where you store the filenames. Various products have various syntaxes for retrieving that information.

In Java 1.6 you can use System.console() instead of System.out.println() to display accentuated characters to console.
public class Test {
public static void main(String args[]){
String s = "caractères français : à é \u00e9"; // Unicode for "é"
System.console().writer().println(s);
}
}
and the output is
C:\temp>java Test
caractères français : à é é

Regarding Java Split Command CSV File Parsing

I have a csv file in the below format. I get an issue if either one of the beow csv data is read by the program
"D",abc"def,"","0429"292"0","11","IJ80","Feb10_1.txt-2","FILE RECORD","05/02/2010","04/03/2010","","1","-91","",""
"D","abc"def","","04292920","11","IJ80","Feb10_1.txt-2","FILE RECORD","05/02/2010","04/03/2010","","1","-91","",""
The below split command is used to ignore the commas inside the double quotes i got the below split command from an earlier post. Pasted the URL that i took this command
String items[] = line.split(",(?=([^\"]\"[^\"]\")[^\"]$)",15);
System.out.println("items.length"+items.length);
Regarding Java Split Command Parsing Csv File
The items.length is printed as 14 instead of 15. The abc"def is not recognized as a individual field and it's getting incorrectly stored as
"D",abc"def in items[0]. . I want it to be stored in the below way
items[0] should be "D" and items[1] should be abc"def
The same issue happens when there is a value "abc"def". I want it to be stored as
items[0] should be "D" and items[1] should be "abc"def"
Also this split command works perfectly if the double quotes repeated inside the double quotes( field value is D,"abc""def",1 ).
How can i resolve this issue.

I think you would be much better off writing a parser to parse the CSV files rather than try to use a regular expression. Once you start dealing with CSV files with carriage returns within the lines, then the Regex will probably fall apart. It wouldn't take that much code to write a simple while loop that went through all the characters and split up the data. It would be lot easier to deal with "Non-Standard"* CSV files such as yours when you have a parser rather than a Regex.
*I say non-standard because there isn't really an official standard for CSV, and when you're dealing with CSV files from many different systems, you see lots of weird things, like the abc"def field as shown above.

opencsv is a great simple and light weight CSV parser for Java. It will easily handle your data.

If possible, changing your CSV format would make the solution very simple.
See the following for an overview of Delimiter Separated Values, a common format on Unix-based systems:
http://www.faqs.org/docs/artu/ch05s02.html#id2901882

Opencsv is very simple and best API for CSV parsing . This can be done with Linux SED commands prior processing it in java . If File is not in proper format convert it into proper delimited which is your (" , " ) into pipe or other unique delimiter , so inside field value and column delimiter can be differentiated easily by Opencsv.Use the power of linux with your java code.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to clean a csv file from weird characters (e.g. SUB)? - java

You can try: myString.replaceAll("\\p{C}", "?"); If you want to remove it: myString.replaceAll("\\p{C}", ""); More here: How can I replace non-printable Unicode characters in Java?

Related

Java Could not recognize the cedilla character in text

Java unexpected character parsing txt file

Symbols ' ï»¿ ' is showing when reading from text file

Get filename as UTF-8? (ä,ü,ö ... is always '?')

Regarding Java Split Command CSV File Parsing

Categories

Resources