reading an ISO-8859-1 encoded data from CSV - java

I have a CSV file, which I need to read and analyse. I use the methods and classes in from Apache Commons CSV.
The input file uses the regular low ASCII (0x0 -0x7f) characters. Some of fields include also line breaks.
However, in addition, some of the fields may contain characters 0xe4 and 0xe5 which need to be converted to '{' and '}' respectively. I have looked at the input file with a hex view so I am certain that it is really 0xe4 and 0xe5, and not some Unicode.
FileReader in = new FileReader(INPUT_CSV);
System.out.println(in.getEncoding());
records = CSVFormat.RFC4180.withFirstRecordAsHeader().withDelimiter('|').withQuote('#').parse(in);
The getEncoding() method says that the file is UTF-8 encoded, and I suspect this is where it goes wrong.
Then I read the records by using a loop through
for (CSVRecord record : records) {
// some analysis in here
String toProcess = record.get("TO_PROCESS"); // this is the field which may contain the 0xe4 and 0xe5
toProcess = StringUtils.replaceChars(toProcess, OPENING_BRACKET,'{');
toProcess = StringUtils.replaceChars(toProcess, CLOSING_BRACKET,'}');
}
Yet, this replacement does not work, and the output strings have a three character sequence 0xef 0xbf 0xbd instead of the brackets I was hoping to see.
Is it possible to force the ISO-8859-1 on the input? Or while reading the strings from the input file?
p.s.
Opening and closing brackets are defined as
static char OPENING_BRACKET = 228; // 'ä'
static char CLOSING_BRACKET = 229; // 'å'

Related

file reading encoding trouble

I've a file to read save, do something with its informations and then rewrite them back to another file. the problem is that the original file contains some characters from asian languages like 坂本龍一, 東京事変 and メリー (I guess they're chinese, japanese and korean). I can see them using Notepad++.
the problem is when I read them and write those things via java they get corrupted and I see weird stuff in my output file like ???????? or Жанна БичевÑ?каÑ?
I think I got something wrong with the encoding but I've no idea of which to use and how to use it.
can someone help me? here's my code:
String fileToRead= SONG_2M;
Scanner scanner = new Scanner(new File(fileToRead), "UTF-8");
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
String[] songData = line.split("\t");
if (/*something*/) {
save the string in the map
}
}
scanner.close();
saveFile("coded_artist_small2.txt");
}
public void saveFile(String fileToSave) throws FileNotFoundException, UnsupportedEncodingException {
PrintWriter writer = new PrintWriter(fileToSave, "UTF-8");
for (Entry<String, Integer> entry : artistsMap.entrySet()) {
writer.println(entry.getKey() + DELIMITER + entry.getValue());
}
writer.close();
}
It is likely that your input file is not, in fact, encoded in UTF-8 (an encoding using two bytes per character satisfying the unicode standard). For instance, the character 坂 you are seeing is unicode 0x5742. If, in fact, your file is encoded in ASCII, that should be displayed as character 0x57 followed by 0x42 - i.e. 9*.
If you're unsure of your file's encoding - take a guess that it might be ASCII text. Try removing the encoding when you set up the Scanner i.e. make the second line of your code
Scanner scanner = new Scanner(new File(fileToRead));
If, in fact, you know the file is unicode, there are different encodings. See this answer for a more comprehensive unicode reader - dealing with various unicode encodings.
For your output - you need to decide how you want the file encoded : some unicode encoding (e.g. UTF-8) or as ASCII.

Why is my String returning "\ufffd\ufffdN a m e"

This is my method
public void readFile3()throws IOException
{
try
{
FileReader fr = new FileReader(Path3);
BufferedReader br = new BufferedReader(fr);
String s = br.readLine();
int a =1;
while( a != 2)
{
s = br.readLine();
a ++;
}
Storage.add(s);
br.close();
}
catch(IOException e)
{
System.out.println(e.getMessage());
}
}
For some reason I am unable to read the file which only contains this "
Name
Intel(R) Core(TM) i5-2500 CPU # 3.30GHz "
When i debug the code the String s is being returned as "\ufffd\ufffdN a m e" and i have no clue as to where those extra characters are coming from.. This is preventing me from properly reading the file.
\ufffd is the replacement character in unicode, it is used when you try to read a code that has no representation in unicode. I suppose you are on a Windows platform (or at least the file you read was created on Windows). Windows supports many formats for text files, the most common is Ansi : each character is represented but its ansi code.
But Windows can directly use UTF16, where each character is represented by its unicode code as a 16bits integer so with 2 bytes per character. Those files uses special markers (Byte Order Mark in Windows dialect) to say :
that the file is encoded with 2 (or even 4) bytes per character
the encoding is little or big endian
(Reference : Using Byte Order Marks on MSDN)
As you write after the first two replacement characters N a m e and not Name, I suppose you have an UTF16 encoded text file. Notepad can transparently edit those files (without even saying you the actual format) but other tools do have problems with those ...
The excellent vim can read files with different encodings and convert between them.
If you want to use directly this kind of file in java, you have to use the UTF-16 charset. From JaveSE 7 javadoc on Charset : UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
You must specify the encoding when reading the file, in your case probably is UTF-16.
Reader reader = new InputStreamReader(new FileInputStream(fileName), "UTF-16");
BufferedReader br = new BufferedReader(reader);
Check the documentation for more details: InputStreamReader class.
Check to see if the file is .odt, .rtf, or something other than .txt. This may be what's causing the extra UTF-16 characters to appear. Also, make sure that (even if it is a .txt file) your file is encoded in UTF-8 characters.
Perhaps you have UTF-16 characters such as '®' in your document.

Unicode Replacement with ASCII

I have created a text file on windows system where I think default encoding style is ANSI and contents of the file looks like this :
This is\u2019 a sample text file \u2014and it can ....
I saved this file using the default encoding style of windows though there were encoding styles were also available like UTF-8,UTF-16 etc.
Now I want to write a simple java function where I will pass some input string and replace all of the unicodes with the corresponding ascii value.
e.g :- \u2019 should be replaced with "'"
\u2014 should be replaced with "-" and so on.
Observation :
When i created a string literal like this
String s = "This is\u2019 a sample text file \u2014and it can ....";
My code is working fine , but when I am reading it from the file it is not working. I am aware that in Java String uses UTF-16 encoding .
Below is the code that I am using to read the input file.
FileReader fileReader = new FileReader(new File("C:\\input.txt"));
BufferedReader bufferedReader = new BufferedReader(fileReader)
String record = bufferedReader.readLine();
I also tried using the InputStream and setting the Charset to UTF-8 , but still the same result.
Replacement code :
public static String removeUTFCharacters(String data){
for(Entry<String,String> entry : utfChars.entrySet()){
data=data.replaceAll(entry.getKey(), entry.getValue());
}
return data;
}
Map :
utfChars.put("\u2019","'");
utfChars.put("\u2018","'");
utfChars.put("\u201c","\"");
utfChars.put("\u201d","\"");
utfChars.put("\u2013","-");
utfChars.put("\u2014","-");
utfChars.put("\u2212","-");
utfChars.put("\u2022","*");
Can anybody help me in understanding the concept and solution to this problem.
Match the escape sequence \uXXXX with a regular expression. Then use a replacement loop to replace each occurrence of that escape sequence with the decoded value of the character.
Because Java string literals use \ to introduce escapes, the sequence \\ is used to represent \. Also, the Java regex syntax treats the sequence \u specially (to represent a Unicode escape). So the \ has to be escaped again, with an additonal \\. So, in the pattern, "\\\\u" really means, "match \u in the input."
To match the numeric portion, four hexadecimal characters, use the pattern \p{XDigit}, escaping the \ with an extra \. We want to easily extract the hex number as a group, so it is enclosed in parentheses to create a capturing group. Thus, "(\\p{XDigit}{4})" in the pattern means, "match 4 hexadecimal characters in the input, and capture them."
In a loop, we search for occurrences of the pattern, replacing each occurrence with the decoded character value. The character value is decoded by parsing the hexadecimal number. Integer.parseInt(m.group(1), 16) means, "parse the group captured in the previous match as a base-16 number." Then a replacement string is created with that character. The replacement string must be escaped, or quoted, in case it is $, which has special meaning in replacement text.
String data = "This is\\u2019 a sample text file \\u2014and it can ...";
Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher m = p.matcher(data);
StringBuffer buf = new StringBuffer(data.length());
while (m.find()) {
String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
System.out.println(buf);
If you can use another library, you can use apache commons
https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html
String dirtyString = "Colocaci\u00F3n";
String cleanString = StringEscapeUtils.unescapeJava(dirtyString);
//cleanString = "Colocación"

Replace non UTF compliant characters characters in a meaningful way rather than simply removing them

My application is malfunctioning because of the special characters in the strings any many areas.
Eg 1 : you can see the ? character that was displaying instead of ’.
Text :
The Hilton Paris La Defense hotel is located at the foot of the Grande Arche at the very heart of Europe’s largest business district and puts you in easy reach of some of Paris’ most famous attractions. Only a few minutes from the...
Screen Shot :
Eg 2 : Parser exception while parsing a XML having special characters (like ’,& etc) using AXIOM.
XMLStreamReader parser = XMLInputFactory.newInstance().createXMLStreamReader(new StringBufferInputStream(responseXML));
OMElement documentElement = new StAXOMBuilder(parser).getDocumentElement();
I found many posts to remove them when they are found.
Eg :
How to remove bad characters that are not suitable for utf8 encoding in MySQL?
remove non-UTF-8 characters from xml with declared encoding=utf-8 - Java
And I'm using following character to remove the non UTF compliant characters characters.
if (null == inString ) return null;
byte[] byteArr = inString.getBytes();
for ( int i=0; i < byteArr.length; i++ ) {
byte ch= byteArr[i];
if ( !(ch < 0x00FD && ch > 0x001F) || ch =='&' || ch=='#') {
byteArr[i]=' ';
}
}
return new String( byteArr );
But this lead to another problem of removing some informative characters like ’.
What I want to do is, I want to replace them in a meaningful way rather than simply removing them. Eg : ’ can be replaced by ', & can be replaced by 'and' etc.
Is there any standard way to do this rather than manually replacing one by one?
The javadoc for StringBufferInputStream says
Deprecated. This class does not properly convert characters into bytes. As of JDK 1.1, the preferred way to create a stream from a string is via the StringReader class.
Don't use it.
The file is read as bytes, no matter where it comes from. Never convert your data to a String if you need it as bytes in the first place.
If you're reading from a file, use a FileInputStream. (Never use FileReader, since it doesn't allow you to specify the encoding.)

Trying to read binary file as text but scanner stops at first line

I'm trying to read a binary file but my program just stops at first line..
I think it's because of the strange characters the file has..I just want to extract some directions from it. Is there a way to do this?..
public static void main(String[] args) throws IOException
{
Scanner readF = new Scanner(new File("D:\\CurrentDatabase_372.txt"));
String line = null;
String newLine = System.getProperty("line.separator");
FileWriter writeF = new FileWriter("D:\\Songs.txt");
while (readF.hasNext())
{
line = readF.nextLine();
if (line.contains("D:\\") && line.contains(".mp3"))
{
writeF.write(line.substring(line.indexOf("D:\\"), line.indexOf(".mp3") + 4) + newLine);
}
}
readF.close();
writeF.close();
}
The file starts like this:
pppppamepD:\Music\Korn\Untouchables\03 Blame.mp3pmp3pmp3pKornpMetalpKornpUntouchablespKornpUntouchables*;*KornpKornpKornUntouchables003pMetalKornUntouchables003pBlameKornUntouchables003pKornKornUntouchables003pMP3pppppCpppÀppp#ppøp·pppŸú#pdppppppòrSpUpppppp€ppªp8›qpppppppppppp,’ppÒppp’ÍpET?ppppppôpp¼}`Ñ#ãâK†¡H¤*(DppppppppppppppppuÞѤéú:M®$#]jkÝW0ÛœFµú½XVNp`w—wâÊp:ºŽwâÊpppp8Npdpp¡pp{)pppppppppppppppppyY:¸[ªA¥Bi `Û¯pppppppppppp2pppppppppppppppppppppppppppppppppppp¿ÞpAppppppp€ppp€;€?€CpCpC€H€N€S€`€e€y€~p~p~€’€«€Ê€â€Hollow LifepD:\Musica\Korn\Untouchables\04 Hollow Life.mp3pmp3pmp3pKornpMetalpKornpUntouchablespKornpUntouchables*;*KornpKornpKornUntouchables004pMetalKornUntouchables004pHollow LifeKornUntouchables004pKornKornUntouchables004pMP3pppppCpppÀHppppppøp¸pppǺxp‰ppppppòrSpUpppppp€ppªp8›qpppppppppppp,’ppÒpppŠºppppppppppôpp¼}`Ñ#ãâK†¡H¤*(DpppppppppppppppppãG#™R‚CA—®þ^bN °mbŽ‚^¨pG¦sp;5p5ÓÐùšwâÊp
)ŽwâÊpppp8Npdpp!cpp{pppppppppppppppppyY:¸[ªA¥Bi `ۯǺxp‰pppppp2pppppppppppppppppppppppppppppppppppp¿
I want to extract file directions like "D:\Music\Korn\Untouchables\03 Blame.mp3".
You cannot use a line-oriented scanner to read binary files. You have no guarantee that the binary file even has "lines" delimited by newline characters. For example, what would your scanner do if there were TWO files matching the pattern "D:\.*.mp3" with no intervening newline? You would extract everything between the first "D:\" and the last ".mp3", with all the garbage in between. Extracting file names from a non-delimited stream such as this requires a different strategy.
If i were writing this I'd use a relatively simple finite-state recognizer that processes characters one at a time. When it encounters a "d" it starts saving characters, checking each character to ensure that it matches the required pattern, ending when it sees the "3" in ".mp3". If at any point it detects a character that doesn't fit, it resets and continues looking.
EDIT: If the files to be processed are small (less than 50mb or so) you could load the entire file into memory, which would make scanning simpler.
As was said, since it is a binary file you can't use a Scanner or other character based readers. You could use a regular FileInputStream to read the actual raw bytes of the file. Java's String class has a constructor that will take an array of bytes and turn them into a string. You can then search that string for the file name(s). This may work if you just use the default character set.
String(byte[]):
http://download.oracle.com/javase/1.4.2/docs/api/java/lang/String.html
FileInputStream for reading bytes:
http://download.oracle.com/javase/tutorial/essential/io/bytestreams.html
Use hasNextLine() instead of hasNext() in the while loop check.
while (readF.hasNextLine()) {
String line = readF.nextLine();
//Your code
}

Categories