character ° encoding and visualization in txt file

character ° encoding and visualization in txt file - java

I have a field in a table that contains the string "Address Pippo p.2 °".
My program read this value and write it into txt file, but the output is:
"Address Pippo p.2 Â°" (Â is unwanted)
I have a problem because the txt file is a positional file.
I open the file with these Java istructions:
FileWriter fw = new FileWriter(file, true);
pw = new PrintWriter(fw);
I want to write the string without strange characters
Any help for me ?
Thanks in advance

Try encoding the string into UTF-8 like this,
File file = new File("D://test.txt");
FileWriter fw = new FileWriter(file, true);
PrintWriter pw = new PrintWriter(fw);
String test = "Address Pippo p.2 °";
ByteBuffer byteBuffer = Charset.forName("UTF-8").encode(test);
test = StandardCharsets.UTF_8.decode(byteBuffer).toString();
pw.write(test);
pw.close();

Java uses Unicode. When you write text to a file, it gets encoded using a particular character encoding. If you don't specify it explicitly, it will use a "system default encoding" which is whatever is configured as default for your particular JVM instance. You need to know what encoding you've used to write the file. Then you need to use the same encoding to read and display the file content. The funny characters you are seeing are probably due to writing the file using UTF-8 and then trying to read and display it in e.g. Notepad using Windows-1252 ("ANSI") encoding.
Decide what encoding you want and stick to it for both reading and writing. To write using Windows-1252, use:
Writer w = new OutputStreamWriter(new FileInputStream(file, true), "windows-1252");
And if you write in UTF-8, then tell Notepad that you want it to read the file in UTF-8. One way to do that is to write the character '\uFEFF' (Byte Order Mark) at the beginning of the file.
If you use UTF-8, be aware that non-ASCII characters will throw the subsequent bytes out of position. So if, for example, a telephone field must always start at byte position 200, then having a non-ASCII character in an address field before it will make the telephone field start at byte position 201 or 202. Using windows-1252 encoding you won't have this issue, but that encoding can't encode all Unicode characters.

Related

Files.readAllLines() does not read all characters correctly

I have a simple text file which includes only one character which is '≤'. Nothing else. This file has UTF-8 encoding.
When I read this file using the method Files.readAllLines(), the character is shown as a question mark '?'
try (FileWriter fw = new FileWriter(new File(file, "f.txt"));
PrintWriter writer = new PrintWriter(fw);) {
List<String> lines = Files.readAllLines(deProp.toPath());
for (String line : lines) {
System.out.println(line);
writer.write(line);
writer.println();
}
In my example I am trying to print the line to the console and to a new file. In both cases a question mark is shown instead.
Any suggestions to solve this?

The Files.readAllLines(path) already uses UTF-8 (see the linked documentation). If you're using the Files.readAllLines(path, charset) variant, well, pass UTF-8 as the charset, of course (for example by using StandardCharsets.UTF_8).
Assuming you're using either the short version or passing UTF-8, then the error lies not with java, but with your setup.
Either the file doesn't contain ≤ in UTF-8, or you're printing it in java to a place that doesn't show such symbols (for example, because your font doesn't have it, and uses ? as the placeholder symbol for 'I do not have this symbol in my font file'; it's more usually a box symbol), or you're sending the output someplace that incorrectly presumes that what is sent is not UTF-8.

The static method of File class e.i
public static List<String> readAllLines(Path path) throws IOException
is read all the lines from a file. The bytes from the file are decoded into characters using the UTF-8 charset. This method invoking equivalent to evaluating the expression:
Files.readAllLines(path, StandardCharsets.UTF_8)
It may be possible that the file contains some garbage or something out of format of UTF-8 charset. Check the text inside files once manually :p

Convert Hindi Output à¤à¥ à¤µà¤¿à¤¨à¤¿à¤¯à¤®à¤¨ à¤à¤§à¤¿à¤¨à¤¿à¤¯à¤®?

à¤à¥ à¤µà¤¿à¤¨à¤¿à¤¯à¤®à¤¨ à¤à¤§à¤¿à¤¨à¤¿à¤¯à¤®
This output I obtain when I translated an English sentence. Is there any way to make it readable form ??
The Goal is to translate English Sentence to Hindi. The Hindi translated output is correctly obtained in the console. I need to write it to text file.
The translated sentence is set to "translation" and by getParameter() it is tried to save in to the file.
String translation = request.getParameter("translation");
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileDir,true), "UTF-8");
BufferedWriter fbw = new BufferedWriter(writer);
fbw.write(translation);
Output file
Output file 1

This is an issue with mismatching character encoding (like UTF-8).
Make sure the character encoding of data that is returned from the request parameter is in UTF-8 encoding.
If the data is in a different encoding, you will have to use that encoding while writing to the file.

Why is my String returning "\ufffd\ufffdN a m e"

This is my method
public void readFile3()throws IOException
{
try
{
FileReader fr = new FileReader(Path3);
BufferedReader br = new BufferedReader(fr);
String s = br.readLine();
int a =1;
while( a != 2)
{
s = br.readLine();
a ++;
}
Storage.add(s);
br.close();
}
catch(IOException e)
{
System.out.println(e.getMessage());
}
}
For some reason I am unable to read the file which only contains this "
Name
Intel(R) Core(TM) i5-2500 CPU # 3.30GHz "
When i debug the code the String s is being returned as "\ufffd\ufffdN a m e" and i have no clue as to where those extra characters are coming from.. This is preventing me from properly reading the file.

\ufffd is the replacement character in unicode, it is used when you try to read a code that has no representation in unicode. I suppose you are on a Windows platform (or at least the file you read was created on Windows). Windows supports many formats for text files, the most common is Ansi : each character is represented but its ansi code.
But Windows can directly use UTF16, where each character is represented by its unicode code as a 16bits integer so with 2 bytes per character. Those files uses special markers (Byte Order Mark in Windows dialect) to say :
that the file is encoded with 2 (or even 4) bytes per character
the encoding is little or big endian
(Reference : Using Byte Order Marks on MSDN)
As you write after the first two replacement characters N a m e and not Name, I suppose you have an UTF16 encoded text file. Notepad can transparently edit those files (without even saying you the actual format) but other tools do have problems with those ...
The excellent vim can read files with different encodings and convert between them.
If you want to use directly this kind of file in java, you have to use the UTF-16 charset. From JaveSE 7 javadoc on Charset : UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

You must specify the encoding when reading the file, in your case probably is UTF-16.
Reader reader = new InputStreamReader(new FileInputStream(fileName), "UTF-16");
BufferedReader br = new BufferedReader(reader);
Check the documentation for more details: InputStreamReader class.

Check to see if the file is .odt, .rtf, or something other than .txt. This may be what's causing the extra UTF-16 characters to appear. Also, make sure that (even if it is a .txt file) your file is encoded in UTF-8 characters.
Perhaps you have UTF-16 characters such as '®' in your document.

Java Unicode to readable text conversion decoding

I am developing a Java application where I am consuming a web service. The web service is created using a SAP server, which encodes the data automatically in Unicode. I get a Unicode string from the web service.
"
倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭⁬慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
"
above is the response.
I want to convert it to readable text format like String. I am using core Java.

倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭⁬慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2
That's a PDF file that has been interpreted as UTF-16LE.
You need to look at what component is receiving the response and how it's dealing with the input to stop it being decoded as UTF-16LE, but ultimately there isn't a 'readable' version of it as such, as it's a binary file. Extracting the document text out of a PDF file is a much bigger problem!
(Note: Unicode is a character set, UTF-16LE is an encoding of that set into bytes. Microsoft call the UTF-16LE encoding "Unicode" due to a historical accident, but that's misleading.)

If you have byte[] or an InputStream (both binary data) you can get a String or a Reader (both text) with:
final String encoding = "UTF-8"; // "UTF16LE" or "UTF-16BE"
byte[] b = ...;
String s = new String(b, encoding);
InputStream is = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
for (;;) {
String line = reader.readLine();
}
The reverse process uses:
byte[] b = s.geBytes(encoding);
OutputStream os = ...;
BufferedWriter writer = new BufferedWriter(new OuputStreamWriter(os, encoding));
writer.println(s);
Unicode is a numbering system for all characters. The UTF variants implement Unicode as bytes.
Your problem:
In normal ways (web service), you would already have received a String. You could write that string to a file using the Writer above for instance. Either to check it yourself with a full Unicode font, or to pass the file on for a check.
You need (?) to check, which UTF variant the text is in. For Asiatic scripts UTF-16 (little endian or big endian) are optimal. In XML it would be defined already.
Addition:
FileWriter writes to a file using the default encoding (from operating system on your machine). Instead use:
new OutputStreamWriter(new FileOutputStream(new File("...")), "UTF-8")
If it is a binary PDF, as #bobince said, use just a FileOutputStream on byte[] or InputStream.

This is definitely not a valid string. This looks like mangled UTF-16.
UPDATE
Indeed #Bobince is right, this is a PDF file (most probably in UTF-8 / or plain ASCII) displayed in UTF-16. When Displayed in UTF-8 this string indeed shows PDF source code. Good catch.

UTF8 convertion for text obtained from internet

ElasticSearch is a search Server which accepts data only in UTF8.
When i tries to give ElasticSearch following text
Small businesses potentially in line for a lighter reporting load include those with an annual turnover of less than £440,000, net assets of less than £220,000 and fewer than ten employees"
Through my java application - Basically my java application takes this info from a webpage , and gives it to elasticSearch. ES complaints it cant understand £ and it fails. After filtering through below code -
byte bytes[] = s.getBytes("ISO-8859-1");
s = new String(bytes, "UTF-8");
Here £ is converted to �
But then when I copy it to a file in my home directory using bash and it goes in fine. Any pointers will help.

You have ISO-8895-1 octets in bytes, which you then tell String to decode as if it were UTF-8. When it does that, it doesn't recognize the illegal 0xA3 sequence and replaces it with the substitution character.
To do this, you have to construct the string with the encoding it uses, then convert it to the encoding that you want. See How do I convert between ISO-8859-1 and UTF-8 in Java?.

UTF-8 is easier than one thinks. In String everything is unicode characters.
Bytes/string conversion is done as follows.
(Note Cp1252 or Windows-1252 is the Windows Latin1 extension of ISO-8859-1; better use
that one.)
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "Cp1252"));
PrintWriter out = new PrintWriter(
new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
response.setContentType("text/html; charset=UTF-8");
response.setEncoding("UTF-8");
String s = "20 \u00A3"; // Escaping
To see why Cp1252 is more suitable than ISO-8859-1:
http://en.wikipedia.org/wiki/Windows-1252

String s is a series of characters that are basically independent of any character encoding (ok, not exactly independent, but close enough for our needs now). Whatever encoding your data was in when you loaded it into a String has already been decoded. The decoding was done either using system default encoding (which is practically ALWAYS AN ERROR, do not ever use system default encoding, trust me I have over 10 years of experience in dealing with bugs related to wrong default encodings) or the encoding you explicitely specified when you loaded the data.
When you call getBytes("ISO-8859-1") for a String, you request that the String is encoded into bytes according to ISO-8859-1 encoding.
When you create a String from a byte array, you need to specify the encoding in which the characters in the byte array are represented. You create a string from a byte array that has been encoded in UTF-8 (and just above you encoded it in ISO-8859-1, that is your error).
What you want to do is:
byte bytes[] = s.getBytes("UTF-8");
s = new String(bytes, "UTF-8");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

character ° encoding and visualization in txt file - java

Related

Files.readAllLines() does not read all characters correctly

Convert Hindi Output à¤à¥ à¤µà¤¿à¤¨à¤¿à¤¯à¤®à¤¨ à¤à¤§à¤¿à¤¨à¤¿à¤¯à¤®?

Why is my String returning "\ufffd\ufffdN a m e"

Java Unicode to readable text conversion decoding

UTF8 convertion for text obtained from internet

Categories

Resources