Reading unicode txt in java

Reading unicode txt in java - java

I am trying to extract data from a .txt file that was encoded in Unicode because there are accents in it (French names). Below is a portion of my code. The output of string postalCode has weird little squares in it (squareHsquare1square). My suspicion is the problem has something to do with the program treating the content as ASCII. Someone please point me in the right direction. Thanks!
Scanner in = new Scanner(new FileReader("postal_codes.txt"));
currentLine = in.nextLine();
//take first 6 char --> store as variable
postalCode = currentLine.substring(0, 5);

If you read the javadoc for FileReader, it says (emphasis mine):
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
In other words, you need to use:
new Scanner(new InputStreamReader(
new FileInputStream("postal_codes.txt"), StandardCharsets.UTF_8));

It sounds like an encoding issue. I'm assuming that by "encoded in Unicode" you mean "encoded in UTF-8". Try this:
Scanner in = new Scanner(
new InputStreamReader(new FileInputStream("postal_codes.txt"), "UTF-8"));
A FileReader automatically uses the default encoding for the platform. This often is not UTF-8.

You can use guava, method :
Files.readLines(File file Charset charset) : List<String>
of package
com.google.common.io.Files;

You can try this:
BufferedReader in = new BufferedReader(new FileReader("postal_codes.txt", "UTF-8")));
String content = in.readLine();
postalCode = content.substring(0, 5);

Related

ISO-8859-1 to UTF-8 in Java (runescape API)

I am trying to make a Discord bot which gets informatie from the Runescape API and returns information about the user. The issue i have is when a username has a space involved.
The runescape api gives a file in ISO-8859-1 and i try to convert it to UTF-8
2 examples from the file: lil Jimmy and lil jessica.
The loop finds a match for jessica, but not for jimmy.
The code for getting and reading the file:
InputStream input = null;
InputStreamReader inputReader = null;
BufferedReader reader = null;
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
input = url.openConnection().getInputStream();
inputReader = new InputStreamReader(input, "ISO-8859-1");
reader = new BufferedReader(inputReader);
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
Does anyone know what im doing wrong? Thank you in advance for taking the time to help!
Edit 1: I've added the "ISO-8859-1" to inputReader as told by the answers. Now the next step is to replace the non wrapping white space with regular whit spaces.
Edit 2: The non breaking whitespace can be solved by:
parts[0] = parts[0].replaceAll("\u00a0","aaaaaaaaa");
parts[0] = parts[0].replaceAll("\u00C2","bbbbbbbbb");
parts[0] = parts[0].replaceAll("bbbbbbbbbaaaaaaaaa", " ");
The aaaaaa replaces the nonbreaking space for a regular one, and the aaaaa removes the roman a (Â) it places in front of the whitespace.
Thanks everyone for helping me out!

If you want to ensure that you're reading the data correctly, use:
inputReader = InputStreamReader(input, "ISO-8859-1");
After that, I'm not sure why you're trying to convert to UTF-8, since you're just using the text as Strings from that point on. A string itself doesn't have an encoding. (Well, in a certain sense a Java string is like UTF-16 in its internal representation, but that's a whole other can of worms you don't need to worry about here.)

First you are not providing the charset in your InputStreamReader which cause it to use the default charset instead of the one it should be using, and then you are doing crazy stuff to try and fix it that you shouldn't have to do and that won't work properly.
Also you are not closing the opened stream, you should be using try-with-resources.
It should probably look more like this:
URL url = new URL("http://services.runescape.com/m=clan-hiscores/members_lite.ws?clanName=uh");
try(BufferedReader inputReader = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream(), StandardCharsets.ISO_8859_1))) {
String line;
while ((line = reader.readLine()) != null) {
String[] parts = line.split(",");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}
if (parts[0].equals("lil jessica")) {System.out.println("lil jessica found");}
}
}

Looking at the downloaded text file:
The whitespace for "lil jessica" is a regular space (U+0020), the one for "lil Jimmy" (and most of the others as well) is a non-breaking space (U+00A0).
If you don't care for breaking or non-breaking, the easiest approach is probably to replace it with a regular white space in your input string. Something like:
parts[0] = new String(parts[0].getBytes("UTF-8"), "ISO-8859-1");
parts[0] = parts[0].replaceAll("\u00a0"," ");
if (parts[0].equals("lil Jimmy")) {System.out.println("lil Jimmy found");}

Convert file from Cp1252 to utf -8 java

User uploads file with the character encoding : Cp1252
Since my mysql table columns Collation as utf8_bin, I try to convert the file to utf-8 before putting the data into table using LOAD DATA INFILE command.
Java source code:
OutputStream output = new FileOutputStream(destpath);
InputStream input = new FileInputStream(filepath);
BufferedReader reader = new BufferedReader(new InputStreamReader(input, "windows-1252"));
BufferedWriter writ = new BufferedWriter(new OutputStreamWriter(output, "UTF8"));
String in;
while ((in = reader.readLine()) != null) {
writ.write(in);
writ.newLine();
}
writ.flush();
writ.close();
It seems that characters are not converted correctly. Converted unicode file has � and box symbols at multiple places. How to convert file efficiently to uft-8? Thanks.

One way of verifying the conversion process is to configure the charset decoder and encoder to bail out on errors instead of silently replacing the erroneous characters with special characters:
CharsetDecoder inDec=Charset.forName("windows-1252").newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
CharsetEncoder outEnc=StandardCharsets.UTF_8.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
try(FileInputStream is=new FileInputStream(filepath);
BufferedReader reader=new BufferedReader(new InputStreamReader(is, inDec));
FileOutputStream fw=new FileOutputStream(destpath);
BufferedWriter out=new BufferedWriter(new OutputStreamWriter(fw, outEnc))) {
for(String in; (in = reader.readLine()) != null; ) {
out.write(in);
out.newLine();
}
}
Note that the output encoder is configured for symmetry here, but UTF-8 is capable of encoding every unicode character, however, doing it symmetric will help once you want to use the same code for performing other conversions.
Further, note that this won’t help if the input file is in a different encoding but misinterpreting the bytes leads to valid characters. One thing to consider is whether the input encoding "windows-1252" actually meant the system’s default encoding (and whether that is really the same). If in doubt, you may use Charset.defaultCharset() instead of Charset.forName("windows-1252") when the actually intended conversion is default → UTF-8.

After changing file encoding Windows get it wrong

I wanted to change file's encoding form ones to the other(doesn't matter which).
But when i open the file with the result(file w.txt) it is messed up inside. Windows does not understand it correct.
What result encoding should i put (args[1]) so it will be interpreted by windows notepad correct?
import java.io.*;
import java.nio.charset.Charset;
public class Kodowanie {
public static void main(String[] args) throws IOException {
args = new String[2];
args[0] = "plik.txt";
args[1] = "ISO8859_2";
String linia, s = "";
File f = new File(args[0]), f1 = new File("w.txt");
FileInputStream fis = new FileInputStream(f);
InputStreamReader isr = new InputStreamReader(fis,
Charset.forName("UTF-8"));
BufferedReader in = new BufferedReader(isr);
FileOutputStream fos = new FileOutputStream(f1);
OutputStreamWriter osw = new OutputStreamWriter(fos,
Charset.forName(args[1]));
BufferedWriter out = new BufferedWriter(osw);
while ((linia = in.readLine()) != null) {
out.write(linia);
out.newLine();
}
out.close();
in.close();
}
}
input:
Ala
ma
Kota
output:
?Ala
ma
Kota
Why there is a '?'

The default encoding in Windows is Cp1252.

US-ASCII is a subset of unicode (a pretty small one by the way). You are reading a file in UTF-8 and then you write it back in US-ASCII. Thus your the encoder will have to take a desicion when a given UTF character cannot be expressed in terms of the reduced 7-bit US-ASCII subset. Clasically, this is repaced by a default charcter, like ?.
Take into account that characters in UTF-8 are multibyte in many cases, whereas US-ASCII is only 7-bit long. This means that al unicode characters above byte 127 cannot be expressed in US-ASCII. That could explain the question marks that you see once the file has been converted.
I had answered a similar question Reading Strange Unicode Characters in Java. Perhaps it helps.
I also recommend you to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Special characters from txt file

I am downloading a text file from ftp, with common ftp library.
The problem is when i read the file into an array line by line, it doesnt take characters such as æøå. Instead it just show the "?" character.
Here is my code
FileInputStream fstream = openFileInput("name of text file");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream, "UTF-8"));
String strLine;
ArrayList<String> lines = new ArrayList<String>();
while ((strLine = br.readLine()) != null) {
lines.add(strLine);
}
String[] linjer = lines.toArray(new String[0]);
ArrayList<String> imei = new ArrayList<String>();
for(int o=0;o<linjer.length;o++)
{
String[] holder = linjer[o].split(" - ");
imei.add(holder[0] + " - " + holder[2]);
}
String[] imeinr = imei.toArray(new String[0]);
I have tried to put UTF-8 in my inputstreamreader, and i have tried with a UnicodeReader class, but with no success.
I am fairly new to Java, so might just be some stupid question, but hope you can help. :)

There is no reason to use a DataInputStream. The DataInputStream and DataOutputStream classes are used for serializing primitive Java data types ("serializing" means reading/writing data to a file). You are just reading the contents of a text file line by line, so the use of DataInputStream is unnecessary and may produce incorrect results.
FileInputStream fstream = openFileInput("name of text file");
//DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream, "UTF-8"));
Professional Java Programmer Tip: The foreach loop was recently added to the Java programming language. It allows the programmer to iterate through the contents of an array without needing to define a loop counter. This simplifies your code, making it easier to read and maintain over time.
for(String line : linjer){
String[] holder = line.split(" - ");
imei.add(holder[0] + " - " + holder[2]);
}
Note: Foreach loops can also be used with List objects.

I would suggest that the file may not be in UTF-8. It could be in CP1252 or something, especially if you're using Windows.
Try downloading the file and running your code on the local copy to see if that works.

FTP has two modes binary and ascii. Make sure you are using the correct mode. Look here for details: http://www.rhinosoft.com/newsletter/NewsL2008-03-18.asp

Check line for unprintable characters while reading text file

My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.

Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}

While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.

Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...

If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]

How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html

I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}

The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading unicode txt in java - java

You can use guava, method : Files.readLines(File file Charset charset) : List<String> of package com.google.common.io.Files;

You can try this: BufferedReader in = new BufferedReader(new FileReader("postal_codes.txt", "UTF-8"))); String content = in.readLine(); postalCode = content.substring(0, 5);

Related

ISO-8859-1 to UTF-8 in Java (runescape API)

Convert file from Cp1252 to utf -8 java

After changing file encoding Windows get it wrong

Special characters from txt file

Check line for unprintable characters while reading text file

Categories

Resources