Read an UTF-8 encoded text file from internet in Java - java

I want to read an xml file from the internet. You can find it here.
The problem is that it is encoded in UTF-8 and I need to store it into a file in order to parse it later. I have already read a lot of topics about that and here is what I came up with :
BufferedReader in;
String readLine;
try
{
in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
BufferedWriter out = new BufferedWriter(new FileWriter(file));
while ((readLine = in.readLine()) != null)
out.write(readLine+"\n");
out.close();
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
This code works until this line : <title>Chérie FM</title>
When I debug, I get this : <title>Ch�rie FM</title>
Obviously, there is something I fail to understand, but it seems to me that I followed the code saw on several website.

This file is not encoded as UTF-8, it's ISO-8859-1.
By changing your code to:
BufferedReader in;
String readLine;
try
{
in = new BufferedReader(new InputStreamReader(url.openStream(), "ISO-8859-1"));
BufferedWriter out = new BufferedWriter(new OutputStreamWriter( new FileOutputStream(file) , "UTF-8"));
while ((readLine = in.readLine()) != null)
out.write(readLine+"\n");
out.flush();
out.close();
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
You should have the expected result.

If you need to write a file in a given encoding, use FileOutputStream instead.
in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
FileOutputStream out = new FileOutputStream(file);
while ((readLine = in.readLine()) != null)
write((readLine+"\n").getBytes("UTF-8"));
out.close();

Related

Java socket can't reply outside of the loop

I have a Java server that listens to connections from a PHP client and replies back. My problem is I can't write anything to outputStream after reading the inputStream.
while (true)
try {
clientSocket = serverSocket.accept();
clientSocket.setSoTimeout(2000);
if (!clientSocket.getInetAddress().equals(clientSocket.getLocalAddress())) {
clientSocket.close();
continue;
}
BufferedReader br = new BufferedReader(new InputStreamReader(clientSocket.getInputStream()));
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(clientSocket.getOutputStream()));
LinkedList<String> messageFromPHP = new LinkedList<>();
String message = "";
while ((message = br.readLine()) != null)
messageFromPHP.add(message);
bw.write("test_message\n");
bw.flush();
bw.close();
br.close();
clientSocket.close();
} catch (SocketTimeoutException ex) {
} catch (IOException ex) {
ex.printStackTrace();
serverSocket.close();
}
^^ This makes both the PHP client and Java Server stuck forever. (I have added SoTimeout to the server to prevent that)
while (true)
try {
clientSocket = serverSocket.accept();
clientSocket.setSoTimeout(2000);
if (!clientSocket.getInetAddress().equals(clientSocket.getLocalAddress())) {
clientSocket.close();
continue;
}
BufferedReader br = new BufferedReader(new InputStreamReader(clientSocket.getInputStream()));
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(clientSocket.getOutputStream()));
LinkedList<String> messageFromPHP = new LinkedList<>();
String message = "";
while ((message = br.readLine()) != null) {
messageFromPHP.add(message);
bw.write("test_message\n");
bw.flush();
}
bw.close();
br.close();
clientSocket.close();
} catch (SocketTimeoutException ex) {
} catch (IOException ex) {
ex.printStackTrace();
serverSocket.close();
}
^^ However this one works perfectly fine and I don't know why. I'm sure these while loops end because I can print messageFromPHP without a problem with both codes.
So, how can I avoid doing everything inside the readLine loop?
Edit: To make things more clear: I want to write and read like the first code. But when I'm reading the input, I can't write to output so I have to use the second code and I don't want to. I'm trying to store the input inside the messageFromPHP list and write to output after that according to the input in the list.

Android: BufferedReader returns wrong data after second time

I am working on an android app which uses AsyncTasks in order to get JSON data from an applications API. When I start my app, everything goes well and the app gets the right information out of the API.
I implemented ActionBar pull-to-refresh library so people can drag down my listview to refresh their data. Now my app crashes on this point.
Instead of receiving any text, my BufferedReader.readline returns strings like this.
���ĥS��Zis�8�+(m��L�ޔ�i}�l�V�8��$AI0��(YN�o�lI�,9cO�V͇� $��F���f~4r֧D4>�?4b�Տ��P#��|xK#h�����`�4#H,+Q�7��L�
Everytime my app wants to receive data, a new AsyncTask will be created so I don't know why my reader returns something like that...
I hope you guys can give me any idea on how to fix this!
EDIT: This is how I get my data.
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
} catch (IOException e1) {
e1.printStackTrace();
}
String s = null;
String data = "";
try {
while ((s = reader.readLine()) != null)
{
data += s;
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
I just had the same issue. I found out that the returned HTML might be compressed into a GZIP format. Use something like this to check for encoding, and use the appropriate streams to decode the content:
URL urlObj = new URL(url);
URLConnection conn = urlObj.openConnection();
String encoding = conn.getContentEncoding();
InputStream is = conn.getInputStream();
InputStreamReader isr = null;
if (encoding != null && encoding.equals("gzip")) {
isr = new InputStreamReader(new GZIPInputStream(is));
} else {
isr = new InputStreamReader(is);
}
reader = new BufferedReader(isr);
And so forth.
Have you tested other enconding like BufferedReader br = new BufferedReader(new InputStreamReader(socket.getInputStream(), "UTF-8"));
You can check all the avaliable encondings on this web page enconding.doc

How to download/read html file via ftp url?

I am having trouble getting the html text from this html file via ftp. I use beautiful soup to read an html file via http/https but for some reason I cannot download/read from an ftp. Please help!
Here is the url.
a link
Here is my code so far.
BufferedReader reader = null;
String total = "";
String line;
ur = "ftp://ftp.legis.state.tx.us/bills/832/billtext/html/house_resolutions/HR00001_HR00099/HR00014I.htm"
try {
URL url = new URL(ur);
URLConnection urlc = url.openConnection();
InputStream is = urlc.getInputStream(); // To download
reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
while ((line = reader.readLine()) != null)
total += reader.readLine();
} finally {
if (reader != null)
try { reader.close();
} catch (IOException logOrIgnore) {}
}
This code working for me, Java 1.7.0_25. Notice that you were storing one of every two lines, calling reader.readLine() both in the condition and in the body of the while loop.
public static void main(String[] args) throws MalformedURLException, IOException {
BufferedReader reader = null;
String total = "";
String line;
String ur = "ftp://ftp.legis.state.tx.us/bills/832/billtext/html/house_resolutions/HR00001_HR00099/HR00014I.htm";
try {
URL url = new URL(ur);
URLConnection urlc = url.openConnection();
InputStream is = urlc.getInputStream(); // To download
reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
while ((line = reader.readLine()) != null) {
total += line;
}
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException logOrIgnore) {
}
}
}
}
First thought this is related to a wrong path resolution as discussed here but this does not help.
I don't know what is exactly going wrong here but I can only reproduce this error on this ftp-server and with the MacOS Java 1.6.0_33-b03-424. I can't reproduce it with Java 1.7.0_25. So perhaps you check for a Java update.
Or you could use commons FTPClient to retrieve the file:
FTPClient client = new FTPClient();
client.connect("ftp.legis.state.tx.us");
client.enterLocalPassiveMode();
client.login("anonymous", "");
client.changeWorkingDirectory("bills/832/billtext/html/house_resolutions/HR00001_HR00099");
InputStream is = client.retrieveFileStream("HR00014I.htm");

Skip creating file in FileOutputStream when there is no data in Inputstream

This is a logging function which logs error stream from the execution of an external program. Everything works fine. But I do not want to generate the log file when there is no data in error stream. Currently it is creating zero size file. Please help.
FileOutputStream fos = new FileOutputStream(logFile);
PrintWriter pw = new PrintWriter(fos);
Process proc = Runtime.getRuntime().exec(externalProgram);
InputStreamReader isr = new InputStreamReader(proc.getErrorStream());
BufferedReader br = new BufferedReader(isr);
String line=null;
while ( (line = br.readLine()) != null)
{
if (pw != null){
pw.println(line);
pw.flush();
}
}
Thank you.
Simply defer the creating of the FileOutputStream and PrintWriter until you need it:
PrintWriter pw = null;
Process proc = Runtime.getRuntime().exec(externalProgram);
InputStreamReader isr = new InputStreamReader(proc.getErrorStream());
BufferedReader br = new BufferedReader(isr);
String line;
while ( (line = br.readLine()) != null)
{
if (pw == null)
{
pw = new PrintWriter(new FileOutputStream(logFile));
}
pw.println(line);
pw.flush();
}
Personally I'm not a big fan of PrintWriter - the fact that it just swallows all exceptions concerns me. I'd also use OutputStreamWriter so that you can explicitly specify the encoding. Anyway, that's aside from the real question here.
The obvious thing to do is to change
FileOutputStream fos = new FileOutputStream(logFile);
PrintWriter pw = new PrintWriter(fos);
....
if (pw != null){
...
}
to
FileOutputStream rawLog = null;
try {
PrintWriter Log = null;
....
if (log == null) {
rawLog = new FileOutputStream(logFile);
log = new PrintWriter(log, "UTF-8");
}
...
} finally {
// Thou shalt close thy resources.
// Icky null check - might want to split this using the Execute Around idiom.
if (rawLog != null) {
rawLog.close();
}
}

java: how to convert a file to utf8

i have a file that have some non-utf8 caracters (like "ISO-8859-1"), and so i want to convert that file (or read) to UTF8 encoding, how i can do it?
The code it's like this:
File file = new File("some_file_with_non_utf8_characters.txt");
/* some code to convert the file to an utf8 file */
...
edit: Put an encoding example
The following code converts a file from srcEncoding to tgtEncoding:
public static void transform(File source, String srcEncoding, File target, String tgtEncoding) throws IOException {
BufferedReader br = null;
BufferedWriter bw = null;
try{
br = new BufferedReader(new InputStreamReader(new FileInputStream(source),srcEncoding));
bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(target), tgtEncoding));
char[] buffer = new char[16384];
int read;
while ((read = br.read(buffer)) != -1)
bw.write(buffer, 0, read);
} finally {
try {
if (br != null)
br.close();
} finally {
if (bw != null)
bw.close();
}
}
}
--EDIT--
Using Try-with-resources (Java 7):
public static void transform(File source, String srcEncoding, File target, String tgtEncoding) throws IOException {
try (
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(source), srcEncoding));
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(target), tgtEncoding)); ) {
char[] buffer = new char[16384];
int read;
while ((read = br.read(buffer)) != -1)
bw.write(buffer, 0, read);
}
}
String charset = "ISO-8859-1"; // or what corresponds
BufferedReader in = new BufferedReader(
new InputStreamReader (new FileInputStream(file), charset));
String line;
while( (line = in.readLine()) != null) {
....
}
There you have the text decoded. You can write it, by the simmetric Writer/OutputStream methods, with the encoding you prefer (eg UTF-8).
You need to know the encoding of the input file. For example, if the file is in Latin-1, you would do something like this,
FileInputStream fis = new FileInputStream("test.in");
InputStreamReader isr = new InputStreamReader(fis, "ISO-8859-1");
Reader in = new BufferedReader(isr);
FileOutputStream fos = new FileOutputStream("test.out");
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
Writer out = new BufferedWriter(osw);
int ch;
while ((ch = in.read()) > -1) {
out.write(ch);
}
out.close();
in.close();
You only want to read it as UTF-8?
What I did recently given a similar problem is to start the JVM with -Dfile.encoding=UTF-8, and reading/printing as normal. I don't know if that is applicable in your case.
With that option:
System.out.println("á é í ó ú")
prints correctly the characters. Otherwise it prints a ? symbol

Categories