I am trying to download web page with all its resources . First i download the html, but when to be sure to keep file formatted and use this function below .
there is and issue , i found 10 in the final file and when i found that hexadecimal code of the LF or line escape . and this makes troubles to my javascript functions .
Example of the final result :
<!DOCTYPE html>10<html lang="fr">10 <head>10 <meta http-equiv="content-type" content="text/html; charset=UTF-8" />10
Can someone help me to found the real issue ?
public static String scanfile(File file) {
StringBuilder sb = new StringBuilder();
try {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
while (true) {
String readLine = bufferedReader.readLine();
if (readLine != null) {
sb.append(readLine);
sb.append(System.lineSeparator());
Log.i(TAG,sb.toString());
} else {
bufferedReader.close();
return sb.toString();
}
}
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
There are multiple problems with your code.
Charset error
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
This isn't going to work in tricky ways.
Files (and, for that matter, data given to you by webservers) comes in bytes. A stream of numbers, each number being between 0 and 255.
So, if you are a webserver and you want to send the character ö, what byte(s) do you send?
The answer is complicated. The mapping that explains how some character is rendered in byte(s)-form is called a character set encoding (shortened to 'charset').
Anytime bytes are turned into characters or vice versa, there is always a charset involved. Always.
So, you're reading a file (that'd be bytes), and turning it into a Reader (which is chars). Thus, charset is involved.
Which charset? The API of new FileReader(path) explains which one: "The system default". You do not want that.
Thus, this code is broken. You want one of two things:
Option 1 - write the data as is
When doing the job of querying the webserver for the data and relaying this information onto disk, you'd want to just store the bytes (after all, webserver gives bytes, and disks store bytes, that's easy), but the webserver also sends the encoding, in a header, and you need to save this separately. Because to read that 'sack of bytes', you need to know the charset to turn it into characters.
How would you do this? Well, up to you. You could for example decree that the data file starts with the name of a charset encoding (as sent via that header), then a 0 byte, and then the data, unmodified. I think you should go with option 2, however
Option 2
Another, better option for text-based documents (which HTML is), is this: When reading the data, convert it to characters, using the encoding as that header tells you. Then, to save it to disk, turn the chars back to bytes, using UTF-8, which is a great encoding and an industry standard. That way, when reading, you just know it's UTF-8, period.
To read a UTF-8 text file, you do:
Files.newBufferedReader(Paths.get(file));
The reason this works, is that the Files API, unlike most other APIs (and unlike FileReader, which you should never ever use), defaults to UTF_8 and not to platform-default. If you want, you can make it more readable:
Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8);
same thing - but now in the code it is clear what's happening.
Broken exception handling
} catch (IOException e) {
e.printStackTrace();
return null;
}
This is not okay - if you catch an exception, either [A] throw something else, or [B] handle the problem. And 'log it and keep going' is definitely not 'handling' it. Your strategy of exception handling results in 1 error resulting in a thousand things going wrong with a thousand stack traces, and all of them except the first are undesired and irrelevant, hence why this is horrible code and you should never write it this way.
The easy solution is to just put throws IOException on your scanFile method. The method inherently interacts with files, it SHOULD be throwing that. Note that your psv main(String[] args) method can, and usually should, be declared to throws Exception.
It also makes your code simpler and shorter, yay!
Resource Management failure
a filereader is a resource. You MUST close it, no matter what happens. You are not doing that: If .readLine() throws an exception, then your code will jump to the catch handler and bufferedReader.close is never executed.
The solution is to use the ARM (Automatic Resource Management) construct:
try (var br = Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8)) {
// code goes here
}
This construct ensures that close() is invoked, regardless of how the 'code goes here' block exits. Even if it 'exits' via an exception or a return statement.
The problem
Your 'read a file and print it' code is other than the above three items mostly fine. The problem is that the HTML file on disk is corrupted; the error lies in your code that reads the data from the web server and saves it to disk. You did not paste that code.
Specifically, System.lineSeparator() returns the actual string. Thus, assuming the code you pasted really is the code you are running, if you are seeing an actual '10' show up, then that means the HTML file on disk has that in there. It's not the read code.
Closing thoughts
More generally the job of 'just print a file on disk with a known encoding' can be done in far fewer lines of code:
public static String scanFile(String path) throws IOException {
return Files.readString(Paths.get(path));
}
You should just use the above code instead. It's simple, short, doesn't have any bugs, cannot leak resources, has proper exception handling, and will use UTF-8.
Actually, there is no problem in this function I was mistakenly adding 10 using another function in my code .
Related
I need to read a text from file and, for instance, print it in console. The file is in UTF-8. It seems that I'm doing something wrong because some russian symbols are printed incorrectly. What's wrong with my code?
StringBuilder content = new StringBuilder();
try (FileChannel fChan = (FileChannel) Files.newByteChannel(Paths.get("D:/test.txt")) ) {
ByteBuffer byteBuf = ByteBuffer.allocate(16);
Charset charset = Charset.forName("UTF-8");
while(fChan.read(byteBuf) != -1) {
byteBuf.flip();
content.append(new String(byteBuf.array(), charset));
byteBuf.clear();
}
System.out.println(content);
}
The result:
Здравствуйте, как поживае��е?
Это п��имер текста на русском яз��ке.ом яз�
The actual text:
Здравствуйте, как поживаете?
Это пример текста на русском языке.
UTF-8 uses a variable number of bytes per character. This gives you a boundary error: You have mixed buffer-based code with byte-array based code and you can't do that here; it is possible for you to read enough bytes to be stuck halfway into a character, you then turn your input into a byte array, and convert it, which will fail, because you can't convert half a character.
What you really want is either to first read ALL the data and then convert the entire input, or, to keep any half-characters in the bytebuffer when you flip back, or better yet, ditch all this stuff and use code that is written to read actual characters. In general, using the channel API complicates matters a ton; it's flexible, but complicated - that's how it goes.
Unless you can explain why you need it, don't use it. Do this instead:
Path target = Paths.get("D:/test.txt");
try (var reader = Files.newBufferedReader(target)) {
// read a line at a time here. Yes, it will be UTF-8 decoded.
}
or better yet, as you apparently want to read the whole thing in one go:
Path target = Paths.get("D:/test.txt");
var content = Files.readString(target);
NB: Unlike most java methods that convert bytes to chars or vice versa, the Files API defaults to UTF-8 (instead of the useless and dangerous, untestable-bug-causing 'platform default encoding' that most java API does). That's why this last incredibly simple code is nevertheless correct.
The code below gets a byte array from an HTTP request and saves it in bytes[], the final data will be saved in message[].
I check to see if it contains a header by converting it to a String[], if I do, I read some information from the header then cut it off by saving the bytes after the header to message[].
I then try to output message[] to file using FileOutputStream and it works slightly, but only saves 10KB of information,one iteration of the while loop, (seems to be overwriting), and if I set the FileOutputStream(file, true) to append the information, it works... once, then the file is just added on to the next time I run it, which isn't what I want. How do I write to the same file with multiple chunks of bytes through each iteration, but still overwrite the file in completeness if I run the program again?
byte bytes[] = new byte[(10*1024)];
while (dis.read(bytes) > 0)
{
//Set all the bytes to the message
byte message[] = bytes;
String string = new String(bytes, "UTF-8");
//Does bytes contain header?
if (string.contains("\r\n\r\n")){
String theByteString[] = string.split("\r\n\r\n");
String theHeader = theByteString[0];
String[] lmTemp = theHeader.split("Last-Modified: ");
String[] lm = lmTemp[1].split("\r\n");
String lastModified = lm[0];
//Cut off the header and save the rest of the data after it
message = theByteString[1].getBytes("UTF-8");
//cache
hm.put(url, lastModified);
}
//Output message[] to file.
File f = new File(hostName + path);
f.getParentFile().mkdirs();
f.createNewFile();
try (FileOutputStream fos = new FileOutputStream(f)) {
fos.write(message);
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}
You're opening a new FileOutputStream on each iteration of the loop. Don't do that. Open it outside the loop, then loop and write as you are doing, then close at the end of the loop. (If you use a try-with-resources statement with your while loop inside it, that'll be fine.)
That's only part of the problem though - you're also doing everything else on each iteration of the loop, including checking for headers. That's going to be a real problem if the byte array you read contains part of the set of headers, or indeed part of the header separator.
Additionally, as noted by EJP, you're ignoring the return value of read apart from using it to tell whether or not you're done. You should always use the return value of read to know how much of the byte array is actually usable data.
Fundamentally, you either need to read the whole response into a byte array to start with - which is easy to do, but potentially inefficient in memory - or accept the fact that you're dealing with a stream, and write more complex code to detect the end of the headers.
Better though, IMO, would be to use an HTTP library which already understands all this header processing, so that you don't need to do it yourself. Unless you're writing a low-level HTTP library yourself, you shouldn't be dealing with low-level HTTP details, you should rely on a good library.
Open the file ahead of the loop.
NB you need to store the result of read() in a variable, and pass that variable to new String() as the length. Otherwise you are converting junk in the buffer beyond what was actually read.
There is an issue with reading the data - you read only part of the response (because at that moment not all data was transfered to you yet) - so obviusly you write only that part.
check this answer for how to read full data from the InputStream:
Convert InputStream to byte array in Java
We have a requirement of picking the data from Oracle DB table and dump that data into a csv file and a plain pipe seperated text file. Give a link to user on application so user can view the generated csv/text files.
As lot of parsing was involved so we wrote a Unix shell script and are calling it from out Struts/J2ee application.
Earlier we were loosing the Chinese and Roman chars in the generated files and the generated file were having us-ascii charset(cheked using-> file -i). Later we used NLS_LANG=AMERICAN_AMERICA.AL32UTF8 and this gave us utf-8 format files.
But still the characters were gibberish, so again we tried iconv command and converted utf-8 files to utf-16le charset.
iconv -f utf-8 -t utf-16le $recordFile > $tempFile
This works fine for the generated text file. But with CSV the Chinese and Roman chars are still not correct. Now if we open this csv file in a Notepad and give a newline by pressing Enter key from keyboard, save it. Open it with MS-Excel, all characters are coming fine including the Chinese and Romans but now the text is in single line for each row instead of columns.
Not sure what's going on.
Java code
PrintWriter out = servletResponse.getWriter();
servletResponse.setContentType("application/vnd.ms-excel; charset=UTF-8");
servletResponse.setCharacterEncoding("UTF-8");
servletResponse.setHeader("Content-Disposition","attachment; filename="+ fileName.toString());
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.close();
Please let me know if i missed out any details.
Thanks to all for taking out time to go through this.
Was able to solve it out. First as mentioned by Aaron removed UTF-16LE encoding to avoid future issues and encoded files to UTF-8. Changed the PrintWriter in Java code to OutputStream and was able to see the correct characters in my text file.
CSV was still showing garbage. Came to know that we need to prepend EF BB BF at the beginning of file as the BOM aware software like MS-Excel needs it. So changing the Java code as below did the trick for csv.
OutputStream out = servletResponse.getOutputStream();
os.write(239); //0xEF
os.write(187); //0xBB
out.write(191); //0xBF
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.flush();
out.close();
As always with Unicode problems, every single step of the transformation chain must work perfectly. If you make a mistake in one place, data will be silently corrupted. There is no easy way to figure out where it happens, you have to debug the code or write unit tests.
The Java code above only works if the file actually contains UTF-8 encoded data; it doesn't "magically" figure out what's in the file and converts it to UTF-8. So if the file already contains garbage, you just slap a "this is UTF-8" label on it but it's still garbage.
That means for you that you need to create test cases which take known test data and move that through every step of the chain: Inserting into database, reading from the database, writing to CSV, writing to the text file, reading those files and download to the user.
For each step, you need to write unit tests which takes a known Unicode string like abc öäü and processes it and then check the result. To make it easier to input in Java code, use "abc \u00f6\u00e4\u00fc" You may also want to add spaces at the beginning and end of the string to see whether they are properly preserved or not.
file -i doesn't help you much here since it just makes a guess what the file contains. There is no indicator (data or metadata) in a text file which says "this is UTF-8". UTF-16 supports a BOM header for this but almost no one uses UTF-16, so many tools don't support it (properly).
My team and I have this nasty problem with parsing a string received from our server. The server is pretty simple socket stuff done in qt here is the sendData function:
void sendData(QTcpSocket *client,QString response){
QString text = response.toUtf8();
QByteArray block;
QDataStream out(&block, QIODevice::WriteOnly);
out << (quint32)0;
out << text;
out.device()->seek(0);
out << (quint32)(block.size() - sizeof(quint32));
try{
client->write(block);
}
catch(...){...
The client is in Java and is also pretty standard socket stuff, here is where we are at now after trying many many different ways of decoding the response from the server:
Socket s;
try {
s = new Socket(URL, 1987);
PrintWriter output = new PrintWriter(s.getOutputStream(), true);
InputStreamReader inp = new InputStreamReader(s.getInputStream(), Charset.forName("UTF-8"));
BufferedReader rd = new BufferedReader( inp );
String st;
while ((st = rd.readLine()) != null){
System.out.println(st);
}...
If a connection is made with the server it sends a string "Send Handshake" with the size of the string in bytes sent before it as seen in the first block of code. This notifies the client that it should send authentication to the server. As of now the string we get from the server looks like this:
������ ��������S��e��n��d�� ��H��a��n��d��s��h��a��k��e
We have used tools such as string encode/decode tool to try and assess how the string is encoded but it fails on every configuration.
We are out of ideas as to what encoding this is, if any, or how to fix it.
Any help would be much appreciated.
At a glance, the line where you convert the QString parameter to a Utf8 QByteArray and then back to a QString seems odd:
QString text = response.toUtf8();
When the QByteArray returned by toUtf8() is assigned to text, I think it is assumed that the QByteArray contains an Ascii (char*) buffer.
I'm pretty sure that QDataStream is intended to be used only within Qt. It provides a platform-independent way of serializing data that is then intended to be deserialized with another QDataStream somewhere else. As you noticed, it's including a lot of extra stuff besides your raw data, and that extra stuff is subject to change at the next Qt version. (This is why the documentation suggests including in your stream the version of QDataStream being used ... so it can use the correct deserialization logic.)
In other words, the extra stuff you are seeing is probably meta-data included by Qt and it is not guaranteed to be the same with the next Qt version. From the docs:
QDataStream's binary format has evolved since Qt 1.0, and is likely to
continue evolving to reflect changes done in Qt. When inputting or
outputting complex types, it's very important to make sure that the
same version of the stream (version()) is used for reading and
writing.
If you are going to another language, this isn't practical to use. If it is just text you are passing, use a well-known transport mechanism (JSON, XML, ASCII text, UTF-8, etc.) and bypass the QDataStream altogether.
I have an embedded device which runs Java applications which can among other things serve up XHTML web pages (I could write the pages as something other than XHTML, but I'm aiming for that for now).
When a request for a web page handled by my application is received a method is called in my code with all the information on the request including an output stream to display the page.
On one of my pages I would like to display a (log) file, which can be up to 1 MB in size.
I can display this file unescaped using the following code:
final PrintWriter writer; // Is initialized to a PrintWriter writing to the output stream.
final FileInputStream fis = new FileInputStream(file);
final InputStreamReader inputStreamReader = new InputStreamReader(fis);
try {
writer.println("<div id=\"log\" style=\"white-space: pre-wrap; word-wrap: break-word\">");
writer.println(" <pre>");
int length;
char[] buffer = new char[1024];
while ((length = inputStreamReader.read(buffer)) != -1) {
writer.write(buffer, 0, length);
}
writer.println(" </pre>");
writer.println("</div>");
} finally {
if (inputStreamReader != null) {
inputStreamReader.close();
}
}
This works reasonably well, and displays the entire file within a second or two (an acceptable timeframe).
This file can (and in practice, does) contain characters which are invalid XHTML, most commonly <>. So I need to find a way to escape these characters.
The first thing I tried was a CDATA section, but as documented here they do not display correctly in IE8.
The second thing I tried was a method like the following:
// Based on code: https://stackoverflow.com/questions/439298/best-way-to-encode-text-data-for-xml-in-java/440296#440296
// Modified to write directly to the stream to avoid creating extra objects.
private static void writeXmlEscaped(PrintWriter writer, char[] buffer, int offset, int length) {
for (int i = offset; i < length; i++) {
char ch = buffer[i];
boolean controlCharacter = ch < 32;
boolean unicodeButNotAscii = ch > 126;
boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';
if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
writer.write("&#" + (int) ch + ";");
} else {
writer.write(ch);
}
}
}
This correctly escapes the characters (I was going to expand it to escape HTML invalid characters if needed), but the web page then takes 15+ seconds to display and other resources on the page (images, css stylesheet) intermittently fail to load (I believe due to the requests for them timing out because the processor is pegged).
I've tried using a BufferedWriter in front of the PrintWriter as well as changing the buffer size (both for reading the file and for the BufferedWriter) in various ways, with no improvement.
Is there a way to escape all XHTML invalid characters that does not require iterating over every single character in the stream? Failing that is there a way to speed up my code enough to display these files within a couple seconds?
I'll consider reducing the size of the log files if I have to, but I was hoping to make them at least 250-500 KB in size (with 1 MB being ideal).
I already have a method to simply download the log files, but I would like to display them in browser as well for simple troubleshooting/perusal.
If there's a way to set the headers so that IE8/Firefox will simply display the file in browser as a text file I would consider that as an alternative (and have an entire page dedicated to the file with no XHTML of any kind).
EDIT:
After making the change suggested by Cameron Skinner and performance testing it looks like the escaped writing takes about 1.5-2x as long as the block-written version. It's not nothing, but I'm probably not going to be able to get a huge speedup by messing with it.
I may just need to reduce the max size of the log file.
One small change that will (well, might) significantly increase the speed is to change
writer.write("&#" + (int) ch + ";");
to
writer.write("&#");
writer.write((int)ch);
writer.write(";");
String concatenation is extremely expensive as Java allocates a new temporary string buffer for each + operator, so you are generating two temporary buffers each time there is a character that needs replacing.
EDIT: One of the comments on another answer is highly relevant: find where the slow bit is first. I'd suggest testing logs that have no characters to be escaped and many characters to be escaped.
I think you should make the suggested change anyway because it costs you only a few seconds of your time.
You can try StringEscapeUtils from commons-lang:
StringEscapeUtils.escapeHtml(writer, string);
One option is for you to serve up the log contents inside of an iframe hosted inside of your web page. The iframe's source could point to a URL that serves up the content as text.