The code below gets a byte array from an HTTP request and saves it in bytes[], the final data will be saved in message[].
I check to see if it contains a header by converting it to a String[], if I do, I read some information from the header then cut it off by saving the bytes after the header to message[].
I then try to output message[] to file using FileOutputStream and it works slightly, but only saves 10KB of information,one iteration of the while loop, (seems to be overwriting), and if I set the FileOutputStream(file, true) to append the information, it works... once, then the file is just added on to the next time I run it, which isn't what I want. How do I write to the same file with multiple chunks of bytes through each iteration, but still overwrite the file in completeness if I run the program again?
byte bytes[] = new byte[(10*1024)];
while (dis.read(bytes) > 0)
{
//Set all the bytes to the message
byte message[] = bytes;
String string = new String(bytes, "UTF-8");
//Does bytes contain header?
if (string.contains("\r\n\r\n")){
String theByteString[] = string.split("\r\n\r\n");
String theHeader = theByteString[0];
String[] lmTemp = theHeader.split("Last-Modified: ");
String[] lm = lmTemp[1].split("\r\n");
String lastModified = lm[0];
//Cut off the header and save the rest of the data after it
message = theByteString[1].getBytes("UTF-8");
//cache
hm.put(url, lastModified);
}
//Output message[] to file.
File f = new File(hostName + path);
f.getParentFile().mkdirs();
f.createNewFile();
try (FileOutputStream fos = new FileOutputStream(f)) {
fos.write(message);
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}
You're opening a new FileOutputStream on each iteration of the loop. Don't do that. Open it outside the loop, then loop and write as you are doing, then close at the end of the loop. (If you use a try-with-resources statement with your while loop inside it, that'll be fine.)
That's only part of the problem though - you're also doing everything else on each iteration of the loop, including checking for headers. That's going to be a real problem if the byte array you read contains part of the set of headers, or indeed part of the header separator.
Additionally, as noted by EJP, you're ignoring the return value of read apart from using it to tell whether or not you're done. You should always use the return value of read to know how much of the byte array is actually usable data.
Fundamentally, you either need to read the whole response into a byte array to start with - which is easy to do, but potentially inefficient in memory - or accept the fact that you're dealing with a stream, and write more complex code to detect the end of the headers.
Better though, IMO, would be to use an HTTP library which already understands all this header processing, so that you don't need to do it yourself. Unless you're writing a low-level HTTP library yourself, you shouldn't be dealing with low-level HTTP details, you should rely on a good library.
Open the file ahead of the loop.
NB you need to store the result of read() in a variable, and pass that variable to new String() as the length. Otherwise you are converting junk in the buffer beyond what was actually read.
There is an issue with reading the data - you read only part of the response (because at that moment not all data was transfered to you yet) - so obviusly you write only that part.
check this answer for how to read full data from the InputStream:
Convert InputStream to byte array in Java
Related
I am trying to download web page with all its resources . First i download the html, but when to be sure to keep file formatted and use this function below .
there is and issue , i found 10 in the final file and when i found that hexadecimal code of the LF or line escape . and this makes troubles to my javascript functions .
Example of the final result :
<!DOCTYPE html>10<html lang="fr">10 <head>10 <meta http-equiv="content-type" content="text/html; charset=UTF-8" />10
Can someone help me to found the real issue ?
public static String scanfile(File file) {
StringBuilder sb = new StringBuilder();
try {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
while (true) {
String readLine = bufferedReader.readLine();
if (readLine != null) {
sb.append(readLine);
sb.append(System.lineSeparator());
Log.i(TAG,sb.toString());
} else {
bufferedReader.close();
return sb.toString();
}
}
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
There are multiple problems with your code.
Charset error
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
This isn't going to work in tricky ways.
Files (and, for that matter, data given to you by webservers) comes in bytes. A stream of numbers, each number being between 0 and 255.
So, if you are a webserver and you want to send the character ö, what byte(s) do you send?
The answer is complicated. The mapping that explains how some character is rendered in byte(s)-form is called a character set encoding (shortened to 'charset').
Anytime bytes are turned into characters or vice versa, there is always a charset involved. Always.
So, you're reading a file (that'd be bytes), and turning it into a Reader (which is chars). Thus, charset is involved.
Which charset? The API of new FileReader(path) explains which one: "The system default". You do not want that.
Thus, this code is broken. You want one of two things:
Option 1 - write the data as is
When doing the job of querying the webserver for the data and relaying this information onto disk, you'd want to just store the bytes (after all, webserver gives bytes, and disks store bytes, that's easy), but the webserver also sends the encoding, in a header, and you need to save this separately. Because to read that 'sack of bytes', you need to know the charset to turn it into characters.
How would you do this? Well, up to you. You could for example decree that the data file starts with the name of a charset encoding (as sent via that header), then a 0 byte, and then the data, unmodified. I think you should go with option 2, however
Option 2
Another, better option for text-based documents (which HTML is), is this: When reading the data, convert it to characters, using the encoding as that header tells you. Then, to save it to disk, turn the chars back to bytes, using UTF-8, which is a great encoding and an industry standard. That way, when reading, you just know it's UTF-8, period.
To read a UTF-8 text file, you do:
Files.newBufferedReader(Paths.get(file));
The reason this works, is that the Files API, unlike most other APIs (and unlike FileReader, which you should never ever use), defaults to UTF_8 and not to platform-default. If you want, you can make it more readable:
Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8);
same thing - but now in the code it is clear what's happening.
Broken exception handling
} catch (IOException e) {
e.printStackTrace();
return null;
}
This is not okay - if you catch an exception, either [A] throw something else, or [B] handle the problem. And 'log it and keep going' is definitely not 'handling' it. Your strategy of exception handling results in 1 error resulting in a thousand things going wrong with a thousand stack traces, and all of them except the first are undesired and irrelevant, hence why this is horrible code and you should never write it this way.
The easy solution is to just put throws IOException on your scanFile method. The method inherently interacts with files, it SHOULD be throwing that. Note that your psv main(String[] args) method can, and usually should, be declared to throws Exception.
It also makes your code simpler and shorter, yay!
Resource Management failure
a filereader is a resource. You MUST close it, no matter what happens. You are not doing that: If .readLine() throws an exception, then your code will jump to the catch handler and bufferedReader.close is never executed.
The solution is to use the ARM (Automatic Resource Management) construct:
try (var br = Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8)) {
// code goes here
}
This construct ensures that close() is invoked, regardless of how the 'code goes here' block exits. Even if it 'exits' via an exception or a return statement.
The problem
Your 'read a file and print it' code is other than the above three items mostly fine. The problem is that the HTML file on disk is corrupted; the error lies in your code that reads the data from the web server and saves it to disk. You did not paste that code.
Specifically, System.lineSeparator() returns the actual string. Thus, assuming the code you pasted really is the code you are running, if you are seeing an actual '10' show up, then that means the HTML file on disk has that in there. It's not the read code.
Closing thoughts
More generally the job of 'just print a file on disk with a known encoding' can be done in far fewer lines of code:
public static String scanFile(String path) throws IOException {
return Files.readString(Paths.get(path));
}
You should just use the above code instead. It's simple, short, doesn't have any bugs, cannot leak resources, has proper exception handling, and will use UTF-8.
Actually, there is no problem in this function I was mistakenly adding 10 using another function in my code .
I need to read a text from file and, for instance, print it in console. The file is in UTF-8. It seems that I'm doing something wrong because some russian symbols are printed incorrectly. What's wrong with my code?
StringBuilder content = new StringBuilder();
try (FileChannel fChan = (FileChannel) Files.newByteChannel(Paths.get("D:/test.txt")) ) {
ByteBuffer byteBuf = ByteBuffer.allocate(16);
Charset charset = Charset.forName("UTF-8");
while(fChan.read(byteBuf) != -1) {
byteBuf.flip();
content.append(new String(byteBuf.array(), charset));
byteBuf.clear();
}
System.out.println(content);
}
The result:
Здравствуйте, как поживае��е?
Это п��имер текста на русском яз��ке.ом яз�
The actual text:
Здравствуйте, как поживаете?
Это пример текста на русском языке.
UTF-8 uses a variable number of bytes per character. This gives you a boundary error: You have mixed buffer-based code with byte-array based code and you can't do that here; it is possible for you to read enough bytes to be stuck halfway into a character, you then turn your input into a byte array, and convert it, which will fail, because you can't convert half a character.
What you really want is either to first read ALL the data and then convert the entire input, or, to keep any half-characters in the bytebuffer when you flip back, or better yet, ditch all this stuff and use code that is written to read actual characters. In general, using the channel API complicates matters a ton; it's flexible, but complicated - that's how it goes.
Unless you can explain why you need it, don't use it. Do this instead:
Path target = Paths.get("D:/test.txt");
try (var reader = Files.newBufferedReader(target)) {
// read a line at a time here. Yes, it will be UTF-8 decoded.
}
or better yet, as you apparently want to read the whole thing in one go:
Path target = Paths.get("D:/test.txt");
var content = Files.readString(target);
NB: Unlike most java methods that convert bytes to chars or vice versa, the Files API defaults to UTF-8 (instead of the useless and dangerous, untestable-bug-causing 'platform default encoding' that most java API does). That's why this last incredibly simple code is nevertheless correct.
I want to send a CSV file encoded in base64 from Client to Server, in order to parse it and use the data.
I want to get the InputStream directly from the Request object and pipe it to the reader used by the CSV parser.
Is there any performance or memory gain using this method?
Can the following code achieve this ? I feel like there's something missing while decoding the content.
Is BufferedReader really needed in this example ?
/* Suppose I get a Base64 encoded CSV file from the client */
String csvContent = "Column 1;Column 2;Column 3\r\nValue 1;Value 2;Value 3\r\n";
ByteArrayInputStream inputStream = new ByteArrayInputStream(Base64.encodeBase64(csvContent.getBytes()));
/* retrieving the content UPDATED */
Base64InputStream b64InputStream = new Base64InputStream(inputStream, false);
/* Parsing the CSV content */
Reader reader = new BufferedReader(
new InputStreamReader(b64InputStream));
CSVParser csvParser = new CSVParser(reader, FORMAT_EXCEL_FR);
/* printing results */
csvParser.forEach(record -> printRecord(record));
Update
I replaced the byte[] array with a Base64InputStream from org.apache.commons.codec
Probably not. A BufferedReader ... uses a buffer. It is commonly used when your data is not in java memory yet. ( e.g. socket communication, reading data from a file , ... )
In your case, you are wrapping a byte[], which means that the data is already in memory. So there is no point in adding a buffer.
The javadoc describes a BufferedReader as follows:
Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
Now, let's say for example you want to read the content of a file, and want to check something byte-per-byte. So you do a lot of byte b = in.read(); calls. In that case, a buffered reader will actually fetch those bytes in chunks internally.
So, basically, whenever it is more efficient to fetch data in chunks, use a BufferedReader.
Update
In response to your update. No, also in this case it's not necessary to add a BufferedReader. As Holger pointed out:
It's likely that the CSVParser does that already (i.e. buffering).
I checked the source code of the CSVParser, and look what's in the constructor.
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber)
throws IOException {
...
this.lexer = new Lexer(format, new ExtendedBufferedReader(reader));
...
}
It wraps some kind of buffered reader by default. So, there's no need to add one yourself.
I came across some strange behavior with reading files in Java 8 and i'm wondering if someone can make sense of it.
Scenario:
Reading a malformed text file. By malformed i mean that it contains bytes that do not map to any unicode code points.
The code i use to create such a file is as follows:
byte[] text = new byte[1];
char k = (char) -60;
text[0] = (byte) k;
FileUtils.writeByteArrayToFile(new File("/tmp/malformed.log"), text);
This code produces a file that contains exactly one byte, which is not part of the ASCII table (nor the extended one).
Attempting to cat this file produces the following output:
�
Which is the UNICODE Replacement Character. This makes sense because UTF-8 needs 2 bytes in order to decode non-ascii characters, but we only have one. This is the behavior i expect from my Java code as well.
Pasting some common code:
private void read(Reader reader) throws IOException {
CharBuffer buffer = CharBuffer.allocate(8910);
buffer.flip();
// move existing data to the front of the buffer
buffer.compact();
// pull in as much data as we can from the socket
int charsRead = reader.read(buffer);
// flip so the data can be consumed
buffer.flip();
ByteBuffer encode = Charset.forName("UTF-8").encode(buffer);
byte[] body = new byte[encode.remaining()];
encode.get(body);
System.out.println(new String(body));
}
Here is my first approach using nio:
FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(Channels.newReader(inputStream.getChannel(), "UTF-8");
This produces the following exception:
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.Reader.read(Reader.java:100)
Which is not what i expected but also kind of makes sense, because this is actually a corrupt and an illegal file, and the exception is basically telling us it expected more bytes to be read.
And my second one (using regular java.io):
FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log"));
read(new InputStreamReader(inputStream, "UTF-8"));
This does not fail and produces the exact same output as cat did:
�
Which also makes sense.
So my questions are:
What is the expected behavior from a Java Application in this scenario?
Why is there a difference between using the Channels.newReader (which returns a StreamDecoder) and simply using the regular InputStreamReader? Am i doing something wrong with how i read?
Any clarifications would be much appreciated.
Thanks :)
The difference between the behaviour actually goes right down to the StreamDecoder and Charset classes. The InputStreamReader gets a CharsetDecoder from StreamDecoder.forInputStreamReader(..) which does replacement on error
StreamDecoder(InputStream in, Object lock, Charset cs) {
this(in, lock,
cs.newDecoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE));
}
while the Channels.newReader(..) creates the decoder with the default settings (i.e. report instead of replace, which results in an exception further up)
public static Reader newReader(ReadableByteChannel ch,
String csName) {
checkNotNull(csName, "csName");
return newReader(ch, Charset.forName(csName).newDecoder(), -1);
}
So they work differently, but there's no indication in documentation anywhere about the difference. This is badly documented, but I suppose they changed the functionality because you'd rather get an exception than have your data silently corrupted.
Be careful when dealing with character encodings!
I'm trying to read the response from a server using a socket and the information is UTF-8 encoded. I'm wrapping the InputStream from the socket in an InputStreamReader with the encoding set to "UTF-8".
For some reason it seems like only part of the response is read and then the reading just hangs for about a minute or two and then it finishes. If I set the encoding on the InputStreamReader to "ISO-8859-1" then I can read all of the data right away, but obviously not all of the characters are displayed correctly.
Code looks something like the following
socketConn = (SocketConnection)Connector.open(url);
InputStreamReader is = new InputStreamReader(socketConn.openInputStream(), "UTF-8");
Then I read through the headers and the content. The content is chunked and I read the line with the size of each chunk (convert to decimal from hex) to know how much to read.
I'm not understanding the difference in reading with the two encodings and the effect it can have because it works without issue with ISO-8859-1 and it works eventually with UTF-8, there is just the long delay.
It's hard to get the reason of the delay.
You may try another way of getting the data from the network:
byte[] data = IOUtilities.streamToBytes(socketConn.openInputStream());
I believe the above should be passed without delay. Then having got the bytes from network you can start data processing. Note you can always get a String from bytes representing a string in UTF-8 encoding:
String stringInUTF8 = new String(bytes, "UTF-8");
UPDATE: see the second comment to this post.
I was already removing the chunk sizes on the fly so I ended up doing something somewhat similar to the IOUtilities answer. Instead of using an InputStreamReader I just used the InputStream. InputStream has a read method that can fill an array of bytes, so for each chunk the code looks something like this
byte[] buf = new buf[size];
is.read(buf);
return new String(buf, "UTF-8");
This seems to work, doesn't cause any delays and I can remove the extra information about the chunks on the fly.