How do I get an FileInputStream from FileItem in java? - java

I am trying to avoid the FileItem getInputStream(), because it will get the wrong encoding, for that I need a FileInputStream instead. Is there any way to get a FileInputStream without using this method? Or can I transform my fileitem into a file?
if (this.strEncoding != null && !this.strEncoding.isEmpty()) {
br = new BufferedReader(new InputStreamReader(clsFile.getInputStream(), this.strEncoding));
}
else {
// br = ?????
}

You can try
FileItem#getString(encoding)
Returns the contents of the file item as a String, using the specified encoding.

You can use the write method here.
File file = new File("/path/to/file");
fileItem.write(file);

An InputStream is binary data, bytes. It must be converted to text by giving the encoding of those bytes.
Java uses internally Unicode to represent all text scripts. For text it uses String/char/Reader/Writer.
For binary data, byte[], InputStream, OutputStream.
So you could use a bridging class, like InputStreamReader:
String encoding = "UTF-8"; // Or "Windows-1252" ...
BufferedReader in = new BufferedStream(
new InputStreamReader(fileItem.getInputStream(),
encoding));
Or if you read the bytes:
String s = new String(bytes, encoding);
The encoding is often an option parameter (there then exists an overloaded method without encoding).

Related

Java Spring returning CSV file encoded in UTF-8 with BOM

Apparently for excel to open CSV files nicely, it should have the Byte Order Mark at the start. The download of CSV is implemented by writing into HttpServletResponse's output stream in the controller, as the data is generated during request. I get an exception when I try to write the BOM bytes - java.io.CharConversionException: Not an ISO 8859-1 character: [] (even though the encoding I specified is UTF-8).
The controller's method in question
#RequestMapping("/monthly/list")
public List<MonthlyDetailsItem> queryDetailsItems(
MonthlyDetailsItemQuery query,
#RequestParam(value = "format", required = false) String format,
#RequestParam(value = "attachment", required = false, defaultValue="false") Boolean attachment,
HttpServletResponse response) throws Exception
{
// load item list
List<MonthlyDetailsItem> list = detailsSvc.queryMonthlyDetailsForList(query);
// adjust format
format = format != null ? format.toLowerCase() : "json";
if (!Arrays.asList("json", "csv").contains(format)) format = "json";
// modify common response headers
response.setCharacterEncoding("UTF-8");
if (attachment)
response.setHeader("Content-Disposition", "attachment;filename=duomenys." + format);
// build csv
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
response.getOutputStream().print("\ufeff");
response.getOutputStream().write(buildMonthlyDetailsItemCsv(list).getBytes("UTF-8"));
return null;
}
return list;
}
I have just come across, this same problem. The solution which works for me is to get the output stream from the response Object and write to it as follows
// first create an array for the Byte Order Mark
final byte[] bom = new byte[] { (byte) 239, (byte) 187, (byte) 191 };
try (OutputStream os = response.getOutputStream()) {
os.write(bom);
final PrintWriter w = new PrintWriter(new OutputStreamWriter(os, "UTF-8"));
w.print(data);
w.flush();
w.close();
} catch (IOException e) {
// logit
}
So UTF-8 is specified on the OutputStreamWriter.
As an addendum to this, I should add, the same application needs to allow users to upload files, these may or may not have BOM's. This may be dealt with by using the class org.apache.commons.io.input.BOMInputStream, then using that to construct a org.apache.commons.csv.CSVParser.
The BOMInputStream includes a method hasBOM() to detect if the file has a BOM or not.
One gotcha that I first fell into was that the hasBOM() method reads (obviously!) from the underlying stream, so the way to deal with this is to first mark the stream, then after the test if it doesn't have a BOM, reset the stream. The code I use for this looks like the following:
try (InputStream is = uploadFile.getInputStream();
BufferedInputStream buffIs = new BufferedInputStream(is);
BOMInputStream bomIn = new BOMInputStream(buffIs);) {
buffIs.mark(LOOKAHEAD_LENGTH);
// this should allow us to deal with csv's with or without BOMs
final boolean hasBOM = bomIn.hasBOM();
final BufferedReader buffReadr = new BufferedReader(
new InputStreamReader(hasBOM ? bomIn : buffIs, StandardCharsets.UTF_8));
// if this stream does not have a BOM, then we must reset the stream as the test
// for a BOM will have consumed some bytes
if (!hasBOM) {
buffIs.reset();
}
// collect the validated entity details
final CSVParser parser = CSVParser.parse(buffReadr,
CSVFormat.DEFAULT.withFirstRecordAsHeader());
// Do stuff with the parser
...
// Catch and clean up
Hope this helps someone.
It doesn't make much sense: the BOM is for UTF-16; there is no byte order with UTF-8. The encoding You've set with setCharacterEncoding is used for getWriter, not for getOutputStream.
UPDATE:
OK, try this:
if ("csv".equals(format)) {
response.setContentType("text/csv; charset=UTF-8");
PrintWriter out = response.getWriter();
out.print("\uFEFF");
out.print(buildMonthlyDetailsItemCsv(list));
return null;
}
I'm assuming that method buildMonthlyDetailsItemCsv returns a String.

Downloaded file by means of Java does not open

I try to open downloaded file but it's impossible. For example *.mp3 does not play *.torrent have message " is not valid bencoding".. Whats's wrong? Promt please?
try(FileOutputStream fwriter = new FileOutputStream(save_location);){
URL url_path = new URL(url);
URLConnection connection = url_path.openConnection();
InputStreamReader reader = new InputStreamReader(connection.getInputStream());
int data;
while((data = reader.read())!=-1)
fwriter.write(data);
fwriter.flush();
}catch(IOException e){
e.printStackTrace();
}
Here's the problem.
The code is taking an input stream (binary data) and wrapping it in a reader (text data) with the conversion of binary to text being performed using the platform's default character set decoder.
Then it is taking those characters, truncating them to bytes and writing them as a byte syteam.
A transformation of binary data from bytes to characters and back to bytes is typically1 lossy; i.e. to damages binary data. When you do it like this (without selecting a "safe" charset, and with a broken text to byte conversion on the back end) then damage is inevitable.
For the record, here is a sketch of the correct way to copy a binary data stream:
URL url = new URL(urlString);
try (FileOutputStream os = new FileOutputStream(save_location);
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream()) {
byte[] data = new byte[BUFFER_SIZE);
int nosBytesRead;
while ((nosBytesRead = is.read()) != -1) {
os.write(data, 0, nosBytesRead);
}
}
Notes:
Does no convert from bytes to chars to bytes
Does reads and writes using a buffer, not one byte (or character) at a time.
Does not squash exceptions.
Opens the input stream as resource so that there is no potential resource leak.
1 - But not always. If you use Latin-1 as the character encoding, and implement the conversions correctly, they won't be lossy. But this is beside the point really. For a binary data you should not do an unnecessary binary -> text -> binary transformation in the first place.

read greek characters from xls file into java

I am trying to read an xls file in java and convert it to csv. The problem is that it contains greek characters. I have used various different methods with no success.
br = new BufferedReader(new InputStreamReader(
new FileInputStream(saveDir+"/"+fileName+".xls"), "UTF-8"));
FileWriter writer1 = new FileWriter(saveDir+"/A"+fileName+".csv");
byte[] bytes = thisLine.getBytes("UTF-8");
writer1.append(new String(bytes, "UTF-8"));
used that with different encoders, like utf16 and windoes-1253 and ofcourse with out using the bytes array. none worked. any ideas?
Use "ISO-8859-7" instead of "UTF-8". It is for latin and greek. See documentation
InputStream in = new BufferedInputStream(new FileInputStream(new File(myfile)));
result = new Scanner(in,"ISO-8859-7").useDelimiter("\\A").next();
A Byte Order Mask (BOM) should be entered at the start of the CSV file.
Can you try this code?
PrintWriter writer1 = new PrintWriter(saveDir+"/A"+fileName+".csv");
writer1.print('\ufeff');
....

Trying to Change the Encdoing of a File in Java is Doubling the Contents of the File

I have a FileOutputStream in java that is reading the contents of UDP packets and saving them to a file. At the end of reading them, I sometimes want to convert the encoding of the file. The problem is that currently when doing this, it just ends up doubling all the contents of the file. The only workaround that I could think to do would be to create a temp file with the new encoding and then save it as the original file, but this seems too hacky.
I must be just overlooking something in my code:
if(mode.equals("netascii")){
byte[] convert = new byte[(int)file.length()];
FileInputStream input = new FileInputStream(file);
input.read(convert);
String temp = new String(convert);
convert = Charset.forName("US-ASCII").encode(temp).array();
fos.write(convert);
}
JOptionPane.showMessageDialog(frame, "Read Successful!");
fos.close();
}
Is there anything suspect?
Thanks in advance for any help!
The problem is the array of bytes you've read from the InputStream will be converted as if its ascii chars, which I'm assuming its not. Specify the InputStream encoding when converting its bytes to String and you'll get a standard Java string.
I've assumed UTF-16 as the InputStream's encoding here:
byte[] convert = new byte[(int)file.length()];
FileInputStream input = new FileInputStream(file);
// read file bytes until EOF
int r = input.read(convert);
while(r!=-1) r = input.read(convert,r,convert.length);
String temp = new String(convert, Charset.forName("UTF-16"));

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

Categories