Android (java) XmlSerializer not encoding "Windows-1252" - java

I'm currently working in a Android Studio project (java) and I need to export a Xml file with an specific encoding "Windows-1252".
I've been trying to do it for a few hours (EXAMPLE A) and no matter what kind of encoding I choose although the resulting file have the "correct" encoding in the first Xml line "... encoding='windows-1252'":
the chars inside the Xml file are "escaped" with "&#xxx;"
opening the file with notepad++ it detects the "UTF-8" encoding (not the desired "Windows-1252")
<?xml version='1.0' encoding='windows-1252' ?><test><message>áéíóúãõç</message></test>
To make sure that the "streams" were correct, I've created a new sample (EXAMPLE B) without the "XmlSerializer", and the result file was much better:
the chars inside the Xml file are now correct (not escaped)
opening the file with notepad++ it detects the "ANSI" encoding (not the desired "Windows-1252")
<?xml version='1.0' encoding='windows-1252' ?><test><message>áéíóúãõç</message></test>
private void doDebug01(){
File dstPath = Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOWNLOADS);
try {
//EXAMPLE A
File dstFile = new File(dstPath, "test.xml");
FileOutputStream dstFOS = new FileOutputStream(dstFile);
OutputStream dstOS= new BufferedOutputStream(dstFOS);
OutputStreamWriter dstOSW = new OutputStreamWriter(dstOS, "windows-1252");
XmlSerializer dstXml = Xml.newSerializer();
dstXml.setOutput(dstOSW);
dstXml.startDocument("windows-1252", null);
dstXml.startTag(null,"test");
dstXml.startTag(null,"message");
dstXml.text("áéíóúãõç");
dstXml.endTag(null,"message");
dstXml.endTag(null,"test");
dstXml.endDocument();
dstXml.flush();
dstOSW.close();
//EXAMPLE B
File dstFileB = new File(dstPath, "testb.xml");
FileOutputStream dstFOSB = new FileOutputStream(dstFileB);
OutputStream dstOSB= new BufferedOutputStream(dstFOSB);
OutputStreamWriter dstOSWB = new OutputStreamWriter(dstOSB, "windows-1252");
dstOSWB.write("<?xml version='1.0' encoding='windows-1252' ?>");
dstOSWB.write("<test>");
dstOSWB.write("<message>áéíóúãõç</message>");
dstOSWB.write("</test>");
dstOSWB.flush();
dstOSWB.close();
}
catch (IOException e) {
Log.e("DEBUG", e.getMessage());
e.printStackTrace();
}
}
So now I'm confused and kind of stucked here with the results I've got from EXAMPLE B because I don't know if my problem (A) resides on the "XmlSerializer" or on the "streams" parameters.
What am I missing in my code (A) to get an Xml with correct chars/file encoded in "Windows-1252" (or at least closest to EXAMPLE B)?

Related

Disable auto-escaping when creating an HTML file with PrintWriter

I don't know why this is so much harder than expected.
I'm trying to create an HTML file with Java, and it is not working. The code creates a file, but the contents are not what I inputted.
My simplified code is as follows:
File file = new File("text.html");
PrintWriter out = null;
try {
out = new PrintWriter(file);
out.write("<b>Hello World!</b>");
} catch (Exception e) { }
out.close();
Instead of the contents "Hello World!", the HTML file contains the escaped form "<b>Hello World!</b>". When I open the file with TextWrangler, I see that Java has automatically escaped all my angle brackets into < and >, which breaks all the formatting and is NOT what I want.
How do I avoid this?

BeanIO - UnidentifiedRecordException when parsing UTF8 file

I have a problem when parsing a file that is encoded with UTF8.
I have two files which are completely identical, except for their encoding. (I simply just copied the file and saved it with UTF8, so the contents are identical). One is encoded using ANSI, the other with UTF8. The file which is encoded with ANSI is succesfully parsed while the other file causes BeanIO to throw an UnidentifiedRecordException when calling the BeanReader.read() method:
org.beanio.UnidentifiedRecordException: Unidentified record at line 1
I have tried to solve this by explicitly setting the encoding to UTF8 using this code:
public static BeanReader getBeanReader(File file, StreamBuilder builder) {
StreamFactory factory = StreamFactory.newInstance();
factory.define(builder);
InputStream iStream;
try {
iStream = new FileInputStream(file);
} catch (FileNotFoundException e) {
throw new CustomException("Could not create BeanReader, file not found", e);
}
Reader reader = new InputStreamReader(iStream, StandardCharsets.UTF_8);
return factory.createReader("reader", reader);
}
which doesn't solve the issue.
What could be the reason for this error?
As the first line is claimed erroneous, did you save the UTF-8 without BOM (that infamous zero-width space at file start)?

Want to throw exception when encounter special UTF-8 characters in an XML file

I am parsing an XML file which has UTF-8 encoding.
<?xml version="1.0" encoding="UTF-8"?>
Now our business application has set of components which are developed by different teams and are not using the same libraries for parsing XML. My component uses JAXB while some other component uses SAX and so forth. Now when XML file has special characters like "ä" or "ë" or "é" (characters with umlauts) JAXB parses it properly but other components (sub-apps) could not parse them properly and throws exception.
Due to business need I can not change programming for other components but I have to put restriction/validation at my application for making sure that XML (data-load) file do not contain any such characters.
What is best approach to make sure that file does not contain above mentioned (or similar) characters and I can throw exception (or give error) right there before I start parsing XML file using JAXB.
If your customer sends you an XML file with a header where the encoding does not match file contents, you might as well give up to try and do anything meaningful with that file. - Are they really sending data where the header does not match the actual encoding? That's not XML, then. And you ought to charge them more ;-)
Simply read the file as a FileInputStream, byte by byte. If it contains a negative byte value, refuse to process it.
You can keep encoding settings like UTF-8 or ISO 8859-1, because they all have US-ASCII as a proper subset.
yes, my answer would be the same as laune mentions...
static boolean readInput() {
boolean isValid = true;
StringBuffer buffer = new StringBuffer();
try {
FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis);
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
System.out.println("ch="+ch);
//TODO - check range for each character
//according the wikipedia table http://en.wikipedia.org/wiki/UTF-8
//if it's a valid utf-8 character
//if it's not in range, the isValid=false;
//and you can break here...
}
in.close();
return isValid;
}
catch (IOException e) {
e.printStackTrace();
return false;
}
}
i'm just adding a code snippet...
You should be able to wrap the XML input in a java.io.Reader in which you specify the actual encoding and then process that normally. Java will leverage the encoding specified in the XML for an InputStream, but when a Reader is used, the encoding of the Reader will be used.
Unmarshaller unmarshaller = jc.createUnmarshaller();
InputStream inputStream = new FileInputStream("input.xml");
Reader reader = new InputStreamReader(inputStream, "UTF-16");
try {
Address address = (Address) unmarshaller.unmarshal(reader);
} finally {
reader.close();
}

Convert Outputstream to file

Well i'm stucked with a problem,
I need to create a PDF with a html source and i did this way:
File pdf = new File("/home/wrk/relatorio.pdf");
OutputStream out = new FileOutputStream(pdf);
InputStream input = new ByteArrayInputStream(build.toString().getBytes());//Build is a StringBuilder obj
Tidy tidy = new Tidy();
Document doc = tidy.parseDOM(input, null);
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(doc, null);
renderer.layout();
renderer.createPDF(out);
out.flush();
out.close();
well i'm using JSP so i need to download this file to the user not write in the server...
How do I transform this Outputstream output to a file in the java without write this file in hard drive ?
If you're using VRaptor 3.3.0+ you can use the ByteArrayDownload class. Starting with your code, you can use this:
#Path("/download-relatorio")
public Download download() {
// Everything will be stored into this OutputStream
ByteArrayOutputStream out = new ByteArrayOutputStream();
InputStream input = new ByteArrayInputStream(build.toString().getBytes());
Tidy tidy = new Tidy();
Document doc = tidy.parseDOM(input, null);
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(doc, null);
renderer.layout();
renderer.createPDF(out);
out.flush();
out.close();
// Now that you have finished, return a new ByteArrayDownload()
// The 2nd and 3rd parameters are the Content-Type and File Name
// (which will be shown to the end-user)
return new ByteArrayDownload(out.toByteArray(), "application/pdf", "Relatorio.pdf");
}
A File object does not actually hold the data but delegates all operations to the file system (see this discussion).
You could, however, create a temporary file using File.createTempFile. Also look here for a possible alternative without using a File object.
use temporary files.
File temp = File.createTempFile(prefix ,suffix);
prefix -- The prefix string defines the files name; must be at least three characters long.
suffix -- The suffix string defines the file's extension; if null the suffix ".tmp" will be used.

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

Categories