BeanIO - UnidentifiedRecordException when parsing UTF8 file - java

I have a problem when parsing a file that is encoded with UTF8.
I have two files which are completely identical, except for their encoding. (I simply just copied the file and saved it with UTF8, so the contents are identical). One is encoded using ANSI, the other with UTF8. The file which is encoded with ANSI is succesfully parsed while the other file causes BeanIO to throw an UnidentifiedRecordException when calling the BeanReader.read() method:
org.beanio.UnidentifiedRecordException: Unidentified record at line 1
I have tried to solve this by explicitly setting the encoding to UTF8 using this code:
public static BeanReader getBeanReader(File file, StreamBuilder builder) {
StreamFactory factory = StreamFactory.newInstance();
factory.define(builder);
InputStream iStream;
try {
iStream = new FileInputStream(file);
} catch (FileNotFoundException e) {
throw new CustomException("Could not create BeanReader, file not found", e);
}
Reader reader = new InputStreamReader(iStream, StandardCharsets.UTF_8);
return factory.createReader("reader", reader);
}
which doesn't solve the issue.
What could be the reason for this error?

As the first line is claimed erroneous, did you save the UTF-8 without BOM (that infamous zero-width space at file start)?

Related

UTF-16LE encoding and xerces2 Java

I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.
However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:
try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}
I still get org.xml.sax.SAXParseException: Content is not allowed in prolog..
I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck.
I also checked that there are not extra chars (except the BOM) before the <?xml part.
I also want to mention that this code works just fine with UTF-8.
// Edit:
I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.
// Edit 2:
Tried using a buffered reader. Same results:
Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'
Thanks in advance.
To get a bit farther some info gathering:
byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
LOG.info("Has BOM and is evidently UTF_16LE");
xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);
try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}

Android (java) XmlSerializer not encoding "Windows-1252"

I'm currently working in a Android Studio project (java) and I need to export a Xml file with an specific encoding "Windows-1252".
I've been trying to do it for a few hours (EXAMPLE A) and no matter what kind of encoding I choose although the resulting file have the "correct" encoding in the first Xml line "... encoding='windows-1252'":
the chars inside the Xml file are "escaped" with "&#xxx;"
opening the file with notepad++ it detects the "UTF-8" encoding (not the desired "Windows-1252")
<?xml version='1.0' encoding='windows-1252' ?><test><message>áéíóúãõç</message></test>
To make sure that the "streams" were correct, I've created a new sample (EXAMPLE B) without the "XmlSerializer", and the result file was much better:
the chars inside the Xml file are now correct (not escaped)
opening the file with notepad++ it detects the "ANSI" encoding (not the desired "Windows-1252")
<?xml version='1.0' encoding='windows-1252' ?><test><message>áéíóúãõç</message></test>
private void doDebug01(){
File dstPath = Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOWNLOADS);
try {
//EXAMPLE A
File dstFile = new File(dstPath, "test.xml");
FileOutputStream dstFOS = new FileOutputStream(dstFile);
OutputStream dstOS= new BufferedOutputStream(dstFOS);
OutputStreamWriter dstOSW = new OutputStreamWriter(dstOS, "windows-1252");
XmlSerializer dstXml = Xml.newSerializer();
dstXml.setOutput(dstOSW);
dstXml.startDocument("windows-1252", null);
dstXml.startTag(null,"test");
dstXml.startTag(null,"message");
dstXml.text("áéíóúãõç");
dstXml.endTag(null,"message");
dstXml.endTag(null,"test");
dstXml.endDocument();
dstXml.flush();
dstOSW.close();
//EXAMPLE B
File dstFileB = new File(dstPath, "testb.xml");
FileOutputStream dstFOSB = new FileOutputStream(dstFileB);
OutputStream dstOSB= new BufferedOutputStream(dstFOSB);
OutputStreamWriter dstOSWB = new OutputStreamWriter(dstOSB, "windows-1252");
dstOSWB.write("<?xml version='1.0' encoding='windows-1252' ?>");
dstOSWB.write("<test>");
dstOSWB.write("<message>áéíóúãõç</message>");
dstOSWB.write("</test>");
dstOSWB.flush();
dstOSWB.close();
}
catch (IOException e) {
Log.e("DEBUG", e.getMessage());
e.printStackTrace();
}
}
So now I'm confused and kind of stucked here with the results I've got from EXAMPLE B because I don't know if my problem (A) resides on the "XmlSerializer" or on the "streams" parameters.
What am I missing in my code (A) to get an Xml with correct chars/file encoded in "Windows-1252" (or at least closest to EXAMPLE B)?

Read and Write file in java whilst keeping the special characters

After reading and writing the file, the bullet points get replaced with symbolic unreadable text "�". Here is the code:
String str = FileUtils.readFileToString(new File(sourcePath), "UTF-8");
nextTextFile.append(redactStrings(str, redactedStrings));
FileUtils.writeStringToFile(new File(targetPath), nextTextFile.toString());
Link to sample file
generated file with funny characters
I checked it out on Windows and if the source file is encoded in UTF-8, the following code will produce the desired output to the console and to a file, which is then encoded in UTF-8 as well, making use of java.nio:
public static void main(String[] args) {
Path inPath = Paths.get(sourcePath);
Path outPath = Paths.get(targetPath);
try {
List<String> lines = Files.readAllLines(inPath, StandardCharsets.UTF_8);
lines.forEach(line -> {
System.out.println(line);
});
Files.write(outPath, lines, StandardCharsets.UTF_8);
} catch (IOException e) {
e.printStackTrace();
}
}
Please note that the source file has to be encoded in UTF-8, otherwise this may throw an IOException stating something like Input length = 1. Play around with the StandardCharsets if it does not meet your requirements or make sure the encoding of the source file is UTF-8.

Want to throw exception when encounter special UTF-8 characters in an XML file

I am parsing an XML file which has UTF-8 encoding.
<?xml version="1.0" encoding="UTF-8"?>
Now our business application has set of components which are developed by different teams and are not using the same libraries for parsing XML. My component uses JAXB while some other component uses SAX and so forth. Now when XML file has special characters like "ä" or "ë" or "é" (characters with umlauts) JAXB parses it properly but other components (sub-apps) could not parse them properly and throws exception.
Due to business need I can not change programming for other components but I have to put restriction/validation at my application for making sure that XML (data-load) file do not contain any such characters.
What is best approach to make sure that file does not contain above mentioned (or similar) characters and I can throw exception (or give error) right there before I start parsing XML file using JAXB.
If your customer sends you an XML file with a header where the encoding does not match file contents, you might as well give up to try and do anything meaningful with that file. - Are they really sending data where the header does not match the actual encoding? That's not XML, then. And you ought to charge them more ;-)
Simply read the file as a FileInputStream, byte by byte. If it contains a negative byte value, refuse to process it.
You can keep encoding settings like UTF-8 or ISO 8859-1, because they all have US-ASCII as a proper subset.
yes, my answer would be the same as laune mentions...
static boolean readInput() {
boolean isValid = true;
StringBuffer buffer = new StringBuffer();
try {
FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis);
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
System.out.println("ch="+ch);
//TODO - check range for each character
//according the wikipedia table http://en.wikipedia.org/wiki/UTF-8
//if it's a valid utf-8 character
//if it's not in range, the isValid=false;
//and you can break here...
}
in.close();
return isValid;
}
catch (IOException e) {
e.printStackTrace();
return false;
}
}
i'm just adding a code snippet...
You should be able to wrap the XML input in a java.io.Reader in which you specify the actual encoding and then process that normally. Java will leverage the encoding specified in the XML for an InputStream, but when a Reader is used, the encoding of the Reader will be used.
Unmarshaller unmarshaller = jc.createUnmarshaller();
InputStream inputStream = new FileInputStream("input.xml");
Reader reader = new InputStreamReader(inputStream, "UTF-16");
try {
Address address = (Address) unmarshaller.unmarshal(reader);
} finally {
reader.close();
}

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

Categories