I have a RDF file containing some errors(probably unrecognized characters).
Is there any way to find these errors in Java?
any XML contains encoding property in the header. And it's UTF-8 is default. If your XML contains bytes which can't be recognized with SAX parser, so you have not "well-formed" XML. Another way is tell correct charset/encoding to InputStreamReader you use.
Related
I am using a spring boot REST API to upload csv file MultipartFile. CSVFormat library of org.apache.commons.csv is used to format the MultipartFile and CSVParser is used to parse and the iterated records are stored into the MySql database.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream()));
Observation is that when the CSV files are uploaded with charset of UTF-8 then it works good. But if the CSV file is of a different format (ANSI etc.,) other than it, its encoding German and other language characters to some random symbols.
Example äößü are encoded to ����
I tried the below to specify the encoding standard, it did not work too.
csvParser = CSVFormat.DEFAULT
.withDelimiter(separator)
.withIgnoreSurroundingSpaces()
.withQuote('"')
.withHeader(CsvHeaders.class)
.parse(new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8));
Can you please advise. Thank you so much in advance.
What you did new InputStreamReader(csvFile.getInputStream(), StandardCharsets.UTF_8) tells the CSV parser that the content of the inputstream is UTF-8 encoded.
Since UTF-8 is (usally) the standard encoding, this is actually the same as using new InputStreamReader(csvFile.getInputStream()).
If I get your question correctly, this is not what you intended. Instead you want to automatically choose the right encoding based on the Import-file, right?
Unfortunatelly the csv-format does not store the information which encoding was used.
There are some libraries you could use to guess the most probable encoding based on the characters contained in the file. While they are pretty accurate, they are still guessing and there is no guarantee that you will get the right encoding in the end.
Depending on your use case it might be easier to just agree with the consumer on a fixed encoding (i.e. they can upload UTF-8 or ANSI, but not both)
Try as shown below which worked for me for the same issue
new InputStreamReader(csvFile.getInputStream(), "UTF-8")
I'm using javax.xml.stream.XMLStreamReader to parse XML documents. Unfortunately, some of the documents I'm parsing use non-IANA encoding names, like "macroman" and "ms-ansi". For example:
<?xml version="1.0" encoding="macroman"?>
<foo />
This causes the parse to blow up with an exception:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,42]
Message: Invalid encoding name "macroman".
Is there any way to provide a custom encoding handler to my XMLStreamReader so that I can augment it with support for the encodings I need??
You could wrap the input stream with a transformer that replaces the non-standard charset with the equivalent charset that XMLStreamReader does understand.
See Filter (search and replace) array of bytes in an InputStream
I'm using Apache Camel 2.17.1 to process a CSV file and I'm using Bindy in conjunction with CsvRecord to parse the file and unmarshal each line into a POJO.
The issue I'm facing is that some of the fields in the file have special unicode characters like "Blah ®" and these are not being parsed correctly -- instead, the String field will end up holding "Blah �" instead...
Is this a known bug and/or is there some workaround or configuration I can specify to enable these characters to be handled correctly as unicode characters?
Thanks in advance!
Check your input file format. Change charset to UTF-8 and try again.
I have a servlet which should reply to requests in Json {obj:XML} (meaning a Json containing an xml object inside).
The XML is encoded in UTF-8 and has several chars like => पोलैंड.
The XML is in a org.w3c.dom.Document and I am using JSON.org library to parse JSON. When i try to print it on the ServletOutputStream, the characters are not well encoded. I have tested it trying to print the response in a file, but the encoding is not UTF-8.
Parser.printTheDom(documentFromInputStream,byteArrayOutputStream);
OutputStreamWriter oS=new OutputStreamWriter(servletOutputStream, "UTF-8");
oS.write((jsonCallBack+"("));
oS.write(byteArrayOutputStream);
oS.write(");");
I have tryed even in local (without deploing the servlet) the previous and the next code :
oS.write("पोलैंड");
and the result is the same.
Instead when I try to print the document,the file is a well formed xml.
oS.write((jsonCallBack+"("));
Parser.printTheDom(documentFromInputStream,oS);
oS.write(");");
Any help?
Typically, if binary data needs to be part of an xml doc, it's base64 encoded. See this question for more details. I suggest you base64 encode the fields that can have exotic UTF-8 chars and and base64 decode them on the client side.
See this question for 2 good options for base64 encoding/decoding in java.
I have been trying to use XStreamMarshaller to generate XML output in my Java Spring project. The XML I am generating has CDATA values in the element text. I am manually creating this CDATA text in the command object like this:
f.setText("<![CDATA[cdata-text]]>");
The XStreamMarshaller generated the element(text-data below is an alias) as:
<text-data><![CDATA[cdata-text]]></text-data>
The above XML display is as expected (Please ignore the back slash in the above element name: forum formatting). But when I do a View Source on the XML output generated I see this for the element: <text-data><![CDATA[cdata-text]]></text-data>.
Issue:
As you can see the less than and greater than characters have been replaced by < and > in the View Source. I need my client to read the source and identify CDATA section from the XML output which it will not in the above scenario.
Is there a way I can get the XStreamMarshaller to escape special characters in the text I provided?
I have set the encoding of the Marshaller to ISO-8859-1 but that does not work either. If the above cannot be done by XStreamMarshaller can you please suggest alternate marshallers/unmarshallers that can do this for me?
// Displaying my XML and View Source as suggested by Paŭlo Ebermann below:
XML View (as displayed in IE):
An invalid character was found in text content. Error processing resource 'http://localhost:8080/file-service-framework/fil...
Los t
View Source:
<service id="file-text"><text-data><![CDATA[
Los túneles a través de las montañas hacen más fácil viajar por carretera.
]]></text-data></service>
Thanks you very much.
Generating CDATA sections is the task of your XML-generating library, not of its client. So you should simply have to write
f.setText("cdata-text");
and then the library can decide whether to use <![CDATA[...]]> or the <-escaping for its contents. It should make no difference for the receiver.
Edit:
Looking at your output, it looks right (apart from the CDATA) - here you must work on your input, as said.
If IE throws an error here, most probably you don't have declared the right encoding.
I don't really know much about the Spring framework, but the encoding used by the Marshaller should be the same encoding as the encoding sent in either the HTTP header (Content-Type: ... ;charset=...) or the <?xml version="1.0" encoding="..." ?> XML prologue (these two should not differ, too).
I would recommend UTF-8 as encoding everywhere, as this can represent all characters, not only the Latin-1 ones.