How to write accented characters from XML into MarkLogic using JavaApi? - java

I have an XML of 20MB having accented characters like Ö,É,Á, and many more.Here the problem is when i insert file into MarkLogic, these characters are saved in English format like O,E,A.but i want to store in same format.So how can i store the characters in accented format and read the XMl in same way. My XML file is ISO-8859-1 encoded.
Code which i have written for writing and reading :
DatabaseClient client = DatabaseClientFactory.newClient(IP, PORT,
DATABASE_NAME, USERNAME, PWD, Authentication.DIGEST);
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = null;
streamReader = factory.createXMLStreamReader(new FileReader("record.xml"));
XMLDocumentManager xmlDocMgr = client.newXMLDocumentManager();
XMLStreamReaderHandle handle = new XMLStreamReaderHandle(streamReader);
xmlDocMgr.write("/" + filename, handle);
For reading XML:
XMLDocumentManager docMgr = client.newXMLDocumentManager();
DOMHandle xmlhandle = new DOMHandle();
docMgr.read("/" + filename, xmlhandle);
String doc = xmlhandle.toString();
String data = Normalizer.normalize(doc, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
return data;
I Am returning Data to display in browser.
Not able to find where the problem is.

If the XML file does not have an XML prologue that declares its encoding, you should specify the ISO-8859-1 encoding when reading the file before writing the file to the database (as flafoux has pointed out).
You should also specify the encoding when reading the content from the database unless the destination accepts UTF-8 encoding.
For more information, see:
http://docs.marklogic.com/guide/java/document-operations#id_11208
Hoping that helps,

You need to specify the encoding (and also change constructor using InputStream) :
XMLStreamReader streamReader = factory.createXMLStreamReader(new FileInputStream("record.xml"),"ISO-8859-1");

Related

How to set 'charset' for DatumWriter || write avro that contains arabic characters to HDFS

Some of the data contains value in Arabic format, and when the data is written, reader code/hadoop fs -text command shows ?? instead of Arabic characters.
1) Writer
// avro object is provided as SpecificRecordBase
Path path = new Path(pathStr);
DatumWriter<SpecificRecord> datumWriter = new SpecificDatumWriter<>();
FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf); // HDFS File System
FSDataOutputStream outputStream = fs.create(path);
DataFileWriter<SpecificRecord> dataFileWriter = new DataFileWriter<>(datumWriter);
Schema schema = getSchema(); // method to get schema
dataFileWriter.setCodec(CodecFactory.snappyCodec());
dataFileWriter.create(schema, outputStream);
dataFileWriter.append(avroObject);
2) Reader
Configuration conf = new Configuration();
FsInput in = new FsInput(new Path(hdfsFilePathStr), conf);
DatumReader<Row> datumReader = new GenericDatumReader<>();
DataFileReader<Row> dataFileReader = new DataFileReader<>(in, datumReader);
GenericRecord outputData = (GenericRecord) dataFileReader.iterator.next();
I've tried hadoop fs -text {filePath} command, there also the values in Arabic appear as ??.
It will be really difficult to change the format in which data is written because there are numerous consumers of the same file.
Tried reading through SpecificRecordBase, still getting ??.
Edit
Also tried these (in both reader and writer):
Configuration conf = new Configuration();
conf.set("file.encoding", StandardCharsets.UTF_16.displayName());
AND
System.setProperty("file.encoding", StandardCharsets.UTF_16.displayName());
Doesn't help.
Apparently, HDFS does not support a lot of non-english characters. To work around that, change the field from String to bytes in your avro schema.
To convert your value from String to bytes, use:
ByteBuffer.wrap(str.getBytes(StandardCharsets.UTF_8)).
Then, while reading, to convert it back to String use:
new String(byteData.array(), StandardCharsets.UTF_8).
Rest of the code in your reader and writer stays the same.
Doing this, for English characters hadooop fs -text command will show proper text but for non-English characters it might show gibberish, but your reader will still be able to create the UTF-8 String from ByteBuffer.

Umlauts get lost on another system after encoding and decoding Base64

Giving the following implementation I face the problem that, on another system, the XML file is missing the Umlaute (ä, ü, ö) compared to the origin XML file. Instead of the Umlaute the replacement character is inserted in the XML file. (0xEF 0xBF 0xBD (efbfbd))
Get a zip file containing a XML with Umlauts
Decompress the zip file
Encode the xml content to a Base64 payload and save it to the db
Querys the entity
Get the Base64 payload
Decode the Base64 content
Decoded Base64 content is a XML which should contain the origin Umlauts
Whats driving me crazy is the fact that the decoded Base64 content is missing the Umlaute on another system. Instead of the umlaute I get the replacement character. On my system the same implementation is working without the replacement.
The following code is just a MCVE to explain the problem which works fine on my system but on a other system (Windows Server 2013) misses the umlaute after decode.
String requestUrl = "https://myserver/mypath/Message_166741.zip";
HttpGet httpget = new HttpGet(String requestUrl = "https://myserver/mypath/Message_166741.zip";);
HttpResponse response = httpClient.execute(httpget);
HttpEntity entity = response.getEntity();
InputStream inputStream = entity.getContent();
byte[] decompressedInputStream = decompress(inputStream);
String content = null;
content = new String(decompressedInputStream, StandardCharsets.UTF_8);
String originFileName = new SimpleDateFormat("yyyyMMddHHmm'_origin.xml'").format(new Date());
String originFileNameWithPath = String.format("C:\\temp\\Tests\\%1$s", originFileName);
// File contains the expected umlauts
FileUtils.writeStringToFile(new File(originFileNameWithPath), content);
String payloadUTF8 = Base64.encodeBase64String(ZipUtils.compress(content.getBytes("UTF-8")));
String payload = Base64.encodeBase64String(ZipUtils.compress(content.getBytes()));
String payloadJavaBase64 = new String(java.util.Base64.getEncoder().encode(ZipUtils.compress(content.getBytes())));
String xmlMessageJavaBase64;
byte[] compressedBinaryJavaBase64 = java.util.Base64.getDecoder().decode(payloadJavaBase64);
byte[] decompressedBinaryJavaBase64= ZipUtils.decompress(compressedBinaryJavaBase64);
xmlMessageJavaBase64 = new String(decompressedBinaryJavaBase64, "UTF-8");
String xmlMessageUTF8;
byte[] compressedBinaryUTF8 = java.util.Base64.getDecoder().decode(payloadUTF8);
byte[] decompressedBinaryUTF8 = ZipUtils.decompress(compressedBinaryUTF8);
xmlMessageUTF8 = new String(decompressedBinaryUTF8, "UTF-8");
String xmlMessage;
byte[] compressedBinary = java.util.Base64.getDecoder().decode(payload);
byte[] decompressedBinary = ZipUtils.decompress(compressedBinary);
xmlMessage = new String(decompressedBinary, "UTF-8");
String processedFileName = new SimpleDateFormat("yyyyMMddHHmm'_processed.xml'").format(new Date());
String processedFileNameUTF8 = new SimpleDateFormat("yyyyMMddHHmm'_processedUTF8.xml'").format(new Date());
String processedFileNameJavaBase64 = new SimpleDateFormat("yyyyMMddHHmm'_processedJavaBase64.xml'").format(new Date());
// These files do not contain the umlauts anymore.
// Instead of the umlauts a replacement character is inserted (0xEF 0xBF 0xBD (efbfbd))
String processedFileNameWithPath = String.format("C:\\temp\\Tests\\%1$s", processedFileName);
String processedFileNameWithPathUTF8 = String.format("C:\\temp\\Tests\\%1$s", processedFileNameUTF8);
String processedFileNameWithPathJavaBase64 = String.format("C:\\temp\\Tests\\%1$s", processedFileNameJavaBase64);
FileUtils.writeStringToFile(new File(processedFileNameWithPath), xmlMessage);
FileUtils.writeStringToFile(new File(processedFileNameWithPathUTF8), xmlMessageUTF8);
FileUtils.writeStringToFile(new File(processedFileNameWithPathJavaBase64), xmlMessageJavaBase64);
The three files are just for testing purpose but I hope you getting the problem
Edit
Both ways create XML file with ü, ö, ä on my machine
Only the WITHOUT implementation create an XML XML file with ü, ö, ä on another system The "content" string of WITH UTF-8 contains for ü =>
// WITHOUT UTF-8 IN BYTE[] => STRING CTOR
byte[] dci = decompress(inputStream);
content = new String(dci);
byte[] compressedBinary = java.util.Base64.getDecoder().decode(content);
byte[] decompressedBinary = ZipUtils.decompress(compressedBinary);
String xml = new String(decompressedBinary);
// WITH UTF-8 IN BYTE[] => STRING CTOR
byte[] dci = decompress(inputStream);
content = String(dci, StandardCharsets.UTF_8);;
byte[] compressedBinary = java.util.Base64.getDecoder().decode(content);
byte[] decompressedBinary = ZipUtils.decompress(compressedBinary);
String xml = new String(decompressedBinary, "UTF-8");
Edit #2
There also seems to be a difference between running the code in IntelliJ and outside of IntelliJ on my machine. Did not know that this makes such a huge difference. So - if I run the code outside of IntelliJ (java.exe -jar myjarfile) the WITH UTF8 Part replaces the Ü. with ... I don't know. Notepad++ shows xFC. Funny: My raspberry pi shows both files with Ü where my Windows / notepad++ shows xFC.
That whole thing confuses me and I would like to understand whats the problem is. Also because the XML file contains the UTF8 as encode in header.
Edit #3 Final Solution
// ## SERVER
// Get ZIP from request URL
HttpGet httpget = new HttpGet(requestUrl);
HttpResponse response = httpClient.execute(httpget);
HttpEntity entity = response.getEntity();
InputStream inputStream = entity.getContent();
byte[] decompressedInputStream = decompress(inputStream);
// Produces a XML string which SHOULD contain ü, ö, ä
String xmlOfZipFileContent = new String(decompressedInputStream, StandardCharsets.UTF_8);
// Just for testing write to file
String xmlOfZipFileSavePath = String.format("C:\\temp\\Tests\\%1$s", new SimpleDateFormat("yyyyMMddHHmm'_original.xml'").format(new Date()));
FileUtils.writeStringToFile(new File(xmlOfZipFileSavePath), xmlOfZipFileContent, StandardCharsets.UTF_8);
// The payloadExplicitUtf8 gets stored into the DB
String payload = java.util.Base64.getEncoder().encodeToString(ZipUtils.compress(xmlOfZipFileContent.getBytes(StandardCharsets.UTF_8)));
// Store payload to db
// Client queries database and gets the payload
// payload = dbEntity.get().payload
// The following three lines is on client
byte[] compressedBinaryPayload = java.util.Base64.getDecoder().decode(payload);
byte[] decompressedBinaryPayload = ZipUtils.decompress(compressedBinaryPayload);
String xmlMessageOutOfPayload = new String(decompressedBinaryPayload, StandardCharsets.UTF_8);
String xmlOfPayloadSavePath = String.format("C:\\temp\\Tests\\%1$s", new SimpleDateFormat("yyyyMMddHHmm'_payload.xml'").format(new Date()));
FileUtils.writeStringToFile(new File(xmlOfPayloadSavePath), xmlMessageOutOfPayload, StandardCharsets.UTF_8);
If I understood correctly, your situation seems to be the following:
// Decompress data from the server, it's in ISO-8859-1 or similar 1 byte encoding
byte[] dci = decompress(inputStream);
// Data gets corrupted because of wrong charset
// This is where ü gets converted to unicode replacement character
content = new String(dci, StandardCharsets.UTF_8);
The rest of the code uses UTF8 explicitly, but it doesn't matter as the data has already been corrupted at this point. In the end you expect an UTF-8 encoded file.
Also because the XML file contains the UTF8 as encode in header.
That doesn't prove anything. If you treat it as just a text file, you can write it out in as many encodings as you want to, and it would still claim to be UTF8.
InputStream inputStream = entity.getContent();
byte[] decompressedInputStream = decompress(inputStream);
Fine, and it is assumed that the bytes are in UTF-8, as:
String content = new String(decompressedInputStream, StandardCharsets.UTF_8);
Should the bytes not be in UTF-8, you could try Windows Latin-1:
Charset.forName("Windows-1252")
Otherwise decompressedInputStream can be used whereever content is converted to bytes in UTF-8.
...
The FileUtils.writeStringToFile without encoding specified uses the default platform encoding.
// File contains the expected umlauts
//FileUtils.writeStringToFile(new File(originFileNameWithPath), content);
Better is to ensure that UTF-8 is written. Either add the encoding to convert the Unicode String to bytes in UTF-8, or simply write the original bytes:
Files.write(Paths.get(originFileNameWithPath), decompressedInputStream);
Also the Base64 encoded UTF-8 bytes of the String should be used:
String payloadUTF8 = Base64.encodeBase64String(ZipUtils.compress(
content.getBytes(StandardCharsets.UTF_8)));
String payloadJavaBase64 = new String(java.util.Base64.getEncoder().encode(
ipUtils.compress(content.getBytes(StandardCharsets.UTF_8))));
The standard JavaSE Base64 will do; though do not use its decodeString and encodeString as that uses ISO-8859-1, Latin-1.

Encoding issue when reading from Google translator API and writing to properties file

I am using Google translator API to generate Arabic property file from English property file.
Making a URL connection and making a GET request to the URL.,passing original language, translation language and value to be translated
URLConnection urlCon = null;
String urlStr = "https://www.googleapis.com/language/translate/v2";
URL url = new URL(urlStr + "?key=" + apikey + "&source=" + origlang + "&target=" + translateToLang + "&q=" + value);
urlCon = url.openConnection();
urlCon.setConnectTimeout(1000 * 60 * 5);
urlCon.setReadTimeout(1000 * 60 * 5);
urlCon.setDoInput(true);
urlCon.setDoOutput(true);
urlCon.setUseCaches(false);
((HttpURLConnection) urlCon).setRequestMethod("GET");
urlCon.setRequestProperty("Accept-Charset", "UTF-8");
Reading the response from the URL connection through inputstream reader. Passing UTF-8 in the encoding parameter.
BufferedReader br = new BufferedReader(new InputStreamReader(((URLConnection) urlCon).getInputStream(), "UTF-8"));
/* Reading the response line by line */
StringBuffer responseString = new StringBuffer();
String nextLine = null;
while ((nextLine = br.readLine()) != null) {
responseString.append(nextLine);
}
// if response is null or empty, throw exception
String response = responseString.toString();
Parsing the JSON received through GSON parser
JsonElement jelement = new JsonParser().parse(response);
JsonObject jobject = jelement.getAsJsonObject();
jobject = jobject.getAsJsonObject("data");
JsonArray jarray = jobject.getAsJsonArray("translations");
jobject = jarray.get(0).getAsJsonObject();
String result = jobject.get("translatedText").toString();
Writing the translated value in a new property file through fileoutstream
FileOutputStream foutStream = new FileOutputStream(outFile);
foutStream.write(key.getBytes());
foutStream.write("=".getBytes());
foutStream.write(transByte.getBytes());foutStream.write("\n".getBytes());
The issue is I am getting garbled text(?????) written in the new property file for Arabic language.
When you call transByte.getBytes(), the Arabic translation is encoded with your platform default encoding, which will only handle Arabic if your machine is configured for UTF-8 or Arabic. Otherwise, characters will be replaced by '�' or '?' .
Create a new Properties instance, and populate it using setProperty() calls. Then when you store it, the proper escaping will be applied to your Arabic text, which is necessary because property files are encoded with ISO-8859-1 (an encoding for Western Latin characters).
Alternatively, you can store the Properties using a Writer instance that is configured with whatever encoding you choose, but the encoding isn't stored in the file itself, so you will need meta-data or a convention to set the correct encoding when reading the file again.
Finally, you can store the Properties in an XML format, which will use UTF-8 by default, or you can specify another encoding. The file itself will specify the encoding, so it's easier to use an optimal encoding for each language.
Trying to emit a file format using custom string concatenation, as you are doing, is an oft-repeated recipe for disaster. Whether it's XML, JSON, or a simple properties file, it's far too easy to overlook special cases that require escape sequences, etc. Use a library designed to emit the format instead.

Encoding for unicode and & characters

I am trying to save the below string to my protobuff model:
STOXX®Europe 600 Food&BevNR ETF
But while printing the protomodel value it's displayed like:
STOXX®Europe 600 Food&BevNR ETF
I tried to encode the string to UTF-8 and also tried StringEscapeUtils.unescapeJava(str), but it failed. I'm getting this string by parsing the XML response from server. Any ideas ?
Ref: XML parser Skip invalid xml element with XmlStreamReader
Correcting the XML parsing should be better than needing to unescape everything. Please check below a test case showing this:
public static void main(String[] args) throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isCoalescing", true);
ReaderInputStream ris = new ReaderInputStream(new StringReader("<tag>STOXX®Europe 600 Food&BevNR ETF</tag>"));
XMLStreamReader reader = factory.createXMLStreamReader(ris, "UTF-8");
StringBuilder sb = new StringBuilder();
while (reader.hasNext()) {
reader.next();
if (reader.hasText())
sb.append(reader.getText());
}
System.out.println(sb);
}
Output:
STOXX®Europe 600 Food&BevNR ETF
Actually I have protobuf method with me to solve this issue:
ByteString.copyFrom(StringEscapeUtils.unescapeHtml3(string), "ISO-8859-1").toStringUtf8();
Documentation of ByteString
As the text comes from XML use:
s = StringEscapeUtils.unescapeXml(s);
This is way better than unescaping HTML which has hundreds of named entities &...;.
The two rubbish characters instead of the Copyright Symbol are due to reading an UTF-8 encoded text (multibyte for Special chars) as some single Byte Encoding, maybe Latin-1.
This wrong conversion just might be repaired with another conversion, but best would be to read using a UTF-8 Encoding.
// Hack, just patching. Assumes Latin-1 encoding
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
// Or maybe:
s = new String(s.getBytes(), StandardCharsets.UTF_8);
Better inspect the reading code, and look wheter an optional Encoding went missing: InputStreamReader, OutputStreamWriter, new String, getBytes.
Your entire problem would be solved by using an XML reader too.

Want to throw exception when encounter special UTF-8 characters in an XML file

I am parsing an XML file which has UTF-8 encoding.
<?xml version="1.0" encoding="UTF-8"?>
Now our business application has set of components which are developed by different teams and are not using the same libraries for parsing XML. My component uses JAXB while some other component uses SAX and so forth. Now when XML file has special characters like "ä" or "ë" or "é" (characters with umlauts) JAXB parses it properly but other components (sub-apps) could not parse them properly and throws exception.
Due to business need I can not change programming for other components but I have to put restriction/validation at my application for making sure that XML (data-load) file do not contain any such characters.
What is best approach to make sure that file does not contain above mentioned (or similar) characters and I can throw exception (or give error) right there before I start parsing XML file using JAXB.
If your customer sends you an XML file with a header where the encoding does not match file contents, you might as well give up to try and do anything meaningful with that file. - Are they really sending data where the header does not match the actual encoding? That's not XML, then. And you ought to charge them more ;-)
Simply read the file as a FileInputStream, byte by byte. If it contains a negative byte value, refuse to process it.
You can keep encoding settings like UTF-8 or ISO 8859-1, because they all have US-ASCII as a proper subset.
yes, my answer would be the same as laune mentions...
static boolean readInput() {
boolean isValid = true;
StringBuffer buffer = new StringBuffer();
try {
FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis);
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
System.out.println("ch="+ch);
//TODO - check range for each character
//according the wikipedia table http://en.wikipedia.org/wiki/UTF-8
//if it's a valid utf-8 character
//if it's not in range, the isValid=false;
//and you can break here...
}
in.close();
return isValid;
}
catch (IOException e) {
e.printStackTrace();
return false;
}
}
i'm just adding a code snippet...
You should be able to wrap the XML input in a java.io.Reader in which you specify the actual encoding and then process that normally. Java will leverage the encoding specified in the XML for an InputStream, but when a Reader is used, the encoding of the Reader will be used.
Unmarshaller unmarshaller = jc.createUnmarshaller();
InputStream inputStream = new FileInputStream("input.xml");
Reader reader = new InputStreamReader(inputStream, "UTF-16");
try {
Address address = (Address) unmarshaller.unmarshal(reader);
} finally {
reader.close();
}

Categories