Giving the following implementation I face the problem that, on another system, the XML file is missing the Umlaute (ä, ü, ö) compared to the origin XML file. Instead of the Umlaute the replacement character is inserted in the XML file. (0xEF 0xBF 0xBD (efbfbd))
Get a zip file containing a XML with Umlauts
Decompress the zip file
Encode the xml content to a Base64 payload and save it to the db
Querys the entity
Get the Base64 payload
Decode the Base64 content
Decoded Base64 content is a XML which should contain the origin Umlauts
Whats driving me crazy is the fact that the decoded Base64 content is missing the Umlaute on another system. Instead of the umlaute I get the replacement character. On my system the same implementation is working without the replacement.
The following code is just a MCVE to explain the problem which works fine on my system but on a other system (Windows Server 2013) misses the umlaute after decode.
String requestUrl = "https://myserver/mypath/Message_166741.zip";
HttpGet httpget = new HttpGet(String requestUrl = "https://myserver/mypath/Message_166741.zip";);
HttpResponse response = httpClient.execute(httpget);
HttpEntity entity = response.getEntity();
InputStream inputStream = entity.getContent();
byte[] decompressedInputStream = decompress(inputStream);
String content = null;
content = new String(decompressedInputStream, StandardCharsets.UTF_8);
String originFileName = new SimpleDateFormat("yyyyMMddHHmm'_origin.xml'").format(new Date());
String originFileNameWithPath = String.format("C:\\temp\\Tests\\%1$s", originFileName);
// File contains the expected umlauts
FileUtils.writeStringToFile(new File(originFileNameWithPath), content);
String payloadUTF8 = Base64.encodeBase64String(ZipUtils.compress(content.getBytes("UTF-8")));
String payload = Base64.encodeBase64String(ZipUtils.compress(content.getBytes()));
String payloadJavaBase64 = new String(java.util.Base64.getEncoder().encode(ZipUtils.compress(content.getBytes())));
String xmlMessageJavaBase64;
byte[] compressedBinaryJavaBase64 = java.util.Base64.getDecoder().decode(payloadJavaBase64);
byte[] decompressedBinaryJavaBase64= ZipUtils.decompress(compressedBinaryJavaBase64);
xmlMessageJavaBase64 = new String(decompressedBinaryJavaBase64, "UTF-8");
String xmlMessageUTF8;
byte[] compressedBinaryUTF8 = java.util.Base64.getDecoder().decode(payloadUTF8);
byte[] decompressedBinaryUTF8 = ZipUtils.decompress(compressedBinaryUTF8);
xmlMessageUTF8 = new String(decompressedBinaryUTF8, "UTF-8");
String xmlMessage;
byte[] compressedBinary = java.util.Base64.getDecoder().decode(payload);
byte[] decompressedBinary = ZipUtils.decompress(compressedBinary);
xmlMessage = new String(decompressedBinary, "UTF-8");
String processedFileName = new SimpleDateFormat("yyyyMMddHHmm'_processed.xml'").format(new Date());
String processedFileNameUTF8 = new SimpleDateFormat("yyyyMMddHHmm'_processedUTF8.xml'").format(new Date());
String processedFileNameJavaBase64 = new SimpleDateFormat("yyyyMMddHHmm'_processedJavaBase64.xml'").format(new Date());
// These files do not contain the umlauts anymore.
// Instead of the umlauts a replacement character is inserted (0xEF 0xBF 0xBD (efbfbd))
String processedFileNameWithPath = String.format("C:\\temp\\Tests\\%1$s", processedFileName);
String processedFileNameWithPathUTF8 = String.format("C:\\temp\\Tests\\%1$s", processedFileNameUTF8);
String processedFileNameWithPathJavaBase64 = String.format("C:\\temp\\Tests\\%1$s", processedFileNameJavaBase64);
FileUtils.writeStringToFile(new File(processedFileNameWithPath), xmlMessage);
FileUtils.writeStringToFile(new File(processedFileNameWithPathUTF8), xmlMessageUTF8);
FileUtils.writeStringToFile(new File(processedFileNameWithPathJavaBase64), xmlMessageJavaBase64);
The three files are just for testing purpose but I hope you getting the problem
Edit
Both ways create XML file with ü, ö, ä on my machine
Only the WITHOUT implementation create an XML XML file with ü, ö, ä on another system The "content" string of WITH UTF-8 contains for ü =>
// WITHOUT UTF-8 IN BYTE[] => STRING CTOR
byte[] dci = decompress(inputStream);
content = new String(dci);
byte[] compressedBinary = java.util.Base64.getDecoder().decode(content);
byte[] decompressedBinary = ZipUtils.decompress(compressedBinary);
String xml = new String(decompressedBinary);
// WITH UTF-8 IN BYTE[] => STRING CTOR
byte[] dci = decompress(inputStream);
content = String(dci, StandardCharsets.UTF_8);;
byte[] compressedBinary = java.util.Base64.getDecoder().decode(content);
byte[] decompressedBinary = ZipUtils.decompress(compressedBinary);
String xml = new String(decompressedBinary, "UTF-8");
Edit #2
There also seems to be a difference between running the code in IntelliJ and outside of IntelliJ on my machine. Did not know that this makes such a huge difference. So - if I run the code outside of IntelliJ (java.exe -jar myjarfile) the WITH UTF8 Part replaces the Ü. with ... I don't know. Notepad++ shows xFC. Funny: My raspberry pi shows both files with Ü where my Windows / notepad++ shows xFC.
That whole thing confuses me and I would like to understand whats the problem is. Also because the XML file contains the UTF8 as encode in header.
Edit #3 Final Solution
// ## SERVER
// Get ZIP from request URL
HttpGet httpget = new HttpGet(requestUrl);
HttpResponse response = httpClient.execute(httpget);
HttpEntity entity = response.getEntity();
InputStream inputStream = entity.getContent();
byte[] decompressedInputStream = decompress(inputStream);
// Produces a XML string which SHOULD contain ü, ö, ä
String xmlOfZipFileContent = new String(decompressedInputStream, StandardCharsets.UTF_8);
// Just for testing write to file
String xmlOfZipFileSavePath = String.format("C:\\temp\\Tests\\%1$s", new SimpleDateFormat("yyyyMMddHHmm'_original.xml'").format(new Date()));
FileUtils.writeStringToFile(new File(xmlOfZipFileSavePath), xmlOfZipFileContent, StandardCharsets.UTF_8);
// The payloadExplicitUtf8 gets stored into the DB
String payload = java.util.Base64.getEncoder().encodeToString(ZipUtils.compress(xmlOfZipFileContent.getBytes(StandardCharsets.UTF_8)));
// Store payload to db
// Client queries database and gets the payload
// payload = dbEntity.get().payload
// The following three lines is on client
byte[] compressedBinaryPayload = java.util.Base64.getDecoder().decode(payload);
byte[] decompressedBinaryPayload = ZipUtils.decompress(compressedBinaryPayload);
String xmlMessageOutOfPayload = new String(decompressedBinaryPayload, StandardCharsets.UTF_8);
String xmlOfPayloadSavePath = String.format("C:\\temp\\Tests\\%1$s", new SimpleDateFormat("yyyyMMddHHmm'_payload.xml'").format(new Date()));
FileUtils.writeStringToFile(new File(xmlOfPayloadSavePath), xmlMessageOutOfPayload, StandardCharsets.UTF_8);
If I understood correctly, your situation seems to be the following:
// Decompress data from the server, it's in ISO-8859-1 or similar 1 byte encoding
byte[] dci = decompress(inputStream);
// Data gets corrupted because of wrong charset
// This is where ü gets converted to unicode replacement character
content = new String(dci, StandardCharsets.UTF_8);
The rest of the code uses UTF8 explicitly, but it doesn't matter as the data has already been corrupted at this point. In the end you expect an UTF-8 encoded file.
Also because the XML file contains the UTF8 as encode in header.
That doesn't prove anything. If you treat it as just a text file, you can write it out in as many encodings as you want to, and it would still claim to be UTF8.
InputStream inputStream = entity.getContent();
byte[] decompressedInputStream = decompress(inputStream);
Fine, and it is assumed that the bytes are in UTF-8, as:
String content = new String(decompressedInputStream, StandardCharsets.UTF_8);
Should the bytes not be in UTF-8, you could try Windows Latin-1:
Charset.forName("Windows-1252")
Otherwise decompressedInputStream can be used whereever content is converted to bytes in UTF-8.
...
The FileUtils.writeStringToFile without encoding specified uses the default platform encoding.
// File contains the expected umlauts
//FileUtils.writeStringToFile(new File(originFileNameWithPath), content);
Better is to ensure that UTF-8 is written. Either add the encoding to convert the Unicode String to bytes in UTF-8, or simply write the original bytes:
Files.write(Paths.get(originFileNameWithPath), decompressedInputStream);
Also the Base64 encoded UTF-8 bytes of the String should be used:
String payloadUTF8 = Base64.encodeBase64String(ZipUtils.compress(
content.getBytes(StandardCharsets.UTF_8)));
String payloadJavaBase64 = new String(java.util.Base64.getEncoder().encode(
ipUtils.compress(content.getBytes(StandardCharsets.UTF_8))));
The standard JavaSE Base64 will do; though do not use its decodeString and encodeString as that uses ISO-8859-1, Latin-1.
Related
I have a file in ISO-8859-1 containing german umlauts and I need to unmarshall it using JAXB. But before I need the content in UTF-8.
#Override
public List<Usage> convert(InputStream input) {
try {
InputStream inputWithNamespace = addNamespaceIfMissing(input);
inputWithNamespace = convertFileToUtf(inputWithNamespace);
ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);
...
I get the "file" as an InputStream. My idea was to read the file's content in UTF-8 and make another InputStream to use. This is what I've tried:
private InputStream convertFileToUtf(InputStream inputStream) throws IOException {
byte[] bytesInIso = ByteStreams.toByteArray(inputStream);
String stringIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String stringUtf = new String(bytesInUtf);
return new ByteArrayInputStream(bytesInUtf);
}
I have those 2 Strings to check the contents, but even just reading the ISO file, it gives question marks where umlauts are (?) and converting that to UTF_8 gives strange characters like 1/2 and so on.
UPDATE
byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
String contentInIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String contentInUtf = new String(bytesInUtf);
Verifying contentInIso prints question marks instead of the umlauts and by checking contentInIso instead of umlauts, it has characters like "�".
#Override
public List<Usage> convert(InputStream input) {
try {
InputStream inputWithNamespace = addNamespaceIfMissing(input);
byte[] bytesInIso = ByteStreams.toByteArray(inputWithNamespace);
String contentInIso = new String(bytesInIso);
byte[] bytesInUtf = new String(bytesInIso, ISO_8859_1).getBytes(UTF_8);
String contentInUtf = new String(bytesInUtf);
ORDR order = xmlUnmarshaller.unmarshall(inputWithNamespace, ORDR.class);
This method convert it's called by another one called processUsageFile:
private void processUsageFile(File usageFile) {
try (FileInputStream fileInputStream = new FileInputStream(usageFile)) {
usageImporterService.importUsages(usageFile.getName(), fileInputStream, getUsageTypeValidated(usageFile.getName()));
log.info("Usage file {} imported successfully. Moving to archive directory", usageFile.getName());
If i take the code I have written under the UPDATE statement and put it immediately after the try, the first contentInIso has question marks but the contentInUtf has the umlauts. Then, by going into the convert, jabx throws an exception that the file has a premature end of line.
Regarding the behaviour you are getting,
String stringIso = new String(bytesInIso);
In this step, you construct a new String by decoding the specified array of bytes using the platform's default charset.
Since this is probably not ISO_8859_1, I think the String you are looking at becomes garbled here.
I am trying in JAVA 8, for "windows-1256" file Writing
my last try was :
String win1256="...";
//
File file = new File ("C:\file1.txt");
OutputStreamWriter os = new OutputStreamWriter (new
FileOutputStream(file),"windows-1256");
os.write(win1256);
or :
FileOutputStream out = new FileOutputStream("C:\file1.txt");
out.write(win1256.getBytes("windows-1256"));
however didn't work and output file was unreadable with "???..."
Changing is:
byte[] originalBytes; // Here the sequence of bytes representing the UTF-8 encoded string
Encoding enc = Encoding.GetEncoding("windows-1256");
byte[] newBytes = enc.GetBytes(Encoding.UTF8.GetString(originalBytes));
I am using Google translator API to generate Arabic property file from English property file.
Making a URL connection and making a GET request to the URL.,passing original language, translation language and value to be translated
URLConnection urlCon = null;
String urlStr = "https://www.googleapis.com/language/translate/v2";
URL url = new URL(urlStr + "?key=" + apikey + "&source=" + origlang + "&target=" + translateToLang + "&q=" + value);
urlCon = url.openConnection();
urlCon.setConnectTimeout(1000 * 60 * 5);
urlCon.setReadTimeout(1000 * 60 * 5);
urlCon.setDoInput(true);
urlCon.setDoOutput(true);
urlCon.setUseCaches(false);
((HttpURLConnection) urlCon).setRequestMethod("GET");
urlCon.setRequestProperty("Accept-Charset", "UTF-8");
Reading the response from the URL connection through inputstream reader. Passing UTF-8 in the encoding parameter.
BufferedReader br = new BufferedReader(new InputStreamReader(((URLConnection) urlCon).getInputStream(), "UTF-8"));
/* Reading the response line by line */
StringBuffer responseString = new StringBuffer();
String nextLine = null;
while ((nextLine = br.readLine()) != null) {
responseString.append(nextLine);
}
// if response is null or empty, throw exception
String response = responseString.toString();
Parsing the JSON received through GSON parser
JsonElement jelement = new JsonParser().parse(response);
JsonObject jobject = jelement.getAsJsonObject();
jobject = jobject.getAsJsonObject("data");
JsonArray jarray = jobject.getAsJsonArray("translations");
jobject = jarray.get(0).getAsJsonObject();
String result = jobject.get("translatedText").toString();
Writing the translated value in a new property file through fileoutstream
FileOutputStream foutStream = new FileOutputStream(outFile);
foutStream.write(key.getBytes());
foutStream.write("=".getBytes());
foutStream.write(transByte.getBytes());foutStream.write("\n".getBytes());
The issue is I am getting garbled text(?????) written in the new property file for Arabic language.
When you call transByte.getBytes(), the Arabic translation is encoded with your platform default encoding, which will only handle Arabic if your machine is configured for UTF-8 or Arabic. Otherwise, characters will be replaced by '�' or '?' .
Create a new Properties instance, and populate it using setProperty() calls. Then when you store it, the proper escaping will be applied to your Arabic text, which is necessary because property files are encoded with ISO-8859-1 (an encoding for Western Latin characters).
Alternatively, you can store the Properties using a Writer instance that is configured with whatever encoding you choose, but the encoding isn't stored in the file itself, so you will need meta-data or a convention to set the correct encoding when reading the file again.
Finally, you can store the Properties in an XML format, which will use UTF-8 by default, or you can specify another encoding. The file itself will specify the encoding, so it's easier to use an optimal encoding for each language.
Trying to emit a file format using custom string concatenation, as you are doing, is an oft-repeated recipe for disaster. Whether it's XML, JSON, or a simple properties file, it's far too easy to overlook special cases that require escape sequences, etc. Use a library designed to emit the format instead.
I have a base64 encoded String . Which looks like this
UEsDBBQABgAIAAAAIQDhD46/jQEAACkGAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAACAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA .
I decodeD the String and write it to word file using FileWriter. But When I tried opening the doc file i get an error saying corrupt data.
I would like to know what are the steps I need to follow to write the content to a word document after decoding the data . below is the code what I did and went wrong.
byte[] encodedBytes = stringBase64.getBytes();
byte[] decodedBytes = Base64.decodeBase64(encodedBytes);
String decodeString = new String(decodedBytes);
filewriter = new java.io.FileWriter("F:\xxx.docx”);
BufferedWriter bw = new BufferedWriter(fw);
bw.write(decodeString);
The decoded data isn't plain text data - it's just binary data. So write it with a FileStream, not a FileWriter:
// If your Base64 class doesn't have a decode method taking a string,
// find a better one!
byte[] decodedBytes = Base64.decodeBase64(stringBase64);
// Note the try-with-resources block here, to close the stream automatically
try (OutputStream stream = new FileOutputStream("F:\\xxx.doc")) {
stream.write(decodedBytes);
}
Or even better:
byte[] decodedBytes = Base64.decodeBase64(stringBase64);
Files.write(Paths.get("F:\\xxx.doc"), decodedBytes);
Please have a look.
byte[] encodedBytes = /* your encoded bytes*/
// Decode data on other side, by processing encoded data
byte[] decodedBytes= Base64.decodeBase64(encodedBytes );
String yourValue=new String(decodedBytes);
System.out.println("Decoded String is " + yourValue);
Now further you can write this string into a file and read further.
I'm trying to bind an xml file(as a byte[]) to a java object. This is my code-
public voidinputConfigXML(String xmlfile, byte[] xmlData) {
IBindingFactory bFact = BindingDirectory.getFactory(GroupsDTO.class);
IUnmarshallingContext uctx = bFact.createUnmarshallingContext();
groups = (GroupsDTO) uctx.unmarshalDocument(new ByteArrayInputStream(xmlData), "UTF8");
}
The unmarshalDocument() is giving me this exception. What do i do?
FYI: Running as JUnit test case
The following is the stacktrace -
Error parsing document (line 1, col 1)
org.xmlpull.v1.XmlPullParserException: only whitespace content allowed before start tag and not \u0 (position: START_DOCUMENT seen \u0... #1:1)
at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1519)
at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)
at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
at org.jibx.runtime.impl.XMLPullReaderFactory$XMLPullReader.next(XMLPullReaderFactory.java:291)
at org.jibx.runtime.impl.UnmarshallingContext.toStart(UnmarshallingContext.java:451)
at org.jibx.runtime.impl.UnmarshallingContext.unmarshalElement(UnmarshallingContext.java:2755)
at org.jibx.runtime.impl.UnmarshallingContext.unmarshalDocument(UnmarshallingContext.java:2905)
at abc.dra.DRAAPI.inputConfigXML(DRAAPI.java:31)
at abc.dra.XMLToObject_Test.test(XMLToObject_Test.java:34)
[...]
This is my code that forms byte[]-
void test() {
String xmlfile = "output.xml"
File file = new File(xmlfile);
byte[] xmlData = new byte[(int) file.length()];
groups = dra.inputConfigXML(xmlfile, xmlData);
}
The ByteArrayInputstream is empty:
only whitespace content allowed before start tag and not \u0
(position: START_DOCUMENT seen \u0... #1:1)
means, that a \u0 Bit was found as first char within the XML.
Ensure you have content within your byte[] and the UTF-8 don't start with a BOM.
I don't think, that the BOM is your problem here, but I often encountert regarding BOM and java.
update
You don't fill the byte[]. You have to read the file-content into the byte[]:
read this: File to byte[] in Java
By the way: byte[] xmlData = new byte[(int) file.length()]; is bad code-style, becaus you will run into problems with larger XML-files. If they are larger than Integer.MAX_VALUE you will read a corrupt file.
Hari,
JiBX need characters as input. I think you have specified your encoding incorrectly. Try this code instead:
FileInputStream fis = new FileInputStream("output.xml");
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
groups = (GroupsDTO) uctx.unmarshalDocument(isr);
If you must use the code you have written, I would try outputting the text to the console (System.put.println(xxx)) to make sure you are decoding the utf-8 correctly.
Don
Go to to mvn repository path and delete that folder for xml file.