Encoding for unicode and & characters

Encoding for unicode and & characters - java

I am trying to save the below string to my protobuff model:
STOXX®Europe 600 Food&BevNR ETF
But while printing the protomodel value it's displayed like:
STOXXÂ®Europe 600 Food&BevNR ETF
I tried to encode the string to UTF-8 and also tried StringEscapeUtils.unescapeJava(str), but it failed. I'm getting this string by parsing the XML response from server. Any ideas ?
Ref: XML parser Skip invalid xml element with XmlStreamReader

Correcting the XML parsing should be better than needing to unescape everything. Please check below a test case showing this:
public static void main(String[] args) throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isCoalescing", true);
ReaderInputStream ris = new ReaderInputStream(new StringReader("<tag>STOXXÂ®Europe 600 Food&BevNR ETF</tag>"));
XMLStreamReader reader = factory.createXMLStreamReader(ris, "UTF-8");
StringBuilder sb = new StringBuilder();
while (reader.hasNext()) {
reader.next();
if (reader.hasText())
sb.append(reader.getText());
}
System.out.println(sb);
}
Output:
STOXX®Europe 600 Food&BevNR ETF

Actually I have protobuf method with me to solve this issue:
ByteString.copyFrom(StringEscapeUtils.unescapeHtml3(string), "ISO-8859-1").toStringUtf8();
Documentation of ByteString

As the text comes from XML use:
s = StringEscapeUtils.unescapeXml(s);
This is way better than unescaping HTML which has hundreds of named entities &...;.
The two rubbish characters instead of the Copyright Symbol are due to reading an UTF-8 encoded text (multibyte for Special chars) as some single Byte Encoding, maybe Latin-1.
This wrong conversion just might be repaired with another conversion, but best would be to read using a UTF-8 Encoding.
// Hack, just patching. Assumes Latin-1 encoding
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
// Or maybe:
s = new String(s.getBytes(), StandardCharsets.UTF_8);
Better inspect the reading code, and look wheter an optional Encoding went missing: InputStreamReader, OutputStreamWriter, new String, getBytes.
Your entire problem would be solved by using an XML reader too.

Related

ZWNBSP appears when parsing CSV

I have a CSV and I want to check if it has all the data it should have. But it looks like ZWNBSP appears at the beginning of the 1st column name in the 1st string.
My simplified code is
#Test
void parseCsvTest() throws Exception {
Configuration.holdBrowserOpen = true;
ClassLoader classLoader = getClass().getClassLoader();
try (
InputStream inputStream = classLoader.getResourceAsStream("files/csv_example.csv");
CSVReader reader = new CSVReader(new InputStreamReader(inputStream))
) {
List<String[]> content = reader.readAll();
var csvStrings0line = content.get(0);
var csv1stElement = csvStrings0line[0];
var csv1stElementShouldBe = "Timestamp";
assertEquals(csv1stElementShouldBe,csv1stElement);
My CSV contains
"Timestamp","Source","EventName","CountryId","Platform","AppVersion","DeviceType","OsVersion"
"2022-05-02T14:56:59.536987Z","courierapp","order_delivered_sent","643","ios","3.11.0","iPhone 11","15.4.1"
"2022-05-02T14:57:35.849328Z","courierapp","order_delivered_sent","643","ios","3.11.0","iPhone 8","15.3.1"
My test fails with
expected: <Timestamp> but was: <Timestamp>
Expected :Timestamp
Actual :Timestamp
<Click to see difference>
Clicking on the see difference shows that there is a ZWNBSP at the beginning of the Actual text.
Copypasting my text to the online tool for displaying non-printable unicode characters https://www.soscisurvey.de/tools/view-chars.php shows only CR LF at the ends of the lines, no ZWNBSPs.
But where does it come from?

It's a BOM character. You may remove it yourself or use several other solutions (see https://stackoverflow.com/a/4897993/1420794 for instance)

That is the Unicode zero-width no-break space character. When used at the beginning of Unicode encoded text files, it serves as a 'byte-order-mark' . You read it to determine the encoding of the text file, then you can safely discard it if you want. The best thing you can do is spread awareness.

Import csv issue with characters

When I import CSV file that contains some countries, then I have a problem with some characters. It doesn't encode it well and then I get? mark instead of the character that is written in CSV file.
Here are countries which make me this problem: ÅLAND ISLANDS, SAINT BARTHÉLEMY, CÔTE D'IVOIRE, CURAÇAO.
Here is code for importing csv file:
ICsvBeanReader beanReader = new CsvBeanReader(new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8),
new CsvPreference.Builder(CsvPreference.STANDARD_PREFERENCE).useQuoteMode(new AlwaysQuoteMode()).build());
first i used FileReader and there was problem with all of these countries, then i change to InputStreamReader and add this UTF-8 charset and problem was almost solved. When i use charset UTF-8 i have problem only with reading this country "ÅLAND ISLANDS", as result i get "?LAND ISLANDS".
As charset i've also tried ISO_8859_1, Windows-1252 but it's always same problem with "ÅLAND ISLANDS".
Does anyone know which charset i should use to solve this problem?

Java File reader doesn't handle Byte order mark. I hope that's the issue.
Different of versions handles it differently.
Wrap input stream with the below method.Which detects file type.This method is available in commons-io.If you don't have commons-io grab code from that library.It will be around 10 to 20 lines.Hope that works.
public static InputStreamReader getInputStreamReader(InputStream inputStream) throws IOException
{
BOMInputStream bOMInputStream = new BOMInputStream(inputStream, false, ByteOrderMark.UTF_8,
ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? "UTF-8" : bom.getCharsetName();
return new InputStreamReader(bOMInputStream, charsetName);
}

How to write accented characters from XML into MarkLogic using JavaApi?

I have an XML of 20MB having accented characters like Ö,É,Á, and many more.Here the problem is when i insert file into MarkLogic, these characters are saved in English format like O,E,A.but i want to store in same format.So how can i store the characters in accented format and read the XMl in same way. My XML file is ISO-8859-1 encoded.
Code which i have written for writing and reading :
DatabaseClient client = DatabaseClientFactory.newClient(IP, PORT,
DATABASE_NAME, USERNAME, PWD, Authentication.DIGEST);
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = null;
streamReader = factory.createXMLStreamReader(new FileReader("record.xml"));
XMLDocumentManager xmlDocMgr = client.newXMLDocumentManager();
XMLStreamReaderHandle handle = new XMLStreamReaderHandle(streamReader);
xmlDocMgr.write("/" + filename, handle);
For reading XML:
XMLDocumentManager docMgr = client.newXMLDocumentManager();
DOMHandle xmlhandle = new DOMHandle();
docMgr.read("/" + filename, xmlhandle);
String doc = xmlhandle.toString();
String data = Normalizer.normalize(doc, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
return data;
I Am returning Data to display in browser.
Not able to find where the problem is.

If the XML file does not have an XML prologue that declares its encoding, you should specify the ISO-8859-1 encoding when reading the file before writing the file to the database (as flafoux has pointed out).
You should also specify the encoding when reading the content from the database unless the destination accepts UTF-8 encoding.
For more information, see:
http://docs.marklogic.com/guide/java/document-operations#id_11208
Hoping that helps,

You need to specify the encoding (and also change constructor using InputStream) :
XMLStreamReader streamReader = factory.createXMLStreamReader(new FileInputStream("record.xml"),"ISO-8859-1");

Want to throw exception when encounter special UTF-8 characters in an XML file

I am parsing an XML file which has UTF-8 encoding.
<?xml version="1.0" encoding="UTF-8"?>
Now our business application has set of components which are developed by different teams and are not using the same libraries for parsing XML. My component uses JAXB while some other component uses SAX and so forth. Now when XML file has special characters like "ä" or "ë" or "é" (characters with umlauts) JAXB parses it properly but other components (sub-apps) could not parse them properly and throws exception.
Due to business need I can not change programming for other components but I have to put restriction/validation at my application for making sure that XML (data-load) file do not contain any such characters.
What is best approach to make sure that file does not contain above mentioned (or similar) characters and I can throw exception (or give error) right there before I start parsing XML file using JAXB.

If your customer sends you an XML file with a header where the encoding does not match file contents, you might as well give up to try and do anything meaningful with that file. - Are they really sending data where the header does not match the actual encoding? That's not XML, then. And you ought to charge them more ;-)
Simply read the file as a FileInputStream, byte by byte. If it contains a negative byte value, refuse to process it.
You can keep encoding settings like UTF-8 or ISO 8859-1, because they all have US-ASCII as a proper subset.

yes, my answer would be the same as laune mentions...
static boolean readInput() {
boolean isValid = true;
StringBuffer buffer = new StringBuffer();
try {
FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis);
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
System.out.println("ch="+ch);
//TODO - check range for each character
//according the wikipedia table http://en.wikipedia.org/wiki/UTF-8
//if it's a valid utf-8 character
//if it's not in range, the isValid=false;
//and you can break here...
}
in.close();
return isValid;
}
catch (IOException e) {
e.printStackTrace();
return false;
}
}
i'm just adding a code snippet...

You should be able to wrap the XML input in a java.io.Reader in which you specify the actual encoding and then process that normally. Java will leverage the encoding specified in the XML for an InputStream, but when a Reader is used, the encoding of the Reader will be used.
Unmarshaller unmarshaller = jc.createUnmarshaller();
InputStream inputStream = new FileInputStream("input.xml");
Reader reader = new InputStreamReader(inputStream, "UTF-16");
try {
Address address = (Address) unmarshaller.unmarshal(reader);
} finally {
reader.close();
}

How to unescape html special characters in Java?

I have some text strings that I need to process and inside the strings there are HTML special characters. For example:
10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂
I would like to convert those characters to utf-8.
I used org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4 but didn't have any luck. Is there an easy way to deal with this problem?

Apache commons-text library has the StringEscapeUtils class that has the unescapeHtml4() utility method.
String utf8Str = StringEscapeUtils.unescapeHtml4(htmlStr);
You may also need unescapeXml()

#Bohemian 's code is correct, It works for me, your un-encoded string is 10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂.
Now, I'm adding another answer instead of commenting on Bohemian's answer because there are two things that still need to be mentioned:
I copy-pasted your string into HTML code and the browser can't render your characters properly, because your String is incorrectly encoded, i. e. the string has encoded the high surrogate and the low one for two-bytes-chars separately, instead of encoding the whole codepoint (it seems the original string is a UTF-16 encoded string, maybe a Java String?).
You want the string to be re-encoded to UTF-8.
Once you have your String unencoded by StringEscapeUtils.unescapeHtml(htmlStr) (which un-encodes your string successfully despite being encoded incorrectly), it doesn't have much sense talking about "string encodings" as java strings are "unaware" about encodings. (they use UTF-16 internally though).
If you need a group of bytes containing a UTF-8 encoded "string", you need to get the "raw" bytes from a String encoded as UTF-8:
String javaStr = StringEscapeUtils.unescapeHtml(htmlStr);
byte[] rawUft8String = javaStr.getBytes("UTF-8");
And do with such byte array whatever you need.
Now if what you need is to write a UTF-8 encoded string to a File, instead of that byte array you need to specify the encoding when you create the proper java.io.Writer.
Try this code to un-encode your string (change the file path first) and then open the resulting file in any editor that supports UTF-8:
java.io.Writer approach (better):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(Writer output = new OutputStreamWriter(
new FileOutputStream("/path/to/testing.txt"), "UTF-8")) {
output.write(javaString);
}
}
java.io.OutputStream approach (if you already have a "raw string"):
public static void main(String[] args) throws IOException {
String str = "10😭😭😂😂😂😂😢😂10😭😭😂😂😂😂😢😂😂";
String javaString = StringEscapeUtils.unescapeHtml(str);
try(OutputStream output = new FileOutputStream("/path/to/testing.txt")) {
for (byte b : javaString.getBytes(Charset.forName("UTF-8"))) {
output.write(b);
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Encoding for unicode and & characters - java

Actually I have protobuf method with me to solve this issue: ByteString.copyFrom(StringEscapeUtils.unescapeHtml3(string), "ISO-8859-1").toStringUtf8(); Documentation of ByteString

Related

ZWNBSP appears when parsing CSV

Import csv issue with characters

How to write accented characters from XML into MarkLogic using JavaApi?

Want to throw exception when encounter special UTF-8 characters in an XML file

How to unescape html special characters in Java?

Categories

Resources