How to convert utf-8 characters in utf-16 unicode - scala

How to convert utf-8 characters in utf-16 unicode - scala - java

Ref: https://www.branah.com/unicode-converter
I'm new in scala and java and trying to writ a .properties file (in few languages like Chinese,french,German etc ) using scala for internationalization functionality. For that I'm using following code:
for ((key, val) <- jsonData.get.asInstanceOf[Map[String, String]]) {
var file: PrintWriter = null
file = new PrintWriter(filepath, "UTF-8")
prop.setProperty(key, val)
prop.store(file, "")
file.close()
}
So this code is working but its writing file in UTF-8 format like:
传播特征 设计师 考虑 测量
düşünce
which is not rendering properly in browser so instead of that I want to convert it into the UTF-16 unicode format like:
\u4f20\u64ad\u7279\u5f81 \u8bbe\u8ba1\u5e08 \u8003\u8651 \u6d4b\u91cf
\u0064\u00fc\u015f\u00fc\u006e\u0063\u0065
As per this converter: https://www.branah.com/unicode-converter
I don't have access of client side so can't post that code here but I'm sure its same like fetching data from .properties file through ajax and rendering it on browser.
How can I convert it into utf-16 unicode so that it'll render properly in browser.
Any help would be appreciated.

You can use the same code and user "UTF-16"for the Charset
for ((key, val) <- jsonData.get.asInstanceOf[Map[String, String]]) {
var file: PrintWriter = null
file = new PrintWriter(filepath, "UTF-16")
prop.setProperty(key, val)
prop.store(file, "")
file.close()
}
Please check the supported java Charset
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

Solved it myself.
replaced FileOutputStream with PrintWriter and it worked :)
var file: File = null
file = new File(filepath)
var fos: FileOutputStream = null
fos = new FileOutputStream(file)
prop.setProperty(key, val)
prop.store(fos, "")
Thanks #Jon Skeet for help.

Related

Android Resources.openRawResource() encoding issue [duplicate]

I am reading a property file which consists of a message in the UTF-8 character set.
Problem
The output is not in the appropriate format. I am using an InputStream.
The property file looks like
username=LBSUSER
password=Lbs#123
url=http://localhost:1010/soapfe/services/MessagingWS
timeout=20000
message=Spanish character are = {á é í, ó,ú ,ü, ñ, ç, å, Á, É, Í, Ó, Ú, Ü, Ñ, Ç, ¿, °, 4° año = cuarto año, €, ¢, £, ¥}
And I am reading the file like this,
Properties props = new Properties();
props.load(new FileInputStream("uinsoaptest.properties"));
String username = props.getProperty("username", "test");
String password = props.getProperty("password", "12345");
String url = props.getProperty("url", "12345");
int timeout = Integer.parseInt(props.getProperty("timeout", "8000"));
String messagetext = props.getProperty("message");
System.out.println("This is soap msg : " + messagetext);
The output of the above message is
You can see the message in the console after the line
{************************ SOAP MESSAGE TEST***********************}
I will be obliged if I can get any help reading this file properly. I can read this file with another approach but I am looking for less code modification.

Use an InputStreamReader with Properties.load(Reader reader):
FileInputStream input = new FileInputStream(new File("uinsoaptest.properties"));
props.load(new InputStreamReader(input, Charset.forName("UTF-8")));
As a method, this may resemble the following:
private Properties read( final Path file ) throws IOException {
final var properties = new Properties();
try( final var in = new InputStreamReader(
new FileInputStream( file.toFile() ), StandardCharsets.UTF_8 ) ) {
properties.load( in );
}
return properties;
}
Don't forget to close your streams. Java 7 introduced StandardCharsets.UTF_8.

Use props.load(new FileReader("uinsoaptest.properties")) instead. By default it uses the encoding Charset.forName(System.getProperty("file.encoding")) which can be set to UTF-8 with System.setProperty("file.encoding", "UTF-8") or with the commandline parameter -Dfile.encoding=UTF-8.

If somebody use #Value annotation, could try StringUils.
#Value("${title}")
private String pageTitle;
public String getPageTitle() {
return StringUtils.toEncodedString(pageTitle.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("UTF-8"));
}

You should specify the UTF-8 encoding when you construct your FileInputStream object. You can use this constructor:
new FileInputStream("uinsoaptest.properties", "UTF-8");
If you want to make a change to your JVM so as to be able to read UTF-8 files by default, you will have to change the JAVA_TOOL_OPTIONS in your JVM options to something like this :
-Dfile.encoding=UTF-8

If anybody comes across this problem in Kotlin, like me:
The accepted solution of #Würgspaß works here as well. The corresponding Kotlin syntax:
Instead of the usual
val properties = Properties()
filePath.toFile().inputStream().use { stream -> properties.load(stream) }
I had to use
val properties = Properties()
InputStreamReader(FileInputStream(filePath.toFile()), StandardCharsets.UTF_8).use { stream -> properties.load(stream) }
With this, special UTF-8 characters are loaded correctly from the properties file given in filePath.

How to fix "Improper Neutralization of Script-Related HTML Tags in a Web Page (Basic XSS)" in a ServletOutputStream

Following code gives veracode flaw "Improper Neutralization of Script-Related HTML Tags in a Web Page" on the line out.write(outByte,0,iRead);
:
try {
bytesImage = helper.getBlob(Integer.parseInt(id) );
ByteArrayInputStream bin = new ByteArrayInputStream(bytesImage);
ServletOutputStream out = response.getOutputStream();
outByte = new byte[bytesImage.length];
int iRead = 0;
while ((iRead = bin.read(outByte)) > 0) {
out.write(outByte,0,iRead);
}
I found a lot of similar issues here but all with strings only. These coulde be fixed with something like this:
> out.write ( ESAPI.encoder().encodeForHTML(theSimpleString) );
but for the binary OutputStream this will not work.
Any hints how to get above veracode issue solved?
Thanks
As suggested by #sinkmanu I tried to convert the bytes to String. Then applied ESAPI.encoder().encodeForHTML().
I added two conversion methods:
private static String base64Encode(byte[] bytes) {
return new BASE64Encoder().encode(bytes);
}
private static byte[] base64Decode(String s) throws IOException {
return new BASE64Decoder().decodeBuffer(s);
}
then tried with this code:
...
bytes = helper.getBlob( inId );
// 1 -> this solves Veracode issue but image is not valid anymore
String encodedString = base64Encode(bytes) ;
String safeString = ESAPI.encoder().encodeForHTML(encodedString);
safeBytes = base64Decode(safeString);
// 2 -> as written above, when i use the safe 'safeBytes' the Veracode flaw is gone but the app is not working anymore (image not ok)
// ByteArrayInputStream bin = new ByteArrayInputStream(safeBytes);
// outBytes = new byte[safeBytes.length];
// 3 -> just use the 'unsafe' bytes -> app is working but veracode flaw needs to be fixed!
ByteArrayInputStream bin = new ByteArrayInputStream(bytes);
outBytes = new byte[bytes.length];
int iRead=0;
ServletOutputStream out = response.getOutputStream();
while ((iRead = bin.read(outBytes)) > 0) {
out.write( outBytes, 0, iRead);
}
...
The above could solve the veracode issue (when 2 is uncommented) but the image then seems to be corrupt (cannot be processes anymore?).
Any hint how i can solve the veracode issue with the binary stream?

The solution to above is this:
String safeString = ESAPI.encoder().encodeForBase64(bytes,false);
byte[] safeBytes = ESAPI.encoder().decodeFromBase64(safeString);
In the ESAPI libs there are also methods to encode and decode from Base64. This was the solution to my problem. Above two lines do the magic for veracode and when using the "safeBytes" in the code later on everything is fine...

You can validate file using following function:
ESAPI.validator().getValidFileContent()

To sanitize strings you can use the encodeForHTML fromt the ESAPI library or StringEscapeUtils from Apache.
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
String data = "<script>alert(document.cookie);</script>";
String escaped = escapeHtml(data);
If your data is not a String, you have to convert it to String. Also, if you are sure that the data that you have are escaped, you can ignore the warning because it is a false positive.

Encoding for unicode and & characters

I am trying to save the below string to my protobuff model:
STOXX®Europe 600 Food&BevNR ETF
But while printing the protomodel value it's displayed like:
STOXXÂ®Europe 600 Food&BevNR ETF
I tried to encode the string to UTF-8 and also tried StringEscapeUtils.unescapeJava(str), but it failed. I'm getting this string by parsing the XML response from server. Any ideas ?
Ref: XML parser Skip invalid xml element with XmlStreamReader

Correcting the XML parsing should be better than needing to unescape everything. Please check below a test case showing this:
public static void main(String[] args) throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isCoalescing", true);
ReaderInputStream ris = new ReaderInputStream(new StringReader("<tag>STOXXÂ®Europe 600 Food&BevNR ETF</tag>"));
XMLStreamReader reader = factory.createXMLStreamReader(ris, "UTF-8");
StringBuilder sb = new StringBuilder();
while (reader.hasNext()) {
reader.next();
if (reader.hasText())
sb.append(reader.getText());
}
System.out.println(sb);
}
Output:
STOXX®Europe 600 Food&BevNR ETF

Actually I have protobuf method with me to solve this issue:
ByteString.copyFrom(StringEscapeUtils.unescapeHtml3(string), "ISO-8859-1").toStringUtf8();
Documentation of ByteString

As the text comes from XML use:
s = StringEscapeUtils.unescapeXml(s);
This is way better than unescaping HTML which has hundreds of named entities &...;.
The two rubbish characters instead of the Copyright Symbol are due to reading an UTF-8 encoded text (multibyte for Special chars) as some single Byte Encoding, maybe Latin-1.
This wrong conversion just might be repaired with another conversion, but best would be to read using a UTF-8 Encoding.
// Hack, just patching. Assumes Latin-1 encoding
s = new String(s.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);
// Or maybe:
s = new String(s.getBytes(), StandardCharsets.UTF_8);
Better inspect the reading code, and look wheter an optional Encoding went missing: InputStreamReader, OutputStreamWriter, new String, getBytes.
Your entire problem would be solved by using an XML reader too.

How to write accented characters from XML into MarkLogic using JavaApi?

I have an XML of 20MB having accented characters like Ö,É,Á, and many more.Here the problem is when i insert file into MarkLogic, these characters are saved in English format like O,E,A.but i want to store in same format.So how can i store the characters in accented format and read the XMl in same way. My XML file is ISO-8859-1 encoded.
Code which i have written for writing and reading :
DatabaseClient client = DatabaseClientFactory.newClient(IP, PORT,
DATABASE_NAME, USERNAME, PWD, Authentication.DIGEST);
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = null;
streamReader = factory.createXMLStreamReader(new FileReader("record.xml"));
XMLDocumentManager xmlDocMgr = client.newXMLDocumentManager();
XMLStreamReaderHandle handle = new XMLStreamReaderHandle(streamReader);
xmlDocMgr.write("/" + filename, handle);
For reading XML:
XMLDocumentManager docMgr = client.newXMLDocumentManager();
DOMHandle xmlhandle = new DOMHandle();
docMgr.read("/" + filename, xmlhandle);
String doc = xmlhandle.toString();
String data = Normalizer.normalize(doc, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
return data;
I Am returning Data to display in browser.
Not able to find where the problem is.

If the XML file does not have an XML prologue that declares its encoding, you should specify the ISO-8859-1 encoding when reading the file before writing the file to the database (as flafoux has pointed out).
You should also specify the encoding when reading the content from the database unless the destination accepts UTF-8 encoding.
For more information, see:
http://docs.marklogic.com/guide/java/document-operations#id_11208
Hoping that helps,

You need to specify the encoding (and also change constructor using InputStream) :
XMLStreamReader streamReader = factory.createXMLStreamReader(new FileInputStream("record.xml"),"ISO-8859-1");

How do I get an FileInputStream from FileItem in java?

I am trying to avoid the FileItem getInputStream(), because it will get the wrong encoding, for that I need a FileInputStream instead. Is there any way to get a FileInputStream without using this method? Or can I transform my fileitem into a file?
if (this.strEncoding != null && !this.strEncoding.isEmpty()) {
br = new BufferedReader(new InputStreamReader(clsFile.getInputStream(), this.strEncoding));
}
else {
// br = ?????
}

You can try
FileItem#getString(encoding)
Returns the contents of the file item as a String, using the specified encoding.

You can use the write method here.
File file = new File("/path/to/file");
fileItem.write(file);

An InputStream is binary data, bytes. It must be converted to text by giving the encoding of those bytes.
Java uses internally Unicode to represent all text scripts. For text it uses String/char/Reader/Writer.
For binary data, byte[], InputStream, OutputStream.
So you could use a bridging class, like InputStreamReader:
String encoding = "UTF-8"; // Or "Windows-1252" ...
BufferedReader in = new BufferedStream(
new InputStreamReader(fileItem.getInputStream(),
encoding));
Or if you read the bytes:
String s = new String(bytes, encoding);
The encoding is often an option parameter (there then exists an overloaded method without encoding).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to convert utf-8 characters in utf-16 unicode - scala - java

Solved it myself. replaced FileOutputStream with PrintWriter and it worked :) var file: File = null file = new File(filepath) var fos: FileOutputStream = null fos = new FileOutputStream(file) prop.setProperty(key, val) prop.store(fos, "") Thanks #Jon Skeet for help.

Related

Android Resources.openRawResource() encoding issue [duplicate]

How to fix "Improper Neutralization of Script-Related HTML Tags in a Web Page (Basic XSS)" in a ServletOutputStream

Encoding for unicode and & characters

How to write accented characters from XML into MarkLogic using JavaApi?

How do I get an FileInputStream from FileItem in java?

Categories

Resources