Convert International String to \u Codes in java - java

How can I convert an international (e.g. Russian) String to \u numbers (unicode numbers)
e.g. \u041e\u041a for OK ?

there is a JDK tools executed via command line as following :
native2ascii -encoding utf8 src.txt output.txt
Example :
src.txt
بسم الله الرحمن الرحيم
output.txt
\u0628\u0633\u0645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062d\u0645\u0646 \u0627\u0644\u0631\u062d\u064a\u0645
If you want to use it in your Java application, you can wrap this command line by :
String pathSrc = "./tmp/src.txt";
String pathOut = "./tmp/output.txt";
String cmdLine = "native2ascii -encoding utf8 " + new File(pathSrc).getAbsolutePath() + " " + new File(pathOut).getAbsolutePath();
Runtime.getRuntime().exec(cmdLine);
System.out.println("THE END");
Then read content of the new file.

You could use escapeJavaStyleString from org.apache.commons.lang.StringEscapeUtils.

I also had this problem. I had some Portuguese text with some special characters, but these characters where already in unicode format (ex.: \u00e3).
So I want to convert S\u00e3o to São.
I did it using the apache commons StringEscapeUtils. As #sorin-sbarnea said. Can be downloaded here.
Use the method unescapeJava, like this:
String text = "S\u00e3o"
text = StringEscapeUtils.unescapeJava(text);
System.out.println("text " + text);
(There is also the method escapeJava, but this one puts the unicode characters in the string.)
If any one knows a solution on pure Java, please tell us.

Here's an improved version of ArtB's answer:
StringBuilder b = new StringBuilder();
for (char c : input.toCharArray()) {
if (c >= 128)
b.append("\\u").append(String.format("%04X", (int) c));
else
b.append(c);
}
return b.toString();
This version escapes all non-ASCII chars and works correctly for low Unicode code points like Ä.

There are three parts to the answer
Get the Unicode for each character
Determine if it is in the Cyrillic Page
Convert to Hexadecimal.
To get each character you can iterate through the String using the charAt() or toCharArray() methods.
for( char c : s.toCharArray() )
The value of the char is the Unicode value.
The Cyrillic Unicode characters are any character in the following ranges:
Cyrillic: U+0400–U+04FF ( 1024 - 1279)
Cyrillic Supplement: U+0500–U+052F ( 1280 - 1327)
Cyrillic Extended-A: U+2DE0–U+2DFF (11744 - 11775)
Cyrillic Extended-B: U+A640–U+A69F (42560 - 42655)
If it is in this range it is Cyrillic. Just perform an if check. If it is in the range use Integer.toHexString() and prepend the "\\u". Put together it should look something like this:
final int[][] ranges = new int[][]{
{ 1024, 1279 },
{ 1280, 1327 },
{ 11744, 11775 },
{ 42560, 42655 },
};
StringBuilder b = new StringBuilder();
for( char c : s.toCharArray() ){
int[] insideRange = null;
for( int[] range : ranges ){
if( range[0] <= c && c <= range[1] ){
insideRange = range;
break;
}
}
if( insideRange != null ){
b.append( "\\u" ).append( Integer.toHexString(c) );
}else{
b.append( c );
}
}
return b.toString();
Edit: probably should make the check c < 128 and reverse the if and the else bodies; you probably should escape everything that isn't ASCII. I was probably too literal in my reading of your question.

There's a command-line tool that ships with java called native2ascii. This converts unicode files to ASCII-escaped files. I've found that this is a necessary step for generating .properties files for localization.

In case you need this to write a .properties file you can just add the Strings into a Properties object and then save it to a file. It will take care for the conversion.

Apache commons StringEscapeUtils.escapeEcmaScript(String) returns a string with unicode characters escaped using the \u notation.
"Art of Beer 🎨 🍺" -> "Art of Beer \u1F3A8 \u1F37A"

Just some basic Methods for that (inspired from native2ascii tool):
/**
* Encode a String like äöü to \u00e4\u00f6\u00fc
*
* #param text
* #return
*/
public String native2ascii(String text) {
if (text == null)
return text;
StringBuilder sb = new StringBuilder();
for (char ch : text.toCharArray()) {
sb.append(native2ascii(ch));
}
return sb.toString();
}
/**
* Encode a Character like ä to \u00e4
*
* #param ch
* #return
*/
public String native2ascii(char ch) {
if (ch > '\u007f') {
StringBuilder sb = new StringBuilder();
// write \udddd
sb.append("\\u");
StringBuffer hex = new StringBuffer(Integer.toHexString(ch));
hex.reverse();
int length = 4 - hex.length();
for (int j = 0; j < length; j++) {
hex.append('0');
}
for (int j = 0; j < 4; j++) {
sb.append(hex.charAt(3 - j));
}
return sb.toString();
} else {
return Character.toString(ch);
}
}

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

You could probably hack if from this JavaScript code:
/* convert 🙌 to \uD83D\uDE4C */
function text_to_unicode(string) {
'use strict';
function is_whitespace(c) { return 9 === c || 10 === c || 13 === c || 32 === c; }
function left_pad(string) { return Array(4).concat(string).join('0').slice(-1 * Math.max(4, string.length)); }
string = string.split('').map(function(c){ return "\\u" + left_pad(c.charCodeAt(0).toString(16).toUpperCase()); }).join('');
return string;
}
/* convert \uD83D\uDE4C to 🙌 */
function unicode_to_text(string) {
var prefix = "\\\\u"
, regex = new RegExp(prefix + "([\da-f]{4})","ig")
;
string = string.replace(regex, function(match, backtrace1){
return String.fromCharCode( parseInt(backtrace1, 16) )
});
return string;
}
source: iCompile - Yet Another JavaScript Unicode Encode/Decode

this type name is Decode/Unescape Unicode.
this site link online convertor.

Related

Java code to process special characters that need to be replaced by other special characters

I am writing Java code to process a string received from a Mainframe that contains special characters that need to be replaced by other special characters, my search characters are §ÄÖÜäüßö#[\]~{¦} and the replacement characters are #[\]{}~¦§ÄÖÜßäöü so if the string has a { in it I need to replace it with ä and example of my input is "0.201322.05.2017LM-R{der Dopp"
My code currently is
String repChar = "§ÄÖÜäüßö#[\\\\]~{¦}#[\\\\]{}~¦§ÄÖÜßäöü";
// Split String and Convert
String repCharin = repChar.substring(0, repChar.length()/2-1);
String repCharout = repChar.substring(repChar.length()/2, repChar.length()-1);
String strblob = new String(utf8ContentIn);
// Convert
for (int j=0; j < repCharin.length();j++) {
strblob = strblob.replace(repCharin.substring(j, 1), repCharout.substring(j, 1));
}
byte [] utf8Content = strblob.getBytes();
But it generates the following error
java.lang.StringIndexOutOfBoundsException at
java.lang.String.substring(String.java:1240)
The \\ are escaped characters I only need a single \
The code
String utf8ContentIn = "0.201322.05.2017LM-R{der Dopp";
String repChar = "§ÄÖÜäüßö#[\\]~{¦}#[\\]{}~¦§ÄÖÜßäöü";
// Split String and Convert
String repCharin = repChar.substring(0, repChar.length() / 2);
String repCharout = repChar.substring(repChar.length() / 2, repChar.length());
String strblob = new String(utf8ContentIn);
String output = strblob.chars().mapToObj(c -> {
char ch = (char) c;
int index = repCharin.indexOf(c);
if (index != -1) {
ch = repCharout.charAt(index);
}
return String.valueOf(ch);
}).collect(Collectors.joining());
System.out.println(output);
will print "0.201322.05.2017LM-Räder Dopp" as you expect. Your problem here (besides incorrect indexes during separation) is that you should iterate input string instead of your characters. Because you can run into situation when you replace Ä with [ and after threat [ as special character again and replace it second time with Ä.
Also, single backslash should be escaped with single backslash, so to get \ you need \\
Hope it helps!

replacing the carriage return with white space in java

I am having the below string in a string variable in java.
rule "6"
no-loop true
when
then
String prefix = null;
prefix = "900";
String style = null;
style = "490";
String grade = null;
grade = "GL";
double basePrice = 0.0;
basePrice = 837.00;
String ruleName = null;
ruleName = "SIVM_BASE_PRICE_006
Rahul Kumar Singh";
ProductConfigurationCreator.createFact(drools, prefix, style,grade,baseprice,rulename);
end
rule "5"
no-loop true
when
then
String prefix = null;
prefix = "800";
String style = null;
style = "481";
String grade = null;
grade = "FL";
double basePrice = 0.0;
basePrice = 882.00;
String ruleName = null;
ruleName = "SIVM_BASE_PRICE_005";
ProductConfigurationCreator.createFact(drools, prefix, style,grade,baseprice,rulename);
end
I need to replace this the carriage return between "THEN" and "END" keyword with white space so that it becomes like below code:
rule "6"
no-loop true
when
then
String prefix = null;
prefix = "900";
String style = null;
style = "490";
String grade = null;
grade = "GL";
double basePrice = 0.0;
basePrice = 837.00;
String ruleName = null;
ruleName = "SIVM_BASE_PRICE_006 Rahul Kumar Singh";
ProductConfigurationCreator.createFact(drools, prefix, style,grade,baseprice,rulename);
end
rule "5"
no-loop true
when
then
String prefix = null;
prefix = "800";
String style = null;
style = "481";
String grade = null;
grade = "FL";
double basePrice = 0.0;
basePrice = 882.00;
String ruleName = null;
ruleName = "SIVM_BASE_PRICE_005";
ProductConfigurationCreator.createFact(drools, prefix, style,grade,baseprice,rulename);
end
In the above two example of string set, the second is correct format that I need. However, in the first set, I am getting this :
ruleName = "SIVM_BASE_PRICE_006
Rahul Kumar Singh";
This perticulerly needs to be like this:
ruleName = "SIVM_BASE_PRICE_006 Rahul Kumar Singh";
and I also need to ensure that this doesn't effect any thing else in the string.
Thus I need to replace this "carriage return" with a white space and make in one line. This is my requirment. I tried with replace and replaceAll method of string but not works properly.
Problem:
I need to look in between string "then" and "end" and in that whenever
there is any carriage return in between two double quaotes "" ""; I
need to replace this carriage return with white space and make it in
one line.
Thanks
EDIT:
DRT:
template header
Prefix
Style
Product
package com.xx
import com.xx.drools.ProductConfigurationCreator;
template "ProductSetUp"
rule "Product_#{row.rowNumber}"
no-loop true
when
then
String prefix = null;
prefix = "#{Prefix}";
String style = null;
prefix = "#{Style}";
String product = null;
product = "#{Product}";
ProductConfigurationCreator.createProductFact(drools,prefix,style,product);
end
end template
The excel and drt are for only demostration purpose.
In the Image, in Product column, there is "SOFAS \rkumar shorav". Actually this is creating problem. This will generate like below:
product = "SOFAS
kumar shorav";
I need this like below:
product = "SOFAS kumar shorav";
Then Excel data :
attached image.
Instead of regex I would probably write my own formatter which will
check if cursor is inside quote
replace each \r with space
replace each \n with space, unless it was placed right after \r which means that space was already placed for that \r
write rest of characters without change.
Only possible problem is that this formatter will not care about where string is placed so if you want to format some specific part of the string you will need to provide only that part.
Code implementing such formatter can look like:
public static String format(String text){
StringBuilder sb = new StringBuilder();
boolean insideQuote = false;
char previous = '\0';//to track `\r\n`
for (char ch : text.toCharArray()) {
if (insideQuote &&
(ch == '\r' ||
ch == '\n' && previous != '\r') ) {
sb.append(" ");//replace `\r` or `\n` with space
}else {
if (ch == '"') {
insideQuote = !insideQuote;
}
sb.append(ch); //write other characters without change
}
previous = ch;
}
return sb.toString();
}
helper utility method
public static String format(File file, String encoding) throws IOException {
String text = new String(Files.readAllBytes(file.toPath()), encoding);
return format(text);
}
Usage:
String formatted = format(new File("input.txt"), "utf-8");
System.out.println(formatted);
You might say that there is a bug in org.drools.template.parser.StringCell, method
public void addValue(Map<String, Object> vars) {
vars.put(column.getName(), value);
}
Here, the value is added to the Map as a String but this does not take into account that string values are usually expanded into string literals. Therefore, an embedded newline should be converted to the escape sequence \n. You might try this patch:
public void addValue(Map<String, Object> vars) {
String h = value.replaceAll( "\n", "\\\\n" );
vars.put(column.getName(), h);
}
Take the source file, put it into a suitable subdirectory, compile it to a class file and make sure that the root directory precedes drools-templates-6.2.0.Final-sources.jar in the class path. You should then see
ruleName = "SIVM_BASE_PRICE_006\nRahul Kumar Singh";
in the generated DRL file. Obviously, this is not a space, but it is what is written in the spreadsheet cell!
I suggest (urgently) that you do not follow this approach. The reason is simply this that strings are not always expanded between quotes, and then the replacement would result almost certainly in invalid code. There is simply no remedy as the template compiler is "dumb" and does not really "know" what it is expanding.
If a String in a spreadsheet contains a line break, template expansion must render this faithfully, and break the line just there. If this produces invalid (Java) code: why was the line break entered in the first place? There is absolutely no reason not to have a space in that cell if that's what you want.
s = s.replaceAll("(?m)^([^\"]*(\"[^\"]*\")*[^\"]*\"[^\"]*)\r?\n\\s*", "$1 ");
This replaces lines with an unpaired quotes to one with the line ending replaced.
^.... means starting at the line begin
[^\"] means not quote
\r?\n catches both CR+LF (Windows) as LF (Rest) line endings
not-quotes,
repetition of " not-quotes ",
not quotes, quote, not-quotes, newline
Mind this does not cover backslash+quote, escapes them-selves.
Use the "multi line" flag:
str = str.replaceAll("(?m)^\\s+", "");
The multi-line flag (?m) makes ^ and $ match start/end of each line (rather than start/end of input). \s+ means "one or more whitespace characters".

Birt Report Indian Rupee formatting

I want to format numbers based on Indian Rupee/Number format (basically commas) in Birt through scripting (for some conditional reasons).
if I use:
this.getStyle().numberFormat="#,##,##,##0.000";
It still adds commas after every 3 characters .. as in 12,345,678.000 but I want the number to be 1,23,45,678.000 in this format
Can you please advise
EDIT: Bug with BIRT raised as : https://bugs.eclipse.org/bugs/show_bug.cgi?id=432211
EDIT: set a custom format number
Here is a possible workaround, forcing BIRT to make use of com.ibm.icu.text.DecimalFormat class. I don't know why indian format is not natively supported, you could report this in bugzilla of eclipse.org site.
Edit your dataset
Create a new computed column, select "String" datatype
Enter as expression: (in the first line, replace "value" with the actual name of the numeric column containing values)
var columnvalue=row["value"], customformat="#,##,##,##0.000"; //we can add here a test for conditional formatting
if (columnvalue!=null){
var symbols=Packages.com.ibm.icu.text.DecimalFormatSymbols(new Packages.java.util.Locale("en","IN"));
var formatter=Packages.com.ibm.icu.text.DecimalFormat(customformat,symbols);
var value=new Packages.java.math.BigDecimal(columnvalue.toString());
formatter.format(value);
}else{
"-"
}
Click "Preview results" in the dataset editor, a new column should be added at the end with the expected format.
You can use NumberFormat by setting the locale to Indian setting.
Locale locale = new Locale("en","IN");
String str = NumberFormat.getNumberInstance(locale).format(<your number>);
That's if you are looking for JAVA code to resolve your problem.
**you can use following javascript currency format and call it from BIRT.
function getSouthAsianCurrencyFormat(amount)
{
var l,ftemp,temp,camount,k,adecimal;
var decimals=2;
var ptrn="##,##,###,##,##,###.##";
var ptrnLength=0;
var adecimal=0;
var counts = {};
var ch, index, len, count;
amount= Number(Math.round(amount+'e'+decimals)+'e-'+decimals);
amount=amount.toFixed( decimals );
for (index = 0, len = ptrn.length; index < len; ++index) {
ch = ptrn.charAt(index);
count = counts[ch];
counts[ch] = count ? count + 1 : 1;
}
for (ch in counts) {
if(ch=="#"){
ptrnLength=counts[ch];
console.log(ch + " count: " + ptrnLength+"("+ptrn.length+")");
console.log( "amount length: " + amount.toString().length);
//console.log("decimalLength: "+decimalLength.toString().length);
}
}
if(counts['.']=0){
amount=amount+".00";
}
k=ptrn.toString().length;
l=amount.toString().length;
ftemp=amount.toString();
temp="";
camount="";
if(ptrnLength<(amount.toString().length-1)) return 0;
else {
k=k-1;
l=l-1;
for(i=l;i>-1;i--){
if(ptrn.charAt(k)=="#" || ptrn.charAt(k)=="." ){
camount=ftemp.charAt(i)+camount;
}
else{
camount=ptrn.charAt(k)+camount;
k=k-1;
if(ptrn.charAt(k)=="#"){
camount=ftemp.charAt(i)+camount;
}
}
k=k-1;
}
return (camount);
}
}

i need simple transliteration in android

got 2 arrays, latin and cyrilic.
got 1 string like "мама моет раму"
need to convert to latin to get this "mama_noet_ramu"
use this "Arrays.asList(copyFrom).contains(cur)" to find if there is a char in array but dont know how to get the position of this char in array?
char[] copyTo = {'a','b','v','g','d','e','e','g','z','i','i','k','l','m','n','o','p','R','S','T','U','f','h','c','h','h','h',' ',' ',' ','e','u','y','_'};
char[] copyFrom = {'а','б','в','г','д','е','ё','ж','з','и','й','к','л','м','н','о','п','р','с','т','у','ф','х','ц','ч','щ','ш','ь','ы','ъ','э','ю','я',' '};
Thanks/
Initially I tried to build on the basic function you did above, but I quickly learned that a single Cyrillic character may map to MULTIPLE Latin characters - so doing a "char" replacement just doesn't do the job.
There's probably a better way to do this, but here's the function I came up with.
public static String transliterate(String srcstring) {
ArrayList<String> copyTo = new ArrayList<String>();
String cyrcodes = "";
for (int i=1040;i<=1067;i++) {
cyrcodes = cyrcodes + (char)i;
}
for (int j=1072;j<=1099;j++) {
cyrcodes = cyrcodes + (char)j;
}
// Uppercase
copyTo.add("A");
copyTo.add("B");
copyTo.add("V");
copyTo.add("G");
copyTo.add("D");
copyTo.add("E");
copyTo.add("Zh");
copyTo.add("Z");
copyTo.add("I");
copyTo.add("I");
copyTo.add("K");
copyTo.add("L");
copyTo.add("M");
copyTo.add("N");
copyTo.add("O");
copyTo.add("P");
copyTo.add("R");
copyTo.add("S");
copyTo.add("T");
copyTo.add("U");
copyTo.add("F");
copyTo.add("Kh");
copyTo.add("TS");
copyTo.add("Ch");
copyTo.add("Sh");
copyTo.add("Shch");
copyTo.add("");
copyTo.add("Y");
// lowercase
copyTo.add("a");
copyTo.add("b");
copyTo.add("v");
copyTo.add("g");
copyTo.add("d");
copyTo.add("e");
copyTo.add("zh");
copyTo.add("z");
copyTo.add("i");
copyTo.add("i");
copyTo.add("k");
copyTo.add("l");
copyTo.add("m");
copyTo.add("n");
copyTo.add("o");
copyTo.add("p");
copyTo.add("r");
copyTo.add("s");
copyTo.add("t");
copyTo.add("u");
copyTo.add("f");
copyTo.add("kh");
copyTo.add("ts");
copyTo.add("ch");
copyTo.add("sh");
copyTo.add("shch");
copyTo.add("");
copyTo.add("y");
String newstring = "";
char onechar;
int replacewith;
for (int j=0; j<srcstring.length();j++) {
onechar = srcstring.charAt(j);
replacewith = cyrcodes.indexOf((int)onechar);
if (replacewith > -1) {
newstring = newstring + copyTo.get(replacewith);
} else {
// keep the original character, not in replace list
newstring = newstring + String.valueOf(onechar);
}
}
return newstring;
}
Arrays.asList(copyTo).indexOf(object)
and btw will return -1 if object is not in copyTo

How can you parse the string which has a text qualifier

How can I parse a String str = "abc, \"def,ghi\"";
such that I get the output as
String[] strs = {"abc", "\"def,ghi\""}
i.e. an array of length 2.
Should I use regular expression or Is there any method in java api or anyother opensource
project which let me do this?
Edited
To give context about the problem, I am reading a text file which has a list of records one on each line. Each record has list of fields separated by delimiter(comma or semi-colon). Now I have a requirement where I have to support text qualifier some thing excel or open office supports. Suppose I have record
abc, "def,ghi"
In this , is my delimiter and " is my text qualifier such that when I parse this string I should get two fields abc and def,ghi not {abc,def,ghi}
Hope this clears my requirement.
Thanks
Shekhar
The basic algorithm is not too complicated:
public static List<String> customSplit(String input) {
List<String> elements = new ArrayList<String>();
StringBuilder elementBuilder = new StringBuilder();
boolean isQuoted = false;
for (char c : input.toCharArray()) {
if (c == '\"') {
isQuoted = !isQuoted;
// continue; // changed according to the OP comment - \" shall not be skipped
}
if (c == ',' && !isQuoted) {
elements.add(elementBuilder.toString().trim());
elementBuilder = new StringBuilder();
continue;
}
elementBuilder.append(c);
}
elements.add(elementBuilder.toString().trim());
return elements;
}
This question seems appropriate: Split a string ignoring quoted sections
Along that line, http://opencsv.sourceforge.net/ seems appropriate.
Try this -
String str = "abc, \"def,ghi\"";
String regex = "([,]) | (^[\"\\w*,\\w*\"])";
for(String s : str.split(regex)){
System.out.println(s);
}
Try:
List<String> res = new LinkedList<String>();
String[] chunks = str.split("\\\"");
if (chunks.length % 2 == 0) {
// Mismatched escaped quotes!
}
for (int i = 0; i < chunks.length; i++) {
if (i % 2 == 1) {
res.addAll(Array.asList(chunks[i].split(",")));
} else {
res.add(chunks[i]);
}
}
This will only split up the portions that are not between escaped quotes.
Call trim() if you want to get rid of the whitespace.

Categories