un escapeing special characters using java - java

I have give following value (escaping using Windows-1252)
ABC &#145 ; &#146 ; &#147 ; &#148 ; &#226 ;, &#234 ;, &#238 ;, &#244 ;, &#251 ;
(I need to add space to display exact value actual there is no space between number and ;)
but the actual value is and I want the same value as below
ABC ‘ ’ “ ” â, ê, î, ô, û
I have tried HtmlUtils.htmlUnescape(decodedString); but did not work
I am getting output like
ABC â, ê, î, ô, û
‘ ’ “ ” is removed.
Can you please provide how to do this in java?

The quote characters are probably still in the string, they are just invisible when displayed. That's because in Unicode or ISO 8859-1, the code point 145 is not assigned to a visible character.
The best solution (if possible) is to pass the encoding to the unescapeHtml method.
An alternative is to call htmlUnescape first and then map the cp1252 codepoints to the corresponding Unicode code points, using the following code:
String unescapeHtmlCp1252(String input) {
String nohtml = HtmlUtils.htmlUnescape(input);
byte[] bytes = nohtml.getBytes(StandardCharsets.ISO_8859_1);
String result = new String(bytes, Charset.forName("cp1252"));
return result;
}
When you step through this code with a debugger and inspect the nohtml string, you will probably see characters with the value 145, 146, and so on. This means that the characters are still there at this point.
Later, when the characters are converted into pixels by using a font, these characters do not have a definition and are therefore just ignored. But until this step, they are still there.

You can use a regular expression for that.
Pattern p = Pattern.compile("&#(\\d+);");
StringBuffer out = new StringBuffer();
String s = "ABC‘’âD";
Matcher m = p.matcher(s);
int startIdx = 0;
byte[] bytes = new byte[]{0};
while(startIdx < s.length() && m.find(startIdx)) {
if (m.start() > startIdx) {
out.append(s.substring(startIdx, m.start()));
}
// fetch the numeric value from the encoding and put it into a byte array
bytes[0] = (byte)Short.parseShort(m.group(1));
// convert the windows 1252 encoded byte array into a java string
out.append(new String(bytes,"Windows-1252"));
startIdx = m.end();
}
if (startIdx < s.length()) {
out.append(s.substring(startIdx));
}
The output / result will be something like
ABC‘’âD

Related

Convert special characters into decimal equivalents in java

Is there a java library to convert special characters into decimal equivalent?
example:
input: "©™®"
output: "& #169; & #8482; & #174;"(space after & is only for question purpose, if typed without a space decimal equivalent is converted to special character)
Thank you !
This can be simply achieved with String.format(). The representations are simply the character value as decimal, padded to 4 characters and wrapped in &#;
The only tricky part is deciding which characters are "special". Here I've assumed not digit, not whitespace and not alpha...
StringBuilder output = new StringBuilder();
String input = "Foo bar ©™® baz";
for (char each : input.toCharArray()) {
if (Character.isAlphabetic(each) || Character.isDigit(each) || Character.isWhitespace(each)) {
output.append(each);
} else {
output.append(String.format("&#%04d;", (int) each));
}
}
System.out.println(output.toString());
You just need to fetch the integer value of the character as mentioned in How do I get the decimal value of a unicode character in Java?.
As per Oracle Java doc
char: The char data type is a single 16-bit Unicode character. It has
a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or
65,535 inclusive).
Assuming your characters fall within the character range, you can just get the decimal equivalent of each character from your string.
String text = "©™®";
char[] cArr = text.toCharArray();
for (char c : cArr)
{
int value = c; // get the decimal equivalent of the character
String result = "& #" + value; // append to some format string
System.out.println(result);
}
Output:
& #169
& #8482
& #174

Replace special characters in a string with their UTF-8 encoded character java?

I want to convert only the special characters to their UTF-8 equivalent character.
For example given a String: Abcds23#$_ss, it should get converted to Abcds23353695ss.
The following is how i did the above conversion:
The utf-8 in hexadecimal for # is 23 and in decimal is 35. The utf-8 in hexadecimal for $ is 24 and in decimal is 36. The utf-8 in hexadecimal for _ is 5f and in decimal is 95.
I know we have the String.replaceAll(String regex, String replacement) method. But I want to replace specific character with their specific UTF-8 equivalent.
How do I do the same in java?
I don't know how do you define "special characters", but this function should give you an idea:
public static String convert(String str)
{
StringBuilder buf = new StringBuilder();
for (int index = 0; index < str.length(); index++)
{
char ch = str.charAt(index);
if (Character.isLetterOrDigit(ch))
buf.append(ch);
else
buf.append(str.codePointAt(index));
}
return buf.toString();
}
#Test
public void test()
{
Assert.assertEquals("Abcds23353695ss", convert("Abcds23#$_ss"));
}
The following uses java 8 or above and checks whether a Unicode code point (symbol) is a letter or digit, pure ASCII (< 128) and otherwise output the Unicode code point as string of the numerical value.
static String convert(String str) {
int[] cps = str.codePoints()
.flatMap((cp) ->
Character.isLetterOrDigit(cp) && cp < 128
? IntStream.of(cp)
: String.valueOf(cp).codePoints())
.toArray();
return new String(cps, 0, cps.length);
}
String.codePoints() yields an IntStream, flatMap adds IntStreams in a single flattened stream, and toArray collects it in an array. So we can construct a new String from those code points. Entirely Unicode safe.
Conversion is not undoable without delimiters.
On Unicode:
Unicode numbers symbols, called code points, from 0 upwards, into the 3 byte range.
To be coded (formated) in bytes there exist UTF-8 (multi-byte), UTF-16LE and UTF-16BE (2byte-sequences) and UTF-32 (code points as-is more or less).
Java string constants in a .class file are in UTF-8. A String is composed of UTF-16BE chars. And String can give code points as above. So java by design uses Unicode for text.

Cast arbitrary escaped character to int

I have a method that, at the end, takes a character array (with one element), and returns the cast of that character:
char[] first = {'a'};
return (int)first[0];
However, sometimes I have character arrays with two elements, where the first is always a "\" (i.e. it is a character array that "contains" an escaped character):
char second = {'\\', 'n'};
I would like to return (int)'\n', but I do not know how to convert that array into a single escaped character. I am okay checking whether or not the array is of length 1 or 2, but I really don't want to have a long switch or if/else block to go through every possible escaped character.
How about making an HashMap of escape character vs the second ? like:
Map<Character, int> escapeMap = new HashMap<>();
escapeMap.put('n', 10);
Then make something like:
If (second[0] == '\\') {
return escapeMap.get(second[1]);
}
else
{
return (int)first[0];
}
You could use a map to store the escape sequences mappings to their corresponding characters. If you assume that the escape sequence will always be just one character with the code below 128, you could simplify the mappings to something like this:
char[] escaped = {..., '\n', ...'\t', ...}
where the character '\n' is on the (int)'n'-th position of the array.
Then you would find the the escaped character just by escaped[(int)second[1]]. You just need to check the array bounds, if an invalid escape sequence is found.
Here is an ugly hack. This works for me, but appears to be unreliable, buggy and time-consuming (and hence don't use this in critical parts).
char[] second = {'\\','n'};
String s = new String(second);
//write the String to an OutputStream
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(baos));
writer.write("key=" + s);
writer.close();
//load the String using Properties
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
Properties prop = new Properties();
prop.load(bais);
baos.close();
bais.close();
//now get the character
char c = prop.getProperty("key").charAt(0);
System.out.println((int)c);
output: 10 (which is the output of System.out.println((int)'\n');)
When printing (int)'a' the decimal value of that character from the UniCode table is being printed. In the case of a that would be 93. http://unicode-table.com/en/
\n is being identified as (Uni)code LF which means Line Feed or New Line. In the Unicode table thats the same as decimal number 10. Which indeed gets printed if you write: System.out.println((int)'\n');
Same goes for the characters \b , \f , \r , \t' ,\',\"and \\ which have special meaning for the compiler and have a special character code like LF for \n. Look them up if you want to know the details.
In that light the most simplest solution would be:
char[] second = {'\\', 'n'};
if (second.length > 1) {
System.out.println((int)'\n');
} else {
System.out.println(second[0]);
}
and thats only if \n is the only escape sequence you encounter.

Trim a string based on the string length

I want to trim a string if the length exceeds 10 characters.
Suppose if the string length is 12 (String s="abcdafghijkl"), then the new trimmed string will contain "abcdefgh..".
How can I achieve this?
s = s.substring(0, Math.min(s.length(), 10));
Using Math.min like this avoids an exception in the case where the string is already shorter than 10.
Notes:
The above does simple trimming. If you actually want to replace the last characters with three dots if the string is too long, use Apache Commons StringUtils.abbreviate; see #H6's solution. If you want to use the Unicode horizontal ellipsis character, see #Basil's solution.
For typical implementations of String, s.substring(0, s.length()) will return s rather than allocating a new String.
This may behave incorrectly1 if your String contains Unicode codepoints outside of the BMP; e.g. Emojis. For a (more complicated) solution that works correctly for all Unicode code-points, see #sibnick's solution.
1 - A Unicode codepoint that is not on plane 0 (the BMP) is represented as a "surrogate pair" (i.e. two char values) in the String. By ignoring this, we might trim the string to fewer than 10 code points, or (worse) truncate it in the middle of a surrogate pair. On the other hand, String.length() is not a good measure of Unicode text length, so trimming based on that property may be the wrong thing to do.
StringUtils.abbreviate from Apache Commons Lang library could be your friend:
StringUtils.abbreviate("abcdefg", 6) = "abc..."
StringUtils.abbreviate("abcdefg", 7) = "abcdefg"
StringUtils.abbreviate("abcdefg", 8) = "abcdefg"
StringUtils.abbreviate("abcdefg", 4) = "a..."
Commons Lang3 even allow to set a custom String as replacement marker. With this you can for example set a single character ellipsis.
StringUtils.abbreviate("abcdefg", "\u2026", 6) = "abcde…"
There is a Apache Commons StringUtils function which does this.
s = StringUtils.left(s, 10)
If len characters are not available, or the String is null, the String will be returned without an exception. An empty String is returned if len is negative.
StringUtils.left(null, ) = null
StringUtils.left(, -ve) = ""
StringUtils.left("", *) = ""
StringUtils.left("abc", 0) = ""
StringUtils.left("abc", 2) = "ab"
StringUtils.left("abc", 4) = "abc"
StringUtils.Left JavaDocs
Courtesy:Steeve McCauley
As usual nobody cares about UTF-16 surrogate pairs. See about them: What are the most common non-BMP Unicode characters in actual use? Even authors of org.apache.commons/commons-lang3
You can see difference between correct code and usual code in this sample:
public static void main(String[] args) {
//string with FACE WITH TEARS OF JOY symbol
String s = "abcdafghi\uD83D\uDE02cdefg";
int maxWidth = 10;
System.out.println(s);
//do not care about UTF-16 surrogate pairs
System.out.println(s.substring(0, Math.min(s.length(), maxWidth)));
//correctly process UTF-16 surrogate pairs
if(s.length()>maxWidth){
int correctedMaxWidth = (Character.isLowSurrogate(s.charAt(maxWidth)))&&maxWidth>0 ? maxWidth-1 : maxWidth;
System.out.println(s.substring(0, Math.min(s.length(), correctedMaxWidth)));
}
}
Or you can just use this method in case you don't have StringUtils on hand:
public static String abbreviateString(String input, int maxLength) {
if (input.length() <= maxLength)
return input;
else
return input.substring(0, maxLength-2) + "..";
}
s = s.length() > 10 ? s.substring(0, 9) : s;
Just in case you are looking for a way to trim and keep the LAST 10 characters of a string.
s = s.substring(Math.max(s.length(),10) - 10);
tl;dr
You seem to be asking for an ellipsis (…) character in the last place, when truncating. Here is a one-liner to manipulate your input string.
String input = "abcdefghijkl";
String output = ( input.length () > 10 ) ? input.substring ( 0 , 10 - 1 ).concat ( "…" ) : input;
See this code run live at IdeOne.com.
abcdefghi…
Ternary operator
We can make a one-liner by using the ternary operator.
String input = "abcdefghijkl" ;
String output =
( input.length() > 10 ) // If too long…
?
input
.substring( 0 , 10 - 1 ) // Take just the first part, adjusting by 1 to replace that last character with an ellipsis.
.concat( "…" ) // Add the ellipsis character.
: // Or, if not too long…
input // Just return original string.
;
See this code run live at IdeOne.com.
abcdefghi…
Java streams
The Java Streams facility makes this interesting, as of Java 9 and later. Interesting, but maybe not the best approach.
We use code points rather than char values. The char type is legacy, and is limited to the a subset of all possible Unicode characters.
String input = "abcdefghijkl" ;
int limit = 10 ;
String output =
input
.codePoints()
.limit( limit )
.collect( // Collect the results of processing each code point.
StringBuilder::new, // Supplier<R> supplier
StringBuilder::appendCodePoint, // ObjIntConsumer<R> accumulator
StringBuilder::append // BiConsumer<R,​R> combiner
)
.toString()
;
If we had excess characters truncated, replace the last character with an ellipsis.
if ( input.length () > limit )
{
output = output.substring ( 0 , output.length () - 1 ) + "…";
}
If only I could think of a way to put together the stream line with the "if over limit, do ellipsis" part.
The question is asked on Java, but it was back in 2014.
In case you use Kotlin now, it is as simple as:
yourString.take(10)
Returns a string containing the first n characters from this string, or the entire string if this string is shorter.
Documentation
str==null ? str : str.substring(0, Math.min(str.length(), 10))
or,
str==null ? "" : str.substring(0, Math.min(str.length(), 10))
Works with null.
// this is how you shorten the length of the string with ..
// add following method to your class
private String abbreviate(String s){
if(s.length() <= 10) return s;
return s.substring(0, 8) + ".." ;
}

Unicode to string conversion in Java

I am building a language, a toy language. The syntax \#0061 is supposed to convert the given Unicode to an character:
String temp = yytext().subtring(2);
Then after that try to append '\u' to the string, I noticed that generated an error.
I also tried to "\\" + "u" + temp; this way does not do any conversion.
I am basically trying to convert Unicode to a character by supplying only '0061' to a method, help.
Strip the '#' and use Integer.parseInt("0061", 16) to convert the hex digits to an int. Then cast to a char.
(If you had implemented the lexer by hand, an alternatively would be to do the conversion on the fly as your lexer matches the unicode literal. But on rereading the question, I see that you are using a lexer generator ... good move!)
i am basically trying to convert
unicode to a character by supplying
only '0061' to a method, help.
char fromUnicode(String codePoint) {
return (char) Integer.parseInt(codePoint, 16);
}
You need to handle bad inputs and such, but that will work otherwise.
You need to convert the particular codepoint to a char. You can do that with a little help of regex:
String string = "blah #0061 blah";
Matcher matcher = Pattern.compile("\\#((?i)[0-9a-f]{4})").matcher(string);
while (matcher.find()) {
int codepoint = Integer.valueOf(matcher.group(1), 16);
string = string.replaceAll(matcher.group(0), String.valueOf((char) codepoint));
}
System.out.println(string); // blah a blah
Edit as per the comments, if it is a single token, then just do:
String string = "0061";
char c = (char) Integer.parseInt(string, 16);
System.out.println(c); // a
\uXXXX is an escape sequence. Before execution it has already been converted into the actual character value, its not "evaluated" in anyway at runtime.
What you probably want to do is define a mapping from your #XXXX syntax to Unicode code points and cast them to char.

Categories