Base64 Encoding in Java vs HttpServerUtility.UrlTokenEncode in C#

Base64 Encoding in Java vs HttpServerUtility.UrlTokenEncode in C# - java

I'm having a trouble while I tried to encode a String in Java.
I have the follwing code in C#, and the string Bpz2Gjg01d7VfGfD8ZP1UA==, when I execute C# code I'm getting:
QnB6MkdqZzAxZDdWZkdmRDhaUDFVQT090
public static void Main(string[] args)
{
string strWord = "Bpz2Gjg01d7VfGfD8ZP1UA==";
byte[] encbuff = Encoding.UTF8.GetBytes(strWord);
string strWordEncoded = HttpServerUtility.UrlTokenEncode(encbuff);
Console.WriteLine(strWordEncoded);
}
I'm trying to replicate the previous code in Java, in the first
attempt I used the javax.xml.bind.DatatypeConverter Class:
public static void main(String[] args) {
String strWord = "Bpz2Gjg01d7VfGfD8ZP1UA==";
byte[] encbuff = strWord.getBytes(StandardCharsets.UTF_8);
String strWordEncoded = DatatypeConverter.printBase64Binary(encbuff);
System.out.println(strWordEncoded);
}
But I'm getting the following String (
missing the last zero compared to C# string):
QnB6MkdqZzAxZDdWZkdmRDhaUDFVQT09
In my second attempt I used the BouncyCastle Base64 encoder:
public static void main(String[] args) {
String strWord = "Bpz2Gjg01d7VfGfD8ZP1UA==";
byte[] encbuff = strWord.getBytes(StandardCharsets.UTF_8);
String strWordEncoded = new String(Base64.encode(encbuff));
System.out.println(strWordEncoded);
}
But I'm getting the exact same previous String(
still missing the last zero):
QnB6MkdqZzAxZDdWZkdmRDhaUDFVQT09
Does anyone know what may be happening?

I've had a look at the .NET framework code. UrlTokenEncode actually removes any extra = padding symbols from the end of the base64 string and replaces them with the number of padding symbols, so either 0, 1, or 2. This is what's causing the extra 0 at the end of your string. So be aware: the HttpServerUtility.UrlTokenEncode method is NOT a plain Base64 encoder. It actually uses Convert.ToBase64String internally for the regular encoding and adds some more on top (see my comments on the question). If you need to create this exact string, you will need to implement the same changes in Java on top of the regular base64 encoding.

I found a solution based on the comments made to me, basically I look at the source code of the method in the Reference Source of Microsoft.
Then I translated the C# code to Java code, and it looks like this:
public static String UrlTokenEncode(byte[] input) {
try {
if (input == null) {
return null;
}
if (input.length < 1) {
return null;
}
String base64Str = null;
int endPos = 0;
char[] base64Chars = null;
base64Str = Base64.toBase64String(input);
if (base64Str == null) {
return null;
}
for (endPos = base64Str.length(); endPos > 0; endPos--) {
if (base64Str.charAt(endPos - 1) != '=') {
break;
}
}
base64Chars = new char[endPos + 1];
base64Chars[endPos] = (char) ((int) '0' + base64Str.length() - endPos);
for (int iter = 0; iter < endPos; iter++) {
char c = base64Str.charAt(iter);
switch (c) {
case '+':
base64Chars[iter] = '-';
break;
case '/':
base64Chars[iter] = '_';
break;
case '=':
base64Chars[iter] = c;
break;
default:
base64Chars[iter] = c;
break;
}
}
return new String(base64Chars);
} catch (Exception e) {
return null;
}
}
Finally I tested the method and I got the desire output:
public static void main(String[] args) {
String strWord = "Bpz2Gjg01d7VfGfD8ZP1UA==";
byte[] encbuff = strWord.getBytes(StandardCharsets.UTF_8);
String strWordEncoded = UrlTokenEncode(encbuff);
}
M2NIclh4eEwxRGp2MEsyeFc0SHVDZz090

Related

Encrypt a Paragraph of Text in Java [duplicate]

This question already has answers here:
Java, How to implement a Shift Cipher (Caesar Cipher)
(7 answers)
Closed last year.
I am trying to make a game where a paragraph of text is “encrypted” using a simple substitution cipher, so for example, all A's will be F's and B's will be G's an so on.
The idea is that the user/player will need to try to guess the famous quote by trying to decrypt the letters. So the screen shows them a blank space with a letter A and they have to figure out it's really an F that goes in the place within the string.
I've not got very far, basically I can manually change each letter using a for loop, but there must be an easier way.
import java.util.Scanner;
import java.util.Random;
public class cryptogram {
public static void main(String[] args) {
char[] alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ".toCharArray();
for (char i = 0; i <alphabet.length; i++) {
if (alphabet[i] == 'B') {
alphabet[i] = 'Z';
}
}
System.out.println(alphabet);
}
}

Substitution
A "substitution" workflow might look something like...
public class Main {
public static void main(String[] args) {
new Main().substitution();
}
public void substitution() {
char[] lookup = "ABCDEFGHIJKLMNOPQRSTUVWXYZ ".toCharArray();
char[] substitution = " FGHIJKLMNOPQRSTUVWXYZABCDE".toCharArray();
String text = "This is a test";
StringBuilder builder = new StringBuilder(text.length());
for (char value : text.toCharArray()) {
int index = indexOf(value, lookup);
builder.append(substitution[index]);
}
String encypted = builder.toString();
System.out.println(text);
System.out.println(encypted);
builder = new StringBuilder(text.length());
for (char value : encypted.toCharArray()) {
int index = indexOf(value, substitution);
builder.append(lookup[index]);
}
System.out.println(builder.toString());
}
protected static int indexOf(char value, char[] array) {
char check = Character.toUpperCase(value);
for (int index = 0; index < array.length; index++) {
if (check == array[index]) {
return index;
}
}
return -1;
}
}
Which will output something like...
This is a test
XLMWEMWE EXIWX
THIS IS A TEST
Now, obviously, this only supports upper-cased characters and does not support other characters like numbers or punctuation (like ! for example). The above example will also crash if the character can't be encoded, it's just an example of an idea after all 😉
A "different" approach
Now, char is a peculiar type, as it can actually be treated as an int. This has to do with how text is encoded by computers, see ASCII Table for an example.
This means that we can do mathematical operations on it (+/-). Now, assuming that we only want to deal with "displayable" characters, this gives us a basic range of 32-126 (you could also have the extended range from 128-255, but lets keep it simple for now)
With this is hand, we could actually do something like...
public class Main {
public static void main(String[] args) {
new Main().encode();
}
private static final int MIN_RANGE = 32;
private static final int MAX_RANGE = 127;
public void encode() {
String text = "This is a test";
String encoded = encode(text, 4);
System.out.println(text);
System.out.println(encoded);
System.out.println(encode(encoded, -4));
}
protected String encode(String value, int offset) {
StringBuilder sb = new StringBuilder(value.length());
for (char c : value.toCharArray()) {
sb.append(encode(c, offset));
}
return sb.toString();
}
protected char encode(char value, int offset) {
char newValue = (char)(value + offset);
if (newValue < MIN_RANGE) {
newValue = (char)(MAX_RANGE - (MIN_RANGE - newValue));
} else if (newValue > MAX_RANGE) {
newValue = (char)((newValue - MAX_RANGE) + MIN_RANGE);
}
return newValue;
}
}
Which outputs...
This is a test
Xlmw$mw$e$xiwx
This is a test
As you can see, decoding is just passing the encoded text with offset in the opposite direction. It's also easier to change the offset if you want to change the encoding process

Problem during development of string manipulation program

I have this problem where a string, sometimes, will repeat its end and I have to remove this repetition, returning only the main string. For example:
in: sanduichuiche out: sanduiche
in: jabutiti out: jabuti
in: sol out: sol
I'm using Java and the solution I came up with is this:
public static void main(String[] args) throws IOException {
String linha;
while ((linha = in.readLine()) != null) {
String palavra = linha;
palavra = palavra.trim()
.replaceAll("\n","")
.replaceAll("\t","");
String subString;
String subStringEncontrada = "";
int palavraLength = palavra.length();
for (int i = palavraLength-1; i >= 0; i--) {
int diff = palavraLength - i;
subString = palavra.substring(i, palavraLength);
if (i-diff < 0) { break; }
if (palavra.substring(i-diff,i).contains(subString)) {
subStringEncontrada = subString;
}
}
String resultado = palavra.substring( 0, palavraLength-subStringEncontrada.length()).trim();
System.out.println(resultado);
}
out.close();
}
For some reason, when I post it to the code challenge, it says 2 of the tests failed, and I have run out of ideas about what can be wrong.
I appreciate if someone could help me out and say what I am missing on this code.

Probably you're overcomplicating.
I wrote this version in a couple minutes, you can try and see if it works for your test cases.
private static String removeEndRepetition(final String str) {
// We need to remove a possible duplicate part of a string, placed at its end.
// This means the max duplicate length is str.length / 2
for (int i = (int) Math.ceil(str.length() / 2.0); i < str.length(); i++) {
final String possibleDuplicatePart = str.substring(i);
final String precedingPart = str.substring(i - possibleDuplicatePart.length(), i);
if (possibleDuplicatePart.equals(precedingPart)) {
return str.substring(0, i);
}
}
return str;
}
public static void main(final String[] args) {
System.out.println(removeEndRepetition("sanduicheiche"));
System.out.println(removeEndRepetition("jabutiti"));
System.out.println(removeEndRepetition("sol"));
}
Which correctly prints
sanduiche
jabuti
sol
How does it work? Kinda debug with sanduicheiche:
possibleDuplicatePart precedingPart
heiche anduic
eiche duich
iche iche MATCH!

Java: Display unicode chars as chars when printing string [duplicate]

I have a string with escaped Unicode characters, \uXXXX, and I want to convert it to regular Unicode letters. For example:
"\u0048\u0065\u006C\u006C\u006F World"
should become
"Hello World"
I know that when I print the first string it already shows Hello world. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.
import org.apache.commons.lang.StringEscapeUtils;
#Test
public void testUnescapeJava() {
String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}
output:
StringEscapeUtils.unescapeJava(sJava):
Hello

Technically doing:
String myString = "\u0048\u0065\u006C\u006C\u006F World";
automatically converts it to "Hello World", so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX and just get XXXX) then do Integer.ParseInt(XXXX, 16) to get a hex value and then case that to char to get the actual character.
Edit: Some code to accomplish this:
String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
int hexVal = Integer.parseInt(arr[i], 16);
text += (char)hexVal;
}
// Text will now have Hello

You can use StringEscapeUtils from Apache Commons Lang, i.e.:
String Title = StringEscapeUtils.unescapeJava("\\u0048\\u0065\\u006C\\u006C\\u006F");

This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.
static final String decode(final String in)
{
String working = in;
int index;
index = working.indexOf("\\u");
while(index > -1)
{
int length = working.length();
if(index > (length-6))break;
int numStart = index + 2;
int numFinish = numStart + 4;
String substring = working.substring(numStart, numFinish);
int number = Integer.parseInt(substring,16);
String stringStart = working.substring(0, index);
String stringEnd = working.substring(numFinish);
working = stringStart + ((char)number) + stringEnd;
index = working.indexOf("\\u");
}
return working;
}

Shorter version:
public static String unescapeJava(String escaped) {
if(escaped.indexOf("\\u")==-1)
return escaped;
String processed="";
int position=escaped.indexOf("\\u");
while(position!=-1) {
if(position!=0)
processed+=escaped.substring(0,position);
String token=escaped.substring(position+2,position+6);
escaped=escaped.substring(position+6);
processed+=(char)Integer.parseInt(token,16);
position=escaped.indexOf("\\u");
}
processed+=escaped;
return processed;
}

StringEscapeUtils from org.apache.commons.lang3 library is deprecated as of 3.6.
So you can use their new commons-text library instead:
compile 'org.apache.commons:commons-text:1.9'
OR
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.9</version>
</dependency>
Example code:
org.apache.commons.text.StringEscapeUtils.unescapeJava(escapedString);

With Kotlin you can write your own extension function for String
fun String.unescapeUnicode() = replace("\\\\u([0-9A-Fa-f]{4})".toRegex()) {
String(Character.toChars(it.groupValues[1].toInt(radix = 16)))
}
and then
fun main() {
val originalString = "\\u0048\\u0065\\u006C\\u006C\\u006F World"
println(originalString.unescapeUnicode())
}

It's not totally clear from your question, but I'm assuming you saying that you have a file where each line of that file is a filename. And each filename is something like this:
\u0048\u0065\u006C\u006C\u006F
In other words, the characters in the file of filenames are \, u, 0, 0, 4, 8 and so on.
If so, what you're seeing is expected. Java only translates \uXXXX sequences in string literals in source code (and when reading in stored Properties objects). When you read the contents you file you will have a string consisting of the characters \, u, 0, 0, 4, 8 and so on and not the string Hello.
So you will need to parse that string to extract the 0048, 0065, etc. pieces and then convert them to chars and make a string from those chars and then pass that string to the routine that opens the file.

For Java 9+, you can use the new replaceAll method of Matcher class.
private static final Pattern UNICODE_PATTERN = Pattern.compile("\\\\u([0-9A-Fa-f]{4})");
public static String unescapeUnicode(String unescaped) {
return UNICODE_PATTERN.matcher(unescaped).replaceAll(r -> String.valueOf((char) Integer.parseInt(r.group(1), 16)));
}
public static void main(String[] args) {
String originalMessage = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String unescapedMessage = unescapeUnicode(originalMessage);
System.out.println(unescapedMessage);
}
I believe the main advantage of this approach over unescapeJava by StringEscapeUtils (besides not using an extra library) is that you can convert only the unicode characters (if you wish), since the latter converts all escaped Java characters (like \n or \t). If you prefer to convert all escaped characters the library is really the best option.

Updates regarding answers suggesting using The Apache Commons Lang's:
StringEscapeUtils.unescapeJava() - it was deprecated,
Deprecated.
as of 3.6, use commons-text StringEscapeUtils instead
The replacement is Apache Commons Text's StringEscapeUtils.unescapeJava()

Just wanted to contribute my version, using regex:
private static final String UNICODE_REGEX = "\\\\u([0-9a-f]{4})";
private static final Pattern UNICODE_PATTERN = Pattern.compile(UNICODE_REGEX);
...
String message = "\u0048\u0065\u006C\u006C\u006F World";
Matcher matcher = UNICODE_PATTERN.matcher(message);
StringBuffer decodedMessage = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(
decodedMessage, String.valueOf((char) Integer.parseInt(matcher.group(1), 16)));
}
matcher.appendTail(decodedMessage);
System.out.println(decodedMessage.toString());

I wrote a performanced and error-proof solution:
public static final String decode(final String in) {
int p1 = in.indexOf("\\u");
if (p1 < 0)
return in;
StringBuilder sb = new StringBuilder();
while (true) {
int p2 = p1 + 6;
if (p2 > in.length()) {
sb.append(in.subSequence(p1, in.length()));
break;
}
try {
int c = Integer.parseInt(in.substring(p1 + 2, p1 + 6), 16);
sb.append((char) c);
p1 += 6;
} catch (Exception e) {
sb.append(in.subSequence(p1, p1 + 2));
p1 += 2;
}
int p0 = in.indexOf("\\u", p1);
if (p0 < 0) {
sb.append(in.subSequence(p1, in.length()));
break;
} else {
sb.append(in.subSequence(p1, p0));
p1 = p0;
}
}
return sb.toString();
}

one easy way i know using JsonObject:
try {
JSONObject json = new JSONObject();
json.put("string", myString);
String converted = json.getString("string");
} catch (JSONException e) {
e.printStackTrace();
}

Fast
fun unicodeDecode(unicode: String): String {
val stringBuffer = StringBuilder()
var i = 0
while (i < unicode.length) {
if (i + 1 < unicode.length)
if (unicode[i].toString() + unicode[i + 1].toString() == "\\u") {
val symbol = unicode.substring(i + 2, i + 6)
val c = Integer.parseInt(symbol, 16)
stringBuffer.append(c.toChar())
i += 5
} else stringBuffer.append(unicode[i])
i++
}
return stringBuffer.toString()
}

UnicodeUnescaper from Apache Commons Text does exactly what you want, and ignores any other escape sequences.
String input = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String output = new UnicodeUnescaper().translate(input);
assert("Hello World".equals(output));
assert("\u0048\u0065\u006C\u006C\u006F World".equals(output));
Where input would be the string you are reading from a file.

try
private static final Charset UTF_8 = Charset.forName("UTF-8");
private String forceUtf8Coding(String input) {return new String(input.getBytes(UTF_8), UTF_8))}

Actually, I wrote an Open Source library that contains some utilities. One of them is converting a Unicode sequence to String and vise-versa. I found it very useful. Here is the quote from the article about this library about Unicode converter:
Class StringUnicodeEncoderDecoder has methods that can convert a
String (in any language) into a sequence of Unicode characters and
vise-versa. For example a String "Hello World" will be converted into
"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"
and may be restored back.
Here is the link to entire article that explains what Utilities the library has and how to get the library to use it. It is available as Maven artifact or as source from Github. It is very easy to use. Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison

Here is my solution...
String decodedName = JwtJson.substring(startOfName, endOfName);
StringBuilder builtName = new StringBuilder();
int i = 0;
while ( i < decodedName.length() )
{
if ( decodedName.substring(i).startsWith("\\u"))
{
i=i+2;
builtName.append(Character.toChars(Integer.parseInt(decodedName.substring(i,i+4), 16)));
i=i+4;
}
else
{
builtName.append(decodedName.charAt(i));
i = i+1;
}
};

I found that many of the answers did not address the issue of "Supplementary Characters". Here is the correct way to support it. No third-party libraries, pure Java implementation.
http://www.oracle.com/us/technologies/java/supplementary-142654.html
public static String fromUnicode(String unicode) {
String str = unicode.replace("\\", "");
String[] arr = str.split("u");
StringBuffer text = new StringBuffer();
for (int i = 1; i < arr.length; i++) {
int hexVal = Integer.parseInt(arr[i], 16);
text.append(Character.toChars(hexVal));
}
return text.toString();
}
public static String toUnicode(String text) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < text.length(); i++) {
int codePoint = text.codePointAt(i);
// Skip over the second char in a surrogate pair
if (codePoint > 0xffff) {
i++;
}
String hex = Integer.toHexString(codePoint);
sb.append("\\u");
for (int j = 0; j < 4 - hex.length(); j++) {
sb.append("0");
}
sb.append(hex);
}
return sb.toString();
}
#Test
public void toUnicode() {
System.out.println(toUnicode("😊"));
System.out.println(toUnicode("🥰"));
System.out.println(toUnicode("Hello World"));
}
// output:
// \u1f60a
// \u1f970
// \u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
#Test
public void fromUnicode() {
System.out.println(fromUnicode("\\u1f60a"));
System.out.println(fromUnicode("\\u1f970"));
System.out.println(fromUnicode("\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u0057\\u006f\\u0072\\u006c\\u0064"));
}
// output:
// 😊
// 🥰
// Hello World

#NominSim
There may be other character, so I should detect it by length.
private String forceUtf8Coding(String str) {
str = str.replace("\\","");
String[] arr = str.split("u");
StringBuilder text = new StringBuilder();
for(int i = 1; i < arr.length; i++){
String a = arr[i];
String b = "";
if (arr[i].length() > 4){
a = arr[i].substring(0, 4);
b = arr[i].substring(4);
}
int hexVal = Integer.parseInt(a, 16);
text.append((char) hexVal).append(b);
}
return text.toString();
}

An alternate way of accomplishing this could be to make use of chars() introduced with Java 9, this can be used to iterate over the characters making sure any char which maps to a surrogate code point is passed through uninterpreted. This can be used as:-
String myString = "\u0048\u0065\u006C\u006C\u006F World";
myString.chars().forEach(a -> System.out.print((char)a));
// would print "Hello World"

Solution for Kotlin:
val sourceContent = File("test.txt").readText(Charset.forName("windows-1251"))
val result = String(sourceContent.toByteArray())
Kotlin uses UTF-8 everywhere as default encoding.
Method toByteArray() has default argument - Charsets.UTF_8.

How to convert a string with Unicode encoding to a string of letters

I have a string with escaped Unicode characters, \uXXXX, and I want to convert it to regular Unicode letters. For example:
"\u0048\u0065\u006C\u006C\u006F World"
should become
"Hello World"
I know that when I print the first string it already shows Hello world. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.
import org.apache.commons.lang.StringEscapeUtils;
#Test
public void testUnescapeJava() {
String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}
output:
StringEscapeUtils.unescapeJava(sJava):
Hello

Technically doing:
String myString = "\u0048\u0065\u006C\u006C\u006F World";
automatically converts it to "Hello World", so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX and just get XXXX) then do Integer.ParseInt(XXXX, 16) to get a hex value and then case that to char to get the actual character.
Edit: Some code to accomplish this:
String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
int hexVal = Integer.parseInt(arr[i], 16);
text += (char)hexVal;
}
// Text will now have Hello

You can use StringEscapeUtils from Apache Commons Lang, i.e.:
String Title = StringEscapeUtils.unescapeJava("\\u0048\\u0065\\u006C\\u006C\\u006F");

This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.
static final String decode(final String in)
{
String working = in;
int index;
index = working.indexOf("\\u");
while(index > -1)
{
int length = working.length();
if(index > (length-6))break;
int numStart = index + 2;
int numFinish = numStart + 4;
String substring = working.substring(numStart, numFinish);
int number = Integer.parseInt(substring,16);
String stringStart = working.substring(0, index);
String stringEnd = working.substring(numFinish);
working = stringStart + ((char)number) + stringEnd;
index = working.indexOf("\\u");
}
return working;
}

Shorter version:
public static String unescapeJava(String escaped) {
if(escaped.indexOf("\\u")==-1)
return escaped;
String processed="";
int position=escaped.indexOf("\\u");
while(position!=-1) {
if(position!=0)
processed+=escaped.substring(0,position);
String token=escaped.substring(position+2,position+6);
escaped=escaped.substring(position+6);
processed+=(char)Integer.parseInt(token,16);
position=escaped.indexOf("\\u");
}
processed+=escaped;
return processed;
}

StringEscapeUtils from org.apache.commons.lang3 library is deprecated as of 3.6.
So you can use their new commons-text library instead:
compile 'org.apache.commons:commons-text:1.9'
OR
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.9</version>
</dependency>
Example code:
org.apache.commons.text.StringEscapeUtils.unescapeJava(escapedString);

With Kotlin you can write your own extension function for String
fun String.unescapeUnicode() = replace("\\\\u([0-9A-Fa-f]{4})".toRegex()) {
String(Character.toChars(it.groupValues[1].toInt(radix = 16)))
}
and then
fun main() {
val originalString = "\\u0048\\u0065\\u006C\\u006C\\u006F World"
println(originalString.unescapeUnicode())
}

It's not totally clear from your question, but I'm assuming you saying that you have a file where each line of that file is a filename. And each filename is something like this:
\u0048\u0065\u006C\u006C\u006F
In other words, the characters in the file of filenames are \, u, 0, 0, 4, 8 and so on.
If so, what you're seeing is expected. Java only translates \uXXXX sequences in string literals in source code (and when reading in stored Properties objects). When you read the contents you file you will have a string consisting of the characters \, u, 0, 0, 4, 8 and so on and not the string Hello.
So you will need to parse that string to extract the 0048, 0065, etc. pieces and then convert them to chars and make a string from those chars and then pass that string to the routine that opens the file.

For Java 9+, you can use the new replaceAll method of Matcher class.
private static final Pattern UNICODE_PATTERN = Pattern.compile("\\\\u([0-9A-Fa-f]{4})");
public static String unescapeUnicode(String unescaped) {
return UNICODE_PATTERN.matcher(unescaped).replaceAll(r -> String.valueOf((char) Integer.parseInt(r.group(1), 16)));
}
public static void main(String[] args) {
String originalMessage = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String unescapedMessage = unescapeUnicode(originalMessage);
System.out.println(unescapedMessage);
}
I believe the main advantage of this approach over unescapeJava by StringEscapeUtils (besides not using an extra library) is that you can convert only the unicode characters (if you wish), since the latter converts all escaped Java characters (like \n or \t). If you prefer to convert all escaped characters the library is really the best option.

Updates regarding answers suggesting using The Apache Commons Lang's:
StringEscapeUtils.unescapeJava() - it was deprecated,
Deprecated.
as of 3.6, use commons-text StringEscapeUtils instead
The replacement is Apache Commons Text's StringEscapeUtils.unescapeJava()

Just wanted to contribute my version, using regex:
private static final String UNICODE_REGEX = "\\\\u([0-9a-f]{4})";
private static final Pattern UNICODE_PATTERN = Pattern.compile(UNICODE_REGEX);
...
String message = "\u0048\u0065\u006C\u006C\u006F World";
Matcher matcher = UNICODE_PATTERN.matcher(message);
StringBuffer decodedMessage = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(
decodedMessage, String.valueOf((char) Integer.parseInt(matcher.group(1), 16)));
}
matcher.appendTail(decodedMessage);
System.out.println(decodedMessage.toString());

I wrote a performanced and error-proof solution:
public static final String decode(final String in) {
int p1 = in.indexOf("\\u");
if (p1 < 0)
return in;
StringBuilder sb = new StringBuilder();
while (true) {
int p2 = p1 + 6;
if (p2 > in.length()) {
sb.append(in.subSequence(p1, in.length()));
break;
}
try {
int c = Integer.parseInt(in.substring(p1 + 2, p1 + 6), 16);
sb.append((char) c);
p1 += 6;
} catch (Exception e) {
sb.append(in.subSequence(p1, p1 + 2));
p1 += 2;
}
int p0 = in.indexOf("\\u", p1);
if (p0 < 0) {
sb.append(in.subSequence(p1, in.length()));
break;
} else {
sb.append(in.subSequence(p1, p0));
p1 = p0;
}
}
return sb.toString();
}

one easy way i know using JsonObject:
try {
JSONObject json = new JSONObject();
json.put("string", myString);
String converted = json.getString("string");
} catch (JSONException e) {
e.printStackTrace();
}

Fast
fun unicodeDecode(unicode: String): String {
val stringBuffer = StringBuilder()
var i = 0
while (i < unicode.length) {
if (i + 1 < unicode.length)
if (unicode[i].toString() + unicode[i + 1].toString() == "\\u") {
val symbol = unicode.substring(i + 2, i + 6)
val c = Integer.parseInt(symbol, 16)
stringBuffer.append(c.toChar())
i += 5
} else stringBuffer.append(unicode[i])
i++
}
return stringBuffer.toString()
}

UnicodeUnescaper from Apache Commons Text does exactly what you want, and ignores any other escape sequences.
String input = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String output = new UnicodeUnescaper().translate(input);
assert("Hello World".equals(output));
assert("\u0048\u0065\u006C\u006C\u006F World".equals(output));
Where input would be the string you are reading from a file.

try
private static final Charset UTF_8 = Charset.forName("UTF-8");
private String forceUtf8Coding(String input) {return new String(input.getBytes(UTF_8), UTF_8))}

Actually, I wrote an Open Source library that contains some utilities. One of them is converting a Unicode sequence to String and vise-versa. I found it very useful. Here is the quote from the article about this library about Unicode converter:
Class StringUnicodeEncoderDecoder has methods that can convert a
String (in any language) into a sequence of Unicode characters and
vise-versa. For example a String "Hello World" will be converted into
"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"
and may be restored back.
Here is the link to entire article that explains what Utilities the library has and how to get the library to use it. It is available as Maven artifact or as source from Github. It is very easy to use. Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison

Here is my solution...
String decodedName = JwtJson.substring(startOfName, endOfName);
StringBuilder builtName = new StringBuilder();
int i = 0;
while ( i < decodedName.length() )
{
if ( decodedName.substring(i).startsWith("\\u"))
{
i=i+2;
builtName.append(Character.toChars(Integer.parseInt(decodedName.substring(i,i+4), 16)));
i=i+4;
}
else
{
builtName.append(decodedName.charAt(i));
i = i+1;
}
};

I found that many of the answers did not address the issue of "Supplementary Characters". Here is the correct way to support it. No third-party libraries, pure Java implementation.
http://www.oracle.com/us/technologies/java/supplementary-142654.html
public static String fromUnicode(String unicode) {
String str = unicode.replace("\\", "");
String[] arr = str.split("u");
StringBuffer text = new StringBuffer();
for (int i = 1; i < arr.length; i++) {
int hexVal = Integer.parseInt(arr[i], 16);
text.append(Character.toChars(hexVal));
}
return text.toString();
}
public static String toUnicode(String text) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < text.length(); i++) {
int codePoint = text.codePointAt(i);
// Skip over the second char in a surrogate pair
if (codePoint > 0xffff) {
i++;
}
String hex = Integer.toHexString(codePoint);
sb.append("\\u");
for (int j = 0; j < 4 - hex.length(); j++) {
sb.append("0");
}
sb.append(hex);
}
return sb.toString();
}
#Test
public void toUnicode() {
System.out.println(toUnicode("😊"));
System.out.println(toUnicode("🥰"));
System.out.println(toUnicode("Hello World"));
}
// output:
// \u1f60a
// \u1f970
// \u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
#Test
public void fromUnicode() {
System.out.println(fromUnicode("\\u1f60a"));
System.out.println(fromUnicode("\\u1f970"));
System.out.println(fromUnicode("\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u0057\\u006f\\u0072\\u006c\\u0064"));
}
// output:
// 😊
// 🥰
// Hello World

#NominSim
There may be other character, so I should detect it by length.
private String forceUtf8Coding(String str) {
str = str.replace("\\","");
String[] arr = str.split("u");
StringBuilder text = new StringBuilder();
for(int i = 1; i < arr.length; i++){
String a = arr[i];
String b = "";
if (arr[i].length() > 4){
a = arr[i].substring(0, 4);
b = arr[i].substring(4);
}
int hexVal = Integer.parseInt(a, 16);
text.append((char) hexVal).append(b);
}
return text.toString();
}

An alternate way of accomplishing this could be to make use of chars() introduced with Java 9, this can be used to iterate over the characters making sure any char which maps to a surrogate code point is passed through uninterpreted. This can be used as:-
String myString = "\u0048\u0065\u006C\u006C\u006F World";
myString.chars().forEach(a -> System.out.print((char)a));
// would print "Hello World"

Solution for Kotlin:
val sourceContent = File("test.txt").readText(Charset.forName("windows-1251"))
val result = String(sourceContent.toByteArray())
Kotlin uses UTF-8 everywhere as default encoding.
Method toByteArray() has default argument - Charsets.UTF_8.

Java: Removing comments from string

I'd like to do a function which gets a string and in case it has inline comments it removes it. I know it sounds pretty simple but i wanna make sure im doing this right, for example:
private String filterString(String code) {
// lets say code = "some code //comment inside"
// return the string "some code" (without the comment)
}
I thought about 2 ways: feel free to advice otherwise
Iterating the string and finding double inline brackets and using substring method.
regex way.. (im not so sure bout it)
can u tell me what's the best way and show me how it should be done? (please don't advice too advanced solutions)
edited: can this be done somehow with Scanner object? (im using this object anyway)

If you want a more efficient regex to really match all types of comments, use this one :
replaceAll("(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)","");
source : http://ostermiller.org/findcomment.html
EDIT:
Another solution, if you're not sure about using regex is to design a small automata like follows :
public static String removeComments(String code){
final int outsideComment=0;
final int insideLineComment=1;
final int insideblockComment=2;
final int insideblockComment_noNewLineYet=3; // we want to have at least one new line in the result if the block is not inline.
int currentState=outsideComment;
String endResult="";
Scanner s= new Scanner(code);
s.useDelimiter("");
while(s.hasNext()){
String c=s.next();
switch(currentState){
case outsideComment:
if(c.equals("/") && s.hasNext()){
String c2=s.next();
if(c2.equals("/"))
currentState=insideLineComment;
else if(c2.equals("*")){
currentState=insideblockComment_noNewLineYet;
}
else
endResult+=c+c2;
}
else
endResult+=c;
break;
case insideLineComment:
if(c.equals("\n")){
currentState=outsideComment;
endResult+="\n";
}
break;
case insideblockComment_noNewLineYet:
if(c.equals("\n")){
endResult+="\n";
currentState=insideblockComment;
}
case insideblockComment:
while(c.equals("*") && s.hasNext()){
String c2=s.next();
if(c2.equals("/")){
currentState=outsideComment;
break;
}
}
}
}
s.close();
return endResult;
}

The best way to do this is to use regular expressions.
At first to find the /**/ comments and then remove all // commnets. For example:
private String filterString(String code) {
String partialFiltered = code.replaceAll("/\\*.*\\*/", "");
String fullFiltered = partialFiltered.replaceAll("//.*(?=\\n)", "")
}

Just use the replaceAll method from the String class, combined with a simple regular expression. Here's how to do it:
import java.util.*;
import java.lang.*;
class Main
{
public static void main (String[] args) throws java.lang.Exception
{
String s = "private String filterString(String code) {\n" +
" // lets say code = \"some code //comment inside\"\n" +
" // return the string \"some code\" (without the comment)\n}";
s = s.replaceAll("//.*?\n","\n");
System.out.println("s=" + s);
}
}
The key is the line:
s = s.replaceAll("//.*?\n","\n");
The regex //.*?\n matches strings starting with // until the end of the line.
And if you want to see this code in action, go here: http://www.ideone.com/e26Ve
Hope it helps!

To find the substring before a constant substring using a regular expression replacement is a bit much.
You can do it using indexOf() to check for the position of the comment start and substring() to get the first part, something like:
String code = "some code // comment";
int offset = code.indexOf("//");
if (-1 != offset) {
code = code.substring(0, offset);
}

#Christian Hujer has been correctly pointing out that many or all of the solutions posted fail if the comments occur within a string.
#Loïc Gammaitoni suggests that his automata approach could easily be extended to handle that case. Here is that extension.
enum State { outsideComment, insideLineComment, insideblockComment, insideblockComment_noNewLineYet, insideString };
public static String removeComments(String code) {
State state = State.outsideComment;
StringBuilder result = new StringBuilder();
Scanner s = new Scanner(code);
s.useDelimiter("");
while (s.hasNext()) {
String c = s.next();
switch (state) {
case outsideComment:
if (c.equals("/") && s.hasNext()) {
String c2 = s.next();
if (c2.equals("/"))
state = State.insideLineComment;
else if (c2.equals("*")) {
state = State.insideblockComment_noNewLineYet;
} else {
result.append(c).append(c2);
}
} else {
result.append(c);
if (c.equals("\"")) {
state = State.insideString;
}
}
break;
case insideString:
result.append(c);
if (c.equals("\"")) {
state = State.outsideComment;
} else if (c.equals("\\") && s.hasNext()) {
result.append(s.next());
}
break;
case insideLineComment:
if (c.equals("\n")) {
state = State.outsideComment;
result.append("\n");
}
break;
case insideblockComment_noNewLineYet:
if (c.equals("\n")) {
result.append("\n");
state = State.insideblockComment;
}
case insideblockComment:
while (c.equals("*") && s.hasNext()) {
String c2 = s.next();
if (c2.equals("/")) {
state = State.outsideComment;
break;
}
}
}
}
s.close();
return result.toString();
}

I made an open source library (on GitHub) for this purpose , its called CommentRemover you can remove single line and multiple line Java Comments.
It supports remove or NOT remove TODO's.
Also it supports JavaScript , HTML , CSS , Properties , JSP and XML Comments too.
Little code snippet how to use it (There is 2 type usage):
First way InternalPath
public static void main(String[] args) throws CommentRemoverException {
// root dir is: /Users/user/Projects/MyProject
// example for startInternalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc.. goes like that
.removeTodos(false) // Do Not Touch Todos (leave them alone)
.removeSingleLines(true) // Remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startInternalPath("src.main.app") // Starts from {rootDir}/src/main/app , leave it empty string when you want to start from root dir
.setExcludePackages(new String[]{"src.main.java.app.pattern"}) // Refers to {rootDir}/src/main/java/app/pattern and skips this directory
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}
Second way ExternalPath
public static void main(String[] args) throws CommentRemoverException {
// example for externalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc..
.removeTodos(true) // Remove todos
.removeSingleLines(false) // Do not remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startExternalPath("/Users/user/Projects/MyOtherProject")// Give it full path for external directories
.setExcludePackages(new String[]{"src.main.java.model"}) // Refers to /Users/user/Projects/MyOtherProject/src/main/java/model and skips this directory.
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}

for scanner, use a delimiter,
delimiter example.
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Scanner;
public class MainClass {
public static void main(String args[]) throws IOException {
FileWriter fout = new FileWriter("test.txt");
fout.write("2, 3.4, 5,6, 7.4, 9.1, 10.5, done");
fout.close();
FileReader fin = new FileReader("Test.txt");
Scanner src = new Scanner(fin);
// Set delimiters to space and comma.
// ", *" tells Scanner to match a comma and zero or more spaces as
// delimiters.
src.useDelimiter(", *");
// Read and sum numbers.
while (src.hasNext()) {
if (src.hasNextDouble()) {
System.out.println(src.nextDouble());
} else {
break;
}
}
fin.close();
}
}
Use a tokenizer for a normal string
tokenizer:
// start with a String of space-separated words
String tags = "pizza pepperoni food cheese";
// convert each tag to a token
StringTokenizer st = new StringTokenizer(tags," ");
while ( st.hasMoreTokens() )
{
String token = (String)st.nextToken();
System.out.println(token);
}
http://www.devdaily.com/blog/post/java/java-faq-stringtokenizer-example

It will be better if code handles single line comment and multi line comment separately . Any suggestions ?
public class RemovingCommentsFromFile {
public static void main(String[] args) throws IOException {
BufferedReader fin = new BufferedReader(new FileReader("/home/pathtofilewithcomments/File"));
BufferedWriter fout = new BufferedWriter(new FileWriter("/home/result/File1"));
boolean multilinecomment = false;
boolean singlelinecomment = false;
int len,j;
String s = null;
while ((s = fin.readLine()) != null) {
StringBuilder obj = new StringBuilder(s);
len = obj.length();
for (int i = 0; i < len; i++) {
for (j = i; j < len; j++) {
if (obj.charAt(j) == '/' && obj.charAt(j + 1) == '*') {
j += 2;
multilinecomment = true;
continue;
} else if (obj.charAt(j) == '/' && obj.charAt(j + 1) == '/') {
singlelinecomment = true;
j = len;
break;
} else if (obj.charAt(j) == '*' && obj.charAt(j + 1) == '/') {
j += 2;
multilinecomment = false;
break;
} else if (multilinecomment == true)
continue;
else
break;
}
if (j == len)
{
singlelinecomment=false;
break;
}
else
i = j;
System.out.print((char)obj.charAt(i));
fout.write((char)obj.charAt(i));
}
System.out.println();
fout.write((char)10);
}
fin.close();
fout.close();
}

Easy solution that doesn't remove extra parts of code (like those above)
// works for any reader, you can also iterate over list of strings instead
String str="";
String s;
while ((s = reader.readLine()) != null)
{
s=s.replaceAll("//.*","\n");
str+=s;
}
str=str.replaceAll("/\\*.*\\*/"," ");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Base64 Encoding in Java vs HttpServerUtility.UrlTokenEncode in C# - java

Related

Encrypt a Paragraph of Text in Java [duplicate]

Problem during development of string manipulation program

Java: Display unicode chars as chars when printing string [duplicate]

How to convert a string with Unicode encoding to a string of letters

Java: Removing comments from string

Categories

Resources