Decode and replace hex values in a string in Java - java

I have the a string in Java which contains hex values beneath normal characters. It looks something like this:
String s = "Hello\xF6\xE4\xFC\xD6\xC4\xDC\xDF"
What I want is to convert the hex values to the characters they represent, so it will look like this:
"HelloöäüÖÄÜß"
Is there a way to replace all hex values with the actual character they represent?
I can achieve what I want with this, but I have to do one line for every character and it does not cover unexcepted characters:
indexRequest = indexRequest.replace("\\xF6", "ö");
indexRequest = indexRequest.replace("\\xE4", "ä");
indexRequest = indexRequest.replace("\\xFC", "ü");
indexRequest = indexRequest.replace("\\xD6", "Ö");
indexRequest = indexRequest.replace("\\xC4", "Ä");
indexRequest = indexRequest.replace("\\xDC", "Ü");
indexRequest = indexRequest.replace("\\xDF", "ß");

public static void main(String[] args) {
String s = "Hello\\xF6\\xE4\\xFC\\xD6\\xC4\\xDC\\xDF\\xFF ";
StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile("\\\\x[0-9A-F]+");
Matcher m = p.matcher(s);
while(m.find()){
String hex = m.group(); //find hex values
int num = Integer.parseInt(hex.replace("\\x", ""), 16); //parse to int
char bin = (char)num; // cast int to char
m.appendReplacement(sb, bin+""); // replace hex with char
}
m.appendTail(sb);
System.out.println(sb.toString());
}

I would loop through every chacter to find the '\' and than skip one char and start a methode with the next two chars.
And than just use the code by Michael Berry
here:
Convert a String of Hex into ASCII in Java

You can use a regex [xX][0-9a-fA-F]+ to identify all the hex code in your string, convert them to there corresponding character using Integer.parseInt(matcher.group().substring(1), 16) and replace them in string. Below is a sample code for it
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HexToCharacter {
public static void main(String[] args) {
String s = "HelloxF6xE4xFCxD6xC4xDCxDF";
StringBuilder sb = new StringBuilder(s);
Pattern pattern = Pattern.compile("[xX][0-9a-fA-F]+");
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
int indexOfHexCode = sb.indexOf(matcher.group());
sb.replace(indexOfHexCode, indexOfHexCode+matcher.group().length(), Character.toString((char)Integer.parseInt(matcher.group().substring(1), 16)));
}
System.out.println(sb.toString());
}
}
I have tested this regex pattern using your string. If there are other test-cases that you have in mind, then you might need to change regex accordingly

Related

How to make control characters visible?

I have to display string with visible control characters like \n, \t etc.
I have tried quotations like here, also I have tried to do something like
Pattern pattern = Pattern.compile("\\p{Cntrl}");
Matcher matcher = pattern.matcher(str);
String controlChar = matcher.group();
String replace = "\\" + controlChar;
result = result.replace(controlChar, replace);
but I have failed
Alternative: Use visible characters instead of escape sequences.
To make control characters "visible", use the characters from the Unicode Control Pictures Block, i.e. map \u0000-\u001F to \u2400-\u241F, and \u007F to \u2421.
Note that this requires output to be Unicode, e.g. UTF-8, not a single-byte code page like ISO-8859-1.
private static String showControlChars(String input) {
StringBuffer buf = new StringBuffer();
Matcher m = Pattern.compile("[\u0000-\u001F\u007F]").matcher(input);
while (m.find()) {
char c = m.group().charAt(0);
m.appendReplacement(buf, Character.toString(c == '\u007F' ? '\u2421' : (char) (c + 0x2400)));
if (c == '\n') // Let's preserve newlines
buf.append(System.lineSeparator());
}
return m.appendTail(buf).toString();
}
Output using method above as input text:
␉private static String showControlChars(String input) {␍␊
␉␉StringBuffer buf = new StringBuffer();␍␊
␉␉Matcher m = Pattern.compile("[\u0000-\u001F\u007F]").matcher(input);␍␊
␉␉while (m.find()) {␍␊
␉␉␉char c = m.group().charAt(0);␍␊
␉␉␉m.appendReplacement(buf, Character.toString(c == '\u007F' ? '\u2421' : (char) (c + 0x2400)));␍␊
␉␉␉if (c == '\n')␍␊
␉␉␉␉buf.append(System.lineSeparator());␍␊
␉␉}␍␊
␉␉return m.appendTail(buf).toString();␍␊
␉}␍␊
Simply replace occurences of '\n' with the escaped version (i.e. '\\n'), like this:
final String result = str.replace("\n", "\\n");
For example:
public static void main(final String args[]) {
final String str = "line1\nline2";
System.out.println(str);
final String result = str.replace("\n", "\\n");
System.out.println(result);
}
Will yield the output:
line1
newline
line1\nnewline
just doing
result = result.replace("\\", "\\\\");
will work!!

Replacing \\u by \u in java string

I have a string which contains normal text and Unicode in between, for example "abc\ue415abc".
I want to replace all occurrences of \\u with \u. How can I achieve this?
I used the following code but it's not working properly.
String s = "aaa\\u2022bbb\\u2014ccc";
StringBuffer buf = new StringBuffer();
Matcher m = Pattern.compile("\\\\u([0-9A-Fa-f]{4})").matcher(s);
while (m.find()) {
try {
int cp = Integer.parseInt(m.group(1), 16);
m.appendReplacement(buf, "");
buf.appendCodePoint(cp);
} catch (NumberFormatException e) {
}
}
m.appendTail(buf);
s = buf.toString();
Please help. Thanks in advance.
From API reference: http://developer.android.com/reference/java/lang/String.html#replace(java.lang.CharSequence, java.lang.CharSequence)
You can use public
public String replace (CharSequence target, CharSequence replacement)
string = string.replace("\\u", "\u");
or
String replacedString = string.replace("\\u", "\u");
Your initial string doesn't, in fact, have any double backslashes.
String s = "aaa\\u2022bbb\\u2014ccc";
yields a string that contains aaa\u2022bbb\u2014ccc, as \\ is just java string-literal escaping for \.
If you want unicode characters: (StackOverflow21028089.java)
import java.util.regex.*;
class StackOverflow21028089 {
public static void main(String[] args) {
String s = "aaa\\u2022bbb\\u2014ccc";
StringBuffer buf = new StringBuffer();
Matcher m = Pattern.compile("\\\\u([0-9A-Fa-f]{4})").matcher(s);
while (m.find()) {
try {
// see example:
// http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#appendReplacement%28java.lang.StringBuffer,%20java.lang.String%29
int cp = Integer.parseInt(m.group(1), 16);
char[] chars = Character.toChars(cp);
String rep = new String(chars);
System.err.printf("Found %d which means '%s'\n", cp, rep);
m.appendReplacement(buf, rep);
} catch (NumberFormatException e) {
System.err.println("Confused: " + e);
}
}
m.appendTail(buf);
s = buf.toString();
System.out.println(s);
}
}
=>
Found 8226 which means '•'
Found 8212 which means '—'
aaa•bbb—ccc
If you want aaa\u2022bbb\u2014ccc, that's what you started with. If you meant to start with a string literal with aaa\\u2022bbb\\u2014ccc, that's this:
String s = "aaa\\\\u2022bbb\\\\u2014ccc";
and converting it to the one with single slashes can be as simple as #Overv's code:
s = s.replaceAll("\\\\u", "\\u");
though since backslash has a special meaning in regex patterns and replacements (see Matcher's docs) (in addition to java parsing), this should probably be:
s = s.replaceAll("\\\\\\\\u", "\\\\u");
=>
aaa\u2022bbb\u2014ccc
Try this:
s = s.replace(s.indexOf("\\u"), "\u");
There is a contains method and a replace method in String. That being said
String hello = "hgjgu\udfgyud\\ushddsjn\hsdfds\\ubjn";
if(hello.contains("\\u"))
hello.replace("\\u","\u");
System.out.println(hello);
It will print :- hgjgu\udfgyud\ushddsjn\hsdfds\ubjn

Jmeter - regex in beanshell (matcher()/pattern() ) is cutting national characters

i need to cut some words from server response data.
Use Regular Expression Extractor I get
<span class="snippet_word">Działalność</span> <span class="snippet_word">lecznicza</span>.</a>
from that i need just: "Działalność lecznicza"
so i write a program in Beanshell which should do that and there's a problem because i get
"lecznicza lecznicza"
Here is my program:
import java.util.regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
String pattern = "\\w+(?=\\<)";
String co = vars.get("tresc");
int len = Integer.parseInt(vars.get("length"));
String phrase="";
StringBuffer sb = new StringBuffer();
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(co);
for(i=0; i < len ;i++){
if (m.find()){
strbuf = new StringBuffer(m.group(0));
}
else {
phrase="notfound";
}
sb.append(" ");
sb.append(strbuf);
}
phrase = sb.toString();
return phrase;
tresc - is my source from I extract pattern word.
Length - tells me how many words i'm extracting.
Program is working fine for phrase without national characters. Thats why I think there is some problem with encoding or somewhere here:
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(co);
but i don't know how to change my code.
\w does not match unicode. To match unicode in regex, you can use \p{L}:
String pattern = "\\p{L}+(?=\\<)";
Although for this type of work I would recommend using an XML parser as regular expressions are completely unsuitable for parsing HTML/XML as described in this post

How to replace characters in a java String?

I like to replace a certain set of characters of a string with a corresponding replacement character in an efficent way.
For example:
String sourceCharacters = "šđćčŠĐĆČžŽ";
String targetCharacters = "sdccSDCCzZ";
String result = replaceChars("Gračišće", sourceCharacters , targetCharacters );
Assert.equals(result,"Gracisce") == true;
Is there are more efficient way than to use the replaceAll method of the String class?
My first idea was:
final String s = "Gračišće";
String sourceCharacters = "šđćčŠĐĆČžŽ";
String targetCharacters = "sdccSDCCzZ";
// preparation
final char[] sourceString = s.toCharArray();
final char result[] = new char[sourceString.length];
final char[] targetCharactersArray = targetCharacters.toCharArray();
// main work
for(int i=0,l=sourceString.length;i<l;++i)
{
final int pos = sourceCharacters.indexOf(sourceString[i]);
result[i] = pos!=-1 ? targetCharactersArray[pos] : sourceString[i];
}
// result
String resultString = new String(result);
Any ideas?
Btw, the UTF-8 characters are causing the trouble, with US_ASCII it works fine.
You can make use of java.text.Normalizer and a shot of regex to get rid of the diacritics of which there exist much more than you have collected as far.
Here's an SSCCE, copy'n'paste'n'run it on Java 6:
package com.stackoverflow.q2653739;
import java.text.Normalizer;
import java.text.Normalizer.Form;
public class Test {
public static void main(String... args) {
System.out.println(removeDiacriticalMarks("Gračišće"));
}
public static String removeDiacriticalMarks(String string) {
return Normalizer.normalize(string, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
}
This should yield
Gracisce
At least, it does here at Eclipse with console character encoding set to UTF-8 (Window > Preferences > General > Workspace > Text File Encoding). Ensure that the same is set in your environment as well.
As an alternative, maintain a Map<Character, Character>:
Map<Character, Character> charReplacementMap = new HashMap<Character, Character>();
charReplacementMap.put('š', 's');
charReplacementMap.put('đ', 'd');
// Put more here.
String originalString = "Gračišće";
StringBuilder builder = new StringBuilder();
for (char currentChar : originalString.toCharArray()) {
Character replacementChar = charReplacementMap.get(currentChar);
builder.append(replacementChar != null ? replacementChar : currentChar);
}
String newString = builder.toString();
I'd use the replace method in a simple loop.
String sourceCharacters = "šđćčŠĐĆČžŽ";
String targetCharacters = "sdccSDCCzZ";
String s = "Gračišće";
for (int i=0 ; i<sourceCharacters.length() ; i++)
s = s.replace(sourceCharacters.charAt[i], targetCharacters.charAt[i]);
System.out.println(s);

Conversion from javascript-escaped Unicode to Java Unicode

I have a query string passed in through an HTTP request that has this character in it:
%u54E6
And I'd like to generate a string that contains the actual Chinese character so I can use it in a different part of the application, I've tried using this code:
String foo = "%u54E6";
String ufoo = new String(foo.replaceAll("%u([a-zA-Z0-9]{4})", "\\" + "u$1"));
System.out.println("ufoo: " + ufoo);
Unfortunately, all I'm getting is 'u54E6' printed to the console for the value, instead of the Chinese character.
Is there an easy way to convert the original string to a Unicode character in Java?
You're trying to use \u escapes at run time. These are compile-time only. Instead, you should be able to do something like:
String foo = "%u54E6";
Pattern p = Pattern.compile("%u([a-zA-Z0-9]{4})");
Matcher m = p.matcher(foo);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb,
String.valueOf((char) Integer.parseInt(m.group(1), 16)));
}
m.appendTail(sb);
System.out.println(sb.toString());

Categories