How to match "escape" non-printable character in a regex?

How to match "escape" non-printable character in a regex? - java

I've found a howto, http://answers.oreilly.com/topic/214-how-to-match-nonprintable-characters-with-a-regular-expression/ , but non of the codes, \e, \x1b, \x1B, work for me in Java.
EDIT
I am trying to replace the ANSI escape sequences (specifically, color sequences) of a Linux terminal command's output.
In Python the replace pattern would look like "\x1b[34;01m", which means blue bold text. This same pattern does not work in Java. I tried to replace "[34;01m" separately, and it worked, so the problem is \x1b.
And I am doing the "[" escaping using Pattern.quote().
EDIT
Map<String,String> escapeMap = new HashMap<String,String>();
escapeMap.put("\\x1b[01;34m", "</span><span style=\"color:blue;font-weight:bold\">");
FileInputStream stream = new FileInputStream(new File("/home/ch00k/gun.output"));
FileChannel fc = stream.getChannel();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
String message = Charset.defaultCharset().decode(bb).toString();
stream.close();
String patternString = Pattern.quote(StringUtils.join(escapeMap.keySet(), "|"));
System.out.println(patternString);
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(message);
StringBuffer sb = new StringBuffer();
while(matcher.find()) {
matcher.appendReplacement(sb, escapeMap.get(matcher.group()));
}
matcher.appendTail(sb);
String formattedMessage = sb.toString();
System.out.println(formattedMessage);
EDIT
Here is the code I've ended up with:
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.*;
import java.util.*;
import java.util.regex.*;
import org.apache.commons.lang3.*;
class CreateMessage {
public static void message() throws IOException {
FileInputStream stream = new FileInputStream(new File("./gun.output"));
FileChannel fc = stream.getChannel();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
String message = Charset.defaultCharset().decode(bb).toString();
stream.close();
Map<String,String> tokens = new HashMap<String,String>();
tokens.put("root", "nobody");
tokens.put(Pattern.quote("[01;34m"), "qwe");
String patternString = "(" + StringUtils.join(tokens.keySet(), "|") + ")";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(message);
StringBuffer sb = new StringBuffer();
while(matcher.find()) {
System.out.println(tokens.get(matcher.group()));
matcher.appendReplacement(sb, tokens.get(matcher.group()));
}
matcher.appendTail(sb);
System.out.println(sb.toString());
}
}
The file gun.output contains the output of ls -la --color=always /
Now, the problem is that I'm getting a NullPointerException if I'm trying to match Pattern.quote("[01;34m"). Everything matches fine except of the strings, that contain [, even though I quote them. The exception is the following:
Exception in thread "main" java.lang.NullPointerException
at java.util.regex.Matcher.appendReplacement(Matcher.java:699)
at org.minuteware.jgun.CreateMessage.message(CreateMessage.java:32)
at org.minuteware.jgun.Main.main(Main.java:23)
EDIT
So, according to http://java.sun.com/developer/technicalArticles/releases/1.4regex/, the escape character should be matched with "\u001B", which indeed works in my case. The problem is, if I use tokens.put("\u001B" + Pattern.quote("[01;34m"), "qwe");, I still get the above mentioned NPE.

quote() is to make a pattern that will match the input string verbatim. Your string has pattern language in it. Look at the output from quote() - you'll see that it's trying to literally find the four characters \x1b.

The ansi escape sequences are of the following form [\033[34;01m]
where \033 is ANSI character 033 (oct) or 1b in Hex or 27 in decimal. You need to use the following regexp:
Pattern p = Pattern.compile("\033\\[34;01m");
You can use an octal (\033) or hex (\x1b) representation when you're using a non-printable character in a java string.

The proper value for "escape" character in a regexp is \u001B

FWIW, I've been working on stripping ANSI color codes from colorized log4j files and this little pattern seems to do the trick for all of the cases I've come across:
Pattern.compile("(\\u001B\\[\\d+;\\d+m)+")

Related

Decode and replace hex values in a string in Java

I have the a string in Java which contains hex values beneath normal characters. It looks something like this:
String s = "Hello\xF6\xE4\xFC\xD6\xC4\xDC\xDF"
What I want is to convert the hex values to the characters they represent, so it will look like this:
"HelloöäüÖÄÜß"
Is there a way to replace all hex values with the actual character they represent?
I can achieve what I want with this, but I have to do one line for every character and it does not cover unexcepted characters:
indexRequest = indexRequest.replace("\\xF6", "ö");
indexRequest = indexRequest.replace("\\xE4", "ä");
indexRequest = indexRequest.replace("\\xFC", "ü");
indexRequest = indexRequest.replace("\\xD6", "Ö");
indexRequest = indexRequest.replace("\\xC4", "Ä");
indexRequest = indexRequest.replace("\\xDC", "Ü");
indexRequest = indexRequest.replace("\\xDF", "ß");

public static void main(String[] args) {
String s = "Hello\\xF6\\xE4\\xFC\\xD6\\xC4\\xDC\\xDF\\xFF ";
StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile("\\\\x[0-9A-F]+");
Matcher m = p.matcher(s);
while(m.find()){
String hex = m.group(); //find hex values
int num = Integer.parseInt(hex.replace("\\x", ""), 16); //parse to int
char bin = (char)num; // cast int to char
m.appendReplacement(sb, bin+""); // replace hex with char
}
m.appendTail(sb);
System.out.println(sb.toString());
}

I would loop through every chacter to find the '\' and than skip one char and start a methode with the next two chars.
And than just use the code by Michael Berry
here:
Convert a String of Hex into ASCII in Java

You can use a regex [xX][0-9a-fA-F]+ to identify all the hex code in your string, convert them to there corresponding character using Integer.parseInt(matcher.group().substring(1), 16) and replace them in string. Below is a sample code for it
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HexToCharacter {
public static void main(String[] args) {
String s = "HelloxF6xE4xFCxD6xC4xDCxDF";
StringBuilder sb = new StringBuilder(s);
Pattern pattern = Pattern.compile("[xX][0-9a-fA-F]+");
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
int indexOfHexCode = sb.indexOf(matcher.group());
sb.replace(indexOfHexCode, indexOfHexCode+matcher.group().length(), Character.toString((char)Integer.parseInt(matcher.group().substring(1), 16)));
}
System.out.println(sb.toString());
}
}
I have tested this regex pattern using your string. If there are other test-cases that you have in mind, then you might need to change regex accordingly

Reading file to String in Java results in invisible characters

I'm having trouble around reading from a text file into a String in Java. I have a text file (created in Eclipse, if that matters) that contains a short amount of text -- approximately 98 characters. Reading that file to a String via several methods results in a String that is quite a bit longer -- 1621 characters. All but the relevant 98 are invisible in the debugger/console.
I've tried the following methods to load the String:
apache commons-io:
FileUtils.readFileToString(new File(path));
FileUtils.readFileToString(new File(path), "UTF-8");
byte[] b = FileUtils.readFileToByteArray(new File(path);
new String(b, "UTF-8");
byte[] b = FileUtils.readFileToByteArray(new File(path);
Charset.defaultCharset().decode(ByteBuffer.wrap(bytes)).toString();
NIO:
new String(Files.readAllBytes(path);
And so on.
Is there a method to strip away these control chars? Is there a way to read files to strings that doesn't have this issue?
As noted in the comments below, this behavior is due to a corrupted(?) file generated by Eclipse. I'd still be interested in hearing any strategies for trimming away control characters from Strings, though!

If you want to strip out all non-printable characters, try this
str = str.replaceAll("[^\\p{Graph}\n\r\t ]", "");
The regex matches all "invisible" characters, except ones we want to keep; in this case newline chars, tabs and spaces.
\p{Graph} is a POSIX character class for all printable/visible characters. To negate a POSIX character class, we can use capital P, ie P{Graph} (all non-printable/invisible characters), however we need to not exclude newlines etc, so we need [^\\p{Graph}\n\r\t] .

Read it line by line into a StringBuilder, and then convert it to a String:
StringBuilder sb = new StringBuilder();
BufferedReader file = new BufferedReader(new FileReader(fileName));
while (true)
{
String line = file.readLine();
if (line == null)
break;
sb.append(line+"\n");
}
file.close();
return sb.toString();

Jmeter - regex in beanshell (matcher()/pattern() ) is cutting national characters

i need to cut some words from server response data.
Use Regular Expression Extractor I get
<span class="snippet_word">Działalność</span> <span class="snippet_word">lecznicza</span>.</a>
from that i need just: "Działalność lecznicza"
so i write a program in Beanshell which should do that and there's a problem because i get
"lecznicza lecznicza"
Here is my program:
import java.util.regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
String pattern = "\\w+(?=\\<)";
String co = vars.get("tresc");
int len = Integer.parseInt(vars.get("length"));
String phrase="";
StringBuffer sb = new StringBuffer();
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(co);
for(i=0; i < len ;i++){
if (m.find()){
strbuf = new StringBuffer(m.group(0));
}
else {
phrase="notfound";
}
sb.append(" ");
sb.append(strbuf);
}
phrase = sb.toString();
return phrase;
tresc - is my source from I extract pattern word.
Length - tells me how many words i'm extracting.
Program is working fine for phrase without national characters. Thats why I think there is some problem with encoding or somewhere here:
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(co);
but i don't know how to change my code.

\w does not match unicode. To match unicode in regex, you can use \p{L}:
String pattern = "\\p{L}+(?=\\<)";
Although for this type of work I would recommend using an XML parser as regular expressions are completely unsuitable for parsing HTML/XML as described in this post

How to convert into cyrillic

Good day.
I got string like this from server
\u041a\u0438\u0441\u0435\u043b\u0435\u0432 \u0410\u043d\u0434\u0440\u0435\u0439
I need to convert it into cyrillic cp-1251 string.
How do i do it? Thank you.

If that is a literal sequence of characters that must decoded, you'll need to first start with something like this (assuming your input is in the string input):
StringBuffer decodedInput = new StringBuffer();
Matcher match = Pattern.compile("\\\\u([0-9a-fA-F]{4})| ").matcher(input);
while (match.find()) {
String character = match.group(1);
if (character == null)
decodedInput.append(match.group());
else
decodedInput.append((char)Integer.parseInt(character, 16));
}
At this point, you should have java string representation of your input in decodedInput.
If your system supports the cp-1251 charset, you can then convert that to cp-1251 with something like this:
Charset cp1251charset = Charset.forName("cp-1251");
ByteBuffer output = cp1251charset.encode(decodedInput.toString());

Conversion from javascript-escaped Unicode to Java Unicode

I have a query string passed in through an HTTP request that has this character in it:
%u54E6
And I'd like to generate a string that contains the actual Chinese character so I can use it in a different part of the application, I've tried using this code:
String foo = "%u54E6";
String ufoo = new String(foo.replaceAll("%u([a-zA-Z0-9]{4})", "\\" + "u$1"));
System.out.println("ufoo: " + ufoo);
Unfortunately, all I'm getting is 'u54E6' printed to the console for the value, instead of the Chinese character.
Is there an easy way to convert the original string to a Unicode character in Java?

You're trying to use \u escapes at run time. These are compile-time only. Instead, you should be able to do something like:
String foo = "%u54E6";
Pattern p = Pattern.compile("%u([a-zA-Z0-9]{4})");
Matcher m = p.matcher(foo);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb,
String.valueOf((char) Integer.parseInt(m.group(1), 16)));
}
m.appendTail(sb);
System.out.println(sb.toString());

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to match "escape" non-printable character in a regex? - java

quote() is to make a pattern that will match the input string verbatim. Your string has pattern language in it. Look at the output from quote() - you'll see that it's trying to literally find the four characters \x1b.

The proper value for "escape" character in a regexp is \u001B

FWIW, I've been working on stripping ANSI color codes from colorized log4j files and this little pattern seems to do the trick for all of the cases I've come across: Pattern.compile("(\\u001B\\[\\d+;\\d+m)+")

Related

Decode and replace hex values in a string in Java

Reading file to String in Java results in invisible characters

Jmeter - regex in beanshell (matcher()/pattern() ) is cutting national characters

How to convert into cyrillic

Conversion from javascript-escaped Unicode to Java Unicode

Categories

Resources