i need to cut some words from server response data.
Use Regular Expression Extractor I get
<span class="snippet_word">Działalność</span> <span class="snippet_word">lecznicza</span>.</a>
from that i need just: "Działalność lecznicza"
so i write a program in Beanshell which should do that and there's a problem because i get
"lecznicza lecznicza"
Here is my program:
import java.util.regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
String pattern = "\\w+(?=\\<)";
String co = vars.get("tresc");
int len = Integer.parseInt(vars.get("length"));
String phrase="";
StringBuffer sb = new StringBuffer();
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(co);
for(i=0; i < len ;i++){
if (m.find()){
strbuf = new StringBuffer(m.group(0));
}
else {
phrase="notfound";
}
sb.append(" ");
sb.append(strbuf);
}
phrase = sb.toString();
return phrase;
tresc - is my source from I extract pattern word.
Length - tells me how many words i'm extracting.
Program is working fine for phrase without national characters. Thats why I think there is some problem with encoding or somewhere here:
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(co);
but i don't know how to change my code.
\w does not match unicode. To match unicode in regex, you can use \p{L}:
String pattern = "\\p{L}+(?=\\<)";
Although for this type of work I would recommend using an XML parser as regular expressions are completely unsuitable for parsing HTML/XML as described in this post
Related
I am trying to get the URL of the first search result. So far, I have tried converting the page to HTML using InputStream and AsyncTask. and then reading the string, stripping out the first URL using java regex.
String str = result;
String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println(matcher.group());
Toast.makeText(getBaseContext(), matcher.group(), Toast.LENGTH_LONG).show();
}
My code works very well stripping out the first URL from an HTML file, but I have noticed that there are no URL's in the HTML file when I save it using an android device. There must be a better way of doing this.
Instead of if(matcher.find()){} do while(matcher.find()){}
if there are multiple URLs in a single line, your regex will only parse the first URL in that line, ignoring any other important ones
i.e:
while((line = reader.readLine()) != null) {
Matcher matcher = pattern.matcher(line);
while(matcher.find()){
String url = matcher.group();
}
}
your code modified:
Pattern pattern = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]");
Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
String url = matcher.group();
}
I'm guessing you're attempting to get the first result though, and you're bound to see a lot of random google.com URLs, I recommend using Jsoup, as it's highly not recommended to try and parse XML/HTML with REGEX, it gets messy, and that'll take care of it all for you EASILY.
i.e:
Document connection = Jsoup.connect("https://www.google.com/search?q=query").get();
// all results are grouped into containers using the class "g" (group)
Elements groups = connection.getElementsByClass("g");
// check if any results were found
if(groups.size() <= 0) {
System.out.println("no results found!");
return;
}
// get the first result
Element firstGroup = groups.first();
// get the href from from first result
String href = firstGroup.getElementsByTag("a").first().attr("href");
I have the a string in Java which contains hex values beneath normal characters. It looks something like this:
String s = "Hello\xF6\xE4\xFC\xD6\xC4\xDC\xDF"
What I want is to convert the hex values to the characters they represent, so it will look like this:
"HelloöäüÖÄÜß"
Is there a way to replace all hex values with the actual character they represent?
I can achieve what I want with this, but I have to do one line for every character and it does not cover unexcepted characters:
indexRequest = indexRequest.replace("\\xF6", "ö");
indexRequest = indexRequest.replace("\\xE4", "ä");
indexRequest = indexRequest.replace("\\xFC", "ü");
indexRequest = indexRequest.replace("\\xD6", "Ö");
indexRequest = indexRequest.replace("\\xC4", "Ä");
indexRequest = indexRequest.replace("\\xDC", "Ü");
indexRequest = indexRequest.replace("\\xDF", "ß");
public static void main(String[] args) {
String s = "Hello\\xF6\\xE4\\xFC\\xD6\\xC4\\xDC\\xDF\\xFF ";
StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile("\\\\x[0-9A-F]+");
Matcher m = p.matcher(s);
while(m.find()){
String hex = m.group(); //find hex values
int num = Integer.parseInt(hex.replace("\\x", ""), 16); //parse to int
char bin = (char)num; // cast int to char
m.appendReplacement(sb, bin+""); // replace hex with char
}
m.appendTail(sb);
System.out.println(sb.toString());
}
I would loop through every chacter to find the '\' and than skip one char and start a methode with the next two chars.
And than just use the code by Michael Berry
here:
Convert a String of Hex into ASCII in Java
You can use a regex [xX][0-9a-fA-F]+ to identify all the hex code in your string, convert them to there corresponding character using Integer.parseInt(matcher.group().substring(1), 16) and replace them in string. Below is a sample code for it
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HexToCharacter {
public static void main(String[] args) {
String s = "HelloxF6xE4xFCxD6xC4xDCxDF";
StringBuilder sb = new StringBuilder(s);
Pattern pattern = Pattern.compile("[xX][0-9a-fA-F]+");
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
int indexOfHexCode = sb.indexOf(matcher.group());
sb.replace(indexOfHexCode, indexOfHexCode+matcher.group().length(), Character.toString((char)Integer.parseInt(matcher.group().substring(1), 16)));
}
System.out.println(sb.toString());
}
}
I have tested this regex pattern using your string. If there are other test-cases that you have in mind, then you might need to change regex accordingly
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I have this list of 100 names to be extracted that lie in between the tags. I need to extract just the data and not the tags using Java Regular Expressions.
Eg: I need the data Aaron,Teb, Abacha, Jui, Abashidze, Harry. All in a new line.
<a class="listing" href=http://eeee/a/hank_aaron/index.html">Aaron, Teb</a><br>
<a class="listing" href=http://eeee/t/sani_abacha/index.html">Abacha, Jui</a><br>
<a class="listing" href=http://eeee/i/aslan_abashidze/index.html">Abashidze, Harry</a><br>
I wrote the following code, but it extracts the tags too. Where am i going wrong. How do i replace the tags or Is the Regexp wrong.
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://eeee/all/people/index.html");
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
String input;
String REGEX = "<a class=\"listing\"[^>]*>";
while ((input = in.readLine()) != null){
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(input);
while(m.find()) {
System.out.println(input);
}
}
in.close();
}
Use this regexp:
(?:<a class=\"listing\"[^>]*>)([^<]*)(?:<)
Its group 1 will capture the name.
P.S. You should move Pattern p = Pattern.compile(REGEX); outside the loop.
I've found a howto, http://answers.oreilly.com/topic/214-how-to-match-nonprintable-characters-with-a-regular-expression/ , but non of the codes, \e, \x1b, \x1B, work for me in Java.
EDIT
I am trying to replace the ANSI escape sequences (specifically, color sequences) of a Linux terminal command's output.
In Python the replace pattern would look like "\x1b[34;01m", which means blue bold text. This same pattern does not work in Java. I tried to replace "[34;01m" separately, and it worked, so the problem is \x1b.
And I am doing the "[" escaping using Pattern.quote().
EDIT
Map<String,String> escapeMap = new HashMap<String,String>();
escapeMap.put("\\x1b[01;34m", "</span><span style=\"color:blue;font-weight:bold\">");
FileInputStream stream = new FileInputStream(new File("/home/ch00k/gun.output"));
FileChannel fc = stream.getChannel();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
String message = Charset.defaultCharset().decode(bb).toString();
stream.close();
String patternString = Pattern.quote(StringUtils.join(escapeMap.keySet(), "|"));
System.out.println(patternString);
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(message);
StringBuffer sb = new StringBuffer();
while(matcher.find()) {
matcher.appendReplacement(sb, escapeMap.get(matcher.group()));
}
matcher.appendTail(sb);
String formattedMessage = sb.toString();
System.out.println(formattedMessage);
EDIT
Here is the code I've ended up with:
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.*;
import java.util.*;
import java.util.regex.*;
import org.apache.commons.lang3.*;
class CreateMessage {
public static void message() throws IOException {
FileInputStream stream = new FileInputStream(new File("./gun.output"));
FileChannel fc = stream.getChannel();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
String message = Charset.defaultCharset().decode(bb).toString();
stream.close();
Map<String,String> tokens = new HashMap<String,String>();
tokens.put("root", "nobody");
tokens.put(Pattern.quote("[01;34m"), "qwe");
String patternString = "(" + StringUtils.join(tokens.keySet(), "|") + ")";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(message);
StringBuffer sb = new StringBuffer();
while(matcher.find()) {
System.out.println(tokens.get(matcher.group()));
matcher.appendReplacement(sb, tokens.get(matcher.group()));
}
matcher.appendTail(sb);
System.out.println(sb.toString());
}
}
The file gun.output contains the output of ls -la --color=always /
Now, the problem is that I'm getting a NullPointerException if I'm trying to match Pattern.quote("[01;34m"). Everything matches fine except of the strings, that contain [, even though I quote them. The exception is the following:
Exception in thread "main" java.lang.NullPointerException
at java.util.regex.Matcher.appendReplacement(Matcher.java:699)
at org.minuteware.jgun.CreateMessage.message(CreateMessage.java:32)
at org.minuteware.jgun.Main.main(Main.java:23)
EDIT
So, according to http://java.sun.com/developer/technicalArticles/releases/1.4regex/, the escape character should be matched with "\u001B", which indeed works in my case. The problem is, if I use tokens.put("\u001B" + Pattern.quote("[01;34m"), "qwe");, I still get the above mentioned NPE.
quote() is to make a pattern that will match the input string verbatim. Your string has pattern language in it. Look at the output from quote() - you'll see that it's trying to literally find the four characters \x1b.
The ansi escape sequences are of the following form [\033[34;01m]
where \033 is ANSI character 033 (oct) or 1b in Hex or 27 in decimal. You need to use the following regexp:
Pattern p = Pattern.compile("\033\\[34;01m");
You can use an octal (\033) or hex (\x1b) representation when you're using a non-printable character in a java string.
The proper value for "escape" character in a regexp is \u001B
FWIW, I've been working on stripping ANSI color codes from colorized log4j files and this little pattern seems to do the trick for all of the cases I've come across:
Pattern.compile("(\\u001B\\[\\d+;\\d+m)+")
I have a query string passed in through an HTTP request that has this character in it:
%u54E6
And I'd like to generate a string that contains the actual Chinese character so I can use it in a different part of the application, I've tried using this code:
String foo = "%u54E6";
String ufoo = new String(foo.replaceAll("%u([a-zA-Z0-9]{4})", "\\" + "u$1"));
System.out.println("ufoo: " + ufoo);
Unfortunately, all I'm getting is 'u54E6' printed to the console for the value, instead of the Chinese character.
Is there an easy way to convert the original string to a Unicode character in Java?
You're trying to use \u escapes at run time. These are compile-time only. Instead, you should be able to do something like:
String foo = "%u54E6";
Pattern p = Pattern.compile("%u([a-zA-Z0-9]{4})");
Matcher m = p.matcher(foo);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb,
String.valueOf((char) Integer.parseInt(m.group(1), 16)));
}
m.appendTail(sb);
System.out.println(sb.toString());