parsing html string in java using regex - java

Need help in parsing html string
String str = "<div id=\"test\" > Amrit </div><div><a href=\"#bbbb\" > Amrit </a> </div><a href=\"#cccc\" ><a href=\"#dddd\" >";
String reg = ".*(<\\s*a\\s+href\\s*=\\s*\\\"(.+?)\"\\s*>).*";
str is my sample string and reg is my regex used to parse all the anchor tags, specially the value of href. Using this regex, it only shows the last part of the string.
Pattern MY_PATTERN = Pattern.compile(reg);
Matcher m = MY_PATTERN.matcher(str);
while (m.find()) {
for(int i=0; i<m.groupCount(); i++){
String s = m.group(i);
System.out.println("->" + s);
}
}
This is the code I did.
What is missing?
And also if i want particular occurrence of string to be replaced, generally if I have my url changed form [string]_[string] into [string]-[string]. How can I get "_" and replace it by "-" ?

Instead of parsing html using regex (regex is for regular language - HTML is not regular language) use HtmlUnit
http://htmlunit.sourceforge.net/
This may help: Options for HTML scraping?

It looks like you have a double escape too many.
This segment may fix it: "<\\s*a\\s+href\\s*=\\s*\"(.+?)\"\\s*>", but can't comment
on the entire regex if it works or not.

I would suggest to use JSoup. It could be much more flexible than using a regex. A sample code is put below.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ListLinks {
public static void main(String[] args) throws Exception {
String url = "http://www.umovietv.com/EntertainmentList.aspx";
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
print("%s", link.attr("abs:href"));
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
}
Refer to http://jsoup.org/ for more information.

Related

jsoup parsing string value in android

I am learning jsoup. I want to parse the below script :
<script>
_cUq="1lj9lodlnq";
</script>
After parsing output : 1lj9lodlnq
Here is what I am trying:
String str = element.ownText().toString();
str = str.replace("\r","");
str = str.replace("\n","");
str = str.replace("<script>","");
str = str.replace("</script>","");
System.out.println(str);
if(str.contains("="))
split = str.split("=");
On debugging I can see the script is stored in the element tag but on assigning to str I get "". Correct me where I am going wrong.
You can extract the inner Javascript with Jsoup. This has the plus that your code is much easier to maintain. Also, you can use regular expressions to rule out the whitespaces instead of String.replace() them one by one.
import org.jsoup.Jsoup;
import org.junit.Test;
import static org.hamcrest.core.Is.is;
import static org.junit.Assert.assertThat;
public class JSoupSO {
#Test
public void script() {
String s = "<script>\n" +
"_cUq=\"1lj9lodlnq\";\n" +
"</script>";
// let Jsoup parse the HTML
String innerJavascript = Jsoup.parse(s).data();
// remove all whitespaces
innerJavascript = innerJavascript.replaceAll("\\s", "");
assertThat(innerJavascript, is("_cUq=\"1lj9lodlnq\";"));
}
}

Find text between two tags and replace it with the uppercase version of the same text

I am writing code that gives me proper nouns in a sentence in uppercase. I am using an NER tagger for that to get tags like PERSON and LOCATION. I want my code to output the text between the tags in uppercase. I am doing it the following way but its not working:
Matcher m1 = Pattern.compile("<PERSON>(.+?)</PERSON>|<LOCATION>(.+?)</LOCATION>").matcher(NER);
while(m1.find())
{ String newDecapTitle = m1.appendReplacement(sb, decapTitle.get(m1.group().toUppercase()));
........
}
Here sb is a string buffer.
To give you an example:
James murray went to Los angeles
gets parsed as
<PERSON>James murray</PERSON> went to <LOCATION>Los angeles</LOCATION>
and I want my output to be -
James Murray went to Los Angeles
.
You are giving it the whole pattern, try giving it a m1.group(1) (which is James murray) and m1.group(2) (which is Los angeles). Or you can do another regex and strip all the tags from your final result (PERSON and LOCATION - those are tags, stack strips them as well).
For the sake of future proofing, I have considered that you may use tags in the future that may be different to just <PERSON> and <LOCATION>. You can do the following to capture words between the tags that are of the form <tag></tag>:
public static void main(String[] args){
String in = "<PERSON>James murray</PERSON> went to <LOCATION>Los angeles</LOCATION>";
Matcher m1 = Pattern.compile(">(.*?)<").matcher(in);
while (m1.find()) {
for (int i = 1; i <= m1.groupCount(); i++) {
System.out.println("matched text: "+ m1.group(i));
}
}
}
Output:
matched text: James murray
matched text: went to
matched text: Los angeles
You can use this to do whatever you want with the captured words.
Another solution is to use a non-capturing group to do something like this (untested):
Matcher m1 = Pattern.compile("(?:<PERSON>|<\\/PERSON>|<LOCATION>|<\\/LOCATION>)?([\\w ]+)").matcher(in);
This will find specifically the tags and capture the groups between them. But I would recommend the first way of doing it.
Try it with jsoup and apache.commons.lang WordUtils
Example:
import org.apache.commons.lang3.text.WordUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ExtractInfo {
public static void main (String [] args) {
String html = "<PERSON>James murray</PERSON> went to <LOCATION>Los angeles</LOCATION>";
Document doc = Jsoup.parse(html);
Elements es = doc.select("person,location");
for(Element e : es){
String eText = e.text();
e.text(replace(eText));
}
System.out.println(doc.text());
}
public static String replace(String str){
return WordUtils.capitalize(str);
}
}
//prints "James Murray went to Los Angeles"

Regex to get text from html tags (nested) - Java

Using regex, I want to be able to get the text between multiple html tags.
Here HTML is just for representation of input, I am not worried about HTML tags, just want to retrieve the content in the HTML tags(between both correct open and close tags).
For instance, the following:
Required Input:
<h1>Text 1</h1>
<h1><h2>Text 2</h2></h1>
<h1><h2>Text 3</h2>Xtra</h1>
<h1>Text 4<h1>extra</h1515></h1>
<h1><h1></h1></h1>
Required Output:
Text 1
Text 2
Text 3
None
None
Output Obtained:
Text 1
Text 2
Text 3
Text 4<h1>extra</h1515>
<h1></h1>
Regex I tried:
"<([\\S ]+)>([\\S ]+)</\\1>"
I am not getting the expected result.
My java code:
import java.io.*;
import java.util.*;
import java.text.*;
import java.math.*;
import java.util.regex.*;
public class Solution{
public static void main(String[] args){
Scanner in = new Scanner(System.in);
int testCases = Integer.parseInt(in.nextLine());
while(testCases>0){
String line = in.nextLine();
String tmp = line;
Pattern r = Pattern.compile("<([\\S ]+)>([\\S ]+)</\\1>", Pattern.MULTILINE);
Matcher m = r.matcher(line);
while(m.find()){
line = line.replaceAll(line, m.group(2));
m = r.matcher(line);
}
if(line != tmp)
System.out.println(line);
else
System.out.println("None");
testCases--;
}
}
}
As pointed out in the comments that way lies nothing but pain. For what your attempting to do you would be far better off walking the DOM (Document Object Model) with something like jsoup

Java regex get exact token value

I've string like below , want to get the value of cn=ADMIN , but dont know how to get to using regex efficient way.
group:192.168.133.205:387/cn=ADMIN,cn=groups,dc=mi,dc=com,dc=usa
well ... like this?
package test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexSample {
public static void main(String[] args) {
String str = "group:192.168.133.205:387/cn=ADMIN,cn=groups,dc=mi,dc=com,dc=usa";
Pattern pattern = Pattern.compile("^.*/(.*)$");
Matcher matcher = pattern.matcher(str);
if (matcher.matches()) {
String right = matcher.group(1);
String[] parts = right.split(",");
for (String part : parts) {
System.err.println("part: " + part);
}
}
}
}
Output is:
part: cn=ADMIN
part: cn=groups
part: dc=mi
part: dc=com
part: dc=usa
String bubba = "group:192.168.133.205:387/cn=ADMIN,cn=groups,dc=mi,dc=com,dc=usa";
String target = "cn=ADMIN";
for(String current: bubba.split("[/,]")){
if(current.equals(target)){
System.out.println("Got it");
}
}
Pattern for regex
cn=([a-zA-Z0-9]+?),
Your name will be in group 1 of matcher. You can extend character classes if you allow spaces etc.

How can I extract all substring by matching a regular expression?

I want extract values of all src attribute in this string, how can i do that:
<p>Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/1.jpg" />
Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/2.jpg" />
</p>
Here you go:
String data = "<p>Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/1.jpg\" />\n" +
"Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/2.jpg\" />\n" +
"</p>";
Pattern p0 = Pattern.compile("src=\"([^\"]+)\"");
Matcher m = p0.matcher(data);
while (m.find())
{
System.out.printf("found: %s%n", m.group(1));
}
Most regex flavors have a shorthand for grabbing all matches, like Ruby's scan method or .NET's Matches(), but in Java you always have to spell it out.
Idea - split around the '"' char, look at each part if it contains the attribute name src and - if yes - store the next value, which is a src attribute.
String[] parts = thisString.split("\""); // splits at " char
List<String> srcAttributes = new ArrayList<String>();
boolean nextIsSrcAttrib = false;
for (String part:parts) {
if (part.trim().endsWith("src=") {
nextIsSrcAttrib = true; {
else if (nextIsSrcAttrib) {
srcAttributes.add(part);
nextIsSrcAttrib = false;
}
}
Better idea - feed it into a usual html parser and extract the values of all src attributes from all img elements. But the above should work as an easy solution, especially in non-production code.
sorry for not coding it (short of time)
how about:
1. (assuming that the file size is reasonable)read the entire file to a String.
2. Split the String arround "src=\"" (assume that the resulting array is called strArr)
3. loop over resulting array of Strings and store strArr[i].substring(0,strArr[i].indexOf("\" />")) to some collection of image sources.
Aviad
since you've requested a regex implementation ...
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
private static String input = "....your html.....";
public static void main(String[] args) {
Pattern pattern = Pattern.compile("src=\".*\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
You may have to tweak the regex if your src attributes are not double quoted

Categories