Parsing Inner <p> tags - java

I need to parse a xml content and need to find a inner tags inside the
<p><span>test</span></p> <p><span>test12</span></p> <p>Some text<p><span>test</span></p></p>
In my above test the last p tag has inner p tag inside. I need to find inner p tags of p tag. i tried as below
public static void main(String[] args) {
String text= "<p><span>test</span></p> <p><span>test12</span></p> <p>Some text<p><span>test</span></p></p>";
Pattern pattern = Pattern.compile("<p>.*?</p>");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String match = matcher.group();
//System.out.println("matcher group:"+match);
if (match.lastIndexOf("<p>") > 0) {
//System.out.println("Substring:"+match.substring(match.indexOf("<p>") + "<p>".length(), match.indexOf("</p>")));
text = text.replace(match, "<p>" +match.substring(match.indexOf("<p>") + "<p>".length(), match.indexOf("</p>")).replaceAll("<p>", ""));
}
}
System.out.println("text:"+text);
}
Let me know if any easy way to do this.

Have a look at JAXB.
As suggested by others, don't do this manually and instead use an existing library like JAXB.
An easy to understand JAXB hello world example can be found here.

Related

Regex - how to find HTML <a> tag content by it's class?

I need to get the content of an <a> html tag by a certain css class name.
The css class that I need find is: whtbigheader
What I done so far is this:
content = "<A HREF='/articles/0,7340,L-4664450,00.html' CLASS='whtbigheader' style='color:#FFFFFF;' HM=1>need to get this value</A>";
Pattern p = Pattern.compile("<A.+?class\\s*?=[whtbigheader]['\"]?([^ '\"]+).*?>(.*?)</A>");
Matcher m = p.matcher(content);
if (m.find()) {
System.out.println("found");
System.out.println(m.group(1));
}
else {
System.out.println("not found");
}
The expected value is: need to get this value
More info:
Can use only regex
The content is an whole HTML String
Any ideas how to find it?
I'm a hater of using regex for html parsing, that's why the solution might not be what the requester desires:
using Jsoup to achieve this :
String html; // your html code
Document doc = Jsoup.parse(html);
Elements elements=doc.select(".whtbigheader")` //<-- that's it, it contains all the tags with whtbigheader as its class.
to make sure you only get a tag:
Elements elements=doc.select("a").select(".whtbigheader");
to get the text from you just need to loop through elements and get the text :
for(Element element : elements){
System.out.println(element.text());
}
download link:
to download Jsoup 1.8.2 click here :).
A parser is the more robust way to go for extracting information from HTML. However, in this case, it is possible to use a regular expression to get what you want (assuming you are never going to have nested anchor tags - if you do have nested anchor tags then you might want to sanity check your documents and you will definately need a parser).
You can use the following regex (using case insensitive flags):
"<a\\s+(?:[^>]+\\s+)?bclass\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>"
You want to extract the second group match like this:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
static final Pattern ANCHOR_PATTERN = Pattern.compile(
"<a\\s+(?:[^>]+\\s+)?class\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>",
Pattern.CASE_INSENSITIVE
);
public static String getAnchorContents( final String html ){
final Matcher matcher = ANCHOR_PATTERN.matcher( html );
if ( matcher.find() ){
return matcher.group(2);
}
return null;
}
public static void main( final String[] args ){
final String[] tests = {
"<a class=whtbigheader>test</a>",
"<a class=\"whtbigheader\">test</a>",
"<a class='whtbigheader'>test</a>",
"<a class =whtbigheader>test</a>",
"<a class =\"whtbigheader\">test</a>",
"<a class ='whtbigheader'>test</a>",
"<a class= whtbigheader>test</a>",
"<a class= \"whtbigheader\">test</a>",
"<a class= 'whtbigheader'>test</a>",
"<a class = whtbigheader>test</a>",
"<a class\t=\r\n\"whtbigheader\">test</a>",
"<a class =\t'whtbigheader'>test</a>",
"<a class=\"otherclass whtbigheader\">test</a>",
"<a class=\"whtbigheader otherclass\">test</a>",
"<a class=\"whtbigheader2 whtbigheader\">test</a>",
"<a class=\"otherclass whtbigheader otherotherclass\">test</a>",
"<a class=whtbigheader href=''>test</a>",
};
int successes = 0;
int failures = 0;
for ( final String test : tests )
{
final String contents = getAnchorContents( test );
if ( "test".equals( contents ) )
successes++;
else
{
System.err.println( test + " => " + contents );
failures++;
}
}
final String[] failingTests = {
"<a class=whtbigheader2>test</a>",
"<a class=awhtbigheader>test</a>",
"<a class=whtbigheader-other>test</a>",
"<a class='whtbigheader2'>test</a>",
"<a class='awhtbigheader'>test</a>",
"<a class='whtbigheader-other'>test</a>",
"<a class=otherclass whtbigheader>test</a>",
"<a class='otherclass' whtbigheader='value'>test</a>",
"<a class='otherclass' id='whtbigheader'>test</a>",
"<a><aclass='whtbigheader'>test</aclass></a>",
"<a aclass='whtbigheader'>test</a>",
"<a class='whtbigheader\"'>test</a>",
"<ab class='whtbigheader'><a>test</a></ab>",
};
for ( final String test : failingTests )
{
final String contents = getAnchorContents( test );
if ( contents == null )
successes++;
else
{
System.err.println( test + " => " + contents );
failures++;
}
}
System.out.println( "Successful tests: " + successes );
System.out.println( "Failed tests: " + failures );
}
}
Use non-capturing group instead of square brackets to match a word.
Pattern p = Pattern.compile("(?i)<A.+?class\\s*?=(['\"])?(?:whtbigheader)\\1[^>]*>(.*?)</A>");
Matcher m = p.matcher(content);
if (m.find()) {
System.out.println("found");
System.out.println(m.group(2));
}
else {
System.out.println("not found");
}
DEMO
IDEONE
You can use following regex :
/<a[^>]*class=\s?['"]\s?whtbigheader\s?['"][^>]*>(.*?)</a>/i
Demo
Note that if you just want content of tag a with a certain class you you don't need extra regex within tag only a[^>]*class='whtbigheader'[^>]* will do the job :
[^>]* will match any thing except >
Also you need to use modifier i (IGNORE CASE) for ignoring the case!
In addition, regex is not a good and proper way for parsing (?:X|H)TML documents.you may consider about using a proper Parser.
Note if you used quote for your regex you need to escape the quotes around class name.

Unable to parse Multiple lined XML Message using Java "Pattern" and "Matcher"

I am unable to parse Multi-lined XML message payload using Pattern.compile(regex).However If I make same message Single line it Gives me expected result.For Example,IF I parse
<Document> <RGOrdCust50K5s0F> AccName AccNo AccAddress </RGOrdCust50K50F> </Document>
It gives me RGOrdCust50K50F> tag value as : AccName AccNo AccAddress but if I use multiple lines like
<Document> <RGOrdCust50K50F>AccNo
AccName
AccAddress </RGOrdCust50K50F></Document>
it through ava.lang.IllegalStateException: No match found
The Testcase code I am using to test this is as below
public class ParseXMLMessage {
public static void main(String[] args) {
String fldName = "RGOrdCust50K50F";
String message="<?xml version=1.0 encoding=UTF-8?> <Document><RGOrdCust50K50F>1234
ABCD
LONDON,UK </RGOrdCust50K50F></Document>";
String fldValue = getTagValue(fldName, message);
System.out.println("fldValue:"+fldValue);
}
private static String getTagValue(String tagName, String message) {
String regex = "(?<=<" + tagName + ">).*?(?=</" + tagName + ">)";
System.out.println("regex:"+regex);
Pattern pattern = Pattern.compile(regex);
System.out.println("pattern:"+pattern);
Matcher matcher = pattern.matcher(message);
System.out.println("matcher:"+matcher);
matcher.find(0);
String tagValue = null;
try {
tagValue = matcher.group();
} catch (IllegalStateException isex) {
System.out.println("No Tag/Match found " + isex.getMessage());
}
return tagValue;
}
}
As a business requirment I need to make message muli-lined but when i make message mutiple lined I get exception.
I am unable to fix this issue Kindly suggest if there IS ANY ISSUE WITH 'REGEX' expression I am using do I need to Use '/n' in Regex express to resolve this issue.Kindly assist
If you are parsing XML, use an XML parser to do it - your REGEX will get increasingly complex and frail as you find more and more situations that it can't handle adequately.
There are a large number of mature and stable XML processing libraries. I tend to stick with what I know and jdom has a very shallow learning curve and will handle this sort of processing very easily.
Issue depends on '.' metacharacter. See http://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html
. Any character (may or may not match line terminators)
Try to use following code:
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE| Pattern.DOTALL);
Check following topic: java regex string matches and multiline delimited with new line

InnerHTML for Java

I would like JavaScript style innerHTML in Java. For instance, I want to get 'TRUE' from the string below:
String control = "<div class='myclass'>TRUE</div>";
But my pattern seems to be off as find() returns false. Ideas anyone?
Pattern pattern = Pattern.compile(">(.*?)<");
Matcher matcher = pattern.matcher(control);
if(matcher.find()) {
result = matcher.group(1);
}
get rid of the question mark:
public static void main(String[] args) {
String control = "<div class='myclass'>TRUE</div>";
Pattern pattern = Pattern.compile(">(.*)<");
Matcher matcher = pattern.matcher(control);
String result = null;
if(matcher.find()) {
result = matcher.group(1);
}
System.out.print(result);
}
BTW it would be better to learn how to use java's DOM objects and XPath classes.
Either use Jquery or if you really insist on doing it in Java, try using JSoup to strip out the HTML and return on the safe stuff

Extracting a pattern from String

I have a Random string from which i need to match a certain pattern and parse it out.
My String-
{"sid":"zw9cmv1pzybexi","parentId":null,"time":1373271966311,"color":"#e94d57","userId":"255863","st":"comment","type":"section","cType":"parent"},{},null,null,null,null,{"sid":"zwldv1lx4f7ovx","parentId":"zw9cmv1pzybexi","time":1373347545798,"color":"#774697","userId":"5216907","st":"comment","type":"section","cType":"child"},{},null,null,null,null,null,{"sid":"zw76w68c91mhbs","parentId":"zw9cmv1pzybexi","time":1373356224065,"color":"#774697","userId":"5216907","st":"comment","type":"section","cType":"child"},
From the above I want to parse out (using regex) all the values for userId attribute. Can anyone help me out on how to do this ? It is a Random string and not JSON. Can you provide me a regex solution for this ?
Is that a random string ? It looks like JSON to me, and if it is I would recommend a JSON parser in preference to a regexp. The right thing to do when faced with a particular language/grammar is to use the corresponding parser, rather than a (potentially) fragile regexp.
To get the user Ids, you can use this pattern:
String input = "{\"sid\":\"zw9cmv1pzybexi\",\"parentId\":null,\"time\":1373271966311,\"color\":\"#e94d57\",\"userId\":\"255863\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"parent\"},{},null,null,null,null,{\"sid\":\"zwldv1lx4f7ovx\",\"parentId\":\"zw9cmv1pzybexi\",\"time\":1373347545798,\"color\":\"#774697\",\"userId\":\"5216907\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"child\"},{},null,null,null,null,null,{\"sid\":\"zw76w68c91mhbs\",\"parentId\":\"zw9cmv1pzybexi\",\"time\":1373356224065,\"color\":\"#774697\",\"userId\":\"5216907\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"child\"},";
Pattern p = Pattern.compile("\"userId\":\"(.*?)\"");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
which outputs:
255863
5216907
5216907
If you want the full string "userId":"xxxx", you can use m.group(); instead of m.group(1);.
Use JSON parser instead of using Regex, your code will be much more readable and maintainable
http://json.org/java/
https://code.google.com/p/json-simple/
As other already told you, it looks like a JSON String, but if you really want to parse this string on your own, you could use this piece of code:
final Pattern pattern = Pattern.compile("\"userId\":\"(\\d+)\"");
final Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
The matcher will match every "userId":"12345" pattern. matcher.group(1) will return every userId, 12345 in this case (matcher.group() without parameter returns the entire group, ie "userId":"12345").
Here's the regex-code you're asking for ..
//assign subject
String subject = "{\"sid\":\"zw9cmv1pzybexi\",\"parentId\":null,\"time\":1373271966311,\"color\":\"#e94d57\",\"userId\":\"255863\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"parent\"},{},null,null,null,null,{\"sid\":\"zwldv1lx4f7ovx\",\"parentId\":\"zw9cmv1pzybexi\",\"time\":1373347545798,\"color\":\"#774697\",\"userId\":\"5216907\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"child\"},{},null,null,null,null,null,{\"sid\":\"zw76w68c91mhbs\",\"parentId\":\"zw9cmv1pzybexi\",\"time\":1373356224065,\"color\":\"#774697\",\"userId\":\"5216907\",\"st\":\"comment\",\"type\":\"section\",\"cType\":\"child\"},";
//specify pattern and matcher
Pattern pat = Pattern.compile( "userId\":\"(\\d+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL );
Matcher mat = pat.matcher( subject );
//browse all
while ( mat.find() )
{
System.out.println( "result [" + mat.group( 1 ) + "]" );
}
But OF COURSE I´d suggest to solve this using a JSON-Parser like
http://json.org/java/
Greetings
Christopher
It's a JSON format, so you have to use a JSON Parser:
JSONArray array = new JSONArray(yourString);
for (int i=0;i<array.length();i++){
JSONObject jo = inputArray.getJSONObject(i);
userId = jo.getString("userId");
}
EDIT : Regex pattern
"userId"[ :]+((?=\[)\[[^]]*\]|(?=\{)\{[^\}]*\}|\"[^"]*\")
Result :
"userId" : "Some user ID (numeric or letters)"

How can I extract all substring by matching a regular expression?

I want extract values of all src attribute in this string, how can i do that:
<p>Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/1.jpg" />
Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/2.jpg" />
</p>
Here you go:
String data = "<p>Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/1.jpg\" />\n" +
"Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/2.jpg\" />\n" +
"</p>";
Pattern p0 = Pattern.compile("src=\"([^\"]+)\"");
Matcher m = p0.matcher(data);
while (m.find())
{
System.out.printf("found: %s%n", m.group(1));
}
Most regex flavors have a shorthand for grabbing all matches, like Ruby's scan method or .NET's Matches(), but in Java you always have to spell it out.
Idea - split around the '"' char, look at each part if it contains the attribute name src and - if yes - store the next value, which is a src attribute.
String[] parts = thisString.split("\""); // splits at " char
List<String> srcAttributes = new ArrayList<String>();
boolean nextIsSrcAttrib = false;
for (String part:parts) {
if (part.trim().endsWith("src=") {
nextIsSrcAttrib = true; {
else if (nextIsSrcAttrib) {
srcAttributes.add(part);
nextIsSrcAttrib = false;
}
}
Better idea - feed it into a usual html parser and extract the values of all src attributes from all img elements. But the above should work as an easy solution, especially in non-production code.
sorry for not coding it (short of time)
how about:
1. (assuming that the file size is reasonable)read the entire file to a String.
2. Split the String arround "src=\"" (assume that the resulting array is called strArr)
3. loop over resulting array of Strings and store strArr[i].substring(0,strArr[i].indexOf("\" />")) to some collection of image sources.
Aviad
since you've requested a regex implementation ...
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
private static String input = "....your html.....";
public static void main(String[] args) {
Pattern pattern = Pattern.compile("src=\".*\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
You may have to tweak the regex if your src attributes are not double quoted

Categories