I have a HTML code where i have the div with same id can we extract the second one.
HTML code
<div id="test>example </div>
<div id ="test">example11</div>
I need to extract the example11
This works (?s)<div id="test>.*<div id ="test">(.*?)</div> but i have a lot of div with same ID so this wont be good so can any one tell me do we have any other way to extract the content.
I know REGEX is not good for HTML paring and i have no choice.
try this !
<div.*>.*</div><div.*>(.*)</div>
now you can select the first group. and its done ;)
a dirty solution would be
<div.*>.*</div><div.*>.*</div><div.*>.*</div><div.*>.*</div><div.*>.*</div><div.*>.*</div><div.*>.*</div><div.*>.*</div><div.*>.*</div><div.*>.*</div><div.*>(.*)</div>
hehe but aint so proud about this one ofc....uhm...will think about it..
Related
I have a StringBuilder object in my class which I want to display on UI. This object has few html tags for ex: <li> <br> etc. I would like to know how to format this object so that the html tags are not shown as it is on screen, however they are converted to a readable format.
Note: I don't want to remove these tags and get a plain text. Rather if there is a <br> tag it should break line while displaying the text. Also, due to project restrictions I don't want to use any third party like jsoup etc.
Any help to achieve this would be appreciated!
How about simple .toString().replaceAll with specific replacements? Like:
<br> = \r\n
<li> = \r\n •
...and so on..
how to get the text "xxxx" and it's url using JSOUP.
<div style="width:45%;float:left;border: dashed 1px #966;margin:0 10px;padding:10px;height:400px;">
<ul>
<li>xxxx</li>
<li><b>years:</b>2015</li>
<li><b>language:</b>non </li>
<li><b>color:</b>color</li>
</ul>
</div>
This is my current approach but I receive nothing:
Elements mvYearElement = doc.select("div[style*=width:45%;float:left;border: dashed.1px #966;margin:0 10px;padding:10px;height:400px;]");
The problem is probably that styles do not need to appear in an particular order. Your selector however fixates the order and lists a lot of styles. I would try to identify the part of the style the really is discriminating the link and only use this part. Since I don't know the rest of the HTML i only could guess what is that discriminating part. This maybe?
Elements els = doc.select(div[style*=dashed]);
That is only a wild guess however. But maybe it is also the contents of the div that are discriminating it from the others? In that case you could do something like this:
Elements els = doc.select(div[style]:has(ul));
Or something else. If you would share more of the HTML I could be more specific.
Elements divs = doc.select("#mp-itn b a");
In this the mp-itn is the id of a div tag. What is this b and a?
what does it signify?
I am not able understand this. Please let me know some good tutorials on jsoup.
They refer to link tag <a ...></a> and bold tag <b></b>. So in that example it selects the a tags within bold tags within the tag that has an id of mp-itn.
I suggest you read the documentation before you do anything else. It explained this in the selector-syntax page.
I have html with css and I want to check what is real color (and other visual text attributes) of specified text in html document. Can I do this with JSoup or must I look for some real-like html engine/processor? Speed of processing this operation is one of main factor.
I think he wants to retreive this data in Java program. So you need few things to do.
Download stylesheet files.
Parse html and find class attribute.
Match .class in css with html attribute and find specific information you want.
But beware if you want to find information about any html element without class attribute. In such case you need to find xpath of html element e.g:
<table class="entityTable">
<tr>
<td> <input type="text" value="abcdef" /></td>
</tr>
Then you need to find xpath like : body/div/.../table/tr/td/input and you need to match any css rules which can influence your input tag attributes.
.entityTable tr td input
{
color:red;
}
This is much more difficult so if html to parse is your page put everywhere class attribute into your html tags. Otherwise you need to find way to mach html tags to css rules.
Cheers.
Though it is still in beta, the Cobra HTML parser has this capability.
if you need to know accurate info about the object in web page,
like default border of standard HTML table, or color of a standard link,
use FireBug extension for FireFox.
If you're doing this in an applet, you can use javascript to collect the information, and pass it to your applet.
CSSBox is definitely what you want. It allow you to load external css and transform it in inline style for every dom element.
http://cssbox.sourceforge.net/manual/
I have the html:
<p>
click here
Welcome
</p>
And I just want to retrieve the "Welcome" part using Xpath combined with the Jaxen lib the Xpath I am using is;
//p/text()
Now when I remove the /text() it retrieves;
click here
Welcome
With the /text() added it retrieve null
Is there any other way to retrieve everything inside the p tag but excluding any other tags?
From the XML parser point of view, there are multiple text elements to choose from (Welcome and the whitespace preceding and following it), so it doesn't choose any one. You have a few options, mainly stripping the whitespace before parsing or being more specific about the query, like selecting the second most text element:
//p/text()[2]