In Android development, how can I get an image URL or load an image from the HTML text shown below? I get it from HTML, and I want to get an image URL from this code:
<p style="text-align: justify;"><span style="font-size: 16px;"><img class="aligncenter wp-image-2699 size-large" src="http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-1024x258.jpg" alt="helal parti-1" width="750" height="189" srcset="http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-300x76.jpg 300w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-768x193.jpg 768w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-1024x258.jpg 1024w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-560x141.jpg 560w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1.jpg 1564w" sizes="(max-width: 750px) 100vw, 750px" /></span></p>
I want to get src="http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-1024x258.jpg"
You should probably use an HTML parser -- there are several Java HTML parsing libraries that can be found with a quick search.
A quick and dirty, way, however, would be to search the input string for the src=" declaration, like so:
int index = input.indexOf("src=\"");
String substr = input.substring(index + 5);
int endIndex = substr.indexOf("\"");
String imgUrl = substr.substring(0, endIndex);
Disclaimer: I haven't tested this, so it may have errors. It also makes a lot of assumptions which may not be true -- which is why you should use a library for this sort of thing!
Edit: Fixed one error after testing (had to use a different machine than the one I'm typing this on). It should work for you now -- but again, you should really use a library.
to create an image in HTML just type:
<img src="www.example.com/images/this-one.png" alt="an image" />
you may nest the img tag into a p and or span tag as needed.
Just a note, code is much easier to read when you use external style sheets
Related
I would like to select the text inside the strong-tag but without the div under it...
Is there a possibility to do this with jsoup directly?
My try for the selection (doesn't work, selects the full content inside the strong-tag):
Elements selection = htmlDocument.select("strong").select("*:not(.dontwantthatclass)");
HTML:
<strong>
I want that text
<div class="dontwantthatclass">
</div>
</strong>
You are looking for the ownText() method.
String txt = htmlDocument.select("strong").first().ownText();
Have a look at various methods jsoup have to deal with it https://jsoup.org/apidocs/org/jsoup/nodes/Element.html. You can use remove(), removeChild() etc.
One thing you can do is use regex.
Here is a sample regex that matches start and end tag also appended by </br> tag
https://www.debuggex.com/r/1gmcSdz9s3MSimVQ
So you can do it like
selection.replace(/<([^ >]+)[^>]*>.*?<\/\1>|<[^\/]+\/>/ig, "");
You can further modify this regex to match most of your cases.
Another thing you can do is, further process your variable using javascript or vbscript:-
Elements selection = htmlDocument.select("strong")
jquery code here:-
var removeHTML = function(text, selector) {
var wrapped = $("<div>" + text + "</div>");
wrapped.find(selector).remove();
return wrapped.html();
}
With regular expression you can use ownText() methods of jsoup to get and remove unwanted string.
I guess you're using jQuery, so you could use "innerText" property on your "strong" element:
var selection = htmlDocument.select("strong")[0].innerText;
https://jsfiddle.net/scratch_cf/8ds4uwLL/
PS: If you want to wrap the retrieved text into a "strong" tag, I think you'll have to build a new element like $('<strong>retrievedText</strong>');
I am trying to extract a specific captcha image id using api Jsoup, the html image tag is like :
<img id="wlspispHIPBimg03256465465dsd5456" style="display: inline; width: 200px; height: 100px;" aria-hidden="true" src="https://users/hip/data/rnd=435cb60d0a6b63ef4">
This is my code to obtain the attribute id="wlspispHIPBimg03256465465dsd5456":
doc = Jsoup.connect("http://go.microsoft.com/fwlink/?LinkID=614866&clcid")
.timeout(0).get();
Elements images = doc.select("img[src~=(?i)]");
for (Element image : images) {
System.out.println(image.attr("id"));
}
The problem is that i can't get the id of captcha image
You need to find something in the html that discriminates the img tag of any other tag in the document. From your posted code that is can't be deduced, so i use my imagination here:
Element imageEl = doc.select("img[scr*=rnd]").first();
This exploits that the source of the image contains "rnd" in it path. To get the best solution you must look yourself. Also it helps a lot if you learn the CSS selectors of Jsoup.
I think you simply can't accomplish this using only Jsoup, the DOM is modified at runtime with javascript and jsoup simply does not execute it.
View also this other question.
I'm trying to scrape "text" off of a website with JSoup. I can get the text cleanly (with no formatting at all, just the text), or with all the formatting still attached (i.e. < br > along with < p > and < /p >).
However, I can't seem to get the formatted version to include < br/ > to any extent, and that's the only thing that was specifically requested to go along with the text.
For example, I can get this:
<p><br>Worldwide database</p>
and this:
Worldwide database
but I can't get this, which is my desired result:
Worldwide database<br/>
I don't see any < br / >'s while looking at the HTML code via the FireBug plugin on Firefox so I'm wondering if that might be the issue? Or maybe there's an issue with the method's I'm using in my code to pull the text?
Anyways, here's my code:
Elements descriptionHTML = doc.select("div[jsname]"); // <-- Get access to the text w/ JSoup
String descText = descriptionHTML.text(); // <-- Get the code w/o any formating at all
// This prints out the desired text with the <p><br> and </p>, but no <br/>
for (Element link : descriptionHTML)
{
String jsname = link.attr("jsname");
if( jsname.equals("C4s9Ed")){
System.out.println(link);
break;
}
}
I'd really apprecaite any help with this issue.
Thanks,
Jack
HTML does not define a closing tag for <br> elements. XHTML however requires that the tag is marked as empty: <br />. JSoup parses both, but will print out only normal HTML (<br>).
If you use the XML parser in Jsoup, the <br> tags are not closed and so Jsoup tries to guess where to place matching closing tags </br> which are neither HTML nor XHTML compliant.
If you want to keep the line break info and strip out all other tags, I think you need to program that part outside of Jsoup. You could for example replace all <br> and <br /> strings with a uniqe other string, say "_brSplitPos_", then parse the document with JSoup, print out the text only and replace the "_brSplitPos_" against <br />:
String html = "<div>This<br>is<br />a<br>test</div>";
html = html.replaceAll("<br(?:\\s+/)?>", "_brSplitPos_");
Document docH = Jsoup.parse(html);
String onlyText = docH.text();
onlyText = onlyText.replace("_brSplitPos_", "<br />");
System.out.println(onlyText);
I have a wysiwyg editor that I can't modify that sometimes returns <p></p> which obviously looks like an empty field to the person using the wysiwyg.
So I need to add some validation on my back-end which uses java.
should be rejected
<p></p>
<p> </p>
<div><p> </p></div>
should be accepted
<p>a</p>
<div><p>a</p></div>
<p> </p>
<div><p>a</p></div>
basically as long as any element contains some content we will accept it and save it.
I am looking for libraries that I should look at and ideas for how to approach it. Thanks.
You may look on jsoup library. It's pretty fast
It takes HTML and you may return text from it (see example from their website below).
Extract attributes, text, and HTML from elements
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
I would advise you to do it on the client side. The reason is because it is natural for the browser to do this. You need to hook your wysiwyg editor in the send or "save" part, a lot of them have this ability.
Javascript would be
function stripIfEmpty(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
var contentText = tmp.textContent || tmp.innerText || "";
if(contentText.trim().length === 0){
return "";
}else{
return html;
}
}
In the case if you need backend javascript, then the only correct solution would be to use some library that parse HTML, like jsoup - #Dmytro Pastovenskyi show you that.
If you want to use backend but allow it to be fuzzy, not strict, then you can use regex like replaceAll("\\<[^>]*>","") then trim, then check if the string is empty.
You can use regular expressions (built-in to Java).
For example,
"<p>\\s*\\w+\\s*</p>"
would match a <p> tag with at least 1 character of content.
How do i get "this text" from the following html code using Jsoup?
<h2 class="link title"><a href="myhref.html">this text<img width=10
height=10 src="img.jpg" /><span class="blah">
<span>Other texts</span><span class="sometime">00:00</span></span>
</a></h2>
When I try
String s = document.select("h2.title").select("a[href]").first().text();
it returns
this textOther texts00:00
I tried to read the api for Selector in Jsoup but could not figure out much.
Also how do i get an element of class class="link title blah" (multiple classes?). Forgive me I only know both Jsoup and CSS a little.
Use Element#ownText() instead of Element#text().
String s = document.select("h2.link.title a[href]").first().ownText();
Note that you can select elements with multiple classes by just concatenating the classname selectors together like as h2.link.title which will select <h2> elements which have at least both the link and title class.