I wrote a method to insert a div with text passed as parameter.
And then I noticed I need to add various HTML content into that div. Current method works on these basic 5 lines of instruction:
//engine is the WebEngine object of some WebView object
Node html = engine.getDocument().getChildNodes().item(0);
Node body = html.getChildNodes().item(1);
Element e = engine.getDocument().createElement("div");
e.setTextContent(msg);
body.appendChild(e);
So here comes my question. Is there a way of parsing some HTML content into an Element object, so I can append that element to the document?
Example HTML String: <b>SomeText</b>
I solved the problem with Javascript! I could append any HTML data with JS.
Example:
engine.executeScript("document.body.innerHTML += '<div><b>SomeText</b></div>' ");
I recently created such a tool, I hope it helps a lot
https://github.com/graycatdeveloper/JavaFXHtmlText
Related
I'm new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas
I have started off trying to do the following, but there are no results from the get go:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty
I also tried this, but again no results:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements divs = document.select("div");
if (!divs.isEmpty()) {
for (Element div : divs) {
// all of these are empty
Elements verbTenses = div.getElementsByClass("verbtense");
Elements verbTables = div.getElementsByClass("verbtable");
Elements tables = div.getElementsByClass("table verbtable");
}
}
What am I doing incorrectly?
The page you are trying to scrape have dynamically generated content on the client side (with javascript), therfore you won be able to extact data using that link
You might me able to scrape some content from the API call that this webpage is making eg https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
Inspect browser console to see what page is doing, and do the same
The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time.
Jsoup can't parse and execute JavaScript so all you get is the initial page :(
The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome's debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests:
One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
as you can see it's a JSON with HTML fragments and this content seems to have verbs forms you need. But here's another catch because unfortunately Jsoup can't parse JSON :( So you'll have to use another library to get the HTML fragment and then you can parse it using Jsoup.
General advice to download JSON is to ignore content type (Jsoup will complain it doesn't support JSON):
String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();
then
you'll have to use some JSON parsing library for example json-simple
to obtain html fragment and then you can parse it to HTML with Jsoup:
String json = Jsoup.connect(
"https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
.ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);
Now you can try your initial approach with using selectors to get what you want from document object.
I find elements either by their ID or tag or etc. But my element is in a body tag with no tags at all, how can I find this? I know it is in the body tag but there are other elements too! The "text I want to find" is a php error displayed and I am hoping to catch that. I usually go writing WebElement x = driver.findElement(By.??); I cant proceed because I am uncertain what to do.
Sample HTML doc
<head></head>
<body>
Text I want to find
<div>xx</div>
<div>yy</div>
</body>
The reason for the java tag is, I am using Java to write my code?
In your situation I'd have used "context item expression" i.e. a .(dot) operator. So if I write an Xpath like this:
//div[contains(.,'Text To Be Searched')]
Then it will find all div elements which contain text Text To Be Searched. For you my answer would be
driver.findElement(By.xpath("//body[contains(.,'Text I want to find')]"));
You should add that text inside p tag and then you can write :
WebElement x = driver.getElementByTag('p');
Hi I am using JSoup to parse a HTML file. After parsing, I want to check if the file contains the tag. I am using the following code to check that,
htmlDom = parser.parse("<p>My First Heading</p>clk");
Elements pe = htmlDom.select("html");
System.out.println("size "+pe.size());
The output I get is "size 1" even though there is no HTML tag present. My guess is that it is because the HTML tag is not mandatory and that it is implicit. Same is the case for Head and Body tag. Is there any way I could check for sure if these tags are present in the input file?
Thank you.
It does not return 1 because the tag is implicit, but because it is present in the Document object htmlDom after you have parsed the custom HTML.
That is because Jsoup will try to conform the HTML5 Parsing Rules, and thus adds missing elements and tries to fix a broken document structure. I'm quite sure you would get a 1 in return if you were to run the following aswell:
Elements pe = htmlDom.select("head");
System.out.println("size "+pe.size());
To parse the HTML without Jsoup trying to clean or make your HTML valid, you can instead use the included XMLParser, as below, which will parse the HTML as it is.
String customHtml = "<p>My First Heading</p>clk";
Document customDoc = Jsoup.parse(customHtml, "", Parser.xmlParser());
So, as opposed to your assumption in the comments of the question, this is very much possible to do with Jsoup.
I have to parse some html and remove the anchor tags , but I need to preserve the innerHTML of anchor tags
For example, if my html text is:
String html = "<div> <p> some text some link text </p> </div>"
Now I can parse the above html and select for a tag in jsoup like this,
Document doc = Jsoup.parse(inputHtml);
//this would give me all elements which have anchor tag
Elements elements = doc.select("a");
and I can remove all of them by,
element.remove()
But it would remove the complete achor tag from start bracket to close bracket, and the inner html would be lost, How can I preserve the inner HTML which removing only the start and close tags.
Also, Please Note : I know there are methods to get outerHTML() and
innerHTML() from the element, but those methods only give me ways to
retrieve the text, the remove() method removes the complete html of
the tag. Is there any way in which I can only remove the outer tags
and preserve the innerHTML ?
Thanks a lot in advance and appreciate your help.
--Rajesh
use unwrap, it preserves the inner html
doc.select("a").unwrap();
check the api-docs for more info:
http://jsoup.org/apidocs/org/jsoup/select/Elements.html#unwrap%28%29
How about extracting the inner HTML first, adding it to the DOM and then removing your tags? This code is untested, but should do the trick:
Edit:
I updated the code to use replaceWith(), making the code more intuitive and probably more efficient; thanks to A.J.'s hint in the comments.
Document doc = Jsoup.parse(inputHtml);
Elements links = doc.select("a");
String baseUri = links.get(0).baseUri();
for(Element link : links) {
Node linkText = new TextNode(link.html(), baseUri);
// optionally wrap it in a tag instead:
// Element linkText = doc.createElement("span");
// linkText.html(link.html());
link.replaceWith(linkText);
}
Instead of using a text node, you can wrap the inner html in anything you want; you might even have to, if there's not just text inside your links.
how to get html source code which was rendered by a javascript in webpage. How can i proceed this? Using xsl or javascript or java.
Get entire HTML in current page:
function getHTML(){
var D=document,h=D.getElementsByTagName('html')[0],e;
if(h.outerHTML)return h.outerHTML;
e=D.createElement('div');
e.appendChild(h.cloneNode(true));
return e.innerHTML;
}
outerHTML is non-standard property thus might not supported in some browser (i.e., Firefox), in this case this function mimic the outerHTML feature by cloning the html node into unattached element and read it's innerHTML property.
Javascript provides
document.getElementByTagName('')
You can get any tag from this line. Moreover if you want to do any operation to this tag then assign any id to that tag. then you can use document.getElementById('') to do any operation on it.
These will give you source code.