Saving page content using selenium - java

I am using selenium to gather data on a web portal.The problem here is the data is in XML format but the URL extension is not .xml and it is displayed as .aspx since it is a dot net website.Now using selenium I can get the page source by using driver.getPageSource()
But it gives me the format in HTML.Separating the XML here using HTML is really a pain and I have tried many options such as JSoup, but it seems like there is too much parsing to be done.
Is there any other way to make selenium manipulate the browser.I can see that File-Save as gives me an option to save the web page in xml format.How to do this in selenium?Are there any other API's that can help me out here.
Edit : My browser here is Internet Explorer

Have you tried like this ?
String pageSource=driver.findElement(By.tagName("body")).getText();
see this pageSource content If it is giving only XML content you can write it to file using file operations.

Related

Why does my Crawler get the wrong HTML code?

I've wanted to write a crawler in java for some school exercise. Actually the crawler code, implemented with the jsoup lib, worked because the result of my request was some HTML code, but when I searched for a word which was clearly written on the website, it was not found, because some div's from the crawler where empty.
Then I recognized, that I got the same code as you can see when you navigate to the website and right-click -> 'view page source'.
When I compared the code to right-click -> 'inspect', the code was not the same as in 'view page source'.
Is there anything I could do to get the HTML code containing the full content?
requested URL: https://app.libertex.com/?lang=deu&_ga=2.222573595.1459393376.1568209606-1642141519.1566978579&_gac=1.53153498.1566978579.CjwKCAjwzJjrBRBvEiwA867byuxkXf35eSWyL2LJhLel3PRiGsSfvU6iLb00E21dQOkogLcx_z5G6hoCQgwQAvD_BwE
You can't get the right code with jsoup as this website loads content dynamically.
This webpage loads code dynamically, i.e. it loads the initial content and then executes other code to load the rest of the content. jsoup is merely an HTML parser, meaning that it can parse through the various content which it is given. It cannot execute Javascript or wait for external files to load.
To scrape a website like this, you probably need an automated browser of sorts. I personally use Selenium in Python for crawling websites which load dynamically.

How to read current source html from a webpage using Java/Perl/Python (e.g. after editing it using the js console)

I realize this looks like a duplicate question, but it's not!(as far as I know, and I've searched a lot...) So for the last few days I've been trying to get the HTML content of my whatsapp web application but using the input stream reader provided by java seems to not give me the full html code. The URL I'm using is just https://web.whatsapp.com/, which I suppose could be a problem, but there aren't any personal URLs as far as I'm aware. However, in developer tools using the element inspector I can easily access and read the DOM elements I'm interested in. I'm wondering if there's a way I can get this source directly using java/perl/python.
I'm also looking to do this as a learning project, so preferably would like to stay away from tools such as jsoup and such. Thanks!
You can use selenium.webdriver in python. Something like:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://web.whatsapp.com/")
html = browser.page_source
If you want to get your own whatsapp page, you should use selenium to log into the site before getting the page_source.

Extracting web page with jquery content data

I need to extract data from particular websites, say it a comment section in a website. What i already tried is extracting html text using jsoup, but since the comment section used jquery it only extract the jquery code not the comments text. Any suggest to solve my problems? thankyou
You can use HTMLUnit to render the page with all needed content and then extract data from build DOMTree. Here you can find info on what to do if AJAX doesn't work OOTB.

Read HTML page, after javascript (java)

in my project I need to read some web pages. Usually it is pretty easy: I read the source code using java classes, parse the output and save interesting data.
But sometimes it is harder; for example reading Google pages. I think it is because of javascript. Do you know to get the real web page code, I mean without javascript? For example if I analyse the page using the Firebug extension of Firefox I read exactly what I need: javascript is correctly replaced by its results. Any idea to do it using Java?
Thanks in advance

How to generate the PDF in CQ5.6.1 using page content in cq5

How to generate the PDF in CQ5.6.1 using page content.
A button in my site (genarate PDF) on click of the button i have to genarate the PDF file using the same page content.
Please let me know is there any out of the box PDF genarator in CQ or do i need to get the any linsenced product to genarate the PDF.
Thanks..
Adobe CQ is integrated with the Apache FOP, a formatter able to create PDF files. This tutorial describes how to enable content rewriter providing PDF version of the content under the .pdf extension.
However, please keep in mind that this approach requires manually writing the XSLT transform file able to process your page (and every component on it) and output the XSL-FO document.
In a previous project (CQ 5.5) we used https://code.google.com/p/wkhtmltopdf/ to create PDF files.. worked pretty good!
I had used Phantomjs to create a a custom pdf from the cq5 pages. for example if you don not want to display the right trail in the pdf or you want to disable the header footer. all this can be achieved with the help of phantomjs.
create a servlet which will execute a command at your server.
phantomjs <custom.js> 'page_url' nameofthepdf.pdf
here custom.js will show or hide html content based on your need.
This will work for all pages irrespective of cq5 or any tool.

Categories