I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.
an example can be cnn site ...
So far I have tried using :
testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'
#Test
public void htmlUnitTest() throws Exception {
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
List<HtmlAnchor> anchors = page.getAnchors();
System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");
try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt"))) {
writer.write(content);
}
}
}
but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )
can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.
The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().
One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.
Related
I am using HtmlUnit 2.10. I am creating a small link validator for a website. For crawling I am using this. during my research I was trying to crawl : loans.xxxxxxx.com. It has 58 anchor tag and 5 link tags.
I am writing a code like this
List<HtmlElement> elementsOfPage = (List<HtmlElement>) htmlPage.getElementsByTagName("link");
Iterator<HtmlElement> it = elementsOfPage.iterator();
System.out.println(elementsOfPage.size());
while(it.hasNext()) {
HtmlElement htmlElement = it.next();
System.out.println(htmlElement.toString());
}
I am also doing the same procedure for anchor tag i.e. a. For link it is just showing 3 and for anchor it is just showing 56 even though there are 5 and 58 respectively.
There are some portions in the code which are commented, I thought the web client ignores it, but if you actually print it will show some results actually are from commented code.
// Before running webclient, I have disabled applets,css, javascripts and increased the timeout to be 7seconds.
Why is this behavior odd ?
How do you get such numbers as 58 and 5? I tried to check URL you provided with HtmlUnit 2.10 + JSoup parser. Code is (Groovy, but almost Java):
def client = new WebClient(BrowserVersion.FIREFOX_3_6)
client.setThrowExceptionOnScriptError(false);
def page = (HtmlPage)client.getPage("http://loans.bankofamerica.com/en/index.html")
def doc = Jsoup.parse(page.asXml())
println doc.select("a").size()
println doc.select("link").size()
Results are 56 and 2. But with default UserAgent
def client = new WebClient()
Results are 56 and 3! Seems server gives different markup depends on useragent string (and maybe other headers).
Is it possible to teach HTMLUnit to ignore certain javascript scripts/files on a web page? Some of them are just out of my control (like jQuery) and I can't do anything with them. Warnings are annoying, for example:
[WARN] com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument:
getElementById(script1299254732492) did a getElementByName for Internet Explorer
Actually I'm using JSFUnit and HTMLUnit works under it.
If you want to avoid exceptions because of any JavaScript errors:
webClient.setThrowExceptionOnScriptError(false);
Well I am yet to find a way for that but I have found an effective workaround. Try implementing FalsifyingWebConnection. Look at the example code below.
public class PinConnectionWrapper extends FalsifyingWebConnection {
public PinConnectionWrapper(WebClient webClient)
throws IllegalArgumentException {
super(webClient);
}
#Override
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse res = super.getResponse(request);
if(res.getWebRequest().getUrl().toString().endsWith("/toolbar.js")) {
return createWebResponse(res.getWebRequest(), "",
"application/javascript", 200, "Ok");
}
return res;
}
}
In the above code whenever HtmlUnit will request for toolbar.js my code will simply return a fake empty response. You can plug-in your above wrapper class into HtmlUnit as below.
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
new PinConnectionWrapper(webClient);
Take a look at WebClient.setScriptPreProcessor. It will give you the opportunity to modify (or in your case, stop) a given script before it is executed.
Also, if it is just the warnings getting on your nerves I would suggest changing the log level.
If you are interested in ignoring all warning log entries you can set the log level to INFO for com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument in the log4j.properties file.
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
Insert this code.
I want to be able to insert a Java applet into a web page dynamically using a Javascript function that is called when a button is pressed. (Loading the applet on page load slows things down too much, freezes the browser, etc...) I am using the following code, which works seamlessly in FF, but fails without error messages in IE8, Safari 4, and Chrome. Does anyone have any idea why this doesn't work as expected, and how to dynamically insert an applet in a way that works in all browsers? I've tried using document.write() as suggested elsewhere, but calling that after the page has loaded results in the page being erased, so that isn't an option for me.
function createPlayer(parentElem)
{
// The abc variable is declared and set here
player = document.createElement('object');
player.setAttribute("classid", "java:TunePlayer.class");
player.setAttribute("archive", "TunePlayer.class,PlayerListener.class,abc4j.jar");
player.setAttribute("codeType", "application/x-java-applet");
player.id = "tuneplayer";
player.setAttribute("width", 1);
player.setAttribute("height", 1);
param = document.createElement('param');
param.name = "abc";
param.value = abc;
player.appendChild(param);
param = document.createElement('param');
param.name = "mayscript";
param.value = true;
player.appendChild(param);
parentElem.appendChild(player);
}
document.write()
Will overwrite your entire document. If you want to keep the document, and just want an applet added, you'll need to append it.
var app = document.createElement('applet');
app.id= 'Java';
app.archive= 'Java.jar';
app.code= 'Java.class';
app.width = '400';
app.height = '10';
document.getElementsByTagName('body')[0].appendChild(app);
This code will add the applet as the last element of the body tag. Make sure this is run after the DOM has processed or you will get an error. Body OnLoad, or jQuery ready recommended.
I would have suggested doing something like what you're doing; so I'm baffled as to why it's not working.
Here's a document that looks pretty authoritative, coming from the horse's mouth as it were. It mentions the idiosyncrasies of different browsers. You may end up needing to do different tag soups for different implementations.
But maybe there's something magic about applet/object tags that keeps them from being processed if inserted dynamically. Having no more qualified advice, I have a crazy workaround to offer you: Howzabout you present the applet on a different page, and dynamically create an IFRAME to show that page in the space your applet should occupy? IFRAMEs are a bit more consistent in syntax across browsers, and I'd be surprised if they were to fail the same way.
Maybe you should use your browser's debugging tools to look at the DOM after you swap in your applet node. Maybe it's not appearing where you think it is, or not with the structure you think you're creating. Your code looks OK to me but I'm not very experienced with dynamic applets.
There is a JavaScript library for this purpose:
http://www.java.com/js/deployJava.js
// launch the Java 2D applet on JRE version 1.6.0
// or higher with one parameter (fontSize)
<script src=
"http://www.java.com/js/deployJava.js"></script>
<script>
var attributes = {code:'java2d.Java2DemoApplet.class',
archive:'http://java.sun.com/products/plugin/1.5.0/demos/plugin/jfc/Java2D/Java2Demo.jar',
width:710, height:540} ;
var parameters = {fontSize:16} ;
var version = '1.6' ;
deployJava.runApplet(attributes, parameters, version);
</script>
I did something similar to what Beachhouse suggested. I modified the deployJava.js like this:
writeAppletTag: function(attributes, parameters) {
...
// don't write directly to document anymore
//document.write(startApplet + '\n' + params + '\n' + endApplet);
var appletString = startApplet + '\n' + params + '\n' + endApplet;
var divApplet = document.createElement('div');
divApplet.id = "divApplet";
divApplet.innerHTML = appletString;
divApplet.style = "visibility: hidden; display: none;";
document.body.appendChild(divApplet);
}
It worked ok on Chrome, Firefox and IE. No problems so far.
I tried at first to have a div already created on my html file and just set its innerHTML to the appletString, but only IE were able to detect the new applet dynamically. Insert the whole div direclty to the body works on all browsers.
Create a new applet element and append it to an existing element using appendChild.
var applet = document.createElement('applet');
applet.id = 'player';
...
var param = document.createElement('param');
...
applet.appendChild(param);
document.getElementById('existingElement').appendChild(applet);
Also, make sure the existing element is visible, meaning you haven't set css to hide it, otherwise the browser will not load the applet after using appendChild. I spent too many hours trying to figure that out.
This worked for me:
// my js code
var app = document.createElement('applet');
app.code= 'MyApplet2.class';
app.width = '400';
app.height = '10';
var p1 = document.createElement('param');
p1.name = 'sm_UnwindType';
p1.value='200';
var p2 = document.createElement('param');
p2.name = 'sm_Intraday';
p2.value='300';
app.appendChild(p1);
app.appendChild(p2);
var appDiv = document.getElementById('applet_div');
appDiv.appendChild(app);
-----html code:
<div id="applet_div"></div>
I'm just getting started with HTMLUnit and what I'm looking to do is take a webpage and extract out the raw text from it minus all the html markup.
Can htmlunit accomplish that? If so, how? Or is there another library I should be looking at?
for example if the page contains
<body><p>para1 test info</p><div><p>more stuff here</p></div>
I'd like it to output
para1 test info more stuff here
thanks
http://htmlunit.sourceforge.net/gettingStarted.html indicates that this is indeed possible.
#Test
public void homePage() throws Exception {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());
final String pageAsXml = page.asXml();
assertTrue(pageAsXml.contains("<body class=\"composite\">"));
final String pageAsText = page.asText();
assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}
NB: the page.asText() command seems to offer exactly what you are after.
Javadoc for asText (Inherited from DomNode to HtmlPage)
I am trying to be able to test a website that uses javascript to render most of the HTML. With the HTMLUNIT browser how would you be able to access the html generated by the javascript? I was looking through their documentation but wasn't sure what the best approach might be.
WebClient webClient = new WebClient();
HtmlPage currentPage = webClient.getPage("some url");
String Source = currentPage.asXml();
System.out.println(Source);
This is an easy way to get back the html of the page but would you use the domNode or another way to access the html generated by the javascript?
You gotta give some time for the JavaScript to execute.
Check a sample working code below. The bucket divs aren't in the original source.
import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class GetPageSourceAfterJS {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); /* comment out to turn off annoying htmlunit warnings */
WebClient webClient = new WebClient();
String url = "http://www.futurebazaar.com/categories/Home--Living-Luggage--Travel-Airbags--Duffel-bags/cid-CU00089575.aspx";
System.out.println("Loading page now: "+url);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(30 * 1000); /* will wait JavaScript to execute up to 30s */
String pageAsXml = page.asXml();
System.out.println("Contains bucket? --> "+pageAsXml.contains("bucket"));
//get divs which have a 'class' attribute of 'bucket'
List<?> buckets = page.getByXPath("//div[#class='bucket']");
System.out.println("Found "+buckets.size()+" 'bucket' divs.");
//System.out.println("#FULL source after JavaScript execution:\n "+pageAsXml);
}
}
Output:
Loading page now: http://www.futurebazaar.com/categories/Mobiles-Mobile-Phones/cid-CU00089697.aspx?Rfs=brandZZFly001PYXQcurtrayZZBrand
Contains bucket? --> true
Found 3 'bucket' divs.
HtmlUnit version used:
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.12</version>
</dependency>
Assuming the issue is HTML generated by JavaScript as a result of AJAX calls, have you tried the 'AJAX does not work' section in the HtmlUnit FAQ?
There's also a section in the howtos about how to use HtmlUnit with JavaScript.
If your question isn't answered here, I think we'll need some more specifics to be able to help.