get HTML element outerHTML without 3rd-party libraries - java

At the moment, I'm using HTMLUnit and it's painfully slow (>10s).
final WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(url_with_hash_fragment);
out.println(page.getElementById(hash_fragment).asXml());
Ideally, I'd like to do this without adding additional dependencies. (I'm not using HTMLUnit otherwise.)
I tried using HTMLEditorKit but couldn't figure out how to use it for outerHTML. I'd rather not use regular expressions, but that's preferred to waiting 10s and bloating my code.

Related

HTML Unit - read from a normal string?

I want to use HTML Unit for JAVA.
In all examples there will be read the HTML Code from a specific website.
But I want to read the HTML source from another String.
Like this:
String myString = "<html> myString and Content </html>";
HtmlPage page = myString; // doesn´t work, how can I do something like this?
I see only examples like this:
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
Can I also read only a Table?
Like this:
String myTable = "<table><td></td></table>";
HtmlTable table = myTable; // doesn´t work, how can I do something like this?
My question is now, how can I convert this correct?
Can anybody help me, please.
HtmlUnit isn't really designed for this use case, so it will always be a bit of a hassle to make it work. If you're not tied to HtmlUnit specifically, you might be better off using something like jsoup, which has better built-in support for parsing HTML from strings.
That said, if you are tied to HtmlUnit, it's possible to make this work. For inspiration, you could look at how HtmlUnit sets up HtmlPage objects in its own test suite.
As you can see there, although there's no way to construct an HtmlPage directly from a String, you can make a MockWebConnection that'll give a canned response without involving the network. So your code could look something like this:
String html = "<html>Your html here</html>";
WebClient client = new WebClient();
MockWebConnection connection = new MockWebConnection();
connection.setDefaultResponse(html);
client.setWebConnection(connection);
HtmlPage page = client.getPage(someUrl);
(Apologies for any errors in the above -- I'm no longer on a Java project, so I don't have a convenient way to test this right now. That said, I did spend some time on a large Java project that used roughly this technique for a lot of tests. It worked reasonably well, but it tended to be a bit fragile when we upgraded HtmlUnit. Overall, we were happier when we moved to Jsoup.)
Here is another way of doing it, similar to Collum's but a little different.
WebClient webClient = new WebClient();
URL url = new URL("http://example.com");
WebRequest requestSettings = new WebRequest(url, HttpMethod.GET);
StringWebResponse response = new StringWebResponse("<html> myString and Content </html>", url);
HtmlPage page = HTMLParser.parseHtml(response, webClient.getCurrentWindow());
As for getting the table, it is possible. You can load it with the method above and extract it with the code below.
HtmlTable table = page.getHtmlElementById("table1");
you can iterate over and cells with the code below
for (final HtmlTableRow row : table.getRows()) {
System.out.println("Found row");
for (final HtmlTableCell cell : row.getCells()) {
System.out.println(" Found cell: " + cell.asText());
}
}
and you can access specific cells with the example below
System.out.println("Cell (1,2)=" + table.getCellAt(1,2));
Please comment if you get stuck and I may be able to help

Render Tapestry page and get it as Stream/String resource

Is any convinient way to dynamically render some page inside application and then retrieve its contents as InputStream or String?
For example, the simplest way is:
// generate url
Link link = linkSource.createPageRenderLink("SomePageLink");
String urlAsString = link.toAbsoluteURI() + "/customParam/" + customParamValue;
// get info stream from url
HttpGet httpGet = new HttpGet(urlAsString);
httpGet.addHeader("cookie", request.getHeader("cookie"));
HttpResponse response = new DefaultHttpClient().execute(httpGet);
InputStream is = response.getEntity().getContent();
...
But it seems it must be some more easy method how to archive the same result. Any ideas?
I created tapestry-offline for exactly this purpose. Please be aware of the issue here (workaround included).
It's probably best to understand your exact use case. If, for example, you are generating emails in a scheduled task, it's probably better to configure jenkins or cron to hit a URL.
It's probably also worth mentioning the capture component from tapestry-stitch
This is only useful in situations where you want to cature part of a page as a String during page / component render.

How to change I18N locale NOT by querystring?

This is a struts2 question.
Currently, I am using i18n for internationalization in my webapp.
Some of my jsp pages has querystring to store the request information.
For example,
http://myWebsite.com/myWebsite/myPage?productId=12345
When users try to switch language, I rewrite the URL by javascript to
http://myWebsite.com/myWebsite/myPage?request_locale=zh_CN
And it loses the query string.
And my urls are used in different ways:
http://myWebsite.com/myWebsite/myPage
http://myWebsite.com/myWebsite/myPage?productId=12345#myAnchor
http://myWebsite.com/myWebsite/myPage?productId=12345&key2=value2&key3=value3#myAnchor
http://myWebsite.com/myWebsite/myPage?productId=12345&key2=value2&key3=value3&request_locale=zh_CN
http://myWebsite.com/myWebsite/myPage?productId=12345&key2=value2&key3=value3
...
When I try to handle all this differences in javascript, it becomes so complicated.
Is there any good way to retain the querystrings and anchors after switching locale?
When you rewrite the URL using JS, you should do it based on the current URL.
So, if in the browser is showed
http://myWebsite.com/myWebsite/myPage?productId=12345
you should rewrite it as
http://myWebsite.com/myWebsite/myPage?productId=12345&request_locale=zh_CN
For that, using JS you should get the current URL displayed. Take a look to this question for how to.

Scraping Data. Save File?

I am trying to scrape data from a website which uses javascript to load much of their content. Right now I am using jSoup to parse html pages, however since much of the content is loaded using javascript I haven't been able to parse the data I want.
How should I go about getting this javascript content? Should I first save the page then load and parse it using jSoup? If so, what should I use to load javascript content before I save? Is there an API which you would recommend that could output html?
Currently using java.
You might be interested in checking out pjscrape (disclaimer: this is my project). It's a command-line tool using PhantomJS to allow scraping using JavaScript and jQuery in a full browser context - among other things, you can define a "ready" function for the page and wait to scrape until the function (which might check for the existence of certain DOM elements, etc) returns true.
The other option, depending on the page, is to use a console like Firebug to figure out what data is being loaded (i.e. what URLs are being retrieved by the AJAX calls on the page), and scrape the data directly from those URLs.
If the data are generated with javascript then the data are in the downloaded page.
Better is to directly parse them on the fly as you do with plain HTML or Text parsing.
If you cannot isolate tokens with jSoup API just parse them using direct String options, as a plain text.
I tried using htmlUnit however I found it very slow.
I ended up using the curl command line function within java which worked for my purposes.
String command = "curl "+url;
Process p = Runtime.getRuntime().exec(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream()));
while ((s = stdInput.readLine()) != null) {
html = html+s+"\n";
}
return html;

Java getting source code from a website

I have a problem once again where I cant find the source code because its hidden or something... When my java program indexes the page it finds everything but the info i need... I assume its hidden for a reason but is there anyway around this?
Its just a bunch of tr/td tags that show up in firebug but dont show up when viewing the page source or when i do below
URL url = new URL("my url");
URLConnection yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
I really have no idea how to attempt to get the info that i need...
The reason for this behavior is because probably those tags are dynamically injected into the DOM using javascript and are not part of the initial HTML which is what you can fetch with an URLConnection. They might even be created using AJAX. You will need a javascript interpreter on your server if you want to fetch those.
If they don't show up in the page source, they're likely being added dynamically by Javascript code. There's no way to get them from your server-side script short of including a javascript interpreter, which is rather high-overhead.
The information in the tags is presumably coming from somewhere, though. Why not track that down and grab it straight from there?
Try Using Jsoup.
Document doc = doc=Jsoup.parse("http:\\",10000);
System.out.print(doc.toString());
Assuming that the issue is that the "missing" content is being injected using javascript, the following SO Question is pertinent:
What's a good tool to screen-scrape with Javascript support?

Categories