HTML Unit - read from a normal string? - java

I want to use HTML Unit for JAVA.
In all examples there will be read the HTML Code from a specific website.
But I want to read the HTML source from another String.
Like this:
String myString = "<html> myString and Content </html>";
HtmlPage page = myString; // doesn´t work, how can I do something like this?
I see only examples like this:
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
Can I also read only a Table?
Like this:
String myTable = "<table><td></td></table>";
HtmlTable table = myTable; // doesn´t work, how can I do something like this?
My question is now, how can I convert this correct?
Can anybody help me, please.

HtmlUnit isn't really designed for this use case, so it will always be a bit of a hassle to make it work. If you're not tied to HtmlUnit specifically, you might be better off using something like jsoup, which has better built-in support for parsing HTML from strings.
That said, if you are tied to HtmlUnit, it's possible to make this work. For inspiration, you could look at how HtmlUnit sets up HtmlPage objects in its own test suite.
As you can see there, although there's no way to construct an HtmlPage directly from a String, you can make a MockWebConnection that'll give a canned response without involving the network. So your code could look something like this:
String html = "<html>Your html here</html>";
WebClient client = new WebClient();
MockWebConnection connection = new MockWebConnection();
connection.setDefaultResponse(html);
client.setWebConnection(connection);
HtmlPage page = client.getPage(someUrl);
(Apologies for any errors in the above -- I'm no longer on a Java project, so I don't have a convenient way to test this right now. That said, I did spend some time on a large Java project that used roughly this technique for a lot of tests. It worked reasonably well, but it tended to be a bit fragile when we upgraded HtmlUnit. Overall, we were happier when we moved to Jsoup.)

Here is another way of doing it, similar to Collum's but a little different.
WebClient webClient = new WebClient();
URL url = new URL("http://example.com");
WebRequest requestSettings = new WebRequest(url, HttpMethod.GET);
StringWebResponse response = new StringWebResponse("<html> myString and Content </html>", url);
HtmlPage page = HTMLParser.parseHtml(response, webClient.getCurrentWindow());
As for getting the table, it is possible. You can load it with the method above and extract it with the code below.
HtmlTable table = page.getHtmlElementById("table1");
you can iterate over and cells with the code below
for (final HtmlTableRow row : table.getRows()) {
System.out.println("Found row");
for (final HtmlTableCell cell : row.getCells()) {
System.out.println(" Found cell: " + cell.asText());
}
}
and you can access specific cells with the example below
System.out.println("Cell (1,2)=" + table.getCellAt(1,2));
Please comment if you get stuck and I may be able to help

Related

HtmlUnit HtmlFileInput.setData() not working

I am trying to upload a file to a website using the HtmlUnit HtmlFileInput class. I have the data in a byte[] array and would like to send it up without writing it to a file first.
I'm trying:
HtmlFileInput fileInput = form.getInputByName("file");
fileInput.setData(data);
HtmlElement button = form.getInputByName("validate");
HtmlPage responsePage = button.click();
This is not working. But, when I try
HtmlFileInput fileInput = form.getInputByName("file");
fileInput.setValueAttribute("file.txt");
HtmlElement button = form.getInputByName("validate");
HtmlPage responsePage = button.click();
Everything works fine. The docs seem to indicate that setData() does exactly what I want to do, but it doesn't seem like any of the HtmlUnit code even uses the data_ variable that is set when setData() is called. The code uses the files_ field which is set when setValueAttribute() is called.
I noticed several old bugs that were opened that talked about similar problems and it says that they were all fixed.
Am I trying to use setData() in a way that it shouldn't be used?
Thanks.
In short - data_ is used by getSubmitNameValuePairs() and there are also unit tests for that (e.g. com.gargoylesoftware.htmlunit.html.HtmlFileInput2Test.setValueAttributeAndSetDataDummyFile()).
The trick here is the missing rest of the file - you have to simulate a bit more if you like to get your stuff uploaded. Please set the Value (to submit a dummy file name) and the content type also to help the server to understand your data.
HtmlFileInput fileInput = form.getInputByName("file");
fileInput.setValueAttribute("dummy.txt");
fileInput.setContentType("text/csv");
fileInput.setData("My file data".getBytes());
I think i have to improve the documentation for this a bit.
If you like we can discuss this or if you like to see a quick fix - simply open an issue on github.

get HTML element outerHTML without 3rd-party libraries

At the moment, I'm using HTMLUnit and it's painfully slow (>10s).
final WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(url_with_hash_fragment);
out.println(page.getElementById(hash_fragment).asXml());
Ideally, I'd like to do this without adding additional dependencies. (I'm not using HTMLUnit otherwise.)
I tried using HTMLEditorKit but couldn't figure out how to use it for outerHTML. I'd rather not use regular expressions, but that's preferred to waiting 10s and bloating my code.

Scraping Yahoo Answers with Jsoup

I am trying to scrape results of keyword search from Yahoo Answers, in my case, "alcohol addiction." I am using Jsoup and URL modification to go through pages of the search results to scrape the results. However, I am noticing that, even though I put in URL for 'Newest' results, it keeps showing 'Relevance' results, and what's worse, the results are not exactly the same as what's shown on the browser.
For instance, the URL for Newest results is:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=new
And for relevant results, the URL is:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=rel
And the "1" will change to 2, 3, 4, etc as you go to the next page (there are 10 results per page).
Here's what I do to scrape the page:
String urlID = "";
String end = "&sort=new";
String glob = "http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=";
Integer forumID = 0;
while(nextPageIsThere){
forumID++;
System.out.println("Now extracting the page: "+forumID);
try {
urlID = glob+forumID+end;
System.out.println(urlID);
exdoc = Jsoup.connect(urlID).get();
java.util.Date date= new java.util.Date();
} catch (IOException e) {
e.printStackTrace();
}
...
What's even more confusing is even if I increase the page number, and the system output shows that the URL is changing to:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=2&sort=new
and
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=3&sort=new
it still scrapes the same page as shown in page 1 over and over again. I know my code is not wrong. I've been debugging it for hours. I think it's something got to do with Jsoup.connect and/or Yahoo Answer possibly blocking bots? At the same time, I don't think it's really that.
Does anyone know why this might be happening?
JSoup is working with static HTML only, they can't parse dynamic pages like this, where content is downloaded after page is loaded with Ajax request or JavaScript modification.
Try reading this page with HTMLUnit, this parser has support for JS pages.
It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.

Render Tapestry page and get it as Stream/String resource

Is any convinient way to dynamically render some page inside application and then retrieve its contents as InputStream or String?
For example, the simplest way is:
// generate url
Link link = linkSource.createPageRenderLink("SomePageLink");
String urlAsString = link.toAbsoluteURI() + "/customParam/" + customParamValue;
// get info stream from url
HttpGet httpGet = new HttpGet(urlAsString);
httpGet.addHeader("cookie", request.getHeader("cookie"));
HttpResponse response = new DefaultHttpClient().execute(httpGet);
InputStream is = response.getEntity().getContent();
...
But it seems it must be some more easy method how to archive the same result. Any ideas?
I created tapestry-offline for exactly this purpose. Please be aware of the issue here (workaround included).
It's probably best to understand your exact use case. If, for example, you are generating emails in a scheduled task, it's probably better to configure jenkins or cron to hit a URL.
It's probably also worth mentioning the capture component from tapestry-stitch
This is only useful in situations where you want to cature part of a page as a String during page / component render.

How to use HTML Parser to get complete information about all tags in the HTML page

I am using HTML Parser to develop an application.
The code below is not able to get the entire set of tags in the page.
There are some tags which are missed out and the attributes and text body of them are also missed out.
Please help me to explain why is this happening.....or suggest me other way....
URL url = new URL("...");
PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(br, callback, true);
ElementIterator iterator = new ElementIterator(htmlDoc);
Element element;
while ((element = iterator.next()) != null)
{
AttributeSet attributes = element.getAttributes();
Enumeration e=attributes.getAttributeNames();
pw.println("Element Name :"+element.getName());
while(e.hasMoreElements())
{
Object key=e.nextElement();
Object val=attributes.getAttribute(key);
int startOffset = element.getStartOffset();
int endOffset = element.getEndOffset();
int length = endOffset - startOffset;
String text=htmlDoc.getText(startOffset, length);
pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");
}
}
}
I am doing this fairly reliably with HTML Parser, (provided that the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we just do not have one.
General idea:
You first have to know in what tags (div, meta, span, etc) the information you want are in, and know the attributes to identify those tags. Example :
<span class="price"> $7.95</span>
if you are looking for this "price", then you are interested in span tags with class "price".
HTML Parser has a filter-by-attribute functionality.
filter = new HasAttributeFilter("class", "price");
When you parse using a filter, you will get a list of Nodes that you can do a instanceof operation on them to determine if they are of the type you are interested in, for span you'd do something like
if (node instanceof Span) // or any other supported element.
See list of supported tags here.
An example with HTML Parser to grab the meta tag that has description about a site:
Tag Sample :
<meta name="description" content="Amazon.com: frankenstein: Books"/>
Code:
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;
public class HTMLParserTest {
public static void main(String... args) {
Parser parser = new Parser();
//<meta name="description" content="Some texte about the site." />
HasAttributeFilter filter = new HasAttributeFilter("name", "description");
try {
parser.setResource("http://www.youtube.com");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
if (node instanceof MetaTag) {
MetaTag meta = (MetaTag) node;
String description = meta.getAttribute("content");
System.out.println(description);
// Prints: "YouTube is a place to discover, watch, upload and share videos."
}
} catch (ParserException e) {
e.printStackTrace();
}
}
}
As per the comments:
actually i want to extract information such as product name,price etc of all products listed in an online shopping site such as amazon.com How should i go about it???
Step 1: read their robots file. It's usually found on the root of the site, for example http://amazon.com/robots.txt. If the URL you're trying to access is covered by a Disallow on an User-Agent of *, then stop here. Contact them, explain them in detail what you're trying to do and ask them for ways/alternatives/webservices which can provide you the information you need. Else you're violating the laws and you may risk to get blacklisted by the site and/or by your ISP or worse. If not, then proceed to step 2.
Step 2: check if the site in question hasn't already a public webservice available which is much more easy to use than parsing a whole HTML page. Using a webservice, you'll get exactly the information you're looking for in a concise format (JSON or XML) based on a simple set of parameters. Look around or contact them for details about any webservices. If there's no way, proceed to step 3.
Step 3: learn how HTML/CSS/JS work, learn how to work with webdeveloper tools like Firebug, learn how to interpret the HTML/CSS/JS source you see by rightclick > View Page Source. My bet that the site in question uses JS/Ajax to load/populate the information you'd like to gather. In that case, you'll need to use a HTML parser which is capable of parsing and executing JS as well (the one you're using namely doesn't do that). This isn't going to be an easy job, so I won't explain it in detail until it's entirely clear what you're trying to achieve and if that is allowed and if there aren't more-easy-to-use webservices available.
You seemed to use the Swing HtmlDocument. It may not be the smartest idea ever.
I believe you would have better results using, as an example, NekoHtml.
Or another simple library you can use is jtidy that can clean up your html before parsing it.
Hope this helps.
http://sourceforge.net/projects/jtidy/
Ciao!

Categories