Scraping websites in Java - java

What I am trying to do is a take a list of URL's and download each URL's content (for indexing). The biggest problem is that if I encounter a link that is something like a facebook event that simply redirects to the login page I need to be able to detect and skip that URL. It seems as though the robots.txt file is there for this purpose. I looked into heritrix, but this seems like way more than I need. Is there a simpler tool that will provide information about robots.txt and scrape site accordingly?
(Also, I don't need to follow additional links and build up a deep index, I just need to index the individual pages in the list.)

you could just take the class you are interested in ie http://crawler.archive.org/xref/org/archive/crawler/datamodel/Robotstxt.html

Related

How to crawl and parse only precise data using Nutch?

I'm new to Nutch and crawling. I have installed Nutch 2.0, crawled and indexed the data using Solr 4.5 by following some basic tutorials. Now I don't want to parse all the text content of a page, I want to customize it like Nutch should crawl the page and scrape/fetch only the data related to address because my use case is to crawl URLs and parse only address info as text.
For example, I need to crawl and parse only the text content which has address information, email id, phone number and fax number.
How should I do this? Is there any plugin already available for this?
If I want to write a customized parser for this can anyone help me in this regards?
Checkout NUTCH-1870 a work in progress on a generic XPath plugin for Nutch, the alternative is to write a custom HtmlParseFilter that scrap the data that you want. A good (and simple) example is the headings plugin. Keep in mind that both of this links are for the 1.x branch of Nutch, and you're working with the 2.x although things are different in some degree the logic should be portable, the other alternative is using the 1.x branch.
Based on your comment:
Since you don't know the structure of the webpage, the problem is somehow different: Essentially you'll need to "teach" Nutch how to detect the text you want, based on some regexp or using some library that does address extraction out of plain text like jgeocoder library, you'll need to parse (iterate on every node of the webpage) trying to find something that resembles an address, phone number, fax number, etc. This is kind of similar to what the headings plugin does, but instead of looking for addresses or phone numbers it just finds the title nodes in the HTML structure. This could be a starting point to write some plugin that does what you want, but I don't think there is anything out of the box for this do.
Check [NUTCH-978] which introduces a plugin called XPath which allows the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.

Convert Web Page to PDF or Image

I need to convert a web page [which has not public access] to PDF or Image [preferably to PNG].
Web page contains set of charts and image. Most of the charts are populated through Ajax calls so there is a delay between page load and chart load.
I am looking answer for any of these questions:
1- I found set of snapshot api's but none of them support accessing my internal page. Since the web page I am trying to export is not public I need to be authenticated. Biggest problem is I cannot send request headers [such as session-id, cookie or other variables] along with these API's. It seems they don't support this kind of functionality.
2- I am not sure if I can do following: Login to my web page with HTTP client, add http headers, send get call and get HTML string. Then use one of the converters to convert it to PDF. What I am not sure is if it's possible to get proper PDF from the HTML string I got from http client since resources [css, js and etc] will be missing. I want my pdf/image looks exactly as it on the web site.
I really appreciate if you can help.
Thanks in advance,
ED
You're probably best of using wkhtmltopdf, which is a server-side tool and is easily installed.
There are two parameters you can use to wait for your Ajax to finish, try:
javascript-delay to influence the time the program waits for the JavaScript to finish
window-status to wait for a certain return code for the window
See the extensive manual for this program here
wkhtmltopdf generates a PDF and wkhtmltoimg generates an image, which is PNG (as you requested) by default.
Authentication is difficult because it involves security. Because the operation you are describing is unusual it is likely to result in all kinds of alarm bells going off. It is entirely possible to do but it is fraught, easy to get wrong and fragile in the face of security updates and code changes.
As such I'm going to suggest an alternate method which is one we often recommend for ABCpdf (on which I work). Yes we support standard authentication methods but the beauty of this approach is that it is robust and is applicable to other solutions (eg Java based) and novel authentication methods.
Typically you just want a PDF of the current page. The easiest way to do this is snaffle the HTML. The way you do this rather depends on your environment. For example under ASP.NET you can obtain the HTML of the current page using the HttpResponse.Filter property or by overriding the Render method of the page. The way you do it will depend on what you're coding in.
Then you need to save this HTML to a file and present it to your solution via a 'file://' protocol URL. Now obviously at this point any relative links will be broken but this is easily fixed by dropping in a BASE tag that references the place they are located.
Generally the types of resources referenced by an server-side page are static. So if you can create a tag that references the actual files rather than a web site, you will bypass any authentication for access to these resources.
That still leaves the AJAX based problems which are another can of worms. The render delay method is something we have supported for many years (from before AJAX was around) however it is not terribly reliable because you just don't know how long to wait.
Much better is a tighter link into the JavaScript via a callback you can use to determine if the page is loaded. I don't think ABCpdf is going to be appropriate for you since it is .NET but I would certainly encourage you to look for a Java based solution that uses this type of more sophisticated approach.

Get the contents of the link created by JavaScript

I am trying to build a very rudimentary crawler which could move through certain specific links and extract the contents from them. I am using JSoup for traversing through the links on a page and reading the required content.
However I have hit a roadblock on one of the sites. It is a kind of news portal on which users are allowed to post their own comments. I need to extract these comments. However if there are more than 5 comments, they are spread over several pages and the links to the subsequent pages are created by a JavaScript code in href (instead of a real link). It is something like this:
<a id="pager1_lnkPage2" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("pager1$lnkPage2", "", true, "", "", false, true))">2</a>
Now I have no idea how to traverse through the links generated by this JavaScript. Is there any way to get the data on the pages referred to by these links (on the face of it this does not seem to create any new link since the URL does not change while we navigate through other pages)?
For your reference here is a link to one such page. The links to navigate through multiple pages are at the lower right corner of the page.
This is embedded on the page with the main story in an iframe.
I have also come across an interface called ScriptEngine in javax but I could not understand it well enough to use it here.
Thanks
I've never used jsoup, but judging by its description (it is HTML parser) and the fact you try to somehow incorporate javascript into it, is telling me that you chose wrong tool for the job.
In your case I would rather go with Zombie.js (Node.js based) or Selenium. Latter may be better choice if you want to stick with Java (Selenium has Java based plugins).

Display external webpage into a webpage in my application

i want to display an external webpage (exactly as it's rendered in that site) into a webpage in my application in a way that's fast and better for SEO crawlers, and i was wondering if there's a way to do that with javaee ?
if not then what is better in performance and for SEO the XMLHTTPRequest way or the iframes way.
please advise with sample code or link if possible, thanks
Update: example website is: http://www.akhbarak.net/
If you need to display content from different pages inline, use iframe (iframe stands for inline frame - it has nothing to do with Apple).
If you'd like to use AJAX to display pages, I would recommend colorbox.
Note that accessing pages in a different domain via AJAX is next to impossible - this is a very, very big security hole. I would not recommend doing it. You would have to use a proxy on your own server to fetch the page and return its HTML.
That said, using the iframe in your source code, so it is loaded with the rest of the page, seems like your best bet. Sites like facebook and twitter use this in embeddable "like" and "tweet" widgets so that those widgets can make requests on their own domain - that is, twitter or facebook. While managing lots of iframes isn't very fun, it is a very accepted way of doing what you want to do.
In theory, you could
load the whole page into a PHP variable,
replace the body tags with ,
take out the html tags,
pull out the entire section and put it in the encompassing pages ,
and replace all links with absolute ones (ie '/images' changes to 'http://example.com/images')
Would it be easy to do? Probably not. It's the only way I can think of to accomplish it so that the site appears as part of yours though.

How do I create an HTML table of files for download?

I'm in charge of updating an existing java app for an embedded device (a copier).
One of the things I want to do is create a servlet which allows the download of all the files in our sandboxed directory on the device (which will include the application log files, local caches, etc). At the moment these files are all in a single directory with no subdirectories.
Basically what I'd like to do is as follows:
Log.log
Log.log.1
Log.log.2
SomeLocalCache.txt
AnotherLocalCache.txt
where each line is a clickable link allowing download of the file.
However, my HTML experience is basically nil, and my familiarity with the Java API is still fairly rudimentary, so I'm looking for some advice on the proper way to go about it.
I've read through all the samples provided, and here's what I'm thinking.
I can create a servlet at a specified URL on the device which will call into my code. Let's call this /MyApp.
I add another link below that, let's call it /MyApp/Download.
When this address it reached in a browser, it displays the list of files.
This list will have to be created on the fly. I can create an HTML template file and put it in the res folder (this seems to be the recommended method for the device in question), but the whole list of files/links will need to be substituted in at run time. Here's an example I found using <ol>+<li> tags for the list and <a> tags for the links. I can generate that on the fly pretty easily. Is that a reasonable way to go?
e.g.
<ol>
<li>
Log.log
</li>
<!--more <li> elements-->
</ol>
Clicking on an individual file will link to /MyApp/Download/File.ext which will then trigger the file download via my servlet (I've found this code which looks promising for the actual download).
The device will require users to log before they are allowed to access the /MyApp link or any sub-links, and I can additionally require that the logged in user be an admin before allowing file download, which together seems like sufficient security in this case (heavy security is not required for these files).
So am I missing anything big or is this a reasonable plan of engagement?
EDIT
Judging by this link when to use UL or OL in html? Many people are going to hammer the answer and comment below because they say it is important to put semantic information into the HTML.
My point is simply this -- the only difference is browsers will display one with bullet points (as OP seems to want) and one with numbers (as the OP does not want.) I suggest he change the HTML to the way he wants it to render, or leave it as is, and make some CSS changes.
Yes there is a semantic difference between the two... they will both still render in order as defined here http://www.w3.org/TR/html401/struct/lists.html Is the HTML the place to put semantic information? I think not, code that generated the HTML is the correct place. Your cohesion may vary.
I won't change my original comment for the sake of history.
END EDIT
Seems fine to me -- however <ol> is not really used any more, I'd go with <ul>. Don't worry, it is still ordered as you would expect.
The reason for this is the only difference between the two was browsers would automatically number (render with a number before) ordered lists. However, with CSS all the rendering control can be in the CSS (including numbering) and everyone is happy.
Hardly anyone uses the auto number anymore. In fact via CSS, lists can and are used for all sorts of crazy things, including CSS menuing systems.
Here's a summary you need to do:
You can use File#listFiles() to get a File[].
You can use JSTL c:forEach to iterate over an array.
You can use HTML <ol>, <ul> or <dl> elements to display a list.
You can use HTML <a> element to display a link.
You can use a Servlet to write an InputStream of a local file to OutputStream of the response. Remember to pass at least Content-Type, Content-Length and Content-Disposition along.
You can make use of request pathinfo to pass file identifier safely. E.g. map servlet on /files/* and let link point to http://example.com/files/path/to/file.ext and in the servlet you can get /path/to/file.ext by request.getPathInfo().
A basic and solid servlet example can be found here: FileServlet. If you want to add resume and compressing capabilities, then you may find the improved FileServlet more useful.
That said, most appservers also just supports directory listing by default. Tomcat for example supports it by default. You can just define another <Context> in Tomcat's server.xml with a docBase of C:/path/to/all/files and a context path of /files (so that it's accessible by http://example.com/files.
<Context docBase="/path/to/all/files" path="/files" />
That's basically all. No homegrown code/html/servlet needed.

Categories