What sequence of steps does crawler4j follow to fetch data? - java

I'd like to learn,
how crawler4j works?
Does it fetch web page then download its content and extract it ?
What about .db and .cvs file and its structures?
Generally ,What sequences it follows?
please, I want a descriptive content
Thanks

General Crawler Process
The process for a typical multi-threaded crawler is as follows:
We have a queue data structure, which is called frontier. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.
Crawler threads then obtain URLs from the frontier and schedule them for later processing.
The actual processing starts:
The robots.txt for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)
Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in crawler4j this can be controlled via shouldVisit(...)).
The whole process is repeated until no new URLs are added to the frontier.
General (Focused) Crawler Architecture
Besides the implementation details of crawler4j a more or less general (focused) crawler architecture (on a single server/pc) looks like this:
Disclaimer: Image is my own work. Please respect this by referencing this post.

Related

Web crawler storing visited urls in file

I'm having issues with figuring out how I can store and scan large amounts of visited URLS from a web crawler. The idea is that the number of visited URL's will eventually be too much to store in memory and I should store them in file but I was wondering, doesn't this become very inefficient? If after getting a batch of URL's and I want to check if the URL is already visited I have to check the visited file line by line and see if there is a match?
I had thought about using a cache but the problem still remains when the URL is not found in the cache and I would still have to check the file. Do I have to check the file line-by-line for every URL and is there a better/more efficient way to do this?
A key data-structure here could be a Bloom Filter, and Guava provides an implementation. The Bloom-filter would tell you (maybe you have visited the URL) or you haven't surely. If the result is a maybe you can go and check the file if it's already visited otherwise you go and visit the URL and store it in the file as well as in the Bloom Filter.
Now, to optimise the file seeks, you can hash the URL to get a fixed size byte[] rather than unfixed string length (ex: md5).
byte[] hash = md5(url);
if(bloomFilter.maybe(hash)){
checkTheFile(hash);
}else{
visitUrl(url);
addToFile(hash);
addToBloomFilter(hash);
}
You can use a database and the hash being the primary key to get a O(1) access time when you check if a key exists, or You can implement an index yourself.
What about having a file per URL? If the file exists then the URL has been crawled.
You can then get more sophisticated and have this file contain data that indicates the results of the last crawl, how long you want to wait before the next crawl, etc. (It's handy to be able to spot 404s, for example, and decide whether to try again or abandon that URL.)
With this approach, it's worth normalising the URLs so that minor differences in the URL don't result in different files.
I've used this technique in my node crawler (https://www.npmjs.com/package/node-nutch) and it allows me to store the status either in the file system (for simple crawls) or on S3 (for crawls involving more than one server).

Java HTML element reading

I am trying to create a java program that an detect changes in HTML elements on a web page. For example: http://timer.onlineclock.net/
With each passing second, the HTML elements of the clock change the source of the image they display. Is there anyway, using java, that I can EFFICIENTLY open a connection to this page, and be able to see when these elements change?
I have used HTMLUnit, but I decided that takes to long to load a page to be considered efficient enough.
The only way I know how to do it with a URL is to use a BufferedReader to read the page, and then use Regular Expressions to parse an HTML element within the source, but this would require me to "reload" the page every time that I want to see the properties of an element. Can anybody give me an suggetions on how I can detect these changes in a matter of milliseconds, without using much network resources?
Your best bet is to learn and use javascript instead of server-side java. Javascript program runs on the client side (ie: the web browser) as opposed to the server side.
Typical HTML document consists of elements (eg: text, paragraph, list items etc). With javascript you can create timer, action user's event accordingly and manipulate those elements.
http://www.w3schools.com/js/default.asp is probably a good introduction to javascript, I suggest you spend some time on it
The page in question appears to be a...Javascript digital clock.
If you want the current time, try new Date();.
If you want code to be called at a constant rate, try the Timer class. You can set a task to be called every second, which is the same frequency you will get by polling the page.
If you want to use the page as an external source of time, try the Network Time Protocol. http://en.wikipedia.org/wiki/Network_Time_Protocol It will provide much lower latency and is actually designed for this purpose.

Best efficient data structure to store URIs for a web Crawler in Java

I am buiding a web Crawler I have implemented the parsing part. Now I want to store the URI obtained into a efficient data structure . What should i use?? I am using the Jena library for parsing.
Hash.
E.g: URL: scheme://domain:port/path?query_string#fragment_id.
After the parsing the URL as strings, store the url as:
hash['scheme'] = scheme;
hash['domain'] = domain;
hash['port'] = port;
hash['path'] = path;
hash['query_string'] = query_string;
hash['fragment_id'] = fragment_id;
I guess you want to automatically discard duplicates so no URI is crawled twice? Then I would suggest a HashSet.
It automatically discards duplicates and the insertion complexity is still constant in the optimal case. Note that when you use your own class to represent URIs and not the default class java.net.URI, you will have to override the int hashCode() method of your URI class to return a text-based hash of the URI string. The default method of Object creates an unique hash code for each object, even when the content is identical.
Crawlers generally use a Queue to keep the to be inspected URIs with an accompanying Set to check for duplicate before inserting the URIs in the aforementioned queue and putting the URI in the set after inspecting it.
If the number of your links can fit in-memory, you can just be a LinkedList as the queue and a HashSet as the set. Otherwise you can use an external database for both purposes or a queuing server (like ActiveMQ) as the queue and a database as the set.
I would store the queue of URI's to process and the already processed URI's in Redis (http://redis.io/). Redis is a very fast semi-persistent key-value store with native support for various data structures including lists (the queue of URIs) and hashes (maps). That way these data structures will survive a restart of your Java application. You can also probably have multiple instances of your app running without too much trouble communicating via Redis.
Generally in web crawling applications - you need to manage urls to discard spider traps (sometimes called "black holes"), discard frequent visits to the same page, also you'll use url as a global identifier of page content.
But the other interesting moment - is that it is wrong to discard visiting the same urls twice (because content of web page may change in time).
So the best way to satisfy these requirements is to use some kind of priority queue and associate each url with tuple: {url, hash(url)}. When you get new url - just calculate its hash, and if you have in your database records with same hash - just set low priority to this url and put it into the priority queue.
Web crawler asks priority queue for url to visit. So only pages, which has urls with highest priority would be visited primarily.
You may construct own hash function which fits your needs in the best way (for example - remove parameters from url string and calculate hash from the rest of string).

How can I search for broken links of a website using Java?

I would like to scan some websites looking for broken links, preferably using Java. Any hint how can I start doing this?
(I know there are some websites that do this, but I want to make my own personalized log file)
Writing a web-crawler isn't as simple as just reading the static HTML, if the page uses JavaScript to modify the DOM then it gets complex. You will also need to look for pages you've already visited aka Spider Traps? If the site is pure static HTML, then go for it... But if the site uses Jquery and is large, expect it to be complex.
If your site is all static, small and has little or no JS then use the answers already listed.
Or
You could use Heritrix and then later parsed it's crawl.log for 404's. Heritrix doc on crawl.log
Or If you most write your own:
You could use something like HTMLUnit (it has a JavaScript engine) to load the page, then query the DOM object for links. Then place each link in a "unvisited" queue, then pull links from the unvisited queue to get your next url to load, if the page fails to load, report it.
To avoid duplicate pages (spider traps) you could hash each link and keep a HashTable of visited pages (see CityHash ). Before placing a link into the unvisited queue check it against the visited hashtable.
To avoid leaving your site check that the URL is in a safe domain list before adding it to the unvisited queue. If you want to confirm that the off domain links are good, then keep them in a offDomain queue. Then later load each link from this queue using URL.getContent(url) to see if they work (faster than using HTMLUnit and you don't need to parse the page anyway.).
Write a function which recursively checks links.
Pseudo Code:
function checklinks(String url){
try{
content=HTTP.getContents(url);
String[] links=content.getAllRegexMatches('href="(http://.*?)"');
foreach(links as String link)
checklinks(link)
} catch (Exception e) {
System.out.println("Link "+url" failed");
}
}
Depending on the Links you have to complete the link passed to the next recursion by adding the url relative to the current URL.
Load website front page using some HTTP client for Java
Parse HTML (since it's not well formed XML you may need to clean it up first. With something like tagsoup)
For every <a> tag, get its content and attempt to connect to it.
If necessary, repeat recursively if URL from <a> belongs to your site. Make sure to store URLs that you processed already in a map so you don't do it more than once.

Multiple questions in Java (Validating URLs, adding applications to startup, etc.)

I have an application to build in Java, and I've got some questions to put.
Is there some way to know if one URL of a webpage is real? The user enters the URL and I have to test if it's real, or not.
How can I konw if one webpage has changes since one date, or what is the date of the last update?
In java how can I put an application running on pc boot, the application must run since the user turns on the computer.
I'm not sure what kind of application you want to build. I'll assume it's a desktop application. In order to check if a URL exists, you should make a HTTP HEAD Request, and parse the results. HEAD can be used to check if the page has been modified. In order for an application to start when the PC boots, you have to add a registry entry under Windows; this process is explained here
To check whether a url is valid you could try using a regular expression (regex for urls).
To know if a webpage has changed you can take a look at the http headers (reading http headers in java).
You can't make the program startup automatically on boot, the user must do that. However, you can write code to help the user set the program as startup app; this however depends on the operating system.
I'm not sure what you mean by "real". If you mean "valid", then you can just construct a java.net.URL from a String and catch the resulting MalformedURLException if it's not valid. If you mean that there's actually something there, you could issue an HTTP HEAD request like Geo says, or you could just retrieve the content. HTTPUnit is particularly handy for retrieving web content.
HTTP headers may indicate when the content has changed, as nan suggested above. If you don't want to count on that you can just retrieve the page and store it, or even better, store a hash of the page content. See DigestOutputStream for generating a hash. On a subsequent check for changes you would simply compare the new hash with the the one you stored last time.
Nan is right about start on boot. What OS are you targeting?

Categories