I'm having issues with figuring out how I can store and scan large amounts of visited URLS from a web crawler. The idea is that the number of visited URL's will eventually be too much to store in memory and I should store them in file but I was wondering, doesn't this become very inefficient? If after getting a batch of URL's and I want to check if the URL is already visited I have to check the visited file line by line and see if there is a match?
I had thought about using a cache but the problem still remains when the URL is not found in the cache and I would still have to check the file. Do I have to check the file line-by-line for every URL and is there a better/more efficient way to do this?
A key data-structure here could be a Bloom Filter, and Guava provides an implementation. The Bloom-filter would tell you (maybe you have visited the URL) or you haven't surely. If the result is a maybe you can go and check the file if it's already visited otherwise you go and visit the URL and store it in the file as well as in the Bloom Filter.
Now, to optimise the file seeks, you can hash the URL to get a fixed size byte[] rather than unfixed string length (ex: md5).
byte[] hash = md5(url);
if(bloomFilter.maybe(hash)){
checkTheFile(hash);
}else{
visitUrl(url);
addToFile(hash);
addToBloomFilter(hash);
}
You can use a database and the hash being the primary key to get a O(1) access time when you check if a key exists, or You can implement an index yourself.
What about having a file per URL? If the file exists then the URL has been crawled.
You can then get more sophisticated and have this file contain data that indicates the results of the last crawl, how long you want to wait before the next crawl, etc. (It's handy to be able to spot 404s, for example, and decide whether to try again or abandon that URL.)
With this approach, it's worth normalising the URLs so that minor differences in the URL don't result in different files.
I've used this technique in my node crawler (https://www.npmjs.com/package/node-nutch) and it allows me to store the status either in the file system (for simple crawls) or on S3 (for crawls involving more than one server).
Related
I'd like to learn,
how crawler4j works?
Does it fetch web page then download its content and extract it ?
What about .db and .cvs file and its structures?
Generally ,What sequences it follows?
please, I want a descriptive content
Thanks
General Crawler Process
The process for a typical multi-threaded crawler is as follows:
We have a queue data structure, which is called frontier. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.
Crawler threads then obtain URLs from the frontier and schedule them for later processing.
The actual processing starts:
The robots.txt for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)
Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in crawler4j this can be controlled via shouldVisit(...)).
The whole process is repeated until no new URLs are added to the frontier.
General (Focused) Crawler Architecture
Besides the implementation details of crawler4j a more or less general (focused) crawler architecture (on a single server/pc) looks like this:
Disclaimer: Image is my own work. Please respect this by referencing this post.
I have a servlet with an API that delivers images from GET requests. The servlet creates a data file of CAD commands based on the parameters of the GET request. This data file is then delivered to an image parser, which creates an image on the file system. The servlet reads the image and returns the bytes on the response.
All of the IO and the calling of the image parser program can be very taxing and images of around 80kb are rendering in 3-4000ms on a local system.
There are roughly 20 parameters that make up the GET request. Each correlates to a different portion of the image. So, the combinations of possible images is extremely large.
To alleviate the loading time, I plan to store BLOBs of rendered images in a database. If a GET request matches one previously executed, I will pull from cache. Else, I will render a new one. This does not fix "first-time" run, but will help "n+1 runs".
Any other ideas on how I can improve performance?
you can store file on you disk,and image path in database,because database storage is usually more expensive than file system storage.
sort the http get parameters and hash them as an index to that image record for fast query by parameters.
to make sure your program not crush when disk capacity not enough,you should remove the the unused or rarely used record:
store a lastAccessedTime for each record,updated each time when the image is requested.
using a scheduler to check lastAccessedTime,removing records which is lower than a specified weight.
you can use different strategy to calculate the weight,such as lastAccessedTime,accessedCount,image size,etc.
You can turn all the parameters that you feed into the rendering pipeline into a single String in a predictable way such that you can compute a SHA1 hash of the input then store the output file in a directory with the SHA1 as the file name, that way if you get a request with the same parameters you just compute the hash then check if the file is on disk if it is return it otherwise send the work to the render pipeline and create the file.
If you have a lot of files you might want to use more than one directory, maybe look at how git divides up files across directories by the first few chars of the SHA1 for inspiration.
I use a similar setup on my app I am not doing rendering just storing files, the files are stored in the db but for performance reasons I serve them out from disk using the sha1 hash of the file contents as the filename / URI for the file.
I am buiding a web Crawler I have implemented the parsing part. Now I want to store the URI obtained into a efficient data structure . What should i use?? I am using the Jena library for parsing.
Hash.
E.g: URL: scheme://domain:port/path?query_string#fragment_id.
After the parsing the URL as strings, store the url as:
hash['scheme'] = scheme;
hash['domain'] = domain;
hash['port'] = port;
hash['path'] = path;
hash['query_string'] = query_string;
hash['fragment_id'] = fragment_id;
I guess you want to automatically discard duplicates so no URI is crawled twice? Then I would suggest a HashSet.
It automatically discards duplicates and the insertion complexity is still constant in the optimal case. Note that when you use your own class to represent URIs and not the default class java.net.URI, you will have to override the int hashCode() method of your URI class to return a text-based hash of the URI string. The default method of Object creates an unique hash code for each object, even when the content is identical.
Crawlers generally use a Queue to keep the to be inspected URIs with an accompanying Set to check for duplicate before inserting the URIs in the aforementioned queue and putting the URI in the set after inspecting it.
If the number of your links can fit in-memory, you can just be a LinkedList as the queue and a HashSet as the set. Otherwise you can use an external database for both purposes or a queuing server (like ActiveMQ) as the queue and a database as the set.
I would store the queue of URI's to process and the already processed URI's in Redis (http://redis.io/). Redis is a very fast semi-persistent key-value store with native support for various data structures including lists (the queue of URIs) and hashes (maps). That way these data structures will survive a restart of your Java application. You can also probably have multiple instances of your app running without too much trouble communicating via Redis.
Generally in web crawling applications - you need to manage urls to discard spider traps (sometimes called "black holes"), discard frequent visits to the same page, also you'll use url as a global identifier of page content.
But the other interesting moment - is that it is wrong to discard visiting the same urls twice (because content of web page may change in time).
So the best way to satisfy these requirements is to use some kind of priority queue and associate each url with tuple: {url, hash(url)}. When you get new url - just calculate its hash, and if you have in your database records with same hash - just set low priority to this url and put it into the priority queue.
Web crawler asks priority queue for url to visit. So only pages, which has urls with highest priority would be visited primarily.
You may construct own hash function which fits your needs in the best way (for example - remove parameters from url string and calculate hash from the rest of string).
I'm looking for a solution for storing unique file names for a web application using Java Servlet.
What I need is to render profile images in a webpage, since I decided to store only file names in the database I have only to be sure they're unique in the image folder: to achieve that I was thinking to name images and save them with a string name composed by the couple:
<user_id>_<hash>.<file-type>
In this manner I think I would be pretty sure there will be no collisions, since user_ids are already unique.
1) Is this solution sound?
2) What algorithm should I pick for this purpose?
I'd like to use it properly so code snippets would be very appreciated.
Thanks
You can use File#createTempFile() wherein you specify the prefix, suffix and folder. The generated filename is guaranteed to be unique.
File file = File.createTempFile("name-", ".ext", new File("/path/to/uploads"));
// ...
No, this file won't be auto-deleted on exit or something, it's just a part of temp file generation mechanism.
Really simple approach would be to append System.currentTimeMillis() to the userid. If userid is unique then it should be pretty safe.
Why attach a hash or anything to it if its a 1 to 1 mapping? 1 user to 1 profile image, just use their id. If its a 1 to many, save the record in the database and get the id of that record and use that as the name. You might want to convert all your images to a particular image format too just to be consistent, so that wouldn't be stored in your database either.
BalusC tempfile method would work perfectly with this presented concept.
I have an application to build in Java, and I've got some questions to put.
Is there some way to know if one URL of a webpage is real? The user enters the URL and I have to test if it's real, or not.
How can I konw if one webpage has changes since one date, or what is the date of the last update?
In java how can I put an application running on pc boot, the application must run since the user turns on the computer.
I'm not sure what kind of application you want to build. I'll assume it's a desktop application. In order to check if a URL exists, you should make a HTTP HEAD Request, and parse the results. HEAD can be used to check if the page has been modified. In order for an application to start when the PC boots, you have to add a registry entry under Windows; this process is explained here
To check whether a url is valid you could try using a regular expression (regex for urls).
To know if a webpage has changed you can take a look at the http headers (reading http headers in java).
You can't make the program startup automatically on boot, the user must do that. However, you can write code to help the user set the program as startup app; this however depends on the operating system.
I'm not sure what you mean by "real". If you mean "valid", then you can just construct a java.net.URL from a String and catch the resulting MalformedURLException if it's not valid. If you mean that there's actually something there, you could issue an HTTP HEAD request like Geo says, or you could just retrieve the content. HTTPUnit is particularly handy for retrieving web content.
HTTP headers may indicate when the content has changed, as nan suggested above. If you don't want to count on that you can just retrieve the page and store it, or even better, store a hash of the page content. See DigestOutputStream for generating a hash. On a subsequent check for changes you would simply compare the new hash with the the one you stored last time.
Nan is right about start on boot. What OS are you targeting?