nutch - Is there a way to get http response header fields parsed? - java

Can I get http respone header fields parsed with nutch?
Is it built-in capability that's need to be configured?
I've looked the internet and I can't find any info about this.
And also, if i do local file system crawling, is there a way to parse file's header? (size, description etc fields?)

See line 144 here . You can see that http response headers can be obtained and you can use that info.
For second question:
For parsing different file types, there are plugins provided by nutch. You need to study the same for the specific file type and get going.

Related

How to do correlation in HTTP Header manager which producing dynamic values

In the above picture we are facing the dynamic values in content type,so we are try to correlation for that but it is unable to read the dynamic value from the server.
Please check the correlation value which we done, Please help us on this issue.
HTTP Header Manager is defining additional request headers which you are sending with the request.
I don't really understand why you are trying to extract the value from the request header which you define yourself.
If you are trying to perform a file upload or your application expects multipart/form-data content type you can send this header automatically ticking the relevant checkbox in the HTTP Request sampler
In general I would recommend recording your test scenario using HTTP(S) Test Script Recorder, it should be smart enough to generate the relevant test plan skeleton.

writing file content into a http response

How can I write a content of a file into http response? I prefer to write it to an entity first if possible. I searched for examples, but unfortunately didn't find any suitable working one..
Of course it is possible. After all that's what all web servers do when they serve you pages. Add proper Content-Type and Content-Lenght (if known) to your headers, open your file, read it and write it your response.

How to retrieve XML/RDF data from a dbpedia link or URL?

Recently I have been trying to learn Semantic Web. For a project I need to retrieve data from a given dbPedia link. e.g http://dbpedia.org/page/Berlin . But when retrieve data using java.net.URLConnection I get the html data. How can I get the xml from the same link ? I know that there is link in every dbpedia page to download the XML but that is not what I want to do. Thanks in advance.
Note that the URI of the resource is actually http://dbpedia.org/resource/Berlin (with resource, not page). Ideally, you could request that URI with an Accept header of application/rdf+xml and get the RDF/XML representation of the resource. That's how the BBC publishes their data (e.g., see this answer), but DBpedia doesn't do that. Even if you request application/rdf+xml, you end up getting a redirect. You can see if you try with an HTTP client. E.g., using Advanced Rest Client in Chrome, we get this 303 redirect:
In a web browser, you get redirected to the page version by a 303 See Other response code. Ideally, you could request the resource URI with the accept header set to application/rdf+xml and get the data, but DBpedia doesn't place quite so nicely.
So, that means that the easiest way is to note that at the bottom of http://dbpedia.org/page/Berlin, there's the text with some download links:
RDF ( N-Triples N3/Turtle JSON XML )
The URL of the last link is http://dbpedia.org/data/Berlin.rdf. Thus, you can get the RDF/XML by changing page or resource to data, and appending .rdf to the end of the URL. It's not the most ReSTful solution, but it seems to be what's available.
The good to access data from dbpedia is through Sparql. You can use Apache Jena to run sparql queries against http://dbpedia.org/sparql

Download single file with multiple connection in java?

I'm doing a project on java download manager.i want to download a single file(which is in some website) with multiple connections(just like download Managers do,example-IDM).Is this possible in java ?.if yes please help me how can i implement that.if you people have any sample code then please post.Thank you in Advance..Have a Rocking Future.
Here are a couple of hints. No code though.
A multi-connection download manager relies on the support for the Accept-Ranges header in the HTTP 1.1 specification. Servers would use this header to indicate that they support sending of partial responses to the client.
HTTP clients use the Range header in the request to obtain partial responses. All partial responses will carry a Content-Range header.
A multi-connection download manager would make multiple connections to a server supporting this feature. Each connection would issue it's own range of headers to download. The responses would then be collated in the necessary order to obtain the desired file. The size of the ranges can be pre-calculated using an initial HTTP HEAD request, which returns the actual size of the file in the Content-Length response header; the task of downloading the file may now be split into suitable chunks.
I'd recommend reading about Segmented downloading, thinking of a way to implement it in Java and than asking concrete questions if you have any.

How to find out the name of the default page displayed by a webserver?

I'm downloading various files through I/O-streaming in my Java application. Receiving and saving those files works well as long as I have a full URL-path including file name, but how can I find out the name of the index file (as defined in, for example, Apache's DirectoryIndex) of a domain? The HTTP header doesn't provide this information and neither does the URLConnection method.
Thanks alot!
Be well
S.
As far as I know there is no way of retrieving this information. The HTTP specification doesn't provide it, and I think this isn't a bad thing. Your clients requests the URL "/", it's up to the web server how to handle that, there is no obligation to return a filename too.
It's also worth pointing out (I'm sure you're aware of it but just in case) that just because a URL looks like /somedir/somefile.html, it doesn't mean that is the actual file being served. It could be being served via a proxy to another host, mod_rewrite etc - in other words, the name as arbitrary and doesn't necessarily bear any relation to the physical name on disk.
In short, I think your best bet would be to pick a default filename e.g. index.html for those cases and stick to it.
Only way out is to:
Inspect Content-Disposition header and use it to generate filename. If server is serving a file, it would set this header. E.g. http://server:port/DownLoadServlet URL might set this header to indicate name as "statement.pdf".
IF this header is missing, use Heuristics to generate filename. This is what browsers do to generate filenames like Doc[10].pdf Doc[12].pdf etc.
Use content-type header (if available) to guess file extension.

Categories