Apache Nutch fetch and updatedb stages - java

I've got a question about the way Nutch obtains links to update the crawldb with.
The command in question is bin/nutch updatedb crawl/crawldb $s1
I need to write a custom parser, before doing so I've examined Nutch's source code and as far as I'm concerned I'm responsible for providing the links to update the crawldb, by extracting it from the document and putting in as Outlink[] in the ParseData. At least that's what I understood from this.
Correct me if I'm wrong, because I wouldn't like my crawler to stop after the first iteration, as it wouldn't have links to update the crawldb.

Nutch uses either parse-html or parse-tika to parse your crawled URLs (usually HTML) in this phase the outlinks are extracted and stored, when you execute a new iteration of the crawler Nutch will select some of the available (extracted) links to continue the crawl, you'll only need to write your own parser if you need to extract additional information from the web, let's say that you want all h1 titles in a separate field, for instance.
If you take a look at the crawl script (https://github.com/apache/nutch/blob/master/src/bin/crawl#L246) you'll see that the updatedb command will be executed once per iteration, so if you're using parse-html or parse-tika the outlinks of an HTML document (among others) are automatically extracted for you.

Related

JMeter method call on start/end of test plan

I am looking to make a combination JMeter extensions that satisfy a few criteria:
Open files at the beginning of the test plan/thread group. One of the files in an excel file used for input. I have the code to read it using apache poi via including the apache tika jar in the jmeter lib folder. The input should then be used in the threads as variables just like it's done with the CSV Data Set Config test element.
Aggregate all results at the end of the test plan/thread group to do calculations on the set of all results.
For #1, Maybe it is possible to do this by extending a config element but I haven't seen how to do this yet. I am also unsure of how to mimic the behaviour of CSV Data Set Config.
For #2 the purpose is to send the final information extracted from the results to a server so saving results to a file is not optimal. The View Results Tree or the View Results in Table elements both create a report of all results so it seems it should be possible to do this.
Edit:
How to achieve the above?
Assuming your question as 'How to achieve the above?'
For #1:
First of, I believe it is much easier/simpler (because Simple is better than complex) to get the excel/app to provide csv file for jmeter to consume it using CSV Data Set Config. I mean, write the reading logic somewhere else which will feed into jmeter's test data file. Another option would be to write a JSR223 sampler in a set up threadgroup to read the excel and produce the CSV.
But if you need it anyhow, you will need to write a custom plugin which will inherit from ConfigTestElement and will need to implement TestBean and LoopIterationListener interfaces. Good place to start is here. And the code for CSV Data Set Config is here.
For #2:
If you need POST the result file to server then you can use a tearDown thread group in jmeter which will pick up the file at the end of the test and do a HTTP post request using HTTP Request sampler.
Hope I gave you some direction.

How to crawl and parse only precise data using Nutch?

I'm new to Nutch and crawling. I have installed Nutch 2.0, crawled and indexed the data using Solr 4.5 by following some basic tutorials. Now I don't want to parse all the text content of a page, I want to customize it like Nutch should crawl the page and scrape/fetch only the data related to address because my use case is to crawl URLs and parse only address info as text.
For example, I need to crawl and parse only the text content which has address information, email id, phone number and fax number.
How should I do this? Is there any plugin already available for this?
If I want to write a customized parser for this can anyone help me in this regards?
Checkout NUTCH-1870 a work in progress on a generic XPath plugin for Nutch, the alternative is to write a custom HtmlParseFilter that scrap the data that you want. A good (and simple) example is the headings plugin. Keep in mind that both of this links are for the 1.x branch of Nutch, and you're working with the 2.x although things are different in some degree the logic should be portable, the other alternative is using the 1.x branch.
Based on your comment:
Since you don't know the structure of the webpage, the problem is somehow different: Essentially you'll need to "teach" Nutch how to detect the text you want, based on some regexp or using some library that does address extraction out of plain text like jgeocoder library, you'll need to parse (iterate on every node of the webpage) trying to find something that resembles an address, phone number, fax number, etc. This is kind of similar to what the headings plugin does, but instead of looking for addresses or phone numbers it just finds the title nodes in the HTML structure. This could be a starting point to write some plugin that does what you want, but I don't think there is anything out of the box for this do.
Check [NUTCH-978] which introduces a plugin called XPath which allows the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.

Using Nutch to Retrive Page Contents

I have a very large list of seeds to be crawled (only those seeds are needed without any deepening). How can I use Nutch to retrieve:
the HTML of
the text content of
(Preferably) the out-links of
the seed pages? (without any indexing and integration into any other platform like Solr).
Thanks
Well, there are many issues you want to address. Below are the issues with their solutions:
Limiting crawling to seed list: enable the scoring-depth plugin and configure it to allow only 1 level of crawling.
Getting textual content: Nutch does that by default.
Getting HTML raw data: it is not possible by Nutch 1.9. You need to download Nutch from its trunk repository and build it because the HTML content is scheduled for Nutch's next release (1.10).
Extracting outlinks: you can do that, but you have to write a new indexingFilter to index the outlinks.
Doing all of the above without Solr: you can do that. However, you have to write a new indexer that stores the extract data in whatever format you want.

How to grab data that is not in html source but visible from browser?

The data I want is visible from the browser, but I can't find it from the html source code. I suspect the data was generated by scripts. I'd like to grad such kind of data. Is it possible using Jsoup? I'm aware Jsoup just does not execute Javascript.
Take this page for example, I'd like to grab all the colleges and schools under Academics -> COLLEGES & SCHOOLS.
If the dom content is generated via scripts or plugins, then you really should consider a scriptable browser like phantomjs. Then you can just write some javascript to extract the data.
I didn't check your link, and I assume you're looking for a general answer not specific to any page.

desktop application / script to interact with javascript on web application

My work has tasked me with determining the feasibility of migrating our existing in-house built change management services(web based) to a Sharepoint solution. I've found everything to be easy except I've run into the issue that for each change management issue (several thousand) there may be any number of attachment files associated with them, called through javascript, that need to be downloaded and put into a document library.
(ex. ... onClick="DownloadAttachment(XXXXX,'ProjectID=YYYY');return false">Attachment... ).
To keep me from manually selecting them all I've been looking over posts of people wanting to do similar, and there seem to be many possible solutions, but they often seem more complicated than they need to be.
So I suppose in a nutshell I'm asking what would be the best way to approach this issue that yields some sort of desktop application or script that can interact with web pages and will let me select and organize all the attachments. (Making a purely web based app (php, javascript, rails, etc.) is not an option for me, so throwing that out there now).
Thanks in advance.
Given a document id and project id,
XXXXX and YYYY respectively in
your example, figure out the URL
from which the file contents can be
downloaded. You can observe a few
URL links in the browser and detect
the pattern which your web
application uses.
Use a tool like Selenium to get a
list of XXXXXs and YYYYs of
documents you need to download.
Write a bash script with wget to
download the files locally and put
in the correct folders.
This is a "one off" migration, right?
Get access to your in-house application's database, and create an SQL query which pulls out rows showing the attachment names (XXXXX?) and the issue/project (YYYY?), ex:
|file_id|issue_id|file_name |
| 5| 123|Feasibility Test.xls|
Analyze the DownloadAttachment method and figure out how it generates the URL that it calls for each download.
Start a script (personally I'd go for Python) that will do the migration work.
Program the script to connect and run the SQL query, or can read a CSV file you create manually from step #1.
Program the script to use the details to determine the target-filename and the URL to download from.
Program the script to download the file from the given URL, and place it on the hard drive with the proper name. (In Python, you might use urllib.)
Hopefully that will get you as far as a bunch of files categorized by "issue" like:
issue123/Feasibility Test.xls
issue123/Billing Invoice.doc
issue456/Feasibility Test.xls
Thank you everyone. I was able to get what I needed using htmlunit and java to traverse a report I made of all change items with attachments, go to each one, copy the source code, traverse that to find instances of the download method, and copy the unique IDs of each attachment and build an .xls of all items and their attachments.

Categories