I'm new to Nutch and crawling. I have installed Nutch 2.0, crawled and indexed the data using Solr 4.5 by following some basic tutorials. Now I don't want to parse all the text content of a page, I want to customize it like Nutch should crawl the page and scrape/fetch only the data related to address because my use case is to crawl URLs and parse only address info as text.
For example, I need to crawl and parse only the text content which has address information, email id, phone number and fax number.
How should I do this? Is there any plugin already available for this?
If I want to write a customized parser for this can anyone help me in this regards?
Checkout NUTCH-1870 a work in progress on a generic XPath plugin for Nutch, the alternative is to write a custom HtmlParseFilter that scrap the data that you want. A good (and simple) example is the headings plugin. Keep in mind that both of this links are for the 1.x branch of Nutch, and you're working with the 2.x although things are different in some degree the logic should be portable, the other alternative is using the 1.x branch.
Based on your comment:
Since you don't know the structure of the webpage, the problem is somehow different: Essentially you'll need to "teach" Nutch how to detect the text you want, based on some regexp or using some library that does address extraction out of plain text like jgeocoder library, you'll need to parse (iterate on every node of the webpage) trying to find something that resembles an address, phone number, fax number, etc. This is kind of similar to what the headings plugin does, but instead of looking for addresses or phone numbers it just finds the title nodes in the HTML structure. This could be a starting point to write some plugin that does what you want, but I don't think there is anything out of the box for this do.
Check [NUTCH-978] which introduces a plugin called XPath which allows the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
Related
We are trying to develop a website for tracking the progress of pipeline and assets. I want to link progress data entered from forms developed in Java to GIS. We will be capturing all the lat and longs of pipeline stretch and lat and log of progress of work.
I need help in developing shapefile dynamically based on the progress and also to view that shape file in my webpage
The javascript API does have functionality for uploading shapefiles form your desktop into the browser, but you will need another tool to create those shapefiles based on the progress.
For uploading the shapefiles, see the example here: https://developers.arcgis.com/javascript/3/jssamples/portal_addshapefile.html
There may be more examples available at https://developers.arcgis.com/javascript/3/.
For creating the shapefiles dynamically, you could use ArcMap or ArcPro if you have those, or one of many python libraries to help write shapefiles. There may even be some Java libraries to help with this as well but I only work on the front end so I cannot help you there.
For something simple in context, the complexity of what kind of workflow to best suite your needs can range depending on these questions (and probably others I'm not thinking of):
Do you absolutely need to create a shapefile for this? Why can't you just push the form to a spatial database (e.g. PostGIS) and then return the XYs of the points or the string of XYs for line features, etc.?
Where is the source of the data & what is the format? Is it a PDF, text on an HTML page, a .csv file downloadable from a page, etc.? You may need to implement scraping (from a site) or download and update, or your data could be live streaming - these are all different workflows and you need to establish these boundaries before setting up your workflow.
If your end game are points, all you need is XYs in a table format to display in GIS software. If they are lines or polygons, it'd be a little different. Again - what output type are they and what are you trying to do with it (e.g. import into QGIS)?
Without these answers, it doesn't make sense for anyone to suggest something to you that could be totally impossible for you to execute. Please answer these and think through your workflow from beginning to end and/or visa versa.
Cheers,
Shawn
I have a very large list of seeds to be crawled (only those seeds are needed without any deepening). How can I use Nutch to retrieve:
the HTML of
the text content of
(Preferably) the out-links of
the seed pages? (without any indexing and integration into any other platform like Solr).
Thanks
Well, there are many issues you want to address. Below are the issues with their solutions:
Limiting crawling to seed list: enable the scoring-depth plugin and configure it to allow only 1 level of crawling.
Getting textual content: Nutch does that by default.
Getting HTML raw data: it is not possible by Nutch 1.9. You need to download Nutch from its trunk repository and build it because the HTML content is scheduled for Nutch's next release (1.10).
Extracting outlinks: you can do that, but you have to write a new indexingFilter to index the outlinks.
Doing all of the above without Solr: you can do that. However, you have to write a new indexer that stores the extract data in whatever format you want.
My work has tasked me with determining the feasibility of migrating our existing in-house built change management services(web based) to a Sharepoint solution. I've found everything to be easy except I've run into the issue that for each change management issue (several thousand) there may be any number of attachment files associated with them, called through javascript, that need to be downloaded and put into a document library.
(ex. ... onClick="DownloadAttachment(XXXXX,'ProjectID=YYYY');return false">Attachment... ).
To keep me from manually selecting them all I've been looking over posts of people wanting to do similar, and there seem to be many possible solutions, but they often seem more complicated than they need to be.
So I suppose in a nutshell I'm asking what would be the best way to approach this issue that yields some sort of desktop application or script that can interact with web pages and will let me select and organize all the attachments. (Making a purely web based app (php, javascript, rails, etc.) is not an option for me, so throwing that out there now).
Thanks in advance.
Given a document id and project id,
XXXXX and YYYY respectively in
your example, figure out the URL
from which the file contents can be
downloaded. You can observe a few
URL links in the browser and detect
the pattern which your web
application uses.
Use a tool like Selenium to get a
list of XXXXXs and YYYYs of
documents you need to download.
Write a bash script with wget to
download the files locally and put
in the correct folders.
This is a "one off" migration, right?
Get access to your in-house application's database, and create an SQL query which pulls out rows showing the attachment names (XXXXX?) and the issue/project (YYYY?), ex:
|file_id|issue_id|file_name |
| 5| 123|Feasibility Test.xls|
Analyze the DownloadAttachment method and figure out how it generates the URL that it calls for each download.
Start a script (personally I'd go for Python) that will do the migration work.
Program the script to connect and run the SQL query, or can read a CSV file you create manually from step #1.
Program the script to use the details to determine the target-filename and the URL to download from.
Program the script to download the file from the given URL, and place it on the hard drive with the proper name. (In Python, you might use urllib.)
Hopefully that will get you as far as a bunch of files categorized by "issue" like:
issue123/Feasibility Test.xls
issue123/Billing Invoice.doc
issue456/Feasibility Test.xls
Thank you everyone. I was able to get what I needed using htmlunit and java to traverse a report I made of all change items with attachments, go to each one, copy the source code, traverse that to find instances of the download method, and copy the unique IDs of each attachment and build an .xls of all items and their attachments.
Does anybody know some open source tools to parse the html pages, filter the Ads,JS and etc to get title, text. Front end of my application is based on LAMP. So I needs to parse the html pages and storage them into Mysql. And populate front pages with these data.
I know some tools: Heritrix, Nutch. But it seems that they are crawlers.
Thanks.
Joseph
It depends on what you mean by "text" from the webpage. I did a similar thing by grabbing a webpage using the apache HttpClient libraries and then dom4j to look for a particular tag to extract text from. But you do in effect need the same type of crawler that search engines like google use. You are emulating the basic steps that they do when they crawl a website. Extracting the information. It would be helpful if you went into a little more detail on what kind of information you want to retrieve from the pages.
I have 20000-50000 entries in an excel file. One column contains the name of that company. Ideally, I would like search the name of that company, and whatever is the first result, I would select the URL associated with it. I am aware that Google (which my ideal choice) provides a AJAX Search API. However, it also has a 1000 search limit per registrant. Is there a way to get over 20000 searches without making 20 accounts with Google, or is there an alternative engine I could use?
Any alternative ways of approaching this problem are also welcome (i.e. WhoIs look-ups).
Google AJAX Search has no such limit of 1000. Yahoo Search does. Google AJAX Search limits you to getting 64 results per search but otherwise has no limit.
From Google AJAX Search API - Class Reference:
Note: The maximum number of results
pages is based on the type of
searcher. Local search supports 4
pages (or a maximum of 32 total
results) and the other searchers
(Blog, Book, Image, News, Patent,
Video, and Web) support 8 pages (for a
maximum total of 64 results).
Approaches that avoid using an external search service ...
Approach 1 - put the information content of the XML into a database and search using SQL/JDBC. Variations of the same using Hibernate, etc.
Approach 2 - read the XML file as an in-memory data structure as a Java collection, and do the searching programmatically. This will use a bit of memory depending on how much information is in the XML file, but you only need to figure out how to parse / load the XML, and access the collection.
However, it would help if you explained the context in which you are trying to do this. Is it a browser plugin? The client side of a web app? The server side? A desktop application?