Using Nutch to Retrive Page Contents - java

I have a very large list of seeds to be crawled (only those seeds are needed without any deepening). How can I use Nutch to retrieve:
the HTML of
the text content of
(Preferably) the out-links of
the seed pages? (without any indexing and integration into any other platform like Solr).
Thanks

Well, there are many issues you want to address. Below are the issues with their solutions:
Limiting crawling to seed list: enable the scoring-depth plugin and configure it to allow only 1 level of crawling.
Getting textual content: Nutch does that by default.
Getting HTML raw data: it is not possible by Nutch 1.9. You need to download Nutch from its trunk repository and build it because the HTML content is scheduled for Nutch's next release (1.10).
Extracting outlinks: you can do that, but you have to write a new indexingFilter to index the outlinks.
Doing all of the above without Solr: you can do that. However, you have to write a new indexer that stores the extract data in whatever format you want.

Related

Chromium headless PDF generation (in Java) using String instead of temp file/page

We currently use the PhantomJS executable for two things in our Java project:
Create a PDF file from a given String html we get from our database (for which we write the String to a temp file first)
Create a screenshot of a given Widget-Object (for which we have an open HTML page on the front-end)
Since PhantomJS hasn't been updated for a few years, I'm about to change it to a headless Chromium method instead, which has the options --print-to-pdf and --screenshot for options 1 and 2.
Option 2 isn't really relevant since we have a page, but for option 1 it would be nice if we could directly use the chromium command-line with the given String. Right now with PhantomJS, we convert the String to a temp file, and then use the executable to generate the actual PDF output file. I can of course do the same with the headless Chromium executable, but since I'm changing it right now anyway, it would be nice if the 'String to temp HTML file' step wouldn't be necessary for creating the output PDF file, since we already have the page in memory anyway after retrieving it from the database.
From what I've seen, the Chromium executable is usually run for either a HTML file to PDF file:
chromium --headless -disable-gpu --print-to-pdf="C:/path/to/output-file.pdf" C:/path/to/input-file.html
Or for a HTML page to PDF file:
chromium --headless -disable-gpu --print-to-pdf="C:/path/to/output-file.pdf" https://www.google.com/
I couldn't really find the docs for the chrome/chromium executable (although I have been able to find the list of command options in the source code), so maybe there are more options besides these two above? (If anyone has a link to the docs, that would be great as well.)
If not, I guess I'll just use a temp file as we did before with PhantomJS.
The terms 'chrome read stdin' would probably have brought you to this question explaining how to read from a data url:
chrome.exe "data:text/html;base64,PCFET0NUWVBFIGh0bWw+PGh0bWw+PGhlYWQ+PHRpdGxlPlRlc3Q8L3RpdGxlPjwvaGVhZD48Ym9keT5ZbzwvYm9keT48L2h0bWw+"
Reading input from stdin sounds like you would also want to write the output to stdout: 'chrome pdf to stdout'. Which links to someone trying the same thing and running into issues about not being able to combine --stdout with screenshot or pdf output from 2018.
And (depending on the usecase) even worse, a limitation of the data url's of 2MB.
So if you can't guarantee the input to be less than 2MB you might be better off using files anyway, or check if the limitation has been removed.
Also, given that you specify that option 2 has a solution in serving the page directly, would that not also open up the option to do the same for option 1?
You should not need redundant -disable-gpu usure which version it was not needed for Windows, but redundant in https://chromium-review.googlesource.com/c/chromium/src/+/1161172 (2018), however, you may want to replace with --print-to-pdf-no-header to avoid those.
Your using windows as the shell to run Chrome/MSEdge.exe so for that reason there will be a significantly smaller command line ability to CMD pass a variable string.
To pass a base64 for html string as stdin will often be limited by content for similar string length reasons to say 1.5MB (75% of 2MB). Thus in special exceptional cases that may be 4096 pages see https://github.com/GitHubRulesOK/MyNotes/raw/master/Hanoi.htm however the norm is usually only a few standard html pages.
PDF file handling requires a file system to generate the pages, thus a file centred approach to store the decimal based file index. So the memory work around is to use a RamDrive/Disk or its Bytes IO equivalent as named FileStream object.
Using PDF data in memory is usually highly disk intensive as the limited resources after contents program processing need to draw on the bus disk cache to augment virtual ram. As a result working in memory can be just as slow if not slower than using cached disk file data.
%Tmp% / %temp% files can usually respond quicker and be very easily overwritten.
There are many other working and non-working switches bandied about the web, but the semi-official list is https://peter.sh/experiments/chromium-command-line-switches/

Apache Nutch fetch and updatedb stages

I've got a question about the way Nutch obtains links to update the crawldb with.
The command in question is bin/nutch updatedb crawl/crawldb $s1
I need to write a custom parser, before doing so I've examined Nutch's source code and as far as I'm concerned I'm responsible for providing the links to update the crawldb, by extracting it from the document and putting in as Outlink[] in the ParseData. At least that's what I understood from this.
Correct me if I'm wrong, because I wouldn't like my crawler to stop after the first iteration, as it wouldn't have links to update the crawldb.
Nutch uses either parse-html or parse-tika to parse your crawled URLs (usually HTML) in this phase the outlinks are extracted and stored, when you execute a new iteration of the crawler Nutch will select some of the available (extracted) links to continue the crawl, you'll only need to write your own parser if you need to extract additional information from the web, let's say that you want all h1 titles in a separate field, for instance.
If you take a look at the crawl script (https://github.com/apache/nutch/blob/master/src/bin/crawl#L246) you'll see that the updatedb command will be executed once per iteration, so if you're using parse-html or parse-tika the outlinks of an HTML document (among others) are automatically extracted for you.

How to crawl and parse only precise data using Nutch?

I'm new to Nutch and crawling. I have installed Nutch 2.0, crawled and indexed the data using Solr 4.5 by following some basic tutorials. Now I don't want to parse all the text content of a page, I want to customize it like Nutch should crawl the page and scrape/fetch only the data related to address because my use case is to crawl URLs and parse only address info as text.
For example, I need to crawl and parse only the text content which has address information, email id, phone number and fax number.
How should I do this? Is there any plugin already available for this?
If I want to write a customized parser for this can anyone help me in this regards?
Checkout NUTCH-1870 a work in progress on a generic XPath plugin for Nutch, the alternative is to write a custom HtmlParseFilter that scrap the data that you want. A good (and simple) example is the headings plugin. Keep in mind that both of this links are for the 1.x branch of Nutch, and you're working with the 2.x although things are different in some degree the logic should be portable, the other alternative is using the 1.x branch.
Based on your comment:
Since you don't know the structure of the webpage, the problem is somehow different: Essentially you'll need to "teach" Nutch how to detect the text you want, based on some regexp or using some library that does address extraction out of plain text like jgeocoder library, you'll need to parse (iterate on every node of the webpage) trying to find something that resembles an address, phone number, fax number, etc. This is kind of similar to what the headings plugin does, but instead of looking for addresses or phone numbers it just finds the title nodes in the HTML structure. This could be a starting point to write some plugin that does what you want, but I don't think there is anything out of the box for this do.
Check [NUTCH-978] which introduces a plugin called XPath which allows the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.

desktop application / script to interact with javascript on web application

My work has tasked me with determining the feasibility of migrating our existing in-house built change management services(web based) to a Sharepoint solution. I've found everything to be easy except I've run into the issue that for each change management issue (several thousand) there may be any number of attachment files associated with them, called through javascript, that need to be downloaded and put into a document library.
(ex. ... onClick="DownloadAttachment(XXXXX,'ProjectID=YYYY');return false">Attachment... ).
To keep me from manually selecting them all I've been looking over posts of people wanting to do similar, and there seem to be many possible solutions, but they often seem more complicated than they need to be.
So I suppose in a nutshell I'm asking what would be the best way to approach this issue that yields some sort of desktop application or script that can interact with web pages and will let me select and organize all the attachments. (Making a purely web based app (php, javascript, rails, etc.) is not an option for me, so throwing that out there now).
Thanks in advance.
Given a document id and project id,
XXXXX and YYYY respectively in
your example, figure out the URL
from which the file contents can be
downloaded. You can observe a few
URL links in the browser and detect
the pattern which your web
application uses.
Use a tool like Selenium to get a
list of XXXXXs and YYYYs of
documents you need to download.
Write a bash script with wget to
download the files locally and put
in the correct folders.
This is a "one off" migration, right?
Get access to your in-house application's database, and create an SQL query which pulls out rows showing the attachment names (XXXXX?) and the issue/project (YYYY?), ex:
|file_id|issue_id|file_name |
| 5| 123|Feasibility Test.xls|
Analyze the DownloadAttachment method and figure out how it generates the URL that it calls for each download.
Start a script (personally I'd go for Python) that will do the migration work.
Program the script to connect and run the SQL query, or can read a CSV file you create manually from step #1.
Program the script to use the details to determine the target-filename and the URL to download from.
Program the script to download the file from the given URL, and place it on the hard drive with the proper name. (In Python, you might use urllib.)
Hopefully that will get you as far as a bunch of files categorized by "issue" like:
issue123/Feasibility Test.xls
issue123/Billing Invoice.doc
issue456/Feasibility Test.xls
Thank you everyone. I was able to get what I needed using htmlunit and java to traverse a report I made of all change items with attachments, go to each one, copy the source code, traverse that to find instances of the download method, and copy the unique IDs of each attachment and build an .xls of all items and their attachments.

Web Search API for 25000-50000 Entries

I have 20000-50000 entries in an excel file. One column contains the name of that company. Ideally, I would like search the name of that company, and whatever is the first result, I would select the URL associated with it. I am aware that Google (which my ideal choice) provides a AJAX Search API. However, it also has a 1000 search limit per registrant. Is there a way to get over 20000 searches without making 20 accounts with Google, or is there an alternative engine I could use?
Any alternative ways of approaching this problem are also welcome (i.e. WhoIs look-ups).
Google AJAX Search has no such limit of 1000. Yahoo Search does. Google AJAX Search limits you to getting 64 results per search but otherwise has no limit.
From Google AJAX Search API - Class Reference:
Note: The maximum number of results
pages is based on the type of
searcher. Local search supports 4
pages (or a maximum of 32 total
results) and the other searchers
(Blog, Book, Image, News, Patent,
Video, and Web) support 8 pages (for a
maximum total of 64 results).
Approaches that avoid using an external search service ...
Approach 1 - put the information content of the XML into a database and search using SQL/JDBC. Variations of the same using Hibernate, etc.
Approach 2 - read the XML file as an in-memory data structure as a Java collection, and do the searching programmatically. This will use a bit of memory depending on how much information is in the XML file, but you only need to figure out how to parse / load the XML, and access the collection.
However, it would help if you explained the context in which you are trying to do this. Is it a browser plugin? The client side of a web app? The server side? A desktop application?

Categories