Web Search API for 25000-50000 Entries - java

I have 20000-50000 entries in an excel file. One column contains the name of that company. Ideally, I would like search the name of that company, and whatever is the first result, I would select the URL associated with it. I am aware that Google (which my ideal choice) provides a AJAX Search API. However, it also has a 1000 search limit per registrant. Is there a way to get over 20000 searches without making 20 accounts with Google, or is there an alternative engine I could use?
Any alternative ways of approaching this problem are also welcome (i.e. WhoIs look-ups).

Google AJAX Search has no such limit of 1000. Yahoo Search does. Google AJAX Search limits you to getting 64 results per search but otherwise has no limit.
From Google AJAX Search API - Class Reference:
Note: The maximum number of results
pages is based on the type of
searcher. Local search supports 4
pages (or a maximum of 32 total
results) and the other searchers
(Blog, Book, Image, News, Patent,
Video, and Web) support 8 pages (for a
maximum total of 64 results).

Approaches that avoid using an external search service ...
Approach 1 - put the information content of the XML into a database and search using SQL/JDBC. Variations of the same using Hibernate, etc.
Approach 2 - read the XML file as an in-memory data structure as a Java collection, and do the searching programmatically. This will use a bit of memory depending on how much information is in the XML file, but you only need to figure out how to parse / load the XML, and access the collection.
However, it would help if you explained the context in which you are trying to do this. Is it a browser plugin? The client side of a web app? The server side? A desktop application?

Related

Easiest way to run Java program in the Cloud?

I'm working on a small side-project for our company that does the following:
PDF-based documents received through Office 365 Outlook are temporarily stored in OneDrive, using Power Automate
Text data is extracted from the PDFs using a few Java libraries
Based on extracted data an appropriate filename and filepath is created
The PDFs are permanently saved in OneDrive
The issue right now is that my Java program is locally-run, i.e. point 2,3,4 require code to run 24/7 on my PC. I'd like to transition to a Cloud-based solution.
What is the easiest way to accomplish this? The solution doesn't have to be free, but shouldn't cost more than $20/mo. Our company already has an Azure subscription, though I'm not familiar yet with Azure.
What you are looking for is a solution that uses a serverless computing execution model. Azure Functions seems to be a possible choice here. It does seem to have input bindings that respond to OneDrive files and an likewise output bindings.
The cost will depend on the number of documents, not the time the solution is available. I assume we are talking about a small number of documents a month so this will come out cheaper than other execution models.

How to crawl and parse only precise data using Nutch?

I'm new to Nutch and crawling. I have installed Nutch 2.0, crawled and indexed the data using Solr 4.5 by following some basic tutorials. Now I don't want to parse all the text content of a page, I want to customize it like Nutch should crawl the page and scrape/fetch only the data related to address because my use case is to crawl URLs and parse only address info as text.
For example, I need to crawl and parse only the text content which has address information, email id, phone number and fax number.
How should I do this? Is there any plugin already available for this?
If I want to write a customized parser for this can anyone help me in this regards?
Checkout NUTCH-1870 a work in progress on a generic XPath plugin for Nutch, the alternative is to write a custom HtmlParseFilter that scrap the data that you want. A good (and simple) example is the headings plugin. Keep in mind that both of this links are for the 1.x branch of Nutch, and you're working with the 2.x although things are different in some degree the logic should be portable, the other alternative is using the 1.x branch.
Based on your comment:
Since you don't know the structure of the webpage, the problem is somehow different: Essentially you'll need to "teach" Nutch how to detect the text you want, based on some regexp or using some library that does address extraction out of plain text like jgeocoder library, you'll need to parse (iterate on every node of the webpage) trying to find something that resembles an address, phone number, fax number, etc. This is kind of similar to what the headings plugin does, but instead of looking for addresses or phone numbers it just finds the title nodes in the HTML structure. This could be a starting point to write some plugin that does what you want, but I don't think there is anything out of the box for this do.
Check [NUTCH-978] which introduces a plugin called XPath which allows the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.

How to edit native google documents programmatically? [duplicate]

This question already has answers here:
How to programmatically manipulate native google doc files
(2 answers)
Closed 3 months ago.
I found few quite depressive QA here which mentioned that google documents cannot be modified programmatically in Google Drive API - there`s just upload/download option.
I checked those similar topics :
How to programmatically manipulate native google doc files
How do you create a document in Google Docs programmatically?
As I suppose we cannot download and upload directly native google doc formats. Is there any other way how to solve this requirement ?
Has anyone tried to trigger google app script programmatically on selected document, is that possible ? Is possible to start google app script programmatically with some parameters on the input ?
I just need to replace few pieces of text in native google doc`s but i cannot use download->modify->upload (e.g. with formats word/html/pdf) flow as i would broken formatting of pictures,borders etc... (customer requirement : full google integration no proprietary formats)
Do you have any innovative ideas or tips which would be good to explore ?
We are trying to use Google Drive as some kind of very simple templating system (~ thousands of users, hundreds of google documents) but it seems to be a really wrong idea as there is a lot of limitations on the way.
You can't use the Drive API to programmatically manage the content of a Google Document but you can use the Document Service in Apps Script to perform text replacing and other editing:
https://developers.google.com/apps-script/service_document
We invoke google app script deployed on the same domain as webapp which changes content of documents before we download them to proprietary format. We are just replacing few strings nothing complex.
This solution works but its a bit fragile (you have to install g app script + google app engine app in one domain), we are not sure how quickly are changes propagated after you trigger script so we wait always small amount of time e.g. 10 seconds before we try to download modified document.
Important disadvantage is that you cannot invoke GScript from localhost so development is a bit slower as we have to upload our app each time into google app engine.
Nowadays it's possible to use Java and other programming languages without having to use Google Apps Script by using the Google Docs API.
Also it is possible by using execute Google Apps Script code from other programming platforms by using Google Apps Script API, but it doesn't work with service accounts.
Notes:
There are some features available in the Google Docs user interface that aren't available in the Google Docs API.
Inserting content inside tables, that have rows and columns of different sizes might be complex due to the way that the indices work. Something that might help is to build the document from bottom to top.

How to programmatically retrieve google results?

Now that the Google search API has been discontinued - what is the best way to retrieve search results programmatically?
I need to get a list of files that have been indexed by google in my web site, so that I can write a script using that data.
What I want to do is delete everything cached under http://mysite.com/mypdfs/
i.e.
search for "site:mysite.com/mypdfs",
to get back a list of pdfs on mysite.com:
http://mysite.com/pdf/1.pdf
http://mysite.com/pdf/1.pdf
...
http://mysite.com/pdf/1000000.pdf
etc
Then use WebDriver to push them through the webmaster removal tool.
Happy to pay for the privilege if required...
You'll have to setup a Custom Search and use the new Custom Search API. It's similar to the old deprecated search API and does JSON or Atoms.

Google App Engine(java) - App upload failed due to app size limit exceeded - (free account)

i am using Google app engine for my development, my project involves around 60 PDfs to be available for users to download.
when i try to upload the project by clicking deploy button in eclipse i get the error app limit exceeded.
i just want to know if i try to use the paid account is there is a different in the application size in paid account or not?
as far as i know its 150 MB for now
You should use Blobstore service to store your PDF files and keep application only for files needed by your application logic and presentation, not data. Here is description of the Blobstore:
The Blobstore API allows your app to serve data objects, called blobs,
that are much larger than the size allowed for objects in the
Datastore service. Blobs are created by uploading a file through an
HTTP request. Typically, your apps will do this by presenting a form
with a file upload field to the user. When the form is submitted, the
Blobstore creates a blob from the file's contents and returns an
opaque reference to the blob, called a blob key, which you can later
use to serve the blob.
All good advice above, try to avoid putting content like that in your code. My app hit this issue and only has about 10MB of code/images/resources. What takes up a lot of space is the GWT compiling of 15 permutations of your app.
One thing that helped me, was changing my GWT javascript generation output style from Details to Obfuscated, resulting in much smaller code. You can also limit the number of permutations being created.
https://developers.google.com/web-toolkit/doc/1.6/FAQ_DebuggingAndCompiling#Can_I_speed_up_the_GWT_compiler?
According to http://code.google.com/intl/de/appengine/docs/quotas.html#Deployments the applications may not exceed 10 MB.
upto 10MB data u can upload to ur app engine
see following link
http://code.google.com/appengine/docs/quotas.html

Categories