How to programmatically retrieve google results? - java

Now that the Google search API has been discontinued - what is the best way to retrieve search results programmatically?
I need to get a list of files that have been indexed by google in my web site, so that I can write a script using that data.
What I want to do is delete everything cached under http://mysite.com/mypdfs/
i.e.
search for "site:mysite.com/mypdfs",
to get back a list of pdfs on mysite.com:
http://mysite.com/pdf/1.pdf
http://mysite.com/pdf/1.pdf
...
http://mysite.com/pdf/1000000.pdf
etc
Then use WebDriver to push them through the webmaster removal tool.
Happy to pay for the privilege if required...

You'll have to setup a Custom Search and use the new Custom Search API. It's similar to the old deprecated search API and does JSON or Atoms.

Related

How to crawl and parse only precise data using Nutch?

I'm new to Nutch and crawling. I have installed Nutch 2.0, crawled and indexed the data using Solr 4.5 by following some basic tutorials. Now I don't want to parse all the text content of a page, I want to customize it like Nutch should crawl the page and scrape/fetch only the data related to address because my use case is to crawl URLs and parse only address info as text.
For example, I need to crawl and parse only the text content which has address information, email id, phone number and fax number.
How should I do this? Is there any plugin already available for this?
If I want to write a customized parser for this can anyone help me in this regards?
Checkout NUTCH-1870 a work in progress on a generic XPath plugin for Nutch, the alternative is to write a custom HtmlParseFilter that scrap the data that you want. A good (and simple) example is the headings plugin. Keep in mind that both of this links are for the 1.x branch of Nutch, and you're working with the 2.x although things are different in some degree the logic should be portable, the other alternative is using the 1.x branch.
Based on your comment:
Since you don't know the structure of the webpage, the problem is somehow different: Essentially you'll need to "teach" Nutch how to detect the text you want, based on some regexp or using some library that does address extraction out of plain text like jgeocoder library, you'll need to parse (iterate on every node of the webpage) trying to find something that resembles an address, phone number, fax number, etc. This is kind of similar to what the headings plugin does, but instead of looking for addresses or phone numbers it just finds the title nodes in the HTML structure. This could be a starting point to write some plugin that does what you want, but I don't think there is anything out of the box for this do.
Check [NUTCH-978] which introduces a plugin called XPath which allows the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.

How to edit native google documents programmatically? [duplicate]

This question already has answers here:
How to programmatically manipulate native google doc files
(2 answers)
Closed 3 months ago.
I found few quite depressive QA here which mentioned that google documents cannot be modified programmatically in Google Drive API - there`s just upload/download option.
I checked those similar topics :
How to programmatically manipulate native google doc files
How do you create a document in Google Docs programmatically?
As I suppose we cannot download and upload directly native google doc formats. Is there any other way how to solve this requirement ?
Has anyone tried to trigger google app script programmatically on selected document, is that possible ? Is possible to start google app script programmatically with some parameters on the input ?
I just need to replace few pieces of text in native google doc`s but i cannot use download->modify->upload (e.g. with formats word/html/pdf) flow as i would broken formatting of pictures,borders etc... (customer requirement : full google integration no proprietary formats)
Do you have any innovative ideas or tips which would be good to explore ?
We are trying to use Google Drive as some kind of very simple templating system (~ thousands of users, hundreds of google documents) but it seems to be a really wrong idea as there is a lot of limitations on the way.
You can't use the Drive API to programmatically manage the content of a Google Document but you can use the Document Service in Apps Script to perform text replacing and other editing:
https://developers.google.com/apps-script/service_document
We invoke google app script deployed on the same domain as webapp which changes content of documents before we download them to proprietary format. We are just replacing few strings nothing complex.
This solution works but its a bit fragile (you have to install g app script + google app engine app in one domain), we are not sure how quickly are changes propagated after you trigger script so we wait always small amount of time e.g. 10 seconds before we try to download modified document.
Important disadvantage is that you cannot invoke GScript from localhost so development is a bit slower as we have to upload our app each time into google app engine.
Nowadays it's possible to use Java and other programming languages without having to use Google Apps Script by using the Google Docs API.
Also it is possible by using execute Google Apps Script code from other programming platforms by using Google Apps Script API, but it doesn't work with service accounts.
Notes:
There are some features available in the Google Docs user interface that aren't available in the Google Docs API.
Inserting content inside tables, that have rows and columns of different sizes might be complex due to the way that the indices work. Something that might help is to build the document from bottom to top.

using google maps in the java app

I want to use google maps in my java web app. What i want is when a user visits a particular page his location be showed in the map with its ip address as the input. Is it possible ?
Also the map should be able to locate the position the user entered in the text-field. How can i do that ?
I even downloaded http://code.google.com/p/gdata-java-client/downloads/list the jar files,samples from the link but they don't work as some of the packages are missing like com.google.gdata.util .
I have used Google maps more on the Javascript side.
And i always find Google Playground to be the best tutor in helping out.
Check it out.
You'll have to use JavaScript to use the Google Maps API. Here's a sample of using HTML5 geolocation to find someone's location: http://html5demos.com/geo
For user input, you might check out using the Google Maps API Places Library which has an autocomplete function: https://developers.google.com/maps/documentation/javascript/places#places_autocomplete

Duplicate Google Spreadsheet on Demand

I've created a pretty complex Google spreadsheet. I would like a user to be able to click a button or follow a link, and get a copy of this spreadsheet where they can fill in data. I would later check process this data manually.
Is there anyway I can do this via a complicated link, or some Javascript, or possibly even using a server side language (e.g. Python, Java).
Thank you,
You have a few options:
Rather than force a user to create a spreadsheet that you verify, you can email them a form to fill out with Google forms, and the answers get aggregated back on your spreadsheet.
Use the docs API to copy documents.
Use Google Apps Script to automate the process (it's essentially javascript).
Copying the document from the client side:
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#CopyingDocs
Using the Java API, it would seem you'd have to export the document and then upload it:
http://code.google.com/apis/documents/docs/3.0/developers_guide_java.html

Web Search API for 25000-50000 Entries

I have 20000-50000 entries in an excel file. One column contains the name of that company. Ideally, I would like search the name of that company, and whatever is the first result, I would select the URL associated with it. I am aware that Google (which my ideal choice) provides a AJAX Search API. However, it also has a 1000 search limit per registrant. Is there a way to get over 20000 searches without making 20 accounts with Google, or is there an alternative engine I could use?
Any alternative ways of approaching this problem are also welcome (i.e. WhoIs look-ups).
Google AJAX Search has no such limit of 1000. Yahoo Search does. Google AJAX Search limits you to getting 64 results per search but otherwise has no limit.
From Google AJAX Search API - Class Reference:
Note: The maximum number of results
pages is based on the type of
searcher. Local search supports 4
pages (or a maximum of 32 total
results) and the other searchers
(Blog, Book, Image, News, Patent,
Video, and Web) support 8 pages (for a
maximum total of 64 results).
Approaches that avoid using an external search service ...
Approach 1 - put the information content of the XML into a database and search using SQL/JDBC. Variations of the same using Hibernate, etc.
Approach 2 - read the XML file as an in-memory data structure as a Java collection, and do the searching programmatically. This will use a bit of memory depending on how much information is in the XML file, but you only need to figure out how to parse / load the XML, and access the collection.
However, it would help if you explained the context in which you are trying to do this. Is it a browser plugin? The client side of a web app? The server side? A desktop application?

Categories