Wayback machine offers an API allowing you to download information. There are actually multiple APIs and after searching for a few hours I really can't manage to do the following:
Using the wayback machine API, I am trying to get a list of all domains indexed on 06/06/15.
I have read the documentation here
https://archive.org/help/wayback_api.php
but I can't find it...
I expected something like this to work:
http://archive.org/wayback/available?url=*×tamp=20150606
It is not possible to do what you want (?url=*), by design. You're asking us to go through 36 terabytes of data to fish out a huge list; it's not a query that our query engine supports.
Here's a working example check it bellow:
http://archive.org/wayback/available?http://sourceforge.net/projects/=%27+url+%27×tamp=20131006000000
Make sure you have the correct timestamp value
These are the lines i used to generate urls. It's in python:
url = "http://sourceforge.net/projects/"+name.rstrip()
wbm_url = 'http://archive.org/wayback/available?url='+url+'×tamp=20131006000000'
Since 2013, there may be an answer on how to get the timestamps one would need in order to fetch a specific archived copy of a website. look at this link:
http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true
Explained here:
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#advanced-usage
Then, to get confirmation this url works (using python's requests):
w = requests.get('http://archive.org/wayback/available?url=archive.org×tamp=997121112295')
Or you can fetch the HTML directly:
w2 = requests.get('http://web.archive.org/web/20040324162136/http://www.globalgiving.org:80/')
Related
Currently I'm using client.getAsync(folder + "/files", liveListener) to check the contents of a folder for a specific file. At the moment it works well but if the folder has many files in it (which it is likely to have) then the result returned will be rather large so I was wondering if there was any way to limit this?
Using the Google Drive api I can query for files of a certain mimetype which means the results returned is greatly reduced.
Is there anything like this for the Windows Live api?
The documentation doesn't suggest so..
From here:
http://msdn.microsoft.com/en-us/library/live/hh826531.aspx
Looks like a variety of filtering and selection mechanisms. Not one explicitly for mime-types but it does let you filter on some notion of 'type':
Get only certain types of items by using the filter parameter in the
preceding code and specifying the item type: all (default), photos,
videos, audio, folders, or albums. For example, to get only photos,
use FOLDER_ID/files?filter=photos.
Also looks like you can sort by a variety of criteria and then select by offset, which would seemingly be very useful for what you discuss above.
Introduce a nontrivial folder structure in your file storage. Folders are a natural filtering mechanism for files.
I am currently using MediaWiki's URL example to query HTTP GET requests on android.
I am simply getting information through a URL like this;
http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Main%20Page&prop=revisions&rvprop=content
However, in this example, I always need some sort of direct title and only get one result back (titles=some name here)
I know that Wikipedia has more complex search methods explained here;
http://en.wikipedia.org/wiki/Help:Searching
I would like to offer a few "previews" of multiple wikipedia article per search, since what they type might not always be what they want.
Is there any way to query these special "search" results?
Any help would be appreciated.
It looks like the MediaWiki search API may be what you're after. That particular page discusses getting previews of search results.
Hi I have this project that I need to use Java to access twitter api, and I found Twitter4j easy to use and tried some samples from the site. However I cannot find details regarding the Query class regarding the query strings for this object, anyone knows a comprehensive info for this one?
Cheers.
If by "query string" you mean the value in the query field, that's literally any text you can type into the search box on Twitter's website. There's no list of examples because it's so wide open. Just use whatever you happen to be thinking about at that particular instant in time.
The related JavaDoc page is where I would start (select the library version your using) + searching for 'Twitter4J query examples' in Google.
Is what you need not covered in this?: http://twitter4j.org/en/code-examples.html
I want to do some development in Java. I'd like to be able to access a website, say for example
www.chipotle.com
On the top right, they have a place where you can enter in your zip code and it will give you all of the nearest locations. The program will just have an empty box for user input for their zip code, and it will query the actual chipotle server to retrieve the nearest locations. How do I do that, and also how is the data I receive stored?
This will probably be a followup question as to what methods I should use to parse the data.
Thanks!
First you need to know the parameters needed to execute the query and the URL which these parameters should be submitted to (the action attribute of the form). With that, your application will have to do an HTTP request to the URL, with your own parameters (possibly only the zip code). Finally parse the answer.
This can be done with standard Java API classes, but it won't be very robust. A better solution would be HttpClient. Here are some examples.
This will probably be a followup question as to what methods I should use to parse the data.
It very much depends on what the website actually returns.
If it returns static HTML, use an regular (strict) or permissive HTML parser should be used.
If it returns dynamic HTML (i.e. HTML with embedded Javascript) you may need to use something that evaluates the Javascript as part of the content extraction process.
There may also be a web API designed for programs (like yours) to use. Such an API would typically return the results as XML or JSON so that you don't have to scrape the results out of an HTML document.
Before you go any further you should check the Terms of Service for the site. Do they say anything about what you are proposing to do?
A lot of sites DO NOT WANT people to scrape their content or provide wrappers for their services. For instance, if they get income from ads shown on their site, what you are proposing to do could result in a diversion of visitors to their site and a resulting loss of potential or actual income.
If you don't respect a website's ToS, you could be on the receiving end of lawyers letters ... or worse. In addition, they could already be using technical means to make life difficult for people to scrape their service.
I am new to IR techniques.
I looking for a Java based API or tool that does the following.
Download the given set of URLs
Extract the tokens
Remove the stop words
Perform Stemming
Create Inverted Index
Calculate the TF-IDF
Kindly let me know how can Lucene be helpful to me.
Regards
Yuvi
You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.
Actually, TF-IDF is a score given to a term in a document, rather than the whole document.
If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene.
If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.