I am currently using MediaWiki's URL example to query HTTP GET requests on android.
I am simply getting information through a URL like this;
http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Main%20Page&prop=revisions&rvprop=content
However, in this example, I always need some sort of direct title and only get one result back (titles=some name here)
I know that Wikipedia has more complex search methods explained here;
http://en.wikipedia.org/wiki/Help:Searching
I would like to offer a few "previews" of multiple wikipedia article per search, since what they type might not always be what they want.
Is there any way to query these special "search" results?
Any help would be appreciated.
It looks like the MediaWiki search API may be what you're after. That particular page discusses getting previews of search results.
Related
Let me just start by saying that this is a soft question.
I am rather new to application development, and thus why I'm asking a question without presenting you with any actual code. I know the basics of Java coding, and I was wondering if anyone could enlighten me on the following topic:
Say I have an external website, Craigslist, or some other site that allows me to search through products/services/results manually by typing a query into a searchbox somewhere on the page. The trouble is, that there is no API for this site for me to use.
However I do know that http://sfbay.craigslist.org/search/sss?query=QUERYHERE&sort=rel points me to a list of results, where QUERYHERE is replaced by what I'm looking for.
What I'm wondering here is: is it possible to store these results in an Array (or List or some form of Collection) in Java?
Is there perhaps some library or external tool that can allow me to specify a query to search for, have it paste it in to a search-link, perform the search, and fill an Array with the results?
Or is what I am describing impossible without an API?
This depends, if the query website accepts returning the result as XML or JSON (usually with a .xml or .json at the end of url) you can parse it easily with DOM for XML on Java or download and use the JSONLibrary to parse a JSON.
Otherwise you will receive a HTML that is the page that a user would see in a browser, then you can try parse it as a XML but you will have a lot of work to map all fields in the HTML to get the list as you want.
Wayback machine offers an API allowing you to download information. There are actually multiple APIs and after searching for a few hours I really can't manage to do the following:
Using the wayback machine API, I am trying to get a list of all domains indexed on 06/06/15.
I have read the documentation here
https://archive.org/help/wayback_api.php
but I can't find it...
I expected something like this to work:
http://archive.org/wayback/available?url=*×tamp=20150606
It is not possible to do what you want (?url=*), by design. You're asking us to go through 36 terabytes of data to fish out a huge list; it's not a query that our query engine supports.
Here's a working example check it bellow:
http://archive.org/wayback/available?http://sourceforge.net/projects/=%27+url+%27×tamp=20131006000000
Make sure you have the correct timestamp value
These are the lines i used to generate urls. It's in python:
url = "http://sourceforge.net/projects/"+name.rstrip()
wbm_url = 'http://archive.org/wayback/available?url='+url+'×tamp=20131006000000'
Since 2013, there may be an answer on how to get the timestamps one would need in order to fetch a specific archived copy of a website. look at this link:
http://web.archive.org/cdx/search/cdx?url=archive.org&limit=5&showResumeKey=true
Explained here:
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#advanced-usage
Then, to get confirmation this url works (using python's requests):
w = requests.get('http://archive.org/wayback/available?url=archive.org×tamp=997121112295')
Or you can fetch the HTML directly:
w2 = requests.get('http://web.archive.org/web/20040324162136/http://www.globalgiving.org:80/')
I want to collect domain names (crawling). I have wrote a simple Java application that reads HTML page and save the code in text file. Now, I want to parse this text in order to collect all domain names without douplicates. But I need the domain names without "http://www.", just domainname.topleveldmian or the possibilities of dmianname.subdomain.topleveldomain or whatever number of subdomains (then, the collected links need to be extracted the same way and collect the links inside them till I reach certain number of links, say 100).
I have asked about this in previous posts https://stackoverflow.com/questions/11113568/simple-efficient-java-web-crawler-to-extract-hostnames , and searched. JSoup seems good solution but I have not worked with JSoup before, so before going deeply on it. I just want to ask: Does it achieve what I want to do ?? Any other suggestions for achieving my simple crawling in a simple way are welcome.
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods
So yes, you can connect to a website extract its html and parse it with jsoup.
The logic of extracting the top level domain is "your part" you will need to write the code logic yourself.
Take a look at the docs for more options...
Use selector-syntax to find elements
Use DOM methods to navigate a document
Hi I have this project that I need to use Java to access twitter api, and I found Twitter4j easy to use and tried some samples from the site. However I cannot find details regarding the Query class regarding the query strings for this object, anyone knows a comprehensive info for this one?
Cheers.
If by "query string" you mean the value in the query field, that's literally any text you can type into the search box on Twitter's website. There's no list of examples because it's so wide open. Just use whatever you happen to be thinking about at that particular instant in time.
The related JavaDoc page is where I would start (select the library version your using) + searching for 'Twitter4J query examples' in Google.
Is what you need not covered in this?: http://twitter4j.org/en/code-examples.html
I want to do some development in Java. I'd like to be able to access a website, say for example
www.chipotle.com
On the top right, they have a place where you can enter in your zip code and it will give you all of the nearest locations. The program will just have an empty box for user input for their zip code, and it will query the actual chipotle server to retrieve the nearest locations. How do I do that, and also how is the data I receive stored?
This will probably be a followup question as to what methods I should use to parse the data.
Thanks!
First you need to know the parameters needed to execute the query and the URL which these parameters should be submitted to (the action attribute of the form). With that, your application will have to do an HTTP request to the URL, with your own parameters (possibly only the zip code). Finally parse the answer.
This can be done with standard Java API classes, but it won't be very robust. A better solution would be HttpClient. Here are some examples.
This will probably be a followup question as to what methods I should use to parse the data.
It very much depends on what the website actually returns.
If it returns static HTML, use an regular (strict) or permissive HTML parser should be used.
If it returns dynamic HTML (i.e. HTML with embedded Javascript) you may need to use something that evaluates the Javascript as part of the content extraction process.
There may also be a web API designed for programs (like yours) to use. Such an API would typically return the results as XML or JSON so that you don't have to scrape the results out of an HTML document.
Before you go any further you should check the Terms of Service for the site. Do they say anything about what you are proposing to do?
A lot of sites DO NOT WANT people to scrape their content or provide wrappers for their services. For instance, if they get income from ads shown on their site, what you are proposing to do could result in a diversion of visitors to their site and a resulting loss of potential or actual income.
If you don't respect a website's ToS, you could be on the receiving end of lawyers letters ... or worse. In addition, they could already be using technical means to make life difficult for people to scrape their service.