In the project I am working on I need to access information from the website explainxkcd.com which gives the explanation of specific xkcd comics. The information I am looking for would be the explanation of the comic as a string. Explainxkcd is a site that runs using mediawiki, software that forms a template for the "wiki" feel. Mediawiki has an api that allows you to extract information from their websites and I have gone to http://www.mediawiki.org/wiki/API:Main_page trying to figure out how to use their API for this particular wiki site but to no avail. It seems that you can replace the "index.php" in a URL with api.php to use the API but when I try this for http://explainxkcd.com/9/api.php it doesn't seem to work. I guess my URL is wrong but I don't see any information on how to find the specific URL to use for Explainxkcd.com
QUESTION:
How can I access information from a third party wikipedia page in a Java program? This can be through the mediawiki api or some other solution. If you know a good way to find the URL that can be used with mediawiki that would be preferred. Just looking for a nudge in the right direction here.
Thanks
Using the same method, s/index.php/api.php/, I get a different result: http://www.explainxkcd.com/wiki/api.php which seems to work. If a wiki is using pretty URLs (e.g. example.com/wiki/Main_Page), just click on edit, view source or history.
Yes, please use the API instead of screen-scraping. You can see a few existing Java libraries for that here.
Related
It may sound like a naive issue but I can't find a perfect solution for this.
My solution:
Copy every single URL.
Paste it in Jmeter.
Run Jmeter after every
build.
Surely, there must be a better solution.
I am not sure if this fit to your question. On my end when I test and search for broken links to my website I used "xenulink" tools. It is a bit old but still works fine on my end
Cheers
JMeter comes with HTML Link Parser which can be used for automatic checking for "dead" links
Spidering Example
Consider a simple example: let's say you wanted JMeter to "spider" through your site, hitting link after link parsed from the HTML returned from your server (this is not actually the most useful thing to do, but it serves as a good example). You would create a Simple Controller, and add the "HTML Link Parser" to it. Then, create an HTTP Request, and set the domain to ".*", and the path likewise. This will cause your test sample to match with any link found on the returned pages. If you wanted to restrict the spidering to a particular domain, then change the domain value to the one you want. Then, only links to that domain will be followed.
You may also find How to Spider a Site with JMeter - A Tutorial article useful.
This question was asked me during my interview, and i was asked to implement it. The question is as follows:
Your application
Will take the username and password for the linkedIn profile,
On the page www.linkedin.com , use that to login into the page.
Simulate the Clicking of the Profile->Edit profile menu
Scrape the page of that user that comes below in the format below and dumps it in a text file. ( hint: you can use the beautiful soup library )
On fetching this url, you need to read the following information and put it in an csv/excel file.
Can somebody give me an idea on how to do it? It should be done using java only.
I'd use Web Browser Automation software like Selenium http://www.seleniumhq.org/ which seems like will solve this problem. You can choose any of its bindings (Java, C#, Ruby, Python, JavaScript) to implement the solution.
Take a look at the tutorials https://www.airpair.com/selenium/posts/selenium-tutorial-with-java
It seems related to web crawler, and we can do this using JSOUP library very well.
You have to read implementation using JSOUP library and we can filter out the link which has something like
https://www.linkedin.com/profile/edit?trk=nav_responsive_sub_nav_edit_profile"
Here if you see we are having the keywords as edit_profile which can be used to filter out the results we require.
Link u should follow and explore more about JSOUP
Webcrawler using JSOUP
I am attempting to use the Java/JSON examples located at this web page (which describes the introduction of JSON as a native type to DynamoDB). And while I understand the examples presented on the page, there is no place on the page showing how to go about defining the "people" table itself in Java.
I did find this link talking in the area. However, it appears to have been asked and answered PRIOR to the article above introducing the "official" version of the new API. And even reviewing the article didn't give me enough clues to figure it out myself.
BTW, I am NOT able to use the "AWS Toolkit for Eclipse" as must use the results of this in IntelliJ+Scala-Plugin using SBT.
Any guidance on this would be greatly appreciated.
I don't want just the source code. I want the rendered page. This is an important distinction that I apparently cannot make by simply searching Google.
Does anyone know how I can get the rendered page from a URL?
This needs to be done in Java, hopefully without an extra library.
Another solution would be to use HTMLUnit which is a "GUI-less browser for JAVA". It is recommended by Google to generate snapshots of ajax-based webpages to make them crawlable.
You can try using a library that wraps a web browser, for example Berkelium. If you need it in Java, a Google search produced this Java wrapper API for Berkelium (I haven't tried it personally).
sites.google has an example of its use:
I thought of making the following application for my college project in java. I know core java. I want to know what should i read "specifically" for this project as there is less time:
It will have an interface to put your query. This string would go as a query to internet search engines and with the help of search engine find the data (the first web page that we see (that is data for my application for this time. :) )).
I do not want to display the data. I just want the HTML file or the source code of the generated web page. Is it sounding like Common Getaway Interface? I do not know about this.
But i think it for the same purpose. If it is this. please guide me to know how to implement this.
Whatever please specify
Problem 1 : What should i read ? Any direct help at this point is not my intention. I want to implement it myself.
Problem 2 : Is connecting to internet requires some jnlp knowledge too.
for eg. as on google we search something it shows us the links of the websites. I can see the source code of this generated web page. I just want this page for my application to work on.
EDIT:
I do not want to rely on google only or any particular web server. I want to decide that by my application.
Please also refer to my problem 2.
As i discovered that we have Terms of Conditions for websites should i try to make my crawler. Would then my application not breaking the rules . Well its important for me.
Ashish,
Here what I would recommend.
Learn the basics of JSON from these links (Introduction ,lib download)
Then look at the Google Web Search JSON API here.
Learn how to GET the data from servers using HttpClient library here.
Now what you have to do is, fire a get request for the search, read the JSON response, parse the response using the JSON lib from #1 and you have the search results.
Most of the search engines (Bing etc) offer Jason/REST apis so you can do the same for other search engines.
Note: Jason APIs are normally used from JavaScritps on the UI side but since its very easy and quick to learn, I suggested you that. You can also explore (if time permits) the XML based APIs also.
URL url = new URL("http://fooooo.com");
in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
{
System.out.println(inputLine);
}
Should be enough to get you started .
And yes , do check if you are not violating the usage terms of a website . Search Engines dont really like you trying to access them via a program .
Many , Including Google , has APIs specifically designed for this purpose.
you can do everything you want using HTMLUnit. It´s like a web browser but for java. Check some examples at their website.
Read "Working with URL's" in the Java tutorial to get an idea what is behind the available libs like HTMLUnit, HttpClient, etc
I do not want to display the data. I just want the HTML file or the source code of the generated web page.
You probably dont need the HTML either. Google provide its search results as a web service using this API. Similarly for other search engine GIYF. You get the search results as XML, which is far more easier for you to parse. Plus the XML wont have any unwanted data like ads.