I would like to read the content of a tweet from Java, by using the public link of that tweet, e.g.:
http://twitter.com/FoodRhythms/status/461201880354271232/photo/1
I am using the same procedure used for reading content from other types of pages:
String XMLstring = "";
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
XMLstring += inputLine;
in.close();
However, while with other pages this works, when reading from Twitter links the returned content is empty (the BufferedReader object does not contain any line).
Any hint on this?
You are approaching the problem in unnecessary complicated way. You would use an HTML parser library, such as Jsoup to parse the URL of it's content.
An example would be as follows:
String url = "https://twitter.com/FoodRhythms/status/461201880354271232/photo/1";
Document doc = Jsoup.connect(url).get();
Element tweetText = doc.select("p.js-tweet-text.tweet-text").first();
System.out.println(tweetText.text());
which would output
Miso-Glazed Japanese Eggplant
In a similar fashion, you can select any element you want!
Reading from twitter requires oAuth authentication, you will need to adhere to the twitter for java library to get the required data.
Hope this helps..!
http://www.javacodegeeks.com/2012/03/twitter-api-on-your-java-application.html
The problem with your code right now is that you are trying to access twitter via http where an implicit redirect to https is happening. Using your exact code, modify the url to be https and all should work fine.
Related
I have written a Java applet to read in the HTML from one of our intranet systems.
My code is as follows:
public static String getOrdersInProvisioning(){
try{
URL url = new URL("https://www.internalsystem.net/system//src/order/OrderProvList.cfm");
URLConnection connection = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String inputLine;
String result;
StringBuilder a = new StringBuilder();
while ((inputLine = reader.readLine()) != null) {
a.append(inputLine);
a.append("\r\n");
}
reader.close();
result = a.toString();
return result;
}catch (Exception e){
return e.toString();
}
}
This idea being that I can read in the HTML Source Code (the same code I see when I log into the system, right click on the page and select "View Page Source") and use the resulting string to extract things like order numbers, due dates etc for my applet.
I can successfully do this for SOME pages on this intranet system (the URL changes as you move from page to page) but not on others. I must be logged in to the system as a valid user for it to work.
On pages where this fails, the resulting HTML code seems to indicate my applet was redirected to some kind of log in page by JavaScript:
<SCRIPT LANGUAGE="JavaScript">
self.location='/system//src/Login.cfm?redirect=1';
</SCRIPT>
I have double checked that I am logged into the system and that my applet is running under the correct user account. But for some reason it will only work for particular pages. Upon looking at the HTML source for the pages where this fails, it looks like there a particular piece of JavaScript that I'm guessing is the cause of this redirect.
My question is, is there a way to avoid this redirection - or is this Javascript there to prevent exactly what I am trying to do?
I have tried using Jsoup with followRedirects(false) and httpConn.addRequestProperty(...) options but all to no avail.
This could be based on how the security is implemented in the server. Sometimes it might be checking for the source. So you could try setting the referrer property or appropriate headers. If you are logged in properly it might work. For example.
String url = "https://www.internalsystem.net/system//src/order/OrderProvList.cfm";
doc = Jsoup.connect(url).referrer(url).get();
I am doing an assignment for one of my classes.
I am supposed to write a webcrawler that download files and images from a website given a specified crawl depth.
I am allowed to use third party parsing api so I am using Jsoup. I've also tried htmlparser. Both nice softwares but they are not perfect.
I used the default java URLConnection to check content type before processing the url but it becomes really slow as the number of links grows.
Question : Anyone know any specialized parser api for images and links ?
I could start writing mine using Jsoup but am being lazy. Besides why reinvent the wheel if there could be a working solution out there? Any help would be appreciated.
i need to check contentType while looping through the links to check if the link is to a file, in an effective way but Jsoup does not have what i need. Heres what i have:
**
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
UPDATE
Based on Yoshi's answer, I was able to get my code to work right. Here's the link:
https://github.com/unekwu/cs_nemesis/blob/master/crawler/crawler/src/cu/cs/cpsc215/project1/parser/Parser.java
Use jSoup i think this API is good enough for your purpose. Also you can find good Cookbook on this site.
Several steps:
Jsoup: how to get an image's absolute url?
how to download image from any web page in java
You can write your own recursion method which walk through links on page which contains nesessary domain name or relative links. Use this way to grab all links and find all images on it. Write it yourself it's not bad practice.
You don't need to use URLConnection class, jSoup have wrapper for it.
e.g
You can use only one line of code to get DOM object:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Instead of this code:
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
Update1
try to add in your code next lines:
Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();
I've been trying to get information from a webpage, specifically this site: http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D (among other similar ones). I'm using the URL and URLConnection packages to do so. I'm trying to get a certain number from the webpage - on this page, I want the total number of articles (16428).
It says this near the top of the page: "Results: 1 to 20 of 16428" and when I look at the page source manually I can find this. However, when I try to use the java connection to obtain this number from the page source, for some reason the number it gets is "863399" instead of "16428".
Code:
URL connection = new URL("http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D");
URLConnection yc = connection.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String html = "";
String inputLine;
while ((inputLine = in.readLine()) != null) html += inputLine;
in.close();
int startMarker = html.indexOf("ncbi_resultcount");
int endMarker = html.indexOf("ncbi_op");
System.out.println(html.substring(startMarker, endMarker));
When I run this code, I get:
ncbi_resultcount" content="863399" />
rather than:
ncbi_resultcount" content="16428" />
Does anyone know why this is / how I can fix it?
Thanks!
I can't reproduce your problem and I have no idea why this is happening. Perhaps it's sniffing specific Java user agent versions. You'd then need to try to set the User-Agent header to something else to pretend as a "real" webbrowser.
yc.setRequestProperty("User-Agent", "Mozilla");
Unrelated to the concrete problem, I'd suggest to use a real HTML parser for this job, such as Jsoup. It's then as easy as:
Document document = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D").get();
Element nbci_resultcount = document.select("meta[name=ncbi_resultcount]").first();
System.out.println(nbci_resultcount.attr("content")); // 16433
I have a problem once again where I cant find the source code because its hidden or something... When my java program indexes the page it finds everything but the info i need... I assume its hidden for a reason but is there anyway around this?
Its just a bunch of tr/td tags that show up in firebug but dont show up when viewing the page source or when i do below
URL url = new URL("my url");
URLConnection yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
I really have no idea how to attempt to get the info that i need...
The reason for this behavior is because probably those tags are dynamically injected into the DOM using javascript and are not part of the initial HTML which is what you can fetch with an URLConnection. They might even be created using AJAX. You will need a javascript interpreter on your server if you want to fetch those.
If they don't show up in the page source, they're likely being added dynamically by Javascript code. There's no way to get them from your server-side script short of including a javascript interpreter, which is rather high-overhead.
The information in the tags is presumably coming from somewhere, though. Why not track that down and grab it straight from there?
Try Using Jsoup.
Document doc = doc=Jsoup.parse("http:\\",10000);
System.out.print(doc.toString());
Assuming that the issue is that the "missing" content is being injected using javascript, the following SO Question is pertinent:
What's a good tool to screen-scrape with Javascript support?
My Application is using XForms for view and XForms generate output XML containing the answer given by user. If we include the following line
<fr:xforms-inspector xmlns:fr="http://orbeon.org/oxf/xml/form-runner"/>
in the code we can see the generated output in the screen. So for username if user type amit it would also come with the generated XML.
I actually wanted to get this generated XML in my Java Class to save it in database and parse it and split its contents. I have tried the following code for getting that XML but not able to get the generated XML.
BufferedReader requestData = new BufferedReader(new InputStreamReader(request.getInputStream()));
StringBuffer stringBuffer = new StringBuffer();
String line;
try{
while ((line = requestData.readLine()) != null) {
stringBuffer.append(line);
}
} catch (Exception e){}
return stringBuffer.toString();
}
Please let me know what wrong I am doing.
Assuming that you'd like to have Java code inside a servlet or JSP that receives XML posted to the servlet or JSP through an XForms submission, then I would recommend you parse it using an XML parser rather than doing this by hand. Doing this with Dom4j is quite simple; for instance to get the content of the root element (assuming that all you receive is an element with some text in it):
Document queryDocument = xmlReader.read(request.getInputStream());
String query = queryDocument.getRootElement().getStringValue();
And for reference, see the full source of an example this is taken from.