Hey i am doing a project of data retrieval from tweets from twitter, i am collecting tweets from certain kinds of events, few of the post contains some links, few are expanded and few are shorten, i want to save link from each tweets to my mysql database. I have found code for expanding url, someone please tell me will this work for every shorten url.
for (URLEntity urle : status.getURLEntities()) {
System.out.println(urle.getDisplayURL());
System.out.println(urle.getExpandedURL());
}
With that code you will print the url twice, from the Javadoc
getDisplayURL
Returns: the display URL if mentioned URL is shorten, or null if no shorten URL was mentioned.
So, for every URLEntity you will need to print the expanded URL if it's shortened
for (URLEntity urle : status.getURLEntities()) {
if(urle.getExpandedURL()){System.out.println(urle.getExpandedURL());}
else {System.out.println(urle.getDisplayURL());}
}
Or in you case, save them to a database.
Related
As per the html the source code is:
{result.data}
While requesting the URL result.data is set with 100 and am able to see the value as 100 in the browser. Where as while I am trying to execute the java program with the same url request I am unable to see the value as I have seen in browser.
URL url = new URL(site)
url.openConnection() etc..
I wanted to get the same content as I have seen in the browser through java program.
Your question is not very descriptive, but I guess you are trying to scrape data from the site.
You can use the following libraries for this task:
Jaunt (http://jaunt-api.com)
Jsoup (http://jsoup.org/cookbook/extracting-data/dom-navigation)
HTMLUnit
To what i understand, you want to do one of the below things :
Instead of reading the result line by line, you want to parse it as an XML to as to traverse to div(s) and other html tags.
For this purpose i would suggest you to use jsoup library.
When you hit the URL: www.abcd.com/number=500 in browser, it loads an empty div and on load it fetches data from somewhere, this data which it fetches on load, you want to fetch this using java ?
For this, there must be some js in the resulting page, which is fetching data by hitting some service on page load, you will need to look up in the page to know the service details and instead of hitting this URL (www.abcd.com/number=500) you will need to hit that service to get data.
According to the things I've found twitter4J seems to be the most prominent tool when it comes to Java and Twitter. I went through the code examples and javadoc but I couldn't find a way to do this.
What I want to do is, extract the tweet (content of the tweet) using the URL of it. I tried using JSOUP and the CSS selector but when its a conversation it pulls all the tweets of it. How can I do it using Twitter4J?
Input tweet URL -> Output the content of the tweet
Using jsoup is easier, u can do this:
Document doc = Jsoup.connect("the tweet url here").timeout(10*1000).get();
Element tweet = doc.select(".js-tweet-text-container").first();
now u can use the tweet object to parse the information.
I am trying to sort a Google Spreadsheet with the Java API but unfortunately it doesn't seem to work. The code I am using is really simple as shown in the API reference.
URL listFeedUrl = new URI(worksheet.getListFeedUrl().toString() + "?orderby=columnname").toURL();
However, this does not work. The feed returned is not sorted at all. Am I missing something? FYI the column I am trying to sort contains email addresses.
EDIT: I just realized that the problem only happens with the old version of Google Spreadsheet.
maybe this happens. The query is performed on the spreadsheet xml and xml tags are in lower case, for example the title of my column in my spreadseet is "Nombre" and the xml <gsx:nombre>is not working so instead of using [?orderby=Nombre], use [?orderby=nombre] with a lowercase "n"
The correct query for this is.
URL listFeedUrl = new URI(worksheet.getListFeedUrl().toString() + "?orderby=nombre").toURL();
I am trying to scrape results of keyword search from Yahoo Answers, in my case, "alcohol addiction." I am using Jsoup and URL modification to go through pages of the search results to scrape the results. However, I am noticing that, even though I put in URL for 'Newest' results, it keeps showing 'Relevance' results, and what's worse, the results are not exactly the same as what's shown on the browser.
For instance, the URL for Newest results is:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=new
And for relevant results, the URL is:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=rel
And the "1" will change to 2, 3, 4, etc as you go to the next page (there are 10 results per page).
Here's what I do to scrape the page:
String urlID = "";
String end = "&sort=new";
String glob = "http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=";
Integer forumID = 0;
while(nextPageIsThere){
forumID++;
System.out.println("Now extracting the page: "+forumID);
try {
urlID = glob+forumID+end;
System.out.println(urlID);
exdoc = Jsoup.connect(urlID).get();
java.util.Date date= new java.util.Date();
} catch (IOException e) {
e.printStackTrace();
}
...
What's even more confusing is even if I increase the page number, and the system output shows that the URL is changing to:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=2&sort=new
and
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=3&sort=new
it still scrapes the same page as shown in page 1 over and over again. I know my code is not wrong. I've been debugging it for hours. I think it's something got to do with Jsoup.connect and/or Yahoo Answer possibly blocking bots? At the same time, I don't think it's really that.
Does anyone know why this might be happening?
JSoup is working with static HTML only, they can't parse dynamic pages like this, where content is downloaded after page is loaded with Ajax request or JavaScript modification.
Try reading this page with HTMLUnit, this parser has support for JS pages.
It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.
OK so I am starting a Bing search, then retrieving a couple resulting urls and using those as starting points to traverse other pages, parsing links from them and adding them to a List.
The problem I'm having is, I don't want to visit the same domain twice. I can stop it from visiting the same URL but if a page has link to another part of the website (such as an about page) I can't.
Currently I've a LinkedList where I add a URL to every time I parse one from the document using Jsoup. And I have a HashMap for storing already visited URLs. So I have it set up in a basic "if" like this:
if(!urlsVisited.containsKey(url))
{
urlsToVisit.add(url);
urlsVisited.put(url, url);
}
This is in a for loop where I retrieve the links on each page (currently 4 threads handling 4 pages).
This stops it from adding the likes of "http://www.stackoverflow.com" twice but doesn't work if I were to come across "http://www.stackoverflow.com/questions/ask".
I would like to add one link from StackOverflow (for example) and then be done with that domain. Any ideas?
I'm using Jsoup api in Java to parse results.
You can use URI class to parse your URLs. I also recommend to use Set<String> to store visited domains:
Set<String> urlsVisited = new HashSet<String>();
...
String domain = new URI(url).getHost();
if(!urlsVisited.contains(domain))
{
urlsToVisit.add(url);
urlsVisited.add(domain);
}
Use the java.net.URL class to pull the host name, and use that as the key to your urlsVisited map.
http://docs.oracle.com/javase/6/docs/api/java/net/URL.html#getHost()