I am trying to pull all the links off a youtube users uploads page but I am only getting the first 30 videos, I want to be able to get all 56 videos and I would like to be able to continue using jsoup if possible.
Document BeInspired = Jsoup.connect("https://www.youtube.com/channel/UCaKZDEMDdQc8t6GzFj1_TDw/videos").get();
Elements links = BeInspired.select("a[href]");
for (Element link : links) {
if (link.attr("href").contains("/watch?v=")) {
if (videos.contains(link.attr("href"))) {
} else {
videos.add(link.attr("href"));
}
}
System.out.println(link.attr("href"));
}
This is the code I am currently using.
Related
It is necessary to determine the level of nesting of pages in clicks from home page. How to do it right? I understand that all the pages from the site will get recursively.
The code will look like this:
public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"));
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
Only still there will be a check whether the link is a link to an external site, if so then you do not need to go to it.
am trying to click on this game to open it's page but every time gives me null pointer exception whatever locator am using still gives me same error also i tried to do a select from list as the link seems to be inside an "li" but didnt work also.
anyone could help me with the code to click this item ??
Targeted Page Url:
https://staging-kw.games.getmo.com:/game/43321031
Search(testCase);
WebElement ResultList = driver.findElement(By.xpath(testData.getParam("ResultList")));
log.info("list located ..");
List<WebElement> Results = ResultList.findElements(By.tagName(testData.getParam("ResultListItems")));
for (WebElement List : Results) {
String Link = List.getAttribute("href");
try {
if (Link.equals(null)) {
log.info("null");
}
if (Link.equals(testData.getParam("GameLink")) && !Link.equals("") && !Link.equals(null)) {
List.click();
}
} catch (NullPointerException countryIsNull) {
log.info("LinkIsNull");
}
}
//clickLink(GameLocator,driver);
}`
I got this code to work by just adding this line after the search method
driver.findElement(By.cssSelector("li.title")).click();
and how did i get it ? .. i used Selenium IDE to record the actions then converted the code to Java -TestNG to get the exact web element selector
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am using JSoup to crawl the web and get results. I want to perform a keyword search. For example I crawl
http://www.business-standard.com/ for the following keywords:
google hyderabad
and it should provide me with the link:
http://www.business-standard.com/article/companies/google-to-get-7-2-acres-in-hyderabad-it-corridor-for-its-campus-115051201238_1.html.
I wrote the code below which did not give me appropriate results.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class App {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("http://www.business-standard.com").userAgent("Mozilla").get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("a:contains(google)");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
The results are as follows:
title : India News, Latest News Headlines, BSE live, NSE Live, Stock Markets Live, Financial News, Business News & Market Analysis on Indian Economy - Business Standard News
link : /photo-gallery/current-affairs/mumbai-central-turns-into-wi-fi-zone-courtesy-google-power-2574.htm
text : Mumbai Central turns into Wi-Fi zone, courtesy Google power
link : plus.google.com/+businessstandard/posts
text : Google+
Jsoup 1.8.2
Try this url instead:
http://www.business-standard.com/search?q=<keyword>
SAMPLE CODE
Document doc;
try {
String keyword = "google hyderabad";
doc = Jsoup //
.connect("http://www.business-standard.com/search?q=" + URLEncoder.encode(keyword, "UTF-8")) //
.userAgent("Mozilla") //
.get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("a:contains(google)");
for (Element link : links) {
System.out.println("\nlink : " + link.absUrl("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
OUTPUT
The link you're looking for is in second position.
title : Search
link : http://www.business-standard.com/article/pti-stories/google-to-invest-more-in-india-set-up-new-campus-115121600841_1.html
text : Google to invest more in India, set up new campus in Hyderabad
link : http://www.business-standard.com/article/companies/google-to-get-7-2-acres-in-hyderabad-it-corridor-for-its-campus-115051201238_1.html
text : Google to get 7.2 acres in Hyderabad IT corridor for its campus
link : http://www.business-standard.com/article/technology/swine-flu-closes-google-hyderabad-office-for-2-days-109071500023_1.html
text : Swine flu closes Google Hyderabad office for 2 days
link : http://www.business-standard.com/article/pti-stories/facebook-posts-strong-4q-as-company-closes-gap-with-google-116012800081_1.html
text : Facebook posts strong 4Q as company closes gap with Google
link : http://www.business-standard.com/article/pti-stories/r-day-bsf-camel-contingent-march-on-google-doodle-116012600104_1.html
text : R-Day: BSF camel contingent marches on Google doodle
link : http://www.business-standard.com/article/international/daimler-ceo-says-apple-google-making-progress-on-car-116012501298_1.html
text : Daimler CEO says Apple, Google making progress on car
link : https://plus.google.com/+businessstandard/posts
text : Google+
DISCUSSION
The sample code below fetch only the first results page. If you need to fetch more results, extract the next link page (#hpcontentbox div.next-colum > a) and crawl it with Jsoup.
You'll notice there are additionnal parameters to the above link I provided you:
itemPerPages : self explanatory (default to 19)
page : the search results page index (default is 1 if not provided)
company-code : ?? (can be empty)
You may try to give itemPerPages to the url with larger values (100 or more). This may reduce your crawling time.
The absUrl method is used in order to have absolute urls instead of relative urls.
I'm using Selenium to log into my Google account and to visit YouTube.
Now on the video manager I would like to get all of my video ids. I tried copying the CSSSelector or XPath which the developer tools in Chrome give me but each of them contain the video id which makes them impossible to use like this:
List<WebElement> allVideoUrls = driver.findElements(By.cssSelector("my-selector-which-gives-all-videos-on-page"));
Note that I have to be logged in to be able to "see" unlisted or private videos as well so that's required.
So far I have a bad implementation which sometimes fails to work for some reason. I firstly get all the links on the page and only return the ones which are for editing a video. To avoid a StaleElementReferenceException I'm retrieving all links again inside the loop.
public void getVideoInformation()
{
// Visit video manager
driver.get("https://www.youtube.com/my_videos?o=U");
// Wait until video list has loaded
new WebDriverWait(driver, 10).until(ExpectedConditions
.visibilityOfElementLocated(By
.cssSelector("#vm-playlist-video-list-ol")));
// Return all links on page
List<WebElement> allLinks = driver.findElements(By.tagName("a"));
HashSet<String> videoLinks = new HashSet<>();
for (int linksIndex = 0; linksIndex < allLinks.size(); linksIndex++)
{
String link = driver.findElements(By.tagName("a")).get(linksIndex)
.getAttribute("href");
try
{
if (link.contains("edit"))
{
System.out.println(link);
// No duplicates
videoLinks.add(link);
}
} catch (Exception error)
{
error.printStackTrace();
}
}
// ...
}
I'm fine with the fact that I need to load every other page as well to get all the videos but please help me to find an efficient/reliable way of getting the video ids.
I am creating an app in Java that will take all the information from a public website and load it in the app for people to read using jsoup. I was trying the same kind of function with Facebook but it wasn't working the same way. Does anyone have a good idea about how I should go about this?
Thanks,
Calland
public String[] scrapeEvents(String... args) throws Exception {
Document doc = Jsoup.connect("http://www.facebook.com/cedarstreettimes?fref=ts").get();
Elements elements = doc.select("div._wk");
String s = elements.toString();
return s;
}
edit: I found this link of information,but I'm a little confused on how to manipulate it to get me only the content of what the specific user posts on their wall. http://developers.facebook.com/docs/getting-started/graphapi/
I had a look at the source of that page -- the thing that is tripping up the parse is that all the real content is wrapped in comments, like this:
<code class="hidden_elem" id="u_0_42"><!-- <div class="fbTimelineSection ...> --></code>
There is JS on the page that lifts that data into the real DOM, but as jsoup doesn't execute JS it stays as comments. So before extracting the content, we need to emulate that JS and "un-hide" those elements. Here's an example to get you started:
String url = "https://www.facebook.com/cedarstreettimes?fref=ts";
String ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.33 (KHTML, like Gecko) Chrome/27.0.1438.7 Safari/537.33";
Document doc = Jsoup.connect(url).userAgent(ua).timeout(10*1000).get();
// move the hidden commented out html into the DOM proper:
Elements hiddenElements = doc.select("code.hidden_elem");
for (Element hidden: hiddenElements) {
for (Node child: hidden.childNodesCopy()) {
if (child instanceof Comment) {
hidden.append(((Comment) child).getData()); // comment data parsed as html
}
}
}
Elements articles = doc.select("div[role=article]");
for (Element article: articles) {
if (article.select("span.userContent").size() > 0) {
String text = article.select("span.userContent").text();
String imgUrl = article.select("div.photo img").attr("abs:src");
System.out.println(String.format("%s\n%s\n\n", text,imgUrl));
}
}
That example pulls out the article text and any photo that is associated with it.
(It's possibly better to use the FB API that this method; I wanted to show how you can emulate little bits of JS to make a scrape work properly.)