Scraping with Jsoup

Scraping with Jsoup - java

I need to gather data from this page http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number but the problem is that what i need is the link for each pokemons so for the first one, "/wiki/Bulbasaur_(Pok%C3%A9mon)" (all i need to do after that is add "bulbapedia.bulbagarden.net" in front but i don't know how to get all of these. I've seen some examples but i did not see anything that would of helped me here. Those i've seen used for loops by getting the data inside a div but these links don't seem to be part of any div other than the main big one.
So does anyone know how i could scrape this page?

Here's a solution:
Document doc = Jsoup.connect("http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number").get();
for( Element element : doc.select("td > span.plainlinks > a") )
{
/*
* You can do further things here - for this example we
* only print the absolut url of each link.
*/
System.out.println(element.absUrl("href"));
}
This will already give you the absolute URL's of each pokemon link:
http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon)
http://bulbapedia.bulbagarden.net/wiki/Ivysaur_(Pok%C3%A9mon)
http://bulbapedia.bulbagarden.net/wiki/Venusaur_(Pok%C3%A9mon)
http://bulbapedia.bulbagarden.net/wiki/Charmander_(Pok%C3%A9mon)
...
However, if you need the relative URL you only have to replace element.absUrl("href") with element.attr("href").
Result:
/wiki/Bulbasaur_(Pok%C3%A9mon)
/wiki/Ivysaur_(Pok%C3%A9mon)
/wiki/Venusaur_(Pok%C3%A9mon)
/wiki/Charmander_(Pok%C3%A9mon)
...
For explanation of this see: Jsoup Selector API. Some good examples can found here: Jsoup Codebook.

Related

How can I extract text from a flex container?

I'm a beginner in Java and I'm attempting to extract some text from a website. The text however is between two tags and when I use getByXPath to extract the text I get everything except the text I need.
This is the layout of the website I'm scraping from: Website HTML Layout
The two highlighted portions are the pieces of text I actually need.
And this is the code I've got so far:
List<HtmlElement> name = (List<HtmlElement>) page.getByXPath("//ul/li/a[#class='title']");
List<HtmlElement> subText = (List<HtmlElement>) page.getByXPath("//ul/li/p[#data-af=' (Secret)']");
This however results in two lists:
name - which has HtmlAnchor objects within
[HtmlAnchor[<a class="title" data-af="10" href="/a180775/daddys-home-achievement">], HtmlAnchor[<a class="title" data-af="11" href="/a180776/protector-achievement">], HtmlAnchor[<a class="title" data-af="12" href="/a180777/sinclairs-solution-achievement">]]
subText - which has HtmlParagraph objects within.
[HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">]]
URL if you want to take a look at the whole website: https://truesteamachievements.com/game/BioShock-2-Remastered/achievements
I need the lists to look something like these:
["Daddy's Home", "Protector", "Sinclair's Solution"]
["Found your way back to the ruins of Rapture.", "Defended yourself against Lamb's assault in the train station.", "Joined forces with Sinclair in Ryan Amusements."]
This is the Html library I'm using : https://htmlunit.sourceforge.io/apidocs/overview-summary.html
Appreciate any help.

The simplest way is using Stream API:
List<HTMLElement> htmlElementList = new ArrayList<>();//get your list in needed way
List<String> listOfTitles = htmlElementList.stream()
.map(HTMLElement::getTitle)
.toList();
More understanding, with foreach loop:
List<HTMLElement> htmlElementList = new ArrayList<>();
List<String> listOfTitles = new ArrayList<>();
for (HTMLElement htmlElement:
htmlElementList) {
listOfTitles.add(htmlElement.getTitle());
}
Not clear - which library you use for elements receiving. This is an example of the case if you use the org.w3c.dom library for HtmlElement definition. Otherwise - use the appropriate method for text receiving (except getTitle()), for example getText() - used for selenium WebElement, etc...

Vaadin how to print dynamic raw html using PrintUI class?

I'm following the example code which puts the html string into a Label. The html is perfect in the browser, is multiple pages and so on. However when I do a Print Preview (or Print) the printout is limited to only one page and there is vertical scrollbar on the printout.
How do I print multiple pages and remove the scrollbar?
My code in the PrintUI class is only:
setContent(new Label(template, ContentMode.HTML));

The answer can be found at: https://vaadin.com/forum/#!/thread/3869543/3869542
You basically need to resort to pure html. The following code does this and fixes the issue:
private void setSizeUndefined2Print()
{
com.vaadin.ui.JavaScript.getCurrent().execute("document.body.style.overflow = \"auto\";" +
"document.body.style.height = \"auto\"");
UI.getCurrent().setSizeUndefined();
this.setSizeUndefined();
}
You can find more details in the above link.

How can I use the Wikipedia API to extract/parse the link I am looking for?

In Wikipedia 95% of the links leads to the Philosophy page. I am trying to write a program in Java that takes any link on wikipedia and clicks the first link(which is not citation/sound/extraneous link and also ignores parentsitzed link .)
For e.g if you start with this url http://en.wikipedia.org/wiki/Dutch_people, it should click Ethnic Group http://en.wikipedia.org/wiki/Ethnic_group and so on until it reaches Philosophy
You should see this Getting_to_Philosophy
Check http://xefer.com/wikipedia (type any word) to see how it works .
I already wrote the back end that stores the data in database in 3 columns
Unique_URL_Id URL_Link Next_URL_Id so latter on printing the whole path will be easier.
The backend works fine(if I give it just a list of links to follow). However extracting and finding the first link is something not working as it should work.
Here is sample code I wrote just for extracting from a URL using jSoap API
public static void extractWikiPage(String title) throws IOException{
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Europe").get();
//int titles = doc.toString().indexOf("(");
//Get the first paragraph where the main body contents starts
String body = doc.getElementsByTag("p").first().toString();
System.out.println(body);
Document doc2= Jsoup.parse(body);
Elements href=doc2.getElementsByTag("a");
int x="".indexOf("");
for(Element h: href){
System.out.println(h.toString());
}
//System.out.println(linkText);
System.exit(1);
}
I am just finding the first occurence of '<p>' since that's where 95% of the links to the next page start. And in that paragraph, I am trying to get all the links but I need the first one that satisfies the condition I wrote above.
How can I use Wikipedia API to solve extracting the data I am looking for.I appreciate your help.

/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&rawcontinue=&titles=Dutch_people is the query that returns the wikitext for that page.
You'll have to parse that result to get the data you want back. You'll be looking for the first thing that is inside of [[double square brackets]] (probably after /\{\{Infobox(.*?)\}\}/i or something like that to exclude links in the infobox and any maintenance tags that might be on the page) that don't start with "something:" to eliminate all interwiki links and categories and file/media pages.

How to get current browser in RFT test script

I have my webpage opened using RFT. In that page, I have a link I want to click.
For that I am using
objMap.ClickTabLink(objMap.document_eBenefitsHome(), "Upload Documentation", "Upload Documentation");
The current page link name is "Upload Documentation"
I know that objMap.document_eBenefitsHome() takes it back to the initial page, what can I use in that place which uses the "current page opened" ?
Many thanks in advance.

There are some alternatives that could solve your problem:
Open the Test Object Map; select from the map the object that represents the document document_eBenefitsHome; modify the .url property using regular expression, so that the URLs of the two pages you cited in your question match the regex.
Find dinamically the document object using the find method. Once the page containing the link you want to click was fully loaded, try to use this code to find the document: find(atDescendant(".class", "Html.HtmlDocument"), false). The false boolean value allow the find method to search also among object that are not previously recorded.

POST Password and Email data to Google Sign-in

All- I have the code:
doc = Jsoup.connect("https://play.google.com/apps/publish/Home?dev_acc=00758402038897917238")
.data("input#Email", "email#gmail.com")
.data("input#Passwd", "123abcABC123" )
.post();
I got his from here: SO question but could not figure out what is wrong. I am getting the sign in page instead of the page displaying all my published apps. I belive the problem might lie in the input#Email and input#Passwd but I am not sure. I don't quite understand what that is supposed to refer to. So my question: how can I login to my developer console using code similar to the one above and what is supposed to go where input#Email and input#Passwd are?

In post you have to use names, not css selectors:
doc = Jsoup.connect("https://play.google.com/apps/publish/Home?dev_acc=00758402038897917238")
.data("Email", "email#gmail.com")
.data("Passwd", "123abcABC123" )
.post();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scraping with Jsoup - java

Related

How can I extract text from a flex container?

Vaadin how to print dynamic raw html using PrintUI class?

How can I use the Wikipedia API to extract/parse the link I am looking for?

How to get current browser in RFT test script

POST Password and Email data to Google Sign-in

Categories

Resources