I'm a beginner in Java and I'm attempting to extract some text from a website. The text however is between two tags and when I use getByXPath to extract the text I get everything except the text I need.
This is the layout of the website I'm scraping from: Website HTML Layout
The two highlighted portions are the pieces of text I actually need.
And this is the code I've got so far:
List<HtmlElement> name = (List<HtmlElement>) page.getByXPath("//ul/li/a[#class='title']");
List<HtmlElement> subText = (List<HtmlElement>) page.getByXPath("//ul/li/p[#data-af=' (Secret)']");
This however results in two lists:
name - which has HtmlAnchor objects within
[HtmlAnchor[<a class="title" data-af="10" href="/a180775/daddys-home-achievement">], HtmlAnchor[<a class="title" data-af="11" href="/a180776/protector-achievement">], HtmlAnchor[<a class="title" data-af="12" href="/a180777/sinclairs-solution-achievement">]]
subText - which has HtmlParagraph objects within.
[HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">]]
URL if you want to take a look at the whole website: https://truesteamachievements.com/game/BioShock-2-Remastered/achievements
I need the lists to look something like these:
["Daddy's Home", "Protector", "Sinclair's Solution"]
["Found your way back to the ruins of Rapture.", "Defended yourself against Lamb's assault in the train station.", "Joined forces with Sinclair in Ryan Amusements."]
This is the Html library I'm using : https://htmlunit.sourceforge.io/apidocs/overview-summary.html
Appreciate any help.
The simplest way is using Stream API:
List<HTMLElement> htmlElementList = new ArrayList<>();//get your list in needed way
List<String> listOfTitles = htmlElementList.stream()
.map(HTMLElement::getTitle)
.toList();
More understanding, with foreach loop:
List<HTMLElement> htmlElementList = new ArrayList<>();
List<String> listOfTitles = new ArrayList<>();
for (HTMLElement htmlElement:
htmlElementList) {
listOfTitles.add(htmlElement.getTitle());
}
Not clear - which library you use for elements receiving. This is an example of the case if you use the org.w3c.dom library for HtmlElement definition. Otherwise - use the appropriate method for text receiving (except getTitle()), for example getText() - used for selenium WebElement, etc...
Related
I want to automate a test using selenium java in which I need to check whether a specific text is NOT present on the entire page.
The page has many elements where this text may be present. In general, if it were a few elements, I could resolve this via standard driver.findElement(By.xpath("//somepath")).getText(). But, I want to write an efficient test that doesn't have tons of locators just for this test.
You can use XPATH selector '//*[contains(text(), "YOUR_TEXT")] to find a text that you need.
Example of the code on Python:
def find_text_on_the_page(text):
selector = '//*[contains(text(), "{}")]'.format(text)
elements = browser.find_elements_by_xpath(selector)
assert len(elements), 'The text is not on the page'
Hope, this will help you.
You could try to locate it via xPath selector, but I would not recommend test like that.
Surely You know where to look for some test? At least a part of web page.
Here is code sample how to achieve this:
List<WebElement> list = driver.findElements(By.xpath("//*[contains(text(),'" + text + "')]"));
Assert.assertTrue("Text not found!", list.size() > 0);
or
String bodyText = driver.findElement(By.tagName("body")).getText();
Assert.assertTrue(bodyText.contains(text));
or in method:
public boolean isTextOnPagePresent(String text) {
WebElement body = driver.findElement(By.tagName("body"));
String bodyText = body.getText();
return bodyText.contains(text);
}
Hope this helps, but as I mentioned try defining a smaller scope for testing.
I'm following the example code which puts the html string into a Label. The html is perfect in the browser, is multiple pages and so on. However when I do a Print Preview (or Print) the printout is limited to only one page and there is vertical scrollbar on the printout.
How do I print multiple pages and remove the scrollbar?
My code in the PrintUI class is only:
setContent(new Label(template, ContentMode.HTML));
The answer can be found at: https://vaadin.com/forum/#!/thread/3869543/3869542
You basically need to resort to pure html. The following code does this and fixes the issue:
private void setSizeUndefined2Print()
{
com.vaadin.ui.JavaScript.getCurrent().execute("document.body.style.overflow = \"auto\";" +
"document.body.style.height = \"auto\"");
UI.getCurrent().setSizeUndefined();
this.setSizeUndefined();
}
You can find more details in the above link.
In Wikipedia 95% of the links leads to the Philosophy page. I am trying to write a program in Java that takes any link on wikipedia and clicks the first link(which is not citation/sound/extraneous link and also ignores parentsitzed link .)
For e.g if you start with this url http://en.wikipedia.org/wiki/Dutch_people, it should click Ethnic Group http://en.wikipedia.org/wiki/Ethnic_group and so on until it reaches Philosophy
You should see this Getting_to_Philosophy
Check http://xefer.com/wikipedia (type any word) to see how it works .
I already wrote the back end that stores the data in database in 3 columns
Unique_URL_Id URL_Link Next_URL_Id so latter on printing the whole path will be easier.
The backend works fine(if I give it just a list of links to follow). However extracting and finding the first link is something not working as it should work.
Here is sample code I wrote just for extracting from a URL using jSoap API
public static void extractWikiPage(String title) throws IOException{
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Europe").get();
//int titles = doc.toString().indexOf("(");
//Get the first paragraph where the main body contents starts
String body = doc.getElementsByTag("p").first().toString();
System.out.println(body);
Document doc2= Jsoup.parse(body);
Elements href=doc2.getElementsByTag("a");
int x="".indexOf("");
for(Element h: href){
System.out.println(h.toString());
}
//System.out.println(linkText);
System.exit(1);
}
I am just finding the first occurence of '<p>' since that's where 95% of the links to the next page start. And in that paragraph, I am trying to get all the links but I need the first one that satisfies the condition I wrote above.
How can I use Wikipedia API to solve extracting the data I am looking for.I appreciate your help.
/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&rawcontinue=&titles=Dutch_people is the query that returns the wikitext for that page.
You'll have to parse that result to get the data you want back. You'll be looking for the first thing that is inside of [[double square brackets]] (probably after /\{\{Infobox(.*?)\}\}/i or something like that to exclude links in the infobox and any maintenance tags that might be on the page) that don't start with "something:" to eliminate all interwiki links and categories and file/media pages.
I am working on Selenium with java client. I am getting the Html as the string using the method driver.getPageSource() .
Can you please suggest to me, do we have any open source which is used to convert the Html to Java Object?
Based on that above question, I am expecting functionality like below:
getTextBoxIds() - It will list of the all the text box Ids as the HashMap() ids as the key, and value is the TextBox value.
getSelectBoxIds()
getDivIds()
Note: As of now I am checking the expected data using the contain(), indexOf(), lastIndexOf() methods.
Regards,
Vasanth D
Don't do that! Selenium does it for you (and much more).
Once you're on the page you wanted to get to, you can get all the data you need:
/** Maps IDs of all textboxes to their value attribute. */
public Map<String,String> getTextBoxIds() {
Map<String,String> textboxIds = new HashMap<>();
// find all textboxes
List<WebElement> textboxes = driver.findElements(By.cssSelector("input[type='text']"));
// map id of each textbox to its value
for (WebElement textbox : textboxes) {
textboxIds.put(textbox.getAttribute("id"), textbox.getAttribute("value"));
}
return textboxIds;
}
and so on and so forth. Look at Selenium's documentation to find out more.
Also, JavaDocs.
I need to gather data from this page http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number but the problem is that what i need is the link for each pokemons so for the first one, "/wiki/Bulbasaur_(Pok%C3%A9mon)" (all i need to do after that is add "bulbapedia.bulbagarden.net" in front but i don't know how to get all of these. I've seen some examples but i did not see anything that would of helped me here. Those i've seen used for loops by getting the data inside a div but these links don't seem to be part of any div other than the main big one.
So does anyone know how i could scrape this page?
Here's a solution:
Document doc = Jsoup.connect("http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number").get();
for( Element element : doc.select("td > span.plainlinks > a") )
{
/*
* You can do further things here - for this example we
* only print the absolut url of each link.
*/
System.out.println(element.absUrl("href"));
}
This will already give you the absolute URL's of each pokemon link:
http://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon)
http://bulbapedia.bulbagarden.net/wiki/Ivysaur_(Pok%C3%A9mon)
http://bulbapedia.bulbagarden.net/wiki/Venusaur_(Pok%C3%A9mon)
http://bulbapedia.bulbagarden.net/wiki/Charmander_(Pok%C3%A9mon)
...
However, if you need the relative URL you only have to replace element.absUrl("href") with element.attr("href").
Result:
/wiki/Bulbasaur_(Pok%C3%A9mon)
/wiki/Ivysaur_(Pok%C3%A9mon)
/wiki/Venusaur_(Pok%C3%A9mon)
/wiki/Charmander_(Pok%C3%A9mon)
...
For explanation of this see: Jsoup Selector API. Some good examples can found here: Jsoup Codebook.