HTML Parsing library in java

HTML Parsing library in java - java

I am working on Selenium with java client. I am getting the Html as the string using the method driver.getPageSource() .
Can you please suggest to me, do we have any open source which is used to convert the Html to Java Object?
Based on that above question, I am expecting functionality like below:
getTextBoxIds() - It will list of the all the text box Ids as the HashMap() ids as the key, and value is the TextBox value.
getSelectBoxIds()
getDivIds()
Note: As of now I am checking the expected data using the contain(), indexOf(), lastIndexOf() methods.
Regards,
Vasanth D

Don't do that! Selenium does it for you (and much more).
Once you're on the page you wanted to get to, you can get all the data you need:
/** Maps IDs of all textboxes to their value attribute. */
public Map<String,String> getTextBoxIds() {
Map<String,String> textboxIds = new HashMap<>();
// find all textboxes
List<WebElement> textboxes = driver.findElements(By.cssSelector("input[type='text']"));
// map id of each textbox to its value
for (WebElement textbox : textboxes) {
textboxIds.put(textbox.getAttribute("id"), textbox.getAttribute("value"));
}
return textboxIds;
}
and so on and so forth. Look at Selenium's documentation to find out more.
Also, JavaDocs.

Related

I have to validate the user agreement content for a mobile application using selenium java

I am new to Appium and i have to validate End user agreement content to make sure its typo free.
below is the logic to get the list of element text from current view
List listOfElements = driver.findElements(AppiumBy.className("android.widget.xxx"));
{
for (WebElement element : listOfElements) {
String ActualContent = element.getText();
System.out.println(ActualContent);
}
after getting the actual content for each element one by one, i want to validate the content whether is type free or not
what is the best way to approach it.

How can I extract text from a flex container?

I'm a beginner in Java and I'm attempting to extract some text from a website. The text however is between two tags and when I use getByXPath to extract the text I get everything except the text I need.
This is the layout of the website I'm scraping from: Website HTML Layout
The two highlighted portions are the pieces of text I actually need.
And this is the code I've got so far:
List<HtmlElement> name = (List<HtmlElement>) page.getByXPath("//ul/li/a[#class='title']");
List<HtmlElement> subText = (List<HtmlElement>) page.getByXPath("//ul/li/p[#data-af=' (Secret)']");
This however results in two lists:
name - which has HtmlAnchor objects within
[HtmlAnchor[<a class="title" data-af="10" href="/a180775/daddys-home-achievement">], HtmlAnchor[<a class="title" data-af="11" href="/a180776/protector-achievement">], HtmlAnchor[<a class="title" data-af="12" href="/a180777/sinclairs-solution-achievement">]]
subText - which has HtmlParagraph objects within.
[HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">]]
URL if you want to take a look at the whole website: https://truesteamachievements.com/game/BioShock-2-Remastered/achievements
I need the lists to look something like these:
["Daddy's Home", "Protector", "Sinclair's Solution"]
["Found your way back to the ruins of Rapture.", "Defended yourself against Lamb's assault in the train station.", "Joined forces with Sinclair in Ryan Amusements."]
This is the Html library I'm using : https://htmlunit.sourceforge.io/apidocs/overview-summary.html
Appreciate any help.

The simplest way is using Stream API:
List<HTMLElement> htmlElementList = new ArrayList<>();//get your list in needed way
List<String> listOfTitles = htmlElementList.stream()
.map(HTMLElement::getTitle)
.toList();
More understanding, with foreach loop:
List<HTMLElement> htmlElementList = new ArrayList<>();
List<String> listOfTitles = new ArrayList<>();
for (HTMLElement htmlElement:
htmlElementList) {
listOfTitles.add(htmlElement.getTitle());
}
Not clear - which library you use for elements receiving. This is an example of the case if you use the org.w3c.dom library for HtmlElement definition. Otherwise - use the appropriate method for text receiving (except getTitle()), for example getText() - used for selenium WebElement, etc...

What is best strategy to verify text of multiple elements without adding a ton of locators in a selenium framework?

I want to automate a test using selenium java in which I need to check whether a specific text is NOT present on the entire page.
The page has many elements where this text may be present. In general, if it were a few elements, I could resolve this via standard driver.findElement(By.xpath("//somepath")).getText(). But, I want to write an efficient test that doesn't have tons of locators just for this test.

You can use XPATH selector '//*[contains(text(), "YOUR_TEXT")] to find a text that you need.
Example of the code on Python:
def find_text_on_the_page(text):
selector = '//*[contains(text(), "{}")]'.format(text)
elements = browser.find_elements_by_xpath(selector)
assert len(elements), 'The text is not on the page'
Hope, this will help you.

You could try to locate it via xPath selector, but I would not recommend test like that.
Surely You know where to look for some test? At least a part of web page.
Here is code sample how to achieve this:
List<WebElement> list = driver.findElements(By.xpath("//*[contains(text(),'" + text + "')]"));
Assert.assertTrue("Text not found!", list.size() > 0);
or
String bodyText = driver.findElement(By.tagName("body")).getText();
Assert.assertTrue(bodyText.contains(text));
or in method:
public boolean isTextOnPagePresent(String text) {
WebElement body = driver.findElement(By.tagName("body"));
String bodyText = body.getText();
return bodyText.contains(text);
}
Hope this helps, but as I mentioned try defining a smaller scope for testing.

How can I use the Wikipedia API to extract/parse the link I am looking for?

In Wikipedia 95% of the links leads to the Philosophy page. I am trying to write a program in Java that takes any link on wikipedia and clicks the first link(which is not citation/sound/extraneous link and also ignores parentsitzed link .)
For e.g if you start with this url http://en.wikipedia.org/wiki/Dutch_people, it should click Ethnic Group http://en.wikipedia.org/wiki/Ethnic_group and so on until it reaches Philosophy
You should see this Getting_to_Philosophy
Check http://xefer.com/wikipedia (type any word) to see how it works .
I already wrote the back end that stores the data in database in 3 columns
Unique_URL_Id URL_Link Next_URL_Id so latter on printing the whole path will be easier.
The backend works fine(if I give it just a list of links to follow). However extracting and finding the first link is something not working as it should work.
Here is sample code I wrote just for extracting from a URL using jSoap API
public static void extractWikiPage(String title) throws IOException{
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Europe").get();
//int titles = doc.toString().indexOf("(");
//Get the first paragraph where the main body contents starts
String body = doc.getElementsByTag("p").first().toString();
System.out.println(body);
Document doc2= Jsoup.parse(body);
Elements href=doc2.getElementsByTag("a");
int x="".indexOf("");
for(Element h: href){
System.out.println(h.toString());
}
//System.out.println(linkText);
System.exit(1);
}
I am just finding the first occurence of '<p>' since that's where 95% of the links to the next page start. And in that paragraph, I am trying to get all the links but I need the first one that satisfies the condition I wrote above.
How can I use Wikipedia API to solve extracting the data I am looking for.I appreciate your help.

/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&rawcontinue=&titles=Dutch_people is the query that returns the wikitext for that page.
You'll have to parse that result to get the data you want back. You'll be looking for the first thing that is inside of [[double square brackets]] (probably after /\{\{Infobox(.*?)\}\}/i or something like that to exclude links in the infobox and any maintenance tags that might be on the page) that don't start with "something:" to eliminate all interwiki links and categories and file/media pages.

How do you send checkbox data in Jsoup

I am trying to post checkbox data with Jsoup and am having a little trouble. I thought that when multiple checkboxes are selected, they are sent as an array to the server but maybe that is not the case?
This is what I thought was correct:
HashMap<String, String> postData = new HashMap<String, String>();
postData.put("checkbox", "[box1,box2,box3]");
Jsoup.connect("somesite").data(postData).post();
This does not seem to work properly. However, if I send only a single checkbox then I get my expected results leading me to believe my understanding of how checkbox form data sends is incorrect.
This works:
postData.put("checkbox", "box2");
Maybe HashMap is the wrong type to use. According to the Jsoup documentation I could just call .data(key, value) multiple times but I was hoping for something a little cleaner than that.

If you have multiple checkboxes, then presumably each checkbox has its own name attribute. You should then call .data(name, value) for each such name.
AFAIK there's no way to "collapse" these calls to data into a single call.

Maybe You can try something like the following ?
HashMap<String,String> paramHM=new HashMap<String,String>();
ArrayList<String> checkboxVal=new ArrayList<Strnig>();
/ .. put request.getParametersValues() in this arraylist
org.jsoup.Connection jsoupConn=Jsoup.connect(web_api).data(paramHM);
// Multiple Call that
for(String item:checkboxVal){
jsoupConn=jsoupConn.data("checkbox",item);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML Parsing library in java - java

Related

I have to validate the user agreement content for a mobile application using selenium java

How can I extract text from a flex container?

What is best strategy to verify text of multiple elements without adding a ton of locators in a selenium framework?

How can I use the Wikipedia API to extract/parse the link I am looking for?

How do you send checkbox data in Jsoup

Categories

Resources