Determine the nesting of pages on the site - java

It is necessary to determine the level of nesting of pages in clicks from home page. How to do it right? I understand that all the pages from the site will get recursively.
The code will look like this:
public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"));
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
Only still there will be a check whether the link is a link to an external site, if so then you do not need to go to it.

Related

AngularJs page issue with selecting an element and clicking it

I have a problem with selecting and clicking an element it so the drop down occurs here is what i have tried uptill now:-
String csspath = "html body.ng-scope f:view form#wdesk.ng-pristine.ng-valid div.container div.ng-scope md-content.md-padding._md md-tabs.ng-isolate-scope.md-dynamic-height md-tabs-content-wrapper._md md-tab-content#tab-content-7._md.ng-scope.md-active.md-no-scroll div.ng-scope.ng-isolate-scope ng-include.ng-scope div.ng-scope accordion div.accordion div.accordion-group.ng-isolate-scope div.accordion-heading a.accordion-toggle.ng-binding span.ng-scope b.ng-binding";
String uxpath = "//html//body//f:view//form//div//div[2]//md-content//md-tabs//md-tabs-content-wrapper//md-tab-content[1]//div//ng-include//div//accordion//div//div[1]//div[1]//a";
String xpath2 = "/html/body/pre/span[202]/a";
xpath = "/html/body/f:view/form/div/div[2]/md-content/md-tabs/md-tabs-content-wrapper/md-tab-content[1]/div/ng-include/div/accordion/div/div[1]/div[1]/a/span/b";
try {
element = wait.until(ExpectedConditions.visibilityOfElementLocated(By.cssSelector(csspath)));
locator = By.cssSelector(csspath);
driver.findElement(locator).click();
} catch (Exception e) {
System.out.println("Not foune csspath");
}
try {
element = wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath(xpath)));
locator = By.xpath(xpath);
driver.findElement(locator).click();
} catch (Exception e) {
System.out.println("Not foune xpath");
}
try {
element = wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath(uxpath)));
locator = By.xpath(uxpath);
driver.findElement(locator).click();
} catch (Exception e) {
System.out.println("Not foune uxpath");
}
try {
element = wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath(xpath2)));
locator = By.xpath(xpath2);
driver.findElement(locator).click();
} catch (Exception e) {
System.out.println("Not foune xpath2");
}
However nothing has worked till now i want to select responsibility code and give it values
It would be really appreciated if you can give me any insight
Thanks in advance
Here is a screenshot of my issue
enter image description here
First issue (as already pointed out in comments) is the absolute selectors you are using. For example, try to refactor your xpath selectors and make those relative.
Next issue is related to the
AngularJs page
itself. Let's look at Protractor, the testing framework for Angular built upon WebDriverJS, it provides additional WebDriver-like functionality to test Angular based websites. Put simple - your code needs extra functionality that will know when Angular elements are available for interaction.
Here is how to port some of the most useful Protractor functions to Java (and Python):

How to select item in a list from search Result page using Selenium in Java?

am trying to click on this game to open it's page but every time gives me null pointer exception whatever locator am using still gives me same error also i tried to do a select from list as the link seems to be inside an "li" but didnt work also.
anyone could help me with the code to click this item ??
Targeted Page Url:
https://staging-kw.games.getmo.com:/game/43321031
Search(testCase);
WebElement ResultList = driver.findElement(By.xpath(testData.getParam("ResultList")));
log.info("list located ..");
List<WebElement> Results = ResultList.findElements(By.tagName(testData.getParam("ResultListItems")));
for (WebElement List : Results) {
String Link = List.getAttribute("href");
try {
if (Link.equals(null)) {
log.info("null");
}
if (Link.equals(testData.getParam("GameLink")) && !Link.equals("") && !Link.equals(null)) {
List.click();
}
} catch (NullPointerException countryIsNull) {
log.info("LinkIsNull");
}
}
//clickLink(GameLocator,driver);
}`
I got this code to work by just adding this line after the search method
driver.findElement(By.cssSelector("li.title")).click();
and how did i get it ? .. i used Selenium IDE to record the actions then converted the code to Java -TestNG to get the exact web element selector

Cannot download full Document using HtmlUnit and Jsoup combination (using Java)

Problem Statement:
I want to crawl this page : http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0
Lets say I want to parse the address, that is "24, Middle Gap Road, The Peak, Hong Kong"
What I did:
I first only tried to load using jsoup, but then I noticed that the page is taking some time to load. So, then I also plugged in HTMLUnit to wait for the page to load first
Code I wrote:
public static void parseByHtmlUnit() throws Exception{
String url = "http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0";
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
webClient.waitForBackgroundJavaScriptStartingBefore(30000);
HtmlPage page = webClient.getPage(url);
synchronized(page) {
page.wait(30000);
}
try {
Document doc = Jsoup.parse(page.asXml());
String address = ElementsUtil.getTextOrEmpty(doc.select(".addr"));
System.out.println("address"+address);
} catch (Exception e) {
e.printStackTrace();
}
}
Expected output :
In the console, I should get this output:
address 24, Middle Gap Road, The Peak, Hong Kong
Actual output :
address
How about this?
final Document document = Jsoup.parse(
new URL("http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0"),
30000
);
System.out.println(document.select(".addr").text());

Check if Richtext has an attachment

This is strange - or not as Richtext is a son of a b**ch in general. I want to track if a document (resp. the richtext items) have attachments or not to set other fields in my backend document. I created a static Java method to compute the stuff. The method is called from the postSaveDocument event of my datasource.
This is the method:
/**
* Set flag fields if attachments exist or not
*
* #param xspdoc
*/
public static void setAttachments(final Document doc, final boolean post) {
try {
if (doc.hasItem("audioFile")) {
doc.replaceItemValue("audioHasFile", "1");
} else {
doc.removeItem("audioHasFile");
}
if (doc.hasItem("audioCase")) {
doc.replaceItemValue("audioHasCase", "1");
} else {
doc.removeItem("audioHasCase");
}
if (doc.hasItem("audioTrackliste")) {
doc.replaceItemValue("audioHasTrackliste", "1");
} else {
doc.removeItem("audioHasTrackliste");
}
if (post)
doc.save();
} catch (Exception e) {
e.printStackTrace();
}
}
The problem is: everytime I add an attachment to one of the RTF items on my Xpage (via Fileupload control), save the document with the simple action, the item e.g. "audioHasFile" is set to "1". Bene!
If I then reopen the document, delete the attachment (via the Filedownload control trashicon) and save the document again, the backend doesn't recognize that the attachment has gone and the item e.g. "audioHasFile" is not removed but still holds the value "1" which was set before.
Only if a re-open the document in my Xpage (from a View panel) and save it again, the field is removed as the backend now recognizes that there is no attachment.
I know what you are thinking: the lack of an attachment doesn't mean that theres is no item for it - wrong! I also tried to check the type of the Richtext item via getType == 1 (Item.ATTACHMENT) - no luck.
Info: I deliver the Document parameter via currentDocument.getDocument(true) - so I am dealing with the synchronized backend document here.
To be clear: it's not a question of testing in general but a problem of timing.
Any idea how to solve this? Thank you in advance! :)
UPDATE: This is the solution that works:
/**
* Set flag fields if attachments exist or not
*
* #param xspdoc
*/
public static void setAttachments(final DominoDocument doc) {
try {
doc.replaceItemValue("audioHasFile", doc.getAttachmentList("audioFile").size() > 0 ? "1" : "");
doc.replaceItemValue("audioHasTrackliste", doc.getAttachmentList("audioTrackliste").size() > 0 ? "1" : "");
doc.replaceItemValue("audioHasCase", doc.getAttachmentList("audioCase").size() > 0 ? "1" : "");
// key
doc.replaceItemValue("audioKey", doc.getItemValueString("audioTitle").toLowerCase().replaceAll("\\s+",""));
doc.save();
} catch (Exception e) {
e.printStackTrace();
}
}
Try to wrap your Document in NotesXspDocument:
NotesXspDocument xspDoc;
xspDoc = com.ibm.xsp.model.domino.wrapped.DominoDocument.wrap(doc.getParentDatabase().getFilePath(), doc, null, null, false, null);
if (xspDoc.getAttachmentList("FieldName").size() > 0){
//
}

Controlling the list of URL(s) to be crawled at runtime

In crawler4j we can override a function boolean shouldVisit(WebUrl url) and control whether that particular url should be allowed to be crawled by returning 'true' and 'false'.
But can we add URL(s) at runtime ? if yes , what are ways to do that ?
Currently I can add URL(s) at beginning of program using addSeed(String url) function before the start(BasicCrawler.class, numberOfCrawlers) in CrawlController class and if I try to add new url using addSeed(String url), it gives error. Here is error image .
Any help will be appreciative and please let me know if any more detail about project is required to answer the question .
You can do this.
Use public void schedule(WebURL url) to add URLs to the crawler frontier which is a member of the Frontier.java class. But for this you need to have your url of type WebURL. If you want to make a WebURL out of your string. Please have a look at the addSeed() (below code) which is in the CrawlController.java class to see how it has converted the string (url) into a WebURL.
Also use the existing frontier instance.
Hope this helps..
public void addSeed(String pageUrl, int docId) {
String canonicalUrl = URLCanonicalizer.getCanonicalURL(pageUrl);
if (canonicalUrl == null) {
logger.error("Invalid seed URL: " + pageUrl);
return;
}
if (docId < 0) {
docId = docIdServer.getDocId(canonicalUrl);
if (docId > 0) {
// This URL is already seen.
return;
}
docId = docIdServer.getNewDocID(canonicalUrl);
} else {
try {
docIdServer.addUrlAndDocId(canonicalUrl, docId);
} catch (Exception e) {
logger.error("Could not add seed: " + e.getMessage());
}
}
WebURL webUrl = new WebURL();
webUrl.setURL(canonicalUrl);
webUrl.setDocid(docId);
webUrl.setDepth((short) 0);
if (!robotstxtServer.allows(webUrl)) {
logger.info("Robots.txt does not allow this seed: " + pageUrl);
} else {
frontier.schedule(webUrl); //method that adds URL to the frontier at run time
}
}
Presumably you can implement this function however you like, and have it depend on a list of URLs that should not be crawled. The implementation of shouldVisit is then going to involve asking if a given URL is in your list of forbidden URLs (or permitted URLs), and returning true or false on that basis.

Categories