Scraping multiple pages with jsoup

Scraping multiple pages with jsoup - java

I am trying to scrap links in pagination of GitHub repositories
I have scraped them separately but what Now I want is to optimize it using some loop. Any idea how can i do it? here is code
ComitUrl= "http://github.com/apple/turicreate/commits/master";
Document document2 = Jsoup.connect(ComitUrl ).get();
Element pagination = document2.select("div.pagination a").get(0);
String Url1 = pagination.attr("href");
System.out.println("pagination-link1 = " + Url1);
Document document3 = Jsoup.connect(Url1).get();
Element pagination2 = document3.select("div.pagination a").get(1);
String Url2 = pagination2.attr("href");
System.out.println("pagination-link2 = " + Url2);
Document document4 = Jsoup.connect(Url2).get();
Element check = document4.select("span.disabled").first();
if (check.text().equals("Older")) {
System.out.println("No pagination link more");
}
else { Element pagination3 = document4.select("div.pagination a").get(1);
String Url3 = pagination3.attr("href");
System.out.println("pagination-link3 = " + Url3);
}

Try something like given below:
public static void main(String[] args) throws IOException{
String url = "http://github.com/apple/turicreate/commits/master";
//get first link
String link = Jsoup.connect(url).get().select("div.pagination a").get(0).attr("href");
//an int just to count up links
int i = 1;
System.out.println("pagination-link_"+ i + "\t" + link);
//parse next page using link
//check if the div on next page has more than one link in it
while(Jsoup.connect(link).get().select("div.pagination a").size() >1){
link = Jsoup.connect(link).get().select("div.pagination a").get(1).attr("href");
System.out.println("pagination-link_"+ (++i) +"\t" + link);
}
}

Related

Add https to missing strings of an array?

I'm writing an app for a client who doesn't have an official API but wants the app to extract video links from his website so I wrote a logic using jsoup. Everything seems to work fine except some of the links don't start with https so I'm trying to add it before the URL.
Here's my code:
new Thread(() -> {
final StringBuilder jsoupStr = new StringBuilder();
String URL = "https://example.com" +titleString
.replaceAll(":", "")
.replaceAll(",", "")
.replaceAll(" ", "-")
.toLowerCase();
Log.d("CALLING_URL", " " +URL);
try {
Document doc = Jsoup.connect(URL).get();
Element content = doc.getElementById("list-eps");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
jsoupStr.append("\n").append(link.attr("player-data"));
}
} catch (IOException e) {
e.getMessage();
}
String linksStr = jsoupStr.toString().trim();
if (!linksStr.startsWith("https://")) {
linksStr = "https:" + linksStr;
}
String[] links_array = linksStr.split("\n");
arrayList.addAll(Arrays.asList(links_array));
}).start();
The website contains about 10 links per video but some links start like "//" instead of https.
This code adds the https but only for the first link it finds missing.
if (!linksStr.startsWith("https://")) {
linksStr = "https:" + linksStr;
}

You need to iterate over your final array to apply your function to all links.
String[] links_array = linksStr.split("\n");
for(int i = 0; i < length; i++)
if(!links_array[i].startsWith("https://"))
links_array[i] = "https:" + links_array[i];

If this code working just for first missing link:
if (!linksStr.startsWith("https://")) {
linksStr = "https:" + linksStr;
}
I believe you can use loop for control every link.

How to add more data in java bson document with looping

i want to looping the code, every loop the data saved to document variable, how to add more data to document, i have problem when the loop more than 1 loop. can you give me an idea how to do it? thank you anyway
private Document getProcessInstances(String status, int page, int size, String sort){
StringBuilder url = new StringBuilder();
Document processinstancelist = null;
Integer totalItems = this.getTotalItems(status, page, size, sort);
Integer totalPages = totalItems/size;
try{
while (page<=totalPages){
url.append(activitiqueryhost).append("/v1/process-instances?status=").append(status).append("&page=").append(page).append("&size=").append(size).append("&sort=").append(sort);
// System.out.println(" >>>>>>>>>> URL="+url.toString());
ResponseEntity<String> processinstancestring = this.get(url.toString());
// System.out.println("processinstancestring="+processinstancestring.getBody());
Document processinstance = Document.parse(processinstancestring.getBody());
// System.out.println(">>>>> processinstance=" + processinstance.toJson());
// Document
processinstancelist = (Document) processinstance.get("list");
// System.out.println(">>>>> list=" + processinstancelist.toJson());
System.out.println("==== datanya "+totalItems);
System.out.println("==== total page "+totalPages);
System.out.println("==== datanya "+page);
page++;
}
return processinstancelist;
}
catch(Exception e){
return null;
}
}

Xpages - Passing <ahref> in Arraylist

I'm trying to add a href to Arraylist and this adds nicely to the Arraylist, but the link is broken. Everything after the question mark (?) in the URL is not included in the link.
Is there anything that I'm missing, code below:
private String processUpdate(Database dbCurrent) throws NotesException {
int intCountSuccessful = 0;
View vwLookup = dbCurrent.getView("DocsDistribution");
ArrayList<String> listArray = new ArrayList<String>();
Document doc = vwLookup.getFirstDocument();
while (doc != null) {
String paperDistro = doc.getItemValueString("DistroRecords");
if (paperDistro.equals("")) {
String ref = doc.getItemValueString("ref");
String unid = doc.getUniversalID();
// the link generated when adding to Arraylist is broken
listArray.add("" + ref + "");
}
Document tmppmDoc = vwLookup.getNextDocument(doc);
doc.recycle();
doc = tmppmDoc;
}
Collections.sort(listArray);
String listString = "";
for (String s : listArray) {
listString += s + ", \t";
}
return listString;
}

You have a problem with " escaping around unid value due to which you URL becomes gandhi.w3schools.com/testbox.nsf/distro.xsp?documentId="+ unid + "&action=openDocument.
It would be easier to read if you use String.format() and single quotes to generate the a tag:
listArray.add(String.format(
"<a href='gandhi.w3schools.com/testbox.nsf/distro.xsp?documentId=%s&action=openDocument'>%s</a>",
unid, ref));

List of elements keeps using the first element after I nest findelement

As you can see below I get all the rows in a list. Then when iterating through the rows list I pull out web elements from each row. However when I pull out the web element I keep getting the first web element in the list.
System.out.println(row.getText()); //Prints correct values
System.out.println(actualFirstName.getText() + " " + actualLastName.getText()); //Prints incorrect values
The code:
private WebElement getElementRow(WebDriver driver, String expectedFirstName, String expectedLastName) throws Exception{
List<WebElement> allRows = getAllRows(driver);
WebElement actualFirstName;
WebElement actualLastName;
for(int i=0; i<allRows.size(); i++){
WebElement row = allRows.get(i);
System.out.println(row.getText());
actualFirstName = row.findElement(firstNameLocator);
actualLastName = row.findElement(lastNameLocator);
System.out.println(actualFirstName.getText() + " " + actualLastName.getText());
if(actualFirstName.getText().equals(expectedFirstName)) && actualLastName.getText().equals(expectedLastName)){
return row;
}
}
throw new Exception(expectedFirstName + " " + expectedLastName + " row not found in the list");
}
Any help would be greatly appreciated.

After reviewing my By locators further I realized that my xpath had it referencing the first element
I had
private By brokerFirstNameLocator = By.xpath("//div[contains(#id,'first-name')]");
private By brokerLastNameLocator = By.xpath("//div[contains(#id,'last-name')]");
Where I should have had
private By firstNameLocator = By.xpath("./div[contains(#id,'first-name')]");
private By lastNameLocator = By.xpath("./div[contains(#id,'last-name')]");
TBH I was surprised this my intial way didn't work but it is a relative vs absolute path issue

How to generate XPath query matching a specific element in Jsoup?

_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.

As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1

This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}

This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}

Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scraping multiple pages with jsoup - java

Related

Add https to missing strings of an array?

How to add more data in java bson document with looping

Xpages - Passing <ahref> in Arraylist

List of elements keeps using the first element after I nest findelement

How to generate XPath query matching a specific element in Jsoup?

Categories

Resources