I'm writing an app for a client who doesn't have an official API but wants the app to extract video links from his website so I wrote a logic using jsoup. Everything seems to work fine except some of the links don't start with https so I'm trying to add it before the URL.
Here's my code:
new Thread(() -> {
final StringBuilder jsoupStr = new StringBuilder();
String URL = "https://example.com" +titleString
.replaceAll(":", "")
.replaceAll(",", "")
.replaceAll(" ", "-")
.toLowerCase();
Log.d("CALLING_URL", " " +URL);
try {
Document doc = Jsoup.connect(URL).get();
Element content = doc.getElementById("list-eps");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
jsoupStr.append("\n").append(link.attr("player-data"));
}
} catch (IOException e) {
e.getMessage();
}
String linksStr = jsoupStr.toString().trim();
if (!linksStr.startsWith("https://")) {
linksStr = "https:" + linksStr;
}
String[] links_array = linksStr.split("\n");
arrayList.addAll(Arrays.asList(links_array));
}).start();
The website contains about 10 links per video but some links start like "//" instead of https.
This code adds the https but only for the first link it finds missing.
if (!linksStr.startsWith("https://")) {
linksStr = "https:" + linksStr;
}
You need to iterate over your final array to apply your function to all links.
String[] links_array = linksStr.split("\n");
for(int i = 0; i < length; i++)
if(!links_array[i].startsWith("https://"))
links_array[i] = "https:" + links_array[i];
If this code working just for first missing link:
if (!linksStr.startsWith("https://")) {
linksStr = "https:" + linksStr;
}
I believe you can use loop for control every link.
Related
I prepare the program and I wrote this code with helping but the first 10 times it works then it gives me NULL values,
String url = "https://uzmanpara.milliyet.com.tr/doviz-kurlari/";
//Document doc = Jsoup.parse(url);
Document doc = null;
try {
doc = Jsoup.connect(url).timeout(6000).get();
} catch (IOException ex) {
Logger.getLogger(den3.class.getName()).log(Level.SEVERE, null, ex);
}
int i = 0;
String[] currencyStr = new String[11];
String[] buyStr = new String[11];
String[] sellStr = new String[11];
Elements elements = doc.select(".borsaMain > div:nth-child(2) > div:nth-child(1) > table.table-markets");
for (Element element : elements) {
Elements curreny = element.parent().select("td:nth-child(2)");
Elements buy = element.parent().select("td:nth-child(3)");
Elements sell = element.parent().select("td:nth-child(4)");
System.out.println(i);
currencyStr[i] = curreny.text();
buyStr[i] = buy.text();
sellStr[i] = sell.text();
System.out.println(String.format("%s [buy=%s, sell=%s]",
curreny.text(), buy.text(), sell.text()));
i++;
}
for(i = 0; i < 11; i++){
System.out.println("currency: " + currencyStr[i]);
System.out.println("buy: " + buyStr[i]);
System.out.println("sell: " + sellStr[i]);
}
here is the code, I guess it is a connection problem but I could not solve it I use Netbeans, Do I have to change the connection properties of Netbeans or should I have to add something more in the code
can you help me?
There's nothing wrong with the connection. Your query simply doesn't match the page structure.
Somewhere on your page, there's an element with class borsaMain, that has a direct child with class detL. And then somewhere in the descendants tree of detL, there is your table. You can write this as the following CSS element selector query:
.borsaMain > .detL table
There will be two tables in the result, but I suspect you are looking for the first one.
So basically, you want something like:
Element table = doc.selectFirst(".borsaMain > .detL table");
for (Element row : table.select("tr:has(td)")) {
// your existing loop code
}
I'm trying to add a href to Arraylist and this adds nicely to the Arraylist, but the link is broken. Everything after the question mark (?) in the URL is not included in the link.
Is there anything that I'm missing, code below:
private String processUpdate(Database dbCurrent) throws NotesException {
int intCountSuccessful = 0;
View vwLookup = dbCurrent.getView("DocsDistribution");
ArrayList<String> listArray = new ArrayList<String>();
Document doc = vwLookup.getFirstDocument();
while (doc != null) {
String paperDistro = doc.getItemValueString("DistroRecords");
if (paperDistro.equals("")) {
String ref = doc.getItemValueString("ref");
String unid = doc.getUniversalID();
// the link generated when adding to Arraylist is broken
listArray.add("" + ref + "");
}
Document tmppmDoc = vwLookup.getNextDocument(doc);
doc.recycle();
doc = tmppmDoc;
}
Collections.sort(listArray);
String listString = "";
for (String s : listArray) {
listString += s + ", \t";
}
return listString;
}
You have a problem with " escaping around unid value due to which you URL becomes gandhi.w3schools.com/testbox.nsf/distro.xsp?documentId="+ unid + "&action=openDocument.
It would be easier to read if you use String.format() and single quotes to generate the a tag:
listArray.add(String.format(
"<a href='gandhi.w3schools.com/testbox.nsf/distro.xsp?documentId=%s&action=openDocument'>%s</a>",
unid, ref));
I am trying to scrap links in pagination of GitHub repositories
I have scraped them separately but what Now I want is to optimize it using some loop. Any idea how can i do it? here is code
ComitUrl= "http://github.com/apple/turicreate/commits/master";
Document document2 = Jsoup.connect(ComitUrl ).get();
Element pagination = document2.select("div.pagination a").get(0);
String Url1 = pagination.attr("href");
System.out.println("pagination-link1 = " + Url1);
Document document3 = Jsoup.connect(Url1).get();
Element pagination2 = document3.select("div.pagination a").get(1);
String Url2 = pagination2.attr("href");
System.out.println("pagination-link2 = " + Url2);
Document document4 = Jsoup.connect(Url2).get();
Element check = document4.select("span.disabled").first();
if (check.text().equals("Older")) {
System.out.println("No pagination link more");
}
else { Element pagination3 = document4.select("div.pagination a").get(1);
String Url3 = pagination3.attr("href");
System.out.println("pagination-link3 = " + Url3);
}
Try something like given below:
public static void main(String[] args) throws IOException{
String url = "http://github.com/apple/turicreate/commits/master";
//get first link
String link = Jsoup.connect(url).get().select("div.pagination a").get(0).attr("href");
//an int just to count up links
int i = 1;
System.out.println("pagination-link_"+ i + "\t" + link);
//parse next page using link
//check if the div on next page has more than one link in it
while(Jsoup.connect(link).get().select("div.pagination a").size() >1){
link = Jsoup.connect(link).get().select("div.pagination a").get(1).attr("href");
System.out.println("pagination-link_"+ (++i) +"\t" + link);
}
}
I was using facebook FQL query to fetch sharecount for multiple URLS using this code without needing any access token.
https://graph.facebook.com/fql?q=";
"SELECT url, total_count,share_count FROM link_stat WHERE url in (";
private void callFB(List validUrlList,Map> dataMap,long timeStamp,Double calibrationFactor){
try {
StringBuilder urlString = new StringBuilder();
System.out.println("List Size " + validUrlList.size());
for (int i = 0; i < (validUrlList.size() - 1); i++) {
urlString.append("\"" + validUrlList.get(i) + "\",");
}
urlString.append("\""
+ validUrlList.get(validUrlList.size() - 1) + "\"");
String out = getConnection(fbURL+URLEncoder.encode(
queryPrefix
+ urlString.toString() + ")", "utf-8"));
dataMap = getSocialPopularity(validUrlList.toArray(), dataMap);
getJSON(out, dataMap, timeStamp,calibrationFactor);
} catch (Exception e) {
e.printStackTrace();
}
}
But as now Facebook has depreciated it i am planning to use
https://graph.facebook.com/v2.5/?ids=http://timesofindia.indiatimes.com/life-style/relationships/soul-curry/An-NRI-bride-who-was-tortured-to-hell/articleshow/50012721.cms&access_token=abc
But i could not find any code to make batch request in the same also i am using pageaccesstoken so what could be the rate limit for same.
Could you please help me to find teh batch request using java for this new version.
You will always be subject to rate limiting... If you're using the /?ids= endpoint, there's already a "batch" functionality built-in.
See
https://developers.facebook.com/docs/graph-api/using-graph-api/v2.5#multirequests
https://developers.facebook.com/docs/graph-api/advanced/rate-limiting
I'm using the SWT OLE api to edit a Word document in an Eclipse RCP. I read articles about how to read properties from the active document but now I'm facing a problem with collections like sections.
I would like to retrieve only the body section of my document but I don't know what to do with my sections object which is an IDispatch object. I read that the item method should be used but I don't understand how.
I found the solution so I'll share it with you :)
Here is a sample code to list all paragraphs of the active document of the word editor :
OleAutomation active = activeDocument.getAutomation();
if(active!=null){
int[] paragraphsId = getId(active, "Paragraphs");
if(paragraphsId.length > 0) {
Variant vParagraphs = active.getProperty(paragraphsId[0]);
if(vParagraphs != null){
OleAutomation paragraphs = vParagraphs.getAutomation();
if(paragraphs!=null){
int[] countId = getId(paragraphs, "Count");
if(countId.length > 0) {
Variant count = paragraphs.getProperty(countId[0]);
if(count!=null){
int numberOfParagraphs = count.getInt();
for(int i = 1 ; i <= numberOfParagraphs ; i++) {
Variant paragraph = paragraphs.invoke(0, new Variant[]{new Variant(i)});
if(paragraph!=null){
System.out.println("paragraph " + i + " added to list!");
listOfParagraphs.add(paragraph);
}
}
return listOfParagraphs;
}
}
}
}
}