How to extract all the URLs from the text in android - java

I want to get all the URLs from the given text using Patterns.WEB_URL.matcher(qrText);
What I want to do:
I am scanning a QR code,
open the link in webView if the link contains link which contians the word "veridoc"
showing in textView if the text scanned is not link or another link that does not contain the word "veridoc"
What I have tried:
private void initialize() {
if (getIntent().getStringExtra(Constants.KEY_LINK) != null) {
qrText = getIntent().getStringExtra(Constants.KEY_LINK);
webMatcher = Patterns.WEB_URL.matcher(qrText);
}
if (qrText.contains("veridoc") && webMatcher.matches()) {
//if qr text is veridoc link
Log.e("veridoc link", qrText);
setupWebView(qrText, false);
} else if (webMatcher.matches()) {
//if qr text is link other than veridoc
Log.e("link", qrText);
openInBrowser(qrText);
finish();
} else if (qrText.contains("veridoc") && webMatcher.find()) {
//if qrText contains veridoc link + other text.
String url = webMatcher.group();
if (url.contains("veridoc")) {
Log.e("veridoc link found", url);
setupWebView(url, true);
} else
showQRText(qrText);
} else {
//the qrText neither is a link nor contains any link that contains word veridoc
showQRText(qrText);
}
}
}
In the above code,
setupWebView(String strUrl, boolean isTextAndUrlBoth) setup webview and load url etc.
openInBrowser(String url) opens the provided URL in the browser.
showQRText(String text) shows the provided text in textView with formatting.
The Issue
When the text contains some text and more than 1 link, String url = webMatcher.group(); always fetches the first link in the text.
What I want
I want all the links from the text, and find out that which links contain the word "veridoc". After that I would like to call the method setupWebView(url, true); .
I am using following link and text for Example
name: Something
Profession: Something
link1: https://medium.com/#rkdaftary/understanding-git-for-beginners-20d4b55cc72c
link 2: https://my.veridocglobal.com/login
Can anyone help me to find all the links present in the text?

You can loop on find to find the different websites and setup arraylists with that
Matcher webMatcher = Patterns.WEB_URL.matcher(input);
ArrayList<String> veridocLinks = new ArrayList<>();
ArrayList<String> otherLinks = new ArrayList<>();
while (webMatcher.find()){
String res = webMatcher.group();
if(res!= null) {
if(res.contains("veridoc")) veridocLinks.add(res);
else otherLinks.add(res);
}
}
Given a sample input like :
String input = "http://www.veridoc.com/1 some text http://www.veridoc.com/2 some other text http://www.othersite.com/3";
Your ArrayLists will contains :
veridocLinks : "http://www.veridoc.com/1", "http://www.veridoc.com/2"
otherLinks : "http://www.othersite.com/3"

Related

Need help in retrieving text of a tooltip with Selenium and Java

I am trying to get the text of the tooltip in the following image - with the code snippet shown below.
String xPath = "//div[#class="tooltip-inner"]/div";
we = driver.findElement(By.xPath(xPath));
if (null != we) {
Actions action = new Actions(driver);
action.moveToElement(we).moveToElement(driver.findElement(By.xpath(xPath))).click().build()
.perform();
String actualText = we.getText();
} else {
....generate an error
}
The code does not throw an error, but at the same time, text is not retrieved.
I tried to locate the /p and the /ul child elements and get their texts - but no luck either.
What am I not doing right? Any ideas?
Thanks.
-S-

How to edit a Hyperlink in a Word Document using Apache POI?

So I've been browsing around the source code / documentation for POI (specifically XWPF) and I can't seem to find anything that relates to editing a hyperlink in a .docx. I only see functionality to get the information for the currently set hyperlink. My goal is to change the hyperlink in a .docx to link to "http://yahoo.com" from "http://google.com" as an example. Any help would be greatly appreciated. Thanks!
I found a way to edit the url of the link in a "indirect way" (copy the previous hyperlink, modify the url, delete the previous hyperlink and add the new one in the paragraph).
Code is shown below:
private void editLinksOfParagraph(XWPFParagraph paragraph, XWPFDocument document) {
for (int rIndex = 0; rIndex < paragraph.getRuns().size(); rIndex++) {
XWPFRun run = paragraph.getRuns().get(rIndex);
if (run instanceof XWPFHyperlinkRun) {
// get the url of the link to edit it
XWPFHyperlink link = ((XWPFHyperlinkRun) run).getHyperlink(document);
String linkURL = link.getURL();
//get the xml representation of the hyperlink that includes all the information
XmlObject xmlObject = run.getCTR().copy();
linkURL += "-edited-link"; //edited url of the link, f.e add a '-edited-link' suffix
//remove the previous link from the paragraph
paragraph.removeRun(rIndex);
//add the new hyperlinked with updated url in the paragraph, in place of the previous deleted
XWPFHyperlinkRun hyperlinkRun = paragraph.insertNewHyperlinkRun(rIndex, linkURL);
hyperlinkRun.getCTR().set(xmlObject);
}
}
}
This requirement needs knowledge about how hyperlinks referring to an external reference get stored in Microsoft Word documents and how this gets represented in XWPF of Apache POI.
The XWPFHyperlinkRun is the representation of a linked text run in a IRunBody. This text run, or even multiple text runs, is/are wrapped with a XML object of type CTHyperlink. This contains a relation ID which points to a relation in the package relations part. This package relation contains the URI which is the hyperlink's target.
Currently (apache poi 5.2.2) XWPFHyperlinkRun provides access to a XWPFHyperlink. But this is very rudimentary. It only has getters for the Id and the URI. It neither provides access to it's XWPFHyperlinkRun and it's IRunBody nor it provides a setter for the target URI in the package relations part. It not even has internally access to it's the package relations part.
So only using Apache POI classes the only possibility currently is to delete the old XWPFHyperlinkRun and create a new one pointing to the new URI. But as the text runs also contain the text formatting, deleting them will also delete the text formatting. It would must be copied from the old XWPFHyperlinkRun to the new before deleting the old one. That's uncomfortable.
So the rudimentary XWPFHyperlink should be extended to provide a setter for the target URI in the package relations part. A new class XWPFHyperlinkExtended could look like so:
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.openxml4j.opc.PackageRelationship;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
/**
* Extended XWPF hyperlink class
* Provides access to it's Id, URI, XWPFHyperlinkRun, IRunBody.
* Provides setting target URI in PackageRelationship.
*/
public class XWPFHyperlinkExtended {
private String id;
private String uri;
private XWPFHyperlinkRun hyperlinkRun;
private IRunBody runBody;
private PackageRelationship rel;
public XWPFHyperlinkExtended(XWPFHyperlinkRun hyperlinkRun, PackageRelationship rel) {
this.id = rel.getId();
this.uri = rel.getTargetURI().toString();
this.hyperlinkRun = hyperlinkRun;
this.runBody = hyperlinkRun.getParent();
this.rel = rel;
}
public String getId() {
return this.id;
}
public String getURI() {
return this.uri;
}
public IRunBody getIRunBody() {
return this.runBody;
}
public XWPFHyperlinkRun getHyperlinkRun() {
return this.hyperlinkRun;
}
/**
* Provides setting target URI in PackageRelationship.
* The old PackageRelationship gets removed.
* A new PackageRelationship gets added using the same Id.
*/
public void setTargetURI(String uri) {
this.runBody.getPart().getPackagePart().removeRelationship(this.getId());
this.uri = uri;
PackageRelationship rel = this.runBody.getPart().getPackagePart().addExternalRelationship(uri, XWPFRelation.HYPERLINK.getRelation(), this.getId());
this.rel = rel;
}
}
It does not extend XWPFHyperlink as this is so rudimentary it's not worth it. Furthermore after setTargetURI the String uri needs to be updated. But there is no setter in XWPFHyperlink and the field is only accessible from inside the package.
The new XWPFHyperlinkExtended can be got from XWPFHyperlinkRun like so:
/**
* If this HyperlinkRun refers to an external reference hyperlink,
* return the XWPFHyperlinkExtended object for it.
* May return null if no PackageRelationship found.
*/
/*modifiers*/ XWPFHyperlinkExtended getHyperlink(XWPFHyperlinkRun hyperlinkRun) {
try {
for (org.apache.poi.openxml4j.opc.PackageRelationship rel : hyperlinkRun.getParent().getPart().getPackagePart().getRelationshipsByType(XWPFRelation.HYPERLINK.getRelation())) {
if (rel.getId().equals(hyperlinkRun.getHyperlinkId())) {
return new XWPFHyperlinkExtended(hyperlinkRun, rel);
}
}
} catch (org.apache.poi.openxml4j.exceptions.InvalidFormatException ifex) {
// do nothing, simply do not return something
}
return null;
}
Once we have that XWPFHyperlinkExtended we can set an new target URI using it's method setTargetURI.
A further problem results from the fact, that the XML object of type CTHyperlink can wrap around multiple text runs, not only one. Then multiple XWPFHyperlinkRun are in one CTHyperlink and point to one target URI. For example this could look like:
... [this is a link to example.com]->https://example.com ...
This results in 6 XWPFHyperlinkRuns in one CTHyperlink linking to https://example.com.
This leads to problems when link text needs to be changed when the link target changes. The text of all the 6 text runs is the link text. So which text run shall be changed?
The best I have found is a method which sets the text of the first text run in the CTHyperlink.
/**
* Sets the text of the first text run in the CTHyperlink of this XWPFHyperlinkRun.
* Tries solving the problem when a CTHyperlink contains multiple text runs.
* Then the String value is set in first text run only. All other text runs are set empty.
*/
/*modifiers*/ void setTextInFirstRun(XWPFHyperlinkRun hyperlinkRun, String value) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTHyperlink ctHyperlink = hyperlinkRun.getCTHyperlink();
for (int r = 0; r < ctHyperlink.getRList().size(); r++) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR ctR = ctHyperlink.getRList().get(r);
for (int t = 0; t < ctR.getTList().size(); t++) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTText ctText = ctR.getTList().get(t);
if (r == 0 && t == 0) {
ctText.setStringValue(value);
} else {
ctText.setStringValue("");
}
}
}
}
There the String value is set in first text run only. All other text runs are set empty. The text formatting of the first text run remains.
This works, but need more some steps to get text formatting correctly:
try (var fis = new FileInputStream(fileName);
var doc = new XWPFDocument(fis)) {
var pList = doc.getParagraphs();
for (var p : pList) {
var runs = p.getRuns();
for (int i = 0; i < runs.size(); i++) {
var r = runs.get(i);
if (r instanceof XWPFHyperlinkRun) {
var run = (XWPFHyperlinkRun) r;
var link = run.getHyperlink(doc);
// To get text: link for checking
System.out.println(run.getText(0) + ": " + link.getURL());
// how i replace it
var run1 = p.insertNewHyperlinkRun(i, "http://google.com");
run1.setText(run.getText(0));
// remove the old link
p.removeRun(i + 1);
}
}
}
try (var fos = new FileOutputStream(outFileName)) {
doc.write(fos);
}
}
I'm using these libraries:
implementation 'org.apache.poi:poi:5.2.2'
implementation 'org.apache.poi:poi-ooxml:5.2.2'

Jsoup .select returns empty value but element does contains text

I'm trying to get the text of "link" tag element in this xml: http://www.istana.gov.sg/latestupdate/rss.xml
I have coded to get the first article.
URL = getResources().getString(R.string.istana_home_page_rss_xml);
// URL = "http://www.istana.gov.sg/latestupdate/rss.xml";
try {
doc = Jsoup.connect(URL).ignoreContentType(true).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// retrieve the link of the article
links = doc.select("link");
// retrieve the publish date of the article
dates = doc.select("pubDate");
//retrieve the title of the article
titles = doc.select("title");
String[] article1 = new String[3];
article1[0] = links.get(1).text();
article1[1] = titles.get(1).text();
article1[2] = dates.get(0).text();
The article comes out nicely but the link returns "" value (The whole entire link elements return "" value). The titles and dates have no problems. The link tag consist of a URL text. Anyone knows why it returns "" value?
It looks like default HTML parser can't recognize <link> as valid tag and is automatically closing it <link /> which means that content of this tag is empty.
To solve this problem instead of HTML parser you can use XML parser which doesn't care that much about tag names.
doc = Jsoup.connect(URL)
.ignoreContentType(true)
.parser(Parser.xmlParser()) // <-- add this
.get();

Extracting user details from facebook page

I am extracting details from a page which I'm administering. I tried using jsoup to extract the links then from that extract names of users but it's not working. It only shows links other than user links. I tried extracting names from this link
https://www.facebook.com/plugins/fan.php?connections=100&id=pageid
which is working quite well but does not works for this link
https://www.facebook.com/browse/?type=page_fans&page_id=
Can anyone help me...Below is the code which I tried.
doc = Jsoup.connect("https://www.facebook.com/browse/?type=page_fans&page_id=mypageid").get();
Elements els = doc.getElementsByClass("fsl fwb fcb");
Elements link = doc.select("a[href]");
for(Element ele : link)
{
system.out.println(ele.attr("href"));
} }
Try This
Document doc = Jsoup.connect("https://www.facebook.com/plugins/fan.php?connections=100&id=pageid").timeout(0).get();
Elements nameLinks = doc.getElementsByClass("link");
for (Element users : nameLinks) {
String name = users.attr("title");
String url = users.attr("href");
System.out.println(name + "-" + url);
}
It will give all the users name and URl present on the first link defined in your question.

jsoup how to reach dropdownlist

Hello everybody I want to get the data from
http://sansoyunlari.hurriyet.com.tr/SayisalLoto/SayisalLotoSonuclari.aspx this adress by using jsoup ı can get them but only the latest results . There is a dropdownlist on the website which consists dates how can I reach other dates ? by the way I will move these codes to the android these are codes which is written in netbeans for now. ı will put a dropdownlist to my android program which get the data from this adress and also the results.
these are my java codes I wrote until now
public static void main(String[] args) {
String adres = "http://sansoyunlari.hurriyet.com.tr/SayisalLoto/SayisalLotoSonuclari.aspx";
ArrayList sayi = new ArrayList<>();
sayi.add("six");
sayi.add("five");
sayi.add("four");
sayi.add("three");
sayi.add("two");
sayi.add("one");
//Sayısal Loto
try {
Document doc = Jsoup.connect(adres).get();
Elements sonuclar = doc.select("div.hurriyet2010_so_sanstopu_no_bg");
//1. yi manuel almak gerek ilk yoldan çünkü resut diye kodlanmış
Elements sonuclar1 = doc.select("span#_ctl0_ContentPlaceHolder1_lblresut"+sayi.get(sayi.size()-1));
Element numaralar = sonuclar1.first();
System.out.println(numaralar.text());
//yol 1 numaraları almak için
for (int i = sonuclar.size();i>1;i--)
{
sonuclar1 = doc.select("span#_ctl0_ContentPlaceHolder1_lblresult"+sayi.get(i-2));
Element numaralar1 = sonuclar1.first();
System.out.println(numaralar1.text());
}
//yol 2 numaraları almak için
// for(Element el : sonuclar)
// {
// System.out.println(el.text());
// }
//kazanan kişi sayısı ve ikramiye tutarı için
for(int i = 0;i<4;i++)
{
int b = 6 -i;
System.out.println(b + " bilen kişi sayısı :");
sonuclar = doc.select("span#_ctl0_ContentPlaceHolder1_lblluckycount"+sayi.get(i));
Element el = sonuclar.first();
System.out.println(el.text());
System.out.println("Kişi başına düşen ikramiye :");
sonuclar = doc.select("span#_ctl0_ContentPlaceHolder1_lblluckyamount"+sayi.get(i));
el = sonuclar.first();
System.out.println(el.text());
}
}
catch(Exception e){
}
}
To get the select item you should do:
Element select = doc.select("#_ctl0_ContentPlaceHolder1_ddlSayisalLotoDates").first();
Now the children of this elements are the "option" items you want:
for (Element e : select) {
String date = e.text();
}
edit
I looked at the html source. In order to get the right page you need to do a post request at the URL "http://sansoyunlari.hurriyet.com.tr/SayisalLoto/SayisalLotoSonuclari.aspx" with following params:
__EVENTARGUMENT = empty
__EVENTTARGET = _ctl0$ContentPlaceHolder1$ddlSayisalLotoDates
__EVENTVALIDATION = a random value that you get from the html page
__LASTFOCUS = empty
__VIEWSTATE = another random value
_ctl0:ContentPlaceHolder1:ddlSayisalLotoDates = The ID of the date you want to search (i.e. 884 for 19 Ekim 2013)
txtSearch = can be empty
As you can see, it's quite annoying scraping an ASP.NET webpage..
Use an application like Fiddler (or another one) to find the params you need to post (hidden inputs, session cookies, your selected input). Probably you're missing some of them.
Hope it helps.

Categories