Getting data from the web using Android?

Getting data from the web using Android? - java

When using Eclipse for Java, I'm able to access data from websites and fill out online forms using Selenium. All I have to do is do WebDriver driver = new HtmlUnitDriver(); and driver.get("wwww.google.com"); and driver.findElement(). In order to accomplish this, I would go into the Java Build Path, access Libraries, and then add the external JAR file: selenium-server-standalone-2.39.0.jar.
I'd like to do the same for Android but am having difficulty. I understand there was something called Selenium for Android, but it's no longer being supported. Now there's Selendroid. But while the code is vaguely familiar to that of Eclipse for Java (i.e., SelendroidCapabilities capa = new SelendroidCapabilities("io.selendroid.testapp:0.12.0");, WebDriver driver = new SelendroidDriver(capa);, WebElement inputField = driver.findElement(By.id("my_text_field"));), I don't think this is actually the same as what I am looking for. I even tried to add selendroid-standalone-0.12.0-with-dependencies.jar to the Android library and all I got back was this error in the console:
Dx warning: Ignoring InnerClasses attribute for an anonymous inner class
(org.apache.xalan.lib.sql.SecuritySupport12$8) that doesn't come with an
associated EnclosingMethod attribute. This class was probably produced by a
compiler that did not target the modern .class file format. The recommended
solution is to recompile the class from source, using an up-to-date compiler
and without specifying any "-target" type options. The consequence of ignoring
this warning is that reflective operations on this class will incorrectly
indicate that it is *not* an inner class.
So my question is: Where can I go to learn about using Android to go to a web page and retrieve some data (but not actually open a web page on the screen, this is strictly background stuff)? Or, what are the steps to getting data from a website via Android using identifiers such as id, name, or Xpath, etc.?

Use JSOUP for the same. I think thats what you loking for.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Download jar and include in project.
Simple example :
Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();
Read apidocs for more info.
Also make sure to put network calls in an AsyncTask and not on main UI thread.

I eventually found something that is exactly what I wanted: HtmlCleaner. There's a good guide here.
Download the JAR file here and include it in the project's library.
Then use the following code to get your element from the XPath:
public class Main extends Activity {
// HTML page
static final String URL = "https://www.yourpage.com/";
// XPath query
static final String XPATH = "//some/path/here";
#Override
public void onCreate(Bundle savedInstanceState) {
// init view layout
super.onCreate(savedInstanceState);
setContentView(R.layout.main);
// decide output
String value = getData();
}
public String getData() {
String data = "";
// config cleaner properties
HtmlCleaner htmlCleaner = new HtmlCleaner();
CleanerProperties props = htmlCleaner.getProperties();
props.setAllowHtmlInsideAttributes(false);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
// create URL object
URL url = new URL(URL);
// get HTML page root node
TagNode root = htmlCleaner.clean(url);
// query XPath
Object[] statsNode = root.evaluateXPath(XPATH);
// process data if found any node
if(statsNode.length > 0) {
// I already know there's only one node, so pick index at 0.
TagNode resultNode = (TagNode)statsNode[0];
// get text data from HTML node
stats = resultNode.getText().toString();
}
// return value
return data;
}
}

Related

Why is my Jsoup Code not Returning the Correct Elements?

I am working on an app in Android Studio and am having some trouble web-scraping with JSoup. I have successfully connected to the webpage and returned some basic elements to test the library, but now I cannot actually get the elements I need for my app.
I am trying to get a number of elements with the "data-at" attribute. The weird thing is, a few elements with the "data-at" attribute are returned, but not the ones I am looking for. For whatever reason my code is not extracting all of the elements that share the "data-at" attribute on the web page.
This is the URL of the webpage I am scraping:
https://express.liatoyotaofcolonie.com/inventory?f=dealer.name%3ALia%20Toyota%20of%20Colonie&f=submodel%3ACamry&f=trim%3ALE&f=year%3A2020
The method containing the web-scraping code:
#Override
protected String doInBackground(Void... params) {
String title = "";
Document doc;
Log.d(TAG, queryString.toString());
try {
doc = Jsoup.connect(queryString.toString()).get();
Elements content = doc.select("[data-at]");
for (Element e: content) {
Log.d(TAG, e.text());
}
} catch (IOException e) {
Log.e(TAG, e.toString());
}
return title;
}
The results in Logcat
The element I want to retrieve
One of the elements that is actually being retrieved

This is because some of the content - including the one you are looking for - is created asyncronously and is not present in initial DOM (Javascript ;))
When you view the source of the page you will notice that there is only 17 data-at occurences, while running document.querySelector("[data-at]") 29 nodes are returned.
What you are able to get in the JSoup is static content of the page (initial DOM). You wont be able to fetch dynamically created content as you do not run required JS scripts.
In order to overcome this, you will have to either fetch and parse required resources manually (eg trace what AJAX calls are made by the browser) or use headless browser setup. Selenium + headless Chrome should be enough.
Letter option will allow you to scrape ANY posible web application, including SPA apps, which is not possible using plaing Jsoup.

I don't quite know what to do about this, but I'm going to try one more time... The "Problematic Lines" in your code are these:
doc = Jsoup.connect(queryString.toString()).get();
Elements content = doc.select("[data-at]");
It is the queryString that you have requested - the URL points to a page that contains quite a bit of script code. When you load up a browser and click the button (or menu-option) that reads: "View Source", the HTML you see is not the same exact HTML that is broadcast to and received by JSoup.
If the HTML that is broadcast contains any <SCRIPT TYPE="text/javascript"> ... </SCRIPT> in it (and the named URL in your question does), AND those <SCRIPT> tags are involved in the initial loading of the page, then JSoup will not know anything about it... It only parses what it receives, it cannot process any dynamic content.
There are four ways that I know of to get the "Post Script Loaded" version of the HTML from a dynamic web-page, and I will type them here, now. The first is likely the most popular method (in Java) that I have heard about on Stack Overflow:
Selenium This Answer will show how the tool can run Java-Script. These are some Selenium Docs. And then there is this page right here has a great "first class" for using the tool to retrieve post-script processed HTML. Again, there is no way JSoup can retrieve HTML that is sent to the browser by script (JS/AJAX/Angular/React) since it just a parser.
Puppeteer This requires running a language called Node.js Perhaps calling a simple Node.js program from Java could work, but it would be a "Two Language" solution. I've never used it. Here is an answer that shows getting, sort of, what you are trying to get... The HTML after the script.
WebView Android Java Programmers have a popular class called "WebView" (documented here), that I have recently been told about (yesterday ... but it has been out for years) that will execute script in a browser, and return the HTML. Here is an answer that shows "JavaScript Injection" to retrieve DOM Tree elements from a "WebView" instance (which is how I was told it was done)
Splash My favorite tool, which I don't think anyone has heard of, but has been the simplest for me... So there is an A.P.I. called the "Splash API". Here is their explanation for a "Java-Script Rendering Service." Since this one I have been using... I'll post a code snippet that shows how "Splash Tool" can retrieve post-script processed HTML below.
To run the Splash API (only if you have access to the docker loading program) ... You start a Splash Server as below. These two lines are typed into a GCP (Google Cloud Platform) Shell instance, and the server starts right up without any configurations:
Pull the image:
$ sudo docker pull scrapinghub/splash
Start the container:
$ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
In your code, just prepend the String to your URL's:
"http://localhost:8050/render.html?url="
So in your code, you would use the following command (instead), and the script would (more likely) load all the HTML Elements that you are not finding:
String SPLASH_URL = "http://localhost:8050/render.html?url=";
doc = Jsoup.connect(SPLASH_URL + queryString.toString()).get();

how to make jsoup wait for the complete page(skip a progress page) to load? [duplicate]

This question already has answers here:
Page content is loaded with JavaScript and Jsoup doesn't see it
(8 answers)
Closed 6 years ago.
I am trying to parse a webpage and extract data using Jsoup. But the link is dynamic and throws up a wait-for-loading page before displaying the details. So the Jsoup seems to process the waiting page rather than the details page. is there anyway to make this wait till page is fully loaded?

If some of the content is created dynamically once the page is loaded, then your best chance to parse the full content would be to use Selenium with JSoup:
WebDriver driver = new FirefoxDriver();
driver.get("http://stackoverflow.com/");
Document doc = Jsoup.parse(driver.getPageSource());

Probably, the page in question is t generated by JavaScript in the browser (client-side). Jsoup does not interpret JavaScript, so you are out of luck. However, you could analyze the page loading in the network tab of the browser developer tools and find out which AJAX calls are made during page load. These calls also have URLs and you may get all infos you need by directly accessing them. Alternatively, you can use a real browser engine to load the page. You can use a library like selenium webdriver for that or the JavaFX webkit component if you are using Java 8.

I think i am just expanding luksch's answer a bit more. I am not familiar with web frameworks, so the answer looked a little difficult to understand. Since page was loading dynamically using a parser like Jsoup is difficult since we must know that all the elements are loaded completely before attempting a parsing. So instead of parsing immediately, use the webdriver(selenium) to check for elements status and once they are loaded, get the page source and parse or use the webdriver itself to gather the data needed instead of using a separate parser.
WebDriver driver = new ChromeDriver();
driver.get("<DynamicURL>");
List<WebElement> elements = null;
while (elements == null)
{
elements = driver.findElements(By.className("marker"));
if (!valuePresent(elements))
{
elements = null;
}
}
if (elements != null)
{
processElements(elements);
}

Convert a Jsoup "Element" into a WebDriver "WebElement"?

So I found that if you get the page source with WebDriver, you actually get the generated source of the entire DOM (not just the HTML source code of the page that loaded). You can then use this String to generate a Jsoup Document. This is cool, because Jsoup is much faster than WebDriver at searching for elements, it also has a much better API to do so.
So, is there anyway to turn a Jsoup Element into a WebDriver WebElement? I saw another post on stackoverflow about using a method to generate an xpath from the Jsoup document, but that's not what I'm looking for since WebDriver will still have to parse the page and use the Xpath to lookup the element, defeating the purpose (unless your porpuse is purely to use Jsoup for its superior Selector methods).
The reason I want to try and use Jsoup to find WebElements for WebDriver is because on some websites, WebDriver is very very slow (I work for a company that automation hundreds of 3rd party websites, we have no control over these sites).

There seems to be a confusion between interactive and non-interactive tools here.
WebDriver tests are very often slow (in my experience) due to unnecessary and defensive waits and delays, using improperly-understood frameworks, and often written by junior or outsourced developers - but fundamentally also because WebDriver is mimicking a real user's actions in 'real time' on a real browser, and communicating with the browser app using an API (based on a specification) and a protocol. It's interactive.
(Less so with HtmlUnit, PhantomJS etc.)
By contrast, Jsoup is just a glorified HTTP client with extra parsing capabilities. It's non-interactive, and ultimately works off a snapshot String of data. We'd expect it to be much faster for its particular use-cases.
Clearly both are HTTP clients of a sort, and can share static web content, which is why WebDriver could pass data off for processing by Jsoup (though I've never heard of this use-case before).
However, Jsoup can never turn one of its Elements (a Java snapshot object containing some properties) into a WebDriver WebElement, which is more a kind of 'live' proxy to a real and interactive object within a program like Firefox or Chrome. (Again, less so with HtmlUnit, PhantomJS etc.)
So you need to decide whether interactivity is important to you. If it's crucial to mimic a real user, WebDriver has to 'drive' the process using a real browser.
If it's not, then you can consider the headless browsers like HtmlUnit and (especially) PhantomJS, as they will be able to execute JavaScript and update the DOM in a way that the HTTP libraries and Jsoup can't. You can then pass the output to Jsoup etc.
Potentially, if you went down the PhantomJS route, you could do all your parsing there using the JavaScript API. See: Use PhantomJS to extract html and text etc.
For a lot of people, interactivity isn't important at all, and it's quicker to drop WebDriver completely and rely on the libraries.

I know this question is incredibly old, but just so anyone who comes to see this can find this answer. This will return an xpath from your Jsoup Element. This was translated to Java by me, but the original source I copied the code from was https://stackoverflow.com/a/48376038/13274510.
You can then use the xpath with WebDriver
Edit: Code works now
public static String jsoupToXpath(Element element) {
String xpath = "/";
List<String> components = new ArrayList<>();
Element child = element.tagName().isEmpty() ? element.parent() : element;
System.out.println(child.tag());
while (child.parent() != null){
Element parent = child.parent();
Elements siblings = parent.children();
String componentToAdd = null;
if (siblings.size() == 1) {
componentToAdd = child.tagName();
} else {
int x = 1;
for(Element sibling: siblings){
if (child.tagName().equals(sibling.tagName())){
if (child == sibling){
break;
} else {
x++;
}
}
}
componentToAdd = String.format("%s[%d]", child.tagName(), x);
}
components.add(componentToAdd);
child = parent;
}
List<String> reversedComponents = new ArrayList<>();
for (int i = components.size()-1; i > 0; i--){
reversedComponents.add(components.get(i));
}
xpath = xpath + String.join("/", reversedComponents);
return xpath;
}

Programmatically render template area in Magnolia CMS

I am using Magnolia CMS 5.4 and I want to build a module that will render some content of a page and expose it over REST API. The task is simple but not sure how to approach it and/or where to start.
I want my module to generate partial template or an area of a template for a given reference, let's say that is "header". I need to render the header template/area get the HTML and return that as a response to another system.
So questions are: is this possible at all and where to start?

OK after asking here and on Magnolia forum couldn't get answer I dug in the source code and found a way to do it.
First thing the rendering works based on different renderers and those could be JCR, plain text or Freemarker renderer. In Magnolia those are decided and used in RenderingEngine and the implementation: DefaultRenderingEngine. The rendering engine will allow you to render a whole page node which is one step closer to what I am trying to achieve. So let's see how could this be done:
I'll skip some steps but I've added command and made that work over REST so I could see what's happening when I send a request to the endpoint. The command extends BaseRepositoryCommand to allow access to the JCR repositories.
#Inject
public setDefaultRenderingEngine(
final RendererRegistry rendererRegistry,
final TemplateDefinitionAssignment templateDefinitionAssignment,
final RenderableVariationResolver variationResolver,
final Provider<RenderingContext> renderingContextProvider
) {
renderingEngine = new DefaultRenderingEngine(rendererRegistry, templateDefinitionAssignment,
variationResolver, renderingContextProvider);
}
This creates your rendering engine and from here you can start rendering nodes with few small gotchas. I've tried injecting the rendering engine directly but that didn't work as all of the internals were empty/null so decided to grab all construct properties and initialise my own version.
Next step is we want to render a page node. First of all the rendering engine works based on the idea it's rendering for a HttpServletResponse and ties to the request/response flow really well, though we need to put the generated markup in a variable so I've added a new implementation of the FilteringResponseOutputProvider:
public class AppendableFilteringResponseOutputProvider extends FilteringResponseOutputProvider {
private final FilteringAppendableWrapper appendable;
private OutputStream outputStream = new ByteArrayOutputStream();
public AppendableFilteringResponseOutputProvider(HttpServletResponse aResponse) {
super(aResponse);
OutputStreamWriter writer = new OutputStreamWriter(outputStream);
appendable = Components.newInstance(FilteringAppendableWrapper.class);
appendable.setWrappedAppendable(writer);
}
#Override
public Appendable getAppendable() throws IOException {
return appendable;
}
#Override
public OutputStream getOutputStream() throws IOException {
((Writer) appendable.getWrappedAppendable()).flush();
return outputStream;
}
#Override
public void setWriteEnabled(boolean writeEnabled) {
super.setWriteEnabled(writeEnabled);
appendable.setWriteEnabled(writeEnabled);
}
}
So idea of the class is to expose the output stream and still preserve the FilteringAppendableWrapper that will allow us the filter the content we want to write. This is not needed in the general case, you can stick to using AppendableOnlyOutputProvider with StringBuilder appendable and easily retrieve the entire page markup.
// here I needed to create a fake HttpServletResponse
OutputProvider outputProvider = new AppendableFilteringResponseOutputProvider(new FakeResponse());
Once you have the output provider you need a page node and since you are faking it you need to set the Magnolia global env to be able to retrieve the JCR node:
// populate repository and root node as those are not set for commands
super.setRepository(RepositoryConstants.WEBSITE);
super.setPath(nodePath); // this can be any existing path like: "/home/page"
Node pageNode = getJCRNode(context);
Now we have the content provider and the node we want to render next thing is actually running the rendering engine:
renderingEngine.render(pageNode, outputProvider);
outputProvider.getOutputStream().toString();
And that's it, you should have your content rendered and you can use it as you wish.
Now we come to my special case where I want to render just an area of the whole page in this case this is the Header of the page. This is all handled by same renderingEngine though you need to add a rendering listener that overrides the writing process. First inject it in the command:
#Inject
public void setAreaFilteringListener(final AreaFilteringListener aAreaFilteringListener) {
areaFilteringListener = aAreaFilteringListener;
}
This is where the magic happens, the AreaFilteringListener will check if you are currently rendering the requested area and if you do it enables the output provider for writing otherwise keeps it locked and skips all unrelated areas. You need to add the listener to the rendering engine like so:
// add the area filtering listener that generates specific area HTML only
LinkedList<AbstractRenderingListener> listeners = new LinkedList<>();
listeners.add(areaFilteringListener);
renderingEngine.setListeners(listeners);
// we need to provide the exact same Response instance that the WebContext is using
// otherwise the voters against the AreaFilteringListener will skip the execution
renderingEngine.initListeners(outputProvider, MgnlContext.getWebContext().getResponse());
I hear you ask: "But where do we specify the area to be rendered?", aha here is comes:
// enable the area filtering listener through a global flag
MgnlContext.setAttribute(AreaFilteringListener.MGNL_AREA_PARAMETER, areaName);
MgnlContext.getAggregationState().setMainContentNode(pageNode);
The area filtering listener is checking for a specific Magnolia context property to be set: "mgnlArea" if that's found it will read its value and use it as an area name, check if that area exists in the node and then enable writing once we hit the area. This could be also used through URLs like: https://demopublic.magnolia-cms.com/~mgnlArea=footer~.html and this will give you just the footer area generated as an HTML page.
here is the full solution: http://yysource.com/2016/03/programatically-render-template-area-in-magnolia-cms/

Just use the path of the area and make a http request using that url, e.g. http://localhost:9080/magnoliaAuthor/travel/main/0.html
As far as I can see there is no need to go through everything programmatically as you did.
Direct component rendering

How to get Layout instance's URL in Liferay? What is friendly URL base?

Suppose I have Layout instance (in Java or JSP) and I want to get it's URL.
Layout represents a page. Page has "friendly URL" and I can get it by friendlyURL property.
But what about FULL url?
I can also get scopeGroup's friendly url, where
Group scopeGroup = themeDisplay.getScopeGroup();
and obtain more short part, which is also not full.
Company.getPortalURL
also does not contain all other text (does not include port and "/web" parts).
Inside \ROOT\html\portlet\layouts_admin\layout\details.jsp I found the following code to build it
boolean privateLayout = ((Boolean)renderRequest.getAttribute("edit_pages.jsp-privateLayout")).booleanValue();
Layout selLayout = (Layout)renderRequest.getAttribute("edit_pages.jsp-selLayout");
StringBuilder friendlyURLBase = new StringBuilder();
friendlyURLBase.append(themeDisplay.getPortalURL());
LayoutSet layoutSet = selLayout.getLayoutSet();
String virtualHostname = layoutSet.getVirtualHostname();
if (Validator.isNull(virtualHostname) || (friendlyURLBase.indexOf(virtualHostname) == -1)) {
friendlyURLBase.append(scopeGroup.getPathFriendlyURL(privateLayout, themeDisplay));
friendlyURLBase.append(scopeGroup.getFriendlyURL());
}
but this code is based on strange parameters edit_pages.jsp-privateLayout and edit_pages.jsp-selLayout which I am afraid will not be accessible in normal portlet.
So, how to obtain FULL URL of page instance?

Try this:
PortalUtil.getLayoutFullURL(layout, themeDisplay)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.