How to catch one specific text from html source code using Jsoup? - java

I tried the solution from:
How to extract text of paragraph from html using Jsoup?
jsoup how to extract this text
but both examples are working with texts from tags.
I have this unique piece of code on my html web search:
and what I need is to take the link that comes with the d.href variable.
I tried codes like:
Elements link = jSoupConnection.select(":contains(d.href)");
Elements link = jSoupConnection.select("#d.href");
Elements link = jSoupConnection.getElementsByAttributeValueContaining("d.href","google");
but until now none of them worked.
I tried also to make one research at http://jsoup.org/cookbook/ and also nothing sucessfull. Could anyone more experienced with Jsoup help me please??
Thanks in advance

In case of your text doesn't come with any tag that you could specific catch with Jsoup select elements, you should download the hole page (which you can do with Elements link = jSoupConnection.select("*");) and then open it on your application as one text file to retrieve whatever you want. If the downloaded file is too big, and that was my problem, try to limit the file size download; more details you can find on those links:
Limiting file size creation with java
How to limit the file size in Java

Related

is image descripting same as metadata extracting?

i am trying to understand my project description: One of the techniques used to search for images is the analysis of its content through the tags by first extracting “image descriptors” such as MPEG-7 descriptors, shown in the Figure below. These tags are snippets of text that describe an image’s content; the tags don’t appear on the image itself, but only in the extracted descriptors. The tags will help and tell the search engines what an image content is about.
But i never worked with image
i just want to know if i need to extract the metadata from the image to get the tags
and does the tags contains the image content?

I dont understand this output produced by selenium webdriver java

I'm trying to write an application for the game Path of Exile, that lists the items in my stash on trading websites automatically.
For this I have to retrieve the items in my stash from their website. For some reason the ".getText()" functions is behaving very weird on the website. I really can't see any mistakes I did with the x-path Expressions.
Example:
Here you can see a snippet from the HTML file I am working on
screenshot of browser debugg tool
In the screenshot you can see that the x-Path I am using is selecting a element with a text element, however when I iterate over the elements and get the text with the getText() function, it returns a empty String... I really have no clue what I am doing wrong, is it the website, that is denying me to access the field?
In case it helps I add here a screenshot of the source code for outputting the text fields
printing the text of the elements(SourceCode)
5 empty Strings as output
On your place I would try to get value instead of text.
try to replace
e.getText()
with
e.getAttribute("value")
or you can also try to play with .getCssValue()

CSS Selector: Anchor text of href contains

I am currently working with Selenium and have now reached the interesting, yet incredibly difficult, world of CSS selectors.
I'm currently looking at selecting the different options of the Google tool bar. E.g when you search for something, on the results page you get options to search for the same term but under images, news, videos etc
I'm particularly interested in selecting the "Images" link.
I've been working on it for quite a while and the closest i have got is the below selector:
div a.q.qs[href]
This drills down into the right sub classes but there are 16 of them. Despite hours of aimless searching, i'm unable to complete the query with a contains method around the anchor text, which is unique in the targeted sub classes.
I'm aware there is a By LinkText option in Selenium, but i'm not sure if the anchor text is unique across the entire page. Also, i really want to understand CSS selectors in general so even if it was, i want to resolve this particular issue so i can apply it to future problems.
I'm looking for something like the below pseudo CSS selector:
div a.q.qs[href].Anchorcontains("Images")
Can anyone help?
All links have a unique parameter called tbm: its value is isch for images, so I'd go with
a[href*="tbm=isch"]
There are sometimes ways to get what you want with CSS selectors but if you want to find an element by contained text, you will have to use either link text/partial link text if it's a link or XPath for everything else.
The XPath for what you want is
//div[#id='hdtb-msb-vis']//a[.='Images']
You could use //a[.='Images'] but that returns two elements, one of which is not visible.
To break this down
// at any level
div find a DIV
[#id='hdtb-msb-vis'] that contains an ID of 'hdtb-msb-vis'
//a that has a child A at any level
[.='Images'] that contains text (.) equal to 'Images'
If you want to explore by link text, you can write something like
int count = driver.findElements(By.linkText("Images")).size();
and print count. My guess is that it will be 2, one of which is not visible. You can use Selenium to further filter this down to only the visible link, if you wanted.
You are going to have the same issue with BackSlash's CSS selector answer. You can tweak it slightly and solve this problem with the CSS selector locator
#hdtb-msb-vis a[href*='tbm=isch']
Here are some links to CSS references that will help you get started: W3C reference, SauceLabs CSS Selectors tips, and Taming Advanced CSS Selectors.

Jsoup select div _ngcontent

I'm trying to use Jsoup to extract some information from a web site, but I don't know how to access to the date content at the bottom of the code. I used the select command with "div", but it doesn´t works. How can I do this?
Thanks!
From the image that you have in your query, it appears like you are trying to fetch the date within the 'br', br - is line break. Even by using CSS we have nothing to fetch under this. Hence a workaround could be tried, something similar to take the text under the tag "small" and split it and take the second part. You need to inspect your DOM more closely and check out for failures with this approach. For the limited html available in the image, you can use the following:
String[] text = doc.select("div > small").text().split("\"");
System.out.println(text[1]);

Replace text with an image docx4j

I have an word template. There is an word photo that has to be replaced with an image. This has to be done with Docx4Java.
How do I do this?
If specifically looking to replace a text with an image(which is not possible using docx4j as answered above), you can use replace bookmark with image as an alternative.
Just open your templated word file, position the cursor at desired location and insert->bookmark and name your bookmark.
I followed the instructions here to replace this bookmark with an image
Disclosure: I manage the docx4j project
The VariableReplace code doesn't handle images.
The best way to do this would be to use data bound content controls, specifically a picture content control pointing via XPath at a base-64 encoded image in an XML document (see Getting Started for details).
However, if you want to replace a word with an image, you can do so, but you'll have to write a bit of glue code. It is pretty straightforward.
First, find the word. You can do this using XPath or TraversalUtil (again, see Getting Started for details).
Hopefully it is in a run (w:r/w:t) by itself. If not, you'll need to split the run up so you don't replace adjacent text.
Then, add the image. See the sample ImageAdd.
I suggest you have a look at the XML created when you add an image in Word (ie save and unzip your docx, then look at document.xml). Take care that the XML representing the image is at the correct level (eg child of w:p).

Categories