Replace text on page depending on page URL - java

We are already replacing text with images on the website but have run into a little problem due to the platform we're running on - which is proprietary and provides limited access.
Our goal is to replace the price with an image, ONLY for this specific brand and all items within it.
It seems that forming some sort of expression to look at the current URL and then if it fits to replace the text with the image.
Is this valid thinking and if so how do I go about doing this?
Here is a link to a sample product that is within the brand 'KW Suspension';

Yeah that shouldn't be too hard,
<script>
$(document).ready(function(){
if ( /.*\/kw_suspension\/.*/i.test(location.href) ) {
$(".yourprice").html("<img src='myimg.png' />");
}
});
</script>
you could also change it to check using a regexp, you just change what is within the () to fit your criteria.
EDIT
Added surrounding code and changed to regexp as suggested by OP.

You have access to the location.href which would return a string for the current windows url and you could use a regex match to see if the brand is in the ur. You can then do a replace to replace the pricing span:
var matcher = new Regexp(/kw_suspension/);
if(x.test(location.href){
$('#ctl00_MainContentPlaceHolder_YourPriceLabel').replace(better html here);
}
The above will just simply see if kw_suspension is in the url and then it replaces the span with the price with something else.

You can use indexOf to see if the url contains your keyword.
$(document).ready(function(){
var urlString = location.href; //get URL string
if( urlString.string.indexOf("kw_suspension") != -1){
$('div.yourprice').empty().html('<img src="/path/to/image.jpg" />');
}
});

Related

How to select text in HTML tag without a tag around it (JSoup)

I would like to select the text inside the strong-tag but without the div under it...
Is there a possibility to do this with jsoup directly?
My try for the selection (doesn't work, selects the full content inside the strong-tag):
Elements selection = htmlDocument.select("strong").select("*:not(.dontwantthatclass)");
HTML:
<strong>
I want that text
<div class="dontwantthatclass">
</div>
</strong>
You are looking for the ownText() method.
String txt = htmlDocument.select("strong").first().ownText();
Have a look at various methods jsoup have to deal with it https://jsoup.org/apidocs/org/jsoup/nodes/Element.html. You can use remove(), removeChild() etc.
One thing you can do is use regex.
Here is a sample regex that matches start and end tag also appended by </br> tag
https://www.debuggex.com/r/1gmcSdz9s3MSimVQ
So you can do it like
selection.replace(/<([^ >]+)[^>]*>.*?<\/\1>|<[^\/]+\/>/ig, "");
You can further modify this regex to match most of your cases.
Another thing you can do is, further process your variable using javascript or vbscript:-
Elements selection = htmlDocument.select("strong")
jquery code here:-
var removeHTML = function(text, selector) {
var wrapped = $("<div>" + text + "</div>");
wrapped.find(selector).remove();
return wrapped.html();
}
With regular expression you can use ownText() methods of jsoup to get and remove unwanted string.
I guess you're using jQuery, so you could use "innerText" property on your "strong" element:
var selection = htmlDocument.select("strong")[0].innerText;
https://jsfiddle.net/scratch_cf/8ds4uwLL/
PS: If you want to wrap the retrieved text into a "strong" tag, I think you'll have to build a new element like $('<strong>retrievedText</strong>');

How to get img url from html text

In Android development, how can I get an image URL or load an image from the HTML text shown below? I get it from HTML, and I want to get an image URL from this code:
<p style="text-align: justify;"><span style="font-size: 16px;"><img class="aligncenter wp-image-2699 size-large" src="http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-1024x258.jpg" alt="helal parti-1" width="750" height="189" srcset="http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-300x76.jpg 300w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-768x193.jpg 768w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-1024x258.jpg 1024w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-560x141.jpg 560w, http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1.jpg 1564w" sizes="(max-width: 750px) 100vw, 750px" /></span></p>
I want to get src="http://www.helalsaglikliyasam.org/wp-content/uploads/2016/02/helal-parti-1-1024x258.jpg"
You should probably use an HTML parser -- there are several Java HTML parsing libraries that can be found with a quick search.
A quick and dirty, way, however, would be to search the input string for the src=" declaration, like so:
int index = input.indexOf("src=\"");
String substr = input.substring(index + 5);
int endIndex = substr.indexOf("\"");
String imgUrl = substr.substring(0, endIndex);
Disclaimer: I haven't tested this, so it may have errors. It also makes a lot of assumptions which may not be true -- which is why you should use a library for this sort of thing!
Edit: Fixed one error after testing (had to use a different machine than the one I'm typing this on). It should work for you now -- but again, you should really use a library.
to create an image in HTML just type:
<img src="www.example.com/images/this-one.png" alt="an image" />
you may nest the img tag into a p and or span tag as needed.
Just a note, code is much easier to read when you use external style sheets

How to validate that at least one element in a html string has content?

I have a wysiwyg editor that I can't modify that sometimes returns <p></p> which obviously looks like an empty field to the person using the wysiwyg.
So I need to add some validation on my back-end which uses java.
should be rejected
<p></p>
<p> </p>
<div><p> </p></div>
should be accepted
<p>a</p>
<div><p>a</p></div>
<p> </p>
<div><p>a</p></div>
basically as long as any element contains some content we will accept it and save it.
I am looking for libraries that I should look at and ideas for how to approach it. Thanks.
You may look on jsoup library. It's pretty fast
It takes HTML and you may return text from it (see example from their website below).
Extract attributes, text, and HTML from elements
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
I would advise you to do it on the client side. The reason is because it is natural for the browser to do this. You need to hook your wysiwyg editor in the send or "save" part, a lot of them have this ability.
Javascript would be
function stripIfEmpty(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
var contentText = tmp.textContent || tmp.innerText || "";
if(contentText.trim().length === 0){
return "";
}else{
return html;
}
}
In the case if you need backend javascript, then the only correct solution would be to use some library that parse HTML, like jsoup - #Dmytro Pastovenskyi show you that.
If you want to use backend but allow it to be fuzzy, not strict, then you can use regex like replaceAll("\\<[^>]*>","") then trim, then check if the string is empty.
You can use regular expressions (built-in to Java).
For example,
"<p>\\s*\\w+\\s*</p>"
would match a <p> tag with at least 1 character of content.

Need to scrape an url from a web page

I need to scrape a url from a website which is located within some javascript code.
<script type="text/javascript">
(function() {
// somewhere..
$.get("http://someurl.com?q=34343&b=343434&c=343434")...
});
</script>
I know that the url starts with http://someurl.com?q= and it needs to have at least a second query parameter (&b=) inside, but the rest of the content is unknown.
I initially tried with jsoup, however it's not really suitable for that task. Manually fetching the page and then applying a regex pattern on it is also not a preferable option since the page is huge. What could I do to get the url quick and safe?
You can use this regex
/\$\.get\("(http:\/\/someurl\.com\?q=[\w.\-%#\/]*&b=[\w.\-%&=\/]*)"\)/g
This regex will search directly for this string:
$.get("http://someurl.com?q=
It will then allow any number of URL valid characters to occur as the value of q.
It will then look to match
&b=
and then again any number of valid characters followed by the opposing quotation marks. I tested it with
MATCH - $.get("http://someurl.com?q=34343&b=343434&c=343434")
MATCH - $.get("http://someurl.com?q=34343&b=13a43&k=343434&c2=something")
FAIL - $.get("http://someurl.com?q=34343&c=343434&b=343434")
FAIL - $.get("http://someurl.com?a=34343&b=343434=343434")
If you only want to return the first result you can remove the global identifier from the end
/\$\.get\("(http:\/\/someurl\.com\?q=[\w.\-%#\/]*&b=[\w.\-%&=\/]*)"\)/

HtmlUnit - Convert an HtmlPage into HTML string?

I'm using HtmlUnit to generate the HTML for various pages, but right now, the best I can do to get the page into the raw HTML that the server returns is to convert the HtmlPage into an XML string.
This is somewhat annoying because the XML output is rendered by web browsers differently than the raw HTML would. Is there a way to convert an HtmlPage into raw HTML instead of XML?
Thanks!
page.asXml() will return the HTML. page.asText() returns it rendered down to just text.
I'm not 100% certain I understood the question correctly, but maybe this will address your issue:
page.getWebResponse().getContentAsString()
I think there is no direct way to get the final page as HTML.
asXml() returns the result as XML, asText() returns the extracted text content.
The best you can do is to use asXml() and "transform" it to HTML:
htmlPage.asXml().replaceFirst("<\\?xml version=\"1.0\" encoding=\"(.+)\"\\?>", "<!DOCTYPE html>")
(Of course you can apply more transformations like converting <br/> to <br> - it depends on your requirements.)
Even the related Google documentation recommends this approach (although they don't apply any transformations):
// return the snapshot
out.println(page.asXml());
I dont know the answer short of a switch on Page type and for XmlPage and SgmlPage one must do an innerHTML on the HTML element and manually write out the attributes. Not elegant and exact (its missing the doctype) but it works.
Page.getWebResponse().getContentAsString()
This is incorrect as it returns the text form of the original unrendered, no js bytes. If javascript executes and changes stuff, then this method will not see the changes.
page.asXml() will return the HTML. page.asText() returns it rendered down to just text.
Just want to confirm this only returns text within text nodes and does not include the tags and their attributes. If you wish to take the complete HTML this is not the good enuff.
Maybe you want to go with something like this, instead of using the HtmlUnit framework's methods:
try (InputStreamReader isr = new InputStreamReader(url.openConnection().getInputStream());
BufferedReader br = new BufferedReader(isr);){
String line ="";
String htmlSource ="";
while((line = br.readLine()) != null)
{
htmlSource += line + "\n";
}
return htmlSource;
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Here is my solution that works for me:
ScriptResult scriptResult = htmlPage.executeJavaScript("document.documentElement.outerHTML;");
System.out.println(scriptResult.getJavaScriptResult().toString());

Categories