Need to scrape an url from a web page - java

I need to scrape a url from a website which is located within some javascript code.
<script type="text/javascript">
(function() {
// somewhere..
$.get("http://someurl.com?q=34343&b=343434&c=343434")...
});
</script>
I know that the url starts with http://someurl.com?q= and it needs to have at least a second query parameter (&b=) inside, but the rest of the content is unknown.
I initially tried with jsoup, however it's not really suitable for that task. Manually fetching the page and then applying a regex pattern on it is also not a preferable option since the page is huge. What could I do to get the url quick and safe?

You can use this regex
/\$\.get\("(http:\/\/someurl\.com\?q=[\w.\-%#\/]*&b=[\w.\-%&=\/]*)"\)/g
This regex will search directly for this string:
$.get("http://someurl.com?q=
It will then allow any number of URL valid characters to occur as the value of q.
It will then look to match
&b=
and then again any number of valid characters followed by the opposing quotation marks. I tested it with
MATCH - $.get("http://someurl.com?q=34343&b=343434&c=343434")
MATCH - $.get("http://someurl.com?q=34343&b=13a43&k=343434&c2=something")
FAIL - $.get("http://someurl.com?q=34343&c=343434&b=343434")
FAIL - $.get("http://someurl.com?a=34343&b=343434=343434")
If you only want to return the first result you can remove the global identifier from the end
/\$\.get\("(http:\/\/someurl\.com\?q=[\w.\-%#\/]*&b=[\w.\-%&=\/]*)"\)/

Related

Jsoup extract Hrefs from the HTML content

My problem is that I try to get the Hrefs from this site with JSoup
https://www.amazon.de/s?k=kissen&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2
but it does not work.
I tried to select the class from the Href like this
Elements elements = documentMainSite.select(".a-link-normal");
and after that I tried to extract the Hrefs with the following piece of code.
for (Element element : elements) {
String href = element.attributes().get("href");
}
but unfortunately it gives me nothing...
Can someone tell me where is my mistake please?
I don't just connect to the website. I also save the hrefs in a string by extracting them with
String href = element.attributes().get("href");
after that I've print the href String but is empty.
On another side the code works with another css selector. so it has nothing to do with the code by it self. its just the css selector (.a-link-normal) that is probably wrong.
You won't get anything by simply connecting to the url via Jsoup.
Document document = Jsoup.connect(yourUrl).get();
String bodyText = document.getElementsByTag("body").get(0).text();
Here is the translation of the body text, which I got from the above code.
Enter the characters below We ask for your understanding and want to
be sure that you are not a bot. For best results, please use a browser
that accepts cookies. Type the characters you see in the image: Enter
characters Try another image Continue shopping Terms & Conditions
Privacy Policy © 1996-2015, Amazon.com, Inc. or its affiliates
Either you need to bypass captcha or emulate a browser by means of Selenium, for example.

How to handle double escaping in chrome/firefox?

Here is my my java code in jsp :
custUrl="customer.action?custId=211&custAddressId=2341";
Now javascript code :
function submit() {
window.location = "<c:out value='<%=custUrl%>' />";
// here is generated javascript code
// window.location = "customer.action?custId=211&custAddressId=2341"
}
FireFox and Chrome (IE does not do double escape) are escaping the already escaped value (that's why I am getting the second paramter name as amp;custAddressId instead of custAddressId).
Is there any generic solution where i can handle double escaping in firefox/chrome?
UPDATE:-
so bottom line is i want to escape the intended characters with c:out (which is happening)
but also want to avoid the double escaping while sending the data to server which is happening
in case of some browsers
By default special characters are escaped by <c:out>. Turn escaping off as
<c:out value='<%=custUrl%>' escapeXml='false' />
Ampersand & is escaped as & in XML. Here amp is short for ampersand.
This isn't a Firefox/Chrome issue because final HTML generated is the same irrespective of which browser you use to access your site. IE's HTML source viewer must have chosen to display the ampersand in its unescaped form.

c:out value behavior

I am relatively new to working with JSPs and I have a feeling I'm overlooking something simple. I have a segment that appends a key onto a URL before sending the user back to where they came from. The key is a string value and when it consists of only numberic values(for example 12345) it works fine, but when it contains non-numerics(for example abcde) it simply appends "#" to the url and stays on the same page.
<core:when test="${dataTransferObject.someBoolean}">
Back to Home
</core:when>
When it's a string the JavaScript will be illegal–it will think you're trying to reference a non-existent JavaScript variable. You will see an error your JavaScript console.
Don't do any JavaScript operations; JSP is evaluated on the server side before the client sees it:
onclick="javascript:location='path/back/to/their/home.request?cachekey=<core:out value="${dataTransferObject.stringVariable}"/>';return false;"
Better yet, use JSP EL:
onclick="javascript:location='path/back/to/their/home.request?cachekey=${dataTransferObject.stringVariable}';return false;"
Also, if this is the JSTL core tag library, the canonical prefix is "c".

Replace text on page depending on page URL

We are already replacing text with images on the website but have run into a little problem due to the platform we're running on - which is proprietary and provides limited access.
Our goal is to replace the price with an image, ONLY for this specific brand and all items within it.
It seems that forming some sort of expression to look at the current URL and then if it fits to replace the text with the image.
Is this valid thinking and if so how do I go about doing this?
Here is a link to a sample product that is within the brand 'KW Suspension';
Yeah that shouldn't be too hard,
<script>
$(document).ready(function(){
if ( /.*\/kw_suspension\/.*/i.test(location.href) ) {
$(".yourprice").html("<img src='myimg.png' />");
}
});
</script>
you could also change it to check using a regexp, you just change what is within the () to fit your criteria.
EDIT
Added surrounding code and changed to regexp as suggested by OP.
You have access to the location.href which would return a string for the current windows url and you could use a regex match to see if the brand is in the ur. You can then do a replace to replace the pricing span:
var matcher = new Regexp(/kw_suspension/);
if(x.test(location.href){
$('#ctl00_MainContentPlaceHolder_YourPriceLabel').replace(better html here);
}
The above will just simply see if kw_suspension is in the url and then it replaces the span with the price with something else.
You can use indexOf to see if the url contains your keyword.
$(document).ready(function(){
var urlString = location.href; //get URL string
if( urlString.string.indexOf("kw_suspension") != -1){
$('div.yourprice').empty().html('<img src="/path/to/image.jpg" />');
}
});

How can get html content include content of javascript?

i need to get contents on web page and read it via URL,but contents not include data on javascript any body can help me to solve this problem ? For example : i want to get bibtext content ' javascrip from URL : http://portal.acm.org/citation.cfm?id=152610.152611&coll=DL&dl=GUIDE&CFID=111326695&CFTOKEN=18291914 how can i get content (2) from (1)
From a quick observation, here is what I would do:
1/ Get the content of this web page: http://portal.acm.org/citation.cfm?id=152610.152611&coll=DL&dl=GUIDE&CFID=111326695&CFTOKEN=18291914
2/ Use regular expression to search for 'BibTeX' and locate the below string in the content:
<li style="list-style:disc; display:inline; margin-bottom:0px;">BibTeX</li>
3/ Use another regular expression to fish out:
exportformats.cfm?id=152611&expformat=bibtex
4/ Concatenate it to the url (make sure you decode & to &):
"http://portal.acm.org/" + "exportformats.cfm?id=152611&expformat=bibtex"
5/ Capture the content you're looking for. Ultimately http://portal.acm.org/exportformats.cfm?id=152611&expformat=bibtex gives you the content.

Categories