Render Tapestry page and get it as Stream/String resource

Render Tapestry page and get it as Stream/String resource - java

Is any convinient way to dynamically render some page inside application and then retrieve its contents as InputStream or String?
For example, the simplest way is:
// generate url
Link link = linkSource.createPageRenderLink("SomePageLink");
String urlAsString = link.toAbsoluteURI() + "/customParam/" + customParamValue;
// get info stream from url
HttpGet httpGet = new HttpGet(urlAsString);
httpGet.addHeader("cookie", request.getHeader("cookie"));
HttpResponse response = new DefaultHttpClient().execute(httpGet);
InputStream is = response.getEntity().getContent();
...
But it seems it must be some more easy method how to archive the same result. Any ideas?

I created tapestry-offline for exactly this purpose. Please be aware of the issue here (workaround included).
It's probably best to understand your exact use case. If, for example, you are generating emails in a scheduled task, it's probably better to configure jenkins or cron to hit a URL.

It's probably also worth mentioning the capture component from tapestry-stitch
This is only useful in situations where you want to cature part of a page as a String during page / component render.

Related

using jsoup to parse html but not follow/fetch links

What is the "correct" way to use JSoup to parse html string or stream without fetching external data for link/img/area/iframe (and whatever other) tags? Right now I am doing something like this after I fetch a page using Apache HttpComponents:
HttpEntity entity = response.getEntity();
InputStream is = entity.getContent();
Document = JSoup.parse(is, null, "");
Which actually works fine. But passing the baseUri as empty just feels wrong, because I am betting JSoup tries to use it, only to fail and move on. I only want to use JSoup as an html parser and DOM manipulation kit, not an http framework. I am also a bit worried that JSoup might try to look for ="/foo" resources in the current directory or something. What does it do with an empty string? I tried passing null as the baseUri, which would be a natural interface for doing what I want, but it dies with an IllegalStateException.
Is there a way to do this, or am I worried about nothing?

... I don't think think that JSoup does that. The URL parameter is just for the canonicalization of relative URLs, what you do with them is your responsibility. JSoup will not by itself try to access resources.

Read from XML over Http

I am trying to learn how to read from an XML file (getting it from a url) over Http in Java and am pretty confused as to where I should start. I know how to parse an XML document and print the text associated with the elements to the screen and basic manipulation like that but I am trying to take it a little further.
If anyone could provide me with somewhere to start or any tips that would be much appreciated. I would be more than happy to provide more specifics if that is needed. Thanks!

It seems like you already know how to deal with XML, you're just asking how to get the XML over HTTP. This code should work.
URLConnection connection = new URL(urlThatReturnsXml).openConnection();
InputStream is = connection.getInputStream();
String responseAsString = org.apache.commons.io.IOUtils.toString(is);

Make a java.net.URL object from the url string and call openStream() on it. You now have an InputStream to read from. That should get you going.

Search for redirected website path

I have two websites, A and B. When I open website A, I am redirected to website B automatically.
What is the function with which I can check what was the full path of website A from which was the redirect?
I was trying to start with:
logger.info(request.getPathInfo());
logger.info(request.getPathTranslated());
logger.info(request.getServletPath());
logger.info(request.getLocalName());
logger.info(request.getRemoteAddr());
logger.info(request.getRemoteHost());
logger.info(request.getRequestURI());
logger.info(request.getServerName());
but none of them is correct.
For redirecting I use response.sendRedirect inside Controller.
Thanks for help.

You can try using the optional referer header:
request.getHeader("referer");
But it is important to note that this may not always be populated (specifically IE).
A better solution, if you are in control of both of the websites, is to pass the value somehow when you are doing the redirect. For example, as a GET or POST parameter.
Edit:
As suggested above, you can append query strings to your redirect URL. For example, you might try something like this:
String redirectUrl = "http://my.redirect.com/";
redirectUrl += "?referer=";
redirectUrl += URLEncoder.encode(request.getRequestURL().toString(), "UTF-8");
Then you can just pull this out of the request on the other side.
Use this as a starting point. You may need to manually append other query parameters that may not be part of the getRequestURL() output.

None of these would get you the page that redirected you to the current page. What you can try is:
String refererPage = request.getHeader("referer");
However keep in mind that this is also browser dependent and may not always be present.

Try this
request.getHeader("referer");

Please try
request.getHeader("referer");

Java: How to easily check if a URL was already shortened?

If I have a general url (not restricted to twitter or google) like this:
http://t.co/y4o14bI
is there an easy way to check if this url is shortened?
In the above case, I as a human can of course see that it was shortend, but is there an automatic and elegant way?

You could do a request to the URL, look if you get redirected and if so, assume it's a shortening service. For this you'd have to read the HTTP status codes.
On the other hand, you could whitelist some URL shortening services (t.co, bit.ly, and so on) and assume all links to those domains are shortened.
Drawback of the first method is that it isn't certain, some sites use redirects internally. The drawback of the second method is that you'd have to keep adding shortening services, although only a few are used widely.

One signal may be to request the URL and see if it results in a redirect to another domain. However, without a good definition of what "shortened" means, there is no generic way.

if you know all the domains that can be used to shorten your URLs, check if it is contained :
String[] domains = {"bit.ly", "t.co"...};
for(String domain : domains){
if(url.startsWith("http://" + domain)){
return true;
}
}
return false;

You can't: You will have to work by assumption.
Assumption:
Does www exist in url.
Does the server name end with a valid domain (e.g. com, edu, etc.) or does it has co.xx where xx is a valid country or organization code.
And you can add more assumption based on other url shortening links.

You can't.
You can only check if you list a couple of shorteners and check if the url starts with it.
You can also try checking whether the url is shorter than a given length (and contains path/query string), but some shorteners (tinyurl for example) may have longer urls than normal sites (aol.com)
I would prefer the list of known shorteners.

Here's what you could do in Java, groovy and the like.
Get the url you want to test;
Open the url with HttpURLConnection
Check the response code
if it is a valid code, 200 for example, the you can retrieve the url string in long form from the connection object if it was shortened or back in its original form if it wasn't.
We all love to see some code don't we. Its crude, but hey!
String addr = "http://t.co/y4o14bI";
URL url = new URL(addr);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
if (connection.getResponseCode() == 200) {
String longUrl = connection.url;
System.out.println(longUrl);
} else {
// You decide what you want to do here!
}

Actually, you as a human, can't. The only way you know that it's shortened is that it's a t.co domain. The y4o14bI could be an CMS identifier for all you know.
The best way would be to use a list of known shortener urls, and lookup against that.
And even then you would have problems. I use bit.ly with a personal domain, wtn.gd
So http://wtn.gd/random would also be a shortened URL.
You could maybe do a HTTP HEAD-request, and check for a 301/302 ?

If you request an URL like this, your HttpCLient should receive a HTTP Redirect instead of a HTML page. This wouldn't be an evidence but at least a hint.

Evaluate the URL and look for some clues:
the Path meets certain criteria
only has one step (i.e. not multiple slashes)
does not end with filename extensions
not longer than X characters (would need to evaluate various URL shortening services and adjust the upper bounds for the max token length)
HttpUrlConnection returns a redirect responseCode (i.e. 301, 302)

I would suggest using android.util.Patterns.WEB_URL
public static List<String> findUrls(String input) {
List<String> links = new ArrayList<>();
Matcher m = android.util.Patterns.WEB_URL.matcher(input);
while (m.find()) {
String url = m.group();
links.add(url);
}
return links;
}

Use the unshorten URL service like https://unshorten.me
They have an API as well https://unshorten.me/api
If the URL is shortened it will return the original URL.
If not you will get the same one back.

Java getting source code from a website

I have a problem once again where I cant find the source code because its hidden or something... When my java program indexes the page it finds everything but the info i need... I assume its hidden for a reason but is there anyway around this?
Its just a bunch of tr/td tags that show up in firebug but dont show up when viewing the page source or when i do below
URL url = new URL("my url");
URLConnection yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
I really have no idea how to attempt to get the info that i need...

The reason for this behavior is because probably those tags are dynamically injected into the DOM using javascript and are not part of the initial HTML which is what you can fetch with an URLConnection. They might even be created using AJAX. You will need a javascript interpreter on your server if you want to fetch those.

If they don't show up in the page source, they're likely being added dynamically by Javascript code. There's no way to get them from your server-side script short of including a javascript interpreter, which is rather high-overhead.
The information in the tags is presumably coming from somewhere, though. Why not track that down and grab it straight from there?

Try Using Jsoup.
Document doc = doc=Jsoup.parse("http:\\",10000);
System.out.print(doc.toString());

Assuming that the issue is that the "missing" content is being injected using javascript, the following SO Question is pertinent:
What's a good tool to screen-scrape with Javascript support?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Render Tapestry page and get it as Stream/String resource - java

I created tapestry-offline for exactly this purpose. Please be aware of the issue here (workaround included). It's probably best to understand your exact use case. If, for example, you are generating emails in a scheduled task, it's probably better to configure jenkins or cron to hit a URL.

It's probably also worth mentioning the capture component from tapestry-stitch This is only useful in situations where you want to cature part of a page as a String during page / component render.

Related

using jsoup to parse html but not follow/fetch links

Read from XML over Http

Search for redirected website path

Java: How to easily check if a URL was already shortened?

Java getting source code from a website

Categories

Resources