I want to download a source of a webpage to a file (*.htm) (i.e. entire content with all html markups at all) from this URL:
http://isap.sejm.gov.pl/DetailsServlet?id=WDU20061831353
which works perfectly fine with FileUtils.copyURLtoFile method.
However, the said URL has also some links, for instance one which I'm very interested in:
http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true
This link works perfectly fine If open it with a regular browser, but when I try to download it in Java by means of FileUtils -- I got only a no-content page with single message "trwa ladowanie danych" (which means: "loading data...") but then nothing happens, the target page is not loaded.
Could anyone help me with this? From the URL I can see that the page uses Servlets -- is there a special way to download pages created with servlets?
Regards --
This isn't a servlet issue - that just happens to be the technology used to implement the server, but generally clients don't need to care about that. I strongly suspect it's just that the server is responding with different data depending on the request headers (e.g. User-Agent). I see a very different response when I fetch it with curl compared to when I load it in Chrome, for example.
I suggest you experiment with curl, making a request which looks as close as possible to a request from a browser, and then fiddling until you can find out exactly which headers are involved. You might want to use Wireshark or Fiddler to make it easy to see the exact requests/responses involved.
Of course, even if you can fetch the original HTML correctly, there's still all the Javascript - it would be entirely feasible for the HTML to contain none of the data, but for it to include Javascript which does the actual data fetching. I don't believe that's the case for this particular page, but you may well find it happens for
try using selenium webdriver to the main page
HtmlUnitDriver driver = new HtmlUnitDriver(true);
driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
driver.get(baseUrl);
and then navigate to the link
driver.findElement(By.name("name of link")).click();
UPDATE: I checked the following: if I turn off the cookies in Firefox and then try to load my page:
http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true
then I yield the incorrect result just like in my java app (i.e. page with "loading data" message instead of the proper content).
Now, how can I manage the cookies in java to download this page properly then?
Related
I have recorded a scenario in Jmeter, I have webpage which is having Iframe in it. which loads another webpage from same domain.
Retrieve All Embedded Resources is checked but I don't want that Iframe should get loaded. I have tried adding .css,.js.*.png in URLs must match but it doesn't work.
You can stop downloading all embedded resources in the iframe. In a way, iframe won't get loaded.
Please note that - requested page which has iframe embedded will still show the iframe in HTML response but subsequent calls that iframe will make to download embedded resources can be stopped.
Here the sample iframe example. Editor being displayed on the page is in iframe. So if you load the page all the resources get downloaded.
So lets try this in jmeter :
and results of this call as same in developer console-
.
Now, block the iframe using URLs must match functionality.
I peeked in the respons of eariler request and blocked the iframe using below regex pattern :
^(nested_frames)*?
Here is the image:
And here is the response for this request:
I have uploaded the JMX file on Github if you want to play around.
Your requirement seems a little bit weird as well-behaved JMeter test should have the same network footprint as the real browser does (it applies to embedded resources, cookies, cache, headers, etc.) so if the real browser does load the page from the domain you're testing the JMeter test needs to do the same.
If you still need to exclude the iframe from your JMeter test you can "blacklist" the "another webpage" from being loaded via "URLs must match" section of the HTTP Request sampler like:
^((?!the-webpage-you-don-want-here).)*$
More information: Excluding Domains from the Load Test
I need to validate the PDF report. I need to get the report embeded in HTML. If I read that URL using:
File file = new File("url");
or
HttpWebConnection.getResponse();
it requests the URL in separate session, hence it cannot get the file.
Does ieDriver have something like HtmlUnit?
HttpWebConnection.getResponse()
or somebody can suggest alternative.
Unfortunately it does not.
If you want to get the response code you will need a proxy. If you are using Java then Browser Mob is what you need. You may also try making XmlHttpRequest from javascript and get the status code in that way.
You could also stick to the method you are using right now (separate request from Java) but pass the cookie with session (you can obtain the cookie from WebDriver)
I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily.
But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data).
But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing. Meaning there are some dynamic parts of a website and I want to get that data but since they're rendered post get, asynchronously on the website I'm unable to capture it with jsoup.
Does anybody knows a way around this? Am I using the right toolset? more experienced people, I bid your advice.
You need to check before if the website you're crawling demands some of this list to show all contents:
Authentication with Login/Password
Some sort of session validation on HTTP headers
Cookies
Some sort of time delay to load all the contents (sites profuse on Javascript libraries, CSS and asyncronous data may need of this).
An specific User-Agent browser
A proxy password if, by example, you're inside a corporative network security configuration.
If anything on this list is needed, you can manage that data providing the parameters in your jsoup.connect(). Please refer the official doc.
http://jsoup.org/cookbook/input/load-document-from-url
I need to go around a site using java programmatically but the site doesn't change the url when linked is clicked.
Site: http://cliqa.nana10.co.il/
On the right you have a bar with some links, click them and you will see that while the content changes the url doesn't change. how can i achieve programmatically this mouse click on one of the links in Java, I thought about HTTP POST but what exactly I'm going to send? an example would be much appreciated.
These links use JavaScript to trigger an AJAX request and refresh only the center of the page. Use FireBug inside Firefox to sniff the network requests and see which requests are executed on each click. Or use a programmatic web browser like HtmlUnit which will handle JavaScript as your web browser does.
You need to look at the actual HTTP request being sent. You can do this Chrome with the built in Inspect Element tool or Firebug in Firefox (or Live HTTP Headers). I prefer to use Burp Suite's intercepting proxy to see this.
You can try charles (http://www.charlesproxy.com/). Check the section on JAVA APPLICATION PROXY CONFIGURATION at http://www.charlesproxy.com/documentation/configuration/browser-and-system-configuration/. You can inspect and change request sent from your java application.
Im trying to use DefaultHttpClient to log into xbox.com. I realize that you cant be logged in without visiting http://login.live.com, so I was going to submit to the form on that page and then use the cookies in any requests to xbox.com.
The problem is that requesting anything from live.com using DefaultHttpClient returns the followings message.
Windows Live ID requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.
How do I tell DefaultHttpClient to tell the server that javascript is available for use? I tried looking in the default options and also adding it as a parameter object but I cant see what I've got to do.
The reason this is happening is that this line of HTML is getting parsed from live:
<noscript><meta http-equiv="Refresh" content="0; URL=http://login.live.com/jsDisabled.srf?mkt=EN-US&lc=1033"/>Windows Live ID requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.<br /><br />To find out whether your browser supports JavaScript, or to allow scripts, see the browser's online help.</noscript>
Which is used to redirect you if your client does not have javascript enabled (and therefore will parse <noscript> tags.)
You could try to use a less intelligent HTTP library which does no parsing of the content, but which instead simply does the transport and leaves the parsing to you.
Use Wireshark to trace the communication using both a browser and your program, and look for the differences. It's hard to say what, exactly, live.com/xbox.com are looking for, but there is likely some AJAX-y code used to get the actual content.
Windows Live ID requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.To find out whether your browser supports JavaScript, or to allow scripts, see the browser's online help.