I'm trying to scrape live data from 50+ dynamic webpages and need the data to be updated every 1-2 seconds. To do so, I have a Timer scheduled every 1/2 second that iterates through the following method 50 times (for 50 URLs):
public double fetchData(String link) {
String data = null;
try {
URL url = new URL();
urlConn = url.openConnection(link);
InputStreamReader inStream = new InputStreamReader(urlConn.getInputStream());
BufferedReader buff = new BufferedReader(inStream);
/*code that scrapes webpage, stores value in "data"*/
inStream.close();
buff.close();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
This method works but takes about a second per URL, or 50 sec total. I've also tried JSoup in hopes that the delay may be overcome using the following code:
public double fetchData(String link, String identifier) {
Document doc;
String data = null;
try {
doc = Jsoup.connect(link).timeout(10*1000).get();
data = doc.getElementById(identifier).parent().child(0).text();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
but have run into approximately the same processing time. Are there any faster ways to draw data from dynamic webpages simultaneously, whether through URLConnection, JSoup, or some other method?
The short answer is "use threads". Create a thread for each of the 50+ URLs that you want to scrape repeatedly.
It will most likely make little difference if you use URLConnection, JSoup or some other way do the scraping. The actual bottleneck is likely to be due to:
load and performance of the load on the server(s) you are scraping from
network bandwidth
network latency
The first of those is outside of your control (in a positive way!). The last two ... you might be able to address but only by throwing money at the problem. For example, you could pay for a better network connection / path, or pay for alternative hosting to move your scraper close to the sites you are trying to scrape.
Switching to multi-threaded scraping will ameliorate some of those bottlenecks, but not eliminate them.
But I don't think what you are doing is a good idea.
If you write something that repeatedly re-scrapes the same pages once every 1 or 2 seconds, they are going to notice. And they are going to take steps to stop you. Steps that will be difficult to deal with. Things like:
rate limiting your requests
blocking your IPs or IP range
sending you "cease and desist" letters
And if that doesn't help, maybe more serious things.
The real solution may be to get the information a more efficient way; e.g. via an API. This may cost you money too. Because (when it boils down to it) your scraping will be costing them money for either no return ... or a negative return if your activity ends up reducing real peoples' clicks on their site.
Related
Let me get straight to an example to explain further.
final var socket = new java.net.ServerSocket(1234);
for (;;)
{
try (final var client = socket.accept())
{
client.getOutputStream().write("HTTP/1.1 200 OK\r\n\r\n".concat(java.time.Instant.now().toString()).getBytes());
}
}
When I now open my browser of choice (Firefox cough) I'll receive the current time and date. The question now is how I can write to that socket at a later point in time.
hypothetical solution
Here's something I already tried, but doesn't work at all.
final var socket = new java.net.ServerSocket(1234);
for (;;)
{
try (final var client = socket.accept())
{
client.getOutputStream().write("HTTP/1.1 200 OK\r\n\r\n".concat(java.time.Instant.now().toString()).getBytes());
client.getOutputStream().flush();
Thread.sleep(1000L);
client.getOutputStream().write("And another paragraph.".getBytes());
}
}
The result is a web page loading for approximately a single second, printing out the following result (may vary due to different date and time on your end).
2019-01-19T18:19:15.607192500Z
And another paragraph.
Instead I would like the see something like that:
print out the current time and date.
wait a second without the content of the web page changing.
print out the next paragraph.
How would I go about implementing that?
Is it possible for the server to write text into a web page after it is loaded? Yes it definitely is, but these days I suspect it it is rarely done. I started web development in the 1990s and back then that was a pretty common technique. We used it to write live chat messages to browsers with no Javascript. These days Javascript is ubiquitous and powerful, so using client-side Javascript to update a page will be the best option in most cases.
That said, the technologies we used for writing server-side updates back then should still work now. I suspect the reason you don't see updates in your browser is because it doesn't know it should start displaying the page before everything is loaded. Using chunked transfer encoding, a 1990s technology still supported by modern browsers, should resolve that. It it allows the server to indicate when a 'chunk' of data is complete and browsers will generally process each chunk immediately rather than wait for all the chunks to arrive.
The easiest way to use chunked transfer encoding is to use an HTTP library like Apache HttpComponents, then wrap your output stream in the appropriate class:
final var socket = new java.net.ServerSocket(1234);
for (;;)
{
try (final var client = socket.accept())
{
var outputStream = new ChunkedOutputStream(client.getOutputStream());
outputStream.write("HTTP/1.1 200 OK\r\n\r\n".concat(java.time.Instant.now().toString()).getBytes());
outputStream.flush();
Thread.sleep(1000L);
outputStream.write("And another paragraph.".getBytes());
}
}
I have Android app with GAE backend. I'm encountering java.net.SocketTimeoutException, probably due to fetch time limitations of GAE.
However, operations I'm doing there is writing pretty simple object into datastore and returning it to the user. Debug time, that eclipse generates makes it too long I guess...
What would be the way to increase timeout time in such usage:
Gameendpoint.Builder builder = new Gameendpoint.Builder(AndroidHttp.newCompatibleTransport(), new JacksonFactory(), null);
builder = CloudEndpointUtils.updateBuilder(builder);
Gameendpoint endpoint = builder.build();
try {
Game game = endpoint.createGame().execute();;
} catch (Exception e) {
e.printStackTrace();
}
Well, it was a silly mistake. The limit of such operation is 30 seconds, which should be enough. However, inside createGame() there was an infinite loop. I have a feeling that GAE framework recognizes such situation and causes SocketTimeoutException before 30 seconds actually passes.
Sockets on endpoints have a 2000 ms timeout. This is ample time if you are running short processes: a quick query(continuous queries handled differently), or a write operation. If you overload the process and try to do too much (My issue) then you will get this error. what you need to do it run a lot of different endpoint operations and not try to handle too much at one time. You can override the timeout with the HTTP transport if needed but it is not advised.
I'm using jsoup in my android app but the problem is, the html source takes too much time to download. Here is my code:
long t = System.currentTimeMillis();
String url = "http://www.stackoverflow.com/";
Document doc = null;
try {
Connection c = Jsoup.connect(url);
doc = c.get();
System.out.println(System.currentTimeMillis() - t);
} catch (IOException e) {
e.printStackTrace();
}
Executing this code takes 1.265 seconds which feels really weird because i can download the whole website (with images and all that good stuff) using web browser in less than a 0.5 seconds on the same device. Did I do something wrong? Or maybe there is a faster way for getting html source of website? Thanks in advance.
Where are you trying this code on? Your device? If you are using the LTE/3G network it wouldn't be too much off.
The other reason that I could think is that your wireless router is not situated in the best place from your device in case you are using Wifi.
From that code I don't see anything that could cause more delay. 1.2 secs may not be that bad if you dont have the host DNS entry cached and the server is far away from you.
Also, try setting the Agent to the same as your browser when comparing times. It may happen that the server gives different priorities based on the user agent. In this case you are using the default Java user agent.
In my app i'm using code like the following to download several images.
Is it High-Performance to do it like that or can I reuse the connection somehow?
for(int i = 0; i < 100; i++){
URL url = new URL("http://www.android.com/image" + i + ".jpg");
HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection();
try {
InputStream in = new BufferedInputStream(urlConnection.getInputStream());
readStream(in);
finally {
urlConnection.disconnect();
}
}
}
You won't really get any benefit from reuse of the HttpURLConnection.
One thing that will greatly benefit your application is if you spend some time looking into Async Tasks, which will allow you to harness the power of multi threaded HTTP requests with callbacks to your main code.
See:
http://www.vogella.com/articles/AndroidPerformance/article.html
for a good example of how Async Tasks can be utilised.
A good starting point though is of course the Android Developers Blog, where they have an example for downloading an image from a server asynchronously, which will match your requirements nicely. With some adaptation you can have your application sending multiple asynchronous requests at once for good performance.
The Google article can be found at:
http://android-developers.blogspot.co.uk/2009/05/painless-threading.html
The key area to look at is:
public void onClick(View v) {
new DownloadImageTask().execute("http://example.com/image.png");
}
private class DownloadImageTask extends AsyncTask {
protected Bitmap doInBackground(String... urls) {
return loadImageFromNetwork(urls[0]);
}
protected void onPostExecute(Bitmap result) {
mImageView.setImageBitmap(result);
}
}
The loadImageFromNetwork method is where the downloading takes place, and will be completely asynchronous away from your main UI thread.
As a basic example, you could modify your application to call this like so:
for(int i = 0; i < 100; i++){
new DownloadImageTask().execute("http://www.android.com/image" + i + ".jpg");
}
Though for an optimisation, I wouldn't throw 100 requests out at once, maybe creating a Threaded queue system which will allow maybe 4 or 5 concurrent connections and then keep the rest coming through when another finishes by maintaining an ArrayList of pending requests to read off.
No matter how you do it, you are going to end up opening multiple connections, one to get each image. This is how any image is received. And there's no way to change the HttpURLConnection anyways. So this could looks fine from that sense.
However, you could attempt to load multiple images at the same time through threading. It would be somewhat complex to implement such a scheme, but entirely possible. It would speed up the process by requesting multiple images at the same time.
I am writing a Java applet that downloads images from a web server and displays them to the user. It works fine in Java 1.6.0_3 and later, but on older versions it will completely crash the process about once every 20 page views. There are no error messages in the Java console, because the process is completely frozen. I've waited for almost 15 minutes sometimes, but it never un-freezes.
I added a debug message after every line of code, and determined that the line that is causing the crash is this: InputStream data = urlConn.getInputStream().
urlConn is a URLConnection object that is pointed at the image I want to load. I've tried every combination of options that I can think of, but nothing helps. I haven't been able to find anything in the Java bug database or the release notes for 1.6.0_3.
Has anyone encountered this problem before? Any idea how to fix it?
To determine if it really is the whole JVM process that's frozen, or something else:
(1) get a java stack dump (sigquit/ctrl-break/jstack)
(2) have another background thread doing something you can observe; does it stop?
(3) check if another process (browser/etc) can contact server during freeze? (There's a chance the real problem is server connection depletion)
Is it randomly once-in-every-20-fetches (for example, 5% of the time, sometimes the first fetch in the JVM run), or always after about 20 fetches? If the latter, it sounds like something isn't being closed properly.
If on Linux you can use 'netstat -t' or 'lsof' (with certain options or grepped to show only some lines) to see open sockets; if after each fetch, one more is open, and the count never goes down, you're not closing things properly.
If so, calling close() on the stream you get back and/or disconnect() on the HttpUrlConnection after each try may help. (There may also be more severe limits on the number of connections an applet can leave open, so you're hitting this more quickly than you would in a standalone app.)
The fact that it 'works' in later Javas is also suggestive that some sort of automatic cleanup might be happening more effectively/regularly by finalization/GC. It's best to close things up cleanly yourself but you could also try forcing a GC/runFinalization in the earlier Javas showing the problem.
I'm unsure the cause of the problem you are facing, but I use the following code successfully for synchronously loading images from within applets (loads from either jar file or the server):
public Image loadImage(String imageName) {
// get the image
Image image = getImage(getCodeBase(), imageName);
// wait for it to fully load
MediaTracker tracker = new MediaTracker(this);
tracker.addImage(image, 0);
boolean interrupted = false;
try {
tracker.waitForID(0);
} catch (InterruptedException e) {
interrupted = true;
}
int status = tracker.statusID(thisImageTrackerID, false);
if (status != MediaTracker.COMPLETE) {
throw new RuntimeException("Failed to load " + imageName + ", interrupted:" + interrupted + ", status:" + status);
}
return image;
}