I'm using jsoup in my android app but the problem is, the html source takes too much time to download. Here is my code:
long t = System.currentTimeMillis();
String url = "http://www.stackoverflow.com/";
Document doc = null;
try {
Connection c = Jsoup.connect(url);
doc = c.get();
System.out.println(System.currentTimeMillis() - t);
} catch (IOException e) {
e.printStackTrace();
}
Executing this code takes 1.265 seconds which feels really weird because i can download the whole website (with images and all that good stuff) using web browser in less than a 0.5 seconds on the same device. Did I do something wrong? Or maybe there is a faster way for getting html source of website? Thanks in advance.
Where are you trying this code on? Your device? If you are using the LTE/3G network it wouldn't be too much off.
The other reason that I could think is that your wireless router is not situated in the best place from your device in case you are using Wifi.
From that code I don't see anything that could cause more delay. 1.2 secs may not be that bad if you dont have the host DNS entry cached and the server is far away from you.
Also, try setting the Agent to the same as your browser when comparing times. It may happen that the server gives different priorities based on the user agent. In this case you are using the default Java user agent.
Related
I'm trying to scrape live data from 50+ dynamic webpages and need the data to be updated every 1-2 seconds. To do so, I have a Timer scheduled every 1/2 second that iterates through the following method 50 times (for 50 URLs):
public double fetchData(String link) {
String data = null;
try {
URL url = new URL();
urlConn = url.openConnection(link);
InputStreamReader inStream = new InputStreamReader(urlConn.getInputStream());
BufferedReader buff = new BufferedReader(inStream);
/*code that scrapes webpage, stores value in "data"*/
inStream.close();
buff.close();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
This method works but takes about a second per URL, or 50 sec total. I've also tried JSoup in hopes that the delay may be overcome using the following code:
public double fetchData(String link, String identifier) {
Document doc;
String data = null;
try {
doc = Jsoup.connect(link).timeout(10*1000).get();
data = doc.getElementById(identifier).parent().child(0).text();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
but have run into approximately the same processing time. Are there any faster ways to draw data from dynamic webpages simultaneously, whether through URLConnection, JSoup, or some other method?
The short answer is "use threads". Create a thread for each of the 50+ URLs that you want to scrape repeatedly.
It will most likely make little difference if you use URLConnection, JSoup or some other way do the scraping. The actual bottleneck is likely to be due to:
load and performance of the load on the server(s) you are scraping from
network bandwidth
network latency
The first of those is outside of your control (in a positive way!). The last two ... you might be able to address but only by throwing money at the problem. For example, you could pay for a better network connection / path, or pay for alternative hosting to move your scraper close to the sites you are trying to scrape.
Switching to multi-threaded scraping will ameliorate some of those bottlenecks, but not eliminate them.
But I don't think what you are doing is a good idea.
If you write something that repeatedly re-scrapes the same pages once every 1 or 2 seconds, they are going to notice. And they are going to take steps to stop you. Steps that will be difficult to deal with. Things like:
rate limiting your requests
blocking your IPs or IP range
sending you "cease and desist" letters
And if that doesn't help, maybe more serious things.
The real solution may be to get the information a more efficient way; e.g. via an API. This may cost you money too. Because (when it boils down to it) your scraping will be costing them money for either no return ... or a negative return if your activity ends up reducing real peoples' clicks on their site.
Let me get straight to an example to explain further.
final var socket = new java.net.ServerSocket(1234);
for (;;)
{
try (final var client = socket.accept())
{
client.getOutputStream().write("HTTP/1.1 200 OK\r\n\r\n".concat(java.time.Instant.now().toString()).getBytes());
}
}
When I now open my browser of choice (Firefox cough) I'll receive the current time and date. The question now is how I can write to that socket at a later point in time.
hypothetical solution
Here's something I already tried, but doesn't work at all.
final var socket = new java.net.ServerSocket(1234);
for (;;)
{
try (final var client = socket.accept())
{
client.getOutputStream().write("HTTP/1.1 200 OK\r\n\r\n".concat(java.time.Instant.now().toString()).getBytes());
client.getOutputStream().flush();
Thread.sleep(1000L);
client.getOutputStream().write("And another paragraph.".getBytes());
}
}
The result is a web page loading for approximately a single second, printing out the following result (may vary due to different date and time on your end).
2019-01-19T18:19:15.607192500Z
And another paragraph.
Instead I would like the see something like that:
print out the current time and date.
wait a second without the content of the web page changing.
print out the next paragraph.
How would I go about implementing that?
Is it possible for the server to write text into a web page after it is loaded? Yes it definitely is, but these days I suspect it it is rarely done. I started web development in the 1990s and back then that was a pretty common technique. We used it to write live chat messages to browsers with no Javascript. These days Javascript is ubiquitous and powerful, so using client-side Javascript to update a page will be the best option in most cases.
That said, the technologies we used for writing server-side updates back then should still work now. I suspect the reason you don't see updates in your browser is because it doesn't know it should start displaying the page before everything is loaded. Using chunked transfer encoding, a 1990s technology still supported by modern browsers, should resolve that. It it allows the server to indicate when a 'chunk' of data is complete and browsers will generally process each chunk immediately rather than wait for all the chunks to arrive.
The easiest way to use chunked transfer encoding is to use an HTTP library like Apache HttpComponents, then wrap your output stream in the appropriate class:
final var socket = new java.net.ServerSocket(1234);
for (;;)
{
try (final var client = socket.accept())
{
var outputStream = new ChunkedOutputStream(client.getOutputStream());
outputStream.write("HTTP/1.1 200 OK\r\n\r\n".concat(java.time.Instant.now().toString()).getBytes());
outputStream.flush();
Thread.sleep(1000L);
outputStream.write("And another paragraph.".getBytes());
}
}
I'm trying to write a little application that will block sites (ip) while using browser (chrome, ie, firefox). It can also redirect to other site. As long as user won't be able to use this site I would be satisfied with result.
The problem is that I've searched few hours for solution in google and I still can't find good solution to my problem. There were two solutions for now:
Use host file - this would be a little problematic for my aplication, because I want to block site for period of time. If application will crash - it won't redo host file.
Use "Windows Filtering Platform" - it's written in C++ so it will be harder for me to do. I would love to use java. I can still use C++ in java application but it still isn't satisfying solution.
I would appreciate for any help.
I think I have found solution:
Blocking a website from access for all browsers
Well will try :). But still if anybody have any better ideas don't hesitate to answer this post :).
I did a similar work some time ago, I used the hosts files to block all the entries that spybot search & destroy marked as "dangerous" sites. If you want to secure the site will be freeed when the app crashes, you could use a second program or thread (don't know how complex your application is) that checks if the programm is still running.
Microsoft has the following entry for displaying task names:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa446864(v=vs.85).aspx
Maybe try this code and check for your application to be alive.
However, the user will notice a second task in his taskmanager........!
To patch host files I used this java-method which saves entrys from a Default list model:
try
{
BufferedWriter out;
this.out = new BufferedWriter(new FileWriter("C:\\Windows\\System32\\drivers\\etc\\hosts"));
for (int save = 0; save < Blocker.model.size(); save++) {
this.out.write((String)Blocker.model.getElementAt(save));
this.out.newLine();
}
this.out.close();
} catch (IOException fail) {
JOptionPane.showMessageDialog(null, "Speichern konnte nicht abgeschlossen werden",
"About", 0);
}
I'm not sure if this will really help you, anyway good luck at your project.
(Note that you have to run as administrator to get write rights to hosts-file)
I have recently started seeing user agents like Java/1.6.0_14 (and variations) on my site
What does this mean. Is it a browser or bot or what
This likely means someone is crawling your website using Java. This isn't much of anything to be concerned about unless you notice the crawler using large amounts of your bandwidth or not respecting your robots.txt file. Usually legitimate crawlers will take the time to create custom user agent to make it easy to contact the crawler if you have a problem, but even if they're using the default user agent, it's more than likely perfectly benign.
However, if you do notice a spike in 404 hits or lots of hits from the Java client, you're likely under attack by spammers looking for security holes in your website. If your site is built well, there's not a whole lot they can do other than burn some of your bandwidth, but if they find a security hole, they'll be sure to exploit it. Dealing with spammers properly is beyond the scope of this answer, but a scorched earth solution (which will work as a short term fix at the very least) would be to block all user agents that contain the string 'java'.
It means your site is being accessed through the JVM on someones machine. It could be a crawler or simply someone scraping data. You can replicate the user-agent string using the HttpURLConnection class. Here is a sample:
import java.net.*;
public class Request {
public static void main(String[] args) {
try {
URL url=new URL("http://google.ca");
HttpURLConnection con=(HttpURLConnection)url.openConnection();
con.connect();
System.out.println(con.getResponseCode());
} catch (Exception e) {
e.printStackTrace();
}
}
}
Java's HttpURLConnection class will send the JVM version information as the User-Agent header.
I am writing a Java applet that downloads images from a web server and displays them to the user. It works fine in Java 1.6.0_3 and later, but on older versions it will completely crash the process about once every 20 page views. There are no error messages in the Java console, because the process is completely frozen. I've waited for almost 15 minutes sometimes, but it never un-freezes.
I added a debug message after every line of code, and determined that the line that is causing the crash is this: InputStream data = urlConn.getInputStream().
urlConn is a URLConnection object that is pointed at the image I want to load. I've tried every combination of options that I can think of, but nothing helps. I haven't been able to find anything in the Java bug database or the release notes for 1.6.0_3.
Has anyone encountered this problem before? Any idea how to fix it?
To determine if it really is the whole JVM process that's frozen, or something else:
(1) get a java stack dump (sigquit/ctrl-break/jstack)
(2) have another background thread doing something you can observe; does it stop?
(3) check if another process (browser/etc) can contact server during freeze? (There's a chance the real problem is server connection depletion)
Is it randomly once-in-every-20-fetches (for example, 5% of the time, sometimes the first fetch in the JVM run), or always after about 20 fetches? If the latter, it sounds like something isn't being closed properly.
If on Linux you can use 'netstat -t' or 'lsof' (with certain options or grepped to show only some lines) to see open sockets; if after each fetch, one more is open, and the count never goes down, you're not closing things properly.
If so, calling close() on the stream you get back and/or disconnect() on the HttpUrlConnection after each try may help. (There may also be more severe limits on the number of connections an applet can leave open, so you're hitting this more quickly than you would in a standalone app.)
The fact that it 'works' in later Javas is also suggestive that some sort of automatic cleanup might be happening more effectively/regularly by finalization/GC. It's best to close things up cleanly yourself but you could also try forcing a GC/runFinalization in the earlier Javas showing the problem.
I'm unsure the cause of the problem you are facing, but I use the following code successfully for synchronously loading images from within applets (loads from either jar file or the server):
public Image loadImage(String imageName) {
// get the image
Image image = getImage(getCodeBase(), imageName);
// wait for it to fully load
MediaTracker tracker = new MediaTracker(this);
tracker.addImage(image, 0);
boolean interrupted = false;
try {
tracker.waitForID(0);
} catch (InterruptedException e) {
interrupted = true;
}
int status = tracker.statusID(thisImageTrackerID, false);
if (status != MediaTracker.COMPLETE) {
throw new RuntimeException("Failed to load " + imageName + ", interrupted:" + interrupted + ", status:" + status);
}
return image;
}