I have written some code to comb through approximately 10000 web pages on a website to put together a profile of the user demographics on the website. The basis of the program is to read each line of the source code of the website, parse out the data wanted, then move onto the next page.
I am encountering an issue where around the 650th page or so, the program goes from reading around 3 pages per second to 1 page per 10-15 seconds. It always occurs at the same point of the program execution. I began wondering if this might be a memory issue with my program and begin to check each aspect of it. Eventually I stripped the program down to its basics:
Step 1) Create an array of URL objects.
Step 2) Loop through the array and open/close a buffered reader to read each line.
Step 3) Read the entire page and move onto the next line.
Even this slowed down in the exact spot, so this isn't a problem with the data I am parsing or where I am storing it. It is a result of this loop somehow. I am wondering if there is a memory issue with what I have written that is causing issues? Otherwise my only guess is somehow I am making calls too quickly to the website servers and it is intentionally slowing me down.
**Obviously not the best written code, as I am new and subject to a bunch of sloppy coding. But it does execute perfectly what I want. The issue is it slows down to a crawl after about ten minutes, which won't work.
Here is the relevant code:
Array code
import java.io.IOException;
import java.net.URL;
public class UrlArrayBuild {
private int page_count; //number of pages
public URL[] urlArray; //array of webpage url's
public UrlArrayBuild(int page) { //object constructor
page_count = page; //initializes page_count
urlArray = new URL[page_count]; //initializes page_count
}
protected void buildArray() throws IOException { // method assigns strings to UrlArray object
int count; //counter for iteration
for(int i = 0; i < page_count; i++) { //loops through
count = i * 60; //sets user number at end of page
URL website = new URL("http://...." + count);
urlArray[i] = website; //url address
//System.out.println(urlArray[i]); //debug
}
}
protected URL returnArrayValue(int index) { //method returns string value in array of given index
//System.out.println(urlArray[index]); //debug
return urlArray[index];
}
protected int returnArrayLength() { //method returns length of array
//System.out.println(urlArray.length); //debug
return urlArray.length;
}
}
Reader Code
import java.net.*;
import java.io.*;
public class DataReader {
public static void main(String[] args) throws IOException {
UrlArrayBuild PrimaryArray = new UrlArrayBuild(9642); //Creates array object
PrimaryArray.buildArray(); //Builds array
//Create and initialize variables to use in loop
URL website = null;
String inputLine = null;
//Loops through array and reads source code
for (int i = 0; i < PrimaryArray.returnArrayLength(); i++) {
try {
website = PrimaryArray.returnArrayValue(i); //acquires url
BufferedReader inputStream = new BufferedReader(new InputStreamReader(website.openStream())); //reads url source code
System.out.println(PrimaryArray.returnArrayValue(i)); //prints out website url. I use it as a check to monitor progress
while((inputLine = inputStream.readLine()) != null) {
if (inputLine.isEmpty()) { //checks for blank lines
continue;
} else {
//begin parsing code. This is currently commented so there is nothing that occurs here
}
}
inputStream.close();
} finally {
//extraneous code here currently commented out.
}
}
}
Some delays cause by websites themselved epspecially if they are rich in term of contents. This might be a reason.
Parsing also can be factor in some delays. Therefore, personally I suggest to useful library for parsing that might be better optimized.
Good luck!
Multithread the application so requests can run concurrently. Or,
Rearchitect to use asynchronous IO / HTTP requests. Netty or MINA, or possibly just raw NIO.
Both of these related solutions are a lot of work, but a sophisticated solution is unfortunately required to deal with your problem. Basically, asynchronous frameworks exist to solve exactly this problem.
I think when looping through array, you can use multi-threading technologies and asynchronized java method invocation to improve your performance.
There is nothing obviously wrong with your code that would explain this. Certainly not in the code that you have shown us. Your code is not saving anything that is being read so it can't be leaking memory that way. And it shouldn't leak resources ... because if there are any I/O exceptions, the application terminates immediately.
(However, if your code did attempt to continue after I/O exceptions, then you would need to move the close() call into the finally block to avoid the socket / file descriptor leakage.)
It is most likely either it is a server-side or (possibly) network issue:
Look to see if there something unusual about the pages at around the 650 page mark. Are they bigger? Do they entail extra server-side processing (meaning they will be delivered more slowly)?
Look at the server-side load (while the application is running) and its log files.
Check to see if some kind of server request throttling has been implemented; e.g. as an anti-DoS measure.
Check to see if some kind of network traffic throttling has been implemented.
Also check on the client-side resource usage. I would expect CPU usage to either stay constant, or tail off at the 650 page mark. If CPU usage increases, that would cast suspicion back onto the application.
Related
I write a thread safe class to get input from multiple threads and upload the result to S3 once it runs up to a fixed size.
S3Exporter class
// this class is thread safe.
public class S3Exporter {
private static final int BUFFER_PADDING = 1000;
private final int targetSize;
private final ByteArrayOutputStream buf;
private volatile boolean started;
public S3Exporter(final int targetSize) {
buf = new ByteArrayOutputStream(targetSize + BUFFER_PADDING);
this.targetSize = targetSize;
started = false;
}
public synchronized void start() {
started = true;
}
public synchronized void end() {
started = false;
flush();
}
public synchronized void export(byte[] data) throws IOException {
Preconditions.checkState(started, "Not started!");
buf.write(b, buf.size(), b.length);
flushIfNeeded();
}
private void flushIfNeeded() {
if (buf.size() >= targetSize) {
flush();
}
}
public synchronized void flush() {
if (buf.size() > 0) {
// upload buf to s3, it's a time-consuming operation
buf.reset();
}
}
}
The client calls export method to pass data and if exception is thrown the client will pass that data later.
To avoid losing data when restarting the application, I add a shutdown hook when creating S3Exporter object:
S3Exporter exporter = new S3Exporter(10000);
Runtime.getRuntime().addShutdownHook(new Thread(() -> exporter.end()));
My concern is the class is not scalable, I mean it could become bottleneck of the system when data are getting more. I could figure out 2 ways to improve the situation:
do the time-consuming upload operation asynchronously: use an executor to upload and call ThreadPoolExecutor.awaitTermination() in the shutdown hook.
just put data to a LinkedBlockingQueue in export method and use multiple threads to handle it.( This way is more scalable than the first per my understanding)
Then I need to do more work in the shutdown hook thread to make sure not losing the accepted data and it's not a good idea as I know. I'll take the risk of losing data when restarting the application, which is the last thing I wanna see.
My question
Is my concern about the scalability a really problem?( To make the question less stupid, let's say the data size is a few bytes and TPS to call export method is 500)
If the answer to the 1st question is yes, what about my improvements, are they right? How to do the cleanup work to avoid losing data?
Scalability depends on requirements, constraints, desired service level, personal preferences, expected users growth rate, and especially money: given infinite resources, every piece of software can be scaled. You didn't mention any, so I guess you don't have any actual figure. In this phase, as a programmer, your job is to make a correct program that uses a predictable amount of resources.
Your program seems correct, and most of your assumptions are correct, too. However I suggest to immediately store chunks to some local persistent database (or the raw filesystem) and have a periodic job, run in a separate thread, that upload group of chunks to S3, and remove any shutdown hooks (you can use Camel for the boring parts). This is because such hooks are unreliable and should only be used as last resources for quick and optional cleanup (optional in the sense that you must be prepared that the cleanup could not have been run properly until the end).
Using a file instead of memory, your data can survive fatal errors and the working memory required by your application is almost independent on the load: there's an irrelevant amount of extra CPU and some disk I/O that is way cheaper then memory.
i am looping a method with a Thread which reads from a website(dynamically)
all the methods work perfectly, but my problem is that sometimes (3 out of 10 times) that i start the program it throws IO exception at me although i haven't changed my input data from the last known good execution , the exception is coming from the method below:
public String readThisUrlContent() throws ExceptionHandler
{
try {
#SuppressWarnings("static-access")
Document doc = Jsoup.connect(url).timeout(1000).get();
return doc.body().text();
} catch (IOException e) {
throw new ExceptionHandler("IO Exception for reading the site in method setUrlContent in Url class");
}
}
my best guess is that since i'm reading more than one Url with looping this method but the timeout is not sometimes at the best range (considering the internet speed etc. it sometimes doesn't work) but its just my theory and it can be dead wrong but even if its correct i have no idea how to handle it
The problem exactly was the time to live of the opened port. since i had other functions working at the same time program simply needed more connected time so i expanded timeout to (5000) and also reduced the timer of another Time.Schedule method in another method, and so it worked
I have a situation where I have a large number of classes that need to do file (read only) access. This is part of a web app running on top of OSGI, so there will be a lot of concurrent needs to access.
So I'm building an OSGI service to access the file system for all the other pieces that will need it and provide a centralized access as this also simplifies configuration of file locations, etc.
It occurs to me that a multi-threaded approach makes the most sense along with a thread pool.
So the question is this:
If I do this and I have a service with an interface like:
FileService.getFileAsClass(class);
and the method getFileAsClass(class) looks kinda like this: (this is a sketch it may not be perfect java code)
public < T> T getFileAsClass(Class< T> clazz) {
Future<InputStream> classFuture = threadpool.submit(new Callable< InputStream>() {
/* initialization block */
{
//any setup from configs.
}
/* implement Callable */
public InputStream call() {
InputStream stream = //new inputstream from file location;
boolean giveUp = false;
while(null == stream && !giveUp) {
//Code that tries to read in the file 4
// times with a Thread.sleep() then gives up
// this is here t make sure we aren't busy updating file.
}
return stream;
}
});
//once we have the file, convert it and return it.
return InputStreamToClassConverter< T>.convert(classFuture.get());
}
Will that correctly wait until the relevant operation is done to call InputStreamtoClassConverter.convert?
This is my first time writing multithreaded java code so I'm not sure what I can expect for some of the behavior. I don't care about order of which threads complete, only that the file handling is handled async and once that file pull is done, then and only then is the Converter used.
I am implementing REST through RESTlet. This is an amazing framework to build such a restful web service; it is easy to learn, its syntax is compact. However, usually, I found that when somebody/someprogram want to access some resource, it takes time to print/output the XML, I use JaxbRepresentation. Let's see my code:
#Override
#Get
public Representation toXml() throws IOException {
if (this.requireAuthentication) {
if (!this.app.authenticate(getRequest(), getResponse()))
{
return new EmptyRepresentation();
}
}
//check if the representation already tried to be requested before
//and therefore the data has been in cache
Object dataInCache = this.app.getCachedData().get(getURI);
if (dataInCache != null) {
System.out.println("Representing from Cache");
//this is warning. unless we can check that dataInCache is of type T, we can
//get rid of this warning
this.dataToBeRepresented = (T)dataInCache;
} else {
System.out.println("NOT IN CACHE");
this.dataToBeRepresented = whenDataIsNotInCache();
//automatically add data to cache
this.app.getCachedData().put(getURI, this.dataToBeRepresented, cached_duration);
}
//now represent it (if not previously execute the EmptyRepresentation)
JaxbRepresentation<T> jaxb = new JaxbRepresentation<T>(dataToBeRepresented);
jaxb.setFormattedOutput(true);
return jaxb;
}
AS you can see, and you might asked me; yes I am implementing Cache through Kitty-Cache. So, if some XML that is expensive to produce, and really looks like will never change for 7 decades, then I will use cache... I also use it for likely static data. Maximum time limit for a cache is an hour to remain in memory.
Even when I cache the output, sometimes, output are irresponsive, like hang, printed partially, and takes time before it prints the remaining document. The XML document is accessible through browser and also program, it used GET.
What are actually the problem? I humbly would like to know also the answer from RESTlet developer, if possible. Thanks
Don't know if it is clear from title, I'll explain it deeper.
First of all limitations: Java 1.5 IBM.
This is the situation:
I have spring web service that receives request with pdf document in it. I need to put this pdf into the some input directory that AFP application (not of the importance) monitors. This AFP application takes that pdf, do something with it and returns it to some output directory that I need to monitor. Monitoring of output directory would take some time, probably 30 seconds. Also, I know what is exact file name that I expect to appear in output directory. If nothing appears in 30 seconds than I would return some fault response.
Because of my poor knowledge of web services and multithreading I don't know in which possible problems I can fall into.
Also, searching the internet I realize that most of people recommend watchservice for directory monitoring, but this is introduced in Java 7.
Any suggestion, link, idea would be helpful.
So, the scenario is simple. In a main method, the following actions are done in order:
call the AFP service;
poll the directory for the output file;
deal with the output file.
We suppose here that outputFile is a File containing the absolute path to the generated file; this method returns void, adapt:
// We poll every second, so...
private static final int SAMPLES = 30;
public void dealWithAFP(whatever, arguments, are, there)
throws WhateverIsNecessary
{
callAfpService(here);
int i = 0;
try {
while (i < SAMPLES) {
TimeUnit.SECONDS.sleep(1);
if (outputFile.exists())
break;
}
throw new WhateverIsNecessary();
} catch (InterruptedException e) {
// Throw it back if the method does, otherwise the minimum is to:
Thread.currentThread().interrupt();
throw new WhateverIsNecessary();
}
dealWithOutputFile(outputFile);
}