I have an application (a servlet, but that's not very important) that downloads a set of files and parse them to extract information. Up to now, I did those operations in a loop :
- fetching new file on the Internet
- analyzing it.
A multi-threaded download-manager seems a better solution for this problem and I would like to implement it in the fastest way possible.
Some of the downloads are dependant from others (so, this set is partially ordered).
Mutli-threaded programming is hard and if I could find an API to do that I would be quite happy. I need to put a group of files (ordered) in a queue and get the first group of files that is completely downloaded.
Do you know of any library I could use to achieve that ?
Regards,
Stéphane
You could do something like:
BlockingQueue<Download> queue = new BlockingQueue<Download>();
ExecutorService pool = Executors.newFixedThreadPool(5);
Download obj = new Download(queue);
pool.execute(obj); //start download and place on queue once completed
Data data = queue.take(); //get completely downloaded item
You may have to use a different kind of queue if the speed of each download is not the same. BlockingQueue is first in first out I believe.
You may want to look into using a PriorityBlockingQueue which will order the Download objects according to their Comparable method. See the API here for more details.
Hope this helps.
Related
I'd like to learn,
how crawler4j works?
Does it fetch web page then download its content and extract it ?
What about .db and .cvs file and its structures?
Generally ,What sequences it follows?
please, I want a descriptive content
Thanks
General Crawler Process
The process for a typical multi-threaded crawler is as follows:
We have a queue data structure, which is called frontier. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.
Crawler threads then obtain URLs from the frontier and schedule them for later processing.
The actual processing starts:
The robots.txt for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)
Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in crawler4j this can be controlled via shouldVisit(...)).
The whole process is repeated until no new URLs are added to the frontier.
General (Focused) Crawler Architecture
Besides the implementation details of crawler4j a more or less general (focused) crawler architecture (on a single server/pc) looks like this:
Disclaimer: Image is my own work. Please respect this by referencing this post.
Webapp, in my project to provide download CSV file functionality based on a search by end user, is doing the following:
A file is opened "download.csv" (not using File.createTempFile(String prefix,
String suffix, File directory); but always just "download.csv"), writing rows of data from a Sql recordset to it and then using FileUtils to copy that file's content to the servlet's OutputStream.
The recordset is based on a search criteria, like 1st Jan to 30th March.
Can this lead to a potential case where the file has contents of 2 users who make different date ranges/ other filters and submit at the same time so JVM processes the requests concurrently ?
Right now we are in dev and there is very little data.
I know we can write automated tests to test this, but wanted to know the theory.
I suggested to use the OutputStream of the Http Response (pass that to the service layer as a vanilla OutputSteam and directly write to that or wrap in a Buffered Writer and then write to it).
Only down side is that the data will be written slower than the File copy.
As if there is more data in the recordset it will take time to iterate thru it. But the total time of request should be less? (as the time to write to output stream of file will be same + time to copy from file to servlet output stream).
Anyone done testing around this and have test cases or solutions to share?
Well that is a tricky question if you really would like to go into the depth of both parts.
Concurrency
As you wrote this "same name" thing could lead to a race condition if you are working on a multi thread system (almost all of the systems are like that nowadays). I have seen some coding done like this and it can cause a lot of trouble. The result file could have not only lines from both of the searches but merged characters as well.
Examples:
Thread 1 wants to write: 123456789\n
Thread 2 wants to write: abcdefghi\n
Outputs could vary in the mentioned ways:
1st case:
123456789
abcdefghi
2nd case:
1234abcd56789
efghi
I would definitely use at least unique (UUID.randomUUID()) names to "hot-fix" the problem.
Concurrency
Having disk IO is a tricky thing if you go in-depth. The speads could vary in a vide range. In the JVM you can have blocking and non-blocking IO as well. The blocking one could wait until the data is really on the disk and the other will do some "magic" to flush the file later. There is a good read in here.
TL.DR.: As a rule of thumb it is better to have things in the memory (if it could fit) and not bother with the disk. If you use thread memory for that purpose as well you can avoid the concurrency problem as well. So in your case it could be better to rewrite the given part to utilize the memory only and write to the output.
Goal
I have the task to find duplicate entries within import files, and in a later stage duplicate entries of these import files compared to a global database. The data inside of the files are personal information like name, email, address etc. The data is not always complete, and often spelled incorrectly.
The files will be uploaded by external users through a web form. The user needs to be notified when the process is done, and he / she has to be able to download the results.
Additionally to solving this task I need to assess the suitability of Apache Beam for this task.
Possible solution
I was thinking about the following: The import files are uploaded to S3, and the pipeline will either get the file location as a pub-sub event (Kafka queue), or watch S3 (if possible) for incoming files.
Then the file is read by one PTransform and each line is pushed into a PCollection. As a side output I would update a search index (inside Redis or some such). The next transform would access the search index, and tries to find matches. The end result (unique value, duplicate values) are written to an output file to S3, and the index is cleared for the next import.
Questions
Does this approach make sense - is it idiomatic for Beam?
Would Beam be suitable for this processing?
Any improvement suggestions for the above?
I would need to track the file name / ID to notify the user at the end. How can I move this meta-data through the pipeline. Do I need to create an "envelope" object for meta-data and payload, and use this object in my PCollection?
The incoming files are unbounded, but the file contents itself are bounded. Is there a way to find out the end of the file processing in an idiomatic way?
Does this approach make sense - is it idiomatic for Beam?
This is a subjective question. In general, I would say no, this is not idiomatic for Apache Beam. Apache Beam is a framework for defining ETL pipelines. The Beam programming model has no opinions or builtin functionality for deduplicating data. Deduplication is achieved through implementation (business logic code you write) or a feature of a data store (UNIQUE constraint, SELECT DISTINCT in SQL or key/value storage).
Would Beam be suitable for this processing?
Yes, Beam is suitable.
Any improvement suggestions for the above?
I do not recommend writing to a search index in the middle of the pipeline. By doing this and then attempting to read the data back in the following transform, you've effectively created a cycle in the DAG. The pipeline may suffer from race conditions. It is less complex to have two separate pipelines - one to write to the search index (deduplicate) and a second one to write back to S3.
I would need to track the file name / ID to notify the user at the end. How can I move this meta-data through the pipeline. Do I need to create an "envelope" object for meta-data and payload, and use this object in my PCollection?
Yes, this is one approach. I believe you can get the file metadata via ReadableFile class.
The incoming files are unbounded, but the file contents itself are bounded. Is there a way to find out the end of the file processing in an idiomatic way?
I'm not sure off the top, but I don't think this is possible for a pipeline executing in streaming mode.
I have a customer who ftp's a file over to our server. I have a route defined to select certain files from this directory and move them to a different directory to be processed. The problem is that it takes it as soon as it sees it and doesn't wait till the ftp is complete. The result is a 0 byte file in the path described in the to uri. I have tried each of the readLock options (masterFile,rename,changed, fileLock) but none have worked. I am using spring DSL to define my camel routes. Here is an example of one that is not working. camel version is 2.10.0
<route>
<from uri="file:pathName?initialDelay=10s&move=ARCHIVE&sortBy=ignoreCase:file:name&readLock=fileLock&readLockCheckInterval=5000&readLockTimeout=10m&filter=#FileFilter" />
<to uri="file:pathName/newDirectory/" />
</route>
Any help would be appreciated. Thanks!
Just to note...At one point this route was running on a different server and I had to ftp the file to another server that processed it. When I was using the ftp component in camel, that route worked fine. That is it did wait till the file was received before doing the ftp. I had the same option on my route defined. Thats why I am thinking there should be a way to do it since the ftp component uses the file component options in camel.
I am taking #PeteH's suggestion #2 and did the following. I am still hoping there is another way, but this will work.
I added the following method that returns me a Date that is current.minus(x seconds)
public static Date getDateMinusSeconds(Integer seconds) {
Calendar cal = Calendar.getInstance();
cal.add(Calendar.SECOND, seconds);
return cal.getTime();
}
Then within my filter I check if the initial filtering is true. If it is I compare the Last modified date to the getDateMinusSeconds(). I return a false for the filter if the comparison is true.
if(filter){
if(new Date(pathname.getLastModified()).after(DateUtil.getDateMinusSeconds(-30))){
return false;
}
}
I have not done any of this in your environment, but have had this kind of problem before with FTP.
The better option of the two I can suggest is if you can get the customer to send two files. File1 is their data, File2 can be anything. They send them sequentially. You trap when File2 arrives, but all you're doing is using it as a "signal" that File1 has arrived safely.
The less good option (and this is the one we ended up implementing because we couldn't control the files being sent) is to write your code such that you refuse to process any file until its last modified timestamp is at least x minutes old. I think we settled on 5 minutes. This is pretty horrible since you're essentially firing, checking, sleeping, checking etc. etc.
But the problem you describe is quite well known with FTP. Like I say, I don't know whether either of these approaches will work in your environment, but certainly at a high level they're sound.
camel inherits from the file component. This is at the top describing this very thing..
Beware the JDK File IO API is a bit limited in detecting whether another application is currently writing/copying a file. And the implementation can be different depending on OS platform as well. This could lead to that Camel thinks the file is not locked by another process and start consuming it. Therefore you have to do you own investigation what suites your environment. To help with this Camel provides different readLock options and doneFileName option that you can use. See also the section Consuming files from folders where others drop files directly.
To get around this problem I had my publishers put out a "done" file. This solves this problem
A way to do so is to use a watcher which will trigger the job once a file is deposed and to delay the consuming of the file to a significant amount of time, to be sure that it's upload is finished.
from("file-watch://{{ftp.file_input}}?events=CREATE&recursive=false")
.id("FILE_WATCHER")
.log("File event: ${header.CamelFileEventType} occurred on file ${header.CamelFileName} at ${header.CamelFileLastModified}")
.delay(20000)
.to("direct:file_processor");
from("direct:file_processor")
.id("FILE_DISPATCHER")
.log("Sending To SFTP Uploader")
.to("sftp://{{ftp.user}}#{{ftp.host}}:{{ftp.port}}//upload?password={{ftp.password}}&fileName={{file_pattern}}-${date:now:yyyyMMdd-HH:mm}.csv")
.log("File sent to SFTP");
It's never late to respond.
Hope it can help someone struggling in the deepest creepy places of the SFTP world...
I want to list large number of files(10, 20 thousand or so) contained in a single directory, quickly and efficiently.
I have read quite a few posts especially over here explaining the short coming of Java to achieve such, basically due to the underlying filesystem (and that probably Java 7 has some answer to it).
Some of the posts here have proposed alternatives like native calls or piping etc and I do understand the best possible option under normal circumstances is the java call
- String[] sList = file.list(); which is only slightly better than file.listFiles();
Also, there was a suggestion for the use of multithreading(also Executor service).
Well, here the issue is I have very little practical know-how of how to code multithreading way. So my logic is bound to be incorrect. Still, I tried this way:
created a list of few thread objects
Ran a loop of this list, called the .start() and immediately .sleep(500)
In the thread class, over-rode the run methos to include the .list()
Something like this, Caller class -
String[] strList = null;
for (int i = 0; i < 5; i++){
ThreadLister tL = new ThreadLister(fit);
threadList.add(tL);
}
for (int j = 0; j < threadList.size(); j++) {
thread = threadList.get(j);
thread.start();
thread.sleep(500);
}
strList = thread.fileList;
and the Thread class as -
public String[] fileList;
public ThreadLister(File f) {
this.f = f;
}
public void run() {
fileList = f.list();
}
I might be way off here with multithreading, I guess that.
I would very much appreciate a solution to my requirement the multithreading. Added benefit is I would learn a bit more about practical multithreading.
Query Update
Well, Obviously multithreading isn't going to help me(well I now realise its not actually a solution). Thank you for helping me to rule out threading.
So I tried,
1. FileUtils.listFiles() from apache commons - not much difference.
2. Native call viz. exec("cmd /c dir /B .\\Test") - here this executes fast but then when I read the Stream using a while loop that takes ages.
What actually I require is filename depending upon a certain filter amongst about 100k files in single directory. So I am using like File.list(new FileNameFilter()).
I believe FileNameFilter has no benefit, as it will try to match accordingly with all the files first and then give out the output.
Yes, I understand, I need a different approach of storing these files. One option I can try is storing these files in multiple directories, I am yet to try this(I dont know if this will help enough) - As suggested by Boris earlier.
What else can be a better option, will a native call on Unix ls with filename match work effectively. I know on windows it doesnt work, I mean unless we are searching in same directory
Kind Regards
Multi-threading is useful for listing multiple directories. However, you cannot split a single call to a single directory and I doubt it would be much faster if you could as the OS returns the files in any order it pleases.
The first thing about learning multi-threading is that not all solutions will be faster or simpler just by using multiple threads.
Am as a completely different suggestion. Did you try using Apache Commons File util?
http://commons.apache.org/io/api-release/index.html Check out the method FileUtils.listFiles().
It will list out all the files in a directory. Maybe it is fast enough and optimized enough for you needs. Maybe you really don't need to reinvent the wheel and the solution is already out there?
What eventually, I have done is.
1. As a quickfix, to get over the problem at the moment, I used a native call to write all the filenames in a temp text file and then used a BufferedReader to read each line.
2. Wrote an utility to archive the inactive files(most of them) into some other archive location, thereby reducing the total no.of files in the active directory. So that the normal list() call returns much quicker.
3. But going forward as a long term solution, I will be modifying the way all these files are stored and create a kind of directory hierarchy structure wherein then each directory will be holding comparatively few files and hence the list() can work very fast.
One thing came to my mind and I noticed while testing was this list() when runs for the first time takes a long time but subsequent requests were very very fast. Makes me believe that JVM inetlligently retrieves the list which has remained on the heap. I tried a few things like adding files to the dir or changing the File variable name but still the response was instant. So I believe that this array sits on the heap till gc'ed and Java intelligently responds for same request. <*Am I right? or is that not how it behaves? some explanation pls.*>
Due to this, I thought, if I can write a small program to get this list once everyday and keep a static reference to it then this array won't be gc'ed and every request to retrieve this list will be fast. <*Again, some comments/suggestion appreciated.*>
Is there a way to configure Tomcat, wherein the GC may gc all other non-referenced objects but doesn't for some which are specified so? Somebody told me in Linux something like this is implemented at obviously for the OS level, I dont know whether its true or not though.
Which file system are you using? each file system has its own limitation on number of files/folders a directory can have (including the directory depth). So not sure how you could create and if created through some program were you able to read all the files back.
As suggested above the FileNameFilter, is a post file name filter so I am not sure if it would be any help (although you are probably creating smaller lists of file lists) as each listFiles() method would get the complete list.
For example:
1) Say Thread 1 is capturing list of file names starting with "T*", listFiles() call would retrieve all the thousands of file names and then filters as per FileNameFilter criteria
2) Thread 2 if capturing list of file names starting with "S*" would repeat the all the steps from 1.
So, you reading the directory listing multiple times putting more and more load on Heap/JVM native calls/file system etc.
If possible best suggestion would be to re-organize the directory structure.