I have two (Java) processes on different JVMs running repeatedly. The first one regularly finds some "information" and needs to store it somewhere. The second process regularly reads this information to handle it. The intervals are more or less random, so process 1 may find three pieces of information until process 2 reads them or vice versa.
My approach is to write this information to text files. But I am afraid that appending and reading the text files accidentally happens at the same time so that I run into locks. But writing a new text file for each piece of information seems like overkill.
What would be a better solution?
EDIT: I am sorry, I did not make clear: The java processes run in different JVMs. They cannot see each other directly.
You can get this to work, provided you are careful with file handling and you don't have a high update rate e.g. 10 updates per second.
Note: you could do it with file renaming instead of locks.
What would be a better solution?
Just about anything, SO is not for recommending things, but in this case I could recommend just about anything without more specific requirements. I could for example recommend my library Chronicle Queue because I wrote it and I sure it could do what you want, however there are many possible alternatives.
I am sending about one line of text every minute.
So you can write a temporary file for each message, rename it when finished. The consumer can have a directory watcher so it knows as soon as you have done this. The consumer could delete the file when done. This has an overhead but it would be less than 10 ms.
If you want to keep a record of all messages, the producer can also write to a log file.
Related
I'm building a library (java 8) that needs to listen to the modifications of several files. The library needs parse the newly added lines in the corresponding files each time a modification occurs.
The files are like event logs. So they are always appended (no deletion or overriding)
I have two questions:
Is there a way to know what are the newly added lines in a file when its modified? (is there a functionality in java NIO to identify this?)
I've seen solutions in NIO package (Watch Service API) that can be used as poll based mechanism to listen to file modification. Is there another native solution to make it push based? so that I don't need to keep polling between intervals of time.
I'm mainly looking for a native solution but third party solution suggestion are also appreciated
thanks
You implement "push" by running a take() loop in its own thread. On each take() you start a new thread to process the changed lines.
To determine what changed, keep track of the file length each time... Changed data is at offset previousSize for length newSize - previousSize.
You'll probably want some kind of event coordination to prevent starting the process again before the previous instance completes. That could happen if there are multiple changes in a very short period of time.
I have to process around 2 million text files and generate there triples.
Suppose I have a txt file xyz.txt(one of the files of 2 million input) , it is processed as below:
start(xyz.txt)---->module1(xyz.tpd)------>module2(xyz.adv)-------->module3(xyz.tpl)
suggest me a logic or concept so that i can process faster and in an optimized way on x64 4GB windows systems.
module1(working): it parses the txt file using a .bat file in which parser is invoked, it is a separate system thread and after 15 seconds it again starts parsing another txt file, and so on....
module2(working): it accepts .tpd file as input and generates .adv file.
module3(working): it accepts .adv file as input and generates .tpl(triples).
should i start threads from txt files or at some other point..?
i am afraid that if i the CPU get stuck in context switching.
can anyone have a better logic, so that i can try it..!?
Use a ThreadPoolExecutor .Tune it's parameters like number of active threads and others to suit your environment and system.
Most importantly, you have to write the program, profile it, and see where the bottleneck is. It is more than probable that the disk I/O operations will be the bottleneck and no amount of multithreading will solve your problems.
In that case using two(three? four?) separate hard drives may yield more speed gain than the best multithreaded solution.
Furthermore, the general rule is that you should optimize your application only when you have working code and you really know what to optimize. Profile, profile, profile.
Taking the future multithreaded optimizations into account when writing is OK; the architecture should be flexible enough to allow for future optimizations.
There is not much told here about your hardware environment; but the basic solution would be to use a fixed-size ExecutorService, where the size would, at first, be the number of your execution units:
private static final int NR_CPUS = Runtime.getRuntime().availableProcessors();
// Then:
final ExecutorService executor = Executors.newFixedThreadPool(NR_CPUS);
Then, for each file, you can create a Runnable to process it, and submit it to the thread pool using its .execute() method.
Note that .execute() is asynchronous; if the submitted runnable cannot be run right now, it will be queued.
..sounds like a typical batch application needed for data integration. Although, I do not intend to throw hyperlinks without completely understanding your needs at you, but, probably you need a solution which should work in a single VM and over the period of time you like to extend the solution for multiple VM/machines.. and may be we are not dealing with PBs of data to start with.. try Spring Batch not only will it solve the problem in the given context you will learn to structure your thoughts (think vocabulary!) to solve similar problems..
As a starting point, I would create one IO thread and a pool of CPU threads. The IO thread reads in text files and offers them to a BlockingQueue, while the CPU threads take the files from the BlockingQueue and process them. Then profile the application to see how many CPU threads you should use to keep pace with the IO thread (you can also dynamically determine this, e.g. start with one CPU thread and start another when the size of the BlockingQueue exceeds a threshold, probably something along the lines of 20 files). It's possible that you'll find that you only need one CPU thread to keep pace with the IO thread, in which case your program is IO bound and you'll need to e.g. place the text files next to each other on disk (so that you can use sequential reads on all but the first file) or put them on separate disks in order to speed up the application; one idea is to zip the files together and read them in with a ZipInputStream - this will reduce the number of disk seeks when reading the files and will also reduce the amount of data you need to read
I am working on an Android app which purpose is to download chunks(parts of a video file) from 2 servers, append them in order(into a main video file) after each one is downloaded and finally play this video file while downloading continues..
This works well when downloading is done serial by using two different threads(one for each server) that perform downloading. I want to know how it is possible to achieve the above but with concurrent downloading, instead of serial.
That is to download chunks from servers at the same time and in order. For example, for the same period of time download chunk0, chunk1 & chunk2 from server1 (which let's say is 3 times faster than server2) and chunk3 from server2, so that we totally use all the available bandwidth of the 2 servers at this period of time. And this process repeats until all chunks are downloaded.
By using threads and join, downloading is serial, as i said above. In order to make it concurrent, i tried to remove join from each thread, but then it doesn't download chunks in order and also download only from one server, not from both. AsyncTask is not a solution, as it also doesn't download chunks in order.
So, is there any way to achieve this concurrent and in order downloading of chunks as I described it above? Has anyone done something like this as a project, so as to know an answer for sure?
You may use the popular technique among download accelerators.
In general, the idea is about requesting chunks from each server using the Range HTTP header. (The server responds the Accept-Ranges header when it is capable of processing the Range header accordingly). (This blog has a good explanation about that).
Every thread/runnable/callable has to know which chunk is its responsibility ( first byte position + length ?) because each one will have to write its own part in the file.
Then there will be a decision to be made, you can:
Write the file using an instance of RandomAccessFile in each thread, obviously positioning the file pointer in the first byte position of its chunk (with the seek method), or..
Be sure that you have a unique worker thread (see Executors and submit) that is in charge of writing the bytes told by each thread. As in the moment of writing, you will use seek to move the file pointer into the correct position, there will be no errors of overlapping.
NOTE: If you want to be able to start the playback when you have your first chunk, you may do it by executing that code after the first chunk thread download+write has finished.
I have a huge line-separated text file and I want to make some calculations on each line. I need to make a multithreaded program to process it because it is the processing of each line that takes the most time to complete rather than reading each line. (the bottleneck lies in the CPU processing, rather than the IO)
There are two options I came up with:
1) Open the file from main thread, create a lock on the file handle and pass the file handle around the worker threads and then let each worker read-access the file directly
2) Create a producer / consumer setup where only the main thread has direct read-access to the file, and feeds lines to each worker thread using a shared queue
Things to know:
I am really interested in speed performance for this task
Each line is independent
I am working this in C++ but I guess the issue here is a bit language-independent
Which option would you choose and why?
I would suggest the second option, since it will be more clear design wise and less complicated than first option. First option is less scalable and require additional communication among thread in order to synchronize they progress on file lines. While in second option you have one dispatcher which deals with IO and initiate workers threads to starts they computation, and each computational thread is completely independent from each other, hence allows you scaling. Moreover in the second option you separate your logic in more clear way.
If we are talking about massively large file, which needs to be processed with a large cluster - MapReduce is probably the best solution.
The framework allows you great scalability, and already handles all the dirty work of managing the workers and tolerating failures for you.
The framework is specifically designed to recieve files read from file system [originally for GFS] as input.
Note that there is an open source implementation of map-reduce: Apache Hadoop
If each line is really independent and processing is much slower than reading the file, what you can do is to read all the data at once and store it in an array, such that each line represents element of an array.
Then all your threads can do the processing in parallel. For example, if you have 200 lines and 4 threads, each thread could perform calculation on 50 lines. Morever, Since this method would be embarrassingly parallel, you could easily use OpenMP for that.
I would suggest the second option because it is definitely better design-wise and would allow you to have better control over the work that the worker threads are doing.
Moreover that would increase the performance since the inter-thread communication in that case is the minimum of the two options you described
Another option is to memory map the file and maintaining a shared structure properly handling mutual exclusion of the threads.
I have stuck in a serious problem. I am sending a request to server which contains some URL as its data.If I explain it , it is like I have a file which contains some URL in a sequential order I have to read those sequential data by using thread. Now the problem is there are one hundred thousand URL, I have to send each URL in the server in a particular time(say suppose 30 seconds).So I have to create threads which will serve the task in the desired time. But I have to read the file in such a way if first thread serve first 100 URL then 2nd thread will serve the next 100 URL and in the same way the other threads also.And I am doing it in a socket programming,so there is only one port at a time that I can use. So how to solve this problem. Give me a nice and simple idea, and if possible give me an example also.
Thanks in Advance
Nice and simple idea (if I understand your question correctly): You can use a LinkedList as a queue. Read in the 1,000 urls from file and put them in the list. Spawn your threads, which then pull (and remove) the next 100 urls from the list. LinkedList is not thread-safe though, so you must synchronize access yourself.
One thing that you could look into is the fork/join framework. The way that the java tutorials explains this is: "It is designed for work that can be broken into smaller pieces recursively. The goal is to use all the available processing power to make your application wicked fast". Then all you really need to do is figure out how to break up your tasks.
http://download.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html
you can find the jar for this at: http://g.oswego.edu/dl/concurrency-interest/