After a long research , I got to know that String is immutable .String Buffer is more efficient than String if the program involves many computations.
But my question is slightly different from these
I have a function to which I pass a string . The string is actually the text of an article (nearly 3000-5000 charcs) .The function is implemented in threads. I mean to say , there is multiple call of function with different String text each time ..The later stage computations in the functions are too vast . Now when I run my code for a large number of threads, I am getting an error saying : GC Overhead Limit Exceeded . .
Now that I cant reduce the computations in the later stage of functions , my question is will it really help if I change the text type from String to String buffer? Also ,I don’t do any concatenation operation on the text string .
I have posted a small snipet of my code :
public static List<Thread> thread_starter(List<Thread> threads,String filename,ArrayList<String> prop,Logger L,Logger L1,int seq_no)
{ String text="";
if(prop.get(7).matches("txt"))
text=read_contents.read_from_txt(filename,L,L1);
else if(prop.get(7).matches("xml"))
text=read_contents.read_from_xml(filename,L,L1);
else if(prop.get(7).matches("html"))
text=read_contents.read_from_html(filename,L,L1);
else
{
System.out.println("not a valid config");
L1.info("Error : config file not properly defined for i/p file type");
}
/*TODO */
//System.out.println(text);
/*TODO CHANGES TO BE DONE HERE */
if(text.length()>0)
{
Runnable task = new MyRunnable(text,filename,prop,filename,L,L1,seq_no);
Thread worker = new Thread(task);
worker.start();
// Remember the thread for later usage
threads.add(worker);
}
else
{
main_entry_class.file_mover(filename, false);
}
return threads;
}
And i'm calling the above function repeatedly using the following code :
List<Thread> threads = new ArrayList<Thread>();
thread_count=10;
int file_pointer=0;// INTEGER POINTER VARIABLE
do
{
if(file.size()<=file_pointer)
break;
else
{ String file_name=file.get(file_pointer);
threads=thread_starter(threads,file_name,prop,L,L1,seq_no);
file_pointer++;
seq_no++;
}
}while(check_status(threads,thread_count)==true);
And the check status function :
public static boolean check_status(List<Thread> threads,int thread_count)
{
int running = 0;
boolean flag=false;
do {
running = 0;
for (Thread thread : threads) {
if (thread.isAlive()) {
//ThreadMXBean thMxB = ManagementFactory.getThreadMXBean();
//System.out.println(thMxB.getCurrentThreadCpuTime());
running++;
}
}
if(Thread.activeCount()-1<thread_count)
{
flag=true;
break;
}
} while (running > 0);
return flag;
}
If you are getting the error GC Overhead Limit Exceeded then you may try something in between like -Xmx512m first. Also if you have a lot of duplicate strings, you can use String.intern() on them.
You may check this doc:
-XX:+UseConcMarkSweepGC
Check out this link to know what GC Overhead Limit Exceeded error isGC overhead limit exceeded.
As the page suggests, out of memory error occurs when the program spends too much time in garbage collection. So, the problem is not with the number of computations you do...it is with the way you have implemented it. You might have a loop creating too many variables or something like that, so a string buffer might not help you.
Related
I understood that reading and writing data from multiple threads need to have a good locking mechanism to avoid data race. However, one situation is: If multiple threads try to write to a single variable with a single value, can this be a problem.
For example, here my sample code:
public class Main {
public static void main(String[] args) {
final int[] a = {1};
while(true) {
new Thread(new Runnable() {
#Override
public void run() {
a[0] = 1;
assert a[0] == 1;
}
}).start();
}
}
}
I have run this program for a long time, and look like everything is fine. If this code can cause the problem, how can I reproduce that?
Your test case does not cover the actual problem. You test the variable's value in the same thread - but that thread already copied the initial state of the variable and when it changes within the thread, the changes are visible to that thread, just like in any single-threaded applications. The real issue with write operations is how and when is the updated value used in the other threads.
For example, if you were to write a counter, where each thread increments the value of the number, you would run into issues. An other problem is that your test operation take way less time than creating a thread, therefore the execution is pretty much linear. If you had longer code in the threads, it would be possible for multiple threads to access the variable at the same time. I wrote this test using Thread.sleep(), which is known to be unreliable (which is what we need):
int[] a = new int[]{0};
for(int i = 0; i < 100; i++) {
final int k = i;
new Thread(new Runnable() {
#Override
public void run() {
try {
Thread.sleep(20);
} catch(InterruptedException e) {
e.printStackTrace();
}
a[0]++;
System.out.println(a[0]);
}
}).start();
}
If you execute this code, you will see how unreliable the output is. The order of the numbers change (they are not in ascending order), there are duplicates and missing numbers as well. This is because the variable is copied to the CPU memory multiple times (once for each thread), and is pasted back to the shared ram after the operation is complete. (This does not happen right after it is completed to save time in case it is needed later).
There also might be some other mechanics in the JVM that copy the values within the RAM for threads, but I'm unaware of them.
The thing is, even locking doesn't prevent these issues. It prevents threads from accessing the variable at the same time, but it generally doesn't make sure that the value of the variable is updated before the next thread accesses it.
While learning Java concurrency I ran into this behaviour which I can't explain:
public class ThreadInterferrence implements Runnable {
public static void main(String[] args) throws InterruptedException {
Thread t = new Thread(new ThreadInterferrence());
t.start();
append("1", 50);
t.join();
System.out.println(value);
}
private static String value = "";
public void run() {
append("2", 50);
}
private static void append(String what, int times) {
for (int i = 0; i < times; ++i) {
value = value + what;
}
}
}
Why does the program generate random Strings? More importantly why does the length of output vary? shouldn't it always be exactly 100 chars?
Output examples:
22222222222222222222222222222222222222222222222222
1111111111111111111111111111112121112211221111122222222222222
etc..
Reason is you have two threads.
Main thread which is appending to same value string
ThreadInterferrence Thread which is appending again to same value String.
It's Operating System (OS) who is scheduling which thread to run when and hence you see random output. So in your case, OS schedules your runnable to run for a time being which prints 1 and then tries to run main thread which in turn prints 2.
On the topic of your updated question (why does the length of output vary? shouldn't it always be exactly 100 chars?)
The behavior will be unpredictable, since the re-assignment of the new String is not atomic. Note that Strings are immutable and you keep reassinging a value to a variable. So what is happening is one thread gets the value, the other thread also gets the value, one thread adds a character and writes it again but so does the other thread with the old value. Now you're losing data because the update from one of the threads is lost.
In such a case you could use a StringBuffer which is thread-safe, or add synchronization which I'm sure you'll learn about.
[Question] More importantly why does the length of output vary?
[Answer] The variable "value" is being used by multiple threads (Main thread as well as the other thread). Hence the method which is used to change the state of the variable needs to be thread safe to control the final length. That is not the case here.
I want to do a task that I've already completed except this time using multithreading. I have to read a lot of data from a file (line by line), grab some information from each line, and then add it to a Map. The file is over a million lines long so I thought it may benefit from multithreading.
I'm not sure about my approach here since I have never used multithreading in Java before.
I want to have the main method do the reading, and then giving the line that has been read to another thread which will format a String, and then give it to another thread to put into a map.
public static void main(String[] args)
{
//Some information read from file
BufferedReader br = null;
String line = '';
try {
br = new BufferedReader(new FileReader("somefile.txt"));
while((line = br.readLine()) != null) {
// Pass line to another task
}
// Here I want to get a total from B, but I'm not sure how to go about doing that
}
public class Parser extends Thread
{
private Mapper m1;
// Some reference to B
public Parse (Mapper m) {
m1 = m;
}
public parse (String s, int i) {
// Do some work on S
key = DoSomethingWithString(s);
m1.add(key, i);
}
}
public class Mapper extends Thread
{
private SortedMap<String, Integer> sm;
private String key;
private int value;
boolean hasNewItem;
public Mapper() {
sm = new TreeMap<String, Integer>;
hasNewItem = false;
}
public void add(String s, int i) {
hasNewItem = true;
key = s;
value = i;
}
public void run() {
while (!Thread.currentThread().isInterrupted()) {
try {
if (hasNewItem) {
// Find if street name exists in map
sm.put(key, value);
newEntry = false;
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
// I'm not sure how to give the Map back to main.
}
}
I'm not sure if I am taking the right approach. I also do not know how to terminate the Mapper thread and retrieve the map in the main. I will have multiple Mapper threads but I have only instantiated one in the code above.
I also just realized that my Parse class is not a thread, but only another class if it does not override the run() method so I am thinking that the Parse class should be some sort of queue.
And ideas? Thanks.
EDIT:
Thanks for all of the replies. It seems that since I/O will be the major bottleneck there would be little efficiency benefit from parallelizing this. However, for demonstration purpose, am I going on the right track? I'm still a bit bothered by not knowing how to use multithreading.
Why do you need multiple threads? You only have one disk and it can only go so fast. Multithreading it won't help in this case, almost certainly. And if it does, it will be very minimal from a user's perspective. Multithreading isn't your problem. Reading from a huge file is your bottle neck.
Frequently I/O will take much longer than the in-memory tasks. We refer to such work as I/O-bound. Parallelism may have a marginal improvement at best, and can actually make things worse.
You certainly don't need a different thread to put something into a map. Unless your parsing is unusually expensive, you don't need a different thread for it either.
If you had other threads for these tasks, they might spend most of their time sitting around waiting for the next line to be read.
Even parallelizing the I/O won't necessarily help, and may hurt. Even if your CPUs support parallel threads, your hard drive might not support parallel reads.
EDIT:
All of us who commented on this assumed the task was probably I/O-bound -- because that's frequently true. However, from the comments below, this case turned out to be an exception. A better answer would have included the fourth comment below:
Measure the time it takes to read all the lines in the file without processing them. Compare to the time it takes to both read and process them. That will give you a loose upper bound on how much time you could save. This may be decreased by a new cost for thread synchronization.
You may wish to read Amdahl's Law. Since the majority of your work is strictly serial (the IO) you will get negligible improvements by multi-threading the remainder. Certainly not worth the cost of creating watertight multi-threaded code.
Perhaps you should look for a new toy-example to parallelise.
I have a Java method that performs two computations over an input set: an estimated and an accurate answer. The estimate can always be computed cheaply and in reliable time. The accurate answer can sometimes be computed in acceptable time and sometimes not (not known a priori ... have to try and see).
What I want to set up is some framework where if the accurate answer takes too long (a fixed timeout), the pre-computed estimate is used instead. I figured I'd use a thread for this. The main complication is that the code for computing the accurate answer relies on an external library, and hence I cannot "inject" Interrupt support.
A standalone test-case for this problem is here, demonstrating my problem:
package test;
import java.util.Random;
public class InterruptableProcess {
public static final int TIMEOUT = 1000;
public static void main(String[] args){
for(int i=0; i<10; i++){
getAnswer();
}
}
public static double getAnswer(){
long b4 = System.currentTimeMillis();
// have an estimate pre-computed
double estimate = Math.random();
//try to get accurate answer
//can take a long time
//if longer than TIMEOUT, use estimate instead
AccurateAnswerThread t = new AccurateAnswerThread();
t.start();
try{
t.join(TIMEOUT);
} catch(InterruptedException ie){
;
}
if(!t.isFinished()){
System.err.println("Returning estimate: "+estimate+" in "+(System.currentTimeMillis()-b4)+" ms");
return estimate;
} else{
System.err.println("Returning accurate answer: "+t.getAccurateAnswer()+" in "+(System.currentTimeMillis()-b4)+" ms");
return t.getAccurateAnswer();
}
}
public static class AccurateAnswerThread extends Thread{
private boolean finished = false;
private double answer = -1;
public void run(){
//call to external, non-modifiable code
answer = accurateAnswer();
finished = true;
}
public boolean isFinished(){
return finished;
}
public double getAccurateAnswer(){
return answer;
}
// not modifiable, emulate an expensive call
// in practice, from an external library
private double accurateAnswer(){
Random r = new Random();
long b4 = System.currentTimeMillis();
long wait = r.nextInt(TIMEOUT*2);
//don't want to use .wait() since
//external code doesn't support interruption
while(b4+wait>System.currentTimeMillis()){
;
}
return Math.random();
}
}
}
This works fine outputting ...
Returning estimate: 0.21007465651836377 in 1002 ms
Returning estimate: 0.5303547292361411 in 1001 ms
Returning accurate answer: 0.008838428149438915 in 355 ms
Returning estimate: 0.7981717302567681 in 1001 ms
Returning estimate: 0.9207406241557682 in 1000 ms
Returning accurate answer: 0.0893839926072787 in 175 ms
Returning estimate: 0.7310211480220586 in 1000 ms
Returning accurate answer: 0.7296754467596422 in 530 ms
Returning estimate: 0.5880164300851529 in 1000 ms
Returning estimate: 0.38605296260291233 in 1000 ms
However, I have a very large input set (in the order of billions of items) to run my analysis over, and I'm uncertain as to how to clean up the threads that do not finish (I do not want them running in the background).
I know that various methods to destroy threads are deprecated with good reason. I also know that the typical way to stop a thread is to use interrupts. However, in this case, I don't see that I can use an interrupt since the run() method passes a single call to an external library.
How can I kill/clean-up threads in this case?
If you know enough about the external library, such as:
never acquires any locks;
never opens any files/network connections;
never involves any I/O whatsoever, not even logging;
then it may be safe to use Thread#stop on it. You could try it and do extensive stress testing. Any resource leaks should manifest themselves soon enough.
I'd try it to see if it will respond to an Thread.interrupt(). Reduce your data of course so it doesn't run forever, but if it responds to an interrupt() then you're home free. If they lock anything, perform a wait(), or sleep() the code will have to handle the InterruptedException and it's possible the author did what was right. They may swallow it and continue, but it's possible they didn't.
While technically you can call Thread.stop() you'll need to know everything about that code to know for sure if it's safe and you won't leak resources. However, doing that research will clue you into how you could easily modify the code to look for interrupt() as well. You'll pretty much have to have the source code to audit it to know for sure which means you could easily do the right thing and add the checks there without involving as much research to know if its safe to call Thread.stop().
The other option is to cause a RuntimeException in the thread. Try nulling a reference it might have or closing some IO (socket, file handle, etc). Modify the array of data it's walking over by changing the size or null out the data. There's something you can do to cause it to throw an exception and that is not handled and it will shutdown.
Extending on the answer by chubbsondubs, if the third-party library uses some well-defined API (such as java.util.List or some library-specific API) to access the input data set, you could wrap the input data set that you pass to the third-party code with a wrapper class that will throw exceptions, e.g. in the List.get method, after a cancel flag is set.
For instance, if you pass a List to your third-party library, then it might be possible to do something along the lines of:
class CancelList<T> implements List<T> {
private final List<T> wrappedList;
private volatile boolean canceled = false;
public CancelList(List<T> wrapped) { this.wrappedList = wrapped; }
public void cancel() { this.canceled = true; }
public T get(int index) {
if (canceled) { throw new RuntimeException("Canceled!"); }
return wrappedList.get(index);
}
// Other List method implementations here...
}
public double getAnswer(List<MyType> inputList) {
CancelList<MyType> cancelList = new CancelList<MyType>(inputList);
AccurateAnswerThread t = new AccurateAnswerThread(cancelList);
t.start();
try{
t.join(TIMEOUT);
} catch(InterruptedException ie){
cancelList.cancel();
}
// Get the result of your calculation here...
}
Of course, this approach depends on a few things:
You must know the third-party code well-enough to know what methods it calls that you can control through input parameters.
The third-party code would need to make frequent calls to these methods throughout the computation process (i.e. it won't work if it copies all the data at once into an internal structure and does its computation there).
Obviously this won't work if the library catches and handles runtime exceptions and continues processing.
I am new to multi-threading and I have to write a program using multiple threads to increase its efficiency. At my first attempt what I wrote produced just opposite results. Here is what I have written:
class ThreadImpl implements Callable<ArrayList<Integer>> {
//Bloom filter instance for one of the table
BloomFilter<Integer> bloomFilterInstance = null;
// Data member for complete data access.
ArrayList< ArrayList<UserBean> > data = null;
// Store the result of the testing
ArrayList<Integer> result = null;
int tableNo;
public ThreadImpl(BloomFilter<Integer> bloomFilterInstance,
ArrayList< ArrayList<UserBean> > data, int tableNo) {
this.bloomFilterInstance = bloomFilterInstance;
this.data = data;
result = new ArrayList<Integer>(this.data.size());
this.tableNo = tableNo;
}
public ArrayList<Integer> call() {
int[] tempResult = new int[this.data.size()];
for(int i=0; i<data.size() ;++i) {
tempResult[i] = 0;
}
ArrayList<UserBean> chkDataSet = null;
for(int i=0; i<this.data.size(); ++i) {
if(i==tableNo) {
//do nothing;
} else {
chkDataSet = new ArrayList<UserBean> (data.get(i));
for(UserBean toChk: chkDataSet) {
if(bloomFilterInstance.contains(toChk.getUserId())) {
++tempResult[i];
}
}
}
this.result.add(new Integer(tempResult[i]));
}
return result;
}
}
In the above class there are two data members data and bloomFilterInstance and they(the references) are passed from the main program. So actually there is only one instance of data and bloomFilterInstance and all the threads are accessing it simultaneously.
The class that launches the thread is(few irrelevant details have been left out, so all variables etc. you can assume them to be declared):
class MultithreadedVrsion {
public static void main(String[] args) {
if(args.length > 1) {
ExecutorService es = Executors.newFixedThreadPool(noOfTables);
List<Callable<ArrayList<Integer>>> threadedBloom = new ArrayList<Callable<ArrayList<Integer>>>(noOfTables);
for (int i=0; i<noOfTables; ++i) {
threadedBloom.add(new ThreadImpl(eval.bloomFilter.get(i),
eval.data, i));
}
try {
List<Future<ArrayList<Integer>>> answers = es.invokeAll(threadedBloom);
long endTime = System.currentTimeMillis();
System.out.println("using more than one thread for bloom filters: " + (endTime - startTime) + " milliseconds");
System.out.println("**Printing the results**");
for(Future<ArrayList<Integer>> element: answers) {
ArrayList<Integer> arrInt = element.get();
for(Integer i: arrInt) {
System.out.print(i.intValue());
System.out.print("\t");
}
System.out.println("");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
I did the profiling with jprofiler and
![here]:(http://tinypic.com/r/wh1v8p/6)
is a snapshot of cpu threads where red color shows blocked, green runnable and yellow is waiting. I problem is that threads are running one at a time I do not know why?
Note:I know that this is not thread safe but I know that I will only be doing read operations throughout now and just want to analyse raw performance gain that can be achieved, later I will implement a better version.
Can anyone please tell where I have missed
One possibility is that the cost of creating threads is swamping any possible performance gains from doing the computations in parallel. We can't really tell if this is a real possibility because you haven't included the relevant code in the question.
Another possibility is that you only have one processor / core available. Threads only run when there is a processor to run them. So your expectation of a linear speed with the number of threads and only possibly achieved (in theory) if is a free processor for each thread.
Finally, there could be memory contention due to the threads all attempting to access a shared array. If you had proper synchronization, that would potentially add further contention. (Note: I haven't tried to understand the algorithm to figure out if contention is likely in your example.)
My initial advice would be to profile your code, and see if that offers any insights.
And take a look at the way you are measuring performance to make sure that you aren't just seeing some benchmarking artefact; e.g. JVM warmup effects.
That process looks CPU bound. (no I/O, database calls, network calls, etc.) I can think of two explanations:
How many CPUs does your machine have? How many is Java allowed to use? - if the threads are competing for the same CPU, you've added coordination work and placed more demand on the same resource.
How long does the whole method take to run? For very short times, the additional work in context switching threads could overpower the actual work. The way to deal with this is to make a longer job. Also, run it a lot of times in a loop not counting the first few iterations (like a warm up, they aren't representative.)
Several possibilities come to mind:
There is some synchronization going on inside bloomFilterInstance's implementation (which is not given).
There is a lot of memory allocation going on, e.g., what appears to be an unnecessary copy of an ArrayList when chkDataSet is created, use of new Integer instead of Integer.valueOf. You may be running into overhead costs for memory allocation.
You may be CPU-bound (if bloomFilterInstance#contains is expensive) and threads are simply blocking for CPU instead of executing.
A profiler may help reveal the actual problem.