Software benefits of FPGA

Software benefits of FPGA - java

I have a doubt: I understood that it takes advantage of hardware parallelism and that it controls I/O at the hardware level in order to provide faster response times but which are the software benefits of an FPGA? Which software components can be accelerated? Thanks in advance.

Both for prototyping and parallelism. Since FPGAs are cheap they are good candidates for using both when making industrial protypes and paralell systems. FPGAs consist of arrays of logic elements connected with wires. The elements contain small lookup tables and flip-flops. FPGAs scale to thousands of lookup tables. Lookup tables and programmable wires are flexible enough to implement any logic function. Once you have the function ready then you might want to use ASIC. Xilinx and Altera are the major brands. Personally I use the Altera DE2 and DE2-115.

You are correct about the parallelism and I/O control of an FPGA. FPGA's are just huge re-configurable logic circuits that allow a developer to create circuits with very specific and dedicated functions. They typically come with a very large amount of I/O when compared to typical micro-controllers as well. Because it is basically a bunch of gates in silicon, everything you code in your hardware description language (HDL) will happen in parallel. This combined with the fact that you can write custom circuits is what gives the ability to accelerate operations in an FPGA over a typical processor even though the processor may have a much higher clock speed. To better illustrate this point, lets say you have an algorithm that you need to run and you want to compare an FPGA to a processor. The FPGA is clocked at 100Mhz and the processor at 3Ghz. This means the processor is running at a rate that is 30 times faster than the FPGA. Let's say you code up a circuit that is capable of computing the algorithm on the FPGA in 10 clock cycles. The equivalent algorithm on a processor could take thousands of instructions to execute. This places the FPGA far ahead of the processor in terms of performance. And, because of the parallel nature of FPGA's, if you code it right and the flow through the FPGA is continuous, every clock cycle, the FPGA can finish the computation of a new result. This is because every stage in an FPGA will execute concurrently. So it may take 10 clock cycles, but at each clock cycle, a different piece of 10 different results can be computed simultaneously (this is called pipelinine: http://en.wikipedia.org/wiki/Pipeline_%28computing%29 ). The processor is not capable of this and can not compute the different stages in parallel without taking extra time to do so. Processors are also bound on a performance level by their instruction set whereas on an FPGA, this can be overcome by good and application specific circuit design. A processor is a very general design that can run any combination of instructions, so it takes longer to compute things because of its generalized nature. FPGAs also don't have the issues of moving things in and out of cache or RAM. They typically do use RAM, but in a parallel nature that does not inhibit or bottleneck the flow of the processing. It is also interesting to note that a processor can be created and implemented on an FPGA because you can implement the circuits that compose a processor.
Typically, you find FPGAs on board with processors or microcontrollers to speed up very math intensive or Digital Signal Processing (DSP) tasks, or tasks that require a large flow of data. For example, a wireless modem that communicates via RF will have to do quite a bit of DSP to pick signals out of the air and decode them. Something like the receiver chip in a cell phone. There will be lots of data continually flowing in and out of the device. This is perfect for an FPGA because it can process such a large amount of data and in parallel. The FPGA can then pass the decoded data off to a microcontroller to allow it to do things like display the text from a text message on a pretty touchscreen display. Note that the chip in your cell phone is not an FPGA, but an ASIC (application specific integrated circuit). This is a bunch of circuits stripped down to the bare minimum for maximum performance and minimal cost. However, it is easy to prototype ASICs on FPGAs and they are very similar. An FPGA can be wasteful because it can have a lot of resources on board that are not needed. Generally you only move from an FPGA to an ASIC if you are going to be producing a TON of them and you know they work perfectly and will only ever have to do exactly what they currently do. Cell phone transceiver chips are perfect for this. They sell millions and millions of them and they only have to do that one thing their entire life.
On something like a desktop processor, it is common to see the term "hardware accleration". This generally means that an FPGA or ASIC is on board to speed up certain operations. A lot of times this means that an ASIC (probably on the die of the processor) is included to do anything such as floating point math, encryption, hashing, signal processing, string processing, etc. This allows the processor to more efficiently process data by offloading certain operations that are known to be difficult and time consuming for a processor. The circuit on the die, ASIC, or FPGA can do the computation in parallel as the processor does something else, then gets the answer back. So the speed up can be very large because the processor is not bogged down with the computation, and is freed up to continue processing other things while the other circuit performs the operation.

Some think of FPGAs as an alternative to processors. This is usually fundamentally wrong.
FPGAs are customizable logic. You can implement a processor in a FPGA and then run your regular software on that processor.
The key advantage with FPGAs is the flexibility to implement any kind of logic.
Think of a processor system. It might have a serial port, USB, Ethernet. What if you need another more specialized interface which your processor system does not support? You would need to change your hardware. Possibly create a new ASIC.
With a FPGA you can implement a new interface without the need for new hardware or ASICs.
FPGAs are almost never used to replace a processor. The FPGA is used for particular tasks, such as implementing a communications interface, speed up a specific operation, switch high bandwidth communication traffic, or things like that. You still run your software on a CPU.

Related

Problems with Streams in Java 8

As per this article there are some serious flaws with Fork-Join architecture in Java. As per my understanding Streams in Java 8 make use of Fork-Join framework internally. We can easily turn a stream into parallel by using parallel() method. But when we submit a long running task to a parallel stream it blocks all the threads in the pool, check this. This kind of behaviour is not acceptable for real world applications.
My question is what are the various considerations that I should take into account before using these constructs in high-performance applications (e.g. equity analysis, stock market ticker etc.)

The considerations are similar to other uses of multiple threads.
Only use multiple threads if you know they help. The aim is not to use every core you have, but to have a program which performs to your requirements.
Don't forget multi-threading comes with an overhead, and this overhead can exceed the value you get.
Multi-threading can experience large outliers. When you test performance you should not only look at throughput (which should be better) but the distribution of your latencies (which is often worse in extreme cases)
For low latency, switch between threads as little as possible. If you can do everything in one thread that may be a good option.
For low latency, you don't want to play nice, instead you want to minimise jitter by doing things such as pinning busy waiting threads to isolated cores. The more isolated cores you have the less junk cores you have to run things like thread pools.

The streams API makes parallelism deceptively simple. As was stated before, whether using a parallel stream speeds up your application needs to be thoroughly analysed and tested in the actual runtime context. My own experience with parallel streams streams suggests the following (and I am sure this list is far from complete):
The cost of the operations performed on the elements of the stream versus the cost of the parallelising machinery determines the potential benefit of parallel streams. For example, finding the maximum in an array of doubles is so fast using a tight loop that the streams overhead is never worthwhile. As soon as the operations get more expensive, the balance starts to tip in favour of the parallel streams API - under ideal conditions, say, a multi-core machine dedicated to a single algorithm). I encourage you to experiment.
You need to have the time and stamina to learn the intrinsics of the stream API. There are unexpected pitfalls. For example, a Spliterator can be constructed from a regular Iterator in simple statement. Under the hood, the elements produced by the iterator are first collected into an array. Depending on the number of elements produced by the Iterator that approach becomes very or even too resource hungry.
While the cited article make it seem that we are completely at the mercy of Oracle, that is not entirely true. You can write your own Spliterator that splits the input into chunks that are specific to your situation rather than relying on the default implementation. Or, you could write your own ThreadFactory (see the method ForkJoinPool.makeCommonPool).
You need to be careful not to produce deadlocks. If the tasks executed on the elements of the stream use the ForkJoinPool themselves, a deadlock may occur. You need to learn how to use the ForkJoinPool.ManagedBlocker API and its use (which I find rather the opposite of easy to grasp). Technically you are telling the ForkJoinPool that a thread is blocking which may lead to the creation of additional threads to keep the degree of parallelism intact. The creation of extra threads is not free, of course.
Just my five cents...

The point (there are actually 17) of the articles is to point out that the F/J Framework is more of a research project than a general-purpose commercial application development framework.
Criticize the object, not the man. Trying to do that is most difficult when the main problem with the framework is that the architect is a professor/scientist not an engineer/commercial developer. The PDF consolidation downloadable from the article goes more into the problem of using research standards rather than engineering standards.
Parallel streams work fine, until you try to scale them. The framework uses pull technology; the request goes into a submission queue, the thread must pull the request out of the submission queue. The Task goes back into the forking thread's deque, other threads must pull the Task out of the deque. This technique doesn't scale well. In a push technology, each Task is scattered to every thread in the system. That works much better in large scale environments.
There are many other problems with scaling as even Paul Sandoz from Oracle pointed out: For instance if you have 32 cores and are doing Stream.of(s1, s2, s3, s4).flatMap(x -> x).reduce(...) then at most you will only use 4 cores. The article points out, with downloadable software, that scaling does not work well and the parquential technique is necessary to avoid stack overflows and OOME.
Use the parallel streams. But beware of the limitations.

What is the performance difference between a JVM method call and a remote call?

I'm gathering some data about the difference in performance between a JVM method call and a remote method call using a binary protocol (in other words, not SOAP). I am developing a framework in which a method call may be local or remote at the discretion of the framework, and I'm wondering at what point it's "worth it" to evaluate the method remotely, either on a much faster server or on a compute grid of some kind. I know that a remote call is going to be much, much slower, so I'm mostly interested in understanding the order-of-magnitude differnce. Is it 10 times slower, or 100, or 1,000? Does anyone have any data on this? I'll write my own benchmarks if necessary, but I'm hoping to re-use some existing knowledge. Thanks!

Having developed a low latency RMI (~20 micro-seconds min) it is still 1000x slower than a direct call. If you use plain Java RMI, (~500 micro-seconds min) it can be 25,000x slower.
NOTE: This is only a very rough estimate to give you a general idea of the difference you might see. There are many complex factors which could change these numbers dramatically. Depending on what the method does, the difference could be much lower, esp if you perform RMI to the same process, if the network is relatively slow the difference could be much larger.
Additionally, even when there is a very large relative difference, it may be that it won't make much difference across your whole application.
To elaborate on my last comment...
Lets say you have a GUI which has to poll some data every second and it uses a background thread to do this. Lets say that using RMI takes 50 ms and the alternative is making a direct method call to a local copy of a distributed cache takes 0.0005 ms. That would appear to be an enormous difference, 100,000x. However, the RMI call could start 50 ms earlier, still poll every second, the difference to the user is next to nothing.
What could be much more important is when RMI compared with using another approach is much simpler (if its the right tool for the job)
An alternative to use RMI is using JMS. Which is best depends on your situation.

It's impossible to answer your question precisely. The ratio of execution time will depends on factors like:
The size / complexity of the parameters and return values that need to be serialized for the remote call.
The execution time of the method itself
The bandwidth / latency of the network connection
But in general, direct JVM method calls are very fast, any kind of of serialization coupled with network delay caused by RMI is going to add a significant overhead. Have a look at these numbers to give you a rough estimate of the overhead:
http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/
Apart from that, you'll need to benchmark.
One piece of advice - make sure you use a really good binary serialization library (avro, protocol buffers, kryo etc.) couple with a decent communications framework (e.g. Netty). These tools are far better than the standard Java serialisation/io facilities, and probably better than anything you can code yourself in a reasonable amount of time.

No one can tell you the answer, because the decision of whether or not to distribute is not about speed. If it was, you would never make a distributed call, because it will always be slower than the same call made in-memory.
You distribute components so multiple clients can share them. If the sharing is what's important, it outweighs the speed hit.
Your break even point has to do with how valuable it is to share functionality, not method call speed.

design a test program: increase/decrease cpu usage

I am trying to design and then write a java program that can increase/decrease CPU usage. Here is my basic idea: write a multi-thread program. And each thread does float point calculation. Increase/Decrease cpu usage through adding/reducing threads.
I am not sure what kinds of float point operations are best in this test case. Especially, I am gonna test VMWare virtual machine.

You can just sum up the reciprocals of the natural numbers. Since this sum doesn't converge, the compiler will not dare to optimize it away. Just make sure that the result is somehow used in the end.
1/1 + 1/2 + 1/3 + 1/4 + 1/5 ...
This will of course occupy the floating point unit, but not necessarily the central processing unit. So if this approach is good or not is the main question you should pose.

Just simple busy loops will increase the CPU usage -- I am not aware if doing FP calculations will change this significantly or otherwise be able to achieve a more consistent load factor, even though it does exercise the FPU and not just the ALU.
While creating a similar proof-of-concept in C# I used a fixed number of threads and changed the sleep/word durations of each thread. Bear in mind that this process isn't exact and is subject to both CPU and process throttling as well as other factors of modern preemptive operating systems. Adding VMware to the mix may also compound the observed behaviors. In degenerate cases harmonics can form between the code designed to adjust the load and the load reported by the system.
If lower-level constructs were used (generally requiring "kernel mode" access) then a more consistent throttle could be implemented -- partially because of the ability to avoid certain [thread] preemptions :-)
Another alternative that may be looked into, with the appropriate hardware, is setting the CPU clock and then running at 100%. The current Intel Core-i line of chips is very flexible this way (the CPU multiplier can be set discreetly through the entire range), although access from Java may be problematic. This would be done in the host, of course, not VMware.
Happy coding.

Normalization of speed for testing on different multicore processors

I want to calculate run time of some simple c programs on different multi-core processors. But as we know with advancement of technology new processors are incorporating more methods for faster computation like clock speed etc. How can I normalize such speed changes(to filter out the effect of other advance methods in processor except multi-core) as I only want to get results on the basis of number of cores of processor.

Under Linux, you can boot with the kernel command line parameter maxcpus=N to limit the machine to only N CPUs. See Documentation/kernel-parameters.txt in the kernel source for details.
Most BIOS environments also have the ability to turn off hyperthreading; depending upon your benchmarks, HT may speed up or slow down your tests; being in control of HT would be ideal.

Decide on a known set of reference hardware, run some sort of repeatable reference benchmark against this, and get a good known value to compare to. Then you can run this benchmark against other systems to figure out how to scale the values you get from your target benchmark runs.
The closer your reference benchmark is to your actual application, the more accurate the results of your scaling will be. You could have a single deterministic run (single code path, maybe average of multiple executions) of your application used as your reference benchmark.

If I understand you correctly, you are trying to find a measurement approach that allows to separate the effect of scaling the number of cores from advances of single processor improvements. I am afraid that is not easily possible. E.g. if you compare a multi-core system to one single core of that system you have a non-linear correlation. Because there are shared resources as e.g. the memory bus. If you use only one core of multi-core system it can use the complete memory bandwidth while it has to share in the multi-core case. Similar arguments apply to many shared resources: as there are caches, buses, io capabillities, ALUs, etc.

Your issue is with the auto scaling of core frequency based on the amount of active cores at any given time. For instance, AMD phenom 6-core chips operate at 3.4GHz (or somewhat similar) and if your application creates more than 3 threads it goes down to 2.8Ghz (or similar). Intel on the other hand uses a bunch of heuristics to determine the right frequency for any given time.
However, you can always turn these settings off by going to BIOS and then the results will be comparable only differing based on clock frequency. Usually, people measure giga flops to have comparable results.

Should I consider parallelism in statistic callculations?

we are going to implement software for various statistic analysis, in Java. The main concept is to get array of points on graph, then iterate thru it and find some results (like looking for longest rising sequence and various indicators).
Problem: lot of data
Problem2: must also work at client's PC, not only server (no specific server tuning possible)
Partial solution: do computation on background and let user stare at empty screen waiting for result :(
Question: Is there way how to increase performance of computation itself (lots of iterations) using parallelism? If so, please provide links to articles, samples, whatever usable here ...

The main point to use parallel processing is a presence of large amount of data or large computations that can be performed without each other. For example, you can count factorial of a 10000 with many threads by splitting it on parts 1..1000, 1001..2000, 2001..3000, etc., processing each part and then accumulating results with *. On the other hand, you cannot split the task of computing big Fibonacci number, since later ones depend on previous.
Same for large amounts of data. If you have collected array of points and want to find some concrete points (bigger then some constant, max of all) or just collect statistical information (sum of coordinates, number of occurrences), use parallel computations. If you need to collect "ongoing" information (longest rising sequence)... well, this is still possible, but much harder.
The difference between servers and client PCs is that client PCs doesn't have many cores, and parallel computations on single core will only decrease performance, not increase. So, do not create more threads than the number of user PC's cores is (same for computing clusters: do not split the task on more subtasks than the number of computers in cluster is).
Hadoop's MapReduce allows you to create parallel computations efficiently. You can also search for more specific Java libraries which allow evaluating in parallel. For example, Parallel Colt implements high performance concurrent algorithms for work with big matrices, and there're lots of such libraries for many data representations.

In addition to what Roman said, you should see whether the client's PC has multiple CPUs/CPU cores/hyperthreading. If there's just a single CPU with a single core and no hyperthreading, you won't benefit from parallelizing a computation. Otherwise, it depends on the nature of your computation.
If you are going to parallelize, make sure to use Java 1.5+ so that you can use the concurrency API. At runtime, determine the number of CPU cores like Runtime.getRuntime().availableProcessors(). For most tasks, you will want to create a thread pool with that many threads like Executors.newFixedThreadPool(numThreads) and submit tasks to the Executor. In order to get more specific, you will have to provide information about your particular computation, as Roman suggested.

If the problem you're going to solve is naturally parallelizable then there's a way to use multithreading to improve performance.
If there are many parts which should be computed serially (i.e. you can't compute the second part until the first part is computed) then multithreading isn't the way to go.
Describe the concrete problem and, maybe, we'll be able to provide you more help.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.