Is it possible to compare two DFEVar values on Kernel side - java

I am using Maxeler, MaxIDE.
I would like to use my input stream as output stream on the next cycle. I was hoping to decide this under an if condition. But the if condition won't allow me to compare two DFEVar(s). I was wondering is it possible?
Type mismatch: cannot convert from DFEVar to boolean

You can not use regular if statement to compare two DFE
Vars.
You should use ternary operator instead. See point 2. below for more details.
You can find the detailed explanation in the Maxeler tutorials.
From the: MaxCompiler: Dataflow Programming Tutorial
Conditionals in dataflow computing
There are three main methods of controlling conditionals that affect dataflow computation:
Global conditionals: These are typically large scale modes of operation depending on input pa-
rameters with a relatively small number of options. If we need to select different computations
based on input parameters, and these conditionals affect the dataflow portion of the design, we
simply create multiple .max files for each case. Some applications may require certain transformation to get them into the optimal structure for supporting multiple .max files.
if (mode==1) p1(x); else p2(x);
where p1 and p2 are programs that use different .max files.
Local Conditionals: Conditionals depending on local state of a computation.
if (a > b) x=x+1; else x=x − 1;
These can be transformed into dataflow computation as
x = (a > b) ? (x+1) : (x − 1);
Conditional Loops: If we do not know how long we need to iterate around a loop, we need to know
a bit about the loop’s behavior and typically values for the number of loop iterations. Once we
know the distribution of values we can expect, a dataflow implementation pipelines the optimal
number of iterations and treats each of the block of iterations as an action for the SLiC interface,
controlled by the CPU (or some other kernel).
The ternary-if operator ( ?: ) selects between two input streams. To select between more than
W two streams, the control.mux method is easier to use and read than nested ternary-if statements.

Related

Best primitive type for fast number comparison?

I've got a function that does several hundred million iterations, trying to find the optimal combination of a given set of possibilities. All of my data is pre-calculated and nearly all the arithmetic is simple >= or <= comparison of these pre-calculated values.
I'm wondering if there's an advantage to using certain primitive types (int, long, double) when doing this simple comparison.
I know I could go and run a test to see which is "best" but it's also important to understand the underlying reasoning. For example, maybe int is most easily comparable because it takes up less memory, or maybe double's floating point more easily tells what power of 10 the value is which speeds up comparison in some cases. I'm interested to know these basics and a simple test wouldn't tell me that.
This is premature optimization. You should pick one data type, make an implementation based on it, and run a performance benchmark using your actual implementation, not some made-up test that compares tens of thousands random values of a specific type.
The reason to test using your specific implementation is that there is a number of factors that have a much greater effect on the speed than the timing of raw comparisons:
Cache hit ratio - accessing memory is multiple times slower than accessing a cached value. Re-structuring your loops when accessing large arrays of data could speed up your program by a large factor without changing the number of raw comparisons that your program performs
Branch predictions - keeping CPU pipeline going is very important. If your loops and your data are structured in a way that optimizes the number of correct branch predictions, your code runs a lot faster than code with large number of incorrect branch predictions
It is not possible to measure any of these metrics until you have your actual algorithm implemented. Once you optimized the actual implementation for cache and branching, switching the underlying data type becomes a relatively easy task.

The user input for a calculating the tangent of a graph

I am making a program that calculates the equation for the tangent of a graph at a given point and ideally I'd want it to work for any type of graph. e.g. 1/x ,x^2 ,ln(x), e^x, sin, tan. I know how to work out the tangent and everything but I just don't really know how to get the input from the user.
Would I have to have options where they choose the type of graph and then fill in the coefficients for it e.g. "Choice 1: 1/(Ax^B) Enter the values of A and B"? Or is there a way so that the program recognises what the user types in so that instead of entering a choice and then the values of A and B, the user can type "1/3x^2" and the program would recognise that the A and B are 3 and 2 and that the graph is a 1/x graph.
This website is kind of an example of what I would like to do be able to do: https://www.symbolab.com/solver/tangent-line-calculator
Thanks for any help :)
Looks like you want to evalute the expression. In that case, you could look into Dijkstra's Shunting-Yard algorithm to convert the expression to prefix notation, and then evaluate the expression using stacks. Alternatively, you can use a library such as exp4j. There are multiple tutorials for it, but remember that you need to add operations for both binary and unary operations (binary meaning it supports 2 operations while unary is like sin(x)).
Then, after you evaluate the expression, you can use first principles to solve. I have an example of this system working without exp4j on my github repository. If you go back in the commit history, you can see the implementation with exp4j as well.
Parsing a formula from user input is itself a problem much harder than calculating the tangent. If this is an assignment, see if the wording allows for the choice of the functions and its parameters, as you're suggesting, because otherwise you are going to spend 10% of time writing code for calculating the derivative and 90% for reading the function from the standard input.
If it's your own idea and you'd like to try your hand at it, a teaser is that you will likely need to design a whole class structure for different operators, constants, and the unknown. Keep a stack of mathematical operations, because in 1+2*(x+1)+3 the multiplication needs to happen before the outer additions, but after the inner one. You'll have to deal with reading non-uniform input that has a high level of freedom (in whitespace, omission of * sign, implicit zero before a –, etc.) Regular expressions may be of help, but be prepared for a debugging nightmare and a ton of special cases anyway.
If you're fine with restricting your users (yourself?) to valid expressions following JavaScript syntax (which your examples are not, due to the implied multiplication and the haphazard rules of precedence thereof to the 1/...) and you can trust them absolutely in having no malicious intentions, see this question. You wouldn't have your expression represented as a formula internally, but you would still be able to evaluate it in different points x. Then you can approximate the derivative by (f(x+ε) - f(x)) / ε with some sufficiently small ε (but not too small either, using trial and error for convergence). Watch out for points where the function has a jump, but in basic principle this works, too.

Collectors.summingInt() vs mapToInt().sum()

When you want to sum an integer value from a stream, there are two main ways of doing it:
ToIntFunction<...> mapFunc = ...
int sum = stream().collect(Collectors.summingInt(mapFunc))
int sum = stream().mapToInt(mapFunc).sum()
The first involves boxing the returned integer & unboxing it, but there's an extra step involved in the second.
Which is more efficient/clearer?
You are looking at the intersection of two otherwise distinct use cases. Using mapToInt(…) allows you to chain other IntStream operations before the terminal operation. In contrast, Collectors.summingInt(…) can be combined with other collectors, e.g. used as downstream collector in a groupingBy collector. For these use cases, there is no question about which to use.
In your special case, when you are not chaining more operations nor dealing with collectors in the first place, there is no fundamental difference between these two approaches. Still, using the one which is more readable has a point. Usually, you don’t use a collector, when there is a predefined operation on the stream doing the same. You wouldn’t use collect(Collectors.reducing(…)) when you can just use .reduce(…), would you?
Not only is mapToInt(mapFunc).sum() shorted, it also follows the usual left-to-right order for what happens conceptionally, first convert to an int, then sum these ints up. I think this justifies preferring this alternative over .collect(Collectors.summingInt(mapFunc)).

Setting Spark RDD sizes:Casting long to Double inside 10^9+ for loop, really bad idea?

(EDIT: Looking at where this question started, it really ended up in a much better place. It wound up being a nice resource on the limits of RDD sizes in Spark when set through SparkContext.parallelize() vs. the actual size limits of RDDs. Also uncovered some arguments to parallelize() not found in user docs. Look especially at zero323's comments and his accepted answer.)
Nothing new under the sun but I can't find this question already asked ... the question is about how wrong/inadvisable/improper it might be to run a cast inside a large for loop in Java.
I want to run a for loop to initialize an Arraylist before passing it to a SparkContext.parallelize() method. I have found passing an uninitialized array to Spark can cause an empty collection error.
I have seen many posts about how floats and doubles are bad ideas as counters, I get that, just seems like this is a bad idea too? Like there must be a better way?
numListLen will be 10^6 * 10^3 for now, maybe as large at 10^12 at some point.
List<Double> numList = new ArrayList<Double>(numListLen);
for (long i = 0; i < numListLen; i++) {
numList.add((double) i);
}
I would love to hear where specifically this code falls down and can be improved. I'm a junior-level CS student so I haven't seen all the angles yet haha. Here's a CMU page seemingly approving this approach in C using implicit casting.
Just for background, numList is going to be passed to Spark to tell it how many times to run a simulation and create a RDD with the results, like this:
JavaRDD dataSet = jsc.parallelize(numList,SLICES_AKA_PARTITIONS);
// the function will be applied to each member of dataSet
Double count = dataSet.map(new Function<Double, Double>() {...
(Actually I'd love to run this Arraylist creation through Spark but it doesn't seem to take enough time to warrant that, 5 seconds on my i5 dual-core but if boosted to 10^12 then ... longer )
davidstenberg and Konstantinos Chalkias already covered problems related to using Doubles as counters and radiodef pointed out an issue with creating objects in the loop but at the end of the day you simply cannot allocate ArrayList larger than Integer.MAX_VALUE. On top of that, even with 231 elements, this is a pretty large object and serialization and network traffic can add a substantial overhead to your job.
There a few ways you can handle this:
using SparkContext.range method:
range(start: Long, end: Long,
step: Long = 1, numSlices: Int = defaultParallelism)
initializing RDD using range object. In PySpark you can use or range (xrange in Python 2), in Scala Range:
val rdd = sc.parallelize(1L to Long.MaxValue)
It requires constant memory on the driver and constant network traffic per executor (all you have to transfer it just a beginning and end).
In Java 8 LongStream.range could work the same way but it looks like JavaSparkContext doesn't provide required constructors yet. If you're brave enough to deal with all the singletons and implicits you can use Scala Range directly and if not you can simply write a Java friendly wrapper.
initialize RDD using emptyRDD method / small number of seeds and populate it using mapPartitions(WithIndex) / flatMap. See for example Creating array per Executor in Spark and combine into RDD
With a little bit of creativity you can actually generate an infinite number of elements this way (Spark FlatMap function for huge lists).
given you particular use case you should also take a look at mllib.random.RandomRDDs. It provides number of useful generators from different distributions.
The problem is using a double or float as the loop counter. In your case the loop counter is a long and does not suffer the same problem.
One problem with a double or float as a loop counter is that the floating point precision will leave gaps in the series of numbers represented. It is possible to get to a place within the valid range of a floating point number where adding one falls below the precision of the number being represented (requires 16 digits when the floating point format only supports 15 digits for example). If your loop went through such a point in normal execution it would not increment and continue in an infinite loop.
The other problem with doubles as loop counters is the ability to compare two floating points. Rounding means that to compare the variables successfully you need to look at values within a range. While you might find 1.0000000 == 0.999999999 your computer would not. So rounding might also make you miss the loop termination condition.
Neither of these problems occurs with your long as the loop counter. So enjoy having done it right.
Although I don't recommend the use of floating-point values (either single or double precision) as for-loop counters, in your case, where the step is not a decimal number (you use 1 as a step), everything depends on your largest expected number Vs the fraction part of double representation (52 bits).
Still, double numbers from 2^52..2^53 represent the integer part correctly, but after 2^53, you cannot always achieve integer-part precision.
In practice and because your loop step is 1, you would not experience any problems till 9,007,199,254,740,992 if you used double as counter and thus avoiding casting (you can't avoid boxing though from double to Double).
Perform a simple increment-test; you will see that 9,007,199,254,740,995 is the first false positive!
FYI: for float numbers, you are safe incrementing till 2^24 = 16777216 (in the article you provided, it uses the number 100000001.0f > 16777216 to present the problem).

Huffman Code Decoder Encoder In Java Source Generation

I want to create a fast Huffman Code decoder in Java and therefore thought about lookup tables. Since those tables consume memory and we use Java code to navigate and access the tables one can easily (or not) write a programm / method that expresses the same table.
The problem with that approach is, I dont know what is the best strategy. I know it is a lot about what fits in the cache and branch prediction. Also the switch case implementation meaning the actual ASM is beyond me. If I have a in memory lookup table (or a hierarchy of it) I will be able to simply jump in and out but I doupt that for my purposal that table would fit in the cache.
Since I actually walk a tree one could implement it as if else statements requireing a certain number of comparisms but for each comparism it would need additional binary operations.
So the following options exist:
General Algorithm using in Memory lookup tables
If/else representation of the decision tree
If/else representation with small switch statements to find the correct group of symboles (same bit pattern length) (fewer if statements, might be more code).
Switch statement representation of the code
Writing and benchmarking is quite tricky so any initial thoughts would be great.
One additional problem that comes into play is the order of bits. The most significant bit comes always first meaning it is stored in reverse order.
If your tree is A = 0, B = 10, C = 11 to write BAC it would actually be 01 + 0 + 11 (plus means append).
So actually the code have to be written in reverse order. using if /else or switch approach for groups it would not be a problem since masking out the bits is simple and the reverse of bit is simply possible but it would lose the idea of getting the index within the group out of the mask since in reverse bit order add and remove have different meaning and also a simple lookup is not possible.
Reversing the bits is a costly operation (I use 4bit lookup tables) not outweighting the performance penality of binary operations.
But reversing the bits on the go is better suited for this and require four operations per bit (shifting up, Masking out, add and also shifting the input down). Since I read bits ahead all those operations will be done in registers so they might take only a few cycles.
This way I can use switch, sub and if to find the right symbol group and also to return those.
So finaly I need advices. Since my codes are global for language processing, they can be hardwired (ie be in source).
I wonder what the parser generators like ANTRL use to express those decisions. Since they also seam to switch or if/else based on the input symbole it would might give me a clue.
[Updates]
I found a simplification that avoids the reverse bit problem but still adds costs per group. So I end up in writing the bits in the order of the groups to traverse. So I will not need four modifications per bit but per group (different bit lengths).
For each group we have:
1. The value for the first element, the size (and therefore the value for the last element within that group.
Therefore for each group the algorithm looks like:
1. Read mbits and combine with the current read value.
2. Compare the value with the last value of that group is it smaller its within that group if not its outside. -> read next
3. If it is inside the group aan array of values can be accessed or use a switch statement.
This is totally generic and can be used without loops making it efficient. Also if the group was detected, the bit length of the code is known and the bits can be consumed from source since the code looks far ahead (reading from stream).
[Update 2]
To access the actual value one could use a single big array of elements grouped by group. Since the propability reduces for group to group it is very likely that a significant part fits L2 or L1 cache speeding up access here.
Or one uses switch statements.
[Update 3]
Depending on the cases of a switch the compiler generates either a tableswitch or a lookup switch. The lookup switch has a complexity of O(log n) and stores key, jmp offset pairs which is not preferable. Therefore checking for groups is better suited for if/else.
The tableswitch itself uses only a table of jump offsets and it only takes substract, compare, access, jmp to reach the destination, than it must executes a return value on a constant.
Therefore a table access looks more promising. Also to avoid an unnecessary jump each group might contain the logic to access and return the group symbols table. Storing everything in a big table is promising since it might be int or short per symbole and my codes often do only have 1000 to 4000 symbols at most making it actually short.
I will check if 1 - pattern will give me the opportunity to store and access the masks in a better way allowing for binary searching the correct group instead of advancing in O(n) and might even avoid any shift operations at all during the processing.
I couldn't make sense of most of what you wrote in your (long) question, but there is a simple approach.
We'll start with a single table. Let's say your longest Huffman code is 15 bits. (In fact, deflate limits the size of its Huffman codes to 15 bits.) Then construct a table with 32768 entries, where each entry is the number of bits in the next code, and the symbol for that code. For codes less than 15 bits, there is more than one entry in the table for the same code. E.g. if the code is 10010110 (7 bits) for the symbol 'C', then all of the indexes of the table xxxxxxxx10010110 have the same thing. Those entries all have {7, 'C'}.
Then you get 15 bits from the stream, and look up the next code in the table. You remove the number of bits from that table entry, and use the resulting symbol. Now you get as many bits from the stream as you need to have 15, and repeat. So if you used 7 bits, then get 8 more to get back to 15 and look up the next code.
The next subtlety is that if your Huffman code changes often, you might end up spending more time filling up that large table for each new Huffman code than you spend actually decoding. To avoid that, you can make a two-level table which has, say, a 9-bit lookup (512 entries) for the first portion of the code. If the code is 9-bits or less, then you proceed as above. That will be the most common case, since shorter codes are more frequent (that being the whole point of Huffman coding). If the table entry says that there are 10 or more bits in the code (and you don't know yet how much more), then you consume the first nine bits and go to a second-level table for those initial nine bits pointed to by the entry in the first table, that has entries for the remaining six bits (64 entries). That resolves the remainder of the code and so tells you how many more bits to consume and what the symbol is. This approach can greatly reduce the time spent filling tables, and is very nearly as fast since short codes are more common. This is the approach used by inflate in zlib.
In the end it was quite simple. I support almost all solutions now. One can test every symbol group (same bit length), use a lookup table (10bit + 10bit + 10bit (just tables of 10bit, symbolscount + 1 is the reference to those talbes)) and generating java (and if needed javascript but currently I use GWT to translate it).
I even use long reads and shift operations to reduce the access to binary information. This way the code gets more efficiently since I only support a maximum bit size (20bit (so a table of a table) which makes 2^20 symbols and therefore at most a million).
For the ordering I use a generator for the bit masks just using shift operations and no requirement of reversing bit orders or such.
The table lookups can also be expressed in Java storing the tables as arrays of arrays (its interesting how big the java files can be without compilers to complain)).
Also I found it interesting that since comparing is expressing an ordering (half order I guess) one can sort the symbols and instead of mapping the symbols mapping the comparison index. By comparing two index one can simply sort streams of codes without touching to much. By also storing the first or first two comparison index (16 or 32bit) one can efficiently sort and therefore binary sort compressed strings using the same Huffman code, which makes it ideal to compress strings in a certain language.

Categories