I have a problem to solve using FSTs.
Basically, I'll make a morphological parser, and in this moment i have to work with large transducers. The performance is The Big issue here.
Recently, i worked in c++ in other projects where the performance matters, but now, i'am considering java, because the java's benefits and because java is getting better.
I studied some comparisons between java and c++, but I cannot decide what language i should use for this specific problem because it depends on lib in use.
I canĀ“t find much information about java's libs, so, my question is: Are there any open source java libs in which the performance is good, like The RWTH FSA Toolkit that i read in an article that is the fastest c++ lib?
Thanks all.
What are the "benefits" of Java, for your purposes? What specific problem does that platform solve that you need? What is the performance constraint you must consider? Were the "comparisons" fair, because Java is actually extremely difficult to benchmark. So is C++, but you can at least get some algorithmic boundary guarantees from STL.
I suggest you look at OpenFst and the AT&T finite-state transducer tools. There are others out there, but I think your worry about Java puts the cart before the horse-- focus on what solves your problem well.
Good luck!
http://jautomata.sourceforge.net/ and http://www.cs.duke.edu/csed/jflap/ are based Java finite state machine libraries, although I don't have experience using them so I cannot comment on the efficiency.
I'm one of the developers of the morfologik-stemming library. It's pure Java and its performance is very good, both when you build the automaton and when you use it. We use it for morphological analysis in LanguageTool.
The problem here is the minimum size of your objects in Java. In C++, without virtual methods and run time type identification, your objects weight exactly their content. And the time your automata take to manipulate memory has a big impact on performance.
I think that should be the main reason for choosing C++ over Java.
OpenFST is a C++ finite state transducer framework that is really comprehensive. Some people from CMU ported it to Java for use in their natural language processing.
A blog post series describing it.
The code is located on svn.
Update:
I ported it to java here
Lucene has a excellent implementation of FST, which is easy to use and high performance, making query engines like Elasticsearch, Solr deliver very fast sub-second term based query.Let me take an example:
import com.google.common.base.Preconditions;
import org.apache.lucene.store.ByteArrayDataInput;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.GrowableByteArrayDataOutput;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.IntsRefBuilder;
import org.apache.lucene.util.fst.Builder;
import org.apache.lucene.util.fst.FST;
import org.apache.lucene.util.fst.PositiveIntOutputs;
import org.apache.lucene.util.fst.Util;
import java.io.IOException;
public class T {
private final String inputValues[] = {"cat", "dog", "dogs"};
private final long outputValues[] = {5, 7, 12};
// https://lucene.apache.org/core/8_4_0/core/org/apache/lucene/util/fst/package-summary.html
public static void main(String[] args) throws IOException {
T t = new T();
FST<Long> fst = t.buildFSTInMemory();
System.out.println(String.format("memory used for fst is %d bytes", fst.ramBytesUsed()));
t.searchFST(fst);
byte[] bytes = t.serialize(fst);
System.out.println(String.format("length of serialized fst is %d bytes", bytes.length));
fst = t.deserialize(bytes);
t.searchFST(fst);
}
private FST<Long> buildFSTInMemory() throws IOException {
// Input values (keys). These must be provided to Builder in Unicode sorted order! Use Collections.sort() to sort inputValues first.
PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
Builder<Long> builder = new Builder<Long>(FST.INPUT_TYPE.BYTE1, outputs);
BytesRef scratchBytes = new BytesRef();
IntsRefBuilder scratchInts = new IntsRefBuilder();
for (int i = 0; i < inputValues.length; i++) {
// scratchBytes.copyChars(inputValues[i]);
scratchBytes.bytes = inputValues[i].getBytes();
scratchBytes.offset = 0;
scratchBytes.length = inputValues[i].length();
builder.add(Util.toIntsRef(scratchBytes, scratchInts), outputValues[i]);
}
FST<Long> fst = builder.finish();
return fst;
}
private FST<Long> deserialize(byte[] bytes) throws IOException {
DataInput in = new ByteArrayDataInput(bytes);
PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
FST<Long> fst = new FST<Long>(in, outputs);
return fst;
}
private byte[] serialize(FST<Long> fst) throws IOException {
final int capicity = 32;
GrowableByteArrayDataOutput out = new GrowableByteArrayDataOutput(capicity);
fst.save(out);
return out.getBytes();
}
private void searchFST(FST<Long> fst) throws IOException {
for (int i = 0; i < inputValues.length; i++) {
Long value = Util.get(fst, new BytesRef(inputValues[i]));
Preconditions.checkState(value == outputValues[i], "fatal error");
}
}
}
Related
I am having some trouble in using the Groovy TemplateEngines in Java without running in OOM. When creating a lot of different templates it seems to me that there a lot of scripts created on the heap - which are then never garbage
collected.
I use java 8. When running this code with -Xmx32M there are about 3000 iterations possible. After that is a OOM-Error thrown.
Here is my code:
import groovy.text.SimpleTemplateEngine;
import groovy.text.Template;
import groovy.text.TemplateEngine;
import java.util.HashMap;
import java.util.Map;
public class Test {
public static void main(String[] args) throws Exception {
String groovy = "XX-${i}";
for (int i = 0; i < (1000000000); i++) {
TemplateEngine e = new SimpleTemplateEngine();
Template t = e.createTemplate(groovy);
Map<String, Object> binding = new HashMap<>();
binding.put("i", i);
String res = t.make(binding).toString();
if (i % 100 == 0) {
System.out.println("->" + res);
}
}
}
}
I also tried different variations and ClassLoaded - but in essence the results are always the same. As I can't find any current issues with that I guess I am missing something.
Could anyone help to enlighten me?
Tino
Here is your problem https://bugs.openjdk.java.net/browse/JDK-8037342.
Each time the parser runs it creates a new unique class based off the number of parse being done. For instance, after a while the class names look like
groovy.runtime.metaclass.SimpleTemplateScript4237MetaClass
groovy.runtime.metaclass.SimpleTemplateScript4238MetaClass
After a while the ClassLoader's parallelLockMap will fill the heap and nothing is eligible to be GC'd. It's sort of like a OOM PermGen error.
Use Apache Commons Text. Fast and Efficient alternative to SimpleTemplateEngine.
String templateString, Map binding;
StrSubstitutor sb = new StrSubstitutor(binding);
String value = sb.replace(templateString);
I have struggling with that problem has been a while and now I come up with that workaround.
Just call clear after run your script.
https://gist.github.com/jpozorio/38f26120e6346dfd74cecd7a147028aa
I have a method that returns the average of a property over a number of model objects:
List<Activity> activities = ...;
double effortSum = 0;
double effortCount = 0;
activities.stream().forEach(a -> {
double effort = a.getEffort();
if (effort != Activity.NULL) {
effortCount++; < Compilation error, local variable
effortSum += effort; < Compilation error, local variable
}
});
But, the above attempt doesn't compile, as noted. The only solution that's coming to me is using an AtomicReference to a Double object, but that seems crufty, and adds a large amount of confusion to what should be a simple operation. (Or adding Guava and gaining AtomicDouble, but the same conclusion is reached.)
Is there a "best practice" strategy for modifying local variables using the new Java 8 loops?
Relevant code for Activity:
public class Activity {
public static final double NULL = Double.MIN_VALUE;
private double effort = NULL;
public void setEffort(double effort) { this.effort = effort; }
public double getEffort() { return this.effort; }
...
}
Is there a "best practice" strategy for modifying local variables using the new Java 8 loops?
Yes: don't. You can modify their properties -- though it's still a bad idea -- but you cannot modify them themselves; you can only refer to variables from inside a lambda if they are final or could be final. (AtomicDouble is indeed one solution, another is a double[1] that just serves as a holder.)
The correct way of implementing the "average" operation here is
activities.stream()
.mapToDouble(Activity::getEffort)
.filter(effort -> effort != Activity.NULL)
.average()
.getAsDouble();
In your case, there is a solution that is more functional - just compute the summary statistics from the stream from where you can grab the number of elements filtered and their sum:
DoubleSummaryStatistics stats =
activities.stream()
.mapToDouble(Activity::getEffort)
.filter(e -> e != Activity.NULL)
.summaryStatistics();
long effortCount = stats.getCount();
double effortSum = stats.getSum();
Is there a "best practice" strategy for modifying local variables
using the new Java 8 loops?
Don't try do to that. I think the main issues is that people try to translate their code using the new Java 8 features in an imperative way (like in your question - and then you have troubles!).
Try to see first if you can provide a solution which is functional (which is what the Stream API aim for, I believe).
I am prototyping an interface to our application to allow other people to use python, our application is written in java. I would like to pass some of our data from the java app to the python code but I am unsure how to pass an object to python. I have done a simple java->python function call using simple parameters using Jython and found it very useful for what I am trying to do. Given the class below, how can I then use it in Python/Jython as an input to a function/class:
public class TestObject
{
private double[] values;
private int length;
private int anotherVariable;
//getters, setters
}
One solution. You could use some sort of message system, queue, or broker of some sort to serialize/deserialize, or pass messages between python and java. Then create some sort workers/producer/consumers to put work on the queues to be processed in python, or java.
Also consider checking out for inspiration: https://www.py4j.org/
py4j is used heavily by/for pyspark and hadoop type stuff.
To answer your question more immediately.
Example using json-simple.:
import org.apache.commons.io.FileUtils;
import org.json.simple.JSONObject;
//import org.json.simple.JSONObject;
public class TestObject
{
private double[] values;
private int length;
private int anotherVariable;
private boolean someBool;
private String someString;
//getters, setters
public String toJSON() {
JSONObject obj=new JSONObject();
obj.put("values",new Double(this.values));
obj.put("length",new Integer(this.length));
obj.put("bool_val",new Boolean(this.SomeBool));
obj.put("string_key",this.someString);
StringWriter out = new StringWriter();
obj.writeJSONString(out);
return out.toString();
}
public void writeObject(){
Writer writer = new BufferedWriter(
new OutputStreamWriter(
new FileOutputStream("anObject.json"), "utf-8")
)
)
writer.write(this.toJSON());
}
public static void setObject(){
values = 100.134;
length = 12;
anotherVariable = 15;
someString = "spam";
}
}
And in python:
class DoStuffWithObject(object):
def __init__(self,obj):
self.obj = obj
self.changeObj()
self.writeObj()
def changeObj(self):
self.obj['values'] = 100.134;
self.obj['length'] = 12;
self.obj['anotherVariable'] = 15;
self.obj['someString'] = "spam";
def writeObj(self):
''' write back to file '''
with open('anObject.json', 'w') as f:
json.dump(self.obj, f)
def someOtherMethod(self, s):
''' do something else '''
print('hello {}'.format(s))
import json
with open('anObject.json','r') as f:
obj = json.loads(f.read())
# print out obj['values'] obj['someBool'] ...
for key in obj:
print(key, obj[key])
aThing = DoStuffWithObject(obj)
aThing.someOtherMethod('there')
And then in java read back the object. There are solutions that exist implementing this idea (JSON-RPC, XML-RPC, and variants). Depending, you may may also want to consider using something like ( http://docs.mongodb.org/ecosystem/drivers/java/ ) the benefit being that mongo does json.
See:
https://spring.io/guides/gs/messaging-reactor/
http://spring.io/guides/gs/messaging-rabbitmq/
http://spring.io/guides/gs/scheduling-tasks/
Celery like Java projects
Jedis
RabbitMQ
ZeroMQ
A more comprehensive list of queues:
http://queues.io/
Resources referenced:
http://www.oracle.com/technetwork/articles/java/json-1973242.html
How do I create a file and write to it in Java?
https://code.google.com/p/json-simple/wiki/EncodingExamples
Agree with the answer below. I think that the bottom line is that "Python and Java are separate interpreter-environments." You therefore shouldn't expect to transfer "an object" from one to the other. You shouldn't expect to "call methods." But it is reasonable to pass data from one to another, by serializing and de-serializing it through some intermediate data format (e.g. JSON) as you would do with any other program.
In some environments, such as Microsoft Windows, it's possible that a technology like OLE (dot-Net) might be usable to allow environments to be linked-together "actively," where the various systems implement and provide OLE-objects. But I don't have any personal experience with whether, nor how, this might be done.
Therefore, the safest thing to do is to treat them as "records," and to use serialization techniques on both sides. (Or, if you got very adventurous, run (say) Java in a child-thread.) An "adventurous" design could get out-of-hand very quickly, with little return on investment.
You need to make the python file to exe using py2exe , Refer the link : https://www.youtube.com/watch?v=kyoGfnLm4LA. Then use the program in java and pass arguements:
Please refer this link it will be having the details:
Calling fortran90 exe program from java is not executing
I tried to distribute a calculation using hadoop.
I am using Sequence input and output files, and custom Writables.
The input is a list of triangles, maximum size 2Mb, but can be smaller around 50kb too.
The intermediate values and the output is a map(int,double) in the custom Writable.
Is this the bottleneck?
The problem is that the calculation is much slower than the version without hadoop. also, increasing the nodes from 2 to 10, doesn't speed up the process.
One possibility is that I don't get enough mappers because of the small input size.
I made tests changing the mapreduce.input.fileinputformat.split.maxsize, but it just got worse, not better.
I am using hadoop 2.2.0 locally, and at amazon elastic mapreduce.
Did I overlook something? Or this is just the kind of task which should be done without hadoop? (it's my first time using mapreduce).
Would you like to see code parts?
Thank you.
public void map(IntWritable triangleIndex, TriangleWritable triangle, Context context) throws IOException, InterruptedException {
StationWritable[] stations = kernel.newton(triangle.getPoints());
if (stations != null) {
for (StationWritable station : stations) {
context.write(new IntWritable(station.getId()), station);
}
}
}
class TriangleWritable implements Writable {
private final float[] points = new float[9];
#Override
public void write(DataOutput d) throws IOException {
for (int i = 0; i < 9; i++) {
d.writeFloat(points[i]);
}
}
#Override
public void readFields(DataInput di) throws IOException {
for (int i = 0; i < 9; i++) {
points[i] = di.readFloat();
}
}
}
public class StationWritable implements Writable {
private int id;
private final TIntDoubleHashMap values = new TIntDoubleHashMap();
StationWritable(int iz) {
this.id = iz;
}
#Override
public void write(DataOutput d) throws IOException {
d.writeInt(id);
d.writeInt(values.size());
TIntDoubleIterator iterator = values.iterator();
while (iterator.hasNext()) {
iterator.advance();
d.writeInt(iterator.key());
d.writeDouble(iterator.value());
}
}
#Override
public void readFields(DataInput di) throws IOException {
id = di.readInt();
int count = di.readInt();
for (int i = 0; i < count; i++) {
values.put(di.readInt(), di.readDouble());
}
}
}
You won't get any benefit from hadoop with only 2MB of data. Hadoop is all about big data. Distributing the 2MB to your 10 nodes costs more time then just doing the job on a single node. The real benfit starts with a high number of nodes and huge data.
If the processing is really that complex, you should be able to realize a benefit from using Hadoop.
The common issue with small files, is that Hadoop will run a single java process per file and that will create overhead from having to start many processes and slows down the output. In your case this does not sound like it applies. More likely you have the opposite problem that only one Mapper is trying to process your input and it doesn't matter how big your cluster is at that point. Using the input split sounds like the right approach, but because your use case is specialized and deviates significantly from the norm, you may need to tweak a number of components to get the best performance.
So you should be able to get the benefits you are seeking from Hadoop Map Reduce, but it will probably take significant tuning and custom Input handling.
That said seldom(never?) will MapReduce be faster than a purpose built solution. It is a generic tool that is useful in that it can be used to distribute and solve many diverse problems without the need to write a purpose built solution for each.
So at the end I figured out a way to not store intermediate values in writables, only in the memory. This way it is faster.
But still, a non-hadoop solution is the best in this usecase.
I have an interface DataSeries with a method
int[] getRawData();
For various reasons (primarily because I'm using this with MATLAB, and MATLAB handles int[] well) I need to return an array rather than a List.
I don't want my implementing classes to return the int[] array because it is mutable. What is the most efficient way to copy an int[] array (sizes in the 1000-1000000 length range) ? Is it clone()?
The only alternative is Arrays#copyOf() (which uses System#arrayCopy() under the hoods).
Just test it.
package com.stackoverflow.q2830456;
import java.util.Arrays;
import java.util.Random;
public class Test {
public static void main(String[] args) throws Exception {
Random random = new Random();
int[] ints = new int[100000];
for (int i = 0; i < ints.length; ints[i++] = random.nextInt());
long st = System.currentTimeMillis();
test1(ints);
System.out.println(System.currentTimeMillis() - st);
st = System.currentTimeMillis();
test2(ints);
System.out.println(System.currentTimeMillis() - st);
}
static void test1(int[] ints) {
for (int i = 0; i < ints.length; i++) {
ints.clone();
}
}
static void test2(int[] ints) {
for (int i = 0; i < ints.length; i++) {
Arrays.copyOf(ints, ints.length);
}
}
}
20203
20131
and when test1() and test2() are swapped:
20157
20275
The difference is negligible. I'd say, just go for clone() since that is better readable and Arrays#copyOf() is Java 6 only.
Note: actual results may depend on platform and JVM used, this was tested at an Dell Latitude E5500 with Intel P8400, 4GB PC2-6400 RAM, WinXP, JDK 1.6.0_17_b04
No one ever solved their app's performance problems by going through and changing arraycopy() calls to clone() or vice versa.
There is no one definitive answer to this question. It isn't just that it might be different on different VMs, versions, operating systems and hardware: it really is different.
I benchmarked it anyway, on a very recent OpenJDK (on a recent ubuntu) and found that arraycopy is much faster. So is this my answer to you? NO! Because if it proves to be true, there's a bug with the intrinsification of Arrays.copyOf, and that bug will likely get fixed, so this information is only transient to you.
http://www.javapractices.com/topic/TopicAction.do?Id=3
the numbers will likely be different depending on your specs, but seems clone is the best choice.