I'd like to know if it's possible (and with which tooling) to do typesafe i18n in Java. Maybe it's not clear so here are some details, assuming we use something based on MessageFormat
1) Translate using typesafe parameters
I'd like to avoid having an interface like String translate(Object key,Object... values) where the values are untyped. It should be impossible to call with a bad parameter type.
Note I'm fine specifying the typing of all the keys. The solution I'm looking for should be scalable and should not increase the backend startup time significantly.
2) It should be known at compile time which keys are still used
I don't want my translation keys base to be like many websites' CSS, growing and growing forever and everybody being frightened to remove keys because we don't know easily if they are still useful or not.
In JS/React land there is babel-plugin-react-intl which permit to extract at compile time the translation keys that are still found in the code. Then we can diff these keys with our translation backend/SaaS and delete the unused keys automatically. Is there anything close to that experience in Java land?
I'm looking for:
any trick you have that could make i18n more manageable in Java regarding these 2 problems I have
current tooling that might help me solve the problem
hints on how to implement something custom if tooling does not exist
Also, is Enum suitable to store a huge fixed list of translation keys?
Translation keys are an open ended domain. For a closed domain an enum would do.
Having something like enums or constant lists likely causes a growth of different enums, constants classes.
And then there is the very important perspective of the translating business:
you would want at least one glossary (not needing translation occurrences), structurally equal phrases grouped,
comments maybe on ambivalent terms and usages (button/menu). This can reduce
the time costs and improve the quality. There also are things like online-help.
Up till now XML, like simple docbook / translation memory (tmx/xliff/...), was sufficient for that. And the tooling
including different forms of evaluation was done ourselves.
I hope a more professional answer will be given, but my answer might shed some light
on the desired functionality:
translation centric: as that needs the most work.
version control: some text lists involved.
checking tools: what you mentioned, integrity, missing, almost equal.
Related
I am trying to use a DNNRegressor model in a java realtime context, unfortunately this requires a garbage free implementation. It doesn't look like tensorflow-light offers a GC free implementation. The path of least resistance would be to extract the weights and re-implement the NN manually. Has anyone tried extracting the weights from a regression model and implementing the regression manually, and if so could you describe any pitfalls?
Thanks!
I am not quite sure if your conclusion
The path of least resistance would be to extract the weights and re-implement the NN manually.
is actually true. It sounds to me like you want to use the trained model in an Android mobile application. I personally do not know much about that, but I am sure there are efficient ways to do exactly that.
However, assuming you actually need to extract the weights there are multiple ways to do this.
One straight forward way to do this is to implement the exact network you want yourself with Tensorflows low level API instead of using the canned DNNRegressor class (which is deprecated btw.). That might sound unnecessarily complex, but is actually quite easy and has the upside of you being in full control.
A general way to get all trainable variables is to use Tensorflows trainable_variables method.
Or maybe this might help you.
In terms of pitfalls I don't really believe there are any. At the end of the day you are just storing a bunch of floats. You should probably make sure to use an appropriate file format like hdf5 and sufficient float precision.
Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.
i have a very large list of String...(ArraList myList) and i want to remove duplicated items from this list very fast..i copied the items into a HashMap, that's the best algorithm i found but yet it is not fast enough...
i have found some thing like writing code using native languages and implement it in android app, can we remove the duplication from the list using native language, is there any function written by assembly language that can do this faster than java can do?
if not, is there a function that can just compare two strings faster than java can do ?
To answer the question, it is possible to program in C for Android using NDK. And, as the way from C to Assembler is rather short, may be possible in Assembler as well. And while Java performance is currently rather good, a claim no language ever could check an array for duplicates faster seems for me somewhat an overestimation.
However switching between languages is complex, and for the task so trivial you may loose performance by just accessing your array an JNI level.
It may be more reasonable to rethink the algorithm. For instance:
If you just need to iterate over the list but must have it ordered, use LinkedHashSet. This will prevent the duplicate items from beginning.
If you have a lot of duplicates, the removal operation may be too expensive as big parts of the array may be moved a lot of times. Try to set items to be removed to null instead and then recreate the array from scratch skipping nulls.
is there any function written by assembly language that can do this faster than java can do?
Does such a function already exist? I don't know ... and I don't know how one would find it if it did.
Could you write such a function? Maybe ... in theory.
Assume that there is a function that performs this task as fast as theoretically possible (in some context).
No matter what language that function is written in, it should be possible to find out what machine code the function compiler (or assembles) to.
Having done that, you can turn that machine code into assembler ... giving an assembly language function that performs the task with maximum performance.
And since such an assembler program can exist (in theory), a sufficiently smart / skilled / patient human being could (in theory) write it ... from scratch!
But the problem is that you would need to be a really good assembler programmer (with a really good understanding of the algorithms involved) to be able to pull this off. And the kicker is there is no guarantee that the existing Java implementation (when compiled using a good JIT compiler) won't be almost as fast.
The reason I'm being pessimistic here is that implementing an efficient hash table in an HLL (like Java) is hard enough for most people. Achieving the same think in assembly language is going to be orders of magnitude harder. (That's rhetorical. You can't really quantify difficulty like that ...)
if not, is there a function that can just compare two strings faster than java can do ?
I don't see how this will help much. If you are using a HashSet properly, then String comparison should not be the performance bottleneck for your problem. Not even if your ratio of duplicates is high.
Where you get and store your strings list? May be using SQLite or somethin like CQEngine to store and manage data will be better?
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I donĀ“t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.
I looking for a program or library in Java capable of finding non-random properties of a byte sequence. Something when given a huge file, runs some statistical tests and reports if the data show any regularities.
I know three such programs, but not in Java. I tried all of them, but they don't really seem to work for me (which is quite surprising as one of them is by NIST). The oldest of them, diehard, works fine, but it's a bit hard to use.
As some of the commenters have stated, this is really an expert mathematics problem. The simplest explanation I could find for you is:
Run Tests for Non-randomness
Autocorrelation
It's interesting, but as it uses 'heads or tails' to simplify its example, you'll find you need to go much deeper to apply the same theory to encryption / cryptography etc - but it's a good start.
Another approach would be using Fuzzy logic. You can extract fuzzy associative rules from sets of data. Those rules are basically implications in the form:
if A then B, interpreted for example "if 01101 (is present) then 1111 (will follow)"
Googling "fuzzy data mining"/"extracting fuzzy associative rules" should yield you more than enough results.
Your problem domain is quite huge, actually, since this is what data/text mining is all about. That, and statistical & combinatorial analysis, just to name a few.
About a program that does that - take a look at this.
Not so much an answer to your question but to your comment that "any observable pattern is bad". Which got me thinking that randomness wasn't the problem but rather observable patterns, and to tackle this problem surely you need observers. So, in short, just set up a website and crowdsource it.
Some examples of this technique applied to colour naming: http://blog.xkcd.com/2010/05/03/color-survey-results/ and http://www.hpl.hp.com/personal/Nathan_Moroney/color-name-hpl.html