Localization of text in Android - java

What is the solution or best practice to display localized text strings in Android?
For example:
The English version text: "You have 1 message" and "You have 3 messages".
Note that the word "message" or "messages" is determined by the integer number.
If this were to be localized in another message, the insertion of the integer number could be at the beginning or the end of the sentence, not necessary in the middle of the sentence.
Further, for languages like Japanese it could better to use the full-width " 3 " to display the number as part of the sentence.
That means, even if I manage all localization text in a strings file, I would still need some kind of logic to calculate the final displayed text.
What is the best practice?
Any library I could use?

I would recommend looking into a i18n lib that has a mature ecosystem, i.e. i18next
There is some android lib too: i.e. i18next-android
It has good support for multiple plural forms too: i18next-android#multiple-plural-forms
Further you should not only consider that you have to instrument your code (i18n) to get your app/website translated. You should think about the process too - how will you solve continuous localization, how you keep track of progress, etc...
For a translation management+ system you might eg. have a look at locize it plays well with all json based i18n frameworks and in the core has a very simpe api... and provides a lot more than traditional systems.

Related

Extract tags or relevant keywords from text

I need to extract relevant keywords or concepts similar to the AlchemyAPI's concept tagging method.
I would like to know if there's any tool that can provide something similar to the "concept tagging" of text or classification, not just steeming words or regex only.
A stand alone solution is preferably in my case as I have a lot of data and it reaches the rate limit quickly for Yahoo Term Extraction and AlchemyAPI..
E.g.
Input:
With that said Its the democratic publics decision on whether they agree or disagree
Ouputs:
Decision making
This is called Text Classification
Here is a 5-part video series on doing what you need with a tool called RapidMiner
http://vancouverdata.blogspot.ca/2010/11/text-analytics-with-rapidminer-loading.html

Stack overflow for java regex

I am getting stackoverflow for validating a csv for large string field.
Regex:
(?![^\",][^,]*\")(\"(\"\"|[^\"])*\"|[^\",]*),[0-9]*
TargetString:
"The Nuvi 1450LMT is a portable global positioning system receiver from Garmin that offers a step-up from the company's standard 1450 and 1450T models. Including free lifetime map and traffic updates, this model can be updated once every three months to ensure to the most-up-to-date location information. A built-in FM signal transmitter can provide up-to-the-minute traffic information concerning accidents, construction and other forms of road blockage, providing users with sufficient time to select an alternate route. A back-lighted 5-inch touchscreen TFT display is included that provides clear visual instruction, complete with "Lane Assist" technology that provides virtual first-person instruction on precisely what lanes to use. Comprehensive "City Navigator" maps are included for Canada, the US and Mexico, with two and three-dimensional support and over 6 million user-selected points of interest. Pedestrian navigation is also fully supported on the 1450LMT, with the "CityXplorer" service offering bus, rail, tram and other public transportation information for a wide variety of major cities. Fuel-effecient routes can be determined with the "EcoRoute" mode, while "HotFix" predictive satellite technology helps to maintain the most accurate locational information even when signal is temporarily lost. Photo navigation is supported through Garmin's "Photo Connect" service, and additional car marker and narration voices can be downloaded via the "Garmin Garage" website. Features 5-inch backlit TFT color touchscreen Free lifetime traffic updates Free maps MicroSD card support Voice prompts Lane assist function Auto Re-route Route avoidance FM traffic compatibility EcoRoute routing Custom Points Of Interest Garmin garage car marker and voice customization",9
Can someone help to optimize it.
Can you optimize using possessive quantifiers
I think the best advice would be to not try to use regexes to parse CSV files. Any way you formulate the regex there is the possibility of an unbounded number of branch points ... and hence stack overflow for pathological input strings.
A better approach is to select and use a decent CSV library for Java. Check the answers to this Question:
Can you recommend a Java library for reading (and possibly writing) CSV files?
You can make that error go away by adding a few plus signs:
"(?![^\",][^,]*\")(\"(\"\"|[^\"]+)*\"|[^\",]+),[0-9]+"
^ ^ ^
Note that those are just regular plus signs, not possessive modifiers. The second and third plus signs replaced asterisks, but it's the first one that makes the real difference. That [^\"]+ is what consumes most of the text, and it was doing so one character at a time before I added that plus sign.
But it still won't match, it will just fail more quickly. That regex is for matching CSV fields with properly escaped quotes, and if I understand you correctly, your problem is that they're not escaped. That's a much more challenging problem, but I wonder if you really need to deal with those inner quotes at all. Won't this work?
".*?",\d+
...or as a Java string literal:
"\".*?\",\\d+"
Or are you trying to correct the string by escaping the quotes yourself?

Finding UnicodeBlock set for a given Locale

I'm currently trying to figure out how to get a Character.UnicodeBlock set for a given Locale.
Languages need differents characters from one to another.
What I'm exactly trying to achieve is having a String containing every character needed to write in a specific language. I can then use this String to precompute a set of OpenGL textures from a TrueTypeFont file, so I can easily write any text in any language.
Precaching every single character and having around 1000000 textures is of course not an option.
Does anyone have an idea ? Or does anyone see a flaw in this procedure ?
It's not as simple as that. Text in most European languages can often be written with a simple set of precomposed Unicode characters, but for many more complex scripts you need to handle composing characters. This starts fairly easily with combining accents for Western alphabets, progresses through Arabic letters that are context-sensitive (they have different shapes depending on whether they are first, last, or in the middle of a word), and ends with the utter madness that is found in many Indic scripts.
The Unicode Standard has chapters about the intricacies involved in rendering the various scripts it can encode. Just sample, for example, the description of Tibetan early in chapter 10, and if that doesn't scare you away, flip back to Devanagari in chapter 9. You will quickly drop your ambition of being able to "write text in any language". Doing so correctly requires specialized rendering software, written by experts deeply familiar with the scripts in question.

What is a fast and unsupervised way of checking quality of pdf-extracted text?

I am working on a somewhat large corpus with articles numbering the tens of thousands. I am currently using PDFBox to extract with various success, and I am looking for a way to programatically check each file to see if the extraction was moderately successful or not. I'm currently thinking of running a spellchecker on each of them, but the language can differ, I am not yet sure which languages I'm dealing with. Natural language detection with scores may also be an idea.
Oh, and any method also has to play nice with Java, be fast and relatively quick to integrate.
Try an automatically learning spell checker. That's not as scary as it sounds: Start with a big dictionary containing all the words you're likely to encounter. This can be from several languages.
When scanning a PDF, allow for a certain number of unknown words (say 5%). If any of these words are repeated often enough (say 5 times), add them to the dictionary. If the PDF contains more than 5% unknown words, it's very likely something that couldn't be processed.
The scanner will learn over time allowing you to reduce the amount of unknown words if that should be necessary. If that is too much hazzle, a very big dictionary should work well, too.
If you don't have a dictionary, manually process a couple of documents and have the scanner learn. After a dozen files or so, your new dictionary should be large enough for a reasonable water level.
Of course no method will be perfect.
There are usually two classes of text extraction poblems :
1 - nothing gets extracted.
This can be because you've got a scanned document or something is invalid in the PDF.
Usually easy to detect, you should not need complicaed code to check those.
2 - You get garbage.
Most of the times because the PDF file is weirdly encoded.
This can be because of homemade encoding not properly declared, or maybe the PDF author needed characters not recognized by PDF ( For example, The turkish S with cedilla was missing for some time in the adobe glyph list : you could not create a correctly encoded file with it inside so you had to cheat to get it visually on the page ).
I use a ngram based method to detect languages of PDF files based on the extracted text (with different technologies but the idea is the same). Files where the language was not recognized are usually good suspects of a problem...
About spellchecking I suppose it will give you tons of false positives especially if you have multiple languages !
You could just run the corpus against a list of stop words (the most frequent words that search engines ignore, like "and" and "the"), but then you obviously need stop word lists for all possible/probable languages first.

Natural language parsing, practical example

I am looking to use a natural language parsing library for a simple chat bot. I can get the Parts of Speech tags, but I always wonder. What do you do with the POS. If I know the parts of the speech, what then?
I guess it would help with the responses. But what data structures and architecture could I use.
A part-of-speech tagger assigns labels to the words in the input text. For example, the popular Penn Treebank tagset has some 40 labels, such as "plural noun", "comparative adjective", "past tense verb", etc. The tagger also resolves some ambiguity. For example, many English word forms can be either nouns or verbs, but in the context of other words, their part of speech is unambiguous.
So, having annotated your text with POS tags you can answer questions like: how many nouns do I have?, how many sentences do not contain a verb?, etc.
For a chatbot, you obviously need much more than that. You need to figure out the subjects and objects in the text, and which verb (predicate) they attach to; you need to resolve anaphors (which individual does a he or she point to), what is the scope of negation and quantifiers (e.g. every, more than 3), etc.
Ideally, you need to map you input text into some logical representation (such as first-order logic), which would let you bring in reasoning to determine if two sentences are equivalent in meaning, or in an entailment relationship, etc.
While a POS-tagger would map the sentence
Mary likes no man who owns a cat.
to such a structure
Mary/NNP likes/VBZ no/DT man/NN who/WP owns/VBZ a/DT cat/NN ./.
you would rather need something like this:
SubClassOf(
ObjectIntersectionOf(
Class(:man)
ObjectSomeValuesFrom(
ObjectProperty(:own)
Class(:cat)
)
)
ObjectComplementOf(
ObjectSomeValuesFrom(
ObjectInverseOf(ObjectProperty(:like))
ObjectOneOf(
NamedIndividual(:Mary)
)
)
)
)
Of course, while POS-taggers get precision and recall values close to 100%, more complex automatic processing will perform much worse.
A good Java library for NLP is LingPipe. It doesn't, however, go much beyond POS-tagging, chunking, and named entity recognition.
Natural language processing is wide and deep, with roots going back at least to the 60s. You could start reading up on computational linguistics in general, natural language generation, generative grammars, Markov chains, chatterbots and so forth.
Wikipedia has a short list of libraries which I assume you might have seen. Java doesn't have a long tradition in NLP, though I haven't looked at the Stanford libraries.
I doubt you'll get very impressive results without diving fairly deeply into linguistics and grammar. Not everybody's favourite school subject (or so I've heard reported -- loved'em meself!).
I'll skip a lot many details and keep this simple. Parts of Speech tagging help you to create an parse tree out of a sentence. Once you have this, you try to make out a meaning as unambiguously as possible. The result of this parsing step will greatly aid you to frame a suitable response for you chatterbot.
Once you have part of speech tags you can extract, for example, all nouns, so you know roughly what things or objects someone is talking about.
To give you an example:
Someone says "you can open a new window." When you have the POS tags you know they are not talking about a can (as in container, jar etc., which would even make sense in the context of open), but a window. You'll also know that open is a verb.
With this information, your chat bot can generate a much better reply that will have nothing to do with can openers etc.
Note: You don't need a parser to get POS tags. A simple POS tagger is enough. A parser will give you even more information (e.g. what is the subject, what the object of the sentence?)

Categories