I have numbers like 32,33,33.1,33.2,34,34.1,35,35.1,35.2,35.3,35.4,36 and so on. Now is it possible that if I will change the number 32 to 50, then all respective numbers will also change like 50,51,51.1,51.2,52,52.1,53,53.1,53.2,53.3,53.4,54 may be using regexp pattern or anything or coding in java.
Based on the Excel tag and assuming the numbers are in different cells, key 18 and copy that cell, select the numbers and Paste Special with Operation Add.
You will need to code this in Java. Arithmetic is impractical to do this with regexes.
Related
I have a text extracted from image using OCR. Some of the words are not correctly recognized in the text as follows:
'DRDER 0F OFF1CE RESTAURAUT, QNE THO...'
As you can see optically some characters is easy to mix for others: 1 -> I, O -> D -> Q, H -> W, U -> N and so on.
Question: Apart from standard algorithms like Levenshtein distance, is there a Java or Python library implementing OCR specific algorithm that can help compare words to a predefined dictionary and give a score, taking into account possible OCR character mixups?
I don't know of anything OCR-specific, but you might be able to make this work with Biopython, because the basic problem of comparing one string to another using a matrix that scores each character's similarity to every other character is very common in bioinformatics. We call it a sequence alignment problem.
Have a look at the pairwise2 module that Biopython provides; you would be able to compare each input word against each dictionary word with pairwise2.align.globaldx, using a dict that has all the pairwise character similarities. There are also functions in there for scoring deleted/inserted characters.
Computing the pairwise character similarities would be something you'd have to do yourself, maybe by rendering each character in your chosen font and comparing the images, or maybe manually by just rating which characters look similar to you. You could also have a look at this other SO answer where characters are broken into classes based on the presence/absence of strokes.
If you want something better than O(input * dictionary), you'd have to switch from brute force comparison to some kind of seed-match-based algorithm. If you assume that you'll always have a 2-character perfect match for example, you can index your dictionary by which words contain each length-2 string, and only compare the input words against the dictionary words that share a length-2 string with them.
I want to make calculator. I don't know to split the string and calculate the result. Is there any algorithm or some easy way to get result? I have already searched for it. But it only tells infix expressions.
Using an External Library
I Google searched the term "android library for calculating math expressions in strings" and the first result was: http://mathparser.org/
If you go to that link and scroll down a bit, it actually shows an animation of how it works. There also seems to be plenty of tutorials.
Creating your own Algorithm
Whilst using an external library in this case certainly does seem to be the optimal option, if you were required to develop the algorithm yourself the use of BODMAS (or PEMDAS depending where you are from) will be critical. At least in your case however, brackets and orders would not be needed.
Your algorithm could initially iterate the string for all cases of division e.g. where the character '/' is found, then multiplication '*', then addition '+' and finally subtraction '-'.
For each case the operands would be the digits to the left and right of the operator. Then replace the immediate expression with the calculation in the overall expression.
So the steps of your recursive style algorithm may look like the following:
2*2+6/3+6
2*2+2+6
4+2+6
6+6
12
Some things you'll need to be able to do:
Convert a character to the expected integer ( hint: subtract '0' or 48 )
Consider if your operands will be longer than one character e.g. 10 (look into substrings)
Possibly think about performance improvement e.g. 4+2+6 could be calculated in one step rather than two
I'm having trouble searching for the right terms here to solve the below problem; I'm sure it's a done thing, I just can't find the right terms to express the problem!
I'm basically trying to create a classifier that will take word comparison outputs (e.g. some outputs from Levenstein distances) and decide whether the words are sufficiently different. An important input would probably be something like a soundex comparison. The trouble I'm having is creating the training set for the algorithm (an SVM in this case). I have a long list of names and I need to mutate them a bit (based on similar sounds within the word).
E.g. John and Jon would be a mutation to make, and I could label this in the test set as being equivalent. John and Johann have sufficiently different sound and letter distance to be considered different.
So I'm kinda asking for is a way to achieve a phoneme variation generator, but need to be able to retain the English lettering structure.
Even simple translation might suffice, like "f" could (sometimes) be replaced by "ph". I'm doing this in Java so any tips in that direction would be great too! Thanks.
EDIT
This is the closest I've come across so far: http://www.isi.edu/natural-language/people/hovy/papers/07IJCAI-spelling-variants.pdf
I'm just thinking aloud.
Rule-based: Apply a rule-based system where you could use standard substitution rules such as 'ph' for 'f', and insertion rules such as insert an h between a vowel and a consonant.
Character n-gram alignment:
Use a word alignment tool such as Giza++ to align character n-grams from parallel corpora such as Europarl. I guess you would be able to find interesting word spelling variations such as "house", "haus" etc. You can play with various values of n.
Bootstraping character n-gram alignment with rule-based: You might also want to use a combination of the two, in which you could, in principle, boost the probabilities of some alignments by using a set of external rules and heuristics.
I am converting single Chinese characters into roman letters(pinyin) using package pinyin4j in java. However, this will often yield multiple pinyins for one character(same character has different pronunciations). Say,character C1 converts to 2 pinyin forms p1 and p2, character C2 converts to 3 pinyin forms, q1,q2,q3.
When I combine C1C2 into a word, it yields 2*3=6 combinations. Usually only one of these is a real word. I want to check these combinations against a lexicon text file I built, with many lines start with \w that is a lexical entry(so for instance, only p1q2 out of the 6 combinations is found in the lexicon). i'm thinking about reading the lexicon file into a hashset. However I'm not sure about how to best implement this whole process. Any suggestions?
HashSet seems quite alright. If the lexicon is extra large and you have to be super fast, consider using Trie data structure. There is, however, no implementation in the Java.
i have created regular expression like this:
^[0-9][0-9][A-Z][A-Z][a-z]_([0-9]{1,10})_([0-9]{1,11})_([0-9]{1,11})$
It should give me values range from 01BRa_1_1_1 to 99BRz_9999999999_99999999999_99999999999
My problem is that I need to exclude values 0 from _number_number_number and to start from number 1.
Have been trying different expressions but can't find right one.
If someone knows how to solve thi help will be good. thx.
Goal is to eliminate 0_0_0 and also 00_00_00 and also 000_000_000 and all situations where 0 is first number so the first combination would be 1_1_1 for those 3 fields.
I am using this in Java (to reply to one comment) but do not see relevance of that more or less this is just a Pattern.
Resolved with this:
^[0-9][0-9][A-Z][A-Z][a-z]_([1-9][0-9]{0,9})_([1-9][0-9]{0,10})_([1-9][0-9]{0,10})$
If your goal is to eliminate values equal to 0 (0, 00, 000, etc) then an expression like this might work:
^[0-9][0-9][A-Z][A-Z][a-z]_(?!0+_)([0-9]{1,10})_(?!0+_)([0-9]{1,11})_(?!0+$)([0-9]{1,11})$
Of course, this will depend on your regex engine supporting variable-length zero-width assertions (aka "lookahead"). It would help to know which flavor you are using. (From the regex tooltip: "Please also include a tag specifying the programming language or tool you are using.")
If your goal is to eliminate anything starting with 0, (0, 01, 001, etc), then an expression like this might work:
^[0-9][0-9][A-Z][A-Z][a-z]_([1-9][0-9]{0,10})_([1-9][0-9]{0,10})_([1-9][0-9]{0,10})$