Interpret sentence and convert into their corresponding format - java

There are some formats of inputs and their corresponding outputs
1. 7 years 10 months ---> YRS:7 MNHS:10
2. 7 kgs 10 grms ---> KGS:7 GRMS:10
3. 7 kilograms 10 grams ---> KGS:7 GRMS:10
4. 7 thousand 9 hundread ---> 7900
5. seven years ten months --> YRS:7 MNHS:10
6. seven kgs ten grms ---> KGS:7 GRMS:10
7. triple seven double five --> 77755
I wrote separate modules for all by storing informations in **HashMap. And it is working fine.**
Then I need to write one main module in which input is one sentence(utterance), and I need to replace all above substrings into corresponding substring output.
For example,
Input :- Dial number triple eight triple four three nine eight.
Output :- Dial number 888444398.
and many such utterances.
My doubts :-
I used numbers of HashMap for smaller modules to store meaning of keys, just like - triple means 3 times, double means 2 times and all. But this has limitation that if I need to add anything I have to add that entry in HashMap. Suggest some good technique for this.
I am confused in main module, how to extract useful substring given in above examples from given utterances. So suggest some good technique for this also.
Project Lanuguage : Java.

You should look at Illinos Quantifier package:
http://cogcomp.cs.illinois.edu/page/software_view/Quantifier
http://cogcomp.cs.illinois.edu/demo/quantities/results.php

You might want to use some kind of formal grammar parser. Just doing design of a grammar can clear a view of the problem. In the most simple case your grammar could look like:
STRING -> "" | STRING MEASUREMENT | STRING NUMBER | STRING WORD
MEASUREMENT -> NUMBER UNITS
UNITS -> kgs | grms | years | months | ...
NUMBER -> THOUSAND HUNDRED NUMBER_BELOW_HUNDRED | THOUSAND HUNDRED
THOUSAND -> "" | NUMBER_BELOW_HUNDRED thousand
HUNDRED -> "" | NUMBER_BELOW_HUNDRED hundred
NUMBER_BELOW_HUNDRED -> one | two | three | ... | ninety nine | 99 | 98 | ... | 1
WORD -> /* all other */
You can write a parser by yourself (in this case it seems to be pretty easy) or use a ready solution like Bison/Flex.
The usual alternative for your HashMaps are configuration files.

Related

Best way to implement friends list into a database? MySQL

So my project has a "friends list" and in the MySQL database I have created a table:
nameA
nameB
Primary Key (nameA, nameB)
This will lead to a lot of entries, but to ensure that my database is normalised I'm not sure how else to achieve this?
My project also uses Redis.. I could store them there.
When a person joins the server, I would then have to search for all of the entries to see if their name is nameA or nameB, and then put those two names together as friends, this may also be inefficient.
Cheers.
The task is quite common. You want to store pairs where A|B has the same meaning as B|A. As a table has columns, one of the two will be stored in the first column and the other in the second, but who to store first and who second and why?
One solution is to always store the lesser ID first and the greater ID second:
userid1 | userid2
--------+--------
1 | 2
2 | 5
2 | 6
4 | 5
This has the advantage that you store each pair only once, as feels natural, but has the disadvantage that you must look up a person in both coumns and find their friend sometimes in the first and sometimes in the second column. That may make queries kind of clumsy.
Another method is to store the pairs redundantly (by using a trigger typically):
userid1 | userid2
--------+--------
1 | 2
2 | 1
2 | 5
2 | 6
4 | 5
5 | 2
5 | 4
6 | 2
Here querying is easier: Look the person up in one column and find their friends in the other. However, it looks kind of weird to have all pairs duplicated. And you rely on a trigger, which some people don't like.
A third method is to store numbered friendships:
friendship | user_id
-----------+--------
1 | 1
1 | 2
2 | 2
2 | 5
3 | 2
3 | 6
4 | 4
4 | 5
This gives both users in the pair equal value. But in order to find friends, you need two passes: find the friendships for a user, find the friends in these friendships. However, the design is very clear and even extensible, i.e. you could have friendships of three four or more users.
No method is really much better than the other.

Searching words in array [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to find list of possible words from a letter matrix [Boggle Solver]
I have a String[][] array such as
h,b,c,d
e,e,g,h
i,l,k,l
m,l,o,p
I need to match an ArrayList against this array to find the words specified in the ArrayList. When searching for word hello, I need to get a positive match and the locations of the letters, for example in this case (0,0), (1,1), (2,1), (3,1) and (3,2).
When going letter by letter and we suppose we are successfully located the first l letter, the program should try to find the next letter (l) in the places next to it. So it should match against e, e, g, k, o, l, m and i meaning all the letters around it: horizontally, vertically and diagonally. The same position can't be found in the word twice, so (0,0), (1,1), (2,1), (2,1) and (3,2) wouldn't be acceptable because the position (2,1) is matched twice. In this case, both will match the word because diagonally location is allowed but it needs to match the another l due to the requirement that a position can not be used more than once.
This case should also be matched
h,b,c,d
e,e,g,h
l,l,k,l
m,o,f,p
If we suppose that we try to search for helllo, it won't match. Either (x1, y1) (x1, y1) or (x1, y1) (x2, y2) (x1, y1) can't be matched.
What I want to know what is the best way to implement this kind of feature. If I have 4x4 String[][] array and 100 000 words in an ArrayList, what is the most efficient and the easiest way to do this?
I think you will probably spend most of your time trying to match words that can't possibly be built by your grid. So, the first thing I would do is try to speed up that step and that should get you most of the way there.
I would re-express the grid as a table of possible moves that you index by the letter. Start by assigning each letter a number (usually A=0, B=1, C=2, ... and so forth). For your example, let's just use the alphabet of the letters you have (in the second grid where the last row reads " m o f p "):
b | c | d | e | f | g | h | k | l | m | o | p
---+---+---+---+---+---+---+---+---+---+----+----
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
Then you make a 2D boolean array that tells whether you have a certain letter transition available:
| 0 1 2 3 4 5 6 7 8 9 10 11 <- from letter
| b c d e f g h k l m o p
-----+--------------------------------------
0 b | T T T T
1 c | T T T T T
2 d | T T T
3 e | T T T T T T T
4 f | T T T T
5 g | T T T T T T T
6 h | T T T T T T T
7 k | T T T T T T T
8 l | T T T T T T T T T
9 m | T T
10 o | T T T T
11 p | T T T
^
to letter
Now go through your word list and convert the words to transitions (you can pre-compute this):
hello (6, 3, 8, 8, 10):
6 -> 3, 3 -> 8, 8 -> 8, 8 -> 10
Then check if these transitions are allowed by looking them up in your table:
[6][3] : T
[3][8] : T
[8][8] : T
[8][10] : T
If they are all allowed, there's a chance that this word might be found.
For example the word "helmet" can be ruled out on the 4th transition (m to e: helMEt), since that entry in your table is false.
And the word hamster can be ruled out, since the first (h to a) transition is not allowed (doesn't even exist in your table).
Now, for the remaining words that you didn't eliminate, try to actually find them in the grid the way you're doing it now or as suggested in some of the other answers here. This is to avoid false positives that result from jumps between identical letters in your grid. For example the word "help" is allowed by the table, but not by the grid
Let me know when your boggle phone-app is done! ;)
Although I am sure there is a beatiful and efficient answer for this question academically, you can use the same approach, but with a list possibilities. so, for the word 'hello', when you find the letter 'h', next you will add possible 'e' letters and so on. Every possibility will form a path of letters.
I would start by thinking of your grid as a graph, where each grid position is a node and each node connect to its eight neighbors (you shouldn't need to explicitly code it as a graph in code, however). Once you find the potential starting letters, all you need to do is to do a depth first search of the graph from each start position. The key is to remember where you've already searched, so you don't end up making more work for yourself (or worse, get stuck in a cycle).
Depending on the size of character space being used, you might also be able to benefit from building lookup tables. Let's assume English (26 contiguous character codepoints); if you start by building a 26-element List<Point>[] array, you can populate that array once from your grid, and then can quickly get a list of locations to start your search for any word. For example, to get the locations of h I would write arr['h'-'a']
You can even leverage this further if you apply the same strategy and build lookup tables for each edge list in the graph. Instead of having to search all 8 edges for each node, you already know which edges to search (if any).
(One note - if your character space is non-contiguous, you can still do a lookup table, but you'll need to use a HashMap<Character,List<Point>> and map.get('h') instead.)
One approach to investigate is to generate all the possible sequences of letters (strings) from the grid, then check if each word exists in this set of strings, rather than checking each word against the grid. E.g. starting at h in your first grid:
h
hb
he
he // duplicate, but different path
hbc
hbg
hbe
hbe // ditto
heb
hec
heg
...
This is only likely to be faster for very large lists of words because of the overhead of generating the sequences. For small lists of words it would be much faster to test them individually against the grid.
You would either need to store the entire path (including coordinates) or have a separate step to work out the path for the words that match. Which is faster will depend on the hit rate (i.e. what proportion of input words you actually find in the grid).
Depending on what you need to achieve, you could perhaps compare the sequences against a list of dictionary words to eliminate the non-words before beginning the matching.
Update 2 in the linked question there are several working, fast solutions that generate sequences from the grid, deepening recursively to generate longer sequences. However, they test these against a Trie generated from the words list, which enables them to abandon a subtree of sequences early - this prunes the search and improves efficiency greatly. This has a similar effect to the transition filtering suggested by Markus.

Dijkstra algorithm alternatives - shortest path in graph, bus routes

i am using slightly modified Dijkstra algorithm in my app but it`s quite slow and i know there have to be a lot better approach. My input data are bus stops with specified travel times between each other ( ~ 400 nodes and ~ 800 paths, max. result depth = 4 (max 4 bus changes or nothing).
Input data (bus routes) :
bus_id | location-from | location-to | travel-time | calendar_switch_for_today
XX | A | B | 12 | 1
XX | B | C | 25 | 1
YY | C | D | 5 | 1
ZZ | A | D | 15 | 0
dijkstraResolve(A,D, '2012-10-10') -> (XX,A,B,12),(XX,B,C,25),(YY,C,D,5)
=> one bus change, 3 bus stops to final destination
* A->D cant be used as calendar switch is OFF
As you can imagine, in more complicated graphs where e.g. main city(node) does have 170 connections to different cities is Dijkstra slower (~ more then 5 seconds) because compute all neighbours first one by one as it`s not "trying" to reach target destination by some other way...
Could you recommend me any other algorithm which could fit well ?
I was looking on :
http://xlinux.nist.gov/dads//HTML/bellmanford.html (is it faster ?)
http://jboost.sourceforge.net/examples.html (i do not see
straightforward example here...)
Would be great to have (just optional things) :
- option to prefer minimal number of bus changes or minimal time
- option to look on alternatives way (if travel time is similar)
Thank you for tips
Sounds like you're looking for A*. It's a variant of Djikstra's which uses a heuristic to speed up the search. Under certain reasonable assumptions, A* is the fastest optimal algorithm. Just make sure to always break ties towards the endpoint.
There are also variants of A* which can provide near-optimal paths in much shorter time. See for example here and here.
Bellman-Ford (as suggested in your question) tends to be slower than either Djikstra's or A* - it is primarily used when there are negative edge-weights, which there are not here.
Maybe A* algorithm? See: http://en.wikipedia.org/wiki/A-star_algorithm
Maybe contraction hierarchies? See: http://en.wikipedia.org/wiki/Contraction_hierarchies.
Contraction hierarchies are implemented by the very nice, very fast Open Source Routing Machine (OSRM):
http://project-osrm.org/
and by OpenTripPlanner:
http://opentripplanner.com/
A* is implemented by a number of routing systems. Just do a search with Google.
OpenTripPlanner is a multi-modal routing system and, as long as I can see, should be very similar to your project.
The A* algorithm would be great for this; it achieves better performance by using heuristics.
Here is a simple tutorial to get you started: Link

How to calculate Centered Moving Average of a set of data in Hadoop Map-Reduce?

I want to calculate Centered Moving average of a set of Data .
Example Input format :
quarter | sales
Q1'11 | 9
Q2'11 | 8
Q3'11 | 9
Q4'11 | 12
Q1'12 | 9
Q2'12 | 12
Q3'12 | 9
Q4'12 | 10
Mathematical Representation of data and calculation of Moving average and then centered moving average
Period Value MA Centered
1 9
1.5
2 8
2.5 9.5
3 9 9.5
3.5 9.5
4 12 10.0
4.5 10.5
5 9 10.750
5.5 11.0
6 12
6.5
7 9
I am stuck with the implementation of RecordReader which will provide mapper sales value of a year i.e. of four quarter.
This is actually totally doable in the MapReduce paradigm; it does not have to be though of as a 'sliding window'. Instead think of the fact that each data point is relevant to a max of four MA calculations, and remember that each call to the map function can emit more than one key-value pair. Here is pseudo-code:
First MR job:
map(quarter, sales)
emit(quarter - 1.5, sales)
emit(quarter - 0.5, sales)
emit(quarter + 0.5, sales)
emit(quarter + 1.5, sales)
reduce(quarter, list_of_sales)
if (list_of_sales.length == 4):
emit(quarter, average(list_of_sales))
endif
Second MR job:
map(quarter, MA)
emit(quarter - 0.5, MA)
emit(quarter + 0.5, MA)
reduce(quarter, list_of_MA)
if (list_of_MA.length == 2):
emit(quarter, average(list_of_sales))
endif
In best of my understanding moving average is not nicely maps to MapReduce paradigm since its calculation is essentially "sliding window" over sorted data, while MR is processing of non-intersected ranges of sorted data.
Solution I do see is as following:
a) To implement custom partitioner to be able to make two different partitions in two runs. In each run
your reducers will get different ranges of data and calculate moving average where approprieate
I will try to illustrate:
In first run data for reducers should be:
R1: Q1, Q2, Q3, Q4
R2: Q5, Q6, Q7, Q8
...
here you will cacluate moving average for some Qs.
In next run your reducers should get data like:
R1: Q1...Q6
R2: Q6...Q10
R3: Q10..Q14
And caclulate the rest of moving averages.
Then you will need to aggregate results.
Idea of custom partitioner that it will have two modes of operation - each time dividing into equal ranges but with some shift. In a pseudocode it will look like this :
partition = (key+SHIFT) / (MAX_KEY/numOfPartitions) ;
where:
SHIFT will be taken from the configuration.
MAX_KEY = maximum value of the key. I assume for simplicity that they start with zero.
RecordReader, IMHO is not a solution since it is limited to specific split and can not slide over split's boundary.
Another solution would be to implement custom logic of splitting input data (it is part of the InputFormat). It can be done to do 2 different slides, similar to partitioning.

Combinations Array Java

Given a array for any dimension (for instance [1 2 3]), a function that gives all combinations like
1 |
1 2 |
1 2 3 |
1 3 |
2 |
2 1 3 |
2 3 |
...
Since I'm guessing this is homework, I'll try to refrain from giving a complete answer.
Suppose you already had all combinations (or permutations if that is what you are looking for) of an array of size n-1. If you had that, you could use those combinations/permutations as a basis for forming the new combinations/permutations by adding the nth element to them in the appropriate way. That is the basis for what computer scientists call recursion (and mathematicians like to call a very similar idea induction).
So you could write a method that would handle the the n case, assuming the n-1 case had been handled, and you can put a check to handle the base case as well.

Categories