Calculate term frequency on large data set

Calculate term frequency on large data set - java

I want to calculate term frequency for a large list from an even larger data set.
The list (of pairs) is in the format of
{
source_term0, target_term0;
source_term1, target_term1;
...
source_termX, target_termX }
Where X is about 3.9 million.
The searching data set (pairs) is in the format of
{
source_sentence0, target_sentence0;
source_sentence1, target_sentence1;
...
source_sentenceY, target_sentenceY }
Where Y is about 12 million.
The term frequency is counted when source_termN is appeared in source_sentenceM AND target_termN is appeared in target_sentenceM.
My challenge is the computational time. I can run a nested loop, but it takes very long to complete. Just wondering there is any better algorithm for this case?

One way to do this is to build posting lists from the source sentences and target sentences. That is, for the source sentences, you have a dictionary that contains the term and a list of source sentences the term appears in. You do the same thing for the target sentences.
So given this:
source_sentence1 = "Joe married Sue."
target_sentence1 = "The bridge is blue."
source_sentence2 = "Sue has big feet."
target_sentence2 = "Blue water is best."
Then you have:
source_sentence_terms:
joe, [1]
married,[1]
sue,[1,2]
has,[2]
big,[2]
feet,[2]
target_sentence_temrs
the,[1]
bridge,[1]
is,[1]
blue,[1,2]
water,[2]
is,[2]
best,[2]
Now you can go through your search terms. Let's say that your first pair is:
source_term1=sue, target_term1=bridge
You look "sue" up in the source_sentence_terms and you get the list [1,2], meaning that the term occurs in those two source sentences.
You look "bridge" up in the target_sentence_terms and you get the list [1].
Now you do a set intersection on those two lists and you wind up with [1].
Building the posting lists from the sentences is O(n), where n is the total number of words in all of the sentences. You only have to do that once.
For each pair, you do two hash table lookups. Those are O(1). Doing a set intersection is O(m + n), where m and n are the sizes of the individual sets. It's hard to say how large those sets will be. It depends on the frequency of terms overall, and whether you're querying frequent terms.

An idea comes to mind: sort the whole set of data. Basically, a good sorting algorithm is O(nlogn). You said you were currently at O(n^2), so this would be an improvement. Right now, when the data is sorted. You can iterate over them linearly.
I'm not sure if I understood your situation correctly, so this might be inappropriate.

Map<String, Map<String, Integer>> terms = new HashMap<>();
for each sourceTerm, targetTerm {
// Java 7 or earlier
Map<String, Integer> targetTerms = terms.get(sourceTerm);
if (targetTerms == null)
terms.put(sourceTerm, targetTerms = new HashMap<>());
// Java 8
Map<String, Integer> targetTerms =
terms.computeIfAbsent(sourceTerm, HashMap::new);
targetTerms.put(targetTerm, 0);
}
for each sourceSentence, targetSentence {
String[] sourceSentenceTerms = sourceSentence.split("\\s+");
String[] targetSentenceTerms = targetSentence.split("\\s+");
for (String sourceSentenceTerm : sourceSentenceTerms) {
for (String targetSentenceTerm : targetSentenceTerms) {
Map<String, Integer> targetTerms = terms.get(sourceSentenceTerm);
if (targetTerms != null) {
// Java 7 or earlier
Integer termFreq = targetTerms.get(targetSentenceTerm);
if (termFreq != null)
targetTerms.put(targetSentenceTerm, termFreq + 1);
// Java 8
targetTerms.computeIfPresent(targetSentenceTerm,
(_, f) -> f + 1);
}
}
}

Related

Optimal way to solve the below problem based on Data Structure

I was recently asked this in an interview.
Given below are the the candidates and the time at which they got a vote.
Q. Given a time, print the person winning till that time.
Cand. Time
A 4
B 10
C 15
C 18
C 21
B 35
B 40
B 42
E.g In the Qsn above, if we are asked to find the winner at time 20, answer would be C -> Since C has 2 votes.
Tried Solution
Have a Map<String, List> to store Map<Candidate, [Time, votes]>
We can iterate through the array & fetch only the times which are less than 20 (as per the question).
But I believe there will be a more optimum way to solve this type of problem.
Essentially store the given data in a proper Data Structure which will give us the result in optimum time.
Thanks

What I would do is as follows:
First, implement a naive solution, such as the one you thought of, or the one suggested by Pp88 above.
Then, immediately write a test for it, which shows(1) that it works.
Then, implement a more optimal solution, such as the one which follows.
Finally, re-use the previous test to show(1) that the more optimal solution also works.
A more optimal solution could be as follows:
Build a LinkedHashMap where the key is a time coordinate, and the value is one more map, in which the keys are the names of all candidates, and the values are the accumulated vote count of each candidate at that time coordinate.
Traverse this LinkedHashMap and create a new map, where the key is again a time coordinate, and the value is the name of the winning candidate at that time coordinate. Throw away the previous map.
Build an ArrayList containing all the keys in the map.
Once you have done all of the above, any query of the type "who is the winner at time X" can be answered by performing a binary search in the ArrayList to find the time coordinate which is closest to but does not exceed X, and then a look-up of the time coordinate in the map to find the name of the candidate who was winning at that moment.
Ignoring the overhead of preparing the data structures, the time complexity of each query is equal to the time complexity of binary search, which is O(log2 n). This is sub-linear, so it is better than O(N).
(1) At best, we can say that a test "shows that it works"; it does not prove anything. The most accurate way of putting it is that it "gives sufficient reason to believe that it works", but that's too long, so "it shows that it works" is a decent alternative.

This is an O(N) solution, I'm assuming you have in input 2 arrays one for candidates and one for times. Who will win if there are the same amount of votes is not clear so in this case the first wins.
public Optional<Character> findCandidateWithMaxVotes(Character[] candidates, int[] times, int timeLimit) {
Character cadidateWithMaxVotes = null;
int max = 0, count = 0;
Map<Character, Integer> numberOvVotesForCandidate = new HashMap<>();
for(int i = 0; i < times.length; i++) {
if(times[i] <= timeLimit) {
count = numberOvVotesForCandidate.merge(candidates[i], 1, Integer::sum);
if(max < count) {
max = count;
cadidateWithMaxVotes = candidates[i];
}
}
}
return Optional.ofNullable(cadidateWithMaxVotes);
}

It's a bit unclear should we answer just one question ("who is the winner at time 20") or we are
supposed to preprocess the data provided and then answer several queries ("who is the winner at time 20", "who is winner at time 8" etc.).
The first problem is easy:
// Nobody is leading before elections with 0 votes
String leader = null;
int leaderVotes = 0;
HashMap<string, Integer> ballots = new HashMap<string, Integer>();
for (int i = 0; i < candidates.length; ++i) {
// too late, don't count this vote
if (times[i] > givenTime)
continue;
// number of votes
int current = ballots.containsKey(candidates[i])
? ballots.get(candidates[i]) + 1
: 1;
ballots.set(candidates[i], current);
// do we have a leader change?
//TODO: add tie breaking logic here
if (current > leaderVotes) {
leaderVotes = current;
leader = candidates[i];
}
}
// at givenTime we have leader with leaderVotes
The second problem is trickier:
We sort the votes by time
Scan them as we do in the first problem
On every leader change we add a record into (time, leader) list
Having all these done we have a sorted list which is ready for binary search: for given time we are looking for the latest record which is not later than time.

Combination Java Performance

I want to use this function with a large amount of possibility like 700 integer but the function make too much time to execute. Does someone have an idea to increase the performance? Thank you :)
public static Set<Set<Integer>> combinations(List<Integer> groupSize, int k) {
Set<Set<Integer>> allCombos = new HashSet<Set<Integer>> ();
// base cases for recursion
if (k == 0) {
// There is only one combination of size 0, the empty team.
allCombos.add(new HashSet<Integer>());
return allCombos;
}
if (k > groupSize.size()) {
// There can be no teams with size larger than the group size,
// so return allCombos without putting any teams in it.
return allCombos;
}
// Create a copy of the group with one item removed.
List<Integer> groupWithoutX = new ArrayList<Integer> (groupSize);
Integer x = groupWithoutX.remove(groupWithoutX.size() - 1);
Set<Set<Integer>> combosWithoutX = combinations(groupWithoutX, k);
Set<Set<Integer>> combosWithX = combinations(groupWithoutX, k - 1);
for (Set<Integer> combo : combosWithX) {
combo.add(x);
}
allCombos.addAll(combosWithoutX);
allCombos.addAll(combosWithX);
return allCombos;
}

What features of Set are you going to need to use on the returned value?
If you only need some of them - perhaps just iterator() or contains(...) - then you could consider returning an Iterator which calculates the combinations on the fly.
There's an interesting mechanism to generate the nth combination of a lexicographically ordered set here.

Other data structure. You could try a BitSet instead of the Set<Integer>. If the integer values have a wild range (negative, larger gaps), use an index in groupSize.
Using indices instead of integer values has other advantages: all subsets as bits can be done in a for-loop (BigInteger as set).
No data. Or make an iterator (Stream) of all combinations to repeatedly apply to your processing methods.
Concurrency.
Paralellism would would only mean a factor 4/8. ThreadPoolExecutor and Future maybe.
OPTIMIZING THE ALGORITHM ITSELF
The set of sets could better be a List. That tremendously improves adding a set.
And shows whether the algorithm does not erroneously create identical sets.

Which data structures to use when storing multiple entities with multiple query criteria?

There is a storage unit, with has a capacity for N items. Initially this unit is empty.
The space is arranged in a linear manner, i.e. one beside the other in a line.
Each storage space has a number, increasing till N.
When someone drops their package, it is assigned the first available space. The packages could also be picked up, in this case the space becomes vacant.
Example: If the total capacity was 4. and 1 and 2 are full the third person to come in will be assigned the space 3. If 1, 2 and 3 were full and the 2nd space becomes vacant, the next person to come will be assigned the space 2.
The packages they drop have 2 unique properties, assigned for immediate identification. First they are color coded based on their content and second they are assigned a unique identification number(UIN).
What we want is to query the system:
When the input is color, show all the UIN associated with this color.
When the input is color, show all the numbers where these packages are placed(storage space number).
Show where an item with a given UIN is placed, i.e. storage space number.
I would like to know how which Data Structures to use for this case, so that the system works as efficiently as possible?
And I am not given which of these operations os most frequent, which means I will have to optimise for all the cases.
Please take a note, even though the query process is not directly asking for storage space number, but when an item is removed from the store it is removed by querying from the storage space number.

You have mentioned three queries that you want to make. Let's handle them one by one.
I cannot think of a single Data Structure that can help you with all three queries at the same time. So I'm going to give an answer that has three Data Structures and you will have to maintain all the three DS's state to keep the application running properly. Consider that as the cost of getting a respectably fast performance from your application for the desired functionality.
When the input is color, show all the UIN associated with this color.
Use a HashMap that maps Color to a Set of UIN. Whenever an item:
is added - See if the color is present in the HashMap. If yes, add this UIN to the set else create a new entry with a new set and add the UIN then.
is removed - Find the set for this color and remove this UIN from the set. If the set is now empty, you may remove this entry altogether.
When the input is color, show all the numbers where these packages are placed.
Maintain a HashMap that maps UIN to the number where an incoming package is placed. From the HashMap that we created in the previous case, you can get the list of all UINs associated with the given Color. Then using this HashMap you can get the number for each UIN which is present in the set for that Color.
So now, when a package is to be added, you will have to add the entry to previous HashMap in the specific Color bucket and to this HashMap as well. On removing, you will have to .Remove() the entry from here.
Finally,
Show where an item with a given UIN is placed.
If you have done the previous, you already have the HashMap mapping UINs to numbers. This problem is only a sub-problem of the previous one.
The third DS, as I mentioned at the top, will be a Min-Heap of ints. The heap will be initialized with the first N integers at the start. Then, as the packages will come, the heap will be polled. The number returned will represent the storage space where this package is to be put. If the storage unit is full, the heap will be empty. Whenever a package will be removed, its number will be added back to the heap. Since it is a min-heap, the minimum number will bubble up to the top, satisfying your case that when 4 and 2 are empty, the next space to be filled will be 4.
Let's do a Big O analysis of this solution for completion.
Time for initialization: of this setup will be O(N) because we will have to initialize a heap of N. The other two HashMaps will be empty to begin with and therefore will incur no time cost.
Time for adding a package: will include time to get a number and then make appropriate entries in the HashMaps. To get a number from heap will take O(Log N) time at max. Addition of entries in HashMaps will be O(1). Hence a worst case overall time of O(Log N).
Time for removing a package: will also be O(Log N) at worst because the time to remove from the HashMaps will be O(1) only while, the time to add the freed number back to min-heap will be upper bounded by O(Log N).

This smells of homework or really bad management.
Either way, I have decided to do a version of this where you care most about query speed but don't care about memory or a little extra overhead to inserts and deletes. That's not to say that I think that I'm going to be burning memory like crazy or taking forever to insert and delete, just that I'm focusing most on queries.
Tl;DR - to solve your problem, I use a PriorityQueue, an Array, a HashMap, and an ArrayListMultimap (from guava, a common external library), each one to solve a different problem.
The following section is working code that walks through a few simple inserts, queries, and deletes. This next bit isn't actually Java, since I chopped out most of the imports, class declaration, etc. Also, it references another class called 'Packg'. That's just a simple data structure which you should be able to figure out just from the calls made to it.
Explanation is below the code
import com.google.common.collect.ArrayListMultimap;
private PriorityQueue<Integer> openSlots;
private Packg[] currentPackages;
Map<Long, Packg> currentPackageMap;
private ArrayListMultimap<String, Packg> currentColorMap;
private Object $outsideCall;
public CrazyDataStructure(int howManyPackagesPossible) {
$outsideCall = new Object();
this.currentPackages = new Packg[howManyPackagesPossible];
openSlots = new PriorityQueue<>();
IntStream.range(0, howManyPackagesPossible).forEach(i -> openSlots.add(i));//populate the open slots priority queue
currentPackageMap = new HashMap<>();
currentColorMap = ArrayListMultimap.create();
}
/*
* args[0] = integer, maximum # of packages
*/
public static void main(String[] args)
{
int howManyPackagesPossible = Integer.parseInt(args[0]);
CrazyDataStructure cds = new CrazyDataStructure(howManyPackagesPossible);
cds.addPackage(new Packg(12345, "blue"));
cds.addPackage(new Packg(12346, "yellow"));
cds.addPackage(new Packg(12347, "orange"));
cds.addPackage(new Packg(12348, "blue"));
System.out.println(cds.getSlotsForColor("blue"));//should be a list of {0,3}
System.out.println(cds.getSlotForUIN(12346));//should be 1 (0-indexed, remember)
System.out.println(cds.getSlotsForColor("orange"));//should be a list of {2}
System.out.println(cds.removePackage(2));//should be the orange one
cds.addPackage(new Packg(12349, "green"));
System.out.println(cds.getSlotForUIN(12349));//should be 2, since that's open
}
public int addPackage(Packg packg)
{
synchronized($outsideCall)
{
int result = openSlots.poll();
packg.setSlot(result);
currentPackages[result] = packg;
currentPackageMap.put(packg.getUIN(), packg);
currentColorMap.put(packg.getColor(), packg);
return result;
}
}
public Packg removePackage(int slot)
{
synchronized($outsideCall)
{
if(currentPackages[slot] == null)
return null;
else
{
Packg packg = currentPackages[slot];
currentColorMap.remove(packg.getColor(), packg);
currentPackageMap.remove(packg.getUIN());
currentPackages[slot] = null;
openSlots.add(slot);//return slot to priority queue
return packg;
}
}
}
public List<Packg> getUINsForColor(String color)
{
synchronized($outsideCall)
{
return currentColorMap.get(color);
}
}
public List<Integer> getSlotsForColor(String color)
{
synchronized($outsideCall)
{
return currentColorMap.get(color).stream().map(packg -> packg.getSlot()).collect(Collectors.toList());
}
}
public int getSlotForUIN(long uin)
{
synchronized($outsideCall)
{
if(currentPackageMap.containsKey(uin))
return currentPackageMap.get(uin).getSlot();
else
return -1;
}
}
I use 4 different data structures in my class.
PriorityQueue I use the priority queue to keep track of all the open slots. It's log(n) for inserts and constant for removals, so that shouldn't be too bad. Memory-wise, it's not particularly efficient, but it's also linear, so that won't be too bad.
Array I use a regular Array to track by slot #. This is linear for memory, and constant for insert and delete. If you needed more flexibility in the number of slots you could have, you might have to switch this out for an ArrayList or something, but then you'd have to find a better way to keep track of 'empty' slots.
HashMap ah, the HashMap, the golden child of BigO complexity. In return for some memory overhead and an annoying capital letter 'M', it's an awesome data structure. Insertions are reasonable, and queries are constant. I use it to map between the UIDs and the slot for a Packg.
ArrayListMultimap the only data structure I use that's not plain Java. This one comes from Guava (Google, basically), and it's just a nice little shortcut to writing your own Map of Lists. Also, it plays nicely with nulls, and that's a bonus to me. This one is probably the least efficient of all the data structures, but it's also the one that handles the hardest task, so... can't blame it. this one allows us to grab the list of Packg's by color, in constant time relative to the number of slots and in linear time relative to the number of Packg objects it returns.
When you have this many data structures, it makes inserts and deletes a little cumbersome, but those methods should still be pretty straight-forward. If some parts of the code don't make sense, I'll be happy to explain more (by adding comments in the code), but I think it should be mostly fine as-is.

Query 3: Use a hash map, key is UIN, value is object (storage space number,color) (and any more information of the package). Cost is O(1) to query, insert or delete. Space is O(k), with k is the current number of UINs.
Query 1 and 2 : Use hash map + multiple link lists
Hash map, key is color, value is pointer(or reference in Java) to link list of corresponding UINs for that color.
Each link list contains UINs.
For query 1: ask hash map, then return corresponding link list. Cost is O(k1) where k1 is the number of UINs for query color. Space is O(m+k1), where m is the number of unique color.
For query 2: do query 1, then apply query 3. Cost is O(k1) where k1 is the number of UINs for query color. Space is O(m+k1), where m is the number of unique color.
To Insert: given color, number and UIN, insert in hash map of query 3 an object (num,color); hash(color) to go to corresponding link list and insert UIN.
To Delete: given UIN, ask query 3 for color, then ask query 1 to delete UIN in link list. Then delete UIN in hash map of query 3.
Bonus: To manage to storage space, the situation is the same as memory management in OS: read more

This is very simple to do with SegmentTree.
Just store a position in each place and query min it will match with vacant place, when you capture a place just assign 0 to this place.
Package information possible store in separate array.
Initiall it have following values:
1 2 3 4
After capturing it will looks following:
0 2 3 4
After capturing one more it will looks following:
0 0 3 4
After capturing one more it will looks following:
0 0 0 4
After cleanup 2 it will looks follwong:
0 2 0 4
After capturing one more it will looks following:
0 0 0 4
ans so on.
If you have segment tree to fetch min on range it possible to done in O(LogN) for each operation.
Here my implementation in C#, this is easy to translate to C++ of Java.
public class SegmentTree
{
private int Mid;
private int[] t;
public SegmentTree(int capacity)
{
this.Mid = 1;
while (Mid <= capacity) Mid *= 2;
this.t = new int[Mid + Mid];
for (int i = Mid; i < this.t.Length; i++) this.t[i] = int.MaxValue;
for (int i = 1; i <= capacity; i++) this.t[Mid + i] = i;
for (int i = Mid - 1; i > 0; i--) t[i] = Math.Min(t[i + i], t[i + i + 1]);
}
public int Capture()
{
int answer = this.t[1];
if (answer == int.MaxValue)
{
throw new Exception("Empty space not found.");
}
this.Update(answer, int.MaxValue);
return answer;
}
public void Erase(int index)
{
this.Update(index, index);
}
private void Update(int i, int value)
{
t[i + Mid] = value;
for (i = (i + Mid) >> 1; i >= 1; i = (i >> 1))
t[i] = Math.Min(t[i + i], t[i + i + 1]);
}
}
Here example of usages:
int n = 4;
var st = new SegmentTree(n);
Console.WriteLine(st.Capture());
Console.WriteLine(st.Capture());
Console.WriteLine(st.Capture());
st.Erase(2);
Console.WriteLine(st.Capture());
Console.WriteLine(st.Capture());

For getting the storage space number I used a min heap approach, PriorityQueue. This works in O(log n) time, removal and insertion both.
I used 2 BiMaps, self-created data structures, for storing the mapping between UIN, color and storage space number. These BiMaps used internally a HashMap and an array of size N.
In first BiMap(BiMap1), a HashMap<color, Set<StorageSpace>> stores the mapping of color to the list of storage spaces's. And a String array String[] colorSpace which stores the color at the storage space index.
In the Second BiMap(BiMap2), a HashMap<UIN, storageSpace> stores the mapping between UIN and storageSpace. And a string arrayString[] uinSpace` stores the UIN at the storage space index.
Querying is straight forward with this approach:
When the input is color, show all the UIN associated with this color.
Get the List of storage spaces from BiMap1, for these spaces use the array in BiMap2 to get the corresponding UIN's.
When the input is color, show all the numbers where these packages are placed(storage space number). Use BiMap1's HashMap to get the list.
Show where an item with a given UIN is placed, i.e. storage space number. Use BiMap2 to get the values from the HashMap.
Now when we are given a storage space to remove, both the BiMaps have to be updated. In BiMap1 get the entry from the array, get the corersponding Set, and remove the space number from this set. From BiMap2 get the UIN from the array, remove it and also remove it from the HashMap.
For both the BiMaps the removal and the insert operations are O(1). And the Min heap works in O(Log n), hence the total time complexity is O(Log N)

java - Remove nearly duplicates from a List

I have a List of Tweet objects (homegrown class) and I want to remove NEARLY duplicates based on their text, using the Levenshtein distance. I have already removed the identical duplicates by hashing the tweets' texts but now I want to remove texts that are identical but have up to 2-3 different characters. Since this is a O(n^2) approach, I have to check every single tweet text with all the others available. Here's my code so far:
int distance;
for(Tweet tweet : this.tweets) {
distance = 0;
Iterator<Tweet> iter = this.tweets.iterator();
while(iter.hasNext()) {
Tweet currentTweet = iter.next();
distance = Levenshtein.distance(tweet.getText(), currentTweet.getText());
if(distance < 3 && (tweet.getID() != currentTweet.getID())) {
iter.remove();
}
}
}
The first problem is that the code throws ConcurrentModificationException at some point and never completes. The second one: can I do anything better than this double loop? The list of tweets contains nearly 400.000 tweets so we're talking about 160 billion iterations!

This solution works for the question in hand(so far tested with possible inputs) but the normal set operations to remove duplicates wont work if you dont implement the full contract for compare to return 1,0 and -1.
Why dont you implement your own compare operation using the Set which can have only distinct values. It is going to be O(n log(n)).
Set set = new TreeSet(new Comparator() {
#Override
public int compare(Tweet first, Tweet second) {
int distance = Levenshtein.distance(first.getText(), second.getText());
if(distance < 3){
return 0;
}
return 1;
}
});
set.addAll(this.tweets);
this.tweets = new ArrayList<Tweet>(set);

As for the ConcurrentModificationException: As the others pointed out, I was removing elements from a list that I was also iterating in the outer for-each. Changing the for-each into a normal for resolved the problem.
As for the O(n^2) approach: There's no "better" algorithm regarding its complexity, than a O(n^2) approach. What I'm trying to do is an "all-to-all" comparison to find nearly duplicate elements. Of course there are optimizations to lower the total capacity of n, parallelization to concurrently parse sub-lists of the original list, but the complexity is quadratic at all times.

I want to arrange the order of the output in descending order by the number of values found in an array. How could that be possible using Java?

for example these will be my arrays. I will input some symptoms and then it will provide me with x ordered based on the number of symptoms found in s.
String [] x=new String[] {
"Allergic Rhinitis",
"Diabetes",
"Diarrhea",
"Dysmenorrhea",
"Anemia"
};
String [] s=new String[] {
"Runny nose,Nasal congestion,Itchy eyes,Sneezing,Cough,Itchy nose,Sinus pressure,Facial pain,Decreased sense of smell or taste",
"Unexplained weight loss,Increase frequency of urination,Increase volume of urine,Increase thirst,Overweight",
"Abdominal cramps,Fever,Feeling the urge to defecate,Fatigue,Loss of appetite,Unintentional weight loss",
"Cramping pain extending to the lower back and thighs",
"Fatigue,Weakness,Pale skin,Fast or irregular heartbeat,Shortness of breath,Chest pain,Dizziness,Cognitive problems,Headache"
}

Sounds like you need a mapping of diseases to their symptoms. Here's a rudimentary example using collection classes in java.util package to demonstrate how you might declare such a mapping and use it.
// setup a map of diseases to their symptoms
Map<String, List<String>> symptomsByDisease = new HashMap<String, List<String>>();
symptomsByDisease.put("Allergic Rhinitis", Arrays.asList("Runny nose"));
symptomsByDisease.put("Anemia", Arrays.asList("Runny nose", "Nasal congestion"));
symptomsByDisease.put("Diabetes", Arrays.asList("Nasal congestion", "Itchy eyes"));
// accept symptoms from user input
List<String> userSymptoms = Arrays.asList("Itchy eyes", "Nasal congestion");
// map diseases to their count of symptoms matching user input
final Map<String, Integer> countsByDisease = new HashMap<String, Integer>();
for (Map.Entry<String, List<String>> diseaseSymptoms: symptomsByDisease.entrySet())
{
Set<String> matchingSymptoms = new HashSet<String>(diseaseSymptoms.getValue());
matchingSymptoms.retainAll(userSymptoms);
countsByDisease.put(diseaseSymptoms.getKey(),
Integer.valueOf(matchingSymptoms.size()));
}
// sort diseases by descending count of matching symptoms
List<String> diseaseNames = new ArrayList<String>(symptomsByDisease.keySet());
Collections.sort(diseaseNames, new Comparator<String>() {
#Override public int compare(String disease1, String disease2) {
int count1 = countsByDisease.get(disease1).intValue();
int count2 = countsByDisease.get(disease2).intValue();
int compare = count2 - count1; // descending symptom match
if (compare == 0) { // default to alphabetical disease name
compare = disease1.compareTo(disease2);
}
return compare;
}
});
// show results
System.out.println(diseaseNames);

I am not completely sure what you are asking but I think you have two arrays of Strings one containing names of illnesses and the other containing the symptoms of each of those diseases. I am guessing that the symptoms in s correspond to the illnesses in x at the same index. So Diabetes's symptoms are Unexplained weight loss,Increase frequency of urination,Increase volume of urine,Increase thirst, Overweight.
So I think your question is how do you get the number of symptoms from that String and compare it to the number of symptoms from the other illnesses. (I don't have enough rep to comment so I might as well give a full answer to what I think your question is).
For this task you need to count the number of symptoms first. To do that I just counted the number of commas in the String and added one, this assumes that the symptoms are separated by commas and don't end with a comma, but it seems like that is the format.
int sympNum[] = new int[s.length];
for(int i=0;i<s.length;i++)
{
for(int j=0;j<s[i].length();j++)
{
if(s[i].charAt(j)==',')
sympNum[i]++;
}
sympNum[i]++;
}
Now that we know the number of each you want to sort the array and then print the illnesses accordingly. Well that is a little tricky because the array of the number of lengths only relates to the array of illnesses by the indexes. I made a new array of the symptom numbers sorted and then just compared that to the unsorted array which related back to original array of diseases because they have the same index.
int[] sorted = Arrays.copyOf(sympNum, sympNum.length);
Arrays.sort(sorted); //sorts the array into ascending order
for(int i=sorted.length-1;i>=0;i--) //you want descending so count backwards
{
int spot = 0;
while(sorted[i]!=sympNum[spot])
spot++;
System.out.println(x[spot]); //when they match it prints the illnesses that corresponds to that number of symptoms
sympNum[spot] = Integer.MAX_VALUE; //I did this so that if there are multiple diseases with the same number of illnesses one doesn't get printed multiple times
}
It would really be better to make the illnesses objects each with it's own array of Strings containing symptoms and also an int with the number of symptoms.
Next time please word your question more thoughtfully so we can figure out what you are asking. I hope that instead of just copying my code and turning this in you learn about object oriented programming and try your own version.
http://docs.oracle.com/javase/tutorial/java/javaOO/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.