How can I split the string elements into disjoint groups in java?

How can I split the string elements into disjoint groups in java? - java

The lines are as follows
A1;B1;C1
A2;B2;C2
How to find a set of unique strings and break it into non-intersecting groups by the following criterion: if two lines have coincidences of non-empty values in one or more columns, they belong to the same group. For example, lines
1,2,3
4,5,6
1,5,7
belong to one group.
Initially I thought to make through a three of HashSet (for each column) to quickly see if the string is included in the list of unique values, then adding either to the list of already grouped rows or to the list of unique rows. But the algorithm in this case has a performance bottleneck: if you want to merge groups, you must go through each group in the list. Algorithm on a large amount of data (> 1 million records), with a large number of mergers, works slowly. If the mergers are small (about thousands), it works quickly. I caught the stuck in this place and do not know how to optimize this bottleneck or whether it is necessary to use other data structures and algorithms. Can someone tell me which direction to dig. I will be grateful for any thoughts on this matter.

I'd suggest the following approach:
Create a Set<String> ungroupedLines, initially containing all the lines. You'll remove the lines as you assign them to groups.
Build three Map<String, Collection<String>>-s, as you've suggested, one per column.
Initialize an empty Collection<Collection<String>> result.
While ungroupedLines is not empty:
Create a new Collection<String> group.
Remove an element, add it to group.
Perform "depth-first search" from that element, using your three maps.
Ignore (skip) any elements that have already been removed from your ungroupedLines.
For the rest, remove them from ungroupedLines and add them to group before recursing on them.
Alternatively, you can use breadth-first search.
Add group to result.

Related

Data Structure choices based on requirements

I'm completely new to programming and to java in particular and I am trying to determine which data structure to use for a specific situation. Since I'm not familiar with Data Structures in general, I have no idea what structure does what and what the limitations are with each.
So I have a CSV file with a bunch of items on it, lets say Characters and matching Numbers. So my list looks like this:
A,1,B,2,B,3,C,4,D,5,E,6,E,7,E,8,E,9,F,10......etc.
I need to be able to read this in, and then:
1)display just the letters or just the numbers sorted alphabetically or numerically
2)search to see if an element is contained in either list.
3)search to see if an element pair (for example A - 1 or B-10) is contained in the matching list.
Think of it as an excel spreadsheet with two columns. I need to be able to sort by either column while maintaining the relationship and I need to be able to do an IF column A = some variable AND the corresponding column B contains some other variable, then do such and such.
I need to also be able to insert a pair into the original list at any location. So insert A into list 1 and insert 10 into list 2 but make sure they retain the relationship A-10.
I hope this makes sense and thank you for any help! I am working on purchasing a Data Structures in Java book to work through and trying to sign up for the class at our local college but its only offered every spring...

You could use two sorted Maps such as TreeMap.
One would map Characters to numbers (Map<Character,Number> or something similar). The other would perform the reverse mapping (Map<Number, Character>)
Let's look at your requirements:
1)display just the letters or just the numbers sorted alphabetically
or numerically
Just iterate over one of the maps. The iteration will be ordered.
2)search to see if an element is contained in either list.
Just check the corresponding map. Looking for a number? Check the Map whose keys are numbers.
3)search to see if an element pair (for example A - 1 or B-10) is
contained in the matching list.
Just get() the value for A from the Character map, and check whether that value is 10. If so, then A-10 exists. If there's no value, or the value is not 10, then A-10 doesn't exist.
When adding or removing elements you'd need to take care to modify both maps to keep them in sync.

Searching for a set of Strings contain a particular string from ArrayList in Java

Is there any fast algorithm to search in an Arraylist of String for a particular string?
For example :
I have an Arraylist :
{"white house","yellow house","black door","house in heaven","wife"}
And want to search strings contain "house".
It should return {"white house","yellow house","house in heaven"} but in a minimum time.
I mean my problem is to deal with big data (a list of about 167000 strings) without index.
Thanks!

There are two answers to your question, depending on whether you are planning to run multiple queries or not:
If you need to run the query only once, you are out of luck: you must search the entire array from the beginning to the end.
If you need to run a significant number of queries, you can reduce the amount of work by building an index.
Make a data structure Map<String,List<String>>, go through the strings in your List<String>, and split them into words. For each word on the list of tokens, add the original string to the corresponding list.
This operation runs in O(N*W), where N is the number of long strings, and W is the average number of words per string. With such map in hand you could run a query in O(1).
Note that this approach pays off only when the number of queries significantly exceeds the average number of words in each string. For example, if your strings have ten words on the average, and you need to run five to eight queries, a linear search would be faster.

I agree with Josh Engelsma. Iterate the list and check one by one is the most simple way. And 167000 is really not a quite big data, unless each String in the List is quite long. Liner search algorithm can be finished in only a few seconds in normal PC.
Consider the coding conventions, the code may be like this:
for(String s : list) {
if(s.contains.("house")) {
//do sth.
}
}
If search will be performed many times on the same list with different keywords, you can build a reverse index to speed up searching.
In your example:
{"white house","yellow house","black door","house in heaven","wife"}
You could pre-process the list, separate each sentence into words, and build an index like:
"house" --> {0,1,3}
"white" --> {0}
"yellow" --> {1}
...
which means "house" is contained in the 0,1 and 3 -th elements of the list, and so on. The index can be implemented with HashMap:
Map<String, LinkedList<Integer>> = new HashMap<String, LinkedList<Integer>>();
And the search operation will be speedup to O(1) complexity ideally.

What data structure to use for indexing data for partial %infix% searching?

Imagine you have a huge cache of data that is to be searched through by 4 ways :
exact match
prefix%
%suffix
%infix%
I'm using Trie for the first 3 types of searching, but I can't figure out how to approach the fourth one other than sequential processing of huge array of elements.

If your dataset is huge cosider using a search platform like Apache Solr so that you dont end up in a performance mess.

You can construct a navigable map or set (eg TreeMap or TreeSet) for the 2 (with keys in normal order) and 3 (keys in reverse)
For option 4 you can construct a collection with a key for every starting letter. You can simplify this depending on your requirement. This can lead to more space being used but get O(log n) lookup times.

For #4 I am thinking if you pre-compute the number of occurances of each character then you can look up in that table for entires that have at least as many occurances of the characters in the search string.
How efficient this algorithm is will probably depend on the nature of the data and the search string. It might be useful to give some examples of both here to get better answers.

Comparator for TreeBag to sort by the number of occurrences

I have a source of strings (let us say, a text file) and many strings repeat multiple times. I need to get the top X most common strings in the order of decreasing number of occurrences.
The idea that came to mind first was to create a sortable Bag (something like org.apache.commons.collections.bag.TreeBag) and supply a comparator that will sort the entries in the order I need. However, I cannot figure out what is the type of objects I need to compare. It should be some kind of an internal map that combines my object (String) and the number of occurrences, generated internally by TreeBag. Is this possible?
Or would I be better off by simply using a hashmap and sort it by value as described in, for example, Java sort HashMap by value

Why don't you put the strings in a map. Map of string to number of times they appear in text.
In step 2, traverse the items in the map and keep on adding them to a minimum heap of size X. Always extract min first if the heap is full before inserting.
Takes nlogx time.
Otherwise after step 1 sort the items by number of occurrences and take first x items. A tree map would come in helpful here :) (I'd add a link to the javadocs, but I'm in a tablet )
Takes nlogn time.

With Guava's TreeMultiset, just use Multisets.copyHighestCountFirst.

List implementation that is a view over multiple sublists?

I'm working on a piece of software that very frequently needs to return a single list that consists of the first (up to) N elements of a number of other lists. The return is not modified by its clients -- it's read-only.
Currently, I am doing something along the lines of (code simplified for readability):
List ret = new ArrayList<String>();
for (List aList : lists) {
// add the first N elements, if they exist
ret.addAll(aList.subList(0, Math.min(aList.size(), MAXMATCHESPERLIST)));
if (ret.size() >= MAXMATCHESTOTAL) {
break;
}
}
return ret;
I'd like to avoid the creation of the new list and the use of addAll() as I don't need to be returning a new list, and I'm dealing with thousands of elements per second. This method is a major bottleneck for my application.
What I'm looking for is an implementation of List that simply consists of the subList() results (those are cheap views, not actual copies) of each of the contained lists.
I've looked through the usual suspects including java.util, Commons Collections, Commons Lang, etc., but can't for the life of me find any such implementation. I'm pretty sure it has to have been implemented at some point though and hopefully I've missed something obvious.
So I'm turning to you, Stack Overflow -- is anyone aware of such an implementation? I can write one myself, but I hate re-inventing the wheel if the wheel is out there.
Suggestions for alternative more efficient approaches are very welcome!
Optional background detail (probably not all that relevant to my question, but just in case it helps you understand what I'm trying to do): this is for a program to fill crossword-style grids with words that revolve around a theme. Each theme may have any number of candidate word lists, ordered in decreasing order of theme relevancy. For instance, the "film" theme may start with a list of movie titles, then a list of actors, then a generic list of places that may or may not be film-relevant, then a generic list of english words. The lists are each stored in a wildcarded trie structure to allow fast lookups that meet the grid constraints (e.g. "CAT" would be stored in trie'd lists against the keys "CAT", "CA?", "C??", "?AT", ... "???" etc.) Lists vary from a few words to several tens of thousands of words.
For any given query, e.g. "C??", I want to return a list that contains up to N (say 50) matching words, ordered in the same order as the source lists. So if list 1 contains 3 matches for "C??", list 2 contains 7, and list 3 contains 100, I need a return list that contains first the 3 matches from list 1, then the 7 matches from list 2, then 40 of the matches from list 3. And I want that returned "conjoined list view" operation to be more efficient than having to continuously call addAll(), in a similar manner to the implementation of subList().
Caching the returned lists is not an option due to memory constraints -- my trie is already consuming the vast majority of my (32 bit) max-sized heap.
PS this isn't homework, it's for a real project. Any help much appreciated!

Do you need random access for the resulting list? Or you client code only iterates over the result?
If you only need to iterate over the result. Create a custom list implementation which will have list of the original lists :) as the instance field. Return custom iterator which will take items from every list one by one and stops when there are no more items in any of the underlying lists or you return MAXMATCHESTOTAL items already.
With some thoughts you can do the same for random access.

Use list.addAll() multiple times. Simple, does not require external jars and ineffective.
Jakarta collections framework has such list. it is effective but requires external jar and does not support generics.
Check Guava from Google. I think it has something that you are looking for.

What's wrong with returning the sublist? That is the fastest way, since the sublist is not a copy but uses a reference to the backing array, and clients are read-only - seems perfect to me.
EDIT:
I understand why you want to group up the contents of several lists to make a larger chunk, but can you change you clients to not need such a large chunk? See my other answer re BlockingQueue and producer/consumer approach.

Have you considered using a BlockingQueue and having consumers pull items from the queue one by one as they need them, rather than getting items in chunks (lists)? It seems you are attempting to reinvent the producer/consumer pattern here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I split the string elements into disjoint groups in java? - java

Related

Data Structure choices based on requirements

Searching for a set of Strings contain a particular string from ArrayList in Java

What data structure to use for indexing data for partial %infix% searching?

Comparator for TreeBag to sort by the number of occurrences

List implementation that is a view over multiple sublists?

Categories

Resources