Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Given a list/array of strings:
document
document (1)
document (2)
document (3)
mypdf (1)
mypdf
myspreadsheet (1)
myspreadsheet
myspreadsheet (2)
How do I remove all the duplicates but retain only the highest copy number?
Ending result to be:
document (3)
mypdf (1)
myspreadsheet (2)
You put in a broad question, so here comes an unspecific (but nonetheless) "complete" answer:
Iterate over all your strings to identify all lines that contain braces.
In other words: identify all the strings that look like "X (n)"
Then, for each "different" X that you found, you can iterate the list again; so that you can find all occurrences of "X", X (1)", .. and so on
Doing so will allow you to detect the maximum n for each of your Xes.
Push that "maximum" "X (n)" into your results list.
In other words: it only takes such a simple receipt to solve this problem; now it only takes your time to turn these pseudo-code instructions into real code.
For the record: if the layout of your file is really as shown above, then things become a bit easier - as it seems that your numbers are just increasing. What I mean is:
X (1)
X (2)
X (3)
is easier to treat than
X (1)
X (3)
X (2)
As in your case, it seems save to assume that the last X(n) contains the largest n. Which makes using a HashMap (as suggested by cainiaofei) a nice solution.
an alternative solution
use a HashMap the key is the name (e.g. the name of document document (1)
document (2) document (3) are all document)
which can be implement by this code str.substring(0,str.indexOf('(')).trim()
and the value is the times that key present, at last traverse the map get the key that corresponding value is max and the result is key(value-1)
I would advice you to use a dictionnary :
Map<String, Integer> dict = new HashMap<>();
for (String s : listOfInput){
String name = s.split(" ")[0];
String version = s.split(" ")[1].charAt(1);
if(dict.get(name)!=null){
if (Integer.parseInt(version) < dict.get(name)){
continue;
}
}
dict.put(name, version);
}
The data would be at the end in the dictionary:
key | value
document | 3
mypdf | 1
myspreadsheet | 2
This is a simple solution by making use of a Map. First you loop through your list, you split the String, and you add it to the map with the name as the key, and what's inside the paranthesis as a value. And for each entry you check if the key already exists. And if the key exists you compare the value and you add the next entry to the map if the value is bigger than what's already stored. At the end you loop through the map and get your list.
This should probably work with any kind of input. I think...
Of course this can be done better than this. If anybody has suggestions, please feel free to share them.
public static void main(String[] args) {
List<String> list = Arrays.asList("document", "document (1)", "document (2)", "document (3)", "mypdf (1)", "mypdf", "myspreadsheet (1)",
"myspreadsheet", "myspreadsheet (2)");
Map<String, Integer> counterMap = new HashMap<>();
List<String> newList = new ArrayList<>();
for (String item : list) {
if (item.indexOf(')') != -1) {
String namePart = item.substring(0, item.indexOf('(')).trim();
Integer numberPart = Integer.parseInt(item.substring(item.indexOf('(') + 1, item.indexOf(')')));
Integer existingValue = counterMap.get(namePart);
if (existingValue != null) {
if (numberPart > existingValue) {
counterMap.put(namePart, numberPart);
}
} else {
counterMap.put(namePart, numberPart);
}
} else {
newList.add(item);
}
}
Iterator<Entry<String, Integer>> iterator = counterMap.entrySet().iterator();
while (iterator.hasNext()) {
Entry<String, Integer> next = iterator.next();
String key = next.getKey();
Integer value = next.getValue();
if (newList.contains(key)) {
newList.remove(key);
}
newList.add(key + " (" + value + ")");
}
System.out.println(newList);
}
Here is a possible approach, but this will only work if the version number doesn't exceed 9 (*) :
1) Sort the list in reverse order, so that the most recent version appears first
(*) The sorting being based on alphabetical order , you should be quite fine unless your version number exceeds one digit.Because 10 for instance, appears before 9 with an alphabetical sorting.
Your list will turn into :
myspreadsheet (2)
myspreadsheet (1)
myspreadsheet
mypdf (1)
mypdf
document (3)
document (2)
document (1)
document
2) Iterate on the list, and only keep the first occurence of a given document (i.e the most recent thanks to the reverse sorting)
3) If you want to, sort back the remaining list to a more natural ordering
List<String> documents = new ArrayList<String>();
documents.add("document");
documents.add("document (1)");
documents.add("document (2)");
documents.add("document (3)");
documents.add("mypdf (1)");
documents.add("mypdf");
documents.add("myspreadsheet (1)");
documents.add("myspreadsheet");
documents.add("myspreadsheet (2)");
// 1) Sort in reverse order, so that the most recent document version appears first
Collections.sort(documents, Collections.reverseOrder());
String lastDocumentName = "";
ListIterator<String> iter = documents.listIterator();
// 2)
while (iter.hasNext()) {
String document = iter.next();
// Store the first part of the String , i.e the document name (without version)
String firstPart = document.split("\\s+")[0];
// Check if this document is a version of the last checked document
// If it is the case, this version is anterior, remove it from the list
if (lastDocumentName.equals(firstPart)) {
iter.remove();
}
// Store this document's name as the last one checked
lastDocumentName = firstPart;
}
// 3) Sort back to natural order
Collections.sort(documents);
for (String doc : documents) {
System.out.println(doc);
}
Let's utilize the Stream API to group our documents and simply pick the newest revision by sorting Strings by the revision number. Keep in mind that those static methods were implemented poorly because you did not give us too much information about the naming strategy but the idea should be clear.
Algorithm:
Group revisions of the same String together
Pick the number with the highest version from each group
Solution:
Map<String, List<String>> grouped = input.stream()
.collect(Collectors.groupingBy(preprocessedString(), Collectors.toList()));
List<String> finalResult = grouped.entrySet().stream()
.map(e -> e.getValue().stream()
.max(Comparator.comparing(revisionNumber())).get()) //at this point we have at least one element
.collect(Collectors.toList());
}
Helper parsing functions:
private static Function<String, Integer> revisionNumber() {
return s -> s.contains("(") ? Integer.valueOf(s.substring(s.indexOf('(') + 1, s.indexOf(')'))) : 0;
}
private static Function<String, String> preprocessedString() {
return s -> s.contains("(") ? s.substring(0, s.lastIndexOf("(")).trim() : s.trim();
}
Input:
List<String> input = Arrays.asList(
"document",
"document (1)",
"document (2)",
"document (3)",
"mypdf (1)",
"mypdf",
"myspreadsheet (12)",
"myspreadsheet",
"myspreadsheet (2)",
"single");
Result:
[single, myspreadsheet (12), document (3), mypdf (1)]
We do not need actually to know if the element contains more then one whitespace or whatever. We can aways start from the end and check if the elements is duplicate or not (see if there is a ")" or not).
Also interating once through the List is enough to get all the information we need. Assuming that, I am providing a solution which saves the highest appearance value as a VALUE in a Map which map will have as KEYs all elements in the given input list.
After that you can create your result List with one more iteration through the Map.
public List<String> removeDuplicates(List<String> inputArray) {
Map<String, Integer> map = new HashMap<String, Integer>();
List<String> result = new ArrayList<String>();
int numberOfOcurences = 0;
for (int i = 0; i < inputArray.size(); i++) {
String element = inputArray.get(i);
if (element.charAt(element.length() - 1) == ')') {
numberOfOcurences = Character.getNumericValue(element.charAt(element.length() - 2));
element = element.substring(0, element.length() - 4);
} else {
numberOfOcurences = 0;
}
if (map.isEmpty()) {
map.put(element, numberOfOcurences);
} else {
if (null != map.get(element) && map.get(element) < numberOfOcurences) {
map.put(element, numberOfOcurences);
} else if (null == map.get(element)) {
map.put(element, numberOfOcurences);
}
}
}
for (String a : map.keySet()) {
result.add(a + " (" + map.get(a)+ ")");
}
return result;
}
Set<T> mySet = new HashSet<T>(Arrays.asList(Your));
I have found that from another user of stackoverflow, try if it works. Good Luck :)
Related
In java (either using external libraries or not) I need to take a list of approximately 500,000 values and find the most frequently occurring (mode) 1000. Doing my best to keep the complexity to a minimum.
What I've tried so far, make a hash, but I can't because it would have to be backwards key=count value =string, otherwise when getting the top 1000, my complexity will be garbage. and the backwards way doesn't really work great because I would be having a terrible complexity for insertion as I search for where my string is to be able to remove it and insert it one higher...
I've tried using a binary search tree, but that had the same issue of what the data would be for sorting, either on the count or the string. If it's on the string then getting the count for the top 1000 is bad, and vice versa insertion is bad.
I could sort the list first (by string) and then iterate over the list and keep a count until it changes strings. but what data structure should I use to keep track of the top 1000?
Thanks
I would first create a Map<String, Long> to store the frequency of each word. Then, I'd sort this map by value in descending order and finally I'd keep the first 1000 entries.
In code:
List<String> top1000Words = listOfWords.stream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream()
.sorted(Map.Entry.comparingByValue().reversed())
.limit(1000)
.map(Map.Entry::getKey)
.collect(Collectors.toList());
You might find it cleaner to separate the above into 2 steps: first collecting to the map of frequencies and then sorting its entries by value and keeping the first 1000 entries.
I'd separate this into three phases:
Count word occurrences (e.g. by using a HashMap<String, Integer>)
Sort the results (e.g. by converting the map into a list of entries and ordering by value descending)
Output the top 1000 entries of the sorted results
The sorting will be slow if the counts are small (e.g. if you've actually got 500,000 separate words) but if you're expecting lots of duplicate words, it should be fine.
I have had this question open for a few days now and have decided to rebel against Federico's elegant Java 8 answer and submit the least Java 8 answer possible.
The following code makes use of a helper class that associates a tally with a string.
public class TopOccurringValues {
static HashMap<String, StringCount> stringCounts = new HashMap<>();
// set low for demo. Change to 1000 (or whatever)
static final int TOP_NUMBER_TO_COLLECT = 10;
public static void main(String[] args) {
// load your strings in here
List<String> strings = loadStrings();
// tally up string occurrences
for (String string: strings) {
StringCount stringCount = stringCounts.get(string);
if (stringCount == null) {
stringCount = new StringCount(string);
}
stringCount.increment();
stringCounts.put(string, stringCount);
}
// sort which have most
ArrayList<StringCount> sortedCounts = new ArrayList<>(stringCounts.values());
Collections.sort(sortedCounts);
// collect the top occurring strings
ArrayList<String> topCollection = new ArrayList<>();
int upperBound = Math.min(TOP_NUMBER_TO_COLLECT, sortedCounts.size());
System.out.println("string\tcount");
for (int i = 0; i < upperBound; i++) {
StringCount stringCount = sortedCounts.get(i);
topCollection.add(stringCount.string);
System.out.println(stringCount.string + "\t" + stringCount.count);
}
}
// in this demo, strings are randomly generated numbers.
private static List<String> loadStrings() {
Random random = new Random(1);
ArrayList<String> randomStrings = new ArrayList<>();
for (int i = 0; i < 5000000; i++) {
randomStrings.add(String.valueOf(Math.round(random.nextGaussian() * 1000)));
}
return randomStrings;
}
static class StringCount implements Comparable<StringCount> {
int count = 0;
String string;
StringCount(String string) {this.string = string;}
void increment() {count++;}
#Override
public int compareTo(StringCount o) {return o.count - count;}
}
}
55 lines of code! It's like reverse code golf. The String generator creates 5 million strings instead of 500,000 because: why not?
string count
-89 2108
70 2107
77 2085
-4 2077
36 2077
65 2072
-154 2067
-172 2064
194 2063
-143 2062
The randomly generated strings can have values between -999 and 999 but because we are getting gaussian values, we will see numbers with higher scores that are closer to 0.
The Solution I chose to use was to first make a hash map with key value pairs as . I got the count by iterating over a linked list, and inserting the key value pair, Before insertion I would check for existence and if so increase the count. That part was quite straight forward.
The next part where I needed to sort it according to it's value, I used a library called guava published by google and it was able to make it very easy to sort by value instead of key using what they called a multimap. where they in a sense reverse the hash, and allow multiple values to be mapped to one key, so that I can have all my top 1000, opposed to some solutions mentioned above which didn't allow that, and would cause me to just get one value per key.
The last step was to iterate over the multimap (backwards) to get the 1000 most frequent occurrences.
Have a look at the code of the function if you're interested
private static void FindNMostFrequentOccurences(ArrayList profileName,int n) {
HashMap<String, Integer> hmap = new HashMap<String, Integer>();
//iterate through our data
for(int i = 0; i< profileName.size(); i++){
String current_id = profileName.get(i).toString();
if(hmap.get(current_id) == null){
hmap.put(current_id, 1);
} else {
int current_count = hmap.get(current_id);
current_count += 1;
hmap.put(current_id, current_count);
}
}
ListMultimap<Integer, String> multimap = ArrayListMultimap.create();
hmap.entrySet().forEach(entry -> {
multimap.put(entry.getValue(), entry.getKey());
});
for (int i = 0; i < n; i++){
if (!multimap.isEmpty()){
int lastKey = Iterables.getLast(multimap.keys());
String lastValue = Iterables.getLast(multimap.values());
multimap.remove(lastKey, lastValue);
System.out.println(i+1+": "+lastValue+", Occurences: "+lastKey);
}
}
}
You can do that with the java stream API :
List<String> input = Arrays.asList(new String[]{"aa", "bb", "cc", "bb", "bb", "aa"});
// First we compute a map of word -> occurrences
final Map<String, Long> collect = input.stream()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
// Here we sort the map and collect the first 1000 entries
final List<Map.Entry<String, Long>> entries = new ArrayList<>(collect.entrySet());
final List<Map.Entry<String, Long>> result = entries.stream()
.sorted(Comparator.comparing(Map.Entry::getValue, Comparator.reverseOrder()))
.limit(1000)
.collect(Collectors.toList());
result.forEach(System.out::println);
I know this question has been answered on "how to find" many times, however I have a few additional questions. Here is the code I have
public static void main (String [] args){
List<String> l1= new ArrayList<String>();
l1.add("Apple");
l1.add("Orange");
l1.add("Apple");
l1.add("Milk");
//List<String> l2=new ArrayList<String>();
//HashSet is a good choice as it does not allow duplicates
HashSet<String> set = new HashSet<String>();
for( String e: l1){
//if(!(l2).add(e)) -- did not work
if(!(set).add(e)){
System.out.println(e);
}
Question 1:The list did not work because List allows Duplicate while HashSet does not- is that correct assumption?
Question 2: What does this line mean: if(!(set).add(e))
In the for loop we are checking if String e is in the list l1 and then what does this line validates if(!(set).add(e))
This code will print apple as output as it is the duplicate value.
Question 3: How can i have it print non Duplicate values, just Orange and Milk but not Apple? I tried this approach but it still prints Apple.
List unique= new ArrayList(new HashSet(l1));
Thanks in advance for your time.
1) Yes that is correct. We often use sets to remove duplicates.
2) The add method of HashSet returns false when the item is already in the set. That's why it is used to check whether the item exists in the set.
3) To do this, you need to count up the number of occurrances of each item in the array, store them in a hash map, then print out those items that has a count of 1. Or, you could just do this (which is a little dirty and is slower! However, this approach takes a little less space than using a hash map.)
List<String> l1= new ArrayList<>();
l1.add("Apple");
l1.add("Orange");
l1.add("Apple");
l1.add("Milk");
HashSet<String> set = new HashSet<>(l1);
for (String item : set) {
if (l1.stream().filter(x -> !x.equals(item)).count() == l1.size() - 1) {
System.out.println(item);
}
}
You're right.
Well... adding to the collection doesn't necessary need to return anything. Fortunately guys from the Sun or Oracle decided to return a message if the item was successfully added to the collection or not. This is indicated by true/false return value. true for a success.
You can extend your current code with the following logic: if element wasn't added successfully to the set, it means it was a duplicate so add it to another set Set<> duplicates and later remove all duplicates from the Set.
Question 1:The list did not work because List allows Duplicate while HashSet does not- is that correct assumption?
That is correct.
Question 2: What does this line mean: if(!(set).add(e)) In the for loop we are checking if String e is in the list l1 and then what does this line validates if(!(set).add(e))
This code will print apple as output as it is the duplicate value.
set.add(e) attempts to add an element to the set, and it returns a boolean indicating whether it was added. Negating the result will cause new elements to be ignored and duplicates to be printed. Note that if an element is present 3 times it will be printed twice, and so on.
Question 3: How can i have it print non Duplicate values, just Orange and Milk but not Apple? I tried this approach but it still prints Apple. List<String> unique= new ArrayList<String>(new HashSet<String>(l1));
There are a number of ways to approach it. This one doesn't have the best performance but it's pretty straightforward:
for (int i = 0; i < l1.size(); i++) {
boolean hasDup = false;
for (int j = 0; j < l1.size(); j++) {
if (i != j && l1.get(i).equals(l1.get(j))) {
hasDup = true;
break;
}
}
if (!hasDup) {
System.out.println(e);
}
}
With the /java8 power...
public static void main(String[] args) {
List<String> l1 = new ArrayList<>();
l1.add("Apple");
l1.add("Orange");
l1.add("Apple");
l1.add("Milk");
// remove duplicates
List<String> li = l1.parallelStream().distinct().collect(Collectors.toList());
System.out.println(li);
// map with duplicates frequency
Map<String, Long> countsList = l1.stream().collect(Collectors.groupingBy(fe -> fe, Collectors.counting()));
System.out.println(countsList);
// filter the map where only once
List<String> l2 = countsList.entrySet().stream().filter(map -> map.getValue().longValue() == 1)
.map(map -> map.getKey()).collect(Collectors.toList());
System.out.println(l2);
}
In problem statement, I have 'n' number of families with 'n' number of family members.
eg:
John jane (family 1)
tiya (family 2)
Erika (family 3)
I have to assign all members in such a way that person should not pair with the his family member.
and output should be:
John => tiya
jane => Erika
tiya => jane
Erika => john
I have created the object Person(name ,familyID, isAllocated).
Created the list and added personName_id in this this.
I am thinking to use the map for association. So that john_1 will be key and tiya_2 will be value.
I am failing to associate those pairs through map. How can I shuffle the members it the list.
Also, It would be nice if anyone could suggest me the better solution.
Code:
Getting person:
public static List getperson()
{
Scanner keyboard = new Scanner(System.in);
String line = null;
int count = 0;
List <Person> people = new ArrayList<>();
while(!(line = keyboard.nextLine()).isEmpty()) {
String[] values = line.split("\\s+");
//System.out.print("entered: " + Arrays.toString(values) + "\n");
int familyid = count++;
for(String name :values)
{
Person person = new Person();
person.setFamilyId(familyid);
person.setName(name);
person.setAllocated(false);
people.add(person);
}
}
return people;
}
Mapping:
public static List mapGifts(List pesonList)
{
Map<String , String> personMap = new HashMap<String , String>();
Iterator<Person> itr = pesonList.iterator();
Iterator<Person> itr2 = pesonList.iterator();
List<String> sender = new ArrayList<>();
while(itr.hasNext())
{
Person p = itr.next();
sender.add(p.getName()+"_"+p.getFamilyId());
personMap.put(p.getName()+"_"+p.getFamilyId(), "");
// p.setAllocated(true);
}
while(itr2.hasNext())
{
/*if(p.isAllocated())
{*/
// Separate Sender name and id from sender list
//check this id match with new p1.getFamilyId()
for(String sendername :sender)
{
// System.out.println("Sender "+sendername);
personMap.put(sendername, "");
String[] names = sendername.split("_");
String part1 = names[0]; // 004
String familyId = names[1]; // 004
Person p2 = itr2.next();
System.out.println(p2.getFamilyId() +" "+familyId +" "+p2.isAllocated());
if(p2.isAllocated())
{
for ( String value: personMap.values()) {
if ( value != sendername) {
}
}
}
if( p2.getFamilyId() != Integer.parseInt(familyId))
{
// add values in map
}
}
break;
// Person newPerson = personLists.get(j);
}
for (Iterator it = personMap.entrySet().iterator(); it.hasNext();)
{
Map.Entry entry = (Map.Entry) it.next();
Object key = entry.getKey();
Object value = entry.getValue();
System.out.println("Gifts "+key+"=>"+value);
}
return pesonList;
}
Thanks
From what I've read, you only care that you match people. How they match doesn't matter. That said, I'll assume you have a list of FamilyID's, and a list of names of everyone, and that you can sort the list of people according to family IDs.
Let's call them:
List<FamilyID> families; and
LinkedList<Person> people;, respectively. (You can make FamilyID an enumerated class)
We need two hashmaps. One to generate a list (essentially an adjacency list) of family members given a familyID:
HashMap<FamilyID, List<Person>> familyMembers; ,
and one to generate a list of sender(key) and receiver(value) pairs:
HashMap<Person, Person> pairs;
A useful function may be that, when given a person and their family ID, we can find the next available person who can receive from them.
String generateReceiver(Person newSender, FamilyID familyID);
Implementation of this method should be pretty straightforward. You can iterate through the list of people and check to see if the current person is not a family member. If that condition passes, you remove them from the "people" list so you don't try to iterate through them again. If you're using a linked list for this, removal is O(1) since you'll already have the reference. Worst case on traversals the list is n + n - 1 + ... + 2 times to get O(n^2) time efficiency (i.e. you have one large family and many small ones). Work around that would be to ditch the LinkedList, use an Array-based list, and keep a separate array of index values corresponding to each "currently available receiver of a specified family". You'd initialize these values as you added the people from each family to the people list (i.e. start of family 1 is index "0"; if family 1 has 2 people, start of family 2 would be index "2"). This would make the function O(1) time if you just incremented the current available receiver index everytime you added a sender-receiver pair. (Message me if you want more details on this!)
Last but not least, the loop is doing this for all people.
for (List<Person> family : familyMembers)
for (Person person : family)
{
// get the next available receiver who is not a family member
// add the family member and its receiver to "pairs" hash
}
Note that the above loop is pseudocode. If you're wondering if you would generate conflicting receiver/senders with this method, you won't. The list of people is essentially acting as a list of receivers. Whichever way you implement the people list, the generateReceiver(...)eliminates the chance that the algorithm would see a faulty-receiver. Per efficiency, if you do the array based implementation then you're at O(N) time for generating all pair values, where N is the total number of people. The program itself would be O(N) space as well.
Of course, this is all based on the assumption you have enough people to match for sender-receiver pairs. You'd need to add bells and whistles to check for special cases.
Hope this helps! Good luck!
So I have a hashmap which contains key as Strings and value as Integers of the count of those strings occurring in my Set
for eg I would have a hashMap as follows
Key Value
abcd 4 (meaning there are 4 duplicate strings of abcd in my Set defined someplace)
----- 13
b-b- 7
and so on..
Now what I am trying to do is remove all the empty strings entries from my HashMap. So in the above example I would want to remove all the empty strings with value 13. So my resulting HashMap would be
Key Value
abcd 4
b-b- 7
This is my code that tries to do the same. generateFeedbackMap() is function which returns the HashMap in consideration StringIterator is a class which I have defined which iterates over through each character of my Strings.
for(String key : generateFeedbackMap().keySet()) {
StringIterator it = new StringIterator(key);
int counter = 0;
while(it.hasNext()){
String nextChar = it.next();
if(nextChar.equals("-")){
counter++;
}
Iterator<Map.Entry<String, Integer>> mapIterator = generateFeedbackMap().entrySet().iterator();
if(counter >= key.length()){
while(mapIterator.hasNext()){
Map.Entry<String, Integer> entry = mapIterator.next();
if(entry.getKey().equals(key)){
mapIterator.remove();
}
}
}
}
}
So I increment the counter wherever I find a "-" character. When the counter equals my key string length which means it is an empty string, I remove it using Map Iterator but this does not remove the entry from my Map. What am I doing wrong?
generateFeedbackMap() makes it sound like you’re getting a copy of the underlying map, in which case removing a key from the copy won’t affect the underlying map. If you’re actually getting the map, then you should rename your method.
Regardless, the following would accomplish the same as your original code (but will only remove from the copy).
Map<String,Integer> feedbackMap = generateFeedbackMap();
for ( String key : feedbackMap.keySet() ) {
if ( key.matches("-+") ) {
feedbackMap.remove(key);
}
}
If you’re stuck getting a copy of the underlying map, then you do need to create your new helpfulMap. But you can still use a regular expression and other Map functions to speed things up:
Map<String,Integer> helpfulMap = new HashMap<>();
for ( Map.Entry<String,Integer> entry : generateFeedbackMap().entrySet() ) {
if ( ! entry.getKey().matches("-+") ) {
helpfulMap.put(entry.getKey(),entry.getValue());
}
}
Okay guys, I think I figured out a solution. I just copied all my current entries from oldMap to a new defined HashMap which would contain at least one letter in their keys. So essentially I got rid of all the removing and iterating over strings and just use another HashMap instead as below
Map<String, Integer> HelpfulMap = new HashMap<String,Integer>();
for(String key : generateFeedbackMap().keySet()) {
StringIterator it = new StringIterator(key);
while(it.hasNext()){
String nextChar = it.next();
if(!nextChar.equals("-")){
HelpfulMap.put(key, generateFeedbackMap().get(key));
}
}
}
I don't know what I was doing previously. I went for a good shower and came up with this idea and it worked. I love programming!
Thanks everyone for your inputs!
Introduction:
One of the features used by many sentiment analysis programs is calculated by assigning to relevant unigrams, bigrams or pairs a specific score according to a lexicon. More in detail:
An example lexicon could be:
//unigrams
good 1
bad -1
great 2
//bigrams
good idea 1
bad idea -1
//pairs (--- stands for whatever):
hold---up -0.62
how---i still -0.62
Given a sample text T, for each each unigram, bigram or pair in T i want to check if a correspondence is present in the lexicon.
The unigram\bigram part is easy: i load the lexicon in a Map and then iterate my text, checking each word if present in the dictionary. My problems is with detecting pairs.
My Problem:
One way to check if specific pairs are present in my text would be to iterate the whole lexicon of pairs and use a regex on the text. Checking for each word in the lexicon if "start_of_pair.*end_of_pair" is present in the text. This seems very wasteful, because i'd have to iterate the WHOLE lexicon for each text to analyze. Any ideas on how to do this in a smarter way?
Related questions: Most Efficient Way to Check File for List of Words and Java: Most efficient way to check if a String is in a wordlist
One could realize a frequency map of bigrams as:
Map<String, Map<String, Integer> bigramFrequencyMap = new TreeMap<>();
Fill the map with the desired bigrams with initial frequency 0.
First lexeme, second lexeme, to frequency count.
static final int MAX_DISTANCE = 5;
Then a lexical scan would keep the last #MAX_DISTANCE lexemes.
List<Map<String, Integer>> lastLexemesSecondFrequencies = new ArrayList<>();
void processLexeme() {
String lexeme = readLexeme();
// Check whether there is a bigram:
for (Map<String, Integer> prior : lastLexemesSecondFrequencies) {
Integer freq = prior.get(lexeme);
if (freq != null) {
prior.put(lexeme, 1 + freq);
}
}
Map<String, Integer> lexemeSecondFrequencies =
bigramFrequencyMap.get(lexeme);
if (lexemeSecondFrequencies != null) {
// Could remove lexemeSecondFrequencies if present in lastLexemes.
lastLexems.add(0, lexemeSecondFrequencies); // addFirst
if (lastLexemes.size() > MAX_DISTANCE) {
lastLexemes.remove(lastLexemes.size() - 1); // removeLast
}
}
}
The optimization is to keep the bigrams second half, and only handle registered bigrams.
At the end i ended up solving it this way:
I loaded the pair lexicon as Map<String, Map<String, Float>> - where the first key is the first half of the pairs, the inner map holds all the possible ending for that key's start and the corresponding sentiment value.
Basically i have a List of possible endings (enabledTokens) which i increase each time i read a new token - and then i search this list to see if the current token is the ending of some previous pair.
With a few modifications to prevent the previous token from being used right away for an ending, this is my code:
private Map<String, Map<String, Float>> firstPartMap;
private List<LexiconPair> enabledTokensForUnigrams, enabledTokensForBigrams;
private Queue<List<LexiconPair>> pairsForBigrams; //is initialized with two empty lists
private Token oldToken;
public void parseToken(Token token) {
String unigram = token.getText();
String bigram = null;
if (oldToken != null) {
bigram = oldToken.getText() + " " + token.getText();
}
checkIfPairMatchesAndUpdateFeatures(unigram, enabledTokensForUnigrams);
checkIfPairMatchesAndUpdateFeatures(bigram, enabledTokensForBigrams);
List<LexiconPair> pairEndings = toPairs(firstPartMap.get(unigram));
if(bigram!=null)pairEndings.addAll(toPairs(firstPartMap.get(bigram)));
pairsForBigrams.add(pairEndings);
enabledTokensForUnigrams.addAll(pairEndings);
enabledTokensForBigrams.addAll(pairsForBigrams.poll());
oldToken = token;
}
private void checkIfPairMatchesAndUpdateFeatures(String text, List<LexiconPair> listToCheck) {
Iterator<LexiconPair> iter = listToCheck.iterator();
while (iter.hasNext()) {
LexiconPair next = iter.next();
if (next.getText().equals(text)) {
float val = next.getValue();
POLARITY polarity = getPolarity(val);
for (LexiconFeatureSubset lfs : lexiconsFeatures) {
lfs.handleNewValue(Math.abs(val), polarity);
}
//iter.remove();
//return; //remove only 1 occurrence
}
}
}