Total number of non repeated words in each tweet

Total number of non repeated words in each tweet - java

I'm new to java and Trident , I imported project for getting tweets but i want to get something How this code get more than one tweet as i got form the code that tuple.getValue(0); means first tweet only ?!
Problem with me to get all tweets in hashset or hashmap to get total number of distnictive words in each tweet
public void execute(TridentTuple tuple, TridentCollector collector) {
this method is used to execute equations on tweet
public Values getValues(Tweet tweet, String[] words){
}
This code got first tweet then get body of it ,converting it to array of string , i know what i need to solve but couldn't write it well
My Think :
Make for loop like
for (int i=0;i<10;i++)
{
Tweet tweet = (Tweet) tuple.getValue(i);
}

For each tweet:
For each word in tweet:
Try adding each word to a set.
If the word already exists in the set, remove it from the set.
count size of set containing words for that tweet.

The "problem" is a miss-match between "get the count of distinct words over all tweets" and Strom as a stream processor. The query you want to answer can only be computed on a finite set of Tweets. However, in stream processing you process an potential infinite stream of input data.
If you have a finite set of Tweets, you might want to use a batch processing framework such as Flink, Spark, or MapReduce. If you indeed have an infinite number of Tweets, you must rephrase your question...
As you mentioned already, you actually want to "loop over all Tweets". As you so stream processing, there is no such concept. You have an infinite number of input tuples, and Storm applies execute() on each of those (ie, you can think of it as if Storm "loops over the input" automatically -- even in "looping" is not the correct term for it). As your computation is "over all Tweets" you would need to maintain a state in your Bolt code, such that you can update this state for each Tweet. The simples form of a state in Storm would be member variable in your Bolt class.
public class MyBolt implements ??? {
// this is your "state" variable
private final Set<String> allWords = new HashSet<String>();
public void execute(TridentTuple tuple, TridentCollector collector) {
Tweet tweet = (Tweet)tuple.getValue(0);
String tweetBody = tweet.getBody();
String words[] = tweetBody.toLowerCase().split(regex);
for(String w : words) {
// as allWords is a set, you cannot add the same word twice
// the second "add" call on the same word will just be ignored
// thus, allWords will contain each word exactly once
this.allWords.add(w);
}
}
}
Right now, this code does not emit anything, because it is unclear what you actually want to emit? As in stream processing, there is no end, you cannot say "emit the final count of words, contained in allWords". What you could do, it to emit the current count after each update... For this, add collector.emit(new Values(this.allWords.size())); at the end of execute().
Furthermore, I want to add, that the presented solution only works correctly, if no parallelism is applied to MyBolt -- otherwise, the different sets over the instances might contain the same word. To resolve this, it would be required to tokenize each Tweet into its words in a stateless Bolt and feet this streams of words into an adopted MyBolt that uses an internal Set as state. The input data for MyBolt must also receive the data via fieldsGrouping to ensure distinct sets of words on each instance.

Related

Getting multiple words in an ArrayList<String> from a String message using Stream

Question: Is there an effective and efficient way to return a list of Strings that show up in a message given a list of words using Stream/Parallel Stream?
Let's say I have 'ArrayList banWords' which contains a list of words players cannot say. Now let's assume 'message' represents the message a player types. How would I check to see if 'message' contains any words in 'banWords' and if so, return all the words that appear in 'message' using Stream?
I'm asking this since I'm not very familiar with Stream and haven't found a suitable question that has been asked in the past. Currently, the code loops through every word in 'banWords' and checks if 'message' contains that word. If so, it gets added to a separate ArrayList.
for (String word: banWords)
if (message.contains(word))
// Adds word to a separate arraylist
However, I'm trying to see if there's a way I can use Stream or Parallel Stream to return the words. This is the closest I've found
if (banWords.parallelstream().anyMatch(message::contains) {
// Adds the word to another list using banWords.parallelstream().filter(message::contains).findAny().get()
}
However, that only returns the last word that appears in banWords. For example, if banWords contains 'hello' and 'hey' and the message is 'hello hey,' instead of adding "hello" and "hey" as two separate words, it just adds "hey."
Any ideas on how I can effectively get a list of words in message? At this point, I'm looking for the most effective or quickest way to do this so if you have another way that doesn't use Streams, I would be happy to hear.

Suppose of you have a String ArrayList banWords then create stream using that string, and use filter to filter strings that contains banWords
List<String> list = Stream.of("ArrayList banWords").filter(s->s.contains("banWords"))
.collect(Collectors.toList());
You can create stream with multiple strings also
List<String> list = Stream.of("ArrayList banWords","Set banWords", "map").filter(s->s.contains("banWords"))
.collect(Collectors.toList());
So this is how you need to do
List<String> list = word.stream().
.filter(s->message.contains(s))
.collect(Collectors.toList());

message.stream().filter(s -> bannedWordSet.contains(s)).collect(Collectors.toList());
Something to note, it's important to use a set for your list of banned words instead of a list. It'll be much more efficient.

you can collect to list after filter()
List foundWords = banWords.parallelstream().filter(message::contains).collect(Collectors.toList());

Dynamic Generation Of IF Statements For Concatenation

I'm looking for a flexible/generic way to build up conditions using metadata stored in a Database and then validate incoming requests at runtime
against the conditions and concatenate value(s) if the condition is met.
My use case looks something like this:
1) A business user selects an operation from a UI i.e. (IF condition from a dropdown), then selects an appropraite field to evaluate i.e. ("language")
then selects a value for the condition i.e. "Java" followed by some values to concatenate i.e "Java 9" and "is coming soon!"
2) This metaData will get stored in a Database (lets say as a List for the moment) i.e ["language","Java","Java 9","is coming soon"]
When my application starts I want to build the appropriate concatenation conditions:
private String concatenateString(String condition, String conditionValue, String concatValue1, String concatValue2){
StringBuilder sb = new StringBuilder();
if (condition.equals(conditionValue)){
sb.append(concatValue1);
sb.append(concatValue2);
}
return sb.toString();
}
3) so at runtime when I receieve a request, i want to compare the values on my incoming request to the various conditions that got built at start up:
if language == "Java" then the output would look like => "Java 9 is coming soon"
While the above might work for 2 String concatenations, how can achieve the same for a variable number of conditions and concatenation values.

So you want user to create a program by selecting options from a GUI which will be stored in a DB. When the options are read back from the DB you want to parse this into a compileable program and run it?
Use StringBuilder to build a string of the code from the data gotten back from the DB, something like this:
"if (language == '"Java"') { doSomething() }" (you'll need to take care to escape strings inside your string if you are storing strings in the DB.
You can then use Compiler class to compile the string to a program which yo can run (all in runtime, google dynamically compiling c# at runtime).
However, you'll probably want to question why you are thinking of going down that route... I've been there before, dynamic compilation has a very narrow use case.
You could, for instance, create a Dictionary which maps selected languages to some output string and simply use this to get your output like:
Dictionary<string, string> langaugeOutputMap = new Dictionary<string, string>();
languageOutputMap.Put("Java", "Java9 is coming soon");
private string concatString(string: userChosenString) {
if (languageOutputMap.containsKey(userChosenString) {
return languageOutputMap.getValue(userChosenString);
}
return string.Empty()
}
If you then want to manage multiple conditions, you could have multiple Dictionaries for each condition type and enumerate them in a collection, iterate over them when given a variable sized set of conditions and make sure that all the conditions evaluate through the use of containsKey().
Also, you can use params to specify variable length function arguments like so:
public string manyArgs(params string[] stringArgs) {
}
Also, look at PredicateBuilder:
http://www.albahari.com/nutshell/predicatebuilder.aspx

Ignite Cache Sum of values

I am using Ignite tutorial code (link below), but I want to modify it such that it operates on a different type of data and the counts are done differently - rather than incrementing counter by 1 I want to add a current value.
So let's assume I have number of occurences of a certain word in different documents, so let's assume I have something like this:
'the' 6586
'the' 925
So I want Cache to hold
'the' 7511
So given this:
try (Stream<String> lines = Files.lines(path)) {
lines.forEach(line -> {
Stream<String> words = Stream.of(line.split(" "));
List<String> tokens = words.collect(Collectors.toList());
// this is just to emphasize that I want to pass value
Long value = Long.parseLong(tokens.get(1));
stmr.addData(tokens.get(0), Long.parseLong(tokens.get(1));
I would like the value to be passed to the stmr.receiver() method, so I can add it to the val.
I have even tried creating a class variable in StreamWords that would store the value, but the value does not get updated, and in stmr.receiver() it is still 0 ( as initialized).
Link to tutorial:
Word Count Ignite Example

I managed to figure it out. In stmr.receiver(), arg is actually the value that I would like to insert, so just cast it to the object of your desire and you should be able to get the value.

Efficiently checking for substrings and replacing them - can I improve performance here?

I need to examine millions of strings for abbreviations and replace them with the full version. Due to the data, only abbreviations terminated by a comma should be replaced. Strings can contain multiple abbreviations.
I have a lookup table that contains Abbreviation->Fullversion pairs, it contains about 600 pairs.
My current setup looks like something this. On startup I create a list of ShortForm instances from a csv file using Jackson and hold them in a singleton:
public static class ShortForm{
public String fullword;
public String abbreviation;
}
List<ShortForm> shortForms = new ArrayList<ShortForm>();
//csv code ommited
And some code that uses the list
for (ShortForm f: shortForms){
if (address.contains(f.abbreviation+","))
address = address.replace(f.abbreviation+",", f.fullword+",");
}
Now this works, but it's slow. Is there a way I can speed it up? The first step is to load the ShortForm objects with commas in place, but what else could I do?
====== UPDATE
Changed code to work the other way around. Splits strings into words and checks a set to see if the string is an abbreviation.
StringBuilder fullFormed = new StringBuilder();
for (String s: Splitter.on(" ").split(add)){
if (shortFormMap.containsKey(s))
fullFormed.append(shortFormMap.get(s));
else
fullFormed.append(s);
fullFormed.append(" ");
}
return fullFormed.toString().trim();
Testing shows this to be over 13x faster that the original approach. Cheers davecom!

It would already be a bit faster if you skip contains() part :)

What could really improve performance would be to use a better data structure than a simple array for storing your ShortForms. All of the shortForms could be stored sorted alphabetically by abbreviation. You could therefore reduce the lookup time from O(N) to something looking more like a binary search.
I haven't used it before, but perhaps the standard library's SortedMap fits the bill instead of using a custom object at all:
http://docs.oracle.com/javase/7/docs/api/java/util/SortedMap.html
Here's what I'm thinking:
Put abbreviation/full word pairs into TreeMap
Tokenize the address into words.
Check each word to see if it is a key in the TreeMap
Replace it if it is
Put the corrected tokens back together as an address

I think I'd do this with a HashMap. The key would be the abbreviation and the value would be the full term. Then just search through a string for a comma and see if the text that precedes the comma is in the dictionary. You could probably map all the replacements in a single string in one pass and then make all the replacements after that.
This makes each lookup O(1) for a total of O(n) lookups where n is the number of abbreviations found and I don't think there's likely a more efficient method.

How to count different elements in Vector using java?

I have a lot of words at hand. What I need to do is to save them and count every different word. The original data may contain some duplicate words.Firstly, I want to use Set, then I can guarantee that I only get the different wrods. But how can I count their times? Is there someone having any "clever" idea?

You can use MultiSet from the Guava library.
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Multiset.html

You can use Map to solve this problem.
String sample = " I have a problem here. I have a lot of words at hand. What I need to do is to save them and count every different word. The original data may contains duplicate words.Firstly, I want to use Set, then I can guarantee that I only get the different wrods. But how can I count their times? Is there someone having any clever idea?";
String[] array = sample.split("[\\s\\.,\\?]");
Map<String,Integer> statistic = new HashMap<String,Integer>();
for (String elem:array){
String trimElem = elem.trim();
Integer count = 0;
if(!"".equals(trimElem)){
if(statistic.containsKey(trimElem)){
count = statistic.get(trimElem);
}
count++;
statistic.put(trimElem,count);
}
}

maybe you can use hash, in java, it's HashMap(or HashSet?)
you can hash every word, and if that word has been hashed, increment some value associated with it by one, that is the idea.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Total number of non repeated words in each tweet - java

For each tweet: For each word in tweet: Try adding each word to a set. If the word already exists in the set, remove it from the set. count size of set containing words for that tweet.

Related

Getting multiple words in an ArrayList<String> from a String message using Stream

Dynamic Generation Of IF Statements For Concatenation

Ignite Cache Sum of values

Efficiently checking for substrings and replacing them - can I improve performance here?

How to count different elements in Vector using java?

Categories

Resources