I need advice in designing a system meant for tweet analysis.
Objective: For a given hashtag, find out the frequency of co-occurrence with other hash-tags. Find out hourly pattern. We should be able to answer queries of this format: For a given date (say 13/Apr/2013) and for a given one hour time period (say 3:00-4:00 PM ) what are the top 5 co-occurring hashtag with "#iPhone".
My Approach: I am using "twitter4j" liabrary to access twitter data. I can query and get 100 tweets for one call(twitter only allows only those many). I can extract time and other relevant data. I am planning to have thread which will query twitter for every 5 mins. This is done to observer hourly patterns. Here is where I am struck: How should I store this information in DB? Should I maintain a hashmap with key as and value as frequency of occurring with "#iPhone". Or should I store unaggregated data directly in DB? what is the best way to query "twitter" to observer hourly patterns? Should I store the time in "epoch" format in DB or as date one column and hour as another column in DB ?
Thanks a lot for your valuable inputs.
I would suggest you to use the Streaming API in Twitter. That will allow you to keep a persistent HTTP connection to twitter so that you can search over tweets. Twitter recommend the Streaming API for tweet analysis type applications.
But you have to pre-process certain data so that the analysis will be faster. Also look into twitter4j's inherent Streaming API support.
For an example please look into the following Github code.
As ay89 said, use key - tag and value - freq, aggregate before storing to DB, and use epoch.
In addition, because this is a multithreaded program, you have two options for synchronization:
Option 1 is to use a ConcurrentHashMap. When the aggregator runs, it will use:
(for Key key : hashMap.keySet()) {
Database.save(key, hashMap.get(key));
hashMap.replace(key, 0);
}
In other words, set a tag's freq to 0 after writing it to the database. And the method adding tweet data will use
public void increment(Key key) {
boolean done = false;
while(!done) {
int current = hashMap.get(key);
int newValue = current + 1;
done = hashMap.replace(key, current, newValue);
}
}
This is a thread-safe way to increment the frequency.
Option 2 probably makes more sense. Your aggregator will replace the hashmap with a new instance.
class DataStore {
Map map = new HashMap();
public void add(Key key, Value value) {
// called by the method querying tweet data
}
public void aggregate() {
// called by the aggregator thread every five minutes
Map oldMap = map;
map = new HashMap();
DataBase.save(oldMap);
}
}
Bottom line is that you don't want to modify the hashmap in an uncontrolled fashion while the aggregator is saving it to the database. The second option is simpler because it simply creates a new hashmap for the querying thread to modify while the aggregator saves the old hashmap to the database.
since you only have to retrieve the frequency, its better to store it in hash, (key - tag, value - freq) because having non-aggregated data stored in db would take more space (and mostly for info which is not required) and ultimately you would have to aggregate it later.
epoch time is good way to store the time. since you can use it to localize it according to timezone, if required later on.
Related
For one of my school assigments, I have to parse GenBank files using Java. I have to store and retrieve the content of the files together with the extracted information maintaining the smallest time complexity possible. Is there a difference between using HashMaps or storing the data as records? I know that using HashMaps would be O(1), but the readability and immutability of using records leads me to prefer using them instead. The objects will be stored in an array.
This my approach now
public static GenBankRecord parseGenBankFile(File gbFile) throws IOException {
try (var fileReader = new FileReader(gbFile); var reader = new BufferedReader(fileReader)) {
String organism = null;
List<String> contentList = new ArrayList<>();
while (true) {
String line = reader.readLine();
if (line == null) break; //Breaking out if file end has been reached
contentList.add(line);
if (line.startsWith(" ORGANISM ")) {
// Organism type found
organism = line.substring(12); // Selecting the correct part of the line
}
}
// Loop ended
var content = String.join("\n", contentList);
return new GenBankRecord(gbFile.getName(),organism, content);
}
}
with GenBankRecord being the following:
record GenBankRecord(String fileName,String organism, String content) {
#Override
public String toString(){
return organism;
}
}
Is there a difference between using a record and a HashMap, assuming the keys-value pairs are the same as the fields of the record?
String current_organism = gbRecordInstance.organism();
and
String current_organism = gbHashMap.get("organism");
I have to store and retrieve the content of the files together with the extracted information maintaining the smallest time complexity possible.
Firstly, I am somewhat doubtful that your teachers actually stated the requirements like that. It doesn't make a lot of sense to optimize just for time complexity.
Complexity is not efficiency.
Big O complexity is not about the value of the measure (e.g. time taken) itself. It is actually about how the measure (e.g. time taken) changes as some variable gets very large.
For example, HashMap.get(nameStr) and someRecord.name are both O(1) complexity.
But they are not equivalent in terms of efficiency. Using Java 17 record types or regular Java classes with named fields will be orders of magnitude faster than using a HashMap. (And it will use orders of magnitude less memory.)
Assuming that your objects have a fixed number of named fields, the complexity (i.e how the performance changes with an ever increasing number of fields) is not even a relevant.
Performance is not everything.
The most differences between HashMap and a record class are actually in the functionality that they provide:
A Map<String, SomeType> provides an set of name / value pairs where:
the number of pairs in the set is not fixed
the names are not fixed
the types of the values are all instances of SomeType or a subtype.
A record (or classic class) can be viewed as set of fieldname / value pairs where:
the number of pairs is fixed at compile time
the field names are fixed at compile time
the field types don't have to be subtypes of any single given type.
As #Louis Wasserman commented:
Records and HashMap are apples and oranges -- it doesn't really make sense to compare them.
So really, you should be choosing between records and hashmaps by comparing the functionality / constraints that they provide versus what your application actually needs.
(The problem description in your question is not clear enough for us to make that judgement.)
Efficiency concerns may be relevant, but it is a secondary concern. (If the code doesn't meet functional requirements, efficiency is moot.)
Is Complexity relevant to your assignment?
Well ... maybe yes. But not in the area that you are looking at.
My reading of the requirements is that one of them is that you be able to retrieve information from your in-memory data structures efficiently.
But so far you have been thinking about storing individual records. Retrieval implies that you have a collection of records and you have to (efficiently) retrieve a specific record, or maybe a set of records matching some criteria. So that implies you need to consider the data structure to represent the collection.
Suppose you have a collection of N records (or whatever) representing (say) N organisms:
If the collection is a List<SomeRecord>, you need to iterate the list to find the record for (say) "cat". That is O(N).
If the collection is a HashMap<String, SomeRecord> keyed by the organism name, you can find the "cat" record in O(1).
According to the Documentation, the below code can set a timestamp as the key of the node using push() in the Realtime Database.
public void uploadToDB(String s) {
databaseReference.push().setValue(s);
}
The returned key are below of my push(), as an example:
a) -MpfCu14jtIkEk28D3CB
b) -MpfCxv_Nzv3YJ87MfZH
My question is:
are they timestamp?
if yes, can I decode it back to a readable timestamp?
are they timestamp?
No, those pushed IDs are not timestamps. However, it contains a time component.
As Michael Lehenbauer mentioned in this blog article:
Push IDs are string identifiers that are generated client-side. They are a combination of a timestamp and some random bits. The timestamp ensures they are ordered chronologically, and the random bits ensure that each ID is unique, even if thousands of people are creating push IDs at the same time.
And to answer the second question:
if yes, can I decode it back to readable timestamp?
If you reverse the engineering, probably yes. Please check the following answer:
How are Firebase IDs generated?
But would not count on that. To have an order according to a time component, then you should add a property of type "timestamp", as explained in my answer from the following post:
How to save the current date/time when I add new value to Firebase Realtime Database
How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.
I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.
In my dataflow pipeline, I'll have two PCollections<TableRow> that have been read from BigQuery tables. I plan to merge those two PCollections into one PCollection with with a flatten.
Since BigQuery is append only, the goal is to write truncate the second table in BigQuery with the a new PCollection.
I've read through the documentation and it's the middle steps I'm confused about. With my new PCollection the plan is to use a Comparator DoFn to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?
All PCollection<TableRow>s will contain the same values: IE: string, integer and timestamp. When it comes to key value pairs, most of the documentation on cloud dataflow includes just simple strings. Is it possible to have a key value pair that is the entire row of the PCollection<TableRow>?
The rows would look similar to:
customerID, customerName, lastUpdateDate
0001, customerOne, 2016-06-01 00:00:00
0001, customerOne, 2016-06-11 00:00:00
In the example above, I would want to filter the PCollection to just return the second row to a PCollection that would be written to BigQuery. Also, is it possible to apply these Pardo's on the third PCollection without creating a fourth?
You've asked a few questions. I have tried to answer them in isolation, but I may have misunderstood the whole scenario. If you provided some example code, it might help to clarify.
With my new PCollection the plan is to use a Comparator DoFn to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?
Based on your description, it seems that you want to take a PCollection of elements and for each customerID (the key) find the most recent update to that customer's record. You can use the provided transforms to accomplish this via Top.largestPerKey(1, timestampComparator) where you set up your timestampComparator to look only at the timestamp.
Is it possible to have a key value pair that is the entire row of the PCollection?
A KV<K, V> can have any type for the key (K) and value (V). If you want to group by key, then the coder for the keys needs to be deterministic. TableRowJsonCoder is not deterministic, because it may contain arbitrary objects. But it sounds like you want to have the customerID for the key and the entire TableRow for the value.
is it possible to apply these Pardo's on the third PCollection without creating a fourth?
When you apply a PTransform to a PCollection, it results in a new PCollection. There is no way around that, and you don't need to try to minimize the number of PCollections in your pipeline.
A PCollection is a conceptual object; it does not have intrinsic cost. Your pipeline is going to be heavily optimized so that many intermediate PCollections - especially those in a sequence of ParDo transforms - will never be materialized anyhow.
TL;DR I am looking for a way to store, increment and retrieve ranges of event counts by minute.
I am looking for a solution to creating an incrementing timeseries in redis. I am looking to store counts to the minute. My goal is to be able to look up a time range and get the values. So for instnace if an event occurred for a specific key 30 times a minute. I would want to do something like zrange and get their key values. I also am hoping to use something like zincrby to increment the value. I have of course looked at a sorted set which would have seemed like a perfect fit until I realized that I can only do a range scan on the score and not the value. The optimal solution would be to use the number of minutes as the score and then use the value in the sorted set as the number of events for that minute. The problem I ran into is the zincrby only increments the score and not the value. I was unable to find a way to increment the value atomically. I also looked into a hashmap using the current minute as the key and event count as the value. I was able to increment the value using hincrby but the problem is that it doesn't support fetching a range of keys.
Any help would be appreciated.
You know, right a question already has an answer. And you already says about redis way to solve your problem:
Use ZSET - key as time and value as counter.
Use HSET - key as time and value as counter.
Use string keys - key name as time and value as counter.
Why only this cases - becouse of only this structures (ZSET, HSET and string keys) has atomic methods to increment values.
So actualy:
You should make right choise about data structure.
Resolve the issue with the data selection.
The first question answer is compromise between memory and perfomance. From your question you do not need to have any types if sorting so sorted sets is not a best solution - consume lot of memory and ZINCRBY time complexity is O(log(N)) rather HINCRBY and INCRBY is O(1). So we should choose betweeh hashes and string keys. Please look at question and answer about right memory optimization in redis - according this i think you should use hashes as a data type for your solution.
The second question is common for any types of data structures becouse of all types of them do not contains select by name features or they analogs. And we may use HMGET or LUA scripting to solve this problem. In any case this solution would have time complexity O(n).
Here is sample with Jedis (i`m not an Java programmer, sorry for possible errors):
int fromMinute = 1;
int toMinute = 10;
List<String> list = new ArrayList<String>();
for(int i = fromMinute ; i < toMinute ; i++) {
list.add(i.toString());
}
Jedis jedis = new Jedis("localhost");
List<String> values = jedis.hmget("your_set_name", list);
This solution is atomic, fast, has time complexity O(n) and consume memory as little as possible in redis.