I got a question when walking myself thru this awesome guide
https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html
My question is in section "Example of semantics for table aggregations". In particular, look at the table in this section, at timestamp 4, but what is the mechanism for aggregator to perform "(E, 5 - 5)".
My confusing is since the key is already transformed from name ("alice") to region ("A") at grouping step. How "groupedTable" can still sense the original key in aggregate and perform subtraction?
Thanks in advance.
There are two mechanism in place here:
the base store can get the old value for a key from the store, before it puts the new value into the store
if required, the upstream operator hosting the base store, will send both the new and old value to the downstream operator
Related
I'm having a bit of trouble with a Firebase Query. I want to query for objects, where the objects child value contains a certain string. So far I have something that looks like this:
Firebase *ref = [[Firebase alloc] initWithUrl:#"https://dinosaur-facts.firebaseio.com/dinosaurs"];
[[[[ref queryOrderedByKey] queryStartingAtValue:#"b"] queryEndingAtValue:#"b~"]
observeEventType:FEventTypeChildAdded withBlock:^(FDataSnapshot *snapshot) {
NSLog(#"%#", snapshot.key);
}];
But that only gives objects that have a starting value of "b". I want objects that contains the string "b". How do I do that?
There are no contains or fuzzy matching methods in the query API, which you have probably already guessed if you've scanned the API and the guide on queries.
Not only has this subject been discussed ad nauseam on SO [1] [2] [3] [4] [5], but I've touched several times on why one should use a real search engine, instead of attempting this sort of half-hearted search approach.
There is a reason it's often easier to Google a website to find results than to use the built-in search, and this is a primary component of that failure.
With all of that said, the answer to your question of how to do this manually, since there is no built-in contains, is to set up a server-side process that loads/streams data into memory and does manual searching of the contents, preferably with some sort of caching.
But honestly, ElasticSearch is faster and simpler, and more efficient here. Since that's a vast topic, I'll defer you to the blog post on this subject.
I'm having a bit of trouble with a Firebase Query. I want to query for objects, where the objects child value contains a certain string. So far I have something that looks like this:
Firebase *ref = [[Firebase alloc] initWithUrl:#"https://dinosaur-facts.firebaseio.com/dinosaurs"];
[[[[ref queryOrderedByKey] queryStartingAtValue:#"b"] queryEndingAtValue:#"b~"]
observeEventType:FEventTypeChildAdded withBlock:^(FDataSnapshot *snapshot) {
NSLog(#"%#", snapshot.key);
}];
But that only gives objects that have a starting value of "b". I want objects that contains the string "b". How do I do that?
There are no contains or fuzzy matching methods in the query API, which you have probably already guessed if you've scanned the API and the guide on queries.
Not only has this subject been discussed ad nauseam on SO [1] [2] [3] [4] [5], but I've touched several times on why one should use a real search engine, instead of attempting this sort of half-hearted search approach.
There is a reason it's often easier to Google a website to find results than to use the built-in search, and this is a primary component of that failure.
With all of that said, the answer to your question of how to do this manually, since there is no built-in contains, is to set up a server-side process that loads/streams data into memory and does manual searching of the contents, preferably with some sort of caching.
But honestly, ElasticSearch is faster and simpler, and more efficient here. Since that's a vast topic, I'll defer you to the blog post on this subject.
I am new to elastic search(ES) and have gone through basic tutorials like
this mykong tutorial
I have question on create part of any document
CREATE Operation Example
To insert a new Document with /mkyong/posts/1001 and the following Request Data:
{
"title": "Java 8 Optional In Depth",
"category":"Java",
"published_date":"23-FEB-2017",
"author":"Rambabu Posa"
}
Question 1 :- Will ES create the inverted index on all attributes of above document i.e. title/category/published/author by default and provide
full text search or I need to mention it explicitly ?
Question 2 :- In above example we already have unique_id i.e. 1001. That's fine if I am already storing it in DB and generate ID. What if
I need to generate the ID through ES engine and do not have any DB ?
Update :-
Got the answer for question 1 from Specify which fields are indexed in ElasticSearch
Question 1 :- Yes, by default ES will index your field twice as two separate types. Once as "text" and once as "keyword" as a sub-field like "title.keyword". The "text" type runs through language analyzers to support the standard search case (remove stop words, stem words, etc). The "keyword" type makes no changes and indexes the data exactly as it is support exact match and aggregations. You can explicitly tell ES a mapping for any field, but if you don't this is the default behavior.
Here is some information on the text vs keyword behavior:
https://www.elastic.co/blog/strings-are-dead-long-live-strings
Question 2 :- ES will automatically create it's own internal ID for every document you index in a field called "_id". You can technically replace this with your own ID, but typically you don't want to do that because it can impact performance by making ES's hashing algorithm to spread out the data preform poorly. It is usually better to just to add any ID's you would like as new fields in the document and let ES index them for you, ideally as the keyword type.
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-id-field.html
When writing the endpoints in java, for finding items by their keys, should I use the Id or the webSafeString of the key? In what situations does this matter?
It's up to you.
Do the entities have parents? Then you probably want to use the urlsafe representation as a single string will contain the full path to the entity. If you used an ID instead - you would somehow need to manually include the IDs of all parents up to the root.
No parents & IDs are numeric / alphanumeric? Then just use the IDs as they look cleaner (again, this is not a rule and is completely up to you).
No parents but IDs have special characters in them? Use the urlsafe representation as you might have issues with not being able to use some special characters without encoding them in HTTP.
Note #1: the urlsafe representation have the entity names encoded that can be easily decoded, this is unlikely a privacy issue but you still should be aware of it. The actual data (IDs) are also simply encoded and can be easily decoded, so be careful when you use personal information such as emails as IDs, they are not safe with urlsafe.
Note #2: if you decide to change the structure of your data in the future (parents <-> children), you might get stuck with some urlsafe data you issued to your users who are not aware of the changes you might have done.
what is the best solution in terms of performance and "readability/good coding style" to represent a (Java) Enumeration (fixed set of constants) on the DB layer in regard to an integer (or any number datatype in general) vs a string representation.
Caveat: There are some database systems that support "Enums" directly but this would require to keept the Database Enum-Definition in sync with the Business-Layer-implementation. Furthermore this kind of datatype might not be available on all Database systems and as well might differ in the syntax => I am looking for an easy solution that is easy to mange and available on all database systems. (So my question only adresses the Number vs String representation.)
The Number representation of a constants seems to me very efficient to store (for example consumes only two bytes as integer) and is most likely very fast in terms of indexing, but hard to read ("0" vs. "1" etc)..
The String representation is more readable (storing "enabled" and "disabled" compared to a "0" and "1" ), but consumes much mor storage space and is most likely also slower in regard to indexing.
My questions is, did I miss some important aspects? What would you suggest to use for an enum representation on the Database layer.
Thank you very much!
In most cases, I prefer to use a short alphanumeric code, and then have a lookup table with the expanded text. When necessary I build the enum table in the program dynamically from the database table.
For example, suppose we have a field that is supposed to contain, say, transaction type, and the possible values are Sale, Return, Service, and Layaway. I'd create a transaction type table with code and description, make the codes maybe "SA", "RE", "SV", and "LY", and use the code field as the primary key. Then in each transaction record I'd post that code. This takes less space than an integer key in the record itself and in the index. Exactly how it is processed depends on the database engine but it shouldn't be dramatically less efficient than an integer key. And because it's mnemonic it's very easy to use. You can dump a record and easily see what the values are and likely remember which is which. You can display the codes without translation in user output and the users can make sense of them. Indeed, this can give you a performance gain over integer keys: In many cases the abbreviation is good for the users -- they often want abbreviations to keep displays compact and avoid scrolling -- so you don't need to join on the transaction table to get a translation.
I would definitely NOT store a long text value in every record. Like in this example, I would not want to dispense with the transaction table and store "Layaway". Not only is this inefficient, but it is quite possible that someday the users will say that they want it changed to "Layaway sale", or even some subtle difference like "Lay-away". Then you not only have to update every record in the database, but you have to search through the program for every place this text occurs and change it. Also, the longer the text, the more likely that somewhere along the line a programmer will mis-spell it and create obscure bugs.
Also, having a transaction type table provides a convenient place to store additional information about the transaction type. Never ever ever write code that says "if whatevercode='A' or whatevercode='C' or whatevercode='X' then ..." Whatever it is that makes those three codes somehow different from all other codes, put a field for it in the transaction table and test that field. If you say, "Well, those are all the tax-related codes" or whatever, then fine, create a field called "tax_related" and set it to true or false for each code value as appropriate. Otherwise when someone creates a new transaction type, they have to look through all those if/or lists and figure out which ones this type should be added to and which it shouldn't. I've read plenty of baffling programs where I had to figure out why some logic applied to these three code values but not others, and when you think a fourth value ought to be included in the list, it's very hard to tell whether it is missing because it is really different in some way, or if the programmer made a mistake.
The only type I don't create the translation table is when the list is very short, there is no additional data to keep, and it is clear from the nature of the universe that it is unlikely to ever change so the values can be safely hard-coded. Like true/false or positive/negative/zero or male/female. (And hey, even that last one, obvious as it seems, there are people insisting we now include "transgendered" and the like.)
Some people dogmatically insist that every table have an auto-generated sequential integer key. Such keys are an excellent choice in many cases, but for code lists, I prefer the short alpha key for the reasons stated above.
I would store the string representation, as this is easy to correlate back to the enum and much more stable. Using ordinal() would be bad because it can change if you add a new enum to the middle of the series, so you would have to implement your own numbering system.
In terms of performance, it all depends on what the enums would be used for, but it is most likely a premature optimization to develop a whole separate representation with conversion rather than just use the natural String representation.