Appengine ID/Name vs WebSafeKey - java

When writing the endpoints in java, for finding items by their keys, should I use the Id or the webSafeString of the key? In what situations does this matter?

It's up to you.
Do the entities have parents? Then you probably want to use the urlsafe representation as a single string will contain the full path to the entity. If you used an ID instead - you would somehow need to manually include the IDs of all parents up to the root.
No parents & IDs are numeric / alphanumeric? Then just use the IDs as they look cleaner (again, this is not a rule and is completely up to you).
No parents but IDs have special characters in them? Use the urlsafe representation as you might have issues with not being able to use some special characters without encoding them in HTTP.
Note #1: the urlsafe representation have the entity names encoded that can be easily decoded, this is unlikely a privacy issue but you still should be aware of it. The actual data (IDs) are also simply encoded and can be easily decoded, so be careful when you use personal information such as emails as IDs, they are not safe with urlsafe.
Note #2: if you decide to change the structure of your data in the future (parents <-> children), you might get stuck with some urlsafe data you issued to your users who are not aware of the changes you might have done.

Related

S3 prefix for listing objects with paritial uuids

I was curious if it was possible to create an S3 prefix which can scope the object listening to a particular folder depending on having just partial data.
basically the structure is like below, where I can only provide the uuid for uuid1 and uuid2. I can't retrieve the ignoreUuid in order to build up the prefix.
Is it possible for me to filter by only providing uuid1 and uuid2?
I can only do by uuid1 at the moment but the listening can be in the thousands and is quite time intensive.
prefer: S3_SCAN_RESULT_PREFIX = "{uuid1}/files/{ignoreUuid}/{uuid2}";
currently: S3_SCAN_RESULT_PREFIX = "{uuid1}/files/"; (not optional as this can be quite a huge and expensive object listing)
objectListing objects = amazonS3Client().listObjects(bucket, format(S3_RESULT_PREFIX, uuid1, uuid2));
No, it's not possible to natively list objects that match known1/known2/unknown/known3.
You would need to rearrange the prefix to bring all known parts to the front, or maintain an index elsewhere (in DynamoDB or an RDBMS, for example).

Build in library's to perform effective searching on 100GB files

Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.
As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.
If u doesn't want to use the tools built for search, then store the data in DB and use sql.

How to best represent Constants (Enums) in the Database (INT vs VARCHAR)?

what is the best solution in terms of performance and "readability/good coding style" to represent a (Java) Enumeration (fixed set of constants) on the DB layer in regard to an integer (or any number datatype in general) vs a string representation.
Caveat: There are some database systems that support "Enums" directly but this would require to keept the Database Enum-Definition in sync with the Business-Layer-implementation. Furthermore this kind of datatype might not be available on all Database systems and as well might differ in the syntax => I am looking for an easy solution that is easy to mange and available on all database systems. (So my question only adresses the Number vs String representation.)
The Number representation of a constants seems to me very efficient to store (for example consumes only two bytes as integer) and is most likely very fast in terms of indexing, but hard to read ("0" vs. "1" etc)..
The String representation is more readable (storing "enabled" and "disabled" compared to a "0" and "1" ), but consumes much mor storage space and is most likely also slower in regard to indexing.
My questions is, did I miss some important aspects? What would you suggest to use for an enum representation on the Database layer.
Thank you very much!
In most cases, I prefer to use a short alphanumeric code, and then have a lookup table with the expanded text. When necessary I build the enum table in the program dynamically from the database table.
For example, suppose we have a field that is supposed to contain, say, transaction type, and the possible values are Sale, Return, Service, and Layaway. I'd create a transaction type table with code and description, make the codes maybe "SA", "RE", "SV", and "LY", and use the code field as the primary key. Then in each transaction record I'd post that code. This takes less space than an integer key in the record itself and in the index. Exactly how it is processed depends on the database engine but it shouldn't be dramatically less efficient than an integer key. And because it's mnemonic it's very easy to use. You can dump a record and easily see what the values are and likely remember which is which. You can display the codes without translation in user output and the users can make sense of them. Indeed, this can give you a performance gain over integer keys: In many cases the abbreviation is good for the users -- they often want abbreviations to keep displays compact and avoid scrolling -- so you don't need to join on the transaction table to get a translation.
I would definitely NOT store a long text value in every record. Like in this example, I would not want to dispense with the transaction table and store "Layaway". Not only is this inefficient, but it is quite possible that someday the users will say that they want it changed to "Layaway sale", or even some subtle difference like "Lay-away". Then you not only have to update every record in the database, but you have to search through the program for every place this text occurs and change it. Also, the longer the text, the more likely that somewhere along the line a programmer will mis-spell it and create obscure bugs.
Also, having a transaction type table provides a convenient place to store additional information about the transaction type. Never ever ever write code that says "if whatevercode='A' or whatevercode='C' or whatevercode='X' then ..." Whatever it is that makes those three codes somehow different from all other codes, put a field for it in the transaction table and test that field. If you say, "Well, those are all the tax-related codes" or whatever, then fine, create a field called "tax_related" and set it to true or false for each code value as appropriate. Otherwise when someone creates a new transaction type, they have to look through all those if/or lists and figure out which ones this type should be added to and which it shouldn't. I've read plenty of baffling programs where I had to figure out why some logic applied to these three code values but not others, and when you think a fourth value ought to be included in the list, it's very hard to tell whether it is missing because it is really different in some way, or if the programmer made a mistake.
The only type I don't create the translation table is when the list is very short, there is no additional data to keep, and it is clear from the nature of the universe that it is unlikely to ever change so the values can be safely hard-coded. Like true/false or positive/negative/zero or male/female. (And hey, even that last one, obvious as it seems, there are people insisting we now include "transgendered" and the like.)
Some people dogmatically insist that every table have an auto-generated sequential integer key. Such keys are an excellent choice in many cases, but for code lists, I prefer the short alpha key for the reasons stated above.
I would store the string representation, as this is easy to correlate back to the enum and much more stable. Using ordinal() would be bad because it can change if you add a new enum to the middle of the series, so you would have to implement your own numbering system.
In terms of performance, it all depends on what the enums would be used for, but it is most likely a premature optimization to develop a whole separate representation with conversion rather than just use the natural String representation.

Exposing Entity IDs of Google Datastore data

Is it save to expose entity ids of data that is in Google Datastore.
For example in my code i have entity with this id:
#PrimaryKey
#Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
#Extension(vendorName="datanucleus", key="gae.encoded-pk", value="true")
private String id;
The id is going to be similar to this: agptZeERtzaWYvSQadLEgZDdRsUYRs
Can anyone extract password, application url and any other information from this string? What is the meaning of that string?
That entity ID contains the object id, appliation id, and object class name. It's just an encoded string. Not really any sort of security risk.
You can use the KeyFactory to convert to keytoString, stringToKey as follows URL Google App Engine:
the ID that I believe that it was an unique id for the data storage in Google App Engine.
Key instances can be converted to and
from the encoded string representation
using the KeyFactory methods
keyToString() and stringToKey().
When using encoded key strings, you
can provide access to an object's
string or numeric ID with an
additional fields.
I hope it helps.
Tiger.
If you navigate to localhost:4321/_ah/admin , you can take advantage of the sdk datastore viewer, where you will see that every kind of entity has a KEY field and a NAME/ID field;
Whether you use long, String or Key as your #PrimaryKey, there will be an ID/Name column with a String/number, and a KEY column, with the encoded key for said ID. As mentioned in other posts, this encoding hashes {md5s, most likely} your appspot application id, the fully qualified classname of the data object, and whatever you specify as the #PrimaryKey.
The only time you will ever want direct access to this field is if you absolutely don't care what the data is named,{when you need your program to find it, but humans won't be searching for it by guessing words into a text box}, or when you WANT to have multiple objects of the same type and name {maybe using a version int?} then you should use the encoded key syntax. Both KEY and ID are present in the db whether you put a field in your class, using the encoded key syntax just gives you access to this value.
Also, there is an available speed bonus for applications that use encoded keys... There are only two types of queries: SELECT * and SELECT _ _ key _ _ {spaces used to show there are two _}. For large data sets in AJAX apps, the only efficient way to paginate data is to select all the keys, send them to the client, and have the client ask for 0->X number of records, build links for the other X->Y results, and query the server with the first set of encoded keys for full data, parse response into nice little lists, and avoid loading 397 server data objects that aren't immediately useful.
Sending encoded keys up and down the wire might take a little more bandwidth than unencoded keys {unless you're as long winded at naming things as I am!}; but it shaves those cpu cycles on appengine, makes your quotas happier, and everybody's app runs just a fraction bit faster!
This key, even if somehow unhashed, will only expose data as sensitive as whatever you make a PrimaryKey. You app password is not involved, nor will user passwords in any sane data model. About the only thing that might {BIG might} leak is a user email address, if you use the provided User class for authentication, or the class names you use in your source.
...Basically, only information already available in watching a firebug request or two could possibly be exposed.

Storing a 2 dimensional table (decision table) in XML for efficient Query(ies)

I need to implement a Routing Table where there are a number of paramters.
For eg, i am stating five attributes in the incoming message below
Customer Txn Group Txn Type Sender Priority Target
UTI CORP ONEOFF ABC LOW TRG1
UTI GOV ONEOFF ABC LOW TRG2
What is the best way to represent this data in XML so that it can be queried efficiently.
I want to store this data in XML and using Java i would load this up in memory and when a message comes in i want to identify the target based on the attributes.
Appreciate any inputs.
Thanks,
Manglu
Here is a pure XML representation that can be processed very efficiently as is, without the need to be converted into any other internal data structure:
<table>
<record Customer="UTI" Txn-Group="CORP"
Txn-Type="ONEOFF" Sender="ABC1"
Priority="LOW" Target="TRG1"/>
<record Customer="UTI" Txn-Group="Gov"
Txn-Type="ONEOFF" Sender="ABC2"
Priority="LOW" Target="TRG2"/>
</table>
There is an extremely efficient way to query data in this format using the <xsl:key> instruction and the XSLT key() function:
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:key name="kRec" match="record"
use="concat(#Customer,'+',#Sender)"/>
<xsl:template match="/">
<xsl:copy-of select="key('kRec', 'UTI+ABC2')"/>
</xsl:template>
</xsl:stylesheet>
when applied on the above XML document produces the desired result:
<record Customer="UTI"
Txn-Group="Gov" Txn-Type="ONEOFF"
Sender="ABC2" Priority="LOW"
Target="TRG2"/>
Do note the following:
There can be multiple <xsl:key>s defined that identify a record using different combinations of values to be concatenated together (whatever will be considered "keys" and/or "primary keys").
If an <xsl:key> is defined to use the concatenation of "primary keys" then a unique record (or no record) will be found when the key() function is evaluated.
If an <xsl:key> is defined to use the concatenation of "non-primary keys", then more than one record may be found when the key() function is evaluated.
The <xsl:key> instruction is the equivalent of defining an index in a database. This makes using the key() function extremely efficient.
In many cases it is not necessary to convert the above XML form to an intermediary data structure, due neither to reasons of understandability nor of efficiency.
If you're loading it into memory, it doesn't really matter what form the XML takes - make it the easiest to read or write by hand, I would suggest. When you load it into memory, then you should transform it into an appropriate data structure. (The exact nature of the data structure would depend on the exact nature of the requirements.)
EDIT: This is to counter the arguments made in comments by Dimitre:
I'm not sure whether you thought I was suggesting that people implement their own hashtable - I certainly wasn't. Just keep a straight hashtable or perhaps a MultiMap for each column which you want to use as a key. Developers know how to use hashtables.
As for the runtime efficiency, which do you think is going to be more efficient:
You build some XSLT (and bear in mind this is foreign territory, at least relatively speaking, for most developers)
XSLT engine parses it. This step may be avoidable if you're using an XSLT library which lets you just parameterise an existing query. Even so, you've got some extra work to do.
XSLT engine hits hashtables (you hope, at least) and returns a node
You convert the node into a more useful data structure
Or:
You look up appropriate entries in your hashtable based on the keys you've been given, getting straight to a useful data structure
I think I'd trust the second one, personally. Using XSLT here feels like using a screwdriver to bash in a nail...
That depends on what is repeating and what could be empty. XML is not known for its efficient queryability, as it is neither fixed-length nor compact.
I agree with the previous two posters - you should definitely not keep the internal representation of this data in XML when querying as messages come in.
The XML representation can be anything, you could do something like this:
<routes>
<route customer="UTI" txn-group="CORP" txn-type="ONEOFF" .../>
...
</routes>
My internal representation would depend on the format of the message coming in, and the language. A simple representation would be a map, mapping a structure of data (i.e. the key fields from which the routing decision is made) to the info on the target route.
Depending on your performance requirements, you could keep the key/target information as strings, though in any high performing system you'd probably want to do a straight memory comparison (in C/C++) or some form integer comparison.
Yeah, your basic problem is that you're using "XML" and "efficient" in the same sentence.
Edit: No, seriously, yer killin' me. The fact that several people in this thread are using "highly efficient" to describe anything to do with operations on a data format that require string parsing just to find out where your fields are shows that several people in this thread do not even know what the word "efficient" means. Downvote me as much as you like for saying it. I can take it, coach.

Categories