S3 prefix for listing objects with paritial uuids

S3 prefix for listing objects with paritial uuids - java

I was curious if it was possible to create an S3 prefix which can scope the object listening to a particular folder depending on having just partial data.
basically the structure is like below, where I can only provide the uuid for uuid1 and uuid2. I can't retrieve the ignoreUuid in order to build up the prefix.
Is it possible for me to filter by only providing uuid1 and uuid2?
I can only do by uuid1 at the moment but the listening can be in the thousands and is quite time intensive.
prefer: S3_SCAN_RESULT_PREFIX = "{uuid1}/files/{ignoreUuid}/{uuid2}";
currently: S3_SCAN_RESULT_PREFIX = "{uuid1}/files/"; (not optional as this can be quite a huge and expensive object listing)
objectListing objects = amazonS3Client().listObjects(bucket, format(S3_RESULT_PREFIX, uuid1, uuid2));

No, it's not possible to natively list objects that match known1/known2/unknown/known3.
You would need to rearrange the prefix to bring all known parts to the front, or maintain an index elsewhere (in DynamoDB or an RDBMS, for example).

Related

Searching for keys in a S3 bucket with prefix, suffix or regex?

I have an S3 bucket that contains a million objects, each object keys are quite different from each other and nothing standard at all.
I want to know if there's a way to search for specific key patterns and return those objects using the Amazon S3 SDK for Java?
For example, can I search for the keys using
Prefix
Suffix
or Regex
What are the possible ways to search for keys with S3?

You can ListObjects() with a given Prefix. Amazon S3 does not support listing via suffix or regex.
The Prefix includes the full path of the object, so an object with a Key of 2020/06/10/foo.txt could be found with a prefix of 2020/06/10/, but not a prefix of foo.
The Java command is: ListObjects()
See also: Performing Operations on Amazon S3 Objects - AWS SDK for Java
With millions of objects, it could be quite slow to list your objects (even with a Prefix) since each API call will return a maximum of 1000 objects.
Alternatively you might want to use Amazon S3 Inventory, which can provide a daily or weekly CSV file containing a list of all objects.

Efficiently copy large timeseries results in Java

I am querying data from a timeseries database (Influx in my case) using Java.
I have approximately 20.000-100.000 values (Strings) in the database.
Mapping the results that I get via the Influx Java API to my Domain Objects seems to be very inefficient (ca.0,5s on a small machine).
I suppose this is due to "resource intensive" object creation of the domain objects.
I am currently using StreamsAPI:
QueryResult series = result.getResults().get(0).getSeries().get(0);
List<ItemHistoryEntity> mappedList = series.getValues().stream().parallel().map(valueList ->
new ItemHistoryEntity(valueList)).collect(Collectors.toList());
Unfortunately, I downsampling my data at the database is not an option in my case.
How can I do this more efficiently in Java?
EDIT:
Next thing I will do with the list is downsampling. The problem is that for further downsampling, I need the oldest timestamp in the list. To get this timestamp, I need to iterate the full list. Would it be more efficient, to never call Collectors.toList() until I have reduced the size of the list, even though I need to iterate it at least twice. Or should I find the oldest timestamp using an additional db query and then iterate the list only once and call the Collector only for the reduce list?

Appengine ID/Name vs WebSafeKey

When writing the endpoints in java, for finding items by their keys, should I use the Id or the webSafeString of the key? In what situations does this matter?

It's up to you.
Do the entities have parents? Then you probably want to use the urlsafe representation as a single string will contain the full path to the entity. If you used an ID instead - you would somehow need to manually include the IDs of all parents up to the root.
No parents & IDs are numeric / alphanumeric? Then just use the IDs as they look cleaner (again, this is not a rule and is completely up to you).
No parents but IDs have special characters in them? Use the urlsafe representation as you might have issues with not being able to use some special characters without encoding them in HTTP.
Note #1: the urlsafe representation have the entity names encoded that can be easily decoded, this is unlikely a privacy issue but you still should be aware of it. The actual data (IDs) are also simply encoded and can be easily decoded, so be careful when you use personal information such as emails as IDs, they are not safe with urlsafe.
Note #2: if you decide to change the structure of your data in the future (parents <-> children), you might get stuck with some urlsafe data you issued to your users who are not aware of the changes you might have done.

Build in library's to perform effective searching on 100GB files

Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.

As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.

If u doesn't want to use the tools built for search, then store the data in DB and use sql.

Data structure for search engine in JAVA?

I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?

A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).

If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.

There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.

I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.