Having one puzzling requirement.
Basically I need to create unique id with these criteria
9 digits number, unique for the day (means it's ok if the number appears again the next day )
generated in realtime ; java only ( means no sequence number generation from database -actually no database access at all )
the number is generated to populate a requestID, and around 1.000.000 id will be generated per day.
UUID or UID should not be used ( more than 9 digits )
Here is my consideration :
using sequence number sounds good, but in case JVM restart, the
requestId might be re-generated.
using time HHmmssSSS ( Hour Minute Second Milliseconds ) have 2 issues :
a. System Hour might be changed by server admin.
b. Can cause issue
if 2 requests being asked on same milliseconds.
Any idea?
no sequence number generation from database
I hate silly requirements like that. I say you cheat and use an embedded database like H2 or HSQLDB and generate the identifier through a sequence.
Edit: Let me expand I bit on why I propose this "cheat": My understanding on the "No database" requirement is that either no database software should be installed to handle this requirement or that the existing database schema cannot be changed. Using an embedded database is the same thing as adding a new jar file to your project. Why you should not do this? Why implement something yourself when relational databases have already solved this problem for you?
Nine digits to handle 1,000,000 IDs gives us three digits to play with (we need the other six for the 0-999999 for the ID).
I assume you have a multi-server setup. Assign each server a three-digit server ID, and then you can allocate unique ID values within each server without worrying about overlap between them. It can just be an ever-increasing value in memory, except to survive JVM restarts, we need to echo the most recently allocated value to disk (well, to anywhere you want to store it — local disk, memcache, whatever).
To ensure you don't hit the overhead of file/whatever I/O on each request, you allocate the IDs in blocks, echoing the endpoint of the block back to the storage.
So it ends up being:
Give each server an ID
Have storage on the server which stores the last allocated value for the day (a file, for instance)
Have the ID allocator work in blocks (10 IDs at a time, 100, whatever)
To allocate a block:
Read the file, write back a number increased by your blocksize
Use IDs from the block
The ID would be , e.g. 12000000027 for the 28th ID allocated by server #12
When the day changes (e.g., midnight), throw away your current block and allocate a new one for the new day
In pseudocode:
class IDAllocator {
Storage storage;
int nextId;
int idCount;
int blockSize;
long lastIdTime;
/**
* Creates an IDAllocator with the given server ID, backing storage,
* and block size.
*
* #param serverId the ID of the server (e.g., 12)
* #param s the backing storage to use
* #param size the block size to use
* #throws SomeException if something goes wrong
*/
IDAllocator(int serverId, Storage s, int size)
throws SomeException {
// Remember our info
this.serverId = serverId * 1000000; // Add a million to make life easy
this.storage = s;
this.nextId = 0;
this.idCount = 0;
this.blockSize = bs;
this.lastIdTime = this.getDayMilliseconds();
// Get the first block. If you like and depending on
// what container this code is running in, you could
// spin this out to a separate thread.
this.getBlock();
}
public synchronized int getNextId()
throws SomeException {
int id;
// If we're out of IDs, or if the day has changed, get a new block
if (idCount == 0 || this.lastIdTime < this.getDayMilliseconds()) {
this.getBlock();
}
// Alloc from the block
id = this.nextId;
--this.idCount;
++this.nextId;
// If you wanted (and depending on what container this
// code is running in), you could proactively retrieve
// the next block here if you were getting low...
// Return the ID
return id + this.serverId;
}
protected long getDayMilliseconds() {
return System.currentTimeMillis() % 86400000;
}
protected void getBlock()
throws SomeException {
int id;
synchronized (this) {
synchronized (this.storage.syncRoot()) {
id = this.storage.readIntFromStorage();
this.storage.writeIntToStroage(id + blocksize);
}
this.nextId = id;
this.idCount = blocksize;
}
}
}
...but again, that's pseudocode, and you might want to throw some proactive stuff in there so you never block on I/O waiting for an ID when you need one.
The above is written assuming you already have some kind of application-wide singleton, and the IDAllocator instance would just be a data member in that single instance. If not, you could readily make the above a singleton instead, by giving it the classic getInstance method and having it read its configuration from the environment rather than receiving it as arguments to the constructor.
What about counting from 1 to 999.999.999 for server 1.
And counting from -999.999.999 to -1 for server 2.
I guess due to load balancing the balancing would be about 50:50. So you got the same id range for each server. In addition you store the last generated id on your filesystem. Due to performance issues just store every 1000. value (or 10000, it doesn't really matter). After restarting your application read the last generated value and add 1000. I guess that would work.
You could maybe try out Apache's RandomStringUtils String random(int count, boolean letters, boolean numbers) or try and use the Java TRNG Client library which in turn makes use of RANDOM.ORG:
This library provides a SecureRandom service, integrated with the Java
Security API, for accessing random.org and random.irb.hr (true random
number generators that generates randomness via atmospheric noise or
photonic emission).
I think that if you get one of those and combine it with a time stamp, you should get what you are after.
Related
I am not able to understand the concept of groupBy/groupById and windowing in kafka streaming. My goal is to aggregate stream data over some time period (e.g. 5 seconds). My streaming data looks something like:
{"value":0,"time":1533875665509}
{"value":10,"time":1533875667511}
{"value":8,"time":1533875669512}
The time is in milliseconds (epoch). Here my timestamp is in my message and not in key. And I want to average the value of 5 seconds window.
Here is code that I am trying but it seems I am unable to get it work
builder.<String, String>stream("my_topic")
.map((key, val) -> { TimeVal tv = TimeVal.fromJson(val); return new KeyValue<Long, Double>(tv.time, tv.value);})
.groupByKey(Serialized.with(Serdes.Long(), Serdes.Double()))
.windowedBy(TimeWindows.of(5000))
.count()
.toStream()
.foreach((key, val) -> System.out.println(key + " " + val));
This code does not print anything even though the topic is generating messages every two seconds. When I press Ctrl+C then it prints something like
[1533877059029#1533877055000/1533877060000] 1
[1533877061031#1533877060000/1533877065000] 1
[1533877063034#1533877060000/1533877065000] 1
[1533877065035#1533877065000/1533877070000] 1
[1533877067039#1533877065000/1533877070000] 1
This output does not make sense to me.
Related code:
public class MessageTimeExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> record, long previousTimestamp) {
String str = (String)record.value();
TimeVal tv = TimeVal.fromJson(str);
return tv.time;
}
}
public class TimeVal
{
final public long time;
final public double value;
public TimeVal(long tm, double val) {
this.time = tm;
this.value = val;
}
public static TimeVal fromJson(String val) {
Gson gson = new GsonBuilder().create();
TimeVal tv = gson.fromJson(val, TimeVal.class);
return tv;
}
}
Questions:
Why do you need to pass serializer/deserializer to group by. Some of the overloads also take ValueStore, what is that? When grouped, how the data looks in the grouped stream?
How window stream is related to group stream?
The above, I was expecting to print in streaming way. That means buffer for every 5 seconds and then count and then print. It only prints once press Ctrl+c on command prompt i.e. it prints and then exits
It seems you don't have keys in your input data (correct me if this is wrong), and it further seems, that you want to do global aggregation?
In general, grouping is for splitting a stream into sub-streams. Those sub-streams are build by key (ie, one logical sub-stream per key). You set your timestamp as key in your code snippet an thus generate a sub-stream per timestamps. I assume this is not intended.
If you want to go a global aggregation, you will need to map all record to a single substream, ie, assign the same key to all records in groupBy(). Note, that global aggregations don't scale as the aggregation must be computed by a single thread. Thus, this will only work for small workloads.
Windowing is applied to each generated sub-stream to build the windows, and the aggregation is computed per window. The windows are build base on the timestamp returned by the Timestamp extractor. It seems you have an implementation that extracts the timestamp for the value for this purpose already.
This code does not print anything even though the topic is generating messages every two seconds. When I press Ctrl+C then it prints something like
By default, Kafka Streams uses some internal caching and the cache will be flushed on commit -- this happens every 30 seconds by default, or when you stop your application. You would need to disable caching to see result earlier (cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html)
Why do you need to pass serializer/deserializer to group by.
Because data needs to be redistributed and this happens via a topic in Kafka. Note, that Kafka Streams is build for a distributed setup, with multiple instances of the same application running in parallel to scale out horizontally.
Btw: we might also be interesting in this blog post about the execution model of Kafka Streams: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
It seems like you misunderstand the nature of window DSL.
It works for internal message timestamps handled by kafka platform, not for arbitrary properties in your specific message type that encode time information. Also, this window does not group into intervals - it is a sliding window. It means any aggregation you get is for the last 5 seconds before the current message.
Also, you need the same key for all group elements to be combined into the same group, for example, null. In your example key is a timestamp which is kind of entry-unique, so there will be only a single element in a group.
I am developing a system which loads a huge CSV file (with more than 1 million lines) and saves into database. Also every line has more than one thousand field. A CSV file is considered as one batch and each line is considered as its child object. During the time of adding objects, every object will be saved in List of single batch and at some point I am running out of memory as the List will have more than 1 million objects being added. I cannot split the file into N numbers since there is dependency between the lines which are not in serial order(any line can have dependency to other lines).
Following is the general logic:
Batch batch = new Batch();
while (csvLine !=null ){
{
String[] values = csvLine.split( ",", -1 );
Transaction txn = new Transaction();
txn.setType(values[0]);
txn.setAmount(values[1]);
/*
There are more than one thousand transaction fields in one line
*/
batch.addTransaction (txn);
}
batch.save();
Is there any way we can handle this type of condition with the server having low memory?
In the old times, we used to process large quantities of data stored on sequential tapes with little memory and disk. But it took loooong time!
Basically, you build a buffer of lines than can fit in your memory, browse all file to resolve dependencies and fully process those lines. Then you iterate on next buffer until you have processed all file. If requires a full read of the file per each buffer, but allows to save memory.
There may be another problem here, because you want to store all records in a single batch. The batch will have to require enough memory to store all the records, so here again you have a risk to exhaust memory. But you can again use the good old methods, and save many batches of smaller size.
If you want to make sure that everything will be either fully inserted in database or everything will be rejected, you can simply use a transaction:
declare transaction at the beginning of your job
save all your batches inside this single transaction
commit the transaction when everithing is done
Professional grade databases (MySQL, PostgreSQL, Oracle, etc.) can use rollback segments on disk to be able to process one transaction without exhausting memory. Of course it is far slower than in memory operations (not speaking if for any reason you have to rollback such a transaction!) but at least it works unless you exhaust the available physical disk...
Dedicate a separate database table just for the CSV import. Maybe with additional fields for those cross-references you mentioned.
If you need to analize CSV fields in java, restrain the number of value instances by caching:
public class SharedStrings {
private Map<String, String> sharedStrings = new HashMap<>();
public String share(String s) {
if (s.length() <= 15) {
String t = sharedStrings.putIfAbsent(s, s); // Since java 8
if (t != null) {
s = t;
}
/*
// Older java:
String t = sharedString.get(s);
if (t == null) {
sharedString.put(s, s);
} else {
s = t;
}
*/
}
return s;
}
}
In your case, with long records, it might even make sence to GZipOutputStream the read line, as bytes, to a shorter byte array.
But then a database seems more logical.
The following will possibly not apply if you are using all fields of a csvLine.
String#split uses String#substring, which in turn does not create a new string but keeps the original string in memory and references the respective portion.
So this line would keep the original string in memory:
String a = "...very long and comma separated";
String[] split = a.split(",");
String b = split[1];
a = null;
So if your are not using all data of the csvLine you should wrap every entry of values in a new String, i.e. in the above example you would do
String b = new String(split[1]);
otherwise the gc is unable to free string a.
I ran into this while i was extracting one column of a csv file with millions of lines.
I have a piece logging and tracing related code, which called often throughout the code, especially when tracing is switched on. StringBuilder is used to build a String. Strings have reasonable maximum length, I suppose in the order of hundreds of chars.
Question: Is there existing library to do something like this:
// in reality, StringBuilder is final,
// would have to create delegated version instead,
// which is quite a big class because of all the append() overloads
public class SmarterBuilder extends StringBuilder {
private final AtomicInteger capRef;
SmarterBuilder(AtomicInteger capRef) {
int len = capRef.get();
// optionally save memory with expense of worst-case resizes:
// len = len * 3 / 4;
super(len);
this.capRef = capRef;
}
public syncCap() {
// call when string is fully built
int cap;
do {
cap = capRef.get();
if (cap >= length()) break;
} while (!capRef.compareAndSet(cap, length());
}
}
To take advantage of this, my logging-related class would have a shared capRef variable with suitable scope.
(Bonus Question: I'm curious, is it possible to do syncCap() without looping?)
Motivation: I know default length of StringBuilder is always too little. I could (and currently do) throw in an ad-hoc intitial capacity value of 100, which results in resize in some number of cases, but not always. However, I do not like magic numbers in the source code, and this feature is a case of "optimize once, use in every project".
Make sure you do the performance measurements to make sure you really are getting some benefit for the extra work.
As an alternative to a StringBuilder-like class, consider a StringBuilderFactory. It could provide two static methods, one to get a StringBuilder, and the other to be called when you finish building a string. You could pass it a StringBuilder as argument, and it would record the length. The getStringBuilder method would use statistics recorded by the other method to choose the initial size.
There are two ways you could avoid looping in syncCap:
Synchronize.
Ignore failures.
The argument for ignoring failures in this situation is that you only need a random sampling of the actual lengths. If another thread is updating at the same time you are getting an up-to-date view of the string lengths anyway.
You could store the string length of each string in a statistic array. run your app, and at shutdown you take the 90% quartil of your string length (sort all str length values, and take the length value at array pos = sortedStrings.size() * 0,9
That way you created an intial string builder size where 90% of your strings will fit in.
Update
The value could be hard coded (like java does for value 10 in ArrayList), or read from a config file, or calclualted automatically in a test phase. But the quartile calculation is not for free, so best you run your project some time, measure the 90% quartil on the fly inside the SmartBuilder, output the 90% quartil from time to time, and later change the property file to use the value.
That way you would get optimal results for each project.
Or if you go one step further: Let your smart Builder update that value from time to time in the config file.
But this all is not worth the effort, you would do that only for data that have some millions entries, like digital road maps, etc.
I have requirement in which I continuously receive messages that needs to be written in a file. Every time a new message is received it needs to be written in a separate file. What I want is to generate an unique identifier to be used as a file-name. I also want to preserve the order of the messages as well. By this I mean, the identifier generated as a file-name should always be incremental.
I was using UUID.randomUUID() to generate file-names but the problem with this approach is that UUID only assures randomness of the identifier but is not incremental. As a result I am losing the ordering of the file (I want file generated first should appear first in the list).
Approaches known
Can use System.currentTimeMillis() but I can receive multiple messages at same time stamp.
2.Another approach could be to implement static long value and increment it whenever a file is to be created and use the long value as a file-name. But I am not sure about this approach. Also it doesn't seem to be a proper solution to my problem. I think there could be far better solutions than this one.
If someone could suggest me a better solution to this problem, will be highly appreciated.
If you want your id value to uniformly rise even between server restarts, then you must either base it on the system time or have some elaborately robust logic that persists the last ID used. Note that achieving robustness on its own is not hard, but achieving it in a performant and scalable way is.
If you additionally need the id to be unique across multiple nodes in a redundant server cluster, then you need even more elaborate logic, which definitely involves a persistent store to which all the boxes synchronize access. Making this performant is, of course, even harder.
The best option I can see is to have a quite long ID so there's room for these parts:
System.currentTimeMillis for long-term uniqueness (across restarts);
System.nanotime for finer granularity;
a unique id of each server node (determined in a platform-specific way).
The method will still have to remember the last value generated and retry in case of a duplicate. It won't have to retry too many times, though, just until the next nanoTime clock tick—it could even busy-wait for it.
Sketch of code without point 3 (single-node implementation):
private static long lastNanos;
public static synchronized String uniqueId() {
for (;/*ever*/;) {
final long n = System.nanoTime();
if (n == lastNanos) continue;
lastNanos = n;
return "" + System.currentTimeMillis() + n;
}
}
Ok, my hands up. My last answer was fairly flaky and I've deleted it.
Keeping with the spirit of the site, I thought I'd try a different tac.
If you say you are keeping these messages in a single file then you could try something like creating an unique Id out of the size of the file?
Before you write the message to the file it's id could be the current size of the file.
You could add the filename + size as the id if these messages need to be unique across a number of files.
I'll leave the hot potato of synchronization to another day. But you could wrap all of this up in a syncronized object that keeps track of things.
Also, I am assuming that any messages written to the file will not be removed in the future.
ADDITIONAL NOTE:
You could create an message processing object that opens the file on construction (or via a create method).
This object will get the initial size of the file and this will be used as the unique id.
As each message is added (in a synchronized manner), the id is incremented by the size of the message.
This would address the performance issues. Will not work if more than one JVM/Node accesses the same file.
Skeletal Idea:
public class MessageSink {
private long id = 0;
public MessageSink(String filename) {
id = ... get file size ..
}
public synchronized addMessage(Message msg) {
msg.setId(id);
.. write to file + flush ..
.. or add to stack of messages that need to be written to file
.. at a later stage.
id = id + msg.getSize();
}
public void flushMessages() {
.. open file
.. for each message in stack write ...
.. flush and close file
}
}
I had the same requirement and found a suitable solution. Twitter Snowflake uses a simple algorithm to generate sortable 64bit (long) ids. Snowflake is written on Scala but the approach is simple and could be easily used in a Java code.
id is composed of:
timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years);
machine id - 10 bits (MAC address could be used as a hardware id);
sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
Formula looks like: ((timestamp - customEpoch) << timestampShift) | (machineId << machineIdShift) | sequenceNumber;
Shift for each component depends on it's bits position in ID.
Detailed description and source code could be found at github:
Twitter Snowflake
Basic Java implementation of the Snowflake algorithm
Summary: I'm developing a persistent Java web application, and I need to make sure that all resources I persist have globally unique identifiers to prevent duplicates.
The Fine Print:
I'm not using an RDBMS, so I don't have any fancy sequence generators (such as the one provided by Oracle)
I'd like it to be fast, preferably all in memory - I'd rather not have to open up a file and increment some value
It needs to be thread safe (I'm anticipating that only one JVM at a time will need to generate IDs)
There needs to be consistency across instantiations of the JVM. If the server shuts down and starts up, the ID generator shouldn't re-generate the same IDs it generated in previous instantiations (or at least the chance has to be really, really slim - I anticipate many millions of presisted resources)
I have seen the examples in the EJB unique ID pattern article. They won't work for me (I'd rather not rely solely on System.currentTimeMillis() because we'll be persisting multiple resources per millisecond).
I have looked at the answers proposed in this question. My concern about them is, what is the chance that I will get a duplicate ID over time? I'm intrigued by the suggestion to use java.util.UUID for a UUID, but again, the chances of a duplicate need to be infinitesimally small.
I'm using JDK6
Pretty sure UUIDs are "good enough". There are 340,282,366,920,938,463,463,374,607,431,770,000,000 UUIDs available.
http://www.wilybeagle.com/guid_store/guid_explain.htm
"To put these numbers into perspective, one's annual risk of being hit by a meteorite is estimated to be one chance in 17 billion, that means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs"
http://en.wikipedia.org/wiki/Universally_Unique_Identifier
public class UniqueID {
private static long startTime = System.currentTimeMillis();
private static long id;
public static synchronized String getUniqueID() {
return "id." + startTime + "." + id++;
}
}
If it needs to be unique per PC: you could probably use (System.currentTimeMillis() << 4) | (staticCounter++ & 15) or something like that.
That would allow you to generate 16 per ms. If you need more, shift by 5 and and it with 31...
if it needs to be unique across multiple PCs, you should also combine in your primary network card's MAC address.
edit: to clarify
private static int staticCounter=0;
private final int nBits=4;
public long getUnique() {
return (currentTimeMillis() << nBits) | (staticCounter++ & 2^nBits-1);
}
and change nBits to the square root of the largest number you should need to generate per ms.
It will eventually roll over. Probably 20 years or something with nBits at 4.
From memory the RMI remote packages contain a UUID generator. I don't know whether thats worth looking into.
When I've had to generate them I typically use a MD5 hashsum of the current date time, the user name and the IP address of the computer. Basically the idea is to take everything that you can find out about the computer/person and then generate a MD5 hash of this information.
It works really well and is incredibly fast (once you've initialised the MessageDigest for the first time).
why not do like this
String id = Long.toString(System.currentTimeMillis()) +
(new Random()).nextInt(1000) +
(new Random()).nextInt(1000);
if you want to use a shorter and faster implementation that java UUID take a look at:
https://code.google.com/p/spf4j/source/browse/trunk/spf4j-core/src/main/java/org/spf4j/concurrent/UIDGenerator.java
see the implementation choices and limitations in the javadoc.
here is a unit test on how to use:
https://code.google.com/p/spf4j/source/browse/trunk/spf4j-core/src/test/java/org/spf4j/concurrent/UIDGeneratorTest.java