Logging a string of big size in java - java

I'm using logback and I need to log all data queried by clients, to a log file. All the data queried by clients is needed to be logged to the same file. The logging process simply looks like below:
private static final OUTPUTFILELOGGER = Logger.getLogger(...);
String outputString = null;
try {
Map<String, Object> outputMap = doService(); // queries data requested by clients.
.... // do something after business logic..
outputLog = outputMap.toString(); // critical!!
} catch (Throwable e) {
handling exception
} finally {
OUTPUTFILELOGGER.info(outputString);
}
It usually works fine, but sometimes it arises OutOfMemoryError with the call of toString to the outputMap variable when the requested data is too big to make a string.
So I want to make it work in a way of streaming without any problem to performance. And I don't know how to make it effectively and gracefully.
Any idea?

Loop through the map so that you're only working with a small part at a time:
LOGGER.info("Map contains:")
map.forEach( (key, value) -> LOGGER.info("{}: {}", key, value));
(Assumes Java 8 and SLF4J)
However if the map is big enough for the code you've given to generate OOMs, you should probably consider whether it's appropriate to log it in such detail -- or whether your service ought to be capping the response size.

Related

How manually read data from Flink's checkpoint file and keep in Java memory

We need to read data from our checkpoints manually for different reasons (let's say we need to change our state object/class structure, so we want to read restore and copy data to a new type of object)
But, while we are reading everything is good, when we want to keep/store it in memory and deploying to flink cluster we get empty list/map. in log we see that we are reading and adding all our data properly to list/map but as soon as our method completes it's work we lost data, list/map is empty :(
val env = ExecutionEnvironment.getExecutionEnvironment();
val savepoint = Savepoint.load(env, checkpointSavepointLocation, new HashMapStateBackend());
private List<KeyedAssetTagWithConfig> keyedAssetsTagWithConfigs = new ArrayList<>();
val keyedStateReaderFunction = new KeyedStateReaderFunctionImpl();
savepoint.readKeyedState("my-uuid", keyedStateReaderFunction)
.setParallelism(1)
.output(new MyLocalCollectionOutputFormat<>(keyedAssetsTagWithConfigs));
env.execute("MyJobName");
private static class KeyedStateReaderFunctionImpl extends KeyedStateReaderFunction<String, KeyedAssetTagWithConfig> {
private MapState<String, KeyedAssetTagWithConfig> liveTagsValues;
private Map<String, KeyedAssetTagWithConfig> keyToValues = new ConcurrentHashMap<>();
#Override
public void open(final Configuration parameters) throws Exception {
liveTagsValues = getRuntimeContext().getMapState(ExpressionsProcessor.liveTagsValuesStateDescriptor);
}
#Override
public void readKey(final String key, final Context ctx, final Collector<KeyedAssetTagWithConfig> out) throws Exception {
liveTagsValues.iterator().forEachRemaining(entry -> {
keyToValues.put(entry.getKey(), entry.getValue());
log.info("key {} -> {} val", entry.getKey(), entry.getValue());
out.collect(entry.getValue());
});
}
public Map<String, KeyedAssetTagWithConfig> getKeyToValues() {
return keyToValues;
}
}
as soon as this code executes I expect having all values inside map which we get from keyedStateReaderFunction.getKeyToValues(). But it returns empty map. However, I see in logs we are reading all of them properly. Even data empty inside keyedAssetsTagWithConfigs list where we are reading output in it.
If anyone has any idea will be very helpful because I get lost, I never had such experience that I put data to map and then I lose it :) When I serialize and write my map or list to text file and then deserialize it from there (using jackson) I see my data exists, but this is not a solution, kind of "workaround"
Thanks in advance
The code you show creates and submits a Flink job to be executed in its own environment orchestrated by the Flink framework: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#flink-application-execution
The job runs independently than the code that builds and submits the Flink job so when you call keyedStateReaderFunction.getKeyToValues(), you are calling the method of the object that was used to build the job, not the actual object that was run in the Flink execution environment.
Your workaround seems like a valid option to me. You can then submit the file with your savepoint contents to your new job to recreate its state as you'd like.
You have an instance of KeyedStateReaderFunctionImpl in the Flink client which gets serialized and sent to each task manager. Each task manager then deserializes a copy of that KeyedStateReaderFunctionImpl and calls its open and readKey methods, and gradually builds up a private Map containing its share of the data extracted from the savepoint/checkpoint.
Meanwhile the original KeyedStateReaderFunctionImpl back in the Flink client has never had its open or readKey methods called, and doesn't hold any data.
In your case the parallelism is one, so there is only one task manager, but in general you will need collect the output from each task manager and assemble together the complete results from these pieces. These results are not available in the flink client process because the work hasn't been done there.
I found a solution, started job in attached mode and collecting results in main thread
val env = ExecutionEnvironment.getExecutionEnvironment();
val configuration = env.getConfiguration();
configuration
.setBoolean(DeploymentOptions.ATTACHED, true);
...
val myresults = dataSource.collect();
Hope will help somebody else because I wasted couple of days while trying to find a soltion.

Making static method Synchronized or Not

I have a webservice call to get an authorization token and use it for subsequent webservice calls. Now what we had done earlier was whenever we make any web service call, we first make the token web service and then make the call for actual web service.
Method to get the token is as shown below. Basically what this code does is call the webservice to get the token and using GSON parse the response and get the token.
public static String getAuthTicket() {
String authTicket = null;
HttpResponse httpResponse = getAuthResponse();
String body;
if (httpResponse.getStatusLine().getStatusCode() == 200) {
try {
body = IOUtils.toString(httpResponse.getEntity().getContent());
Gson gson = new GsonBuilder().disableHtmlEscaping().create();
ResponseTicket responseTicket = gson.fromJson(body, ResponseTicket.class);
authTicket = responseTicket.getTicket();
} catch (UnsupportedOperationException e) {
LOGGER.error("UnsupportedOperationException : ",e);
} catch (IOException e) {
LOGGER.error("IO Exception : ",e);
}
}
return authTicket;
}
This has obviously led to performance issue. Hence the party who is providing the webservice to get the token has made the token valid for 30 minutes.
So in the above method what we are thinking is to put the token in cache along with the time and check if the current time - cache time is less than 30. If time is greater than 30 we will make service call to get token and update the token with timestamp in cache.
The only thing is I am fearing is about synchronization, so that I dont get corrupt authtoken due to race condition.
I am thinking to make this static method as synchronized. Do you think is there any other better way.
The answer is: it depends.
Race conditions occur when more than one thread is accessing shared data at the same point in time. So, when you would have code such as:
private final Map<X, Y> sharedCache = new HashMap<>();
public static getAuthTicket() {
if (! sharedCache.containsKey...) {
sharedCache.put(...
...
You would be subject to a race conditions - two threads could come in at the same time, and update that shared map at the very same time; leading to all kinds of problems.
When I get your code right - you would have something similar:
private static String cachedToken = null;
public static getAuthTicket() {
if (cachedToken == null || isTooOld(cachedToken)) {
cachedToken = getAuthTicketForReal();
}
return cachedToken;
}
You probably do not want that two threads call getAuthTicketForReal() in parallel.
So, yes, making that method synchronized is a valid approach.
Where: the real question is: is it sufficient to add that keyword? Given my code - the answer is yes. You simply want to avoid that this cache is setup "in parallel" by more than one thread.
Finally: in case you are worried about the performance impact of using synchronized here - simply forget about that. You are talking about a multi-second "network based" operation; so you absolutely do not worry about the milli second of overhead that synchronized might have (making up this number - the key thing: it is so small that it doesn't matter in the context of the operation you are doing).
Regarding your comment: of course, using synchronized means that the JVM will serialize calls to that method. This means when this method needs 1 minute to return - any other calls to that method will block for that 1 minute.
In that sense; it might be a good exercise to look into ways of writing up this method in a way that does not require synchronized on method level. For example by using data structures that can deal with multiple threads manipulating them.

Can Spark Streaming do Anything Other Than Word Count?

I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).
The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)

JT400 - get messages from Queue

I would like to get messages in AS400 from a queue other than a queue from QSYS.LIB. I am using the following code that runs well, only if I use a queue from within QSYS.LIB:
public String getMessagesFromQsysopr(boolean needReply) {
String messageStr = "";
try {
MessageQueue queue = new MessageQueue(this.as400, "/qsys.lib/qsysopr.msgq");
// want only inquiry messages
queue.setSelectMessagesNeedReply(needReply);
queue.setSelectMessagesNoNeedReply(!needReply);
queue.setSelectSendersCopyMessagesNeedReply(needReply);
queue.setListDirection(false);
Enumeration e = queue.getMessages();
while (e.hasMoreElements()) {
QueuedMessage message = (QueuedMessage) e.nextElement();
messageStr += message.getText()+"\n";
}
} catch (Exception e) {
e.printStackTrace();
}
If I replace the /qsys.lib/qsysopr.msgq for any other queue from other lib, like for example "/yaclib.lib/queueName.msgq" I get the following error:
com.ibm.as400.access.IllegalPathNameException: /yaclib.lib/queueName.msgq: Object not in QSYS file system.
at com.ibm.as400.access.QSYSObjectPathName.parse(QSYSObjectPathName.java:599)
at com.ibm.as400.access.QSYSObjectPathName.(QSYSObjectPathName.java:169)
at com.ibm.as400.access.QSYSObjectPathName.(QSYSObjectPathName.java:177)
at com.ibm.as400.access.MessageQueue.(MessageQueue.java:299)
at br.com.operation.AS400Inspector.getMessagesFromYaclib(AS400Inspector.java:225)
at br.com.operation.Main.main(Main.java:43)
Question 1: What am I doing wrong?
Question 2: Is there any way to limit the messages that don't need reply? Like get messages after a specific date or just the last 2 day messages?
Thanks.
#user2338816 is correct.
QSYS is a special library; it actually contains every other library in the system. From a 5250 session, WRKOBJ *ALL *LIB will confirm that every library is the system is in the QSYS library. Interestingly, QSYS itself is contained in QSYS.
When using IFS naming, to refer to a library of YACLIB.LIB, you need to use /QSYS.LIB/YACLIB.LIB
As far as selecting by date, no there's no way to do that. If you look at the java docs the closest you'll find is NEW, NEWEST, OLD, OLDEST

Why is it that RESTlet takes quite time to print the XML, sometimes

I am implementing REST through RESTlet. This is an amazing framework to build such a restful web service; it is easy to learn, its syntax is compact. However, usually, I found that when somebody/someprogram want to access some resource, it takes time to print/output the XML, I use JaxbRepresentation. Let's see my code:
#Override
#Get
public Representation toXml() throws IOException {
if (this.requireAuthentication) {
if (!this.app.authenticate(getRequest(), getResponse()))
{
return new EmptyRepresentation();
}
}
//check if the representation already tried to be requested before
//and therefore the data has been in cache
Object dataInCache = this.app.getCachedData().get(getURI);
if (dataInCache != null) {
System.out.println("Representing from Cache");
//this is warning. unless we can check that dataInCache is of type T, we can
//get rid of this warning
this.dataToBeRepresented = (T)dataInCache;
} else {
System.out.println("NOT IN CACHE");
this.dataToBeRepresented = whenDataIsNotInCache();
//automatically add data to cache
this.app.getCachedData().put(getURI, this.dataToBeRepresented, cached_duration);
}
//now represent it (if not previously execute the EmptyRepresentation)
JaxbRepresentation<T> jaxb = new JaxbRepresentation<T>(dataToBeRepresented);
jaxb.setFormattedOutput(true);
return jaxb;
}
AS you can see, and you might asked me; yes I am implementing Cache through Kitty-Cache. So, if some XML that is expensive to produce, and really looks like will never change for 7 decades, then I will use cache... I also use it for likely static data. Maximum time limit for a cache is an hour to remain in memory.
Even when I cache the output, sometimes, output are irresponsive, like hang, printed partially, and takes time before it prints the remaining document. The XML document is accessible through browser and also program, it used GET.
What are actually the problem? I humbly would like to know also the answer from RESTlet developer, if possible. Thanks

Categories