AWS Firehose Transformation lambda putting all messages in same s3 folder

AWS Firehose Transformation lambda putting all messages in same s3 folder - java

I have a Kinesis stream, i have created firehose delivery stream and saving all the data to s3, it was saving correctly in hourly folders. Then i have written firehose transformation lambda, after deploying that all the messages are going to same folder, i am not sure what i am missing. I have below fields in my response from lambda function:
result.put("recordId", record.getRecordId());
result.put("result", "Ok");
result.put("approximateArrivalEpoch", record.getApproximateArrivalEpoch());
result.put("approximateArrivalTimestamp",record.getApproximateArrivalTimestamp());
result.put("kinesisRecordMetadata", record.getKinesisRecordMetadata());
result.put("data", Base64.getEncoder().encodeToString(jsonData.getBytes()));
Edit:
Here is my code in java. I am using KinesisFirehoseEvent and decoding was not needed for my case and i got ByteBuffer in KinesisFirehoseEvent
public JSONObject handler(KinesisFirehoseEvent kinesisFirehoseEvent, Context context) {
final LambdaLogger logger = context.getLogger();
final JSONArray resultArray = new JSONArray();
for (final KinesisFirehoseEvent.Record record: kinesisFirehoseEvent.getRecords()) {
final byte[] data = record.getData().array();
final Optional<TestData> testData = deserialize(data, logger);
if (testData.isPresent()) {
final JSONObject jsonObj = new JSONObject();
final String jsonData = gson.toJson(testData.get());
jsonObj.put("recordId", record.getRecordId());
jsonObj.put("result", "Ok");
jsonObj.put("approximateArrivalEpoch", record.getApproximateArrivalEpoch());
jsonObj.put("approximateArrivalTimestamp", record.getApproximateArrivalTimestamp());
jsonObj.put("kinesisRecordMetadata", record.getKinesisRecordMetadata());
jsonObj.put("data", Base64.getEncoder().encodeToString
(jsonData.getBytes()));
resultArray.add(jsonObj);
}
else {
logger.log("testData not deserialized");
}
}
final JSONObject jsonFinalObj = new JSONObject();
jsonFinalObj.put("records", resultArray);
return jsonFinalObj;
}

The lambda function returning data is not in correct format,
Checkout the below example,
'use strict';
console.log('Loading function');
/* Stock Ticker format parser */
const parser = /^\{\"TICKER_SYMBOL\"\:\"[A-Z]+\"\,\"SECTOR\"\:"[A-Z]+\"\,\"CHANGE\"\:[-.0-9]+\,\"PRICE\"\:[-.0-9]+\}/;
exports.handler = (event, context, callback) => {
let success = 0; // Number of valid entries found
let failure = 0; // Number of invalid entries found
let dropped = 0; // Number of dropped entries
/* Process the list of records and transform them */
const output = event.records.map((record) => {
const entry = (new Buffer(record.data, 'base64')).toString('utf8');
let match = parser.exec(entry);
if (match) {
let parsed_match = JSON.parse(match);
var milliseconds = new Date().getTime();
/* Add timestamp and convert to CSV */
const result = `${milliseconds},${parsed_match.TICKER_SYMBOL},${parsed_match.SECTOR},${parsed_match.CHANGE},${parsed_match.PRICE}`+"\n";
const payload = (new Buffer(result, 'utf8')).toString('base64');
if (parsed_match.SECTOR != 'RETAIL') {
/* Dropped event, notify and leave the record intact */
dropped++;
return {
recordId: record.recordId,
result: 'Dropped',
data: record.data,
};
}
else {
/* Transformed event */
success++;
return {
recordId: record.recordId,
result: 'Ok',
data: payload,
};
}
}
else {
/* Failed event, notify the error and leave the record intact */
console.log("Failed event : "+ record.data);
failure++;
return {
recordId: record.recordId,
result: 'ProcessingFailed',
data: record.data,
};
}
/* This transformation is the "identity" transformation, the data is left intact
return {
recordId: record.recordId,
result: 'Ok',
data: record.data,
} */
});
console.log(`Processing completed. Successful records ${output.length}.`);
callback(null, { records: output });
};
Below documentation can help more details on the data returning format,
https://aws.amazon.com/blogs/compute/amazon-kinesis-firehose-data-transformation-with-aws-lambda/
Hope it helps.

I got this working using above code only, its just that looks like stream is slow so data of new hours haven't reached yet.

Related

Configure multiple base url's in Karate [duplicate]

I have more than 6 environments against which i have to run the same set of rest api scripts. For that reason i have stored all that test data and the end points/resource paths in a json file. I then try to read this json file into my karate-config.js file, this is because i want to fetch the data corresponding to the environment that is being passed from the command line (karate.env), which am reading into my karate-config.js file
Below is my json file sample
[
{
"qa":{
"username_cm_on":"test_cm_on_qa",
"password_cm_on":"Test123$",
"nonadmin_username_cm_on":"test_non_admin_cm_on_qa",
"nonadmin_password_cm_on":"Test123$",
"username_cm_off":"test_cm_off_qa",
"password_cm_off":"Test123$",
"nonadmin_username_cm_off":"test_non_admin_cm_off_qa",
"nonadmin_password_cm_off":"Test123$",
"zuul_urls":{
"home-sec-uri":"https://qa.abc.com/qa/home-sec-uri",
"home-res-uri":"https://qa.abc.com/qa/home-res-uri"
}
}
},
{
"uat":{
"username_cm_on":"test_cm_on_uat",
"password_cm_on":"Test123$",
"nonadmin_username_cm_on":"test_non_admin_cm_on_uat",
"nonadmin_password_cm_on":"Test123$",
"username_cm_off":"test_cm_off_uat",
"password_cm_off":"Test123$",
"nonadmin_username_cm_off":"test_non_admin_cm_off_uat",
"nonadmin_password_cm_off":"Test123$",
"zuul_urls":{
"home-sec-uri":"https://uat.abc.com/qa/home-sec-uri",
"home-res-uri":"https://uat.abc.com/qa/home-res-uri"
}
}
}
]
and below is my karate-config.js file
function() {
// var env = karate.env; // get system property 'karate.env'
var env = 'qa';
var cm = 'ON';
var envData = call read('classpath:env_data.json'); //require("./env_data.json");
// write logic to read data from the json file _ Done, need testing
karate.log('karate.env system property was:', env);
switch(env) {
case "qa":
if(cm === 'ON'){
config.adminusername_cm_on = getData().username_cm_on;
config.adminpassword_cm_on = "";
config.nonadminusername_cm_on = getData().nonadmin_username_cm_on;
config.nonadminpassword_cm_on = "";
}else if(cm === "OFF") {
config.adminusername_cm_off = getData().username_cm_off;
config.adminpassword_cm_off = "";
config.nonadminusername_cm_off = getData().nonadmin_username_cm_off;
config.nonadminpassword_cm_off = "";
}
break;
case "uat":
break;
default:
break;
}
// This method will return the data from the env_data.json file
var getData = function() {
for(var i = 0; i < obj.length; i++) {
for(var e in obj[i]){
var username_cm_on = obj[i][e]['username_cm_on'];
var nonadmin_username_cm_on = obj[i][e]['nonadmin_username_cm_on'];
var username_cm_off = obj[i][e]['username_cm_off'];
var nonadmin_username_cm_off = obj[i][e]['nonadmin_username_cm_off'];
return {
username_cm_on: username_cm_on,
nonadmin_username_cm_on: nonadmin_username_cm_on,
username_cm_off: username_cm_off,
nonadmin_username_cm_off: nonadmin_username_cm_off
}
}
}
}
var config = {
env: env,
data: getData(),
}
return config;
}
I tried several ways to load the env-data.json file into karate-config.js as below
var envData = call read('classpath:env_data.json');
I know the above is not valid from this stackover flow answer Karate - How to import json data by Peter Thomas
So,tried with the below ones
var envData = read('classpath:env_data.json');
var envData = require("./env_data.json");
var envData = require('./env_data.json');
But, still facing issues with reading the json file. Appreciate help on this.

I think you over-complicated your JSON. You just need one object and no top-level array. Just use this as env_data.json:
{
"qa":{
"username_cm_on":"test_cm_on_qa",
"password_cm_on":"Test123$",
"nonadmin_username_cm_on":"test_non_admin_cm_on_qa",
"nonadmin_password_cm_on":"Test123$",
"username_cm_off":"test_cm_off_qa",
"password_cm_off":"Test123$",
"nonadmin_username_cm_off":"test_non_admin_cm_off_qa",
"nonadmin_password_cm_off":"Test123$",
"zuul_urls":{
"home-sec-uri":"https://qa.abc.com/qa/home-sec-uri",
"home-res-uri":"https://qa.abc.com/qa/home-res-uri"
}
},
"uat":{
"username_cm_on":"test_cm_on_uat",
"password_cm_on":"Test123$",
"nonadmin_username_cm_on":"test_non_admin_cm_on_uat",
"nonadmin_password_cm_on":"Test123$",
"username_cm_off":"test_cm_off_uat",
"password_cm_off":"Test123$",
"nonadmin_username_cm_off":"test_non_admin_cm_off_uat",
"nonadmin_password_cm_off":"Test123$",
"zuul_urls":{
"home-sec-uri":"https://uat.abc.com/qa/home-sec-uri",
"home-res-uri":"https://uat.abc.com/qa/home-res-uri"
}
}
}
And then this karate-config.js will work:
function() {
var env = 'qa'; // karate.env
var temp = read('classpath:env_data.json');
return temp[env];
}
And your tests can be more readable:
Given url zuul_urls['home-sec-uri']
If you have trouble understanding how this works, refer to this answer: https://stackoverflow.com/a/59162760/143475

need advice regarding design and performance issue

I have data retrieved from a server at very high rate, the datas sent are in form of a message that resembles the following format:
$FMSn,par1,par2...,...,...,...,..,...,....,par20 //where n is number ranges from 1 to 12
this message I need to process to parse some data.
but less frequently the server sends other message in different format, that message is not important and could be discarded and the difference between it and the previously described messages in format is that
the previous message starts with $FMS while the other message not.
to distinguish between these messages to know which one is that should be processed, i created a class FMSParser as shown below and i checked if the message header is
$FMS
or not.
my question is, should i create a new object of FMSParser class in the loop in which the messages from the server are received or create one object in the whole
program and in the loop in which the data are recived i just call isValid method and getParam(). in other words in code:
should i choose solution 1 or 2?
solution 1:
loop for messages receiving:
msg = receive message();
fmsParser = new FMSParser(msg);
if (fmsParser.isValid) {
params = fmsParser.getParam();
}
solution 2:
fmsParser = new FMSParser();
loop for messages receiving:
msg = receive message();
if (fmsParser.isValid(msg)) {
params = fmsParser.getParam();
}
code:
private class FMSParser {
private final static String HEADER = "$FMS"
private String[] mSplittedMsg;
FMSParser() {}
public boolean isValidMsg(String msg) {
boolean isValid = false;
this.mSplittedMsg = msg.split(",");
for (int i = 0; i < 12; i++) {
if (splittedMsg[0].equals(HEADER+i)) {
valid = true;
break;
}
}
return valid;
}
public String [] getParam() {
return this.mSplittedMsg;
}
}

If you construct a new FMSParser each time through the loop, it will require memory allocation and garbage collection.
I would choose option 3 which makes the FMSParser immutable, meaning it is thread-safe.
FMSParser fmsParser = new FMSParser();
while (messageIterator.hasNext()) {
String msg = messageIterator.next();
if (fmsParser.isValid(msg)) {
params = fmsParser.getParam(msg);
}
}
Eg:
public class FMSParser {
public boolean isValid(String msg) {
return msg.startsWith("$FMS");
}
public String[] getParams(String msg) {
return msg.split(",");
}
}

How to determine message type in protobuf so that I can use that type.parsefrom(byte[ ])

I am trying to send protobuf data from cpp side to java side.
I have multiple message types defined in .proto
On Cpp side, I have enums for every message type and I am adding it to the buf output as follows:
uint8_t* __temp = (uint8_t*)(buf);
*__temp++ = (type) >> 8;
*__temp = (type) & 0x00FF;
How do I get this 'type' that I have added to the buf, so that I can achieve something like
MessageType parseFrom(byte[] data);

It is not clear what is the exact requirement. But I assume you are trying to send different types of messages and the the receiver should be able to parse the correct object out of the received bytes. This can be done as shown in the example below:
message Message1 {
required string a = 1;
required string b = 2;
}
message Message2 {
required int64 id = 1;
required string data = 2;
}
message WrapperMessage {
required int64 commonField = 1;
oneof msg {
Message1 m1 = 2;
Message2 m2 = 3;
}
}
Basically, always WrapperMessage object is sent over the wire which wraps a Message1 or Message2 object.
Then on the receiving side we may parse the WrapperMessage object first and then use HasField method to check if m1 or m2 fields is present in the wrapped object and then parse the Message1 or Message2 object out of it.
"oneof" feature may not be available on older version of protobuf compiler.

Protobuf 3 introduced a new concept, Any, that handles this. A good description can be found here.

Below is the example code for reading and writing with ANY type in Proto 3. Used Bigtable for read and write examples.
public void writeToBigtable(Item item){
try {
RowMutation rowMutation = RowMutation.create("item", String.join("#", item.getHqLine(), item.getPartNo()))
.setCell("item-info-cf", ByteString.copyFromUtf8("item-info-proto"), ByteString.copyFrom(**Any.pack(item).toByteArray()**));
bigtableDataClient.mutateRow(rowMutation);
} catch (RuntimeException exception){
log.error("Error occurred while inserting data into DB");
}
}
public Set<Item> readFromBigtable(String rowKey){
Row row = bigtableDataClient.readRow("item",rowKey,FILTERS.chain().filter(FILTERS.limit().cellsPerColumn(1)));
return row.getCells("item-info-cf", ByteString.copyFromUtf8("item-info-proto"))
.stream()
.map(rowCell->{
Item item = null;
try {
Any any = **Any.parseFrom(rowCell.getValue().toByteArray()**);
if(any.is(Item.class)) {
item = any.unpack(Item.class);
}
} catch (InvalidProtocolBufferException e) {
throw new RuntimeException(e);
}
return item;
}).collect(Collectors.toSet());
}

Grabbing tagged instagram photos in real time

I'm trying to download photos posted with specific tag in real time. I found real time api pretty useless so I'm using long polling strategy. Below is pseudocode with comments of sublte bugs in it
newMediaCount = getMediaCount();
delta = newMediaCount - mediaCount;
if (delta > 0) {
// if mediaCount changed by now, realDelta > delta, so realDelta - delta photos won't be grabbed and on next poll if mediaCount didn't change again realDelta - delta would be duplicated else ...
// if photo posted from private account last photo will be duplicated as counter changes but nothing is added to recent
recentMedia = getRecentMedia(delta);
// persist recentMedia
mediaCount = newMediaCount;
}
Second issue can be addressed with Set of some sort I gueess. But first really bothers me. I've moved two calls to instagram api as close as possible but is this enough?
Edit
As Amir suggested I've rewritten the code with use of min/max_tag_ids. But it still skips photos. I couldn't find better way to test this than save images on disk for some time and compare result to instagram.com/explore/tags/.
public class LousyInstagramApiTest {
#Test
public void testFeedContinuity() throws Exception {
Instagram instagram = new Instagram(Settings.getClientId());
final String TAG_NAME = "portrait";
String id = instagram.getRecentMediaTags(TAG_NAME).getPagination().getMinTagId();
HashtagEndpoint endpoint = new HashtagEndpoint(instagram, TAG_NAME, id);
for (int i = 0; i < 10; i++) {
Thread.sleep(3000);
endpoint.recentFeed().forEach(d -> {
try {
URL url = new URL(d.getImages().getLowResolution().getImageUrl());
BufferedImage img = ImageIO.read(url);
ImageIO.write(img, "png", new File("D:\\tmp\\" + d.getId() + ".png"));
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
}
class HashtagEndpoint {
private final Instagram instagram;
private final String hashtag;
private String minTagId;
public HashtagEndpoint(Instagram instagram, String hashtag, String minTagId) {
this.instagram = instagram;
this.hashtag = hashtag;
this.minTagId = minTagId;
}
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
String maxTagId = feed.getPagination().getNextMaxTagId();
if (maxTagId != null && maxTagId.compareTo(minTagId) > 0) dataList.addAll(paginateFeed(maxTagId));
Collections.reverse(dataList);
// dataList.removeIf(d -> d.getId().compareTo(minTagId) < 0);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}
private Collection<? extends MediaFeedData> paginateFeed(String maxTagId) throws InstagramException {
System.out.println("pagination required");
List<MediaFeedData> dataList = new ArrayList<>();
do {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, null, maxTagId);
maxTagId = feed.getPagination().getNextMaxTagId();
dataList.addAll(feed.getData());
} while (maxTagId.compareTo(minTagId) > 0);
return dataList;
}
}

Using the Tag endpoints to get the recent media with a desired tag, it returns a min_tag_id in its pagination info, which is tied to the most recently tagged media at the time of your call. As the API also accepts a min_tag_id parameter, you can pass that number from your last query to only receive those media that are tagged after your last query.
So based on whatever polling mechanism you have, you just call the API to get the new recent media if any based on last received min_tag_id.
You will also need to pass a large count parameter and follow the pagination of the response to receive all data without losing anything when the speed of tagging is faster than your polling.
Update:
Based on your updated code:
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null, 100000);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
// follow the pagination
MediaFeed recentMediaNextPage = instagram.getRecentMediaNextPage(feed.getPagination());
while (recentMediaNextPage.getPagination() != null) {
dataList.addAll(recentMediaNextPage.getData());
recentMediaNextPage = instagram.getRecentMediaNextPage(recentMediaNextPage.getPagination());
}
Collections.reverse(dataList);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}

Stumbling with dynamic parameters, passing Object[] to Object

I've been using a system in which I could tack on as many parameters as I want and the method determines the data-type based on the object, this methods skeleton is as follows:
public void sendPacket(int id, Object... data) {
....
}
This has allowed me to easily send packets with all sorts of information, by just supplying the ID and then the data in the order that I wanted it to be sent over the network.
This became a problem when I needed to dynamically call sendPacket(Integer, Object);
Usually I know exactly how much data I need to pass to the sendPacket method, and I pass it manually, however in this case I don't know how many parameters I'm going to send, thus the amount of data I'm sending over the network is unknown.
The method I used to try to do this was to create an Object[] buffer which isn't doing what I wanted it to, example below:
Object[] buffer = new Object[list.size() * 3];
int bufferIndex = 0;
for(int i = 0; i < list.size(); i++) {
buffer[bufferIndex++] = list.get(i).getId();
buffer[bufferIndex++] = list.get(i).getName();
buffer[bufferIndex++] = list.get(i).getLevel();
}
sendPacket(5, true, list.size(), buffer);
This presents the following [DEBUG] output.
[DEBUG]: Packet ID: 5 Data Passed[Boolean]: true
[DEBUG]: Packet ID: 5 Data Passed[Integer]: 1
[Ljava.lang.Object;
The [Ljava.lang.Object output is because I have it setup to tell me the class-name of the Object that failed to be converted into usable data.
Here's an example as to how I'm currently interpreting the data being passed to sendPacket
for(Object o : data) {
if(o.getClass().getName().endsWith("Integer")) {
out.writeInt((int)o);
}
}
There's probably more efficient ways to figure out which type to cast the data to, so if you know one, that information would also be beneficial to myself.
Thanks for any help.

public class ConvertUtil {
private ConvertUtil() {}
private final static Map<Class<?>, Method> METHOD_MAP = new HashMap<Class<?>, Method>();
private static Logger log = LoggerFactory.getLogger(ConvertUtil.class);
static {
try {
METHOD_MAP.put(Byte.class, Byte.class.getMethod("valueOf", String.class));
METHOD_MAP.put(Short.class, Short.class.getMethod("valueOf", String.class));
METHOD_MAP.put(Integer.class, Integer.class.getMethod("valueOf", String.class));
METHOD_MAP.put(Long.class, Long.class.getMethod("valueOf", String.class));
METHOD_MAP.put(Boolean.class, Boolean.class.getMethod("valueOf", String.class));
METHOD_MAP.put(Float.class, Float.class.getMethod("valueOf", String.class));
METHOD_MAP.put(Double.class, Double.class.getMethod("valueOf", String.class));
METHOD_MAP.put(String.class, String.class.getMethod("valueOf", Object.class));
} catch (Exception e) {
log.error("ConvertUtil static is error" + e.getLocalizedMessage());
}
}
#SuppressWarnings("unchecked")
public static <T> T castValue(Object val, T defaultVal) {
Method method = METHOD_MAP.get(defaultVal.getClass());
try {
if (val != null && val instanceof String) {
defaultVal = (T) method.invoke(defaultVal.getClass(), val.toString());
}
if (val != null && val.getClass().getName().equals(defaultVal.getClass().getName())) {
defaultVal = (T) val;
}
} catch (Exception e) {
log.error("ConvertUtil castValue is error" + e.getLocalizedMessage());
}
return defaultVal;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

AWS Firehose Transformation lambda putting all messages in same s3 folder - java

I got this working using above code only, its just that looks like stream is slow so data of new hours haven't reached yet.

Related

Configure multiple base url's in Karate [duplicate]

need advice regarding design and performance issue

How to determine message type in protobuf so that I can use that type.parsefrom(byte[ ])

Grabbing tagged instagram photos in real time

Stumbling with dynamic parameters, passing Object[] to Object

Categories

Resources