Apache Flink Process xml and write them to database - java

i have the following use case.
Xml files are written to a kafka topic which i want to consume and process via flink.
The xml attributes have to be renamed to match the database table columns. These renames have to be flexible and maintainable from outside the flink job.
At the end the attributes have to be written to the database.
Each xml document repesent a database record.
As a second step all some attributes of all xml documents from the last x minutes have to be aggregated.
As i know so far flink is capable of all the mentioned steps but i am lacking of an idea how to implement it corretly.
Currently i have implemented the kafka source, retrieve the xml document and parse it via custom MapFunction. There i create a POJO and store each attribute name and value in a HashMap.
public class Data{
private Map<String,String> attributes = HashMap<>();
}
HashMap containing:
Key: path.to.attribute.one Value: Value of attribute one
Now i would like to use the Broadcasting State to change the original attribute names to the database column names.
At this stage i stuck as i have my POJO data with the attributes inside the HashMap but i don't know how to connect it with the mapping via Broadcasting.
Another way would be to flatMap the xml document attributes in single records. This leaves me with two problems:
How to assure that attributes from one document don't get mixed with them from another document within the stream
How to merge all the attributes of one document back to insert them as one record into the database
For the second stage i am aware of the Window function even if i don't have understood it in every detail but i guess it would fit my requirement. The question on this stage would be if i can use more than one sink in one job while one would be a stream of the raw data and one of the aggregated.
Can someone help with a hint?
Cheers
UPDATE
Here is what i got so far - i simplified the code the XmlData POJO is representing my parsed xml document.
public class StreamingJob {
static Logger LOG = LoggerFactory.getLogger(StreamingJob.class);
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
XmlData xmlData1 = new XmlData();
xmlData1.addAttribute("path.to.attribute.eventName","Start");
xmlData1.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:00.000");
xmlData1.addAttribute("third.path.to.attribute.eventSource","Source1");
xmlData1.addAttribute("path.to.attribute.additionalAttribute","Lorem");
XmlData xmlData2 = new XmlData();
xmlData2.addAttribute("path.to.attribute.eventName","Start");
xmlData2.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:01.000");
xmlData2.addAttribute("third.path.to.attribute.eventSource","Source2");
xmlData2.addAttribute("path.to.attribute.additionalAttribute","First");
XmlData xmlData3 = new XmlData();
xmlData3.addAttribute("path.to.attribute.eventName","Start");
xmlData3.addAttribute("second.path.to.attribute.eventTimestamp","2020-11-18T18:00:01.000");
xmlData3.addAttribute("third.path.to.attribute.eventSource","Source1");
xmlData3.addAttribute("path.to.attribute.additionalAttribute","Day");
Mapping mapping1 = new Mapping();
mapping1.addMapping("path.to.attribute.eventName","EVENT_NAME");
mapping1.addMapping("second.path.to.attribute.eventTimestamp","EVENT_TIMESTAMP");
DataStream<Mapping> mappingDataStream = env.fromElements(mapping1);
MapStateDescriptor<String, Mapping> mappingStateDescriptor = new MapStateDescriptor<String, Mapping>(
"MappingBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<Mapping>() {}));
BroadcastStream<Mapping> mappingBroadcastStream = mappingDataStream.broadcast(mappingStateDescriptor);
DataStream<XmlData> dataDataStream = env.fromElements(xmlData1, xmlData2, xmlData3);
//Convert the xml with all attributes to a stream of attribute names and values
DataStream<Tuple2<String, String>> recordDataStream = dataDataStream
.flatMap(new CustomFlatMapFunction());
//Map the attributes with the mapping information
DataStream<Tuple2<String,String>> outputDataStream = recordDataStream
.connect(mappingBroadcastStream)
.process();
env.execute("Process xml data and write it to database");
}
static class XmlData{
private Map<String,String> attributes = new HashMap<>();
public XmlData(){
}
public String toString(){
return this.attributes.toString();
}
public Map<String,String> getColumns(){
return this.attributes;
}
public void addAttribute(String key, String value){
this.attributes.put(key,value);
}
public String getAttributeValue(String attributeName){
return attributes.get(attributeName);
}
}
static class Mapping{
//First string is the attribute path and name
//Second string is the database column name
Map<String,String> mappingTuple = new HashMap<>();
public Mapping(){}
public void addMapping(String attributeNameWithPath, String databaseColumnName){
this.mappingTuple.put(attributeNameWithPath,databaseColumnName);
}
public Map<String, String> getMappingTuple() {
return mappingTuple;
}
public void setMappingTuple(Map<String, String> mappingTuple) {
this.mappingTuple = mappingTuple;
}
}
static class CustomFlatMapFunction implements FlatMapFunction<XmlData, Tuple2<String,String>> {
#Override
public void flatMap(XmlData xmlData, Collector<Tuple2< String,String>> collector) throws Exception {
for(Map.Entry<String,String> entrySet : xmlData.getColumns().entrySet()){
collector.collect(new Tuple2<>(entrySet.getKey(), entrySet.getValue()));
}
}
}
static class CustomBroadcastingFunction extends BroadcastProcessFunction {
#Override
public void processElement(Object o, ReadOnlyContext readOnlyContext, Collector collector) throws Exception {
}
#Override
public void processBroadcastElement(Object o, Context context, Collector collector) throws Exception {
}
}
}

Here's some example code of how to do this using a BroadcastStream. There's a subtle issue where the attribute remapping data might show up after one of the records. Normally you'd use a timer with state to hold onto any records that are missing remapping data, but in your case it's unclear whether a missing remapping is a "need to wait longer" or "no mapping exists". In any case, this should get you started...
private static MapStateDescriptor<String, String> REMAPPING_STATE = new MapStateDescriptor<>("remappings", String.class, String.class);
#Test
public void testUnkeyedStreamWithBroadcastStream() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(2);
List<Tuple2<String, String>> attributeRemapping = new ArrayList<>();
attributeRemapping.add(new Tuple2<>("one", "1"));
attributeRemapping.add(new Tuple2<>("two", "2"));
attributeRemapping.add(new Tuple2<>("three", "3"));
attributeRemapping.add(new Tuple2<>("four", "4"));
attributeRemapping.add(new Tuple2<>("five", "5"));
attributeRemapping.add(new Tuple2<>("six", "6"));
BroadcastStream<Tuple2<String, String>> attributes = env.fromCollection(attributeRemapping)
.broadcast(REMAPPING_STATE);
List<Map<String, Integer>> xmlData = new ArrayList<>();
xmlData.add(makePOJO("one", 10));
xmlData.add(makePOJO("two", 20));
xmlData.add(makePOJO("three", 30));
xmlData.add(makePOJO("four", 40));
xmlData.add(makePOJO("five", 50));
DataStream<Map<String, Integer>> records = env.fromCollection(xmlData);
records.connect(attributes)
.process(new MyRemappingFunction())
.print();
env.execute();
}
private Map<String, Integer> makePOJO(String key, int value) {
Map<String, Integer> result = new HashMap<>();
result.put(key, value);
return result;
}
#SuppressWarnings("serial")
private static class MyRemappingFunction extends BroadcastProcessFunction<Map<String, Integer>, Tuple2<String, String>, Map<String, Integer>> {
#Override
public void processBroadcastElement(Tuple2<String, String> in, Context ctx, Collector<Map<String, Integer>> out) throws Exception {
ctx.getBroadcastState(REMAPPING_STATE).put(in.f0, in.f1);
}
#Override
public void processElement(Map<String, Integer> in, ReadOnlyContext ctx, Collector<Map<String, Integer>> out) throws Exception {
final ReadOnlyBroadcastState<String, String> state = ctx.getBroadcastState(REMAPPING_STATE);
Map<String, Integer> result = new HashMap<>();
for (String key : in.keySet()) {
if (state.contains(key)) {
result.put(state.get(key), in.get(key));
} else {
result.put(key, in.get(key));
}
}
out.collect(result);
}
}

Related

ehcache Map<String,Entry> not work springboot

i tried to cache Map<String,Entry> , but every time i found getEntries() hit database without caching ,
also i serialize Entry object , please yours support
#Cachable("stocks")
public Map<String,Entry> getEntries(){
//getting entry from database then convert to map
return map;
}
This works for me
#Service
public class OrderService {
public static int counter = 0;
#Cacheable("stocks")
public Map<String, Entry> getEntries() {
counter++;
final Map<String, Entry> map = new HashMap<>();
map.put("key", new Entry(123l, "interesting entry"));
return map;
}
}
Here's a test to prove the counter is not called.
#Test
public void entry() throws Exception {
OrderService.counter = 0;
orderService.getEntries();
assertEquals(1, OrderService.counter);
orderService.getEntries();
assertEquals(1, OrderService.counter);
}
I've added it all to my github example

sorting a List<Map<String, String>>

I have a list of a map of strings:
List<Map<String, String>> list = new ArrayList<Map<String, String>>();
This gets populated with the following:
Map<String, String> action1 = new LinkedHashMap<>();
map.put("name", "CreateFirstName");
map.put("nextAction", "CreateLastName");
Map<String, String> action2 = new LinkedHashMap<>();
map.put("name", "CreateAddress");
map.put("nextAction", "CreateEmail");
Map<String, String> action3 = new LinkedHashMap<>();
map.put("name", "CreateLastName");
map.put("nextAction", "CreateAddress");
Map<String, String> action4 = new LinkedHashMap<>();
map.put("name", "CreateEmail");
list.add(action1);
list.add(action2);
list.add(action3);
list.add(action4);
action4 doesn't have a nextAction because it is the last action, but might be easier to just give it a nextAction that is a placeholder for no next action?
Question: How can I sort my list, so that the actions are in order?
ie: the nextAction of an action, is the same as the name of the next action in the list.
Although this seems to be a case of the XY-Problem, and this list of maps is certainly not a "nicely designed data model", and there is likely a representation that is "better" in many ways (although nobody can give recommendations about what the "best" model could be, as long as the overall goal is not known), this is the task that you have at hand, and here is how it could be solved:
First of all, you have to determine the first element of the sorted list. This is exactly the map that has a "name" entry that does not appear as the "nextAction" entry of any other map.
After you have this first map, you can add it to the (sorted) list. Then, determining the next element boils down to finding the map whose "name" is the same as the "nextAction" of the previous map. To quickly find these successors, you can build a map that maps each "name" entry to the map itself.
Here is a basic implementation of this sorting approach:
import java.util.ArrayList;
import java.util.Collections;
import java.util.LinkedHashMap;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
public class SortListWithMaps
{
public static void main(String[] args)
{
List<Map<String, String>> list = new ArrayList<Map<String, String>>();
Map<String, String> action1 = new LinkedHashMap<>();
action1.put("name", "CreateFirstName");
action1.put("nextAction", "CreateLastName");
Map<String, String> action2 = new LinkedHashMap<>();
action2.put("name", "CreateAddress");
action2.put("nextAction", "CreateEmail");
Map<String, String> action3 = new LinkedHashMap<>();
action3.put("name", "CreateLastName");
action3.put("nextAction", "CreateAddress");
Map<String, String> action4 = new LinkedHashMap<>();
action4.put("name", "CreateEmail");
list.add(action1);
list.add(action2);
list.add(action3);
list.add(action4);
// Make it a bit more interesting...
Collections.shuffle(list);
System.out.println("Before sorting");
for (Map<String, String> map : list)
{
System.out.println(map);
}
List<Map<String, String>> sortedList = sort(list);
System.out.println("After sorting");
for (Map<String, String> map : sortedList)
{
System.out.println(map);
}
}
private static List<Map<String, String>> sort(
List<Map<String, String>> list)
{
// Compute a map from "name" to the actual map
Map<String, Map<String, String>> nameToMap =
new LinkedHashMap<String, Map<String,String>>();
for (Map<String, String> map : list)
{
String name = map.get("name");
nameToMap.put(name, map);
}
// Determine the first element for the sorted list. For that,
// create the set of all names, and remove all of them that
// appear as the "nextAction" of another entry
Set<String> names =
new LinkedHashSet<String>(nameToMap.keySet());
for (Map<String, String> map : list)
{
String nextAction = map.get("nextAction");
names.remove(nextAction);
}
if (names.size() != 1)
{
System.out.println("Multiple possible first elements: " + names);
return null;
}
// Insert the elements, in sorted order, into the result list
List<Map<String, String>> result =
new ArrayList<Map<String, String>>();
String currentName = names.iterator().next();
while (currentName != null)
{
Map<String, String> element = nameToMap.get(currentName);
result.add(element);
currentName = element.get("nextAction");
}
return result;
}
}
Instead of using a Map to store the properties of an action (the name and the nextAction), create your own type that's composed of those properties:
class Action {
private String name;
//nextAction
public void perform() {
//do current action
//use nextAction to perform the next action
}
}
The nextAction can now be a reference to the next action:
abstract class Action implements Action {
private String name;
private Action nextAction;
public Action(String name) {
this.name = name;
}
public final void perform() {
perform(name);
nextAction.perform();
}
protected abstract void perform(String name);
}
You can now create your actions by subtyping the Action class:
class CreateFirstName extends Action {
public CreateFirstName(Action nextAction) {
super("CreateFirstName", nextAction);
}
protected final void perform(String name) {
System.out.println("Performing " + name);
}
}
And chain them together:
Action action = new CreateFirstName(new CreateLastName(new CreateEmail(...)));
The nested expressions can get pretty messy, but we'll get to that later. There's a bigger problem here.
action4 doesn't have a nextAction because it is the last action, but might be easier to just give it a nextAction that is a placeholder for no next action
The same problem applies to the code above.
Right now, every action must have a next action, due to the constructor Action(String, Action). We could take the easy route and pass in a placeholder for no next action (null being the easiest route):
class End extends Action {
public End() {
super("", null);
}
}
And do a null check:
//class Action
public void perform() {
perform(name);
if(nextAction != null) {
nextAction.perform(); //performs next action
}
}
But this would be a code smell. You can stop reading here and use the simple fix, or continue below for the more involved (and educational) route.
There's a good chance that when you do use null, you're falling victim to a code smell. Although it doesn't apply to all cases (due to Java's poor null safety), you should try to avoid null if possible. Instead, rethink your design as in this example. If all else fails, use Optional.
The last action is not the same as the other actions. It can still perform like the other, but it has different property requirements.
This means they could both share the same behavior abstraction, but must differ when it comes to defining properties:
interface Action {
void perform();
}
abstract class ContinuousAction implements Action {
private String name;
private Action nextAction;
public ContinuousAction(String name) {
this.name = name;
}
public final void perform() {
perform(name);
nextAction.perform();
}
protected abstract void perform(String name);
}
abstract class PlainAction implements Action {
private String name;
public PlainAction(String name) {
this.name = name;
}
public final void perform() {
perform(name);
}
protected abstract void perform(String name);
}
The last action would extend PlainAction, while the others would extend ContinuousAction.
Lastly, to prevent long chains:
new First(new Second(new Third(new Fourth(new Fifth(new Sixth(new Seventh(new Eighth(new Ninth(new Tenth())))))))))
You could specify the next action within each concrete action:
class CreateFirstName extends ContinuousAction {
public CreateFirstName() {
super("CreateFirstName", new CreateLastName());
}
//...
}
class CreateLastName extends ContinuousAction {
public CreateLastName() {
super("CreateLastName", new CreateEmail());
}
//...
}
class CreateEmail extends PlainAction {
public CreateEmail() {
super("CreateEmail");
}
//...
}
The ContinuousAction and PlainAction can be abstracted further. They are both named actions (they have names), and that property affects their contract in the samw way (passing it to the template method process(String)):
abstract class NamedAction implements Action {
private String name;
public NamedAction(String name) {
this.name = name;
}
public final void perform() {
perform(name);
}
protected abstract void perform(String name);
}
//class ContinuousAction extends NamedAction
//class PlainAction extends NamedAction

Save and Read Key-Value pair in Spark

I have a JavaPairRDD in the following format:
JavaPairRDD< String, Tuple2< String, List< String>>> myData;
I want to save it as a Key-Value format (String, Tuple2< String, List< String>>).
myData.saveAsXXXFile("output-path");
So my next job could read in the data directly to my JavaPairRDD:
JavaPairRDD< String, Tuple2< String, List< String>>> newData = context.XXXFile("output-path");
I am using Java 7, Spark 1.2, Java API. I tried saveAsTextFile and saveAsObjectFile, neither works. And I don't see saveAsSequenceFile option in my eclipse.
Does anyone have any suggestion for this problem?
Thank you very much!
You could use SequenceFileRDDFunctions that is used through implicits in scala, however that might be nastier than using the usual suggestion for java of:
myData.saveAsHadoopFile(fileName, Text.class, CustomWritable.class,
SequenceFileOutputFormat.class);
implementing CustomWritable via extending
org.apache.hadoop.io.Writable
Something like this should work (did not check for compilation):
public class MyWritable extends Writable{
private String _1;
private String[] _2;
public MyWritable(Tuple2<String, String[]> data){
_1 = data._1;
_2 = data._2;
}
public Tuple2<String, String[]> get(){
return new Tuple2(_1, _2);
}
#Override
public void readFields(DataInput in) throws IOException {
_1 = WritableUtils.readString(in);
ArrayWritable _2Writable = new ArrayWritable();
_2Writable.readFields(in);
_2 = _2Writable.toStrings();
}
#Override
public void write(DataOutput out) throws IOException {
Text.writeString(out, _1);
ArrayWritable _2Writable = new ArrayWritable(_2);
_2Writable.write(out);
}
}
such that it fits your data model.

Right definition of HashMap for static variables

I have trouble with logical definition of the HashMap.
For example I create the following class to store some mandatory data, I just wanna know that is it good implementation or not? I used static HashMap because I need these HashMaps all over the time since my application is alive.
public abstract class DataTable {
private static HashMap<String, String[]> mainData = new HashMap<String, String[]>();
public static void putData(String[] data) {
// put some data
}
public static String[] getData(String alias) {
// return entered data with the given alias
}
}
Any suggestion would be appreciated...
Your static understanding is ok.
And your methods (setters and getters) should be
public static void putData(String key ,String[] data) {
mainData.put(key, data);
}
public static HashMap<String, String[]> getData(String alias) {
return mainData;
}
Of course ,obviously proper gauridng and exception handling is mandatory.
Update:As you mentioned in comment,that you are asking about thread safety on map use ConcurrentHashMap.
Map<String, String[]> mainData = new ConcurrentHashMap<String, String[]>();
Which is a
A hash table supporting full concurrency of retrievals and adjustable expected concurrency for updates.
A nice article on diff

My method is too specific. How can I make it more generic?

I have a class, the outline of which is basically listed below.
import org.apache.commons.math.stat.Frequency;
public class WebUsageLog {
private Collection<LogLine> logLines;
private Collection<Date> dates;
WebUsageLog() {
this.logLines = new ArrayList<LogLine>();
this.dates = new ArrayList<Date>();
}
SortedMap<Double, String> getFrequencyOfVisitedSites() {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> domains = new HashSet<String>();
Frequency freq = new Frequency();
for (LogLine line : this.logLines) {
freq.addValue(line.getVisitedDomain());
domains.add(line.getVisitedDomain());
}
for (String domain : domains) {
frequencyMap.put(freq.getPct(domain), domain);
}
return frequencyMap;
}
}
The intention of this application is to allow our Human Resources folks to be able to view Web Usage Logs we send to them. However, I'm sure that over time, I'd like to be able to offer the option to view not only the frequency of visited sites, but also other members of LogLine (things like the frequency of assigned categories, accessed types [text/html, img/jpeg, etc...] filter verdicts, and so on). Ideally, I'd like to avoid writing individual methods for compilation of data for each of those types, and they could each end up looking nearly identical to the getFrequencyOfVisitedSites() method.
So, my question is twofold: first, can you see anywhere where this method should be improved, from a mechanical standpoint? And secondly, how would you make this method more generic, so that it might be able to handle an arbitrary set of data?
This is basically the same thing as Eugene's solution, I just left all the frequency calculation stuff in the original method and use the strategy only for getting the field to work on.
If you don't like enums you could certainly do this with an interface instead.
public class WebUsageLog {
private Collection<LogLine> logLines;
private Collection<Date> dates;
WebUsageLog() {
this.logLines = new ArrayList<LogLine>();
this.dates = new ArrayList<Date>();
}
SortedMap<Double, String> getFrequency(LineProperty property) {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> values = new HashSet<String>();
Frequency freq = new Frequency();
for (LogLine line : this.logLines) {
freq.addValue(property.getValue(line));
values.add(property.getValue(line));
}
for (String value : values) {
frequencyMap.put(freq.getPct(value), value);
}
return frequencyMap;
}
public enum LineProperty {
VISITED_DOMAIN {
#Override
public String getValue(LogLine line) {
return line.getVisitedDomain();
}
},
CATEGORY {
#Override
public String getValue(LogLine line) {
return line.getCategory();
}
},
VERDICT {
#Override
public String getValue(LogLine line) {
return line.getVerdict();
}
};
public abstract String getValue(LogLine line);
}
}
Then given an instance of WebUsageLog you could call it like this:
WebUsageLog usageLog = ...
SortedMap<Double, String> visitedSiteFrequency = usageLog.getFrequency(VISITED_DOMAIN);
SortedMap<Double, String> categoryFrequency = usageLog.getFrequency(CATEGORY);
I'd introduce an abstraction like "data processor" for each computation type, so you can just call individual processors for each line:
...
void process(Collection<Processor> processors) {
for (LogLine line : this.logLines) {
for (Processor processor : processors) {
processor.process();
}
}
for (Processor processor : processors) {
processor.complete();
}
}
...
public interface Processor {
public void process(LogLine line);
public void complete();
}
public class FrequencyProcessor implements Processor {
SortedMap<Double, String> frequencyMap = new TreeMap<Double, String>(Collections.reverseOrder()); //we reverse order to sort from the highest percentage to the lowest.
Collection<String> domains = new HashSet<String>();
Frequency freq = new Frequency();
public void process(LogLine line)
String property = getProperty(line);
freq.addValue(property);
domains.add(property);
}
protected String getProperty(LogLine line) {
return line.getVisitedDomain();
}
public void complete()
for (String domain : domains) {
frequencyMap.put(freq.getPct(domain), domain);
}
}
}
You could also change a LogLine API to be more like a Map, i.e. instead of strongly typed line.getVisitedDomain() could use line.get("VisitedDomain"), then you can write a generic FrequencyProcessor for all properties and just pass a property name in its constructor.

Categories