As shown in the code below, we have an FtpInboundFileSynchronizingMessageSource with a FileSystemPersistentAcceptOnceFileListFilter using PropertiesPersistingMetadataStore.
#Bean
public PropertiesPersistingMetadataStore getMetadataStore() {
final PropertiesPersistingMetadataStore metadataStore = new PropertiesPersistingMetadataStore() {
#Override
public String putIfAbsent(final String key, final String value) {
try {
super.afterPropertiesSet();
} catch (final Exception e) {
e.printStackTrace();
}
return super.putIfAbsent(key, value);
}
};
metadataStore.setBaseDirectory(getRegistryValue("LOCALMETASTOREDIRECTORY"));
return metadataStore;
}
#Bean
#InboundChannelAdapter(value = "CSVChannel", poller = #Poller(fixedRate = "30000", maxMessagesPerPoll = "1"))
public MessageSource<File> ftpMessageSource() {
final String METHODNAME = "ftpMessageSource()";
if (LoggingHelper.isEntryExitTraceEnabled(LOGGER)) {
LOGGER.entering(CLASSNAME, METHODNAME);
}
final Comparator<File> fileLastModifiedDateComparator = new Comparator<File>() {
#Override
public int compare(final File f1, final File f2) {
return Long.valueOf(f1.lastModified())
.compareTo(f2.lastModified());
}
};
final FtpInboundFileSynchronizingMessageSource source = new FtpInboundFileSynchronizingMessageSource(ftpInboundFileSynchronizer(), fileLastModifiedDateComparator);
source.setLocalDirectory(new File(getRegistryValue("LOCALDIRECTORY")));
final FileSystemPersistentAcceptOnceFileListFilter fileSystemPersistentAcceptOnceFileListFilter = new FileSystemPersistentAcceptOnceFileListFilter(getMetadataStore(),
getRegistryValue("REMOTEFILENAMEPATTERN_ANAG_CLI"));
fileSystemPersistentAcceptOnceFileListFilter.setFlushOnUpdate(true);
source.setLocalFilter(fileSystemPersistentAcceptOnceFileListFilter);
if (LoggingHelper.isEntryExitTraceEnabled(LOGGER)) {
LOGGER.exiting(CLASSNAME, METHODNAME);
}
return source;
}
We have 4 instances of the application running in production and the local directory, meta store directory are all on a location shared by all 4 instances.
The problem we facing now is we are seeing invalid characters written in the metadata-store.properties file and sometimes there is some process writing this character \u0000 continuously and that causes the file to grow in big size, like 1GB in few minutes. And since the metadata is read in to memory by the framework that is causing outofmemoryexception when the file is very big.
Please see below some entries from the metadata-store.properties file below.
ANAG_CLI_*.CSV/opt/user-integration/anagcli/input/20200609113855907_ANAG_CLI_20200609113846.CSV.a=1591695480000
\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000=
ANAG_CLI_*.CSV/opt/user-integration/anagcli/input/20200610105125916_ANAG_CLI_20200610105118.CSV.a.writing=1591779085951
\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000=
\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000=
\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000=
\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000=
ANAG_CLI_*.CSV/opt/user-integration/anagcli/input/20200609133155929_ANAG_CLI_20200609133146.CSV.a=1591702315917
Is it safe to use the PropertiesPersistingMetadataStore like this in a shared location between more than one application instances? How to understand what is causing this invalid character issue and how to avoid this?
Any help would be appreciated!
Related
On my flink script I have a stream that I'm getting from one kafka topic, manipulate it and sending it back to kafka using the sink.
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties p = new Properties();
p.setProperty("bootstrap.servers", servers_ip_list);
p.setProperty("gropu.id", "Flink");
FlinkKafkaConsumer<Event_N> kafkaData_N =
new FlinkKafkaConsumer("CorID_0", new Ev_Des_Sch_N(), p);
WatermarkStrategy<Event_N> wmStrategy =
WatermarkStrategy
.<Event_N>forMonotonousTimestamps()
.withIdleness(Duration.ofMinutes(1))
.withTimestampAssigner((Event, timestamp) -> {
return Event.get_Time();
});
DataStream<Event_N> stream_N = env.addSource(
kafkaData_N.assignTimestampsAndWatermarks(wmStrategy));
The part above is working fine no problems at all, the part below instead is where I'm getting the issue.
String ProducerTopic = "CorID_0_f1";
DataStream<Stream_Blocker_Pojo.block> box_stream_p= stream_N
.keyBy((Event_N CorrID) -> CorrID.get_CorrID())
.map(new Stream_Blocker_Pojo());
FlinkKafkaProducer<Stream_Blocker_Pojo.block> myProducer = new FlinkKafkaProducer<>(
ProducerTopic,
new ObjSerializationSchema(ProducerTopic),
p,
FlinkKafkaProducer.Semantic.EXACTLY_ONCE); // fault-tolerance
box_stream_p.addSink(myProducer);
No errors everything works fine, this is the Stream_Blocker_Pojo where I'm mapping a stream manipulating it and sending out a new one.(I have simplify my code, just keeping 4 variables and removing all the math and data processing).
public class Stream_Blocker_Pojo extends RichMapFunction<Event_N, Stream_Blocker_Pojo.block>
{
public class block {
public Double block_id;
public Double block_var2 ;
public Double block_var3;
public Double block_var4;}
private transient ValueState<block> state_a;
#Override
public void open(Configuration parameters) throws Exception {
state_a = getRuntimeContext().getState(new ValueStateDescriptor<>("BoxState_a", block.class));
}
public block map(Event_N input) throws Exception {
p1.Stream_Blocker_Pojo.block current_a = state_a.value();
if (current_a == null) {
current_a = new p1.Stream_Blocker_Pojo.block();
current_a.block_id = 0.0;
current_a.block_var2 = 0.0;
current_a.block_var3 = 0.0;
current_a.block_var4 = 0.0;}
current_a.block_id = input.f_num_id;
current_a.block_var2 = input.f_num_2;
current_a.block_var3 = input.f_num_3;
current_a.tblock_var4 = input.f_num_4;
state_a.update(current_a);
return new block();
};
}
This is the implementation of the Kafka Serialization schema.
public class ObjSerializationSchema implements KafkaSerializationSchema<Stream_Blocker_Pojo.block>{
private String topic;
private ObjectMapper mapper;
public ObjSerializationSchema(String topic) {
super();
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(Stream_Blocker_Pojo.block obj, Long timestamp) {
byte[] b = null;
if (mapper == null) {
mapper = new ObjectMapper();
}
try {
b= mapper.writeValueAsBytes(obj);
} catch (JsonProcessingException e) {
}
return new ProducerRecord<byte[], byte[]>(topic, b);
}
}
When I open the messages that i sent from my Flink script using kafka, I find that all the variables are "null"
CorrID b'{"block_id":null,"block_var1":null,"block_var2":null,"block_var3":null,"block_var4":null}
It looks like I'm sending out an empty obj with no values. But I'm struggling to understand what I'm doing wrong. I think that the problem could be into my implementation of the Stream_Blocker_Pojo or maybe into the ObjSerializationSchema, Any help would be really appreciated. Thanks
There are two probable issues here:
Are You sure the variable You are passing of type block doesn't have null fields? You may want to debug that part to be sure.
The reason may also be in ObjectMapper, You should have getters and setters available for Your block otherwise Jackson may not be able to access them.
I added a step in my application to persist files via GridFS and added a metadata field called "processed" to work as a flag for a scheduled task that retrieves the new file and sends it on for processing. Since the Java driver for GridFS doesn't have a method allowing metadata to be updated I used MongoCollection for the "fs.files" collection to update "metadata.processing" to true.
I use GridFSBucket.find(eq("metadata.processed", false) to get the new files for processing and then update metadata.processed to true once processing is completed. This works if I add a new file while the application is running. However, if I have an existing file with "metadata.processed" set to false and start the application, the above find call returns no results. Similarly if I have a file that was already processed and I set the "metadata.processed" field back to false, the above find call also ceases working.
private static final String FILTER_STR = "'{'\"filename\" : \"{0}\"'}'";
private static final String UPDATE_STR =
"'{'\"$set\": '{'\"metadata.processed\": \"{0}\"'}}'";
#Autowired
private GridFSBucketFactory gridFSBucketFactory;
#Autowired
private MongoCollectionFactory mongoCollectionFactory;
public void storeFile(String filename, DateTime publishTime,
InputStream inputStream) {
if (fileExists(filename)) {
LOGGER.info("File named {} already exists.", filename);
} else {
uploadToGridFS(filename, publishTime, inputStream);
LOGGER.info("Stored file named {}.", filename);
}
}
public GridFSDownloadStream getFile(BsonValue id) {
return gridFSBucketFactory.getGridFSBucket().openDownloadStream(id);
}
public GridFSDownloadStream getFile(String filename) {
final GridFSFile file = getGridFSFile(filename);
return file == null ? null : getFile(file.getId());
}
public GridFSFindIterable getUnprocessedFiles() {
return gridFSBucketFactory.getGridFSBucket()
.find(eq("metadata.processed", false));
}
public void setProcessed(String filename, boolean isProcessed) {
final BasicDBObject filter =
BasicDBObject.parse(format(FILTER_STR, filename));
final BasicDBObject update =
BasicDBObject.parse(format(UPDATE_STR, isProcessed));
if (updateOne(filter, update)) {
LOGGER.info("Set metadata for {} to {}", filename, isProcessed);
}
}
private void uploadToGridFS(String filename, DateTime publishTime,
InputStream inputStream) {
gridFSBucketFactory.getGridFSBucket().uploadFromStream(filename,
inputStream, createMetadata(publishTime));
}
private GridFSUploadOptions createMetadata(DateTime publishTime) {
final Document metadata = new Document();
metadata.put("processed", false);
// metadata.put("publishTime", publishTime.toString());
return new GridFSUploadOptions().metadata(metadata);
}
private boolean fileExists(String filename) {
return getGridFSFile(filename) != null;
}
private GridFSFile getGridFSFile(String filename) {
return gridFSBucketFactory.getGridFSBucket()
.find(eq("filename", filename)).first();
}
private boolean updateOne(BasicDBObject filter, BasicDBObject update) {
try {
mongoCollectionFactory.getFsFilesCollection().updateOne(filter,
update, new UpdateOptions().upsert(true));
} catch (final MongoException e) {
LOGGER.error(
"The following failed to update, filter:{0} update:{1}",
filter, update, e);
return false;
}
return true;
}
Any idea what I can do to ensure:
GridFSBucket.find(eq("metadata.processed", false)
returns the proper results for existing files and/or files that have had the metadata changed?
The issue was due to setting the metadata.processed value as a String vs a boolean.
When initially creating the metadata I set its value with a boolean:
private GridFSUploadOptions createMetadata(DateTime publishTime) {
final Document metadata = new Document();
metadata.put("processed", false);
// metadata.put("publishTime", publishTime.toString());
return new GridFSUploadOptions().metadata(metadata);
}
And later I check for a boolean:
public GridFSFindIterable getUnprocessedFiles() {
return gridFSBucketFactory.getGridFSBucket()
.find(eq("metadata.processed", false));
}
But when updating the metadata using the "fs.files" MongoCollection I incorrectly added quotes around the boolean value here:
private static final String UPDATE_STR =
"'{'\"$set\": '{'\"metadata.processed\": \"{0}\"'}}'";
Which caused the metadata value to be saved as a String vs a boolean.
I've tried to build a route to copy files from one directory to an other directory. But instead of using:
from(file://source-directory).to(file://destination-directory)
I want to do something like this:
from(direct:start)
.to(direct:doStuff)
.to(direct:readDirectory)
.to(file://destination-folder)
I've done the following stuff:
Route
#Component
public class Route extends AbstractRouteBuilder {
#Override
public void configure() throws Exception {
from("direct:start")
.bean(lookup(ReadDirectory.class))
.split(body())
.setHeader("FILENAME", method(lookup(CreateFilename.class)))
.to("file:///path/to/my/output/directory/?fileName=${header.FILENAME}");
}
Processor
#Component
public class ReadDirectory implements CamelProcessorBean {
#Handler
public ImmutableList<File> apply(#Header("SOURCE_DIR") final String sourceDir) {
final File directory = new File(sourceDir);
final File[] files = directory.listFiles();
if (files == null) {
return ImmutableList.copyOf(Lists.<File>newArrayList());
}
return ImmutableList.copyOf(files);
}
}
I can start my route by using the following pseudo-Test (The point is I can manually start my route by producer.sendBodyAndHeader(..))
public class RouteIT extends StandardIT {
#Produce
private ProducerTemplate producer;
#Test
public void testRoute() throws Exception {
final String uri = "direct:start";
producer.sendBodyAndHeaders(uri, InOut, null, header());
}
private Map<String, Object> header() {
final Map<String, Object> header = Maps.newHashMap();
header.put("SOURCE_DIR", "/path/to/my/input/directory/");
return header;
}
}
AbstractRouteBuilderextends SpringRouteBuilder
CamelProcessorBean is only a Marker-Interface
StandardIT loads SpringContext and stuff
The problem is, that I must set the filename. I've read some stuff that camel sets the header CamelFileNameProduced (during the file endpoint). It is a generic string with timestamp and if I don't set the filename - the written files will get this generic string as the filename.
My Question is: Is there a more beautiful solution to copy files (but starting with a direct-endpoint and read the directory in the middle of the route) and keep the filename for the destination? (I don't have to set the filename when I use from("file:source").to("file:destination"), why must I do it now?)
You can set the file name when you send using the producer template, as long as the header is propagated during the routing between the routes you are all fine, which Camel does by default.
For example
#Test
public void testRoute() throws Exception {
final String uri = "direct:start";
Map headers = ...
headers.put(Exchange.FILE_NAME, "myfile.txt");
producer.sendBodyAndHeaders(uri, InOut, null, headers);
}
The file component talks more about how to control the file name
http://camel.apache.org/file2
I have an app where I filter messages according to some rules(existing some keywords or regexps). These rules are to be stored in .properties file(as they must be persistent). I've figured out how to read data from this file. here is the part of the code:
public class Config {
private static final Config ourInstance = new Config();
private static final CompositeConfiguration prop = new CompositeConfiguration();
public static Config getInstance() {
return ourInstance;
}
public Config(){
}
public synchronized void load() {
try {
prop.addConfiguration(new SystemConfiguration());
System.out.println("Loading /rules.properties");
final PropertiesConfiguration p = new PropertiesConfiguration();
p.setPath("/home/mikhail/bzrrep/DLP/DLPServer/src/main/resources/rules.properties");
p.load();
prop.addConfiguration(p);
} catch (ConfigurationException e) {
e.printStackTrace();
}
final int processors = prop.getInt("server.processors", 1);
// If you don't see this line - likely config name is wrong
System.out.println("Using processors:" + processors);
}
public void setKeyword(String customerId, String keyword){
}
public void setRegexp(String customerId, String regexp)
{}
}
as you see I'm going to add values to some properties. Here is the .properties file itself:
users = admin, root, guest
users.admin.keywords = admin
users.admin.regexps = test-5, test-7
users.root.keywords = root
users.root.regexps = *
users.guest.keywords = guest
users.guest.regexps =
I have a GUI for user to add keywords and regexps to this config. so, how to implement methods setKeyword and setRegexp?
The easyest way I found is to read the current values of the property to the String[], add there a new value and set property.
props.setProperty(fieldName, values);
What my application is doing is creating a large csv file (its a report) and the idea is to deliver the contents of the csv file without actually saving a file for it. Here's my code
String csvData; //this is the string that contains the csv contents
byte[] csvContents = csvData.getBytes();
response.contentType = "text/csv";
response.headers.put("Content-Disposition", new Header(
"Content-Disposition", "attachment;" + "test.csv"));
response.headers.put("Cache-Control", new Header("Cache-Control",
"max-age=0"));
response.out.write(csvContents);
ok();
The csv files that are being generated are rather large and the error i am getting is
org.jboss.netty.handler.codec.frame.TooLongFrameException: An HTTP line is larger than 4096 bytes.
Whats the best way to overcome this issue?
My tech stack is java 6 with play framework 1.2.5.
Note: the origin of the response object is play.mvc.Controller.response
Please use
ServletOutputStream
like
String csvData; //this is the string that contains the csv contents
byte[] csvContents = csvData.getBytes();
ServletOutputStream sos = response.getOutputStream();
response.setContentType("text/csv");
response.setHeader("Content-Disposition", "attachment; filename=test.csv");
sos.write(csvContents);
We use this to show the results of an action directly in the browser,
window.location='data:text/csv;charset=utf8,' + encodeURIComponent(your-csv-data);
I am not sure about the out of memory error but I would at least try this:
request.format = "csv";
renderBinary(new ByteArrayInputStream(csvContents));
Apparently netty complains that the http-header is too long - maybe it somehow thinks that your file is part of the header, see also
http://lists.jboss.org/pipermail/netty-users/2010-November/003596.html
as nylund states, using renderBinary should do the trick.
We use writeChunk oursleves to output large reports on the fly, like:
Controller:
public static void getReport() {
final Report report = new Report(code, from, to );
try {
while (report.hasMoreData()) {
final String data = await(report.getData());
response.writeChunk(data);
}
} catch (final Exception e) {
final Throwable cause = e.getCause();
if (cause != null && cause.getMessage().contains("HTTP output stream closed")) {
logger.warn(e, "user cancelled download");
} else {
logger.error(e, "error retrieving data");
}
}
}
in report code
public class Report {
public Report(final String code, final Date from, final Date to) {
}
public boolean hasMoreData() {
// find out if there is more data
}
public Future<String> getData() {
final Job<String> queryJob = new Job<String>() {
#Override
public String doJobWithResult() throws Exception {
// grab data (e.g read form db) and return it
return data;
}
};
return queryJob.now();
}
}