we are currently working on importing huge JSON files (~100 MB) into MongoDB using the java driver. Currently we split up the files into smaller chunks, since we first encountered problems with importing the whole file. Of course we are aware of the limitation to MongoDB that the maximum document size is 16 MB, however our chunks that we are now importing are far smaller than that.
Strangely enough, the import procedure is working when running it on Linux (eclipse), yet the same program will throw an exception stating "can't say something" on Windows (eclipse).
When observing the log from the database, the error message says
> "Thu Sep 13 11:38:48 [conn1] recv(): message len 1835627538 is too
> large1835627538"
Rerunning the import on the same dataset always leads to the same error message regarding the message length. We investigated the size of our documents to import (using .toString().length()) - the chunk that caused the error was only some kB large.
It makes no difference on which OS the mongo database runs on, but depends on where the import code is being executed (Using the same java-mongo-driver
"we are currently working on importing huge JSON files (~100 MB) into
MongoDB using the java driver"
Are we talking about a JSON file containing 1000s of JSON objects OR 1 JSON object that is size ~100MB? Because if I remember correctly the 16MB limit is per object not per JSON file containing 1000s of JSON objecs.
Also!
"Thu Sep 13 11:38:48 [conn1] recv(): message len 1835627538 is too
large1835627538"
the chunk that caused the error was only some kB large.
If 1835627538 is indeed in kb, that is pretty big, cause thats around ~1750 GigaBytes!!
To get round a JSON file containing 1000s of JSON objects, Why don't you iterate through your data file line by line and do your inserts that way? With my method doesn't matter how large your data file is, the iterator is just a pointer to a specific line. It doesn't load the WHOLE FILE into memory and insert.
NOTE: This is assuming your data file contains 1 JSON object per line.
Using the Apache Commons IO FileUtils (click here), you can use their Line iterator to iterate through you file, for example (not fully working code, need to import correct libs):
LineIterator line_iter;
try {
line_iter = FileUtils.lineIterator(data_file);
while (line_iter.hasNext()) {
line = line_iter.next();
try {
if (line.charAt(0) == '{')
this.mongodb.insert(line);
} catch (IndexOutOfBoundsException e) {}
}
}
line_iter.close(); // close the iterator
} catch (IOException e) {
e.printStackTrace();
}
Related
I have a java class that parses an xml file, and writes its content to MySQL. Everything works fine, but the problem is when the xml file contains invalid unicode characters, an exception is thrown and the program stops parsing the file.
My provider sends this xml file on a daily basis with a list of products with its price, quantity etc. and I have no control over this, so invalid characters will always be there.
All I'm trying to do is to catch these errors, ignore them and continue parsing the rest of the xml file.
I've added a try-catch statements on the startElement, endElement and characters methods of the SAXHandler class, however, they don't catch any exception and the execution stops whenever the parser finds an invalid character.
It seems that I can only catch these exceptions from the function who calls the parser:
try {
myIS = new FileInputStream(xmlFilePath);
parser.parse(myIS, handler);
retValue = true;
} catch(SAXParseException err) {
System.out.println("SAXParseException " + err);
}
However, that's useless in my case, even if the exception tells me where the invalid character is, the execution stops, so the list of products is far from being complete. This list has about 8,000 products and only a couple of invalid characters, however, if the invalid character is in the first 100 products, then all the 7,900 products are not updated in the database. I've also noticed that the endDocument method is not called if an exception occurs.
Somebody asked the same question here some years ago, but didn't get any solution.
I'd really appreciate any ideas or workarounds for this.
Data Sample (as requested):
<Producto>
<Brand>
<Description>Epson</Description>
<ManufacturerId>eps</ManufacturerId>
<BrandId>eps</BrandId>
</Brand>
<New>false</New>
<OnSale>null</OnSale>
<Type>Physical</Type>
<Description>Epson TM T88V - Impresora de recibos - línea térmica - rollo 8 cm - hasta 300 mm/segundo - paralelo, USB</Description>
<Category>
<CategoryId>pos</CategoryId>
<Description>Puntos de Venta</Description>
<Subcategories>
<CategoryId>pos.printer</CategoryId>
<Description>Impresoras para Recibos</Description>
</Subcategories>
</Category>
<InStock>0</InStock>
<Price>
<UnitPrice>4865.6042</UnitPrice>
<CurrencyId>MXN</CurrencyId>
</Price>
<Manufacturer>
<Description>Epson</Description>
<ManufacturerId>eps</ManufacturerId>
</Manufacturer>
<Mpn>C31CA85814</Mpn>
<Sku>PT910EPS27</Sku>
<CompilationDate>2020-02-25T12:30:14.6607135Z</CompilationDate>
</Producto>
The XML philosophy is that you don't process bad data. If it's not well-formed XML, the parser is supposed to give up, and user applications are supposed to give up. Culturally, this is a reaction against the HTML culture, where it was found that if it's generally expected that data users will tolerate bad data, the consequence is that suppliers will produce bad data.
Standards deliver cost reduction because you can use readily available off-the-shelf tools both for creating valid data and for reading it at the other end. The benefits are totally neutralised if you decide you're going to interchange things that are almost XML but not quite. If you were downloading software you wouldn't put up with it if it didn't compile. So why are you prepared to put up with bad data? Send it back and demand a refund.
Having said that, if the problem is "invalid Unicode characters" then it's possible that it started out as good XML and got corrupted in transit. Find out what went wrong and get it fixed as close to the source of the problem as you can.
I solved it removing invalid characters of the xml file before processing it.
I couldn't do what I was trying to do (cath error and continue) but this workaround worked.
Summary
I need to build a set of statistics during a Camel server in-modify-out process, and emit those statistics as one object (a single json log line).
Those statistics need to include:
input file metrics (size/chars/bytes and other, file-section specific measures)
processing time statistics (start/end/duration of processing time, start/end/duration of metrics gathering time)
output file metrics (same as input file metrics, and will be different numbers, output file being changed)
The output file metrics are the problem as I can't access the file until it's written to disk, and
its not written to disk until 'process'ing finishes
Background
A log4j implementation is being used for service logging, but after some tinkering we realised it really doesn't suit the requirement here as it would output multi-line json and embed the json into a single top-level field. We need varying top level fields, depending on the file processed.
The server is expected to deal with multiple file operations asynchronously, and the files vary in size (from tiny to fairly immense - which is one reason we need to iterate stats and measures before we start to tune or review)
Current State
input file and even processing time stats are working OK, and I'm using the following technique to get them:
Inside the 'process' override method of "MyProcessor" I create a new instance of my JsonLogWriter class. (shortened pseudo code with ellipsis)
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
...
#Component
public class MyProcessor implements Processor {
...
#Override
public void process(Exchange exchange) throws Exception {
...
JsonLogWriter jlw = new JsonLogWriter();
jlw.logfilePath = jsonLogFilePath;
jlw.inputFilePath = inFilePath;
jlw.outputfilePath = outFilePath;
...
jlw.metricsInputFile(); //gathers metrics using inputFilePath - OK
...
(input file is processed / changed and returned as an inputstream:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.write
}
From this you can see that JsonLogWriter has
properties for file paths (input file, output file, log output),
a set of methods to populate data:
a method to emit the data to a file (once ready)
Once I have populated all the json objects in the class, I call the write() method and the class pulls all the json objects together and
the stats all arrive in a log file (in a single line of json) - OK.
Error - no output file (yet)
If I use the metricsOutputFile method however:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.metricsOutputFile(); // using outputfilePath
jlw.write
}
... the JsonLogWriter fails as the file doesn't exist yet.
java.nio.file.NoSuchFileException: aroute\output\a_long_guid_filename
when debugging I can't see any part of the exchange or result objects which I might pipe into a file read/statistics gathering process.
Will this require more camel routes to solve? What might be an alternative approach where I can get all the stats from input and output files and keep them in one object / line of json?
(very happy to receive constructive criticism - as in why is your Java so heavy-handed - and yes it may well be, I am prototyping solutions at this stage, so this isn't production code, nor do I profess deep understanding of Java internals - I can usually get stuff working though)
Use one route and two processors: one for writing the file and the next for reading the file, so one finishes writing before the other starts reading
Or , also you can use two routes: one for writing the file (to:file) and other that listens to read the file(from:file)
You can check for common EIP patterns that will solve most of this questions here:
https://www.enterpriseintegrationpatterns.com/patterns/messaging/
Wrote a tool to convert data in an excel to JSON.
I only have 100 records, total of 1000 cells of data.
JSON got created is whopping 265MB.
Data Sample: കേന്ദ്ര തിരഞ്ഞെടുപ്പ് കമ്മീഷൻ എല്ലാ
തിരഞ്ഞെടുപ്പുകൾക്കും ഉപയോഗിക്കുന്നത് മൈസൂർ പെയിന്റ്സ് ആന്റ് വാർണ
Code is normal:Object.toJsonString(), and writes using File Writer.
JSON file out: I can observe lots of \u200D and \u200C
Excel file: Size is merely 40KB
Please help, I am expecting a file size of < 1 MB
Part of the JSON output is
െ അമേരിക്കന്\u200D പ്രസിഡന്റ്? ","options":["എബ്രഹാംലിങ്കണ്\u200D","ജോര്\u200Dജ് വാഷിംഗ്ടണ്\u200D","തോമസ് ജഫേഴ്\u200Cസണ്\u200D","കെന്നഡി"]}]}]}]}
I would like to remove \u200D and
I have a Java script, that will get the BLOB data from the database and then email this file to a specific email address. My problem is, that I have to use some framework functions (I can make DB calls only through these) and I think it's not handling BLOB datatypes... All I can get is the string representation of the result, this is the log line result of the code (framework call):
String s = String.valueOf(result.get(j).getValue("BLOB_DATA"));
System.out.println(s);
Log result:
<binary data> 50 KB
So this is the data I have to convert SOMEHOW into a valid pdf file, but right now I'm stuck...
Is it even possible to convert it into a valid byte[]? I tried it several ways, but all I get is invalid files... :(
I was digging in Apache POI API, trying out what all properties it fetches out of MSG file.
I parsed MSG file using POIFSChunkParser.
Here is the code:
try
{
InputStream is = new FileInputStream("C:\\path\\email.msg");
POIFSFileSystem poifs = new POIFSFileSystem(is);
POIFSChunkParser poifscprsr = new POIFSChunkParser();
ChunkGroup[] chkgrps = poifscprsr.parse(poifs);
for(ChunkGroup chunkgrp : chkgrps )
{
for(Chunk chunk : chunkgrp.getChunks())
{
System.out.println(chunk.getEntryName() + " ("
+ chunk.getChunkId() + ") " + chunk);
}
}
}
catch(FileNotFoundException fnfe)
{
System.out.println(fnfe.getMessage());
}
catch(IOException ioe)
{
System.out.println(ioe.getMessage());
}
In output it listed all accessible properties of MSG. One of them was looking like this:
__substg1.0_800A001F (32778) 04
I tried to find what is the significance of the property with HEX 800A here. (The subnodes of this topic lists the properties.)
Q1. However I didnt find property corresponding to HEX 800A. So what should I infer?
Also, I have some other but somewhat related questions:
Q2. Does Apache POI exposes all properties through MAPIMessage (I tried out exploring all methods of MAPIMessage too and started thinking it does not)?
Q3. If not, is there any other way to access all MAPI properties in Java with or without Apache POI.
First up, be a little wary of using the very low level HSMF classes if you're not following the Apache POI Dev List. There have been some updates to HSMF fairly recently to start adding support for fixed-length properties, and more are needed. Generally the high level classes will have a pretty stable API (even with the scratchpad warnings), which the lower level ones can (and sometimes do) change as new support gets added. If you're not on the dev list, this might be a shock...
Next up - working out what stuff is. This is where the HSMF Dev Tools come in. The simple TypesLister will let you check all the types that POI knows about (slightly more than it supports), while HSMFDump will do it's best to decode the file for you. If your chunk is of any kind of known type, between those two you can hopefully work out what it is and what it contains
Finally - getting all properties. As alluded to above, Apache POI used to only support variable length properties in .msg files. That has partly been corrected, with some fixed length support in there too, but more work is needed. Volunteers welcomed on the Dev List! MAPIMessage will give you all the common bits, but will also give you access to the different Chunk Groups. (A given message will be spread across a few different chunks, such as the main one, recipient ones, attachment ones etc). From there, you can get all the properties, along with the PropertiesChunk which gives access to the fixed length properties.