I have a really big json to read and store into database. I am using mix mode of stream and object using gson. If file format is correct it works like a charm. but if format is not correct within an object then whole file is skipped with an exception (reader.hasNext() throws exception).
Is there a way to skip a particular bad record and continue to read with rest of file?
Sample json file structure -
[{
"A":1,
"B":2,
"C":3
}]
and let say comma or colon is missing in this object.
Another example is if there are multiple objects and comma is missing between }(no comma){ 2 objects.
let say comma or colon is missing in this object
Unfortunately if you're missing a comma or a colon, then it's impossible to parse the JSON data.
But:
it's actually a good thing the parser doesn't accept this data because it protects you from accidentally reading garbage. Since you are putting this data into a database, it's protecting you from potentially filling your database with garbage.
I believe the best solution is to fix the producer of this JSON data and implement the necessary safe guards to prevent bad JSON data in the future.
Related
I was tasked to look at data sizes and process speed (speed to create the data) of equal data in JSON format vs. Protobuf, in Java.
For JSON, I'm using Jackson, and I created a Subscriptions class with a List<HashMap<String,String> field, called subscriptionList. Each HashMap would correspond to 1 subscription. I am reading from a file, each line is "|" delimited for fields, there are 523 fields. I loop through each field assigning to the subscription HashMap column name for the key and column value for the value. I loop through each line to create all 1000 subscriptions, I put them into an ArrayList<HashMap<String,String>>, then I create a Subscriptions, and set subscriptionList to the ArrayList with 1000 subscriptions. Finally I convert the Subscriptions object to string in JSON format, and write that to a text file, measure the size of the text file, and that's how I am measuring the size of the data.
For protobuf, the .proto file look something like,
message Subscriptions {
repeated Subscription subscription = 1;
}
message Subscription {
map<string, string> attr = 1;
}
I loop through each line and each column again, creating 1000 Subscription messages, I then repeatedly added the Subscription messages to a Subscriptions message. I then use the .getSerializedSize method, and that's how I measured the protobuf messages data sizes.
Currently, I see that the two formats give me basically the same data sizes, which I don't understand why. Protobuf messages are already inherently compressed some, and they're known to be less space demanding than raw JSON. I don't know what I am doing wrong, and I ran out of ideas to try.
You won't get much a difference between a map turned to a JSON object and the same map turned to a protobuf message. For example in protobuf there are no quotes around object keys and no braces around the first level message, but this doesn't make much of a difference, specially in large datasets.
You will see some difference in protobuf while using packed repeated fields (only primitive types) since those are converted to binary values hence largely decreasing in size compared to UTF8 encoding in JSON.
My Question seems very similar to this question, but what happens if the are duplicate values in the json file?
The duplicate values are found in the json file due to the file contents originating from postgres which allow to insert duplicate values in older JSON format files.
My input look like this.
{
"61":{"value":5,"to_value":5},
"58":{"r":0,"g":0,"b":255}, "58":{"r":165,"g":42,"b":42},"58:{"r":0,"g":255,"b":0},
"63":{"r":0,"g":0,"b":0},
"57":{"r":0,"g":0,"b":255},"57":{"r":0,"g":255,"b":0}
}
If you look carefully there are multiple values of "58" as keys. The main keys "61" and "58" are mappted to a nested map type with different keys.
Now to simplify what I want to achieve, my output of the above input json should look like this.
Approach or solution both equally appreciated in java only.
{
"61":[5,5],
"58": [{"r":0,"g":0,"b":255},{"r":165,"g":42,"b":42},{"r":0,"g":255,"b":0}],
"63":[{"r":0,"g":0,"b":0}],
"57":[{"r":0,"g":0,"b":255},{"r":0,"g":255,"b":0}]
}
A good tool for parsing JSON formats is this library: JSONObject
Here an example of usage in a previous SO question:
Parsing JSON which contains duplicate keys
I have a problem parsing a String I got from MongoCursor to work with JsonNode. I'm trying to get the MongoCursor's returned Json to work with my Spring SQL POJO, so I can insert it to my SQL database. Basically this is a database conversion, and SQL end is just for history storage. I didn't use spring's mongo, because the fields are somewhat different than the POJO's.(MongoDB and SQL have slightly different schema)
Currently, it works by using pattern matcher/ string split and replace then HashMap them so I can get a key and value pair of each field and then insert that into my spring POJO. I know I can also use jackson's POJO to work, but was told to use jsonNode as a better solution. There must be something i'm missing.
In the Jackson docs, the format of a "json" string is:
{ \"color\" : \"Black\", \"type\" : \"BMW\" }
However, that is not going to be the case of what MongoCursor returns me with. With the cursor, I get something like:
Document{{_id=G8HQW9123, User=test}}
which I used string pattern matcher and replaceAll to reduce to:
{_id:G8G8HQW9123, User:test}
However, jackson's slashes and double quotes are throwing me off and unable to parse that. Am I missing something? or do I have to actually add in those slashes & quotes in my code to make things work? Currently getting parse error which requests double quotes.
I think you're missing something here.
MongoCursor is returning you a Document Object, not a String.
Are you calling Document.toString() and working with the String result?
There should be no need for you to be doing any String parsing at all. You should be able to just take the Document Object from Mongo and call the getter functions on it to get the fields out that you need, which will preserve their data types as well (strings, numbers, booleans and dates), for example check all the functions on the BsonDocument class Javadocs: https://mongodb.github.io/mongo-java-driver/3.4/javadoc/org/bson/BsonDocument.html
I'm trying to parse data obtained via Apache HTTPClient in the fastest and most efficient way possible.
The data returned by the response is a string but in a CSV like format:
e.g. the String looks like this:
date, price, status, ...
2014-02-05, 102.22, OK,...
2014-02-05, NULL, OK
I thought about taking the string and manually parsing it, but this may be too slow as I have to do this for multiple requests.
Also the data returned is about 23,000 lines from one source and I may have to parse potentially several sources.
I'm also storing the data in a hash map of type:
Map<String, Map<String, MyObject>>
where the key is the source name, and value is a map with the parsed objects as a key.
So I have 2 questions, best way to parse a 23,000 line file into objects, and best way to store it.
I tried a csv parser, however the double's if not present are stored as NULL and not 0 so I will need to manually parse it.
Thanks
Here is what I want to do. Now I have some text files like this:
<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>
<page>
<url>yyy.example.com</url>
<title>yyy</title>
<content>abcdef</content>
</page>
...
And I want to read the file split in mapper and convert them to key-value pairs, where each value is the content in one <page> tag.
My problem is about the key. I can use urls as keys because they are global unique. However, due to the context of my job, I want to generate a global unique number as a key for each key-value pair. I know this is somehow against the horizontal scalability of Hadoop. But is there any solution to this?
If you're going to process such files by MapReduce I'd take the following strategy:
Use general text input format, line by line. This results every different file goes to different mapper job.
In mapper build cycle which reads next lines in cycle through context.nextKeyValue() instead of being called for each line.
Feed lines to some syntax analyzer (maybe you're just enough to read 6 non-empty lines, maybe you will use something like libxml but finally you will gen number of objects.
If you intend to pass objects that you read to reducer you need to wrap them into something that implements Writable interaface.
To form keys I'd use UUID implementation java.util.UUID. Something like:
UUID key = UUID.randomUUID();
It's enough if you're not generating billions records per second and your job does not take 100 years. :-)
Just note - UUID should be probably encoded in ImmutableBytesWritable class, useful for such things.
That's all, context.write(object,key).
OK, your reducer (if any) and output format is another story. You will definitely need output format to store your objects if you don't convert them to something like Text during the mapping.
Not sure if this answers your question directly. But I am taking the advantage of the input file format.
You might use the NLineInputFormat and set N = 6 as each record encompasses 6 lines:
<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>
.
With each record, the mapper would get the offset position in the file. This offset would be unique for each record.
PS: It would work only if the schema is fixed. I am doubtful if it would work properly for multiple input text files.