How to read data from a XML file in a generic way.generic way means in the sense idf I change the XML file at a later time no impact will be there to the out put format.
It should read the whole content of the XML file perfectly in key value pair.
You should try with a SAX Parser and put the key/value pairs in a Map.
http://docs.oracle.com/javase/7/docs/api/org/xml/sax/helpers/DefaultHandler.html
Using this Handler you can simply parse it from start to end.
See also here for an example.
Related
I have a really big json to read and store into database. I am using mix mode of stream and object using gson. If file format is correct it works like a charm. but if format is not correct within an object then whole file is skipped with an exception (reader.hasNext() throws exception).
Is there a way to skip a particular bad record and continue to read with rest of file?
Sample json file structure -
[{
"A":1,
"B":2,
"C":3
}]
and let say comma or colon is missing in this object.
Another example is if there are multiple objects and comma is missing between }(no comma){ 2 objects.
let say comma or colon is missing in this object
Unfortunately if you're missing a comma or a colon, then it's impossible to parse the JSON data.
But:
it's actually a good thing the parser doesn't accept this data because it protects you from accidentally reading garbage. Since you are putting this data into a database, it's protecting you from potentially filling your database with garbage.
I believe the best solution is to fix the producer of this JSON data and implement the necessary safe guards to prevent bad JSON data in the future.
My Question seems very similar to this question, but what happens if the are duplicate values in the json file?
The duplicate values are found in the json file due to the file contents originating from postgres which allow to insert duplicate values in older JSON format files.
My input look like this.
{
"61":{"value":5,"to_value":5},
"58":{"r":0,"g":0,"b":255}, "58":{"r":165,"g":42,"b":42},"58:{"r":0,"g":255,"b":0},
"63":{"r":0,"g":0,"b":0},
"57":{"r":0,"g":0,"b":255},"57":{"r":0,"g":255,"b":0}
}
If you look carefully there are multiple values of "58" as keys. The main keys "61" and "58" are mappted to a nested map type with different keys.
Now to simplify what I want to achieve, my output of the above input json should look like this.
Approach or solution both equally appreciated in java only.
{
"61":[5,5],
"58": [{"r":0,"g":0,"b":255},{"r":165,"g":42,"b":42},{"r":0,"g":255,"b":0}],
"63":[{"r":0,"g":0,"b":0}],
"57":[{"r":0,"g":0,"b":255},{"r":0,"g":255,"b":0}]
}
A good tool for parsing JSON formats is this library: JSONObject
Here an example of usage in a previous SO question:
Parsing JSON which contains duplicate keys
Here is what I want to do. Now I have some text files like this:
<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>
<page>
<url>yyy.example.com</url>
<title>yyy</title>
<content>abcdef</content>
</page>
...
And I want to read the file split in mapper and convert them to key-value pairs, where each value is the content in one <page> tag.
My problem is about the key. I can use urls as keys because they are global unique. However, due to the context of my job, I want to generate a global unique number as a key for each key-value pair. I know this is somehow against the horizontal scalability of Hadoop. But is there any solution to this?
If you're going to process such files by MapReduce I'd take the following strategy:
Use general text input format, line by line. This results every different file goes to different mapper job.
In mapper build cycle which reads next lines in cycle through context.nextKeyValue() instead of being called for each line.
Feed lines to some syntax analyzer (maybe you're just enough to read 6 non-empty lines, maybe you will use something like libxml but finally you will gen number of objects.
If you intend to pass objects that you read to reducer you need to wrap them into something that implements Writable interaface.
To form keys I'd use UUID implementation java.util.UUID. Something like:
UUID key = UUID.randomUUID();
It's enough if you're not generating billions records per second and your job does not take 100 years. :-)
Just note - UUID should be probably encoded in ImmutableBytesWritable class, useful for such things.
That's all, context.write(object,key).
OK, your reducer (if any) and output format is another story. You will definitely need output format to store your objects if you don't convert them to something like Text during the mapping.
Not sure if this answers your question directly. But I am taking the advantage of the input file format.
You might use the NLineInputFormat and set N = 6 as each record encompasses 6 lines:
<page>
<url>xxx.example.com</url>
<title>xxx</title>
<content>abcdef</content>
</page>
.
With each record, the mapper would get the offset position in the file. This offset would be unique for each record.
PS: It would work only if the schema is fixed. I am doubtful if it would work properly for multiple input text files.
this is my xml file
<waveform>
<Ivalue>12,13,14,15,16,17,18</Ivalue>
<IIvalue>1,4,15,23,22,44</IIvalue>
</waveform>
<waveform>
<Ivalue>12,13,14,15,16,17,18</Ivalue>
<IIvalue>1,4,15,23,22,44</IIvalue>
</waveform>
here, I know how to retrieve the values by tags but is it possible to store these values into separate int[]?
Thanks
You may use JAXB for extracting tags like Ivalue AS STRING.
To my knowledge it is at least not easy to get it directly as int array (with JAXB)
However, it is easy to split the string using String.split and convert the results with
Integer.parse
<item>
<RelatedPersons>
<RelatedPerson>
<Name>xy</Name>
<Title>asd</Title>
<Address>abc</Address>
</RelatedPerson>
<RelatedPerson>
<Name>xy</Name>
<Title>asd</Title>
<Address>abc</Address>
</RelatedPerson>
</RelatedPersons>
</item>
I d like to parse this data with a SAXParser. How can i do this?
I know the tutorials about SAX, and i can parsing any normal RSS, but i can't parsing this datas only.
Define your Problem: What you can probably do is create a Value Object(POJO) called Person which has the properties: name, title and address. You aim of parsing this XML would then be to create an ArrayList<Person> object. Defining a definite data structure helps you build logic around it.
Choose a Parser : You can then use a SAX Parser or an XML Pull Parser to browse through the tags: see this lin for a tutorial on DOM, SAX and XML Pull Parser in Android.
Data Population Logic: Then while Parsing, whenever you encounter a <RelatedPersons> tag, instantiate a new Person object. When you encounter the respective Properties tag, read the value and populate it in this object. When you encounter a closing </RelatedPersons> dump this Person Object in the ArrayList. Depending on the Parser you use, you will have to use appropriate methods to browse to the child node/nested nodes.(Refer the link for details)
By the time you are done parsing the last item node you will have all the values in your ArrayList.
Note that this is more of a theoretical answer; I hope it helps.