I have the following data in a CSV file:
video1duration,video2duration,video3duration
00:01:00, 00:00:24, 00:00:15
00:01:00, 00:00:24, 00:00:15
00:01:00, 00:00:24, 00:00:15
The file is stored in a folder locally in my computer.
I need help with writing code to do the followings:
- pass the path of the CSV file to access its data, then treat the data as actual data and then validate each cell/value against expected data that will be written in the IDE as follows:
video1duration,video2duration,video3duration
00:02:00, 00:05:24, 00:00:15
00:04:00, 00:10:24, 00:00:15
00:01:00, 00:00:24, 00:00:15
As I understand your question, you have two-stage process. Trying to merge these two separate things into one will certainly result in less legible and harder-to-maintain code (everything as one giant package/class/function).
Your first stage is to import a .csv file and parse it using any of these 3 methods: Using java.util.Scanner
Using String.split() function
Using 3rd Party libraries like OpenCSV
It is possible to validate that your .csv is valid, and that it contains tabular data without knowing or caring about what the data will later be used for.
In the second stage take tabular data (e.g. an array of arrays) and turn it into a tree. At this point, your hierarchy package will be doing validation but it will only be validating the tree structure (e.g. every node except the root has one parent, etc.). If you want to delve further this might be interesting: https://www.npmjs.com/package/csv-file-validator.
Related
I´m currently working on a mapReduce job processing xml data and I think there´s something about the data flow in hadoop that I´m not getting correctly.
I´m running on Amazon´s ElasticMapReduce service.
Input data: large files (significantly above 64MB, so they should be splitable), consisting of a lot of small xml files that are concatenated by a previous s3distcp operation that concatenates all files into one.
I am using a slightly modified version of Mahout´s XmlInputFormat to extract the individual xml snippets from the input.
As a next step I´d like to parse those xml snippets into business objects which should then be passed to the mapper.
Now here is where I think I´m missing something: In order for that to work, my business objects need to implement the Writable interface, defining how to read/write an instance from/to an DataInput or DataOutput.
However, I don´t see where this comes into play - the logic needed to read an instance of my object is already in the InputFormat´s record reader, so why does the object have to be capable of reading/writing itself??
I did quite some research already and I know (or rather assume) WritableSerialization is used when transferring data between nodes in the cluster, but I´d like to understand the reasons behind that architecture.
The InputSplits are defined upon job submission - so if the name node sees that data needs to be moved to a specific node for a map task to work, would it not be sufficient to simply send the raw data as a byte stream? Why do we need to decode that into Writables if the RecordReader of our input format does the same thing anyway?
I really hope someone can show me the error in my thoughts above, many thanks in advance!
Writing Java objects or a List into a text file is ok. But I want to know how I can update or rewrite a object which was written previously without writing objects again. For example, let s assume there is a java.util.List has a set of Objects. and then that list is written to a text file. Then later that file will be read again and get all objects from list and then change one object's value at run time by a java application. Then I don't need to write entire list back to the text file. Instead only the updated object in the list is required to be rewritten or updated in the text file without rewriting the whole list again. Any suggestion, or helpful source with sample codes please.
Take a look at RandomAccessFile. This will let you seek to the place in the file you want, and only update the part that you want to update.
Also take a look at this question on stackoverflow.
Without some fairly complex logic, you won't usually be able to update an object without rewriting the entire file. For example, if one of the objects on your list contains a string "shortstring", and you need to update it with string "muchmuchlongerstring", there will be no space in the file for the longer string without rewriting all the following content in the file.
If you want to persist large object trees to a file and still have the ability to update them, your code will be less buggy and life will be simplified by using one of the many file-based DBs out there, like:
SQLite (see Java and SQLite)
Derby
H2 (disk-based tables)
Consider an application that wants to use Hadoop in order to process large amounts of proprietary binary-encoded text data in approximately the following simplified MapReduce sequence:
Gets a URL to a file or a directory as input
Reads the list of the binary files found under the input URL
Extracts the text data from each of those files
Saves the text data into new, extracted plain text files
Classifies extracted files into (sub)formats of special characteristics (say, "context")
Splits each of the extracted text files according to its context, if necessary
Processes each of the splits using the context of the original (unsplit) file
Submits the processing results to a proprietary data repository
The format-specific characteristics (context) identified in Step 5 are also saved in a (small) text file as key-value pairs, so that they are accessible by Step 6 and Step 7.
Splitting in Step 6 takes place using custom InputFormat classes (one per custom file format).
In order to implement this scenario in Hadoop, one could integrate Step 1 - Step 5 in a Mapper and use another Mapper for Step 7.
A problem with this approach is how to make a custom InputFormat know which extracted files to use in order to produce the splits. For example, format A may represent 2 extracted files with slightly different characteristics (e.g., different line delimiter), hence 2 different contexts, saved in 2 different files.
Based on the above, the getSplits(JobConf) method of each custom InputFormat needs to have access to the context of each file before splitting it. However, there can be (at most) 1 InputFormat class per format, so how would one correlate the appropriate set of extracted files with the correct context file?
A solution could be to use some specific naming convention for associating extracted files and contexts (or vice versa), but is there any better way?
This sounds more like a Storm (stream processing) problem with a spout that loads the list of binary files from a URL and subsequent bolts in the topology performing each of the following actions.
I'm writing a tool to analyze stock market data. For this I download data and then save all the data corresponding to a stock as a double[][] 20*100000 array in a data.bin on my hd, I know I should put it in some database but this is simply performance wise the best method.
Now here is my problem: I need to do updates and search on the data:
Updates: I have to append new data to the end of the array as time progresses.
Search: I want to iterate over different data files to find a minimum or calculate moving averages etc.
I could do both of them by reading the whole file in and update it writing or do search in a specific area... but this is somewhat overkill since I don't need the whole data.
So my question is: Is there a library (in Java) or something similar to open/read/change parts of the binary file without having to open the whole file? Or searching through the file starting at a specific point?
RandomAccessFile allows seeking into particular position in a file and updating parts of the file or adding new data to the end without rewriting everything. See the tutorial here: http://docs.oracle.com/javase/tutorial/essential/io/rafs.html
You could try looking at Random Access Files:
Tutorial: http://docs.oracle.com/javase/tutorial/essential/io/rafs.html
API: http://docs.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html
... but you will still need to figure out the exact positions you want to read in a binary file.
You might want to consider moving to a database, maybe a small embedded one like H2 (http://www.h2database.com)
I am making a java program that has a collection of flash-card like objects. I store the objects in a jtree composed of defaultmutabletreenodes. Each node has a user object attached to it with has a few string/native data type parameters. However, i also want each of these objects to have an image (typical formats, jpg, png etc).
I would like to be able to store all of this information, including the images and the tree data to the disk in a single file so the file can be transferred between users and the entire tree, including the images and parameters for each object, can be reconstructed.
I had not approached a problem like this before so I was not sure what the best practices were. I found XLMEncoder (http://java.sun.com/j2se/1.4.2/docs/api/java/beans/XMLEncoder.html) to be a very effective way of storing my tree and the native data type information. However I couldn't figure out how to save the image data itself inside of the XML file, and I'm not sure it is possible since the data is binary (so restricted characters would be invalid). My next thought was to associate a hash string instead of an image within each user object, and then gzip together all of the images, with the hash strings as the names and the XMLencoded tree in the same compmressed file. That seemed really contrived though.
Does anyone know a good approach for this type of issue?
THanks!
Thanks!
Assuming this isn't just a serializable graph, consider bundling the files together in Jar format. If you already have your data structures working with XMLEncoder, you can reuse this code by saving the data as a jar entry.
If memory serves, the jar library has better support for Unicode name entries than the zip package, which is why I would favour it.
You might consider using an MS JET database (.mdb file) and storing all the stuff in there. That'll also make it easy to examine and edit the data in (for example) MS Access.
You can employ some virtual file system, which stores it's data in a single container. We develop and offer one of such files sytems, SolFS, however right now there's no Java binding for it. We will release Java JNI interface for SolFS within a month.