How to read hadoop sequential file?

How to read hadoop sequential file? - java

I have a sequential file which is the output of hadoop map-reduce job.
In this file data is written in key value pairs ,and value itself is a map.
I want to read the value as a MAP object so that i can process it further.
Configuration config = new Configuration();
Path path = new Path("D:\\OSP\\sample_data\\data\\part-00000");
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
long position = reader.getPosition();
while(reader.next(key,value))
{
System.out.println("Key is: "+textKey +" value is: "+val+"\n");
}
output of program: Key is: [this is key] value is: {abc=839177, xyz=548498, lmn=2, pqr=1}
Here i am getting value as string ,but i want it as a object of map.

Check the API documentation for SequenceFile#next(Writable, Writable)
while(reader.next(key,value))
{
System.out.println("Key is: "+textKey +" value is: "+val+"\n");
}
should be replaced with
while(reader.next(key,value))
{
System.out.println("Key is: "+key +" value is: "+value+"\n");
}
Use SequenceFile.Reader#getValueClassName to get the value type in the SequenceFile. SequenceFile have the key/value types in the file header.

Related

apache PropertiesConfiguration doesn't resolve placeholders

Say that I have the following two configuration files:
File 1:
key1 = ${common.key1}
key2 = ${common.key2}
File 2:
common.key1 = value1
common.key2 = value2
And I have the following code:
import org.apache.commons.configuration.PropertiesConfiguration;
...
PropertiesConfiguration newConfig = new PropertiesConfiguration();
File configFile1 = new File("...paht to file 1");
File configFile2 = new File("...path to file 2");
newConfig.setDelimiterParsingDisabled(true);
newConfig.load(configFile2);
newConfig.load(configFile1);
Iterator<String> props = newConfig.getKeys();
while (props.hasNext()) {
String propName = props.next();
String propValue = newConfig.getProperty(propName).toString();
System.out.println(propName + " = " + propValue);
}
I have the following output:
common.key1 = value1
common.key2 = value2
key1 = ${common.key1}
key2 = ${common.key2}
Why the placeholders are not resolved ?

See this page in the documentation, which says:
Below is some more information related to variable interpolation users should be aware of:
...
Variable interpolation is done by all property access methods. One exception is the generic getProperty() method which returns the raw property value.
And that's exactly what you are using in your code.
The API docs of getProperty() mentions this as well:
Gets a property from the configuration. ... On this level variable substitution is not yet performed.
Use other methods available in PropertiesConfiguration to get the actual, interpolated value. For example, call getProperties() on the PropertiesConfiguration to convert it to a java.util.Properties object and iterate on that instead.

It is also possible to use it in generic way with placeholders substitution like below:
config.get(Object.class, propName);
Unlike getProperty method the get method with Object.class parameter will return value of original class with variables interpolated.

Java compare two csv files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So I have two csv files i wish to compare.
Each file could be as much as 20mb each.
Each line has the key followed by the data so key,data
But the data is then separated by comma as well.
csv1.csv
KEY , DATA
AB45,12,15,65,NN
AB46,12,15,64,YY
AB47,45,85,95,YN
csv2.csv
AB45,12,15,65,NN
AB46,15,15,65,YY
AB48,65,45,60,YY
What i want to do is read both files and compare the data for each key.
I was thinking read each file line by line adding into a TreeMap. I can then compare each set of data for a given key and if there is a difference write it to another file.
Any advice?
As I am unsure of how to read the files to extract just the keys and data in an efficient way.

Use a CSV parsing library dedicated for that to speed things up. With uniVocity-parsers you can parse these 20mb files in 100ms or less. The following solution is a bit involved to prevent loading too much data into memory. Check the tutorial I linked above, there are many ways to accomplish what you need with this library.
First we read one of the CSV files, and generate a Map:
public static void main(String... args) {
//First we parse one file (ideally the smaller one)
CsvParserSettings settings = new CsvParserSettings();
//here we tell the parser to read the CSV headers
settings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(settings);
//Parse all data into a list.
List<String[]> records = parser.parseAll(new File("/path/to/csv1.csv"));
//Convert that list into a map. The first column of this input will produce the keys.
Map<String, String[]> mapOfRecords = toMap(records);
//this where the magic happens.
processFile(new File("/path/to/csv2.csv"), new File("/path/to/diff.csv"), mapOfRecords);
}
This is the code to generate a Map from the list of records:
/* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
HashMap<String, String[]> map = new HashMap<String, String[]>();
for (String[] row : records) {
//column 0 will always have an ID.
map.put(row[0], row);
}
return map;
}
With the map of records, we can process your second file and generate another with any updates found:
private static void processFile(final File input, final File output, final Map<String, String[]> mapOfExistingRecords) {
//configures a new parser again
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
//All parsed rows will be submitted to the following Processor. This way you won't have to store all rows in memory.
settings.setProcessor(new RowProcessor() {
//will write the changed rows to another file
CsvWriter writer;
#Override
public void processStarted(ParsingContext context) {
CsvWriterSettings settings = new CsvWriterSettings(); //configure at till
writer = new CsvWriter(output, settings);
}
#Override
public void rowProcessed(String[] row, ParsingContext context) {
// Incoming rows from will have the ID as index 0.
// If the map contains the ID, we'll get a row
String[] existingRow = mapOfExistingRecords.get(row[0]);
if (!Arrays.equals(row, existingRow)) {
writer.writeRow(row);
}
}
#Override
public void processEnded(ParsingContext context) {
writer.close();
}
});
CsvParser parser = new CsvParser(settings);
//the parse() method will submit all rows to the RowProcessor defined above. All differences will be
//written to the output file.
parser.parse(input);
}
This should work just fine. I hope it helps you.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

I work with a lot of CSV file comparisons for my job. I didn't know python before I started working, but I picked it up really quick. If you want to compare CSV files quickly, python is a wonderful way to go, and its fairly easy to pick up if you know java.
I modified a script I use to fit your basic use case (you'll need to modify it a bit more to do exactly what you want). It Runs under a few seconds when I use it compare csv files with millions of rows. If you need to do this in java, you can pretty much transfer this to some java methods. There are similar csv libraries you can use that will replace all the csv functions below.
import csv, sys, itertools
def getKeyPosition(header_row, key_value):
counter = 0
for header in header_row:
if (header == key_value):
return counter
counter += 1
# This will create a dictonary of your rows by their key. (key is the column location)
def getKeyDict(csv_reader, key_position):
key_dict = {}
row_counter = 0
unique_records = 0
for row in csv_reader:
row_counter += 1
if row[key_position] not in key_dict:
key_dict.update({row[key_position]: row})
unique_records += 1
# My use case requires a lot of checking for duplicates
if unique_records != row_counter:
print "Duplicate Keys in File"
return key_dict
def main():
f1 = open(sys.argv[1])
f2 = open(sys.argv[2])
f1_csv = csv.reader(f1)
f2_csv = csv.reader(f2)
f1_header = next(f1_csv)
f2_header = next(f2_csv)
f1_header_key_position = getKeyPosition(f1_header, "KEY")
f2_header_key_position = getKeyPosition(f2_header, "KEY")
f1_row_dict = getKeyDict(f1_csv, f1_header_key_position)
f2_row_dict = getKeyDict(f2_csv, f2_header_key_position)
outputFile = open("KeyDifferenceFile.csv" , 'w')
writer = csv.writer(outputFile)
writer.writerow(f1_header)
#Heres the logic for comparing rows
for key, row_1 in f1_row_dict.iteritems():
#Do whatever comparisions you need here.
if key not in f2_row_dict:
print "Oh no, this key doesn't exist in the file 2"
if key in f2_row_dict:
row_2 = f2_row_dict.get(key)
if row_1 != row_2:
print "oh no, the two rows don't match!"
# You can get more header keys to compare by if you want.
data_position = getKeyPosition(f2_header, "DATA")
row_1_data = row_1[data_position]
row_2_data = row_2[data_position]
if row_1_data != row_2_data:
print "oh no, the data doesn't match!"
# Heres how you'd right the rows
row_to_write = []
#Differences between
for row_1_column, row_2_column in itertools.izip(row_1_data, row_2_data):
row_to_write.append(row_1_column - row_2_column)
writer.writerow(row_to_write)
# Make sure to close those files!
f1.close()
f2.close()
outputFile.close()
main()

Unable to get UniqueIdentifier from X500Principal of X509Certificate

I wanted to get the unique identifier of a X509Certificate using Java.
I tried to get the value from using the below code:-
java.security.cert.X509Certificate certificate=// certificate object
certificate.getSubjectX500Principal().getName();
But i am unable to get the unique identifier value alone.This is the value i am getting:-
2.5.4.45=#0309000000db000000a01a,OU=06
I wanted to get the value alone for "2.5.4.45".
I also tried to get the value using the below code:-
String dn2 = certificate.getSubjectX500Principal().getName();
LdapName ldapDN;
ldapDN = new LdapName(dn2);
for(Rdn rdn: ldapDN.getRdns()) {
System.out.println(rdn.getType() + " -> " + rdn.getValue());
if(rdn.getType().equalsIgnoreCase("2.5.4.45")){
System.out.println(rdn.getValue());
}
I am getting an object as the value for unique identifier. I am not able to parse the Object, get the value for this.
Update ::
I am still not able to figure out a way to get the UniqueIdentifier identifer.Any help is appreciated.

You need provide set of known OIDs. Then you will get a human readable value of DN String. Value of known OIDs will be readable when you define OID. For example:
Map<String, String> knownOids = new HashMap<String, String>();
knownOids.put("2.5.4.45", "uniqueIdentifier");
String humanReadableDN = certToken.getCertificate().getSubjectX500Principal().getName(X500Principal.RFC2253, knownOids);
Example OID repository you can find here: http://oid-info.com/get/2.5.4.45
For Example, this:
CN=Krzysiek,1.2.840.113549.1.9.1=#160f3334353334354064666766642e706c
Will be translated to this:
commonName=Krzysiek,emailAddress=345345#dfgfd.pl
When you provide a set with:
knownOids.put("1.2.840.113549.1.9.1", "emailAddress");

Get Accumulo column family from api?

Learning Accumulo at the moment and I noticed there wasn't a direct call that I found for figuring out the column family for an entry. I need data from an Accumulo table in the format of
for example:
{key:"XPZ-878-S12",
columns:[{name:"NAME",value:"FOO BAR"},
{name:"JOB",value:"ENGINEER"}
]
}
And these spots are where I am trying to take data from:
{key:"key value from table",
columns:[{name:"name of column family",value:"value from table"},
{name:"name of column family",value:"value from table"}
]
}
So obviously key and value are easy to get ahold of, but what I call the "name" is extremely important to me as well, aka the column family name.

Yes it is possible. For example take a look at this:
for (Entry<Key, Value> entry : scan) {
Text key = entry.getKey().getRow();
Value val = entry.getValue();
returnVal.append("KEY" + key + " " + entry.getKey().getColumnFamily() + ": " + val + "\n");
}
The solution being for whatever entry you are looking at do entry.getKey().getColumnFamily()

How to append values to a key using ini4j?

I want to append a value to following Key like this:
[Section]
Key=value1,value2
I tried Wini and Section getAll() and putAll() functions but it always replaces value1 with value2 instead of appending value2. And I did' t find any tutorial about this online. How can I do this using ini4j? Or another jni writinig and parsing library?

I finally treated it as a single Key-value pair and appended to the string after "Key=".

This topic is a little old, but I'm faced exact the same problem, so...
To read all:
//open the file
Ini ini = new Ini(new File(iniFileName));
//load all values at once
Ini.Section names = ini.get("mySectionX");
myStr[] = names.getAll("myKey1", String[].class);
To put all (with the same ini and names):
//if myStr[] have changes
names.putAll("myKey1", myStr);
At final you gonna have the ini file like this ("myKey1" is ALWAYS the same):
[mySectionX]
myKey1 = value1
myKey1 = value2
myKey1 = value3

Adding more information,
if you want o create a new file:
Ini ini = new Ini();
ini.setComment(" Main comment "); //comment about the file
//add a section comment, a section and a value
ini.putComment("mySectionX", " Comment about the section");
ini.put("mySectionX", "myKey1", "value1");
//adding many parameters at one in a section
String[] keyList = {value1, value2, value3};
ini.add("mySectionY");
Ini.Section names = ini.get("mySectionY");
names.putAll("myKey1", keyList); //put all new elements at once
...
ini.store(new File(iniFileName));

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read hadoop sequential file? - java

Related

apache PropertiesConfiguration doesn't resolve placeholders

Java compare two csv files [closed]

Unable to get UniqueIdentifier from X500Principal of X509Certificate

Get Accumulo column family from api?

How to append values to a key using ini4j?

Categories

Resources