My input is some csv/tsv or whatever delimiter separated file and its header. I want to map by any column as key and the whole row as value. I ran the below code fine on my machine but failed when tested in yarn-cluster mode.
public class SparkController implements java.io.Serializable {
String DELIMITER;
String[] header;
String path;
public static void main(String[] args) {
// some parse function
// say input file is a csv likes: (id,timestamp,ip)
// header = [ "id", "timestamp", "ip" ]
// DELIMITER = ","
SparkController sparkController = new SparkController();
sparkController.parseArgs(args);
JavaPariRDD<String, String> pairRdd = sparkController.map2PairRdd("ip");
}
private JavaPariRDD<String, String> map2PairRdd(String column) {
JavaRDD<String> rawFile = sc.textFile(path);
JavaPariRDD<String, String> pairRdd = rawFile.mapToPair((s) -> {
// DELIMITER can be accessed normally
String[] fields = s.split(DELIMITER);
// turns out header is empty when runs in yarn,
// but works fine in standalone mode
return new Tuple2<>(fields[header.indexOf("ip")], s);
});
// other operations continue
}
}
I understand that variables like DELIMITER and header are serialized to workers in cluster mode. But how can array header being empty inside rdd operation.
I modify the code by declare a final int variable index outside the mapToPair and access index inside then this error fixed.
But I'm still confused about the why header is empty when access inside mapToPair. Can anybody provides some insights?
Related
I am attempting to generate an Avro schema from java to describe a table that I can access via JDBC.
I use the JDBC getMetaData() method to retrieve the relevant column metadata and store in an array list of "columnDetail" objects.
Column Detail defined as
private static class columnDetail {
public String tableName;
public String columnName;
public String dataTypeName;
public int dataTypeId;
public String size;
public String scale;
}
I then iterate through this array list and build up the Avro schema using the org.apache.avro.SchemaBuilder class.
My issue is around decimal logical types.
I iterate throuth the array list twice. The first time to add all fields to the FieldAssembler, the second to modify certain byte fields to add the decimal logical datatype.
The issue I am experiencing is that I get an error if the Decimal scale value changes between iterations.
As it iterates through the columnDetail array, it will work so long as the value "scale" does not change. If it does change, the following occurs:
Exception in thread "main" org.apache.avro.AvroRuntimeException: Can't overwrite property: scale
at org.apache.avro.JsonProperties.addProp(JsonProperties.java:187)
at org.apache.avro.Schema.addProp(Schema.java:134)
at org.apache.avro.JsonProperties.addProp(JsonProperties.java:191)
at org.apache.avro.Schema.addProp(Schema.java:139)
at org.apache.avro.LogicalTypes$Decimal.addToSchema(LogicalTypes.java:193)
at GenAvroSchema.main(GenAvroSchema.java:85)
I can prevent this by hardcoding the decimal size. i.e. I can replace
org.apache.avro.LogicalTypes.decimal(Integer.parseInt(cd.size),Integer.parseInt(cd.scale)).addToSchema(schema.getField(cd.columnName).schema());
with
org.apache.avro.LogicalTypes.decimal(18,2).addToSchema(schema.getField(cd.columnName).schema());
This however ends up with the same size datatype for all decimal fields which is not desirable.
Can someone help with this ?
Java: 1.8.0_202
Avro: avro-1.8.2.jar
My java code:
public static void main(String[] args) throws Exception{
String jdbcURL = "jdbc:sforce://login.salesforce.com";
String jdbcUser = "userid";
String jdbcPassword = "password";
String avroDataType = "";
HashMap<String, String> dtmap = new HashMap<String, String>();
dtmap.put("VARCHAR", "string");
dtmap.put("BOOLEAN", "boolean");
dtmap.put("NUMERIC", "bytes");
dtmap.put("INTEGER", "int");
dtmap.put("TIMESTAMP", "string");
dtmap.put("DATE", "string");
ArrayList<columnDetail> columnDetails = new ArrayList<columnDetail>();
columnDetails = populateMetadata(jdbcURL, jdbcUser, jdbcPassword); // This works so have not included code here
SchemaBuilder.FieldAssembler<Schema> fields = SchemaBuilder.builder().record("account").doc("Account Detials").fields() ;
for(columnDetail cd:columnDetails) {
avroDataType = dtmap.get(JDBCType.valueOf(cd.dataTypeId).getName());
switch(avroDataType)
{
case "string":
fields.name(cd.columnName).type().unionOf().nullType().and().stringType().endUnion().nullDefault();
break;
case "int":
fields.name(cd.columnName).type().unionOf().nullType().and().intType().endUnion().nullDefault();
break;
case "boolean":
fields.name(cd.columnName).type().unionOf().booleanType().and().nullType().endUnion().booleanDefault(false);
break;
case "bytes":
if(Integer.parseInt(cd.scale) == 0) {
fields.name(cd.columnName).type().unionOf().nullType().and().longType().endUnion().nullDefault();
} else {
fields.name(cd.columnName).type().bytesType().noDefault();
}
break;
default:
fields.name(cd.columnName).type().unionOf().nullType().and().stringType().endUnion().nullDefault();
break;
}
}
Schema schema = fields.endRecord();
for(columnDetail cd:columnDetails) {
avroDataType = dtmap.get(JDBCType.valueOf(cd.dataTypeId).getName());
if(avroDataType == "bytes" && Integer.parseInt(cd.scale) != 0) {
//org.apache.avro.LogicalTypes.decimal(Integer.parseInt(cd.size),Integer.parseInt(cd.scale)).addToSchema(schema.getField(cd.columnName).schema());
org.apache.avro.LogicalTypes.decimal(18,2).addToSchema(schema.getField(cd.columnName).schema());
}
}
BufferedWriter writer = new BufferedWriter(new FileWriter("./account.avsc"));
writer.write(schema.toString());
writer.close();
}
Thanks,
Eoin.
I have two csv files. One Master CSV File around 500000 records. Another DailyCSV file has 50000 Records.
The DailyCSV files misses few columns which has to be fetched from Master CSV File.
For example
DailyCSV File
id,name,city,zip,occupation
1,Jhon,Florida,50069,Accountant
MasterCSV File
id,name,city,zip,occupation,company,exp,salary
1, Jhon, Florida, 50069, Accountant, AuditFirm, 3, $5000
What I have to do is, read both files, match the records with ID, if ID is present in the master file, then i have to fetch company, exp, salary and write it to a new csv file.
How to achieve this.??
What I have done Currently
while (true) {
line = bstream.readLine();
lineMaster = bstreamMaster.readLine();
if (line == null || lineMaster == null)
{
break;
}
else
{
while(lineMaster != null)
readlineSplit = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
String splitId = readlineSplit[4];
String[] readLineSplitMaster =lineMaster.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
String SplitIDMaster = readLineSplitMaster[13];
System.out.println(splitId + "|" + SplitIDMaster);
//System.out.println(splitId.equalsIgnoreCase(SplitIDMaster));
if (splitId.equalsIgnoreCase(SplitIDMaster)) {
String writeLine = readlineSplit[0] + "," + readlineSplit[1] + "," + readlineSplit[2] + "," + readlineSplit[3] + "," + readlineSplit[4] + "," + readlineSplit[5] + "," + readLineSplitMaster[15]+ "," + readLineSplitMaster[16] + "," + readLineSplitMaster[17];
System.out.println(writeLine);
pstream.print(writeLine + "\r\n");
}
}
}pstream.close();
fout.flush();
bstream.close();
bstreamMaster.close();
First of all, your current parsing approach will be painfully slow. Use a CSV parsing library dedicated for that to speed things up. With uniVocity-parsers you can process your 500K records in less than a second. This is how you can use it to solve your problem:
First let's define a few utility methods to read/write your files:
//opens the file for reading (using UTF-8 encoding)
private static Reader newReader(String pathToFile) {
try {
return new InputStreamReader(new FileInputStream(new File(pathToFile)), "UTF-8");
} catch (Exception e) {
throw new IllegalArgumentException("Unable to open file for reading at " + pathToFile, e);
}
}
//creates a file for writing (using UTF-8 encoding)
private static Writer newWriter(String pathToFile) {
try {
return new OutputStreamWriter(new FileOutputStream(new File(pathToFile)), "UTF-8");
} catch (Exception e) {
throw new IllegalArgumentException("Unable to open file for writing at " + pathToFile, e);
}
}
Then, we can start reading your daily CSV file, and generate a Map:
public static void main(String... args){
//First we parse the daily update file.
CsvParserSettings settings = new CsvParserSettings();
//here we tell the parser to read the CSV headers
settings.setHeaderExtractionEnabled(true);
//and to select ONLY the following columns.
//This ensures rows with a fixed size will be returned in case some records come with less or more columns than anticipated.
settings.selectFields("id", "name", "city", "zip", "occupation");
CsvParser parser = new CsvParser(settings);
//Here we parse all data into a list.
List<String[]> dailyRecords = parser.parseAll(newReader("/path/to/daily.csv"));
//And convert them to a map. ID's are the keys.
Map<String, String[]> mapOfDailyRecords = toMap(dailyRecords);
... //we'll get back here in a second.
This is the code to generate a Map from the list of daily records:
/* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
HashMap<String, String[]> map = new HashMap<String, String[]>();
for (String[] row : records) {
//column 0 will always have an ID.
map.put(row[0], row);
}
return map;
}
With the map of records, we can process your master file and generate the list of updates:
private static List<Object[]> processMasterFile(final Map<String, String[]> mapOfDailyRecords) {
//we'll put the updated data here
final List<Object[]> output = new ArrayList<Object[]>();
//configures the parser to process only the columns you are interested in.
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
settings.selectFields("id", "company", "exp", "salary");
//All parsed rows will be submitted to the following RowProcessor. This way the bigger Master file won't
//have all its rows stored in memory.
settings.setRowProcessor(new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
// Incoming rows from MASTER will have the ID as index 0.
// If the daily update map contains the ID, we'll get the daily row
String[] dailyData = mapOfDailyRecords.get(row[0]);
if (dailyData != null) {
//We got a match. Let's join the data from the daily row with the master row.
Object[] mergedRow = new Object[8];
for (int i = 0; i < dailyData.length; i++) {
mergedRow[i] = dailyData[i];
}
for (int i = 1; i < row.length; i++) { //starts from 1 to skip the ID at index 0
mergedRow[i + dailyData.length - 1] = row[i];
}
output.add(mergedRow);
}
}
});
CsvParser parser = new CsvParser(settings);
//the parse() method will submit all rows to the RowProcessor defined above.
parser.parse(newReader("/path/to/master.csv"));
return output;
}
Finally, we can get the merged data and write everything to another file:
... // getting back to the main method here
//Now we process the master data and get a list of updates
List<Object[]> updatedData = processMasterFile(mapOfDailyRecords);
//And write the updated data to another file
CsvWriterSettings writerSettings = new CsvWriterSettings();
writerSettings.setHeaders("id", "name", "city", "zip", "occupation", "company", "exp", "salary");
writerSettings.setHeaderWritingEnabled(true);
CsvWriter writer = new CsvWriter(newWriter("/path/to/updates.csv"), writerSettings);
//Here we write everything, and get the job done.
writer.writeRowsAndClose(updatedData);
}
This should work like a charm. Hope it helps.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
I will approach the problem in a step by step manner.
First I will parse/read the master CSV file and keep its content into a hashmap, where the key will be each record's unique 'id' as for the value maybe you can store them in a hash or simply create a java class to store the information.
Example of hash:
{
'1' : { 'name': 'Jhon',
'City': 'Florida',
'zip' : 50069,
....
}
}
Next, read your comparer csv file. For each row, read the 'id' and check if the key exists on the hashmap you have created earlier.
if it exists, then from the hashmap access the information you need and write to a new CSV file.
Also, you might want to consider using a 3rd party CSV parser to make this task easier.
If you have maven maybe you can follow this example I found on net. Otherwise you can just google for apache 'csv parser' example on the internet.
http://examples.javacodegeeks.com/core-java/apache/commons/csv-commons/writeread-csv-files-with-apache-commons-csv-example/
I'm searching through some data files (~20GB). I'd like to find some specific terms in that data and mark the offset for the matches. Is there a way to have Spark identify the offset for the chunk of data I'm operating on?
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import java.util.regex.*;
public class Grep {
public static void main( String args[] ) {
SparkConf conf = new SparkConf().setMaster( "spark://ourip:7077" );
JavaSparkContext jsc = new JavaSparkContext( conf );
JavaRDD<String> data = jsc.textFile( "hdfs://ourip/test/testdata.txt" ); // load the data from HDFS
JavaRDD<String> filterData = data.filter( new Function<String, Boolean>() {
// I'd like to do something here to get the offset in the original file of the string "babe ruth"
public Boolean call( String s ) { return s.toLowerCase().contains( "babe ruth" ); } // case insens matching
});
long matches = filterData.count(); // count the hits
// execute the RDD filter
System.out.println( "Lines with search terms: " + matches );
);
} // end main
} // end class Grep
I'd like to do something in the "filter" operation to compute the offset of "babe ruth" in the original file. I can get the offset of "babe ruth" in the current line, but what's the process or function that tells me the offset of the line within the file?
In Spark common Hadoop Input Format can be used. To read the byte offset from the file you can use class TextInputFormat from Hadoop (org.apache.hadoop.mapreduce.lib.input). It is already bundled with Spark.
It will read the file as key (byte offset) and value (text line):
An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.
In Spark it can be used by calling newAPIHadoopFile()
SparkConf conf = new SparkConf().setMaster("");
JavaSparkContext jsc = new JavaSparkContext(conf);
// read the content of the file using Hadoop format
JavaPairRDD<LongWritable, Text> data = jsc.newAPIHadoopFile(
"file_path", // input path
TextInputFormat.class, // used input format class
LongWritable.class, // class of the value
Text.class, // class of the value
new Configuration());
JavaRDD<String> mapped = data.map(new Function<Tuple2<LongWritable, Text>, String>() {
#Override
public String call(Tuple2<LongWritable, Text> tuple) throws Exception {
// you will get each line from as a tuple (offset, text)
long pos = tuple._1().get(); // extract offset
String line = tuple._2().toString(); // extract text
return pos + " " + line;
}
});
You could use the wholeTextFiles(String path, int minPartitions) method from JavaSparkContext to return a JavaPairRDD<String,String> where the key is filename and the value is a string containing the entire content of a file (thus, each record in this RDD represents a file). From here, simply run a map() that will call indexOf(String searchString) on each value. This should return the first index in each file with the occurrence of the string in question.
(EDIT:)
So finding the offset in a distributed fashion for one file (per your use case below in the comments) is possible. Below is an example that works in Scala.
val searchString = *search string*
val rdd1 = sc.textFile(*input file*, *num partitions*)
// Zip RDD lines with their indices
val zrdd1 = rdd1.zipWithIndex()
// Find the first RDD line that contains the string in question
val firstFind = zrdd1.filter { case (line, index) => line.contains(searchString) }.first()
// Grab all lines before the line containing the search string and sum up all of their lengths (and then add the inline offset)
val filterLines = zrdd1.filter { case (line, index) => index < firstFind._2 }
val offset = filterLines.map { case (line, index) => line.length }.reduce(_ + _) + firstFind._1.indexOf(searchString)
Note that you would additionally need to add any new line characters manually on top of this since they are not accounted for (the input format uses new lines as demarcations between records). The number of new lines is simply the number of lines before the line containing the search string so this is trivial to add.
I'm not entirely familiar with the Java API unfortunately and it's not exactly easy to test so I'm not sure if the code below works but have at it (Also, I used Java 1.7 but 1.8 compresses a lot of this code with lambda expressions.):
String searchString = *search string*;
JavaRDD<String> data = jsc.textFile("hdfs://ourip/test/testdata.txt");
JavaRDD<Tuple2<String, Long>> zrdd1 = data.zipWithIndex();
Tuple2<String, Long> firstFind = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
public Boolean call(Tuple2<String, Long> input) { return input.productElement(0).contains(searchString); }
}).first();
JavaRDD<Tuple2<String, Long>> filterLines = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
public Boolean call(Tuple2<String, Long> input) { return input.productElement(1) < firstFind.productElement(1); }
});
Long offset = filterLines.map(new Function<Tuple2<String, Long>, Int>() {
public Int call(Tuple2<String, Long> input) { return input.productElement(0).length(); }
}).reduce(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
}) + firstFind.productElement(0).indexOf(searchString);
This can only be done when your input is one file (since otherwise, zipWithIndex() wouldn't guarantee offsets within a file) but this method works for an RDD of any number of partitions so feel free to partition your file up into any number of chunks.
I'm currently porting a test suite originally written in ruby to java.
List<String[]>
The first step I'm trying to port parses CSV data into a List<String[]>
#Then("test 1")
public void test1( DataTable expectedTable ) {
List<String[]> tableData = getCsvData( fileName );
// I have also tried List<Map<String,String>> here.
// when List<String[]> includes the column names in element 0, TableConverter.toTable()
// for List<Map<String,String>>, TableConverter.toTable() ends up with
// writer: { columnNames:(as provided in element 0),
// fieldNames: ["entry", "entry", "entry"...]
// fieldValues: [colName0, row1Value0, colName1, row1Value1...] }
// and then ComplexTypeWriter.getValues() calls
// int index = fieldNames.indexOf(converter.map(columnName));
// where columnName is correct, but index is evaluated as -1, so getValues() returns
// [, , , ,...]
// so .diff() displays a table of empty strings.
expectedTable.diff( tableData );
}
...cucumber-jvm does not display the actual CSV data correctly.
List<Map<String,String>>
In our ruby implementation other test steps use Cucumber::Ast::Table.diff! to display reasons for failure:
failure = {'line number' => line, 'reason' => 'bad data in column 2', 'data' => column2}
failures.push failure
Cucumber::Ast::Table.new([[]]).diff!( failures, {surplus_col: true, surplus_row: true} ) unless failures.empty?
I've tried to port this to java using java.util.Map, as shown below. The trouble is that although cucumber-jvm identifies that there is a difference between the empty DataTable and my List of Maps, it doesn't parse (or display) my List<Map> correctly.
Map<String,String> failure = new HashMap<String,String>();
failure.put("line number", Integer.toString(line));
failure.put("reason", "bad data in column 2");
failure.put("data", Arrays.toString(column2));
List<Map<String,String> failures = new ArrayList<Map<String,String>>();
failures.add(failure);
// We're expecting an empty list of failures, so create one to compare against.
String[] columnNames = failures.get(0).keySet().toArray(new String[]{});
ArrayList<Map<String, String>> emptyList = new ArrayList<Map<String,String>>();
HashMap<String, String> emptyData = new HashMap<String, String>();
for( String columnName : failures.get(0).keySet() ) {
emptyData.put(columnName, null);
}
emptyList.add(emptyData);
DataTable empty = DataTable.create( emptyList, Locale.getDefault(), columnNames );
empty.diff( failures );
I've implemented support for this:
https://github.com/cucumber/cucumber-jvm/pull/434
I want to implement 2 reports with OO. The reports are all like (but have different columns and data):
name age gender phone_number
A 10 male 1234
B 20 female 5678
C 30 n/a 9012
As you can see, in the report, each column has its own header and parser (for parsing the data). I have design an object Column:
class Column<T extends Object>
{
private String header;
private ColumnParser parser;
public Column(String header)
{
this.header = header;
this.parser = new ColumnParser<T>()
{
public String parse(T t)
{
return t.toString();
}
}
}
public Column(String header, ColumnParser parser)
{
this.header = header;
this.parser = parser;
}
public interface ColumnParser<T>
{
public String parse(T t);
}
}
So that each column has its own parser to parse the data in that column. But after this, I don't know how to store the data so that they can be mapped to each column and can be parsed.
Please advise.
First, it would be helpful to know what format your original data (in memory) is in - e.g. is it an Object[][]?
Second, the output you require looks like it's tab separated. Is that correct?
Third, to write to a text file you have to append row by row. Your current code seems to suggest you want to append column by column - this would be much harder to implement.
If you can convert your data into a String[][] - which should be straightforward - you can then use the following to write to a file. If you want tab-delimited, you can use the "\t" as the delimiter (although, that's for windows - not sure if it is OS specific like new line).
public static void writeToFile(File file, String[][] data, String delimiter){
PrintWriter out = new PrintWriter(new FileWriter(file));
for (String[] row : data){
out.write(makeLine(row, delimiter));
}
out.close();
}
private static String makeLine(String[] row, String delimiter) {
StringBuilder str = new StringBuilder();
for (String cell : row){
str.append("\""+cell+"\"").append(delimiter);
}
str.deleteCharAt(str.length()-1);
str.append("\n");
return str.toString();
}