We are using Hazelcast as our cached database in our project and HBase as the persistent database solution. We load the data from Hbase to Hazelcast using MapLoader when the application starts. Below is the snippet of the loadAll() function.
#Override
public synchronized Map<String, Object> loadAll(Collection<String> rows) {
final Map<String, Object> mapObject = new HashMap<String, Object>();
try {
if (!rows.isEmpty()) {
//Get Scanner from Hbase
final ResultScanner scanner = loadAllData(rows, this.mapName);
final byte[] family = this.mapName.getBytes();
final byte[] qualifier = this.mapName.getBytes();
//Load data from Hbase to Hazelcast
for (Result result = scanner.next(); result != null; result = scanner
.next()) {
mapObject
.put(Bytes.toString(result.getRow()),
convertToObject(result.getValue(family,
qualifier)));
}
LOGGER.logDebug(Constants.EXIT, methodName,
"All the data's are successfully loaded for Table "
+ this.mapName);
}
} catch (Exception e) {
throw new RuntimeException(e);
}
return mapObject;
}
We have a table with 350k entries (total data size of 300 mb) and its taking close to 1.5 hour to load the data. Is this normal behavior? Does Hazelcast usually take this much time to load the data?
Related
I am trying to implement an DMN (Decision Model and Notation) evaluation service, where the user can upload a csv file with test cases to be evaluated and receive results also as a csv file for every test cases in the input file.
Reading the input csv file and evaluating the test cases works without problems. But I have some issues in writing the results to a csv file using OpenCsv.
Here is the mapped bean, which should be converted to csv row:
#Data
#AllArgsConstructor
#NoArgsConstructor
public class DmnTestCaseResult {
private Map<String, Object> testInput;
private Map<String, Object> expectedOutput;
private List<Map<String, Object>> testOutput;
private String errorMessage;
}
As you can see here, the test case result can have in some situations multiple testOutputs, defined as a list of map.
What I want is to write for every map entry in the testOutput, a seperate row in the csv file. But with the code I wrote below, only the first entry of the testOutput is written as only one row in the csv file.
public String convertDmnRuleTestResultToCsv(DmnRuleTestResult result) {
List<DmnTestCaseResult> results = result.getTestCases();
try(StringWriter sw = new StringWriter(); CSVWriter writer = new CSVWriter(sw, CSVWriter.DEFAULT_SEPARATOR, CSVWriter.NO_QUOTE_CHARACTER, CSVWriter.NO_ESCAPE_CHARACTER, CSVWriter.DEFAULT_LINE_END)) {
StatefulBeanToCsv<DmnTestCaseResult> beanToCsv = new StatefulBeanToCsvBuilder<DmnTestCaseResult>(writer)
.withApplyQuotesToAll(false)
.build();
beanToCsv.write(results);
return sw.toString();
} catch(Exception ex){
throw new CsvParseException(ex.getMessage());
}
}
How can I tell the OpenCsv that it should create seperate row for each entry in the testOutputs ?
EDIT: Added more information
UI:
Resulted incorrect CSV:
Expected correct CSV:
As you can see from the screenshots, one input can have multiple test outputs. Therefore I want to create for every test output a seperate line in csv file.
As StatefulBeanToCsv does not seem to be capable to generating multiple lines for a single bean, I suggest implementing a custom mapping function. This also requires you to manually print the header line as well.
public static String convertDmnRuleTestResultToCsv(DmnRuleTestResult result) {
List<DmnTestCaseResult> results = result.getTestCases();
try (StringWriter sw = new StringWriter();
CSVWriter writer = new CSVWriter(sw, CSVWriter.DEFAULT_SEPARATOR,
CSVWriter.NO_QUOTE_CHARACTER, CSVWriter.NO_ESCAPE_CHARACTER,
CSVWriter.DEFAULT_LINE_END)) {
writeHeader(writer);
for (DmnTestCaseResult r : results) {
for (Map<String, Object> map : r.getTestOutput())
writer.writeNext(map(r, map));
}
return sw.toString();
} catch (Exception ex) {
throw new RuntimeException(ex.getMessage());
}
}
private static void writeHeader(CSVWriter writer) {
List<String> header = new ArrayList<>();
header.add("ERRORMESSAGE");
header.add("EXPECTEDOUTPUT");
header.add("INPUT");
header.add("OUTPUT");
writer.writeNext(header.toArray(new String[] {}));
}
private static String[] map(DmnTestCaseResult r, Map<String, Object> testOutput) {
// you can manually adjust formats here as well; entrySet() call can be left out, it does change the format. do what you like more
List<String> line = new ArrayList<>();
line.add(r.getErrorMessage());
line.add(r.getExpectedOutput().entrySet().toString());
line.add(r.getTestInput().entrySet().toString());
line.add(testOutput.entrySet().toString());
return line.toArray(new String[] {});
}
And this prints:
ERRORMESSAGE,EXPECTEDOUTPUT,INPUT,OUTPUT
errorMessage,[expectedOutput1=expectedOutput1, expectedOutput2=expectedOutput2],[input2=testInput2, input1=testInput1],[testOut2=testOut2, testOut=testOut1]
errorMessage,[expectedOutput1=expectedOutput1, expectedOutput2=expectedOutput2],[input2=testInput2, input1=testInput1],[testOut3=testOut3, testOut4=testOut4]
I would like to save RDD to text file grouped by key, currently I can't figure out how to split the output to multiple files, it seems all the output spanning across multiple keys which share the same partition gets written to the same file. I would like to have different files for each key. Here's my code snippet :
JavaPairRDD<String, Iterable<Customer>> groupedResults = customerCityPairRDD.groupByKey();
groupedResults.flatMap(x -> x._2().iterator())
.saveAsTextFile(outputPath + "/cityCounts");
This can be achieved by using foreachPartition to save each partitions into separate file.
You can develop your code as follows
groupedResults.foreachPartition(new VoidFunction<Iterator<Customer>>() {
#Override
public void call(Iterator<Customer> rec) throws Exception {
FSDataOutputStream fsoutputStream = null;
BufferedWriter writer = null;
try {
fsoutputStream = FileSystem.get(new Configuration()).create(new Path("path1"))
writer = new BufferedWriter(fsoutputStream)
while (rec.hasNext()) {
Customer cust = rec.next();
writer.write(cust)
}
} catch (Exception exp) {
exp.printStackTrace()
//Handle exception
}
finally {
// close writer.
}
}
});
Hope this helps.
Ravi
So I figured how to solve this. Convert RDD to Dataframe and then just partition by key during write.
Dataset<Row> dataFrame = spark.createDataFrame(customerRDD, Customer.class);
dataFrame.write()
.partitionBy("city")
.text("cityCounts"); // write as text file at file path cityCounts
I wrote a file dupelication processor which gets the MD5 hash of each file, adds it to a hashmap, than takes all of the files with the same hash and adds it to a hashmap called dupeList. But while running large directories to scan such as C:\Program Files\ it will throw the following error
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.nio.file.Files.read(Unknown Source)
at java.nio.file.Files.readAllBytes(Unknown Source)
at com.embah.FileDupe.Utils.FileUtils.getMD5Hash(FileUtils.java:14)
at com.embah.FileDupe.FileDupe.getDuplicateFiles(FileDupe.java:43)
at com.embah.FileDupe.FileDupe.getDuplicateFiles(FileDupe.java:68)
at ImgHandler.main(ImgHandler.java:14)
Im sure its due to the fact it handles so many files, but im not sure of a better way to handle it. Im trying to get this working so I can sift thru all my kids baby pictures and remove dupelicates before I put them on my external harddrive for longterm storage. Thanks everyone for the help!
My code
public class FileUtils {
public static String getMD5Hash(String path){
try {
byte[] bytes = Files.readAllBytes(Paths.get(path)); //LINE STACK THROWS ERROR
byte[] hash = MessageDigest.getInstance("MD5").digest(bytes);
bytes = null;
String hexHash = DatatypeConverter.printHexBinary(hash);
hash = null;
return hexHash;
} catch(Exception e){
System.out.println("Having problem with file: " + path);
return null;
}
}
public class FileDupe {
public static Map<String, List<String>> getDuplicateFiles(String dirs){
Map<String, List<String>> allEntrys = new HashMap<>(); //<hash, file loc>
Map<String, List<String>> dupeEntrys = new HashMap<>();
File fileDir = new File(dirs);
if(fileDir.isDirectory()){
ArrayList<File> nestedFiles = getNestedFiles(fileDir.listFiles());
File[] fileList = new File[nestedFiles.size()];
fileList = nestedFiles.toArray(fileList);
for(File file:fileList){
String path = file.getAbsolutePath();
String hash = "";
if((hash = FileUtils.getMD5Hash(path)) == null)
continue;
if(!allEntrys.containsValue(path))
put(allEntrys, hash, path);
}
fileList = null;
}
allEntrys.forEach((hash, locs) -> {
if(locs.size() > 1){
dupeEntrys.put(hash, locs);
}
});
allEntrys = null;
return dupeEntrys;
}
public static Map<String, List<String>> getDuplicateFiles(String... dirs){
ArrayList<Map<String, List<String>>> maps = new ArrayList<Map<String, List<String>>>();
Map<String, List<String>> dupeMap = new HashMap<>();
for(String dir : dirs){ //Get all dupe files
maps.add(getDuplicateFiles(dir));
}
for(Map<String, List<String>> map : maps){ //iterate thru each map, and add all items not in the dupemap to it
dupeMap.putAll(map);
}
return dupeMap;
}
protected static ArrayList<File> getNestedFiles(File[] fileDir){
ArrayList<File> files = new ArrayList<File>();
return getNestedFiles(fileDir, files);
}
protected static ArrayList<File> getNestedFiles(File[] fileDir, ArrayList<File> allFiles){
for(File file:fileDir){
if(file.isDirectory()){
getNestedFiles(file.listFiles(), allFiles);
} else {
allFiles.add(file);
}
}
return allFiles;
}
protected static <KEY, VALUE> void put(Map<KEY, List<VALUE>> map, KEY key, VALUE value) {
map.compute(key, (s, strings) -> strings == null ? new ArrayList<>() : strings).add(value);
}
public class ImgHandler {
private static Scanner s = new Scanner(System.in);
public static void main(String[] args){
System.out.print("Please enter locations to scan for dupelicates\nSeperate Location via semi-colon(;)\nLocations: ");
String[] locList = s.nextLine().split(";");
Map<String, List<String>> dupes = FileDupe.getDuplicateFiles(locList);
System.out.println(dupes.size() + " dupes detected!");
dupes.forEach((hash, locs) -> {
System.out.println("Hash: " + hash);
locs.forEach((loc) -> System.out.println("\tLocation: " + loc));
});
}
Reading the entire file into a byte array does not only require sufficient heap space, it’s also limited to file sizes up to Integer.MAX_VALUE in principle (the practical limit for the HotSpot JVM is even a few bytes smaller).
The best solution is not to load the data into the heap memory at all:
public static String getMD5Hash(String path) {
MessageDigest md;
try { md = MessageDigest.getInstance("MD5"); }
catch(NoSuchAlgorithmException ex) {
System.out.println("FileUtils.getMD5Hash(): "+ex);
return null;// TODO better error handling
}
try(FileChannel fch = FileChannel.open(Paths.get(path), StandardOpenOption.READ)) {
for(long pos = 0, rem = fch.size(), chunk; rem>pos; pos+=chunk) {
chunk = Math.min(Integer.MAX_VALUE, rem-pos);
md.update(fch.map(FileChannel.MapMode.READ_ONLY, pos, chunk));
}
} catch(IOException e){
System.out.println("Having problem with file: " + path);
return null;// TODO better error handling
}
return String.format("%032X", new BigInteger(1, md.digest()));
}
If the underlying MessageDigest implementation is a pure Java implementation, it will transfer data from the direct buffer to the heap, but that’s outside your responsibility then (and it will be a reasonable trade-off between consumed heap memory and performance).
The method above will handle files beyond the 2GiB size without problems.
Whatever implementation FileUtils has is trying to read in whole files for calculating hash. This is not necessary: calculation is possible by reading content in smaller chunks. In fact it is sort of bad design to require this, instead of simply reading in chunks that are needed (64 bytes?). So maybe you need to use a better library.
You have a lot of solutions:
Don't read all bytes at one time, try to use a BufferedInputStream, and read a lot of bytes every time. But not all the file.
try (BufferedInputStream fileInputStream = new BufferedInputStream(
Files.newInputStream(Paths.get("your_file_here"), StandardOpenOption.READ))) {
byte[] buf = new byte[2048];
int len = 0;
while((len = fileInputStream.read(buf)) == 2048) {
// Add this to your calculation
doSomethingWithBytes(buf);
}
doSomethingWithBytes(buf, len); // Do only with the bytes
// read from the file
} catch(IOException ex) {
ex.printStackTrace();
}
Use C/C++ for such thing, (well, this is unsafe, because you will handle the memory yourself)
Consider using Guava :
private final static HashFunction HASH_FUNCTION = Hashing.goodFastHash(32);
//somewhere later
final HashCode hash = Files.asByteSource(file).hash(HASH_FUNCTION);
Guava will buffer the reading of the file for you.
i had this java heap space error on my windows machine and i spend weeks searching online for a solution, i tried increasing my -Xmx value higher but to no success. i even try running my spring boot app with a parameter to increase the heap size during run time with command like one below
mvn spring-boot:run -Dspring-boot.run.jvmArguments="-Xms2048m -Xmx4096m"
but still no success. until i figured out i was running jdk 32 bit which has limited memory size and i had to uninstall the 32 bit and install the 64 bit which solved my issue for me. i hope this help someone with issue similar to mine.
I try to implement linear regression over an csv file. Here is the content of the csv file:
X1;X2;X3;X4;X5;X6;X7;X8;Y1;Y2;
0.98;514.50;294.00;110.25;7.00;2;0.00;0;15.55;21.33;
0.98;514.50;294.00;110.25;7.00;3;0.00;0;15.55;21.33;
0.98;514.50;294.00;110.25;7.00;4;0.00;0;15.55;21.33;
0.98;514.50;294.00;110.25;7.00;5;0.00;0;15.55;21.33;
0.90;563.50;318.50;122.50;7.00;2;0.00;0;20.84;28.28;
0.90;563.50;318.50;122.50;7.00;3;0.00;0;21.46;25.38;
0.90;563.50;318.50;122.50;7.00;4;0.00;0;20.71;25.16;
0.90;563.50;318.50;122.50;7.00;5;0.00;0;19.68;29.60;
0.86;588.00;294.00;147.00;7.00;2;0.00;0;19.50;27.30;
0.86;588.00;294.00;147.00;7.00;3;0.00;0;19.95;21.97;
0.86;588.00;294.00;147.00;7.00;4;0.00;0;19.34;23.49;
0.86;588.00;294.00;147.00;7.00;5;0.00;0;18.31;27.87;
0.82;612.50;318.50;147.00;7.00;2;0.00;0;17.05;23.77;
...
0.71;710.50;269.50;220.50;3.50;2;0.40;5;12.43;15.59;
0.71;710.50;269.50;220.50;3.50;3;0.40;5;12.63;14.58;
0.71;710.50;269.50;220.50;3.50;4;0.40;5;12.76;15.33;
0.71;710.50;269.50;220.50;3.50;5;0.40;5;12.42;15.31;
0.69;735.00;294.00;220.50;3.50;2;0.40;5;14.12;16.63;
0.69;735.00;294.00;220.50;3.50;3;0.40;5;14.28;15.87;
0.69;735.00;294.00;220.50;3.50;4;0.40;5;14.37;16.54;
0.69;735.00;294.00;220.50;3.50;5;0.40;5;14.21;16.74;
0.66;759.50;318.50;220.50;3.50;2;0.40;5;14.96;17.64;
0.66;759.50;318.50;220.50;3.50;3;0.40;5;14.92;17.79;
0.66;759.50;318.50;220.50;3.50;4;0.40;5;14.92;17.55;
0.66;759.50;318.50;220.50;3.50;5;0.40;5;15.16;18.06;
0.64;784.00;343.00;220.50;3.50;2;0.40;5;17.69;20.82;
0.64;784.00;343.00;220.50;3.50;3;0.40;5;18.19;20.21;
0.64;784.00;343.00;220.50;3.50;4;0.40;5;18.16;20.71;
0.64;784.00;343.00;220.50;3.50;5;0.40;5;17.88;21.40;
0.62;808.50;367.50;220.50;3.50;2;0.40;5;16.54;16.88;
0.62;808.50;367.50;220.50;3.50;3;0.40;5;16.44;17.11;
0.62;808.50;367.50;220.50;3.50;4;0.40;5;16.48;16.61;
0.62;808.50;367.50;220.50;3.50;5;0.40;5;16.64;16.03;
I read this csv file and implement linear regression implementation. Here is the source code in java:
public static void main(String[] args) throws IOException
{
String csvFile = null;
CSVLoader loader = null;
Remove remove =null;
Instances data =null;
LinearRegression model = null;
int numberofFeatures = 0;
try
{
csvFile = "C:\\Users\\Taha\\Desktop/ENB2012_data.csv";
loader = new CSVLoader();
// load CSV
loader.setSource(new File(csvFile));
data = loader.getDataSet();
//System.out.println(data);
numberofFeatures = data.numAttributes();
System.out.println("number of features: " + numberofFeatures);
data.setClassIndex(data.numAttributes() - 2);
//remove last attribute Y2
remove = new Remove();
remove.setOptions(new String[]{"-R", data.numAttributes()+""});
remove.setInputFormat(data);
data = Filter.useFilter(data, remove);
// data.setClassIndex(data.numAttributes() - 2);
model = new LinearRegression();
model.buildClassifier(data);
System.out.println(model);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
I am getting an error, weka.core.UnassignedClassException: Class index is negative (not set)! at the line model.buildClassifier(data); Number of features is 1, however, it is expected to be 9.They are X1;X2;X3;X4;X5;X6;X7;X8;Y1;Y2 What am I missing?
Thanks in advance.
You can add after the line data=loader.getDataSet(), the next lines which will resolve your exception:
if (data.classIndex() == -1) {
System.out.println("reset index...");
instances.setClassIndex(data.numAttributes() - 1);
}
This worked for me.
Since I can not find any solution to that problem, I decided to position data into Oracle database and I read data from Oracle. There is an import utility in Oracle Sql Developer and I used it. That solves my problem. I write this article for people who has the same problem.
Here is the detailed information about connecting an Oracle database for weka.
http://tahasozgen.blogspot.com.tr/2016/10/connection-to-oracle-database-in-weka.html
I have some questions regarding reading and writing to CSV files (or if there is a simpler alternative).
Scenario:
I need to have a simple database of people and some basic information about them. I need to be able to add new entries and search through the file for entries. I also need to be able to find an entry and modify it (i.e change their name or fill in a currently empty field).
Now I'm not sure if a CSV reader/writer is the best route or not? I wouldn't know where to begin with SQL in Java but if anyone knows of a good resource for learning that, that would be great.
Currently I am using SuperCSV, I put together a test project based around some example code:
class ReadingObjects {
// private static UserBean userDB[] = new UserBean[2];
private static ArrayList<UserBean> arrUserDB = new ArrayList<UserBean>();
static final CellProcessor[] userProcessors = new CellProcessor[] {
new StrMinMax(5, 20),
new StrMinMax(8, 35),
new ParseDate("dd/MM/yyyy"),
new Optional(new ParseInt()),
null
};
public static void main(String[] args) throws Exception {
ICsvBeanReader inFile = new CsvBeanReader(new FileReader("foo.csv"), CsvPreference.EXCEL_PREFERENCE);
try {
final String[] header = inFile.getCSVHeader(true);
UserBean user;
int i = 0;
while( (user = inFile.read(UserBean.class, header, userProcessors)) != null) {
UserBean addMe = new UserBean(user.getUsername(), user.getPassword(), user.getTown(), user.getDate(), user.getZip());
arrUserDB.add(addMe);
i++;
}
} finally {
inFile.close();
}
for(UserBean currentUser:arrUserDB){
if (currentUser.getUsername().equals("Klaus")) {
System.out.println("Found Klaus! :D");
}
}
WritingMaps.add();
}
}
And a writer class:
class WritingMaps {
public static void add() throws Exception {
ICsvMapWriter writer = new CsvMapWriter(new FileWriter("foo.csv", true), CsvPreference.EXCEL_PREFERENCE);
try {
final String[] header = new String[] { "username", "password", "date", "zip", "town"};
String test = System.getProperty("line.seperator");
// set up some data to write
final HashMap<String, ? super Object> data1 = new HashMap<String, Object>();
data1.put(header[0], "Karlasa");
data1.put(header[1], "fdsfsdfsdfs");
data1.put(header[2], "17/01/2010");
data1.put(header[3], 1111);
data1.put(header[4], "New York");
System.out.println(data1);
// the actual writing
// writer.writeHeader(header);
writer.write(data1, header);
// writer.write(data2, header);
} finally {
writer.close();
}
}
}
Issues:
I'm struggling to get the writer to add a new line to the CSV file. Purely for human readability purposes, not such a big deal.
I'm not sure how I would add data to an existing record to modify it. (remove and add it again? Not sure how to do this).
Thanks.
Have you considered an embedded database like H2, HSQL or SQLite? They can all persist to the filesystem and you'll discover a more flexible datastore with less code.
The easiest solution is to read the file at application startup into an in-memory structure (list of UserBean, for example), to add, remove, modify beans in this in-memory structure, and to write the whole list of UserBean to the file when the app closes, or when the user chooses to Save.
Regarding newlines when writing, the javadoc seems to indicate that the writer will take care of that. Just call write for each of your user bean, and the writer will automatically insert newlines between each row.