I have a big data project for school that requires us to build and query a 8 node Cassandra system. The system must contain at least seven terabytes of data. I have to generate all this data myself. There is no requirement that the data be "relevant" to the assignment -- ie each column can just be a random int. That being said it is required that each value is random or based on a random sequence.
So, I wrote a simple java program to just generate random ints. I can generate ~200 MB of random test data in ~120s. Now unless my math is off, then I think I'm in a pickle.
There are 35000 200MB units in 7 terabytes.
35000 * 120 = 4 200 000 seconds
4 200 000 / 3600 ~ 1167hours
1167 / 24 = 49 days
So, it appears that it will take 49 days to generate all the test data needed. Obviously, this is impractical. I'm looking for suggestions that will increase the rate at which I can generate data.
I've considered/considering:
setting replication factor to 8 to reduce the amount of data needed to be generated, and also running the data generation program on all 8 nodes.
edit: how I'm generating the data
private void initializeCols(){
cols = new ArrayList<Generator>();
cols.add(new IntGenerator(400));
}
public ArrayList<String> generatePage(){
ArrayList<String> page = new ArrayList<String>();
String line = "";
for(int i = 0; i < PAGE_SIZE; i++){
line = "";
for(Generator column : cols){
line += column.gen();
}
page.add(line);
}
return page;
}
originally I was generating more test specific data like phone numbers etc. but then I decided to just generate random ints in order to shave some time off -- not much savings. Here is the IntGenerator class.
public IntGenerator(int series){
this.series = series;
}
public String gen(){
String output = "";
for(int i = 0; i < series; i++){
output += Integer.toString(randomInt(1,1000));
output += SEPERATOR;
}
return output;
}
Use cassandra stress 2.1
And this tool to generate your yaml.
You'll have random data in C* in minutes, no coding!
As you are performing a lot of concatenation in loops, I highly recommend you check out StringBuilder. It will dramatically increase the speed of your loops. For example,
public String gen(){
StringBuilder sb = new StringBuilder();
for(int i = 0; i < series; i++){
sb.append(Integer.toString(randomInt(1,1000)));
sb.append(SEPERATOR);
}
return sb.toString();
}
And you should do similar in your generatePage method as well.
Speed in volume as well as more data realism can be had via third-party test data tools. This one (RowGen) creates flat files you can copy into DataStax; see:
Creating Test Data for Cassandra DataStax
Related
I have a Java program using the H5 libraries that tries to read a dataset in a H5 file with the following properties:
The file's size is 769M.
The code that reads the dataset is the following (very simple):
// Open file using the default properties.
fileId = H5.H5Fopen(filepath, HDF5Constants.H5F_ACC_RDONLY, HDF5Constants.H5P_DEFAULT);
// Open dataset using the default properties.
if (fileId >= 0) {
datasetId = H5.H5Dopen(fileId, "/data/0_u0/20050103", HDF5Constants.H5P_DEFAULT);
}
if (datasetId >= 0) {
dataSpaceId = H5.H5Dget_space(datasetId);
}
// Get the dimensions of the dataset
int ndims = -1;
if (dataSpaceId >= 0)
ndims = H5.H5Sget_simple_extent_ndims(dataSpaceId);
if (ndims > 0) {
long[] dims = new long[ndims];
H5.H5Sget_simple_extent_dims(dataSpaceId, dims, null);
H5.H5Sclose(dataSpaceId);
int dimX = (int)dims[0];
int dimY = (int)dims[1];
Double[][] dsetData = new Double[dimX][dimY];
H5.H5Dread(datasetId, HDF5Constants.H5T_NATIVE_DOUBLE,
HDF5Constants.H5S_ALL, HDF5Constants.H5S_ALL,
HDF5Constants.H5P_DEFAULT, dsetData);
}
And it takes forever (more than 15 minutes, I stopped after that).
What I don't understand is that I also have kind of the same code in Python, and it takes a few seconds.
When I debug the Java program and stop in the middle execution, it's in the byteToDouble() function of the H5 lib. It's a lot of double, but should not take that much time right?
Thanks for your help!
I think the issue is that your reading the data into 2D array Double[][]. When you do this the HDF5 implementation is very slow (think the issue is probably in HDFArray.arrayify). Try reading the data into a 1D double[].
Also you are using boxed Double it would probably be better to use primative double.
Summary: I need a function based on the output. The problem is
connecting Eclipse or a Java code with another software.
I'm studying Physics. I needed a code that works the following way:
first, it declares a random number n;
then it outputs a "winner" number (based on some rules; the code
itself is irrelevant now I think), 20 times (but should be more,
first I need something to record the values, though).
I have n and 20 other numbers which are each between 1 and n (including 1 and n). I want, after compiling the code once, to see the 20 values, how they are distributed (for example, are they around one particular number, or in a region, is there a pattern (this is based on the rules, of course)).
The problem is, I'm only familiar with basic Java (I used eclipse), and have no clue on how I should register for example 2000 instead of the 20 numbers (so for an n number the code should print 2000 or more numbers, which should appear on a function: domain of the function is 1, 2, ..., n, and range is 0, 1, ..., 2000, as it might happen that all 2000 numbers are the same). I thought of Excel, but how could I connect a Java code with it? Visual interpretation is not necessary, although it would make my work easier (I hope, at least).
The code:
import java.util.Random;
public class korbeadosjo {
public static void main(String Args[]){
Random rand = new Random();
int n = (rand.nextInt(300)+2);
System.out.println("n= " + n);
int narrayban = n-1;
int jatekmester = n/2;
int jatekmesterarrayban = jatekmester-1;
System.out.println("n/2: " + jatekmester);
for(int i=0; i<400; i++){
int hanyembernelvoltmar = 1;
int voltmar[] = new int[n];
voltmar[jatekmesterarrayban]=1;
int holvan=jatekmester;
int holvanarrayban = holvan-1;
fori: for(;;){
int jobbravagybalra = rand.nextInt(2);
switch(jobbravagybalra){
case 0: //balra
if(holvanarrayban ==0){
holvanarrayban = narrayban;
}else {
--holvanarrayban;
};
if(voltmar[holvanarrayban]==0){
voltmar[holvanarrayban] =1;
++hanyembernelvoltmar;
}
break;
case 1: //jobbra
if(holvanarrayban == narrayban){
holvanarrayban = 0;
} else {++holvanarrayban;};
if(voltmar[holvanarrayban]==0){
voltmar[holvanarrayban]=1;
++hanyembernelvoltmar;
}
break;
}if(hanyembernelvoltmar==n){
System.out.println(holvanarrayban+1);
break fori;
}}}}}
basic Java (I used eclipse)
Unrelated.
I could only find two prompts in your question:
How to create statistics from output of Java code?
You are likely not wanting to get the output alone. Use those numbers in your Java program to find what you want and output it.
How did you store 2000 values? An array, list, queue...? So also iterate on that data structure and generate the statistics you need.
I thought of Excel, but how could I connect a Java code with it?
There is this site.
I am trying to build an OCR by calculating the Coefficient Correlation between characters extracted from an image with every character I have pre-stored in a database. My implementation is based on Java and pre-stored characters are loaded into an ArrayList upon the beginning of the application, i.e.
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
...
...
public double findCorrelation(byte [] extractedCharacter, byte [] storedCharacter)
{
double mag1, mag2, corr = 0;
for(int i=0; i < extractedCharacter.length; i++)
{
mag1 += extractedCharacter[i] * extractedCharacter[i];
mag2 += storedCharacter[i] * storedCharacter[i];
corr += extractedCharacter[i] * storedCharacter[i];
} // for
corr /= Math.sqrt(mag1*mag2);
return corr;
}
The number of extractedCharacters are around 100-150 per image but the database has 15600 stored binary characters. Checking the coefficient correlation between every extracted character and every stored character has an impact on the performance as it needs around 15-20 seconds to complete for every image, with an Intel i5 CPU.
Is there a way to improve the speed of this program, or suggesting another path of building this bringing similar results. (The results produced by comparing every character with such a large dataset is quite good).
Thank you in advance
UPDATE 1
public static void run() {
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
computeNorms(charComps, extractedCharacters);
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
}
}
private static double[] storedNorms;
private static double[] extractedNorms;
// Correlation between to binary images
public static double findCorrelation(byte[] arr1, byte[] arr2, int strCharIndex, int extCharNo){
final int dotProduct = dotProduct(arr1, arr2);
final double corr = dotProduct * storedNorms[strCharIndex] * extractedNorms[extCharNo];
return corr;
}
public static void computeNorms(ArrayList<byte[]> storedCharacters, ArrayList<byte[]> extractedCharacters) {
storedNorms = computeInvNorms(storedCharacters);
extractedNorms = computeInvNorms(extractedCharacters);
}
private static double[] computeInvNorms(List<byte []> a) {
final double[] result = new double[a.size()];
for (int i=0; i < result.length; ++i)
result[i] = 1 / Math.sqrt(dotProduct(a.get(i), a.get(i)));
return result;
}
private static int dotProduct(byte[] arr1, byte[] arr2) {
int dotProduct = 0;
for(int i = 0; i< arr1.length; i++)
dotProduct += arr1[i] * arr2[i];
return dotProduct;
}
Nowadays, it's hard to find a CPU with a single core (even in mobiles). As the tasks are nicely separated, you can do it with a few lines only. So I'd go for it, though the gain is limited.
In case you really mean cross-correlation, then a transform like DFT or DCT could help. They surely do for big images, but with yours 12x16, I'm not sure.
Maybe you mean just a dot product? And maybe you should tell us?
Note that you actually don't need to compute the correlation, most of the time you only need is find out if it's bigger than a threshold:
corr = findCorrelation(extractedCharacter, storedCharacter)
..... more code to check if this is the best match ......
This may lead to some optimizations or not, depending on how the images look like.
Note also that a simple low level optimization can give you nearly a factor of 4 as in this question of mine. Maybe you really should tell us what you're doing?
UPDATE 1
I guess that due to the computation of three products in the loop, there's enough instruction level parallelism, so a manual loop unrolling like in my above question is not necessary.
However, I see that those three products get computed some 100 * 15600 times, while only one of them depends on both extractedCharacter and storedCharacter. So you can compute
100 + 15600 + 100 * 15600
dot products instead of
3 * 100 * 15600
This way you may get a factor of three pretty easily.
Or not. After this step there's a single sum computed in the relevant step and the problem linked above applies. And so does its solution (unrolling manually).
Factor 5.2
While byte[] is nicely compact, the computation involves extending them to ints, which costs some time as my benchmark shows. Converting the byte[]s to int[]s before all the correlations gets computed saves time. Even better is to make use of the fact that this conversion for storedCharacters can be done beforehand.
Manual loop unrolling twice helps but unrolling more doesn't.
I have a file of size around 4-5 Gigs(nearly billion lines). From every line of the file, I have to parse the array of integers and the additional integer info and update my custom data structure. My class to hold such information looks like
class Holder {
private int[][] arr = new int[1000000000][5]; // assuming that max array size is 5
private int[] meta = new int[1000000000];
}
A sample line from the file looks like
(1_23_4_55) 99
Every index in the arr & meta corresponds to the line number in the file. From the above line, I extract the array of integers first and then the meta information. In that case,
--pseudo_code--
arr[line_num] = new int[]{1, 23, 4, 55}
meta[line_num]=99
Right now, I am using BufferedReader object and it's readLine method to read each line & use character level operations to parse the integer array and meta information from each line and populate the Holder instance. But, it takes almost half an hour to complete this entire operation.
I used both java Serialization & Externalizable(write the meta and arr) to serialize and deserialize this HUGE Holder instance. And with both of them, the time to serialize is almost half an hour and to deserialize is also almost half an hour.
I would appreciate your suggestions on dealing with this kind of problem & would definitely love to hear your part of story if any.
P.S. Main Memory is not a problem. I have almost 50 GB of RAM in my machine. I have also increased the BufferedReader size to 40 MB (Of course, I can increase this upto 100 MB considering that disk access takes approx. 100 MB/sec). Even cores and CPU is not a problem.
EDIT I
The code that I am using to do this task is provided below(after anonymizing very few information);
public class BigFileParser {
private int parsePositiveInt(final String s) {
int num = 0;
int sign = -1;
final int len = s.length();
final char ch = s.charAt(0);
if (ch == '-')
sign = 1;
else
num = '0' - ch;
int i = 1;
while (i < len)
num = num * 10 + '0' - s.charAt(i++);
return sign * num;
}
private void loadBigFile() {
long startTime = System.nanoTime();
Holder holder = new Holder();
String line;
try {
Reader fReader = new FileReader("/path/to/BIG/file");
// 40 MB buffer size
BufferedReader bufferedReader = new BufferedReader(fReader, 40960);
String tempTerm;
int i, meta, ascii, len;
boolean consumeNextInteger;
// GNU Trove primitive int array list
TIntArrayList arr;
char c;
while ((line = bufferedReader.readLine()) != null) {
consumeNextInteger = true;
tempTerm = "";
arr = new TIntArrayList(5);
for (i = 0, len = line.length(); i < len; i++) {
c = line.charAt(i);
ascii = c - 0;
// 95 is the ascii value of _ char
if (consumeNextInteger && ascii == 95) {
arr.add(parsePositiveInt(tempTerm));
tempTerm = "";
} else if (ascii >= 48 && ascii <= 57) { // '0' - '9'
tempTerm += c;
} else if (ascii == 9) { // '\t'
arr.add(parsePositiveInt(tempTerm));
consumeNextInteger = false;
tempTerm = "";
}
}
meta = parsePositiveInt(tempTerm);
holder.update(arr, meta);
}
bufferedReader.close();
long endTime = System.nanoTime();
System.out.println("#time -> " + (endTime - startTime) * 1.0
/ 1000000000 + " seconds");
} catch (IOException exp) {
exp.printStackTrace();
}
}
}
public class Holder {
private static final int SIZE = 500000000;
private TIntArrayList[] arrs;
private TIntArrayList metas;
private int idx;
public Holder() {
arrs = new TIntArrayList[SIZE];
metas = new TIntArrayList(SIZE);
idx = 0;
}
public void update(TIntArrayList arr, int meta) {
arrs[idx] = arr;
metas.add(meta);
idx++;
}
}
It sounds like the time taken for file I/O is the main limiting factor, given that serialization (binary format) and your own custom format take about the same time.
Therefore, the best thing you can do is to reduce the size of the file. If your numbers are generally small, then you could get a huge boost from using Google protocol buffers, which will encode small integers generally in one or two bytes.
Or, if you know that all your numbers are in the 0-255 range, you could use a byte[] rather than int[] and cut the size (and hence load time) to a quarter of what it is now. (assuming you go back to serialization or just write to a ByteChannel)
It simply can't take that long. You're working with some 6e9 ints, which means 24 GB. Writing 24 GB to the disk takes some time, but nothing like half an hour.
I'd put all the data in a single one-dimensional array and access it via methods like int getArr(int row, int col) which transform row and col onto a single index. According to how the array gets accessed (usually row-wise or usually column-wise), this index would be computed as N * row + col or N * col + row to maximize locality. I'd also store meta in the same array.
Writing a single huge int[] into memory should be pretty fast, surely no half an hour.
Because of the data amount, the above doesn't work as you can't have a 6e9 entries array. But you can use a couple of big arrays instead and all of the above applies (compute a long index from row and col and split it into two ints for accessing the 2D-array).
Make sure you aren't swapping. Swapping is the most probable reason for the slow speed I can think of.
There are several alternative Java file i/o libraries. This article is a little old, but it gives an overview that's still generally valid. He's reading about 300Mb per second with a 6-year old Mac. So for 4Gb you have under 15 seconds of read time. Of course my experience is that Mac IO channels are very good. YMMV if you have a cheap PC.
Note there is no advantage above a buffer size of 4K or so. In fact you're more likely to cause thrashing with a big buffer, so don't do that.
The implication is that parsing characters into the data you need is the bottleneck.
I have found in other apps that reading into a block of bytes and writing C-like code to extract what I need goes faster than the built-in Java mechanisms like split and regular expressions.
If that still isn't fast enough, you'd have to fall back to a native C extension.
If you randomly pause it you will probably see that the bulk of the time goes into parsing the integers, and/or all the new-ing, as in new int[]{1, 23, 4, 55}. You should be able to just allocate the memory once and stick numbers into it at better than I/O speed if you code it carefully.
But there's another way - why is the file in ASCII?
If it were in binary, you could just slurp it up.
I need steps to perform document clustering using k-means algorithm in java.
It will be very useful for me to provide the steps easily.
Thanks in advance.
You need to count the words in each document and make a feature generally called bag of words. Before that you need to remove stop words(very common but not giving much information like the, a etc). You can generally take top n common words from your document. Count the frequency of these words and store them in n dimensional vector.
For distance measure you can use cosine vector.
Here is a simple algorithm for 2 mean for 1 dimensional data points. you can extend it to k mean and n dimensional data point easily. Let me know if you want n dim implementation.
double[] x = {1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10};
double[] center = new int[2];
double[] precenter = new int[2];
ArrayList[] cluster = new ArrayList[2];
//generate 2 random number from 0 to x.length without replacement
int rand = new int[2];
Random rand = new Random();
rand[0] = rand.nextInt(x.length + 1);
rand[1] = rand.nextInt(x.length + 1);
while(rand[0] == rand[1] ){
rand[1] = rand.nextInt(x.length + 1);
}
center[0] = x[rand[0]];
center[1] = x[rand[1]];
//there is a better way to generate k random number (w/o replacement) just search.
do{
cluster[0].clear();
cluster[1].clear();
for(int i = 0; i < x.length; ++i){
if(abs(x[i]-center1[0]) <= abs(x[i]-center1[1])){
cluster[0].add(x[i]);
}
else{
cluster[0].add(x[i]);
}
precenter[0] = center[0];
precenter[1] = center[1];
center[0] = mean(cluster[0]);
center[1] = mean(cluster[1]);
}
} while(precenter[0] != center[0] && precenter[1] != center[1]);
double mean(ArrayList list){
double mean = 0;
double sum = 0;
for(int index=0;index
}
The cluster[0] and cluster [1] contain points in the clusters and center[0], center[1] are the 2 means.
you need to do some debugging because I have written the code in R and just converted it into java for you :)
Does this help you? Also the wiki article has some links to implementations in other languages ready to be ported to java.
Steps of the algorithm:
Define the number of clusters you want to have
Distribute the points radomly in your problem space.
Link every observation to the nearest point.
calculate the center of mass for each cluster and place the point into the middle.
Link the points again to the centerpoints and repeat until the points dont move any more.
What do you want to cluster the documents based on? If it's by similarity you'll need to do some natural language processing first, and then you'll need a metric (some kind of assignment algorithm) to place the documents into clusters (crp works and is relatively straight forward).
The hardest part will be the NLP (language processing) if you're not clustering them based on something like "length". I can provide more info on all of these, but I won't dive down the rabbit hole if you don't need it.