search a line in preprocessed big text file - java

I have a data file which contains 100,000+ lines, each line just contains two fields, key and value split by comma, and all the keys are unique. I want to query value by key from this file. Loading it to a map is out of question as that consumes too much memory(code will run on embedded device) and I don't want DB involved. What I do so far is to preprocess the file in my PC, i.e., sort the lines, then use binary search like below in the preprocessed file:
public long findKeyOffset(RandomAccessFile raf, String key)
throws IOException {
int blockSize = 8192;
long fileSize = raf.length();
long min = 0;
long max = (long) fileSize / blockSize;
long mid;
String line;
while (max - min > 1) {
mid = min + (long) ((max - min) / 2);
raf.seek(mid * blockSize);
if (mid > 0)
line = raf.readLine(); // probably a partial line
line = raf.readLine();
String[] parts = line.split(",");
if (key.compareTo(parts[0]) > 0) {
min = mid;
} else {
max = mid;
}
}
// find the right line
min = min * blockSize;
raf.seek(min);
if (min > 0)
line = raf.readLine();
while (true) {
min = raf.getFilePointer();
line = raf.readLine();
if (line == null)
break;
String[] parts = line.split(",");
if (line.compareTo(parts[0]) >= 0)
break;
}
raf.seek(min);
return min;
}
I think there are better solutions than this. Can anyone give me some enlightenment?

Data is immutable and keys are unique (as mentioned in the comments on the question).
A simple solution: Write your own hashing code to map key with the line number.
This means, leave the sorting and instead write your data to the file in the order that your hashing algorithm tells.
When key is queried, you hash the key, get the specific line number and then read the value.
In theory, you have an O(1) solution to your problem.
Ensure that the hashing algorithm has less collision, but I think that depending upon your exact case, a few collisions should be fine. Example: 3 keys map to the same line number so you write all three of them on the same line and when any of the collided keys is searched, you read all 3 entries from that line. Then do a linear (aka O(3) aka constant time in this case) search on the entire line.

An easy algorithm to optimise performance for your specific constraints:
let n be the number of lines in the original, immutable, sorted file.
let k < n be a number (we'll discuss ideal number later).
Divide the file into k files, with approximately equal number of lines in each (so each file has n/k lines). the files will be referred to as F1...Fk. If you prefer to keep the original file intact, just consider F1...Fk as line numbers within the file, cutting it into segments.
create a new file called P with k lines, each line i is the first key of Fi.
when looking for a key, first go with binary search over P using O(logk) to find which file /segment (F1...Fk) you need to go to. Then go to that file/segment and search within.
If k is big enough, then size of Fi (n/k) will be small enough to load to a HashMap and retrieve key with O(1). If it is still not practical, do a binary search of O(log(n/k)).
The total search will be O(logk)+O(log(n/k)), which is an improvement on O(logn) which is your original solution.
I would suggest to find a k that would be big enough to allow you to load a specific Fi file/segment into a HashMap, and not too big to fill up space on your device. The most balanced k it sqrt(n), which makes the solution run in O(log(sqrt(n))), but that may be quite a large P file. If you get a k which allows you to load P and Fi into a HashMap for O(1) retrieve, that would be the best solution.

What about this?
#include <iostream>
#include <fstream>
#include <boost/algorithm/string.hpp>
#include <vector>
using namespace std;
int main(int argc, char *argv[])
{
ifstream f(argv[1],ios::ate);
if (!f.is_open())
return 0;
string key(argv[2]),value;
int max = f.tellg();
int min = 0,mid = 0;
string s;
while(max-min>1)
{
mid = min + (max - min )/2;
f.seekg(mid);
f >> s;
std::vector<std::string> strs;
if (!f)
{
break;
}
if (mid)
{
f >> s;
}
boost::split(strs, s, boost::is_any_of(","));
int comp = key.compare(strs[0]);
if ( comp < 0)
{
max = mid;
}
else if (comp > 0)
{
min = mid;
}
else
{
value = strs[1];
break;
}
}
cout<<"key "<<key;
if (!value.empty())
{
cout<<" found! value = "<<value<<endl;
}
else
{
cout<<" not found..."<<endl;
}
f.close();
return 0;
}

Related

Java - Return random index of specific character in string

So given a string such as: 0100101, I want to return a random single index of one of the positions of a 1 (1, 5, 6).
So far I'm using:
protected int getRandomBirthIndex(String s) {
ArrayList<Integer> birthIndicies = new ArrayList<Integer>();
for (int i = 0; i < s.length(); i++) {
if ((s.charAt(i) == '1')) {
birthIndicies.add(i);
}
}
return birthIndicies.get(Randomizer.nextInt(birthIndicies.size()));
}
However, it's causing a bottle-neck on my code (45% of CPU time is in this method), as the strings are over 4000 characters long. Can anyone think of a more efficient way to do this?
If you're interested in a single index of one of the positions with 1, and assuming there is at least one 1 in your input, you can just do this:
String input = "0100101";
final int n=input.length();
Random generator = new Random();
char c=0;
int i=0;
do{
i = generator.nextInt(n);
c=input.charAt(i);
}while(c!='1');
System.out.println(i);
This solution is fast and does not consume much memory, for example when 1 and 0 are distributed uniformly. As highlighted by #paxdiablo it can perform poorly in some cases, for example when 1 are scarce.
You could use String.indexOf(int) to find each 1 (instead of iterating every character). I would also prefer to program to the List interface and to use the diamond operator <>. Something like,
private static Random rand = new Random();
protected int getRandomBirthIndex(String s) {
List<Integer> birthIndicies = new ArrayList<>();
int index = s.indexOf('1');
while (index > -1) {
birthIndicies.add(index);
index = s.indexOf('1', index + 1);
}
return birthIndicies.get(rand.nextInt(birthIndicies.size()));
}
Finally, if you need to do this many times, save the List as a field and re-use it (instead of calculating the indices every time). For example with memoization,
private static Random rand = new Random();
private static Map<String, List<Integer>> memo = new HashMap<>();
protected int getRandomBirthIndex(String s) {
List<Integer> birthIndicies;
if (!memo.containsKey(s)) {
birthIndicies = new ArrayList<>();
int index = s.indexOf('1');
while (index > -1) {
birthIndicies.add(index);
index = s.indexOf('1', index + 1);
}
memo.put(s, birthIndicies);
} else {
birthIndicies = memo.get(s);
}
return birthIndicies.get(rand.nextInt(birthIndicies.size()));
}
Well, one way would be to remove the creation of the list each time, by caching the list based on the string itself, assuming the strings are used more often than they're changed. If they're not, then caching methods won't help.
The caching method involves, rather than having just a string, have an object consisting of:
current string;
cached string; and
list based on the cached string.
You can provide a function to the clients to create such an object from a given string and it would set the string and the cached string to whatever was passed in, then calculate the list. Another function would be used to change the current string to something else.
The getRandomBirthIndex() function then receives this structure (rather than the string) and follows the rule set:
if the current and cached strings are different, set the cached string to be the same as the current string, then recalculate the list based on that.
in any case, return a random element from the list.
That way, if the list changes rarely, you avoid the expensive recalculation where it's not necessary.
In pseudo-code, something like this should suffice:
# Constructs fastie from string.
# Sets cached string to something other than
# that passed in (lazy list creation).
def fastie.constructor(string s):
me.current = s
me.cached = s + "!"
# Changes current string in fastie. No list update in
# case you change it again before needing an element.
def fastie.changeString(string s):
me.current = s
# Get a random index, will recalculate list first but
# only if necessary. Empty list returns index of -1.
def fastie.getRandomBirthIndex()
me.recalcListFromCached()
if me.list.size() == 0:
return -1
return me.list[random(me.list.size())]
# Recalculates the list from the current string.
# Done on an as-needed basis.
def fastie.recalcListFromCached():
if me.current != me.cached:
me.cached = me.current
me.list = empty
for idx = 0 to me.cached.length() - 1 inclusive:
if me.cached[idx] == '1':
me.list.append(idx)
You also have the option of speeding up the actual searching for the 1 character by, for example, useing indexOf() to locate them using the underlying Java libraries rather than checking each character individually in your own code (again, pseudo-code):
def fastie.recalcListFromCached():
if me.current != me.cached:
me.cached = me.current
me.list = empty
idx = me.cached.indexOf('1')
while idx != -1:
me.list.append(idx)
idx = me.cached.indexOf('1', idx + 1)
This method can be used even if you don't cache the values. It's likely to be faster using Java's probably-optimised string search code than doing it yourself.
However, you should keep in mind that your supposed problem of spending 45% of time in that code may not be an issue at all. It's not so much the proportion of time spent there as it is the absolute amount of time.
By that, I mean it probably makes no difference what percentage of the time being spent in that function if it finishes in 0.001 seconds (and you're not wanting to process thousands of strings per second). You should only really become concerned if the effects become noticeable to the user of your software somehow. Otherwise, optimisation is pretty much wasted effort.
You can even try this with best case complexity O(1) and in worst case it might go to O(n) or purely worst case can be infinity as it purely depends on Randomizer function that you are using.
private static Random rand = new Random();
protected int getRandomBirthIndex(String s) {
List<Integer> birthIndicies = new ArrayList<>();
int index = s.indexOf('1');
while (index > -1) {
birthIndicies.add(index);
index = s.indexOf('1', index + 1);
}
return birthIndicies.get(rand.nextInt(birthIndicies.size()));
}
If your Strings are very long and you're sure it contains a lot of 1s (or the String you're looking for), its probably faster to randomly "poke around" in the String until you find what you are looking for. So you save the time iterating the String:
String s = "0100101";
int index = ThreadLocalRandom.current().nextInt(s.length());
while(s.charAt(index) != '1') {
System.out.println("got not a 1, trying again");
index = ThreadLocalRandom.current().nextInt(s.length());
}
System.out.println("found: " + index + " - " + s.charAt(index));
I'm not sure about the statistics, but it rare cases might happen that this Solution take much longer that the iterating solution. On case is a long String with only a very few occurrences of the search string.
If the Source-String doesn't contain the search String at all, this code will run forever!
One possibility is to use a short-circuited Fisher-Yates style shuffle. Create an array of the indices and start shuffling it. As soon as the next shuffled element points to a one, return that index. If you find you've iterated through indices without finding a one, then this string contains only zeros so return -1.
If the length of the strings is always the same, the array indices can be static as shown below, and doesn't need reinitializing on new invocations. If not, you'll have to move the declaration of indices into the method and initialize it each time with the correct index set. The code below was written for strings of length 7, such as your example of 0100101.
// delete this and uncomment below if string lengths vary
private static int[] indices = { 0, 1, 2, 3, 4, 5, 6 };
protected int getRandomBirthIndex(String s) {
int tmp;
/*
* int[] indices = new int[s.length()];
* for (int i = 0; i < s.length(); ++i) indices[i] = i;
*/
for (int i = 0; i < s.length(); i++) {
int j = randomizer.nextInt(indices.length - i) + i;
if (j != i) { // swap to shuffle
tmp = indices[i];
indices[i] = indices[j];
indices[j] = tmp;
}
if ((s.charAt(indices[i]) == '1')) {
return indices[i];
}
}
return -1;
}
This approach terminates quickly if 1's are dense, guarantees termination after s.length() iterations even if there aren't any 1's, and the locations returned are uniform across the set of 1's.

Coefficient Correlation Over a Large Binary Image Data-Set - Slow Performance

I am trying to build an OCR by calculating the Coefficient Correlation between characters extracted from an image with every character I have pre-stored in a database. My implementation is based on Java and pre-stored characters are loaded into an ArrayList upon the beginning of the application, i.e.
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
...
...
public double findCorrelation(byte [] extractedCharacter, byte [] storedCharacter)
{
double mag1, mag2, corr = 0;
for(int i=0; i < extractedCharacter.length; i++)
{
mag1 += extractedCharacter[i] * extractedCharacter[i];
mag2 += storedCharacter[i] * storedCharacter[i];
corr += extractedCharacter[i] * storedCharacter[i];
} // for
corr /= Math.sqrt(mag1*mag2);
return corr;
}
The number of extractedCharacters are around 100-150 per image but the database has 15600 stored binary characters. Checking the coefficient correlation between every extracted character and every stored character has an impact on the performance as it needs around 15-20 seconds to complete for every image, with an Intel i5 CPU.
Is there a way to improve the speed of this program, or suggesting another path of building this bringing similar results. (The results produced by comparing every character with such a large dataset is quite good).
Thank you in advance
UPDATE 1
public static void run() {
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
computeNorms(charComps, extractedCharacters);
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
}
}
private static double[] storedNorms;
private static double[] extractedNorms;
// Correlation between to binary images
public static double findCorrelation(byte[] arr1, byte[] arr2, int strCharIndex, int extCharNo){
final int dotProduct = dotProduct(arr1, arr2);
final double corr = dotProduct * storedNorms[strCharIndex] * extractedNorms[extCharNo];
return corr;
}
public static void computeNorms(ArrayList<byte[]> storedCharacters, ArrayList<byte[]> extractedCharacters) {
storedNorms = computeInvNorms(storedCharacters);
extractedNorms = computeInvNorms(extractedCharacters);
}
private static double[] computeInvNorms(List<byte []> a) {
final double[] result = new double[a.size()];
for (int i=0; i < result.length; ++i)
result[i] = 1 / Math.sqrt(dotProduct(a.get(i), a.get(i)));
return result;
}
private static int dotProduct(byte[] arr1, byte[] arr2) {
int dotProduct = 0;
for(int i = 0; i< arr1.length; i++)
dotProduct += arr1[i] * arr2[i];
return dotProduct;
}
Nowadays, it's hard to find a CPU with a single core (even in mobiles). As the tasks are nicely separated, you can do it with a few lines only. So I'd go for it, though the gain is limited.
In case you really mean cross-correlation, then a transform like DFT or DCT could help. They surely do for big images, but with yours 12x16, I'm not sure.
Maybe you mean just a dot product? And maybe you should tell us?
Note that you actually don't need to compute the correlation, most of the time you only need is find out if it's bigger than a threshold:
corr = findCorrelation(extractedCharacter, storedCharacter)
..... more code to check if this is the best match ......
This may lead to some optimizations or not, depending on how the images look like.
Note also that a simple low level optimization can give you nearly a factor of 4 as in this question of mine. Maybe you really should tell us what you're doing?
UPDATE 1
I guess that due to the computation of three products in the loop, there's enough instruction level parallelism, so a manual loop unrolling like in my above question is not necessary.
However, I see that those three products get computed some 100 * 15600 times, while only one of them depends on both extractedCharacter and storedCharacter. So you can compute
100 + 15600 + 100 * 15600
dot products instead of
3 * 100 * 15600
This way you may get a factor of three pretty easily.
Or not. After this step there's a single sum computed in the relevant step and the problem linked above applies. And so does its solution (unrolling manually).
Factor 5.2
While byte[] is nicely compact, the computation involves extending them to ints, which costs some time as my benchmark shows. Converting the byte[]s to int[]s before all the correlations gets computed saves time. Even better is to make use of the fact that this conversion for storedCharacters can be done beforehand.
Manual loop unrolling twice helps but unrolling more doesn't.

How to parse a huge file line by line, serialize & deserialize a huge object efficiently?

I have a file of size around 4-5 Gigs(nearly billion lines). From every line of the file, I have to parse the array of integers and the additional integer info and update my custom data structure. My class to hold such information looks like
class Holder {
private int[][] arr = new int[1000000000][5]; // assuming that max array size is 5
private int[] meta = new int[1000000000];
}
A sample line from the file looks like
(1_23_4_55) 99
Every index in the arr & meta corresponds to the line number in the file. From the above line, I extract the array of integers first and then the meta information. In that case,
--pseudo_code--
arr[line_num] = new int[]{1, 23, 4, 55}
meta[line_num]=99
Right now, I am using BufferedReader object and it's readLine method to read each line & use character level operations to parse the integer array and meta information from each line and populate the Holder instance. But, it takes almost half an hour to complete this entire operation.
I used both java Serialization & Externalizable(write the meta and arr) to serialize and deserialize this HUGE Holder instance. And with both of them, the time to serialize is almost half an hour and to deserialize is also almost half an hour.
I would appreciate your suggestions on dealing with this kind of problem & would definitely love to hear your part of story if any.
P.S. Main Memory is not a problem. I have almost 50 GB of RAM in my machine. I have also increased the BufferedReader size to 40 MB (Of course, I can increase this upto 100 MB considering that disk access takes approx. 100 MB/sec). Even cores and CPU is not a problem.
EDIT I
The code that I am using to do this task is provided below(after anonymizing very few information);
public class BigFileParser {
private int parsePositiveInt(final String s) {
int num = 0;
int sign = -1;
final int len = s.length();
final char ch = s.charAt(0);
if (ch == '-')
sign = 1;
else
num = '0' - ch;
int i = 1;
while (i < len)
num = num * 10 + '0' - s.charAt(i++);
return sign * num;
}
private void loadBigFile() {
long startTime = System.nanoTime();
Holder holder = new Holder();
String line;
try {
Reader fReader = new FileReader("/path/to/BIG/file");
// 40 MB buffer size
BufferedReader bufferedReader = new BufferedReader(fReader, 40960);
String tempTerm;
int i, meta, ascii, len;
boolean consumeNextInteger;
// GNU Trove primitive int array list
TIntArrayList arr;
char c;
while ((line = bufferedReader.readLine()) != null) {
consumeNextInteger = true;
tempTerm = "";
arr = new TIntArrayList(5);
for (i = 0, len = line.length(); i < len; i++) {
c = line.charAt(i);
ascii = c - 0;
// 95 is the ascii value of _ char
if (consumeNextInteger && ascii == 95) {
arr.add(parsePositiveInt(tempTerm));
tempTerm = "";
} else if (ascii >= 48 && ascii <= 57) { // '0' - '9'
tempTerm += c;
} else if (ascii == 9) { // '\t'
arr.add(parsePositiveInt(tempTerm));
consumeNextInteger = false;
tempTerm = "";
}
}
meta = parsePositiveInt(tempTerm);
holder.update(arr, meta);
}
bufferedReader.close();
long endTime = System.nanoTime();
System.out.println("#time -> " + (endTime - startTime) * 1.0
/ 1000000000 + " seconds");
} catch (IOException exp) {
exp.printStackTrace();
}
}
}
public class Holder {
private static final int SIZE = 500000000;
private TIntArrayList[] arrs;
private TIntArrayList metas;
private int idx;
public Holder() {
arrs = new TIntArrayList[SIZE];
metas = new TIntArrayList(SIZE);
idx = 0;
}
public void update(TIntArrayList arr, int meta) {
arrs[idx] = arr;
metas.add(meta);
idx++;
}
}
It sounds like the time taken for file I/O is the main limiting factor, given that serialization (binary format) and your own custom format take about the same time.
Therefore, the best thing you can do is to reduce the size of the file. If your numbers are generally small, then you could get a huge boost from using Google protocol buffers, which will encode small integers generally in one or two bytes.
Or, if you know that all your numbers are in the 0-255 range, you could use a byte[] rather than int[] and cut the size (and hence load time) to a quarter of what it is now. (assuming you go back to serialization or just write to a ByteChannel)
It simply can't take that long. You're working with some 6e9 ints, which means 24 GB. Writing 24 GB to the disk takes some time, but nothing like half an hour.
I'd put all the data in a single one-dimensional array and access it via methods like int getArr(int row, int col) which transform row and col onto a single index. According to how the array gets accessed (usually row-wise or usually column-wise), this index would be computed as N * row + col or N * col + row to maximize locality. I'd also store meta in the same array.
Writing a single huge int[] into memory should be pretty fast, surely no half an hour.
Because of the data amount, the above doesn't work as you can't have a 6e9 entries array. But you can use a couple of big arrays instead and all of the above applies (compute a long index from row and col and split it into two ints for accessing the 2D-array).
Make sure you aren't swapping. Swapping is the most probable reason for the slow speed I can think of.
There are several alternative Java file i/o libraries. This article is a little old, but it gives an overview that's still generally valid. He's reading about 300Mb per second with a 6-year old Mac. So for 4Gb you have under 15 seconds of read time. Of course my experience is that Mac IO channels are very good. YMMV if you have a cheap PC.
Note there is no advantage above a buffer size of 4K or so. In fact you're more likely to cause thrashing with a big buffer, so don't do that.
The implication is that parsing characters into the data you need is the bottleneck.
I have found in other apps that reading into a block of bytes and writing C-like code to extract what I need goes faster than the built-in Java mechanisms like split and regular expressions.
If that still isn't fast enough, you'd have to fall back to a native C extension.
If you randomly pause it you will probably see that the bulk of the time goes into parsing the integers, and/or all the new-ing, as in new int[]{1, 23, 4, 55}. You should be able to just allocate the memory once and stick numbers into it at better than I/O speed if you code it carefully.
But there's another way - why is the file in ASCII?
If it were in binary, you could just slurp it up.

Java Optimizing arithmetic and Assignment Operators for large input

I have a piece of code that must run extremely fast in terms of clock speed. The algorithm is already in O(N). It takes 2seconds, it needs to take 1s. For most A.length inputs ~ 100,000 it takes .3s unless a particular line of code is invoked an extreme number of times. (For an esoteric programming challenge)
It uses a calculation of the arithmetic series that 1,2,..N -> 1,3,4,10,15..
that can be represented by n*(n+1)/2
I loop through this equation hundreds of thousands of times.
I do not have access to the input, nor can I display it. The only information I am able to get returned is the time it took to run.
particularly the equation is:
s+=(n+c)-((n*(n+1))/2);
s and c can have values range from 0 to 1Billion
n can range 0 to 100,000
What is the most efficient way to write this statement in terms of clock speed?
I have heard division takes more time then multiplication, but beyond that I could not determine whether writing this in one line or multiple assignment lines was more efficient.
Dividing and multiplying versus multiplying and then dividing?
Also would creating custom integers types significantly help?
Edit as per request, full code with small input case (sorry if it's ugly, I've just kept stripping it down):
public static void main(String[] args) {
int A[]={3,4,8,5,1,4,6,8,7,2,2,4};//output 44
int K=6;
//long start = System.currentTimeMillis();;
//for(int i=0;i<100000;i++){
System.out.println(mezmeriz4r(A,K));
//}
//long end = System.currentTimeMillis();;
// System.out.println((end - start) + " ms");
}
public static int mezmeriz4r(int[]A,int K){
int s=0;
int ml=s;
int mxl=s;
int sz=1;
int t=s;
int c=sz;
int lol=50000;
int end=A.length;
for(int i=sz;i<end;i++){
if(A[i]>A[mxl]){
mxl=i;
}else if(A[i]<A[ml]){
ml=i;
}
if(Math.abs(A[ml]-A[mxl])<=K){
sz++;
if(sz>=lol)return 1000000000;
if(sz>1){
c+=sz;
}
}else{
if(A[ml]!=A[i]){
t=i-ml;
s+=(t+c)-((t*(t+1))/(short)2);
i=ml;
ml++;
mxl=ml;
}else{
t=i-mxl;
s+=(t+c)-((t*(t+1))/(short)2);
i=mxl;
mxl++;
ml=mxl;
}
c=1;
sz=0;
}
}
if(s>1000000000)return 1000000000;
return s+c;
}
Returned from Challenge:
Detected time complexity:
O(N)
test time result
example
example test 0.290 s. OK
single
single element 0.290 s. OK
double
two elements 0.290 s. OK
small_functional
small functional tests 0.280 s. OK
small_random
small random sequences length = ~100 0.300 s. OK
small_random2
small random sequences length = ~100 0.300 s. OK
medium_random
chaotic medium sequences length = ~3,000 0.290 s. OK
large_range
large range test, length = ~100,000 2.200 s. TIMEOUT ERROR
running time: >2.20 sec., time limit: 1.02 sec.
large_random
random large sequences length = ~100,000 0.310 s. OK
large_answer
test with large answer 0.320 s. OK
large_extreme
all maximal value = ~100,000 0.340 s. OK
With a little algebra, you can simply the expression (n+c)-((n*(n+1))/2) to c-((n*(n-1))/2) to remove an addition operation. Then you can replace the division by 2 with a bit-shift to the right by 1, which is faster than division. Try replacing
s+=(n+c)-((n*(n+1))/2);
with
s+=c-((n*(n-1))>>1);
I dont have access to validate all inputs. and time range. but this one runs O(N) for sure. and have improved. run and let me know your feedback.i will provide details if necessary
public static int solution(int[]A,int K){
int minIndex=0;
int maxIndex=0;
int end=A.length;
int slize = end;
int startIndex = 0;
int diff = 0;
int minMaxIndexDiff = 0;
for(int currIndex=1;currIndex<end;currIndex++){
if(A[currIndex]>A[maxIndex]){
maxIndex=currIndex;
}else if(A[currIndex]<A[minIndex]){
minIndex=currIndex;
}
if( (A[maxIndex]-A[minIndex]) >K){
minMaxIndexDiff= currIndex- startIndex;
if (minMaxIndexDiff > 1){
slize+= ((minMaxIndexDiff*(minMaxIndexDiff-1)) >> 1);
if (diff > 0 ) {
slize = slize + (diff * minMaxIndexDiff);
}
}
if (minIndex == currIndex){
diff = currIndex - (maxIndex + 1);
}else{
diff = currIndex - (minIndex + 1);
}
if (slize > 1000000000) {
return 1000000000;
}
minIndex = currIndex;
maxIndex = currIndex;
startIndex = currIndex;
}
}
if ( (startIndex +1) == end){
return slize;
}
if (slize > 1000000000) {
return 1000000000;
}
minMaxIndexDiff= end- startIndex;
if (minMaxIndexDiff > 1){
slize+= ((minMaxIndexDiff*(minMaxIndexDiff-1)) >> 1);
if (diff > 0 ) {
slize = slize + (diff * minMaxIndexDiff);
}
}
return slize;
}
Get rid of the System.out.println() in the for loop :) you will be amazed how much faster your calculation will be
Nested assignments, i. e. instead of
t=i-ml;
s+=(t+c)-((t*(t+1))/(short)2);
i=ml;
ml++;
mxl=ml;
something like
s+=((t=i-ml)+c);
s-=((t*(t+1))/(short)2);
i=ml;
mxl=++ml;
sometimes occurs in OpenJDK sources. It mainly results in replacing *load bytecode instructions with *dups. According to my experiments, it really gives a very little speedup, but it is ultra hadrcore, I don't recommend to write such code manually.
I would try the following and profile the code after each change to check if there is any gain in speed.
replace:
if(Math.abs(A[ml]-A[mxl])<=K)
by
int diff = A[ml]-A[mxl];
if(diff<=K && diff>=-K)
replace
/2
by
>>1
replace
ml++;
mxl=ml;
by
mxl=++ml;
Maybe avoid array access of the same element (internal boundary checks of java may take some time)
So staore at least A[i] in a local varianble.
I would create a C version first and see, how fast it can go with "direct access to the metal". Chances are, you are trying to optimize calculation which is already optimized to the limit.
I would try to elimnate this line if(Math.abs(A[ml]-A[mxl])<=
by a faster self calculated abs version, which is inlined, not a method call!
The cast to (short) does not help,
but try the right shift operator X >>1 instead x / 2
removing the System.out.println() can speed up by factor of 1000.
But be carefull otherwise your whole algorithm can be removed by the VM becasue you dont use it.
Old code:
for(int i=0;i<100000;i++){
System.out.println(mezmeriz4r(A,K));
}
New code:
int dummy = 0;
for(int i=0;i<100000;i++){
dummy = mezmeriz4r(A,K);
}
//Use dummy otherwise optimisation can remove mezmeriz4r
System.out.print("finished: " + dummy);

How To Mark a String in a File?

I have a text file. It is designed as following:
#1{1,12,345,867}
#2{123, 3243534, 2132131231}
#3{234, 35345}
#4{}
...
(at the end of an each entry stands "\n")
That is an example. In fact my strings #number{number,number,...,number} could be really long...
Here is a template of a constructor of a class which works with this file:
public Submatrix(String matrixFilePath, int startPos, int endPos) throws FileNotFoundException{
}
As you can see submatrix is determined by startPos and endPos numbers of strings of a matrix.
My question is : "How could I count strings to reach the right one?"
My file can contain billions of strings. Should I use LineNumberReader->readLine() billions times?????
I would be tempted to read each line sequentially until I reached the desired line. However, since the lines are numbered in the file and delimited with newlines you can treat the file as random access and employ various strategies. For example, you get use a variant of binary search to quickly find the starting line. You can estimate the average line length from the first N lines and then try to make a more accurate guess as to the starting location, and so on.
I think the answer would be yes, you read billions of lines using readLine, unless you think it's worth the trouble using either
the strategy outlined by GregS, that is, estimating the line length and using that to start reading somewhere near the correct line, or
you use a seperate index, either at the start of the file or in a separate file which is very predictable and is something like
0000001 000000001024
0000002 000000001064
0000003 000000002010
That is, line number and starting position of that line in bytes in a strictly defined fashion which makes it possible to determine the position of the index by something like:
I want to read line 3, so I find the position of line 3 by going to position (3-1) * 20,
and read 0000003 000000002010, parse that and know that line 3 is at byte position 2010, seek that position and start reading.
Calculating or maintaining the index might not be easy if it's in the main data file, as it would mean that you precalculate positions before you actually write the file. I think I would use a seperate index file and either calculate indices during writing, or have a seperate utility to create a index file given a data file.
EDIT Added example code to demonstrate my proposal
I have made a smallish Python script which reads a data file and creates an index file. The index file contains the position of a line in the data file and is designed to be easily searchable.
This example script has index formatting of 06d, which is good enough for 999.999 line data files, for you it might have to be adjusted (don't forget INDEX_LENGTH). It creates an index file, and uses that index file to read a given line out of the data file (for demonstration purposes; you would use java for that part:)
The script is called like:
python create_index.py data.txt data.idx 3
my example data file is:
#1{1,12,345,867}
#2{123, 3243534, 2132131231}
#3{234, 35345}
#4{}
and the script itself is:
import sys
# Usage: python this_script.py datafile indexfile lineno
# indexfile will be overwritten
# lineno is the data line which will be printed using the
# index file, as a demonstration
datafilename= sys.argv[1]
indexfilename = sys.argv[2]
lineno = int(sys.argv[3])
# max 999999 lines in this format
format = "%06d\n"
INDEX_LENGTH = 6+1 # +1 for newline
def create_indexfile():
indexfile = open(indexfilename, "wB")
# Print index of first line
indexfile.write(format % 0)
f = open(datafilename, "rB")
line = f.readline()
while len(line) > 0:
indexfile.write( format % f.tell() )
line = f.readline()
f.close()
indexfile.close()
# Retrieve the data of 1 line in the data file
# using the index file
def get_line():
linepos = INDEX_LENGTH * (lineno - 1)
indexfile = open(indexfilename, "rB")
indexfile.seek(linepos)
datapos = int(indexfile.readline())
indexfile.close()
datafile = open(datafilename, "rB")
datafile.seek(datapos)
print datafile.readline()
datafile.close()
if __name__ == '__main__':
create_indexfile()
get_line()
The index file needs to be rebuild after a change in the data file. You can verify if you read the right data by comparing your line number from the data read (#3{...}) with the input line number, so it's fairly safe.
Whether you choose to use it or not, I think the example is pretty clear and easy.
#extraneon
This is the class I want to use to represent a string #number{number, number,...}
package logic;
public class DenominatedBinaryRow{
private int sn;
private BinaryRow row;
public DenominatedBinaryRow(int sn, BinaryRow row){
this.sn = sn;
this.row = row;
}
public DenominatedBinaryRow plus(int sn, DenominatedBinaryRow addend){
return new DenominatedBinaryRow(sn, this.row.plus(addend.row));
}
public int getSn(){
return this.sn;
}
public BinaryRow getRow(){
return this.row;
}
public boolean equals(Object obj){
DenominatedBinaryRow res = (DenominatedBinaryRow) obj;
if (this.getSn() == res.getSn() && this.getRow().equals(res.getRow())){
return true;
}
return false;
}
}
May be it would be efficient to serialize it, instead of converting the BinaryRow (it's implementation goes below) to a string?
If I serialize many instances of it to a file, how will I deserialize the necessary string (necessary instance) back? (Hope, I understood your question correctly)
package logic;
import java.util.*;
public class BinaryRow {
private List<Integer> row;
public BinaryRow(){
this.row = new ArrayList<Integer>();
}
public List<Integer> getRow(){
return this.row;
}
public void add(Integer arg){
this.getRow().add(arg);
}
public Integer get(int index){
return this.getRow().get(index);
}
public int size(){
return this.getRow().size();
}
public BinaryRow plus(BinaryRow addend){
BinaryRow result = new BinaryRow();
//suppose, rows are already sorted (ascending order)
int i = this.size();
int j = addend.size();
while (i > 0 && j > 0)
if (this.get(this.size() - i) < addend.get(addend.size() - j)){
result.add(this.get(this.size() - i));
i--;
}
else if (this.get(this.size() - i) > addend.get(addend.size() - j)){
result.add(addend.get(addend.size() - j));
j--;
}
else{
result.add(this.get(this.size() - i));
i--;
j--;
}
if (i > 0){
for (int k = this.size() - i; k < this.size(); k++)
result.add(this.get(k));
}
if (j > 0){
for (int k = addend.size() - j; k < addend.size(); k++)
result.add(addend.get(k));
}
return result;
}
public boolean equals(Object obj){
BinaryRow binRow = (BinaryRow) obj;
if (this.size() == binRow.size()){
for (int i = 0; i < this.size(); i++){
if (this.getRow().get(i) != binRow.getRow().get(i)) return false;
}
return true;
}
return false;
}
public long convertToDec(){
long result = 0;
for (Integer next : this.getRow()) {
result += Math.pow(2, next);
}
return result;
}
}
I am affraid you have to get to the x-th line, you will have to call readLine() x times.
This means reading all the data until you reach this line. Every character could be a line end, so there is no way going to the x-th line without reading every character before this line.

Categories