Maximum number of items in a J2ME List - java

I'm working on j2me project that involves getting a list of users from an online database, I then intend to populate a list with the names of the users and the number can be very large. my question is - are there limits to the number of items you can append to a list?
HttpConnection hc = (HttpConnection);
String reply;
Connector.open("http://www.xxxxxxxxxxxx.com/......?xx=xx");
InputStream is = new hc.openInputStream();
int ch;
// Check the Content-Length first
long len = hc.getLength();
if(len!=-1) {
for(int i = 0;i<len;i++)
if((ch = is.read())!= -1)
reply += (char) ch;
} else {
// if the content-length is not available
while ((ch = is.read()) != -1)
reply += (char) ch;
}
is.close();
hc.close();
DataParser parser = new DataParser(reply); // This is a custom class I created to process the XML data returned from the server to split it into groups and put in an array.
List user list = new List("Users");
if (parser.moveToNext()) {
do {
list.append(parser.get(), null);
}
}
This code seems to be working fine but my problem is, if a keep calling list.append("", null), will it get to a point when some exception is thrown, maybe in the case of 50,000 names (list items)?

Their is no limitation to number of items in a list. You can as well use stringItems appended to form, then add item commands to them... I hope this helps.
J2ME tutorial at http://www.tutorialmasterng.blogspot.com

Some implementations may have limit. Older Sony Ericsson phones have limit of 256 items in the list. Anyway, as Meier pointed out, lists with really many items can be slow or difficult to use. And 50k of strings may easily cause OOM on low heap devices (1 - 2 MB).

Related

Java program slows down abruptly when indexing corpus for k-grams

I have a problem which is puzzling me. I'm indexing a corpus (17 000 files) of text files, and while doing this, I'm also storing all the k-grams (k-long parts of words) for each word in a HashMap to be used later:
public void insert( String token ) {
//For example, car should result in "^c", "ca", "ar" and "r$" for a 2-gram index
// Check if token has already been seen. if it has, all the
// k-grams for it have already been added.
if (term2id.get(token) != null) {
return;
}
id2term.put(++lastTermID, token);
term2id.put(token, lastTermID);
// is word long enough? for example, "a" can be bigrammed and trigrammed but not four-grammed.
// K must be <= token.length + 2. "ab". K must be <= 4
List<KGramPostingsEntry> postings = null;
if(K > token.length() + 2) {
return;
}else if(K == token.length() + 2) {
// insert the one K-gram "^<String token>$" into index
String kgram = "^"+token+"$";
postings = index.get(kgram);
SortedSet<String> kgrams = new TreeSet<String>();
kgrams.add(kgram);
term2KGrams.put(token, kgrams);
if (postings == null) {
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put("^"+token+"$", newList);
}
// No need to do anything if the posting already exists, so no else clause. There is only one possible term in this case
// Return since we are done
return;
}else {
// We get here if there is more than one k-gram in our term
// insert all k-grams in token into index
int start = 0;
int end = start+K;
//add ^ and $ to token.
String wrappedToken = "^"+token+"$";
int noOfKGrams = wrappedToken.length() - end + 1;
// get K-Grams
String kGram;
int startCurr, endCurr;
SortedSet<String> kgrams = new TreeSet<String>();
for (int i=0; i<noOfKGrams; i++) {
startCurr = start + i;
endCurr = end + i;
kGram = wrappedToken.substring(startCurr, endCurr);
kgrams.add(kGram);
postings = index.get(kGram);
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
// if this k-gram has been seen before
if (postings != null) {
// Add this token to the existing postingsList.
// We can be sure that the list doesn't contain the token
// already, else we would previously have terminated the
// execution of this function.
int lastTermInPostings = postings.get(postings.size()-1).tokenID;
if (lastTermID == lastTermInPostings) {
continue;
}
postings.add(newEntry);
index.put(kGram, postings);
}
// if this k-gram has not been seen before
else {
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put(kGram, newList);
}
}
Clock c = Clock.systemDefaultZone();
long timestart = c.millis();
System.out.println(token);
term2KGrams.put(token, kgrams);
long timestop = c.millis();
System.out.printf("time taken to put: %d\n", timestop-timestart);
System.out.print("put ");
System.out.println(kgrams);
System.out.println();
}
}
The insertion into the HashMap happens on the rows term2KGrams.put(token, kgrams); (There are 2 of them in the code snippet). When indexing, everything works fine until things suddenly, at 15 000 indexed files, go bad. Everything slows down immensely, and the program doesn't finish in a reasonable time, if at all.
To try to understand this problem, I've added some prints at the end of the function. This is the output they generate:
http://soccer.org
time taken to put: 0
put [.or, //s, /so, ://, ^ht, cce, cer, er., htt, occ, org, p:/, r.o, rg$, soc, tp:, ttp]
aysos
time taken to put: 0
put [^ay, ays, os$, sos, yso]
http://www.davisayso.org/contacts.htm
time taken to put: 0
put [.da, .ht, .or, //w, /co, /ww, ://, ^ht, act, avi, ays, con, cts, dav, g/c, htm, htt, isa, nta, o.o, ont, org, p:/, rg/, s.h, say, so., tac, tm$, tp:, ts., ttp, vis, w.d, ww., www, yso]
playsoccer
time taken to put: 0
put [^pl, ays, cce, cer, er$, lay, occ, pla, soc, yso]
This looks fine to me, the putting doesn't seem to be taking long time and the k-grams (in this case trigrams) are correct.
But one can see strange behaviour in the pace at which my computer is printing this information. In beginning, everything is printing at a super high speed. But at 15 000, that speed stops, and instead, my computer starts printing a few lines at a time, which of course means that indexing the other 2000 files of the corpus will take an eternity.
Another interesting thing I observed was when doing a keyboard interrupt (ctrl+c) after it had been printing erratically and slowly as described for a while. It gave me this message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringLatin1.newString(StringLatin1.java:549)sahandzarrinkoub#Sahands-MBP:~/Documents/Programming/Information Retrieval/lab3 2$ sh compile_all.sh
Note: ir/PersistentHashedIndex.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Does this mean I'm out of memory? Is that the issue? If so, that's surprising, because I've been storing quite a lot of things in memory before, such as a HashMap containing the document ID's of every single word in the corpus, a HashMap containing every single word where every single k-gram appears, etc.
Please let me know what you think and what I can do to fix this problem.
To understand this, you must first understand that java does not allocate memory dynamically (or, at least, not indefinetly). The JVM is by default configured to start with a minimum heap size and a maximum heap size. When the maximum heap size would be exceeded through some allocation, you get a OutOfMemoryError
You can change the minimum and maximum heap size for your execution with the vm parameters -Xms and -Xmx respectively. An example for an execution with at least 2, but at most 4 GB would be
java -Xms2g -Xmx4g ...
You can find more options on the man page for java.
Before changing the heap memory, however, take a close look at your system resources, especially whether your system starts swapping. If your system swaps, a larger heap size may let the program run longer, but with equally bad performance. The only thing possible then would be to optimize your program in order to use less memory or to upgrade the RAM of your machine.

Java data structure for providing random <String><Float> pair based on a large data set at run-time

Is there a smart way to create a 'JSON-like' structure of String - Float pairs, 'key' not needed as data will be grabbed randomly - although an incremented key from 0-n might aid random retrieval of associated data. Due to the size of data set (10k pairs of values), I need this to be saved out to an external file type.
The reason is how my data will be compiled. To save someone entering data into an array manually the item will be excel based, saved out to CSV, parsed using a temporary java program to a file format (for example jJSON) which can be added to my project resources folder. I can then retrieve data from this set, without my application having to manually load a huge array into memory upon application creation. I can quite easily parse the CSV to 'fill-up' an array (or similar) at run-time - but I fear that on a mobile device, the memory overhead will be significant?
I have reviewed the answers to: Suitable Java data structure for parsing large data file and Data structure options for efficiently storing sets of integer pairs on disk? and have not been able to draw a definitive conclusion.
I have tried saving to a .JSON file, however not sure if I can request a random entry, plus this seems quite cumbersome for holding a simple structure. Is a treeMap or hashtable where I need to be focusing my search.
To provide some context to my query, my application will be running on android, and needs to reference a definition (approx 500 character String) and a conversion factor (an Float). I need to retrieve a random data entry. The user may only make 2 or 3 requests during a session - therefore see no point in loading a 10k element array into memory. QUERY: potentially modern day technology on android phones will easily munch through this type of query, and its perhaps only an issue if I am parsing millions of entries at run-time?
I am open to using SQLlite to hold my data if this will provide the functionality required. Please note that the data set must be derived from an easily exportable file format from excel (CSV, TXT etc).
Any advice you can give me would be much appreciated.
Here's one possible design that requires a minimal memory footprint while providing fast access:
Start with a data file of comma-separated or tab-separated values so you have line breaks between your data pairs.
Keep an array of long values corresponding to the indexes of the lines in the data file. When you know where the lines are, you can use InputStream.skip() to advance to the desired line. This leverages the fact that skip() is typically quite a bit faster than read for InputStreams.
You would have some setup code that would run at initialization time to index the lines.
An enhancement would be to only index every nth line so that the array is smaller. So if n is 100 and you're accessing line 1003, you take the 10th index to skip to line 1000, then read past two more lines to get to line 1003. This allows you to tune the size of the array to use less memory.
I thought this was an interesting problem, so I put together some code to test my idea. It uses a sample 4MB CSV file that I downloaded from some big data website that has about 36K lines of data. Most of the lines are longer than 100 chars.
Here's code snippet for the setup phase:
long start = SystemClock.elapsedRealtime();
int lineCount = 0;
try (InputStream in = getResources().openRawResource(R.raw.fl_insurance_sample)) {
int index = 0;
int charCount = 0;
int cIn;
while ((cIn = in.read()) != -1) {
charCount++;
char ch = (char) cIn; // this was for debugging
if (ch == '\n' || ch == '\r') {
lineCount++;
if (lineCount % MULTIPLE == 0) {
index = lineCount / MULTIPLE;
if (index == mLines.length) {
mLines = Arrays.copyOf(mLines, mLines.length + 100);
}
mLines[index] = charCount;
}
}
}
mLines = Arrays.copyOf(mLines, index+1);
} catch (IOException e) {
Log.e(TAG, "error reading raw resource", e);
}
long elapsed = SystemClock.elapsedRealtime() - start;
I discovered my data file was actually separated by carriage returns rather than line feeds. It must have been created on an Apple computer. Hence the test for '\r' as well as '\n'.
Here's a snippet from the code to access the line:
long start = SystemClock.elapsedRealtime();
int ch;
int line = Integer.parseInt(editText.getText().toString().trim());
if (line < 1 || line >= mLines.length ) {
mTextView.setText("invalid line: " + line + 1);
}
line--;
int index = (line / MULTIPLE);
in.skip(mLines[index]);
int rem = line % MULTIPLE;
while (rem > 0) {
ch = in.read();
if (ch == -1) {
return; // readLine will fail
} else if (ch == '\n' || ch == '\r') {
rem--;
}
}
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String text = reader.readLine();
long elapsed = SystemClock.elapsedRealtime() - start;
My test program used an EditText so that I could input the line number.
So to give you some idea of performance, the first phase averaged around 1600ms to read through the entire file. I used a MULTIPLE value of 10. Accessing the last record in the file averaged about 30ms.
To get down to 30ms access with only a 29312-byte memory footprint is pretty good, I think.
You can see the sample project on GitHub.

Assigning values to zip codes

What I am trying to do is assign a number value to a group of zip codes and then use that number to do a calculation later on in the program. What I have now is
if (zip == 93726 || zip == 93725 || zip == 92144) {
return 100;
}
else if (zip == 94550 || zip == 34599 || zip == 77375) {
return 150;
}
and then I take the variable and use it in a calculation. Assigning the number to the variable and the calculation all work but what I have ran into is apparently android only allows you to have so many lines of code and I have ran out of lines with just using if else statements. My question is what would be a better what to go about this?
I am not trying to assign a city to each zip because I have seen that they have services that do that from other posters.
a. You can either use a
switch (zip)
{
case 93726: case 93725: case 92144: return 100;
case 94550: case 34599: case 77375: return 150;
}
-
b. or create a HashMap<Integer,Integer> and fill it with 'zip to value' entries, which should give you a much better performance if you have that many cases.
Map<Integer,Integer> m = new HashMap<>();
m.put(93726,100);
later you could call
return m.get(zip);
-
c. If your zip count is in the tens of thousands and you want to work all in memory, then you should consider just holding a hundred-thousand int sized array:
int[] arr=new int[100000];
and filling it with the data:
arr[93726]=100;
.
.
.
You should probably use String constants for you ZIP codes since
in some places, some start with 0
in some places, they may contain letters as well as numbers
If you are using an Object (either String or Integer) for your constants, I have often used Collection.contains(zip) as a lazy shortcut for the condition in each if statement. That collection contains all the constants for that condition, and you would probably use a subclass that is geared towards finding, perhaps HashSet. Keep in mind that if you use a HashMap solution as suggested elsewhere, your keys will be Integer objects too, so you will do hashing on the keys in any case, you just won't need to store the result values in the collection suggestion.
I suspect that for a large collection of constants, hashing may turn out to be faster than having to work through the large number of == conditions in the if statement until you get to the right condition. (It may help a bit if the most-used constants come first, and in the first if statement...)
On a completely different note (i.e. strategy instead of code), you should see if you could group your ZIPs. What I mean is for example, that if you know that all (or most) ZIPs of the forms "923xx" and "924xx" result in a return of 250, you could potentially shorten your conditionals considerably. E.g. zip.startsWith("923") || zip.startsWith("923") for String ZIPs, or (zip / 100) == 923 || (zip / 100) == 924 for int.
A small number of more specific exceptions to the groups can still be handled, you just need to have the more specific conditionals before the more general conditionals.
Use declarative data. Especially as the zip codes might get updated, extended, corrected:
Using for instance a zip.properties
points = 100, 150
zip100 = 93726, 93725, 92144
zip150 = 94550, 34599, 77375,\
88123, 12324, 23424
And
Map<Integer, Integer> zipToPoints = new HashMap<>();
If you got ZIP codes with leading zeroes maybe better use String or take care to parse them with base 10 (leading zero is base 8, octal).
Whether there really exists such a limitation I do not know, but the extra effort of a bit of coding is worth having all as data.
Have a map of the zip codes.just return the value.
Map<Integer,Integer> zipValues = new HashMap<Integer,Integer>();
zipValues.put(93726,100);
.
.
And so on. or u can read from a prop file and populate the map.
then instead of using the if(s),
return zipValues.get(zipCode);
so say zipCode=93726, it will return u 100.
Cheers.
You could create a Map which maps each zip code to an integer. For example:
Map<Integer, Integer> zipCodeMap = new HashMap<>();
zipCodeMap.put(93726, 100);
And later you can retrieve values from there.
Integer value = zipCodeMap.get(93726);
Now you still have to map each zipcode to a value. I would not do that in Java code but rather use a database or read from a text file (csv for example). This depends mostly on your requirements.
Example csv file:
93726, 100
93727, 100
93726, 150
Reading from a csv file:
InputStream is = this.class.getResourceAsStream("/data.csv");
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
try {
String line;
while ((line = reader.readLine()) != null) {
String[] row = line.split(",");
int zip = Integer.parseInt(row[0].trim());
int value = Integer.parseInt(row[1].trim());
zipCodeMap.put(zip, value);
}
}
catch (IOException ex) {
// handle exceptions here
}
finally {
try {
is.close();
}
catch (IOException e) {
// handle exceptions here
}
}

java.lang.OutOfMemoryError: Java heap space?

I am writing list object's into a CSV file by using StringBuffer object, when the list contains less data our logic is working perfectly but when there is a large amount of data the in list then there's a problem and I get the error: java.lang.OutOfMemoryError: Java heap space problem
Code snippet as follows :
StringBuffer report = new StringBuffer();
String[] column = null;
StringReader stream = null;
for (MassDetailReportDto dto: newList.values()) {
int i = 0;
column = new String[REPORT_INDEX];
column[i++] = dto.getCommodityCode() == null ? " " : dto.getCommodityCode();
column[i++] = dto.getOaId() == null ? " " : dto.getOaId();
//like this we are calling some other getter methods
//After all getter methods we are appending columns to stringBuffer object
report.append(StringUtils.join(column, PIPE));
report.append(NEW_LINE);
//now stringbuffer object we are writing to file
stream = new StringReader(report.toString());
int count;
char buffer[] = new char[4096];
while ((count = stream.read(buffer)) > -1) {
//writing into file
writer.write(buffer, 0, count);
}
writer.flush();
//clearing the buffer
report.delete(0, report.length());
}
Error is :
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
at java.lang.StringBuilder.append(StringBuilder.java:120)
Could you please look into above code snippet and help me, it would be great help.
Where does column get initialized? I don't see it. But it seems that's a likely culprit. You are building a string array without clearing it out. column[i++] . Where do you clear out that array? It should be scoped to the loop body, not outside of it. So inside loop, declare your String[] column and use it within that scope.
This seems logical to have out of memory error when the list size is big enough. Increasing the JVM heap size (using -Xmx and -Xms jvm args) would resolve the issue temporarily. However, ideally you should used paged access to the source of the items in the list. If the list is populated from database or webservice, it can easily be accessed in paged way.

How to parse a huge file line by line, serialize & deserialize a huge object efficiently?

I have a file of size around 4-5 Gigs(nearly billion lines). From every line of the file, I have to parse the array of integers and the additional integer info and update my custom data structure. My class to hold such information looks like
class Holder {
private int[][] arr = new int[1000000000][5]; // assuming that max array size is 5
private int[] meta = new int[1000000000];
}
A sample line from the file looks like
(1_23_4_55) 99
Every index in the arr & meta corresponds to the line number in the file. From the above line, I extract the array of integers first and then the meta information. In that case,
--pseudo_code--
arr[line_num] = new int[]{1, 23, 4, 55}
meta[line_num]=99
Right now, I am using BufferedReader object and it's readLine method to read each line & use character level operations to parse the integer array and meta information from each line and populate the Holder instance. But, it takes almost half an hour to complete this entire operation.
I used both java Serialization & Externalizable(write the meta and arr) to serialize and deserialize this HUGE Holder instance. And with both of them, the time to serialize is almost half an hour and to deserialize is also almost half an hour.
I would appreciate your suggestions on dealing with this kind of problem & would definitely love to hear your part of story if any.
P.S. Main Memory is not a problem. I have almost 50 GB of RAM in my machine. I have also increased the BufferedReader size to 40 MB (Of course, I can increase this upto 100 MB considering that disk access takes approx. 100 MB/sec). Even cores and CPU is not a problem.
EDIT I
The code that I am using to do this task is provided below(after anonymizing very few information);
public class BigFileParser {
private int parsePositiveInt(final String s) {
int num = 0;
int sign = -1;
final int len = s.length();
final char ch = s.charAt(0);
if (ch == '-')
sign = 1;
else
num = '0' - ch;
int i = 1;
while (i < len)
num = num * 10 + '0' - s.charAt(i++);
return sign * num;
}
private void loadBigFile() {
long startTime = System.nanoTime();
Holder holder = new Holder();
String line;
try {
Reader fReader = new FileReader("/path/to/BIG/file");
// 40 MB buffer size
BufferedReader bufferedReader = new BufferedReader(fReader, 40960);
String tempTerm;
int i, meta, ascii, len;
boolean consumeNextInteger;
// GNU Trove primitive int array list
TIntArrayList arr;
char c;
while ((line = bufferedReader.readLine()) != null) {
consumeNextInteger = true;
tempTerm = "";
arr = new TIntArrayList(5);
for (i = 0, len = line.length(); i < len; i++) {
c = line.charAt(i);
ascii = c - 0;
// 95 is the ascii value of _ char
if (consumeNextInteger && ascii == 95) {
arr.add(parsePositiveInt(tempTerm));
tempTerm = "";
} else if (ascii >= 48 && ascii <= 57) { // '0' - '9'
tempTerm += c;
} else if (ascii == 9) { // '\t'
arr.add(parsePositiveInt(tempTerm));
consumeNextInteger = false;
tempTerm = "";
}
}
meta = parsePositiveInt(tempTerm);
holder.update(arr, meta);
}
bufferedReader.close();
long endTime = System.nanoTime();
System.out.println("#time -> " + (endTime - startTime) * 1.0
/ 1000000000 + " seconds");
} catch (IOException exp) {
exp.printStackTrace();
}
}
}
public class Holder {
private static final int SIZE = 500000000;
private TIntArrayList[] arrs;
private TIntArrayList metas;
private int idx;
public Holder() {
arrs = new TIntArrayList[SIZE];
metas = new TIntArrayList(SIZE);
idx = 0;
}
public void update(TIntArrayList arr, int meta) {
arrs[idx] = arr;
metas.add(meta);
idx++;
}
}
It sounds like the time taken for file I/O is the main limiting factor, given that serialization (binary format) and your own custom format take about the same time.
Therefore, the best thing you can do is to reduce the size of the file. If your numbers are generally small, then you could get a huge boost from using Google protocol buffers, which will encode small integers generally in one or two bytes.
Or, if you know that all your numbers are in the 0-255 range, you could use a byte[] rather than int[] and cut the size (and hence load time) to a quarter of what it is now. (assuming you go back to serialization or just write to a ByteChannel)
It simply can't take that long. You're working with some 6e9 ints, which means 24 GB. Writing 24 GB to the disk takes some time, but nothing like half an hour.
I'd put all the data in a single one-dimensional array and access it via methods like int getArr(int row, int col) which transform row and col onto a single index. According to how the array gets accessed (usually row-wise or usually column-wise), this index would be computed as N * row + col or N * col + row to maximize locality. I'd also store meta in the same array.
Writing a single huge int[] into memory should be pretty fast, surely no half an hour.
Because of the data amount, the above doesn't work as you can't have a 6e9 entries array. But you can use a couple of big arrays instead and all of the above applies (compute a long index from row and col and split it into two ints for accessing the 2D-array).
Make sure you aren't swapping. Swapping is the most probable reason for the slow speed I can think of.
There are several alternative Java file i/o libraries. This article is a little old, but it gives an overview that's still generally valid. He's reading about 300Mb per second with a 6-year old Mac. So for 4Gb you have under 15 seconds of read time. Of course my experience is that Mac IO channels are very good. YMMV if you have a cheap PC.
Note there is no advantage above a buffer size of 4K or so. In fact you're more likely to cause thrashing with a big buffer, so don't do that.
The implication is that parsing characters into the data you need is the bottleneck.
I have found in other apps that reading into a block of bytes and writing C-like code to extract what I need goes faster than the built-in Java mechanisms like split and regular expressions.
If that still isn't fast enough, you'd have to fall back to a native C extension.
If you randomly pause it you will probably see that the bulk of the time goes into parsing the integers, and/or all the new-ing, as in new int[]{1, 23, 4, 55}. You should be able to just allocate the memory once and stick numbers into it at better than I/O speed if you code it carefully.
But there's another way - why is the file in ASCII?
If it were in binary, you could just slurp it up.

Categories