I'm using StringBuilder.append() to parse and process a file as following :
StringBuilder csvString = new StringBuilder();
bufferedReader.lines().filter(line -> !line.startsWith(HASH) && !line.isEmpty()).map(line -> line.trim())
.forEachOrdered(line -> csvString.append(line).append(System.lineSeparator()));
int startOfFileTagIndex = csvString.indexOf(START_OF_FILE_TAG);
int startOfFieldsTagIndex = csvString.indexOf(START_OF_FIELDS_TAG, startOfFileTagIndex);
int endOfFieldsTagIndex = csvString.indexOf(END_OF_FIELDS_TAG, startOfFieldsTagIndex);
int startOfDataTagIndex = csvString.indexOf(START_OF_DATA_TAG, endOfFieldsTagIndex);
int endOfDataTagIndex = csvString.indexOf(END_OF_DATA_TAG, startOfDataTagIndex);
int endOfFileTagIndex = csvString.indexOf(END_OF_FILE_TAG, endOfDataTagIndex);
int timeStartedIndex = csvString.indexOf("TIMESTARTED", endOfFieldsTagIndex);
int dataRecordsIndex = csvString.indexOf("DATARECORDS", endOfDataTagIndex);
int timeFinishedIndex = csvString.indexOf("TIMEFINISHED", endOfDataTagIndex);
if (startOfFileTagIndex != 0 || startOfFieldsTagIndex == -1 || endOfFieldsTagIndex == -1
|| startOfDataTagIndex == -1 || endOfDataTagIndex == -1 || endOfFileTagIndex == -1) {
log.error("not in correct format");
throw new Exception("not in correct format.");
}
The problem is that when the file is quite large i get an outofmemoryexception.
Can you help me transform my code to avoid that exception with large files?
Edit:
As I can understand charging a huge file into a string Builder is not a good idea and won't work.
So the question is which structure in Java is the more appropriate to use to parse my huge file, delete some lines , find the index of some lines and seperate the file into parts (where to store those parts thaht can be huge) according to the found indexes then creating an output file in the end?
The OOM seems to be due to the fact that you are storing all lines in the StringBuilder. When the file has too many lines, it will take up a huge amount of memory and may lead to OOM.
The strategy to avoid this depends upon what you are doing with appended strings.
As I can see in your code, you are only trying to verify the structure of the input file. In that case, you don't need to store all the lines in a StringBuilder instance. Instead,
Have multiple ints to hold each index you are interested in, (or have an array of ints)
Instead of adding the line to the StringBuilder, detect the presence of the "tag" or "index" you are looking for and save it in its designated int variable.
Finally, the check you are already doing may need to undergo a change to test not as -1 but relative to other indices. (This you are currently achieving using a start index in the indexOf() call.)
If there is a risk of a tag spanning across lines, then you may not be able to use streams, but will have to use a simple for loop in which to save some previous lines, append them and check. (Just one idea; you may have a better one.)
Related
So i am using string.split because i need to take certain parts of a string and then print the first part. The part size may vary so I can't use substring or a math formula. I am attempting to store everything I need in the string array to then selectively print what I need based on the position, this much I can control. However, I am not sure what to do because I know when I do a split, it takes the two parts and stores them in the array. However, there is one case where I need that value in the array untouched. I'm afraid if I do
format[0] = rename
That it will overwrite that value and mess up the entire array. My question is how do I assign a position to this value when I don't know what the position of the others will be? Do I need to preemptively assign it a value or give it the last possible value in the array? I have attached a segment of the code that deals with my question. The only thing I can add is that this is in a bigger loop and rename's value changes every iteration. Don't pay to much attention to the comments, those are more of reminders for me as to what to do rather than what the code is suppose to do. Any pointers, tips, help is greatly appreciated.
String format[];
rename = workbook.getSheet(sheet).getCell(column,row).getContents();
for(int i = 0; i < rename.length(); i++) {
//may need to add[i] so it has somewhere to go and store
if(rename.charAt(i) == '/') {
format = rename.split("/");
}
else if(rename.charAt(i) == '.') {
if(rename.charAt(0) == 0) {
//just put that value in the array
format = rename;
} else {
//round it to the tenths place and then put it into the array
format = rename.split("\\.");
}
} else if(rename.charAt(i) == '%') {
//space between number and percentage
format = rename.split(" ");
}
}
Whenever you assign a variable it gets overwritten
format[0] = rename
Will overwrite the first index of this array of Strings.
In your example, the 'format' array is being overwritten with each iteration of the for loop. After the loop has been completed 'format' will contain only the values for the most recent split.
I would suggest looking into using an ArrayList, they are much easier to manage than a traditional array and you can simply just iterate through the split values and append them at the end.
so i've looked around and could'nt find anything specificaly related to what i'm wanting to accomplish, so i'm here to ask some of you folks if ya'll could help. I am a Uni student, and am struggling to wrap my head around a specfific task.
The task revolves around the following:
Being able to have the program we develop check each line of data in a file we input, and report any errors (such as missing data) to the console via messages.
I am currently using Scanner to scan the file and .split to split the text at each hyphen that it finds and then placing that data into a String[] splitText array... the code for that is as follows:
File Fileobject = new File(importFile);
Scanner fileReader = new Scanner(Fileobject);
while(fileReader.hasNext())
{
String line = fileReader.nextLine();
String[] splitText = line.split("-");
}
The text contained within the file we are scanning, is formatted as follows:
Title - Author - Price - Publisher - ISBN
Title, Author and Publisher are varying lengths - and ISBN is 11characters, Price is to two decimal places. I am able to easily print Valid data to the console, though it's the whole validating and printing errors (such as: "The book title may be missing.") to the console which has my head twisted.
Would IF statements be suited to checking each line of data? And if so, how would those be structured?
If you want to check the length/presence of each of the five columns, then consider the following:
while (fileReader.hasNext()) {
String line = fileReader.nextLine();
String[] splitText = line.split("-");
if (splitText.length < 5) {
System.out.println("One or more columns is entirely missing.");
continue; // skip this line
}
if (splitText[0].length == 0) {
System.out.println("Title is missing")
}
if (splitText[1].length == 0) {
System.out.println("Author is missing")
}
boolean isValidPrice = true;
try {
Double.parseDouble(splitText[2]);
}
catch (Exception e) {
isValidPrice = false;
}
if (!isValidPrice) {
System.out.println("Found an invalid price " + splitText[2] + " but expected a decimal.");
}
if (splitText[4].length != 11) {
System.out.println("Found an invalid ISBN.");
}
I do a two level validation above. If splitting the current line on dash does not yield 5 terms, then we have missing columns and we do not attempt to even guess what data might actually be there. If there are 5 expected columns, then we do a validation on each field by length and/or by expected value.
Yes, your best bet is to use if statements (I can't think of another way?). For cleanliness, I recommend you create a validateData(String data) method, or multiple validator functions.
For example, because you know each line is going to be in the Title - Author - Price - Publisher - ISBN format, you can write code like this:
public void validatePrice(String data) {
//Write your logic to validate.
}
public void validateAuthor(String data) {
//Write your logic to validate.
}
...
Then in your while loop you can call
validatePrice([splitText[0]);
validateAuthor([splitText[1]);
for each validator method.
Depending on your needs you can turn this more a bit more OOP style, but this is one cleanish way to do it.
The first thing you want to check for validation is that you have the proper number of entries (in this case check that the array is of size 5), and after that, you want to check each piece of data
If statements are a good way to go, and you can do something as simple as:
if(title.isBad()) print("error");
if(author.isBad()) print("error");
if(price.isBad()) print("error");
if(publisher.isBad()) print("error");
if(isbn.isBad()) print("error");
Replacing the .isBad with which ever clauses you are checking, such as string[i].isEmpty(), the length of the ISBN, etc.
For ones that take longer to check, such as the Price, you'll want to make some nested for loops, checking if it contains a period, contains only numbers, and on'y has 2 digits after the period.
Something helpful to know is the Wrapper classes for the primitive data types, if allows you to do
Character.isLetter(strings[i].charAt[j])
in the place of
(strings[i].charAt[j] >= 'A' && strings[i].charAt[j] <= 'Z') &&
(strings[i].charAt[j] >= 'a' && strings[i].charAt[j] <= 'z')
and
try{
Double.parseDouble(strings[i]);
}
instead of manually checking the price.
Hope this helps!
I want to read each line of input, store the numbers in an int[] array preform some calculations, then move onto my next line of input as fast as possible.
Input (stdin)
2 4 8
15 10 5
12 14 3999 -284 -71
0 -213 18 4 2
0
This is a pure optimization problem and not entirely good practice in the real world as I'm assuming perfect input. I'm interested in how to improve my current method for taking input from stdin and representing it as an integer array. I have seen methods using scanner where they use a getnextint method, however I've read in multiple places scanner is a lot slower than BufferedReader.
Can this taking in of input step be improved?
Current Method
BufferedReader bufferedInput = new BufferedReader(new InputStreamReader(System.in));
String line;
String[] lineArray;
try{
// a line with just "0" indicates end of std input
while((line = bufferedInput.readLine()) != "0"){
lineArray = line.split("\\s+"); // is "\\s+" the optimized regex
int arrlength = lineArray.length;
int[] lineInt = new int[arrlength];
for(int i = 0; i < arrlength; i++){
lineInt[i] = Integer.parseInt(lineArray[i]);
}
// Preform some operations on lineInt, then regenerate a new
// lineInt with inputs from next line of stdin
}
}catch(IOException e){
}
judging from other questions Difference between parseInt and valueOf in java? parseint seems to be the most efficient method for converting strings to integers1. Any enlightenment would be of great help.
Thank you :)
Edit 1: removed GCD information and 'algorithm' tag
Edit 2: (hopefully) made question more concise, grammatical fix ups
First of all, I just want out that it is totally pointless optimizing in your particular example.
For your example, most people would agree that the best solution is not the optimal one. Rather the most readable solution is will be the best.
Having said that, if you want the most optimal solution, then don't use Scanner, don't use BufferedReader.readLine(), don't use String.split and don't use Integer.parseInt(...).
Instead read characters one at a time using BufferedReader.read() and parse and convert them to int by hand. You also need to implement your own "extendable array of int" type that behaves like an ArrayList<Integer>.
This is a lot of (unnecessary) work, and many more lines of code to maintain. BAD IDEA ...
I second what Stephen said, the speed of parsing is likely to massively outperform the speed of actual I/O done, therefore improving parsing won't give you much.
Seriously, don't do this unless you've built the whole system, profiled it and found that inefficient parsing is what keeps it from hitting its performance targets.
But strictly just as an exercise, and because the general principle may be useful elsewhere, here's an example of how to parse it straight from a string.
The assumptions are:
You will use a sensible encoding, where the characters 0..9 are consecutive.
The only characters in the stream will be 0..9, minus sign and space.
All the numbers are well-formed.
Another important caveat is that for the sake of simplicity I used ArrayList, which is a bad idea for storing primitives, the overhead of boxing/unboxing probably wipes out all improvement in parsing speed. In the real world I'd use a list variant custom-made for primitives.
public static List<Integer> parse(String s) {
List<Integer> ret = new ArrayList<Integer>();
int sign = 1;
int current = 0;
boolean inNumber = false;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c >= '0' && c <= '9') { //we assume a sensible encoding
current = current * 10 + sign * (c-'0');
inNumber = true;
}
else if (c == ' ' && inNumber) {
ret.add(current);
current = 0;
inNumber = false;
sign = 1;;
}
else if (c == '-') {
sign = -1;
}
}
if (inNumber) {
ret.add(current);
}
return ret;
}
I have a script which visits links from a text file. I am trying to delete the string if value returned is null
Example:
1. some link (returned value 'hi')
2. some link (returned null value) //DELETE STRING FROM FILE BECAUSE NULL VALUE RETURNED
3. some link (returned value 'hello')
Some code:
while ((input = in.readLine()) != null) {
System.out.println(input);
if ((input = in.readLine())=="0"){
System.out.println("1 String deleted from file because null value returned ");
}
I'm aware that I'm checking for String "0" instead of an integer 0 because the server stores it as a string i suppose.
I think, rather than trying to remove to the file mid-read (and I don't even really know how you'd do that, and if you could it'd be a horrible idea) you might have an easier time of this by just reading the entire file in and storing each value in an index of an ArrayList<string>:
ArrayList<string> lines = new ArrayList<string>();
while ((input = in.readLine()) != null) {
lines.add(input);
}
Then write the file again after you've finished reading it, skipping any index of lines that's equal to "0":
for (String line : lines)
{
// skip "0"
if (line.equals("0")) {
continue;
}
// write to file if not
writer.write(line);
writer.newLine();
}
Note that == compares reference equality in Java, and .equals compares value equality, so for almost all cases you want to use .equals.
Granted, if as your comment states above, you have another file constantly writing to this one, you're better off looking for an entirely new idea. For that matter, if you've got a script writing these, why not change the script so that it just doesn't write lines for null values in the first place? Unless you have literally no way at all of changing the script, spinning another one up to constantly rewrite parts of its work (on the same constantly-accessed file!) is going to be a. ineffective and b. extremely problematic.
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("rates.txt")));
for (int i=0; i<9; i++){
while(s.hasNext()){rates[i] = s.next();}
System.out.println(rates[i]);
}
}catch (IOException e){
System.out.println(e);
}
finally {
if (s != null) {
s.close();
}
}
When I run this code, it reads the last chunk of characters in my txt file, places them in rates[0], sticks null in 1-9. I'm not sure why it's reading the end of my file first. The contents of the txt are below..
USD 1.34
EUR 1.00
JPY 126.28
GBP 0.88
INR 60.20
It reads the 60.20, which is all it is recording in the array. Any help would be appreciated. I guess I could give you the results of running this code:
run:
60.20
null
null
null
null
null
null
null
null
BUILD SUCCESSFUL (total time: 0 seconds)
while(s.hasNext()){rates[i] = s.next();}
In plain english, this says: While there are tokens left, put the next token into rates[i].
So it will put the first token into rates[i], then the next token into rates[i], then the next token into rates[i], ..., and finally the last token into rates[i]. Since i is not modified, all the values are written into the same element of the array, overwriting the previously read values.
I recommend:
Using List instead of array
More flexible, much easier to work with, takes advantage of Java Collections Framework, etc
Not storing the currency symbol and the numeric exchange rate all in one mixed bag
...but using a class to encapsulate the pair
Using Scanner.nextDouble() to read the numeric exchange rate (which presumably you'll want to convert to double anyway)
So, something like this:
List<ExchangeRate> allRates = new ArrayList<ExchangeRate>();
while (sc.hasNext()) {
String symbol = sc.next();
double rate = sc.nextDouble();
allRates.add(new ExchangeRate(symbol, rate));
}
Note how:
You no longer need to know how many elements to allocate in an array
The symbol and the rate aren't all thrown into one mixed bag
List.add means no counter that you need to keep track of and manage
i.e. the bug in your original question!
I think the problem is that line 5, which contains your while loop, reads the entire file input. So you read your entire file on the first for loop iteration where i = 0; The next time your for loop there is nothing left to read.
You probably want something like this instead:
List rates = new ArrayList();
while (s.hasNext()) {
rates.add(s.next());
}
One other potential problem: FileReader uses the platform default encoding. This can be appropriate to process user-supplied files, but if the files are part of the application, they can get corrupted when the application is run on a system with an incompatible default encoding (and no, using only ASCII characters does not protect you completely against this).
To avoid the problem, use new InputStreamReader(new FileInputStream(filename), encoding) instead - and realize that you actually have to pick an encoding for your file.