Guaranteeing order of file content when fetched through multi threading - java

Suppose there are 100 files numbered from 1-100 and you need to read these files in parallel using multi-threading. Is there any way to print the content of these file in order i.e 1-100 ?

Yes, provided you can hold the contents of all of them in memory.
The basic idea is to keep on storing the Future to when you would complete reading/processing the files in order and then get the values from the future in the order they were created.
List<String> filePathsInOrder = new ArrayList<>();
List<Future<String>> fileOutputsInOrder = new ArrayList<>();
for (String filePath : filePathsInOrder) {
fileOutputsInOrder.add(CompletableFuture.supplyAsync(() -> {
try {
return Files.readString(Paths.get(filePath));
}
catch (IOException e) {
throw new RuntimeException(e);
}
}));
}
for (Future<String> fileOutput : fileOutputsInOrder){
System.out.println(fileOutput.get());
}
You would of course need to take of subtleties like exception handling, in case of your reads fail, etc. This done above, as that is beyond the scope of this question.

Yes, of course. You can create a String array of 100 elements and fill the element of proper index, so, if you read file 55, then you set the 54th String (remember, indexing starts from 0) to it. If you wait for all threads to be finished, then you can just loop this array and print its contents. You can also decide not to wait. In that case you can have a numeric n value (initialized to -1) which would denote which was the last file successfully printed and upon each thread end you could print out the files you can at that point.

Related

Trying to add substrings from newLines in a large file to a list

I downloaded my extended listening history from Spotify and I am trying to make a program to turn the data into a list of artists without doubles I can easily make sense of. The file is rather huge because it has data on every stream I have done since 2016 (307790 lines of text in total). This is what 2 lines of the file looks like:
{"ts":"2016-10-30T18:12:51Z","username":"edgymemes69endmylifepls","platform":"Android OS 6.0.1 API 23 (HTC, 2PQ93)","ms_played":0,"conn_country":"US","ip_addr_decrypted":"68.199.250.233","user_agent_decrypted":"unknown","master_metadata_track_name":"Devil's Daughter (Holy War)","master_metadata_album_artist_name":"Ozzy Osbourne","master_metadata_album_album_name":"No Rest for the Wicked (Expanded Edition)","spotify_track_uri":"spotify:track:0pieqCWDpThDCd7gSkzx9w","episode_name":null,"episode_show_name":null,"spotify_episode_uri":null,"reason_start":"fwdbtn","reason_end":"fwdbtn","shuffle":true,"skipped":null,"offline":false,"offline_timestamp":0,"incognito_mode":false},
{"ts":"2021-03-26T18:15:15Z","username":"edgymemes69endmylifepls","platform":"Android OS 11 API 30 (samsung, SM-F700U1)","ms_played":254120,"conn_country":"US","ip_addr_decrypted":"67.82.66.3","user_agent_decrypted":"unknown","master_metadata_track_name":"Opportunist","master_metadata_album_artist_name":"Sworn In","master_metadata_album_album_name":"Start/End","spotify_track_uri":"spotify:track:3tA4jL0JFwFZRK9Q1WcfSZ","episode_name":null,"episode_show_name":null,"spotify_episode_uri":null,"reason_start":"fwdbtn","reason_end":"trackdone","shuffle":true,"skipped":null,"offline":false,"offline_timestamp":1616782259928,"incognito_mode":false},
It is formatted in the actual text file so that each stream is on its own line. NetBeans is telling me the exception is happening at line 19 and it only fails when I am looking for a substring bounded by the indexOf function. My code is below. I have no idea why this isn't working, any ideas?
import java.util.*;
public class MainClass {
public static void main(String args[]){
File dat = new File("SpotifyListeningData.txt");
List<String> list = new ArrayList<String>();
Scanner swag = null;
try {
swag = new Scanner(dat);
}
catch(Exception e) {
System.out.println("pranked");
}
while (swag.hasNextLine())
if (swag.nextLine().length() > 1)
if (list.contains(swag.nextLine().substring(swag.nextLine().indexOf("artist_name"), swag.nextLine().indexOf("master_metadata_album_album"))))
System.out.print("");
else
try {list.add(swag.nextLine().substring(swag.nextLine().indexOf("artist_name"), swag.nextLine().indexOf("master_metadata_album_album")));}
catch(Exception e) {}
System.out.println(list);
}
}
Find a JSON parser you like.
Create a class that with the fields you care about marked up to the parsers specs.
Read the file into a collection of objects. Most parsers will stream the contents so you're not string a massive string.
You can then load the data into objects and store that as you see fit. For your purposes, a TreeSet is probably what you want.
Your code will throw a lot of exceptions only because you don't use braces. Please do use braces in each blocks, whether it is if, else, loops, whatever. It's a good practice and prevent unnecessary bugs.
However, everytime scanner.nextLine() is called, it reads the next line from the file, so you need to avoid using that in this way.
The best way to deal with this is to write a class containing the fields same as the json in each line of the file. And map the json to the class and get desired field value from that.
Your way is too much risky and dependent on structure of the data, even on whitespaces. However, I fixed some lines in your code and this will work for your purpose, although I actually don't prefer operating string in this way.
while (swag.hasNextLine()) {
String swagNextLine = swag.nextLine();
if (swagNextLine.length() > 1) {
String toBeAdded = swagNextLine.substring(swagNextLine.indexOf("artist_name") + "artist_name".length() + 2
, swagNextLine.indexOf("master_metadata_album_album") - 2);
if (list.contains(toBeAdded)) {
System.out.print("Match");
} else {
try {
list.add(toBeAdded);
} catch (Exception e) {
System.out.println("Add to list failed");
}
}
System.out.println(list);
}
}

Does Java create object even if it's not initialized directly?

If I initialize String array directly like this String[] Distro = Distros.split(","); then it'll create an object because variable Distro is holding the array.
But If I do it this way then it'll also create an object?
String Distros = "CentOS,RHEL,Debian,Ubuntu";
for (String s : Distros.split(",")) {
System.out.println(s);
}
My goal is to reduce object creation to minimize garbage.
Your reasoning “then it'll create an object because variable Distro is holding the array” indicates that you are confusing object creation with variable assignment.
The object is created by the expression Distros.split(","), not the subsequent assignment. It should become obvious when you consider that the split method is an ordinary Java method creating and returning the array without any knowledge about what the caller will do with the result.
When the operation happens in a performance critical code, you might use
int p = 0;
for(int e; (e = Distros.indexOf(',', p)) >= 0; p = e+1)
System.out.println(Distros.substring(p, e));
System.out.println(Distros.substring(p));
instead. It’s worth pointing out that this saves the array creation but still performs the creation of the substrings, which is the more expensive aspect of it. Without knowing what you are actually going to do with the substrings, it’s impossible to say whether there are alternatives which can save the substring creation¹.
But this loop still has an advantage over the split method. The split method creates all substrings and returns an array holding references to them, forcing them to exist at the same time, during the entire loop. The loop above calls substring when needed and doesn’t keep a reference when going to the next. Hence, the strings are not forced to exist all the time and the garbage collector is free to decide when to collect them, depending on the current memory utilization.
¹ I assume that printing is just an example. But to stay at the example, you could replace
System.out.println(Distros.substring(p, e));
with
System.out.append(Distros, p, e).println();
The problem is, this only hides the substring creation, at least in the reference implementation which will eventually perform the substring creation behind the scenes.
An alternative is
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(FileDescriptor.out)));
try {
int p = 0; for(int e; (e = Distros.indexOf(',', p)) >= 0; p = e+1) {
bw.write(Distros, p, e - p);
bw.write(System.lineSeparator());
}
bw.write(Distros, p, Distros.length() - p);
bw.write(System.lineSeparator());
bw.flush();
}
catch(IOException ex) {
ex.printStackTrace();
}
which truly writes the strings without creating substrings. But it forces us to deal with potential exceptions, which PrintStream normally hides.
The method split(delimiter) returns string array from the string based on the delimiter, what you did create the string array in for each and the scope of it ended after for each so It's eligible for GC to release it
String Distros = "CentOS,RHEL,Debian,Ubuntu";
for (String s : Distros.split(",")) {
System.out.println(s);
}
, Is equivalent to
String Distros = "CentOS,RHEL,Debian,Ubuntu";
System.out.println("start scope");
{
String[] splitArray = Distros.split(",");
for (String s : splitArray) {
System.out.println(s);
}
}
System.out.println("end scope");

Threading a recursive function

I have this recursive function that finds hrefs on a URL and adds them all to a global list. This is done synchronously and takes a long time. I have tried to do this with threading but have failed to get all threads to write to the one list. Could someone please show me how to do this with threading?
private static void buildList (String BaseURL, String base){
try{
Document doc = Jsoup.connect(BaseURL).get();
org.jsoup.select.Elements links = doc.select("a");
for(Element e: links){
//only if this website has no longer been visited
if(!urls.contains(e.attr("abs:href"))){
//eliminates pictures and pdfs
if(!e.attr("abs:href").contains(".jpg")){
if(!e.attr("abs:href").contains("#")){
if(!e.attr("abs:href").contains(".pdf")){
//makes sure it doesn't leave the website
if(e.attr("abs:href").contains(base)){
urls.add(e.attr("abs:href"));
System.out.println(e.attr("abs:href"));
//recursive call
buildList(e.attr("abs:href"),base);
}
}
}
}
}
}
} catch(IOException ex) {
}
//to print out all urls.
/*
* for(int i=0;i<urls.size();i++){
* System.out.println(urls.get(i));
* }
*/
}
This is a great use case for ForkJoin. It'll provide excellent concurrency with very simple code.
For the set of urls parsed, use a Collections.synchronizedSet(new HashSet<String>());.
You can also create a larger ForkJoinPool than the amount of cores you have, since there's network involved (the common usage expects that each thread will be performing work at ~100%).
Use any of collection from concurrent package to store the values you get from different threads. ArrayBloac
You can use fork and join once you break you your problem into divide and conquer algo.

Android string content loading performance

I have 1000 lines in a file which will be served to the user every time he/she loads the application.
My current approach is:
MainActivity: onCreate: Start an AsyncTask
AsyncTask onPreExecute: show progress dialiog
AsyncTask doInBackground: Check if the key/value is present in sharedpreferences, If yes, then do nothing in doInBackground. If no (first time user), read from the raw file and create a stringbuilder. Store the content of StringBuilder as key value pair in sharedpreferences.
AsyncTask onPostExecute: populate textview from sharedpreferences. Dismiss the progress dialog.
The code to read from file in the doInBackground method is:
StringBuilder sb = new StringBuilder();
InputStream textStream = getBaseContext().getResources().openRawResource(R.raw.file);
BufferedReader bReader = new BufferedReader(new InputStreamReader(textStream));
String aJsonLine = null;
try {
while ((aJsonLine = bReader.readLine()) != null) {
sb.append(aJsonLine + System.getProperty("line.separator"));
}
} catch (IOException e) {
e.printStackTrace();
} finally{
try {
bReader.close();
textStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
I am seeing that the user has to wait for around 9-10 seconds for first launch and 4-5 seconds for subsequent launches. Any suggestions to improve the performance in my case.
You don't need to make your user to wait for the whole list to get loaded. Once you have enough data to fill the screen (10-20 items, maybe?), populate the onscreen list or whatever with the data you already have, this will make the delay totally insignificant.
You may check http://developer.android.com/reference/android/content/AsyncTaskLoader.html to see how it's supposed to be done.
As a small sideline to the other comments, as aJsonLine is a String, it's a better idea to store its value along with the newline by using two append() instead of a single one:
sb.append(aJsonLine);
sb.append(System.getProperty("line.separator"));
instead of:
sb.append(aJsonLine + System.getProperty("line.separator"));
With the later, both the aJsonLine and the result of System.getProperty("line.separator")) need to be converted to StringBuilder before the concatenation between them with can take place and the final value be passed as a parameter.
Of course, you should also cache the value of System.getProperty("line.separator")) al
I'd rather read the JSON stream through a JsonReader and extract the name value pairs I'm interested in. String concatenation / garbage collection are expensive operations. The way the code is written now, these operations will slow down the task. Also there are inefficiencies in the code like accessing the line separator on every iteration of the loop System.getProperty("line.separator").
You should see a significant performance boost just by using a JSONReader.

Scanner.findInLine() leaks memory massively

I'm running a simple scanner to parse a string, however I've discovered that if called often enough I get OutOfMemory errors. This code is called as part of the constructor of an object that is built repeatedly for an array of strings :
Edit: Here's the constructor for more infos; not much more happening outside of the try-catch regarding the Scanner
public Header(String headerText) {
char[] charArr;
charArr = headerText.toCharArray();
// Check that all characters are printable characters
if (charArr.length > 0 && !commonMethods.isPrint(charArr)) {
throw new IllegalArgumentException(headerText);
}
// Check for header suffix
Scanner sc = new Scanner(headerText);
MatchResult res;
try {
sc.findInLine("(\\D*[a-zA-Z]+)(\\d*)(\\D*)");
res = sc.match();
} finally {
sc.close();
}
if (res.group(1) == null || res.group(1).isEmpty()) {
throw new IllegalArgumentException("Missing header keyword found"); // Empty header to store
} else {
mnemonic = res.group(1).toLowerCase(); // Store header
}
if (res.group(2) == null || res.group(2).isEmpty()) {
suffix = -1;
} else {
try {
suffix = Integer.parseInt(res.group(2)); // Store suffix if it exists
} catch (NumberFormatException e) {
throw new NumberFormatException(headerText);
}
}
if (res.group(3) == null || res.group(3).isEmpty()) {
isQuery= false;
} else {
if (res.group(3).equals("?")) {
isQuery = true;
} else {
throw new IllegalArgumentException(headerText);
}
}
// If command was of the form *ABC, reject suffixes and prefixes
if (mnemonic.contains("*")
&& suffix != -1) {
throw new IllegalArgumentException(headerText);
}
}
A profiler memory snapshot shows the read(Char) method of Scanner.findInLine() to be allocated massive amounts of memory during operation as a I scan through a few hundred thousands strings; after a few seconds it already is allocated over 38MB.
I would think that calling close() on the scanner after using it in the constructor would flag the old object to be cleared by the GC, but somehow it remains and the read method accumulates gigabytes of data before filling the heap.
Can anybody point me in the right direction?
You haven't posted all your code, but given that you are scanning for the same regex repeatedly, it would be much more efficient to compile a static Pattern beforehand and use this for the scanner's find:
static Pattern p = Pattern.compile("(\\D*[a-zA-Z]+)(\\d*)(\\D*)");
and in the constructor:
sc.findInLine(p);
This may or may not be the source of the OOM issue, but it will definitely make your parsing a bit faster.
Related: java.util.regex - importance of Pattern.compile()?
Update: after you posted more of your code, I see some other issues. If you're calling this constructor repeatedly, it means you are probably tokenizing or breaking up the input beforehand. Why create a new Scanner to parse each line? They are expensive; you should be using the same Scanner to parse the entire file, if possible. Using one Scanner with a precompiled Pattern will be much faster than what you are doing now, which is creating a new Scanner and a new Pattern for each line you are parsing.
The strings that are filling up your memory were created in findInLine(). Therefore, the repeated Pattern creation is not the problem.
Without knowing what the rest of the code does, my guess would be that one of the groups you get out of the matcher is being kept in a field of your object. Then that string would have been allocated in findInLine(), as you see here, but the fact that it is being retained would be due to your code.
Edit:
Here's your problem:
mnemonic = res.group(1).toLowerCase();
What you might not realize is that toLowerCase() returns this if there are no uppercase letters in the string. Also, group(int) returns a substring(), which creates a new string backed by the same char[] as the full string. So, mnemonic actually contains the char[] for the entire line.
The fix would just be:
mnemonic = new String(res.group(1).toLowerCase());
I think that your code snippet is not full. I believe you are calling scanner.findInLine() in loop. Anyway, try to call scanner.reset(). I hope this will solve your problem.
The JVM apparently does not have time to Garbage collect. Possibly because it's using the same code (the constructor) repeatedly to create multiple instances of the same class. The JVM may not do anything about GC until something changes on the run time stack -- and in this case that's not happening. I've been warned in the past about doing "too much" in a constructor as some of the memory management behaviors are not quite the same when other methods are being called.
Your problem is that you are scanning through a couple hundred thousand strings and you are passing the pattern in as a string, so you have a new pattern object for every single iteration of the loop. You can pull the pattern out of the loop, like so:
Pattern toMatch = Pattern.compile("(\\D*[a-zA-Z]+)(\\d*)(\\D*)")
Scanner sc = new Scanner(headerText);
MatchResult res;
try {
sc.findInLine(toMatch);
res = sc.match();
} finally {
sc.close();
}
Then you will only be passing the object reference to toMatch instead of having the overhead of creating a new pattern object for every attempt at a match. This will fix your leak.
Well I've found the source of the problem, it wasn't Scanner exactly but the list holding the objects doing the scanning in the constructor.
The problem had to do with the overrun of a list that was holding references to the object containing the parsing, essentially more strings were received per unit of time than could be processed and the list grew and grew until there were no more RAM. Bounding this list to a maximum size now prevents the parser from overloading the memory; I'll be adding some synchronization between the parser and the data source to avoid this overrun in the future.
Thank you all for your suggestions, I've already made some changes performance wise regarding the scanner and thank you to #RobI for pointing me to jvisualvm which allowed me to trace back the exact culprits holding the references. The memory dump wasn't showing the reference linking.

Categories