How to highlight first different line in a quite long String?

How to highlight first different line in a quite long String? - java

I have user readable file with several hundreds rows.
Each row is quite short(~20-30 symbols).
From time to time I need to execute equals operation with that string against another strings.
If Strings are different I need to find first row which differs. Sure I can do it manually:
in a loop find first character which differs then find previous and following '/n' but this code is not beaiful from my point of view.
Is there any other way to achieve it using some external libraries ?

There's no need for any library, what you ask is rather straightforward. But it's unique enough that no library would have it, so just write it yourself.
import java.nio.file.Files;
import java.util.*;
...
Optional<String> findFirstDifferentLine(Path file, Collection<String> rows) throws IOException {
try (var fileStream = Files.lines(file)) { // Need to close
var fileIt = fileStream.iterator();
var rowIt = rows.iterator();
while (fileIt.hasNext() && rowIt.hasNext()) {
var fileItem = fileIt.next();
if (!Objects.equal(fileItem, rowIt.next()) {
return Optional.of(fileItem);
}
}
return Optional.of(fileIt)
.filter(Iterator::hasNext)
.map(Iterator::next);
}
}

Related

Trying to add substrings from newLines in a large file to a list

I downloaded my extended listening history from Spotify and I am trying to make a program to turn the data into a list of artists without doubles I can easily make sense of. The file is rather huge because it has data on every stream I have done since 2016 (307790 lines of text in total). This is what 2 lines of the file looks like:
{"ts":"2016-10-30T18:12:51Z","username":"edgymemes69endmylifepls","platform":"Android OS 6.0.1 API 23 (HTC, 2PQ93)","ms_played":0,"conn_country":"US","ip_addr_decrypted":"68.199.250.233","user_agent_decrypted":"unknown","master_metadata_track_name":"Devil's Daughter (Holy War)","master_metadata_album_artist_name":"Ozzy Osbourne","master_metadata_album_album_name":"No Rest for the Wicked (Expanded Edition)","spotify_track_uri":"spotify:track:0pieqCWDpThDCd7gSkzx9w","episode_name":null,"episode_show_name":null,"spotify_episode_uri":null,"reason_start":"fwdbtn","reason_end":"fwdbtn","shuffle":true,"skipped":null,"offline":false,"offline_timestamp":0,"incognito_mode":false},
{"ts":"2021-03-26T18:15:15Z","username":"edgymemes69endmylifepls","platform":"Android OS 11 API 30 (samsung, SM-F700U1)","ms_played":254120,"conn_country":"US","ip_addr_decrypted":"67.82.66.3","user_agent_decrypted":"unknown","master_metadata_track_name":"Opportunist","master_metadata_album_artist_name":"Sworn In","master_metadata_album_album_name":"Start/End","spotify_track_uri":"spotify:track:3tA4jL0JFwFZRK9Q1WcfSZ","episode_name":null,"episode_show_name":null,"spotify_episode_uri":null,"reason_start":"fwdbtn","reason_end":"trackdone","shuffle":true,"skipped":null,"offline":false,"offline_timestamp":1616782259928,"incognito_mode":false},
It is formatted in the actual text file so that each stream is on its own line. NetBeans is telling me the exception is happening at line 19 and it only fails when I am looking for a substring bounded by the indexOf function. My code is below. I have no idea why this isn't working, any ideas?
import java.util.*;
public class MainClass {
public static void main(String args[]){
File dat = new File("SpotifyListeningData.txt");
List<String> list = new ArrayList<String>();
Scanner swag = null;
try {
swag = new Scanner(dat);
}
catch(Exception e) {
System.out.println("pranked");
}
while (swag.hasNextLine())
if (swag.nextLine().length() > 1)
if (list.contains(swag.nextLine().substring(swag.nextLine().indexOf("artist_name"), swag.nextLine().indexOf("master_metadata_album_album"))))
System.out.print("");
else
try {list.add(swag.nextLine().substring(swag.nextLine().indexOf("artist_name"), swag.nextLine().indexOf("master_metadata_album_album")));}
catch(Exception e) {}
System.out.println(list);
}
}

Find a JSON parser you like.
Create a class that with the fields you care about marked up to the parsers specs.
Read the file into a collection of objects. Most parsers will stream the contents so you're not string a massive string.
You can then load the data into objects and store that as you see fit. For your purposes, a TreeSet is probably what you want.

Your code will throw a lot of exceptions only because you don't use braces. Please do use braces in each blocks, whether it is if, else, loops, whatever. It's a good practice and prevent unnecessary bugs.
However, everytime scanner.nextLine() is called, it reads the next line from the file, so you need to avoid using that in this way.
The best way to deal with this is to write a class containing the fields same as the json in each line of the file. And map the json to the class and get desired field value from that.
Your way is too much risky and dependent on structure of the data, even on whitespaces. However, I fixed some lines in your code and this will work for your purpose, although I actually don't prefer operating string in this way.
while (swag.hasNextLine()) {
String swagNextLine = swag.nextLine();
if (swagNextLine.length() > 1) {
String toBeAdded = swagNextLine.substring(swagNextLine.indexOf("artist_name") + "artist_name".length() + 2
, swagNextLine.indexOf("master_metadata_album_album") - 2);
if (list.contains(toBeAdded)) {
System.out.print("Match");
} else {
try {
list.add(toBeAdded);
} catch (Exception e) {
System.out.println("Add to list failed");
}
}
System.out.println(list);
}
}

Creating files in a separate thread

I have a method that starts creating JSON files in each of the folders in my tree.
public static void fill(List<String> subFoldersPaths) {
for (int i = 0; i < subFoldersPaths.size(); i++) {
String fullFileName = subFoldersPaths.get(i) + FILE_NAME;
String formatFullFileName = String.format(fullFileName, i)+"%d";
Runnable runnable = new JsonCreator(formatFullFileName);
new Thread(runnable).start();
}
}
List<String> subFoldersPaths is a list that contains paths to each folder in order.
Here is my folder structure:
I want each folder to be filled with files in a separate thread every 0.08 seconds. But my class will not fill every folder.
Here is a class that implements Runnable, which should perform the filling:
import com.epam.lab.model.Author;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import net.andreinc.mockneat.MockNeat;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import java.io.FileWriter;
import java.io.IOException;
public class JsonCreator implements Runnable {
private static Logger logger = LogManager.getLogger();
private static String fileName;
private static final int FILES_COUNT = 100;
public JsonCreator(String s){
this.fileName = s;
}
#Override
public void run() {
for (int i = 0; i < FILES_COUNT; i++) {
try {
String formatFullFileName = String.format(fileName, i)+".json";
FileWriter fileWriter = new FileWriter(formatFullFileName);
fileWriter.write(createJsonString());
fileWriter.close();
Thread.sleep(80);
} catch (IOException | InterruptedException e) {
logger.error("File was not created", e);
}
}
}
private static String createJsonString() {
MockNeat mockNeat = MockNeat.threadLocal();
Gson gson = new GsonBuilder()
.setPrettyPrinting()
.create();
String json = mockNeat
.reflect(Author.class)
.field("authorName", mockNeat.names().first())
.field("authorSurname", mockNeat.names().last())
.map(gson::toJson)
.val();
return json;
}
}
But this class fills not every folder with files. (maybe there is a problem with the file names) I can not figure it out.
And I want each folder below "foo" to be filled in a separate thread of JSON files in the amount of FILES_COUNT = 10
some examples of algorithm execution:
The folder structure is created with the participation of the random, so it is almost always different. but this does not affect the fact that files are not created in all folders

Your code is buggy; you cannot ever use that FileWriter constructor. Use new FileWriter(formatFullFileName, StandardCharsets.UTF_8), which is only in jdk11. If you're not on JDK11, you can't use FileWriter at all (it uses platform default encoding, and that is not acceptable; JSON must be in UTF-8 as per the JSON spec, and you have no guarantee that UTF-8 is your platform default).
you aren't guarding your FileWriter with an ARM block - you should add that.
In the initial block, formatFullFileName is a variable that is a format string. In the run() method, it's the opposite (it's the result of running a String.format op on one). Makes your code very hard to read.
Most likely your filenames are incorrect. You should be using List<Path> which would have removed any doubt. If your List<String> subFoldersPaths contains, for example, /home/misnomer/project/foo/1stLayerSubFolder0 in it, and the constant FILE_NAME (which you did not put in your pastes) is, say, example, then the path for the very first file to be created becomes: /home/misnomer/project/foo/1stLayerSubFolder0example0.json which is not what you wanted - you're missing a slash.
NB: If using the newer path API, writing a string to a file becomes vastly simpler: Files.write(path, string) is all you need (and note that the Files API defaults to UTF-8, unlike most other parts of the java libraries that involve turning strings to bytes or vice versa).
The paste needs more info, or you should debug this on your own: Print when you write a file, preferably including the thread ID (you can get it with Thread.currentThread().getName()). That's how programming works: You don't just stare at it, go --heck, I dunno, better ask stack overflow!-- and then give up. You debug it. Use a debugger, or if you can't/don't want to, use the poor man's debugger: Add a whole bunch of System.out.println statements. Go through your code and imagine (write it down if you have to) which each step is doing. Then, add a println statement that confirms this. The very place where what the program says it is doing does not match with what you thought it would do? That's where a bug is. Fix it, and keep going until all bugs are eliminated.

Searching files in a directory and pairing them based on a common sub-string

I have been attempting to program a solution for ImageJ to process my images.
I understand how to get a directory, run commands on it, etc etc. However I've run into a situation where I now need to start using some type of search function in order to pair two images together in a directory full of image pairs.
I'm hoping that you guys can confirm I am on the right direction and that my idea is right. So far it is proving difficult for me to understand as I have less than even a month's worth of experience with Java. Being that this project is directly for my research I really do have plenty of drive to get it done I just need some direction in what functions are useful to me.
I initially thought of using regex but I saw that when you start processing a lot of images (especially with imagej which it seems does not dump data usage well, if that's the correct way to say it) that regex is very slow.
The general format of these images is:
someString_DAPI_0001.tif
someString_GFP_0001.tif
someString_DAPI_0002.tif
someString_GFP_0002.tif
someString_DAPI_0003.tif
someString_GFP_0003.tif
They are in alphabetical order so it should be able to go to the next image in the list. I'm just a bit lost on what functions I should use to accomplish this but I think my overall while structure is correct. Thanks to some help from Java forums. However I'm still stuck on where to go to next.
So far here is my code: Thanks to this SO answer for partial code
int count = 0;
getFile("C:\");
string DAPI;
string GFP;
private void getFile(String dirPath) {
File f = new File(dirPath);
File[] files = f.listFiles();
while (files.length > 0) {
if (/* File name contains "DAPI"*/){
DAPI = File f;
string substitute to get 'GFP' filename
store GFP file name into variable
do something(DAPI, GFP);
}
advance to next filename in list
}
}
As of right now I don't really know how to search for a string within a string. I've seen regex capture groups, and other solutions but I do not know the "best" one for processing hundreds of images.
I also have no clue what function would be used to substitute substrings.
I'd much appreciate it if you guys could point me towards the functions best for this case. I like to figure out how to make it on my own I just need help getting to the right information. Also want to make sure I am not making major logic mistakes here.

It doesn't seem like you need regex if your file names follow the simple pattern that you mentioned. You can simply iterate over the files and filter based on whether the filename contains DAPI e.g. see below. This code may be oversimplification of your requirements but I couldn't tell that based on the details you've provided.
import java.io.*;
public class Temp {
int count = 0;
private void getFile(String dirPath) {
File f = new File(dirPath);
File[] files = f.listFiles();
if (files != null) {
for (File file : files) {
if (file.getName().contains("DAPI")) {
String dapiFile = file.getName();
String gfpFile = dapiFile.replace("DAPI", "GFP");
doSomething(dapiFile, gfpFile);
}
}
}
}
//Do Something does nothing right now, expand on it.
private void doSomething(String dapiFile, String gfpFile) {
System.out.println(new File(dapiFile).getAbsolutePath());
System.out.println(new File(gfpFile).getAbsolutePath());
}
public static void main(String[] args) {
Temp app = new Temp();
app.getFile("C:\\tmp\\");
}
}
NOTE: As per Vogel612's answer, if you have Java 8 and like a functional solution you can have:
private void getFile(String dirPath) {
try {
Files.find(Paths.get(dirPath), 1, (path, basicFileAttributes) -> (path.toFile().getName().contains("DAPI"))).forEach(
dapiPath -> {
Path gfpPath = dapiPath.resolveSibling(dapiPath.getFileName().toString().replace("DAPI", "GFP"));
doSomething(dapiPath, gfpPath);
});
} catch (IOException e) {
e.printStackTrace();
}
}
//Dummy method does nothing yet.
private void doSomething(Path dapiPath, Path gfpPath) {
System.out.println(dapiPath.toAbsolutePath().toString());
System.out.println(gfpPath.toAbsolutePath().toString());
}

Using java.io.File is the wrong way to approach this problem. What you're looking for is a Stream-based solution using Files.find that would look something like this:
Files.find(dirPath, 1, (path, attributes) -> {
return path.getFileName().toString().contains("DAPI");
}).forEach(path -> {
Path gfpFile = path.resolveSibling(/*build GFP name*/);
doSomething(path, gfpFile);
});
What this does is:
Iterate over all Paths below dirPath 1 level deep (may be adjusted)
Check that the File's name contains "DAPI"
Use these files to find the relevant "GFP"-File
give them to doSomething
This is preferrable to the files solution because of multiple things:
It's significantly more informative when failing
It's cleaner and more terse than your File-Based solution and doesn't have to check for null
It's forward compatible, and thus preferrable over a File-Based solution
Files.find is available from Java 8 onwards

Mahout : To read a custom input file

I was playing with Mahout and found that the FileDataModel accepts data in the format
userId,itemId,pref(long,long,Double).
I have some data which is of the format
String,long,double
What is the best/easiest method to work with this dataset on Mahout?

One way to do this is by creating an extension of FileDataModel. You'll need to override the readUserIDFromString(String value) method to use some kind of resolver do the conversion. You can use one of the implementations of IDMigrator, as Sean suggests.
For example, assuming you have an initialized MemoryIDMigrator, you could do this:
#Override
protected long readUserIDFromString(String stringID) {
long result = memoryIDMigrator.toLongID(stringID);
memoryIDMigrator.storeMapping(result, stringID);
return result;
}
This way you could use memoryIDMigrator to do the reverse mapping, too. If you don't need that, you can just hash it the way it's done in their implementation (it's in AbstractIDMigrator).

userId and itemId can be string, so this is the CustomFileDataModel which will convert your string into integer and will keep the map (String,Id) in memory; after recommendations you can get string from id.

Assuming that your input fits in memory, loop through it. Track the ID for each string in a dictionary. If it does not fit in memory, use sort and then group by to accomplish the same idea.
In python:
import sys
import sys
next_id = 0
str_to_id = {}
for line in sys.stdin:
fields = line.strip().split(',')
this_id = str_to_id.get(fields[0])
if this_id is None:
next_id += 1
this_id = next_id
str_to_id[fields[0]] = this_id
fields[0] = str(this_id)
print ','.join(fields)

XMLStreamReader Problem

I'm using the XMLStreamReader interface from javax.xml to parse an XML file. The file contains huge data amounts and single text nodes of several KB.
The validating and reading generally works very good, but I'm having trouble with text nodes that are larger than 15k characters. The problem occurs in this function
String foo = "";
if (xsr.getEventType() == XMLStreamConstants.CHARACTERS) {
foo = xsr.getText();
xsr.next(); // read next tag
}
return foo;
xsr being the stream reader. The text in the text node is 53'337 characters long in this particular case (but varies), however the xsr.getText() method only returns the first 15'537 of them. Of course I could loop over the function and concatenate the strings, but somehow I don't think that's the idea...
I did not find anything in the documentation or anywhere else about this. Is it intended behavior or can someone confirm/deny it? Am I using it the wrong way somehow?
Thanks

Of course I could loop over the function and concatenate the strings, but somehow I don't think that's the idea...
Actually, that is the idea :)
The parser is permitted to break up the event stream however it wishes, as long as it's consistent with the original document. That means it can, and often will, break up your text data into multiple events. How and when it chooses to do so is an implementation detail internal to the parser, and is essentially unpredictable.
So yes, if you receive multiple sequential CHARACTERS events, you need to append them manually. This is the price you pay for a low-level API.

Another option is the javax.xml.stream.isCoalescing option (documented in XMLStreamReader.next() or Using StAX), which automatically concatenates long text into a single string. The following JUint3 test passes.
Warning: isCoalescing probably shouldn't be used in production because if the document has lots of character references ( ) or entity references (<), it will cause a StackOverflowError!
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import junit.framework.TestCase;
public class XmlStreamTest extends TestCase {
public void testLengthInXMlStreamReader() throws XMLStreamException {
StringBuilder b = new StringBuilder();
b.append("<root>");
for (int i = 0; i < 65536; i++)
b.append("hello\n");
b.append("</root>");
InputStream is = new ByteArrayInputStream(b.toString().getBytes());
XMLInputFactory inputFactory = XMLInputFactory.newFactory();
inputFactory.setProperty("javax.xml.stream.isCoalescing", true);
XMLStreamReader reader = inputFactory.createXMLStreamReader(is);
reader.nextTag();
reader.next();
assertEquals(6 * 65536, reader.getTextLength());
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.