What is the difference between toString and mkString in scala?

What is the difference between toString and mkString in scala? - java

I have a file that contains 10 lines - I want to retrieve it, and then split them with a newline("\n") delimiter.
here's what I did
val data = io.Source.fromFile("file.txt").toString;
But this causes an error when I try to split the file on newlines.
I then tried
val data = io.Source.fromFile("file.txt").mkString;
And it worked.
What the heck? Can someone tell me what the difference between the two methods are?

Let's look at the types, shall we?
scala> import scala.io._
import scala.io._
scala> val foo = Source.fromFile("foo.txt")
foo: scala.io.BufferedSource = non-empty iterator
scala>
Now, the variable that you have read the file foo.txt into is an iterator. If you perform toString() invocation on it, it doesn't return the contents of the file, rather the String representation of the iterator you've created. OTOH, mkString() reads the iterator(that is, iterates over it) and constructs a long String based on the values read from it.
For more info, look at this console session:
scala> foo.toString
res4: java.lang.String = non-empty iterator
scala> res4.foreach(print)
non-empty iterator
scala> foo.mkString
res6: String =
"foo
bar
baz
quux
dooo
"
scala>

The toString method is supposed to return the string representation of an object. It is often overridden to provide a meaningful representation. The mkString method is defined on collections and is a method which joins the elements of the collection with the provided string. For instance, try something like:
val a = List("a", "b", "c")
println(a.mkString(" : "))
and you will get "a : b : c" as the output. The mkString method has created a string from your collection by joining the elements of the collection with the string you provided. In the particular case you posted, the mkString call joined the elements returned by the BufferedSource iterator with the empty string (this is because you called mkString with no arguments). This results in simply concatenating all of the strings (yielded by the BufferedSource iterator) in the collection together.
On the other hand, calling toString here doesn't really make sense, as what you are getting (when you don't get an error) is the string representation of the BufferedSource iterator; which just tells you that the iterator is non-empty.

They're different methods in different classes. In this case, mkString is a method in the trait GenTraversableOnce. toString is defined on Any (and is very often overridden).
The easiest way (or at least the way I usually use) to find this out is to use the documentation at http://www.scala-lang.org/api/current/index.html. Start with the type of your variable:
val data = io.Source.fromFile("file.txt")
is of type
scala.io.BufferedSource
Go to the doc for BufferedSource, and look for mkString. In the doc for mkString (hit the down arrow over to the left) you'll see that it comes from
Definition Classes TraversableOnce → GenTraversableOnce
And do the same thing with toString.

I think the problem is to understand what Source class is doing. It seems from your code that you expect that Source.fromFile retrieves the content of a file when really what it does is to point to the start of a file.
This is typical when working with I/O operations where you have to open a "connection" with a resource (on this case a connection with your filesystem), read/write several times and then close that "connection". In your example you open a connection to a file and you have to read line per line the contents of the file until you reach the end. Think that when you read you are loading information in memory so it's not a good idea to load the whole file in memory in most of the scenarios (which mkString is going to do).
In the other hand mkString is made to iterate over all the elements of a collection, so in this case what is does is to read the file and load an Array[String] in memory. Be careful because if the file is big your code will fail, normally when working with I/O you should use a buffer to read some content, then process/save that content and then load more content (in the same buffer), avoiding problems with memory. For example reading 5 lines --> parse --> save parsed lines --> read next 5 lines --> etc.
You can also understand that "toString" retrieves you nothing... just tells you "you can read lines, the file is not empty".

Related

Most efficient way to check string array and then write it into file

I need to check a List of Strings to contain certain predefined strings and - in case all these predefined string are contained into the list - I need to write the list to a File.
As a first approach I thought to do something like
if(doesTheListContainPredefinedStrings(list))
writeListIntoFile(list);
Where the doesTheListContainPredefinedStrings and writeListIntoFile executes loops to check for the prefefinedStrings and to write every element of the list to a file, respectively.
But - since in this case I have to worry about performance - I wanted to leverage the fact that in the doesTheListContainPredefinedStrings method I'm still evaluating the elements of the list once.
I also thought about something like
String[] predefinedStrings = {...};
...
PrintWriter pw = new FileWriter("fileName");
int predefinedStringsFound = 0;
for (String string : list)
{
if (predefinedStrings.contains(string))
predefinedStringsFound++;
pw.println(string);
}
if (predefinedStringsFound == predefinedStrings.length)
pw.close();
Since I observed that - at least on the system where I'm developing (Ubuntu 19.04) - if I don't close the stream the strings aren't written to the file.
Nevertheless, this solution seems really bad and the file would still be created, so - if the list wouldn't pass the check - I'd have the problem to delete it (which requires another access to the storage) anyway.
Someone could suggest me a better/the best approach to this and explain why it is better/the best to me?

check the reverse case — is any string from predefs in the strings-to-check-list missing?
Collection<String> predefs; // Your certain predefined strings
List<String> list; // Your list of strings to check
if( ! predefs.parallelStream().anyMatch( s -> ! list.contains( s ) ) )
writeListIntoFile(list);
The above lambda expression stops as soon as the first string from predefs can't be found in the strings-to-check-list and returns true — You must not write the file in this case.
It does not check if any additional strings are in the strings-to-check-list, that are not contained in the predefs strings.

Expected equal JSON strings are not equal when compared using assertEquals

I wrote JUnit test for the web service client, which submits JSON document to the service.
I saved "correct" JSON document to the file, then after the test execution I compare it with actual result.
They are not matched, although lines are identical:
org.junit.ComparisonFailure:
Expected :{"Callback":null,"Data":
{"MarketCode":"ISEM",,............"Price":2.99}]}]}]}]}}
Actual :{"Callback":null,"Data":
{"MarketCode":"ISEM",,............"Price":2.99}]}]}]}]}}
Lines are very long , about 4K characters, so I cut much of it here, but their length is identical. I compared string.size() in the debugger , and also I trim it before the compare, to remove some invisible or whitespace symbols in the end, which text editor can implicitly insert.
Also, test is OK when executed isolately. But it fails , when I run it as part of bigger suite.
There is no global/static variables, so memory overriding should be not an issue.
I'm mocking web service client to extract the request string , like this:
StringBuilder pd = new StringBuilder();
doAnswer((invocation) -> {
String postDocument = ((String)invocation.getArguments()[0]).trim();
pd.append(postDocument);
return null;
}).when(client).doPost(anyString(), anyObject());
client is mocked class.
Then I compare trimmed versions of strings, but it doesnt help
String expectedSubmit = TestUtils.readXmlFromFile("strategyexecution\\ireland_bm_strategy_override_expected.json").trim();
assertEquals(expectedSubmit, pd.toString().trim());

I found answer myself :-)
The problem is with JSON specification itself.
JSON cannot guarantee the same order of elements inside the array, it's basically unordered set.
So, the content can be randomly reordered. Two produced JSON files should not be compared as two strings.
I deserialized it to Java object and object comparision works!

Same old issue as we had with XML. For XML there is XMLUnit which semantically compares xml-s. For JSON I'd try to use a similar tool, like JsoNunit. JSONAssert too looks promising.

A way to strip returned values from java.io.File.listFiles in Clojure

I call a java function in Clojure to get a list of files.
(require '[clojure.java.io :as io])
(str (.listFiles (io/file "/home/loluser/loldir")))
And I get a whole bunch of strings like these
#<File /home/loluser/loldir/lolfile1>
etc. How do I get rid of the brackets and put them in some form of an array so another function can access it?

Those strings are just the print format for a Java File object.
See the File javadoc for which operations are available.
If you want the file paths as strings, it would be something like
(map #(.getPath %)
(.listFiles (io/file "/home/loluser/loldir")))
Or you could just use list, which returns strings in the first place:
(.list (io/file "/home/loluser/loldir"))
If you want to read the file, you might as well keep it as a File object to pass into the core slurp or other clojure.java.io or clojure.contrib.duck-streams functions.

Why can I only convert a Representation to a string once in RESTlet?

So,
I'm trying to convert a Representation to a String or a StringWriter either using the getText() or write() method. It seems I can only call this method once successfully on a Representation... If I call the method again, it returns null or empty string on the second call. Why is this? I'd expect it to return the same thing every time:
public void SomeMethod(Representation rep)
{
String repAsString = rep.getText(); // returns valid text for example: <someXml>Hello WOrld</someXml>
String repAsString2 = rep.getText(); // returns null... wtf?
}
If I'm "doing it wrong" then I'd be open to any suggestions as to how I can get to that data.

The javadocs explain this:
The content of a representation can be
retrieved several times if there is a
stable and accessible source, like a
local file or a string. When the
representation is obtained via a
temporary source like a network
socket, its content can only be
retrieved once.
So presumably it's being read directly from the network or something similar.
You can check this by calling isTransient(). If you need to be able to read it multiple times, presumably you should convert it to a string and then create a new Representation from that string.

It's because in general the Representation doesn't actually get read in from the InputStream until you ask for it with getText(), and once you've asked for it, all the bytes have been read and converted into the String.
This is the natural implementation for efficiency: rather than creating a potentially very large String and then converting that String into something useful (a JSON object, a DOM tree, or whatever), you write your converter to operate on the InputStream instead, avoiding the costs of making and reading that huge String.
So for example if you have a large XML file being PUT into a web service, you can feed the InputStream right into a SAX parser.
(As #John notes, a StringRepresentation wraps a String, and so can be read multiple times. But you must be reading a Request's representation, which is most likely an InputRepresentation.)

Why does appending "" to a String save memory?

I used a variable with a lot of data in it, say String data.
I wanted to use a small part of this string in the following way:
this.smallpart = data.substring(12,18);
After some hours of debugging (with a memory visualizer) I found out that the objects field smallpart remembered all the data from data, although it only contained the substring.
When I changed the code into:
this.smallpart = data.substring(12,18)+"";
..the problem was solved! Now my application uses very little memory now!
How is that possible? Can anyone explain this? I think this.smallpart kept referencing towards data, but why?
UPDATE:
How can I clear the big String then? Will data = new String(data.substring(0,100)) do the thing?

Doing the following:
data.substring(x, y) + ""
creates a new (smaller) String object, and throws away the reference to the String created by substring(), thus enabling garbage collection of this.
The important thing to realise is that substring() gives a window onto an existing String - or rather, the character array underlying the original String. Hence it will consume the same memory as the original String. This can be advantageous in some circumstances, but problematic if you want to get a substring and dispose of the original String (as you've found out).
Take a look at the substring() method in the JDK String source for more info.
EDIT: To answer your supplementary question, constructing a new String from the substring will reduce your memory consumption, provided you bin any references to the original String.
NOTE (Jan 2013). The above behaviour has changed in Java 7u6. The flyweight pattern is no longer used and substring() will work as you would expect.

If you look at the source of substring(int, int), you'll see that it returns:
new String(offset + beginIndex, endIndex - beginIndex, value);
where value is the original char[]. So you get a new String but with the same underlying char[].
When you do, data.substring() + "", you get a new String with a new underlying char[].
Actually, your use case is the only situation where you should use the String(String) constructor:
String tiny = new String(huge.substring(12,18));

When you use substring, it doesn't actually create a new string. It still refers to your original string, with an offset and size constraint.
So, to allow your original string to be collected, you need to create a new string (using new String, or what you've got).

I think this.smallpart kept
referencing towards data, but why?
Because Java strings consist of a char array, a start offset and a length (and a cached hashCode). Some String operations like substring() create a new String object that shares the original's char array and simply has different offset and/or length fields. This works because the char array of a String is never modified once it has been created.
This can save memory when many substrings refer to the same basic string without replicating overlapping parts. As you have noticed, in some situations, it can keep data that's not needed anymore from being garbage collected.
The "correct" way to fix this is the new String(String) constructor, i.e.
this.smallpart = new String(data.substring(12,18));
BTW, the overall best solution would be to avoid having very large Strings in the first place, and processing any input in smaller chunks, aa few KB at a time.

In Java strings are imutable objects and once a string is created, it remains on memory until it's cleaned by the garbage colector (and this cleaning is not something you can take for granted).
When you call the substring method, Java does not create a trully new string, but just stores a range of characters inside the original string.
So, when you created a new string with this code:
this.smallpart = data.substring(12, 18) + "";
you actually created a new string when you concatenated the result with the empty string.
That's why.

As documented by jwz in 1997:
If you have a huge string, pull out a substring() of it, hold on to the substring and allow the longer string to become garbage (in other words, the substring has a longer lifetime) the underlying bytes of the huge string never go away.

Just to sum up, if you create lots of substrings from a small number of big strings, then use
String subtring = string.substring(5,23)
Since you only use the space to store the big strings, but if you are extracting a just handful of small strings, from losts of big strings, then
String substring = new String(string.substring(5,23));
Will keep your memory use down, since the big strings can be reclaimed when no longer needed.
That you call new String is a helpful reminder that you really are getting a new string, rather than a reference to the original one.

Firstly, calling java.lang.String.substring creates new window on the original String with usage of the offset and length instead of copying the significant part of underlying array.
If we take a closer look at the substring method we will notice a string constructor call String(int, int, char[]) and passing it whole char[] that represents the string. That means the substring will occupy as much amount of memory as the original string.
Ok, but why + "" results in demand for less memory than without it??
Doing a + on strings is implemented via StringBuilder.append method call. Look at the implementation of this method in AbstractStringBuilder class will tell us that it finally do arraycopy with the part we just really need (the substring).
Any other workaround??
this.smallpart = new String(data.substring(12,18));
this.smallpart = data.substring(12,18).intern();

Appending "" to a string will sometimes save memory.
Let's say I have a huge string containing a whole book, one million characters.
Then I create 20 strings containing the chapters of the book as substrings.
Then I create 1000 strings containing all paragraphs.
Then I create 10,000 strings containing all sentences.
Then I create 100,000 strings containing all the words.
I still only use 1,000,000 characters. If you add "" to each chapter, paragraph, sentence and word, you use 5,000,000 characters.
Of course it's entirely different if you only extract one single word from the whole book, and the whole book could be garbage collected but isn't because that one word holds a reference to it.
And it's again different if you have a one million character string and remove tabs and spaces at both ends, making say 10 calls to create a substring. The way Java works or worked avoids copying a million characters each time. There is compromise, and it's good if you know what the compromises are.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

What is the difference between toString and mkString in scala? - java

Related

Most efficient way to check string array and then write it into file

Expected equal JSON strings are not equal when compared using assertEquals

A way to strip returned values from java.io.File.listFiles in Clojure

Why can I only convert a Representation to a string once in RESTlet?

Why does appending "" to a String save memory?

Categories

Resources