In Java, I'm trying to log into an FTP server and find all the files newer than x for retrieval.
Currently I have an input stream that's reading in the directory contents and printing them out, line by line, which is all well and good, but the output is fairly vague... it looks like this...
-rw------- 1 vuser 4773 Jun 10 2008 .bash_history
-rw-r--r-- 1 vuser 1012 Dec 9 2007 .bashrc
lrwxrwxrwx 1 root 7 Dec 9 2007 .profile -> .bashrc
drwx------ 2 vuser 4096 Jan 30 01:08 .spamassassin
drwxr-xr-x 2 vuser 4096 Dec 9 2007 backup.upgrade_3_7
dr-xr-xr-x 2 root 4096 Dec 10 2007 bin
etc...
however, I need the actual timestamp in seconds or milliseconds so that I can determine if I want that file or not.
I assume this has something to do with the FTP server's configuration? but I'm kind of stumped.
Maybe worth having a look at the Jakarta Commons Net API which has FTP functionality.
I think with it you can use list files which will give you file objects that you can do getTimeStamp's on, you then should be able to get just the ones you need.
Unfortunately, the output of an FTP servers directory listing is not mandated by the standard. In fact, it is explicitly described as not intended for machine parsing. The format will vary quite a bit between servers on some operating systems.
RFC 3659 describes some extensions to the FTP standard including the "MDTM" command for querying modification times. If you are lucky, the servers that you want to talk to have implemented this extension.
getTimestamp() gets you the last modification time of the file. If you're trying to get the time when the file was uploaded to the server, that wouldn't work. I'm trying to do something similar, I want to get all files that were uploaded after X time. I couldn't find of a clean way of doing this without actually doing a ls and parsing through that.
Related
I am exporting TZ variable in POSIX format to set timezone on Linux. For instance:
export TZ="EST+5EDT,M3.2.0/02:00,M11.1.0/02:00"
Linux date command returns:
Wed Mar 14 03:47 EDT 2018
Java ZonedDateTime.now() returns:
2018-03-14T02:47:36.808[GMT-05:00]
It seems that Java does not take into account DST rule. What can be wrong?
I'm not sure what linux version you're using, but I've tested in Red Hat 4.4 and it accepts IANA's names:
export TZ=America/New_York
date
And the output is:
Qua Mar 14 08:37:25 EDT 2018
I've also checked some articles on web and all examples use names like "America/New_York", "Europe/London" and so on.
But anyway, if your linux version doesn't work with this, it's better to change your code to not use the JVM default timezone:
ZonedDateTime.now(ZoneId.of("America/New_York"));
Actually, I think it's better to use a specific timezone, because the default can be changed at runtime, for any application running in the same JVM. Even if you have control over what the applications do, some infrastructure/environment maintainance can change that, either on purpose or by accident - it happened to me once, which made me start using explicit timezones names everywhere.
And always prefer IANA's names, in the format Continent/Region. Names like EST+5EDT are fixed, in the sense that they represent just an offset (in this case, GMT-05:00), without any Daylight Saving rules. Only names like America/New_York contains DST rules.
I have a java jar application running in the background of my linux servers (debian 7). The servers receives files through apache generally, then the jar pull the database regularly to upload the files to their final destination using a pool of http connections from an httpclient (v2.5).
The servers are 16gb of ram, 8 cores cpu, 1To disk and 2Gb/s internet bandwidth. My problem is that the bandwidth is only used at 10% or 20% for the uploads made by the jar. After many investigations I think it is because of the capacity of the distant server which might be the bottleneck.
So I wanted to launch more threads on my servers to proceed more files at the same time and use all the bandwidth I have, unfortunately file upload with httpclient is eating a lot of cpu it seems !
Actually the jars are working with 20 simultaneous upload runnable threads and the cpu is constantly at 100%, if I try to start more threads the load average increase and break records, getting the system so slow and unusable.
Strange thing is the iowaits seems to be null, so I really don't know what is causing the load average.
I have run an hprof using only one thread, here is the result :
CPU SAMPLES BEGIN (total = 4617) Thu Jul 28 17:42:35 2016
rank self accum count trace method
1 52.76% 52.76% 2436 301157 java.net.SocketOutputStream.socketWrite0
2 33.53% 86.29% 1548 300806 java.net.SocketInputStream.socketRead0
3 1.62% 87.91% 75 301138 org.sqlite.core.NativeDB.step
4 1.47% 89.39% 68 301158 java.io.FileInputStream.readBytes
5 1.06% 90.45% 49 300078 java.lang.ClassLoader.defineClass1
6 0.26% 90.71% 12 300781 com.mysql.jdbc.SingleByteCharsetConverter.<clinit>
7 0.19% 90.90% 9 300386 java.lang.Throwable.fillInStackTrace
8 0.19% 91.10% 9 300653 java.lang.ClassLoader.loadClass
9 0.19% 91.29% 9 300780 com.mysql.jdbc.SingleByteCharsetConverter.<clinit>
10 0.17% 91.47% 8 300387 java.net.URLClassLoader.findClass
11 0.17% 91.64% 8 300389 java.util.zip.Inflater.inflateBytes
12 0.15% 91.79% 7 300090 java.lang.ClassLoader.findBootstrapClass
13 0.15% 91.94% 7 300390 java.util.zip.ZipFile.getEntry
14 0.13% 92.07% 6 300805 java.net.PlainSocketImpl.socketConnect
the files are sent with a common httpclient POST execute request with an overrided writeTo() method from the filebody class that uses a bufferedInputStream of 8kb.
Do you think it is possible to reduce the performance impact of the file uploads, and solve my problem of unused bandwidth ?
Thanks in advance for your help.
You should try https://hc.apache.org/
You could change your httpclient library to an library based on java.nio:
https://docs.oracle.com/javase/7/docs/api/java/nio/package-summary.html
HttpCore is a set of low level HTTP transport components that can be used to build custom client and server side HTTP services with a minimal footprint. HttpCore supports two I/O models: blocking I/O model based on the classic Java I/O and non-blocking, event driven I/O model based on Java NIO.
Please take a look at https://hc.apache.org/httpcomponents-asyncclient-dev/index.html
Hope this is the right idea.
In fact it appears that the problem was not coming from the java process but from apache consuming all the IO of the hard drive, screwing up the performances.
Thanks for your help anyway.
External tools are giving me trouble. Is there a way to get simple cpu usage/time spent per function without the use of some external gui tool?
I've been trying to profile my java program with VisualVM, but I'm having terrible, soul crushing, ambition killing results. It will only display heap usage, what I'm interested in is CPU usage, but that panel simply says Not supported for this JVM. Doesn't tell me which JVM to use, by the way. I've download JDK 6 and launched it using that, I made sure my program targets the same VM, but nothing! Still the same, unhelpful error message.
My needs are pretty simple. I just want to find out where the program is spending its time. Python has an excellent built in profiler that print out where time was spent in each function, both with per call, and total time formats. That's really the extent of what I'm looking for right now. Anyone have any suggestions?
It's not pretty, but you could use the built in hprof profiling mechanism, by adding a switch to the command line.
-Xrunhprof:cpu=times
There are many options available; see the Oracle documentation page for HPROF for more information.
So, for example, if you had an executable jar you wanted to profile, you could type:
java -Xrunhprof:cpu=times -jar Hello.jar
When the run completes, you'll have a (large) text file called "java.hprof.txt".
That file will contain a pile of interesting data, but the part you're looking for is the part which starts:
CPU TIME (ms) BEGIN (total = 500) Wed Feb 27 16:03:18 2013
rank self accum count trace method
1 8.00% 8.00% 2000 301837 sun.nio.cs.UTF_8$Encoder.encodeArrayLoop
2 5.40% 13.40% 2000 301863 sun.nio.cs.StreamEncoder.writeBytes
3 4.20% 17.60% 2000 301844 sun.nio.cs.StreamEncoder.implWrite
4 3.40% 21.00% 2000 301836 sun.nio.cs.UTF_8.updatePositions
Alternatively, if you've not already done so, I would try installing the VisualVM-Extensions, VisualGC, Threads Inspector, and at least the Swing, JVM, Monitor, and Jvmstat Tracer Probes.
Go to Tools->Plugins to install them. If you need more details, comment, and I'll extend this answer further.
I'm a newbie for Hadoop.
Recently I just make an implementation of WordCount example.
But when I run this programs on my single node with 2 input files , just 9 word, it cost nearly 33 second to do such !!! so crazy, and it makes me so confusing !!!
Can any one tell me is this normal or some???
How can I fix this problem? Remember, I just create 2 input files with 9 word in it.
Submit Host Address: 127.0.0.1
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Succeeded
Started at: Fri Aug 05 14:27:22 CST 2011
Finished at: Fri Aug 05 14:27:53 CST 2011
Finished in: 30sec
Hadoop is not efficient for very very small jobs, as it takes more time for the JVM Startup, process initialization and others. Though, it can be optimized to some extent by enabling JVM reuse.
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse
Also, there is some work going on this in Apache Hadoop
https://issues.apache.org/jira/browse/MAPREDUCE-1220
Not sure in which release this will be included or what the state of the JIRA is.
This is not unusual. Hadoop comes into effect with large datasets. What you are seeing is probably the initial startup time of Hadoop.
First please dont overlook because you might think it as common question, this is not. I know how to find out size of file and directory using file.length and Apache FileUtils.sizeOfDirectory.
My problem is, in my case files and directory size is too big (in hundreds of mb). When I try to find out size using above code (e.g. creating file object) then my program becomes so much resource hungry and slows down the performance.
Is there any way to know the size of file without creating object?
I am using
for files File file1 = new file(fileName); long size = file1.length();
and for directory, File dir1 = new file (dirPath); long size = fileUtils.sizeOfDirectiry(dir1);
I have one parameter which enables size computing. If parameter is false then it goes smoothly. If false then program lags or hangs.. I am calculating size of 4 directory and 2 database files.
File objects are very lightweight. Either there is something wrong with your code, or the problem is not with the file objects but with the HD access necessary for getting the file size. If you do that for a large number of files (say, tens of thousands), then the harddisk will do a lot of seeks, which is pretty much the slowest operation possible on a modern PC (by several orders of magnitude).
A File is just a wrapper for the file path. It doesn't matter how big the file is only its file name.
When you want to get the size of all the files in a directory, the OS needs to read the directory and then lookup each file to get its size. Each access takes about 10 ms (because that's a typical seek time for a hard drive) So if you have 100,000 file it will take you about 17 minutes to get all their sizes.
The only way to speed this up is to get a faster drive. e.g. Solid State Drives have an average seek time of 0.1 ms but it would still take 10 second or more to get the size of 100K files.
BTW: The size of each file doesn't matter because it doesn't actually read the file. Only the file entry which has it s size.
EDIT: For example, if I try to get the sizes of a large directory. It is slow at first but much faster once the data is cached.
$ time du -s /usr
2911000 /usr
real 0m33.532s
user 0m0.880s
sys 0m5.190s
$ time du -s /usr
2911000 /usr
real 0m1.181s
user 0m0.300s
sys 0m0.840s
$ find /usr | wc -l
259934
The reason the look up is so fast the fist time is that the files were all installed at once and most of the information is available continuously on disk. Once the information is in memory, it takes next to no time to read the file information.
Timing FileUtils.sizeOfDirectory("/usr") take under 8.7 seconds. This is relatively slow compared with the time it takes du, but it is processing around 30K files per second.
An alterative might be to run Runtime.exec("du -s "+directory); however, this will only make a few seconds difference at most. Most of the time is likely to be spent waiting for the disk if its not in cache.
We had a similar performance problem with File.listFiles() on directories with large number of files.
Our setup was one folder with 10 subfolders each with 10,000 files.
The folder was on a network share and not on the machine running the test.
We were using a FileFilter to only accept files with known extensions or a directory so we could recourse down the directories.
Profiling revealed that about 70% of the time was spent calling File.isDirectory (which I assume Apache is calling). There were two calls to isDirectory for each file (one in the filter and one in the file processing stage).
File.isDirectory was slow cause it had to hit the network share for each file.
Reversing the order of the check in the filter to check for valid name before valid directory saved a lot of time, but we still needed to call isDirectory for the recursive lookup.
My solution was to implement a version of listFiles in native code, that would return a data structure that contained all the metadata about a file instead of just the filename like File does.
This got rid of the performance problem but added a maintenance problem of having to native code maintained by Java developers (lucking we only supported one OS).
I think that you need to read the Meta-Data of a file.
Read this tutorial for more information. This might be the solution you are looking for:
http://download.oracle.com/javase/tutorial/essential/io/fileAttr.html
Answering my own question..
This is not the best solution but works in my case..
I have created a batch script to get the size of the directory and then read it in java program. It gives me less execution time when number of files in directory are more then 1L (That is always in my case).. sizeOfDirectory takes around 30255 ms and with batch script i get 1700 ms.. For less number of files batch script is costly.
I'll add to what Peter Lawrey answered and add that when a directory has a lot of files inside it (directly, not in sub dirs) - the time it takes for file.listFiles() it extremely slow (I don't have exact numbers, I know it from experience). The amount of files has to be large, several thousands if I remember correctly - if this is your case, what fileUtils will do is actually try to load all of their names at once into memory - which can be consuming.
If that is your situation - I would suggest restructuring the directory to have some sort of hierarchy that will ensure a small number of files in each sub-directory.