Wikipedia deletion log download - java

I need wikipedia deletion log for my project. I was able to find deletion logs here
http://en.wikipedia.org/w/index.php?title=Special:Log&type=delete&user=&page=&year=&month=-1&tagfilter=&hide_review_log=1
I can download 5000 entries at a time, but it will take lot of time, due to large number of pages. Is there a dump available?
Thank you
Bala

Why not ask at Wikipedia? There are various dumps available, including tools on the toolserver that may be of use. Your best bet is asking at the technical pump.

Related

Processing large number of text files in java

I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.
What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258
However, the estimated processing time, for ~29K files, still seems to be very high.
What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.
Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).
You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.
For example:
Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.
You can, for example, use visualvm.
It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.
Some Ideas:
You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the
duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)
If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.

Tool to count stacktraces in a logfile

Is there a tool that is able to collect and count (Java) stacktraces in a large logfile, such that you get an overview which errors occur most often?
I am not aware of any automatic tool but logmx will give you a nice clean overview of your log file with search options.
This probably isn't the best answer but I am going to try to answer the spirit of your question. You should try Dynatrace. It's not free and it doesn't work with log files per say but it can get you very detail reports of what types of exceptions are thrown from where and when on top of a lot of other info.
I'm not too sure if there is a tool available to evaluate log files but you may have more success with a tool like AppDynamics. This is a monitoring tool which can be used to evaluate Live application performance and can be configured to monitor exception frequency.
Good luck.
Mark.

Technology to transfer data with external system

We have an interface with an external system in which we get flat files from them and process those files. At present we run a job a few times a day that checks if the file is at the ftp location and then processes if it exists.
I recently read that it is a bad idea to make use of file systems as a message broker which is why I am putting in this question. Can someone clarify if a situation like this one is a right fitment for the use of some other tool and if so which one?
Ours is a java based application.
The first question you should ask is "is it working?".
If the answer to that is yes, then you should be circumspect about change just because you read it was a bad idea. I've read that chocolate may be bad for you but I'm not giving it up :-)
There are potential problems that you can run into, such as files being deleted without your knowledge, or trying to process files that are only half-transferred (though there are ways to mitigate both of those, such as permissions in the former case, or the use of sentinel files or content checking in the latter case).
Myself, I would prefer a message queueing system such as IBM's MQ or JMS (since that's what they're built for, and they do make life a little easier) but, as per the second paragraph above, only if either:
problems appear or become evident with the current solution; or
you have some spare time and money lying around for unnecessary rework.
The last bullet needs expansion. While the work may be unnecessary (in terms of fixing a non-existent problem), that doesn't necessarily make it useless, especially if it can improve performance or security, or reduce the maintenance effort.
I would use a database to synchronize your files. Have a database that points to the file locations. Put an entry into the database only when the files have been fully transferred. This would ensure that you are picking up completed files. You can poll the database to check if new entries are present instead of polling the file system. A very easy simple set up for a polling mechanism. If you would like to be told when a new file appears on the folder, then you would need to go in for a Message Queue.

Approach bottlenecks in a code

All,
Given a code that you are not at all knowledgeable about in terms of the functionality and implementation, how would you go about finding the performance bottlenecks in that code? Please list any specific tools / standard approaches that you might be using.
I assume you have the source code, and that you can run it under a debugger, and that there is a "pause" button (or Ctrl-C, or Esc) with which you can simply stop it in its tracks.
I do that several times while it's making me wait, like 10 or 20, and each time study the call stack, and maybe some other state information, so I can give a verbal explanation of what it is doing and why.
That's the important thing - to know why it's doing what it's doing.
Typically what I see is that on, say, 20%, or 50%, or 90% of samples, it is doing something, and often that thing could be done more efficiently or not at all. So fixing that thing reduces execution time by (roughly) that percent.
The bigger a problem is, the quicker you see it.
In the limit, you can diagnose an infinite loop in 1 sample.
This gets a lot of flak from profiler-aficionados, but people who try it know it works very well. It's based on different assumptions.
If you're looking for the elephant in the room, you don't need to measure him.
Here's a more detailed explanation, and a list of common myths.
The next best thing would be a wall-time stack sampler that reports percent at the line or instruction level, such as Zoom or LTProf, but they still leave you puzzling out the why.
Good luck.
You should use a profiling too, depends on the platform:
.NET: Visual Studio performance tools, JetBrains dotTrace
Java: JProfiler
The above tools work very well for applications, but the features vary. For example, Visual Studio can summarize performance data based on tiers.
How to approach the problem is highly dependent on the type of the program, and the performance problem you're facing. But basically, you'll repeat the following cycle:
Record performance data (maybe change the settings for higher / lower granularity on recorded data)
Identify hot spots, where most of the application time is consumed
Maybe use reverse call tables to identify how the hot spot is invoked, and from where in the code
Try to refactor / optimize the hot spot
Start over, and check how much your optimization was effective.
It might take several iterations of the above cycle to get you to a point that you have acceptable performance.
Note that these tools provide many different features and ways to look at performance data, or record them. Provided that you don't have any knowledge of the internal structure of the application, you should start playing with different features and reports that the tools provide, so that you can pinpoint where to optimize.
Use differential analysis. Pick one part of the program and artificially slow it down (add a bunch of code that does nothing but waste time). Re-run your test and observe the results. Do this for a variety of aspects of your program. If adding the delays does not alter performance, then that aspect is not your bottleneck. The aspect that results in the largest perrformance hit might be the first place to look for bottlenecks.
This works even better if the severity of the delay code is adjustable while the program is running. You can increase and decrease the artificial delay and see how that affects the performance. If you encounter a test where the change in observed performance seems to follow the artificial delay linearly, then that aspect of the program might be your bottleneck.
This is just a poor man's way of doing it. The best method is probably to use a profiler. If you specify your language and platform, someone could probably recommend a good profiler.
Without having an idea on the kind of system you are working with, these pieces of gratuitous advice:
Try to build up knowledge on how the system scales: how are 10 times more users handled, how does it cope with 100 times more data, or with a 100 times slower network environment...
Find the proper 'probing' points in the system: a distributed system is, of course, harder to analyze than a desktop app.
Find proper technology to analyze the data received from the probes. Profilers do a great job visualizing bottleneck functions, but I can imagine they are of no help for your cloud service. Try to graphically visualize your data, your brain is much better at recognizing graphical patterns than numerical, let alone textual.
oh - find out what the expectations are! It's no use optimizing the boot time of your app if it's only booted three times a year.
I'd say the steps would be:
Identify the actual functionality that is slow, based on use of the system or interviewing users. That should narrow down the problem areas (and if nobody is complaining, maybe there's no problem.)
Run a code profiler (such as dotTrace / Compuware) and data layer profiler (e.g. SQL Profiler, NHibernate Profiler, depending on what you're using.) You should get some good results after a day or so of real use.
If you can't get a good idea of the problems from this, add some extra stopwatch code to the next live build that logs the number of milliseconds in each operation.
That should give you a pretty good picture of the multiple database queries that should be combined into one, or code that can be moved out of an inner loop or pre-calculated, etc.

How to determine why is Java app slow

We have an Java ERP type of application. Communication between server an client is via RMI. In peak hours there can be up to 250 users logged in and about 20 of them are working at the same time. This means that about 20 threads are live at any given time in peak hours.
The server can run for hours without any problems, but all of a sudden response times get higher and higher. Response times can be in minutes.
We are running on Windows 2008 R2 with Sun's JDK 1.6.0_16. We have been using perfmon and Process Explorer to see what is going on. The only thing that we find odd is that when server starts to work slow, the number of handles java.exe process has opened is around 3500. I'm not saying that this is the acual problem.
I'm just curious if there are some guidelines I should follow to be able to pinpoint the problem. What tools should I use? ....
Can you access to the log configuration of this application.
If you can, you should change the log level to "DEBUG". Tracing the DEBUG logs of a request could give you a usefull information about the contention point.
If you can't, profiler tools are can help you :
VisualVM (Free, and good product)
Eclipse TPTP (Free, but more complicated than VisualVM)
JProbe (not Free but very powerful. It is my favorite Java profiler, but it is expensive)
If the application has been developped with JMX control points, you can plug a JMX viewer to get informations...
If you want to stress the application to trigger the problem (if you want to verify whether it is a charge problem), you can use stress tools like JMeter
Sounds like the garbage collection cannot keep up and starts "halt-the-world" collecting for some reason.
Attach with jvisualvm in the JDK when starting and have a look at the collected data when the performance drops.
The problem you'r describing is quite typical but general as well. Causes can range from memory leaks, resource contention etcetera to bad GC policies and heap/PermGen-space allocation. To point out exact problems with your application, you need to profile it (I am aware of tools like Yourkit and JProfiler). If you profile your application wisely, only some application cycles would reveal the problems otherwise profiling isn't very easy itself.
In a similar situation, I have coded a simple profiling code myself. Basically I used a ThreadLocal that has a "StopWatch" (based on a LinkedHashMap) in it, and I then insert code like this into various points of the application: watch.time("OperationX");
then after the thread finishes a task, I'd call watch.logTime(); and the class would write a log that looks like this: [DEBUG] StopWatch time:Stuff=0, AnotherEvent=102, OperationX=150
After this I wrote a simple parser that generates CSV out from this log (per code path). The best thing you can do is to create a histogram (can be easily done using excel). Averages, medium and even mode can fool you.. I highly recommend to create a histogram.
Together with this histogram, you can create line graphs using average/medium/mode (which ever represents data best, you can determine this from the histogram).
This way, you can be 100% sure exactly what operation is taking time. If you can't determine the culprit, binary search is your friend (fine grain the events).
Might sound really primitive, but works. Also, if you make a library out of it, you can use it in any project. It's also cool because you can easily turn it on in production as well..
Aside from the GC that others have mentioned, Try taking thread dumps every 5-10 seconds for about 30 seconds during your slow down. There could be a case where DB calls, Web Service, or some other dependency becomes slow. If you take a look at the tread dumps you will be able to see threads which don't appear to move, and you could narrow your culprit that way.
From the GC stand point, do you monitor your CPU usage during these times? If the GC is running frequently you will see a jump in your overall CPU usage.
If only this was a Solaris box, prstat would be your friend.
For acute issues like this a quick jstack <pid> should quickly point out the problem area. Probably no need to get all fancy on it.
If I had to guess, I'd say Hotspot jumped in and tightly optimised some badly written code. Netbeans grinds to a halt where it uses a WeakHashMap with newly created objects to cache file data. When optimised, the entries can be removed from the map straight after being added. Obviously, if the cache is being relied upon, much file activity follows. You probably wont see the drive light up, because it'll all be cached by the OS.

Categories