Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I've been wanting to learn python and do some NLP, so have finally gotten round to starting. Downloaded the english wikipedia mirror for a nice chunky dataset to start on, and have been playing around a bit, at this stage just getting some of it into a sqlite db (havent worked with dbs in the past unfort).
But I'm guessing sqlite is not the way to go for a full blown nlp project(/experiment :) - what would be the sort of things I should look at ? HBase (.. and hadoop) seem interesting, i guess i could run then im java, prototype in python and maybe migrate the really slow bits to java... alternatively just run Mysql.. but the dataset is 12gb, i wonder if that will be a problem? Also looked at lucene, but not sure how (other than breaking the wiki articles into chunks) i'd get that to work..
What comes to mind for a really flexible NLP platform (i dont really know at this stage WHAT i want to do.. just want to learn large scale lang analysis tbh) ?
Many thanks.
NLTK is where you should start from (it's Python-based -- not sure why you're already thinking about parallelizing your processing at such an early stage... start with a more flexible experimental setup, is my advice). sqlite should be fine for a few GB -- if you need more advanced and standard SQL power you could consider postgresql.
There is a related talk on PyCon 2010 "The Python and the Elephant: Large Scale Natural Language Processing with NLTK and Dumbo".
The link has introductory information, slides and video.
I think sqlite is still a good choice for 12G size data. I have a text classification training set which has the similar size, both sqlite and plain text is fine as long as just iterator it line by line.
It is most likely that you are going to use Vector Space Model to represent the text while doing the anlaysis.
In which case, you should look at platforms that can help you store term vectors with term frequencies. It makes your life so much easier.
Have a look at Apache Lucene which has a python library to access Java Lucene. Elasticsearch is also a good alternative, which uses Apache Lucene underneath and has a really good python package. Elasticsearch also exposes a REST API.
Postgresql is also really good at storing tokens. Check out this article to learn more.
I have worked with sizable language data before and I personally prefer Lucene/Elasticsearch for analysis projects.
Cheers.
Summary from the internet:
Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start.
Gensim is a topic modelling library for Python that provides access to Word2Vec and other word embedding algorithms for training, and it also allows pre-trained word embeddings that you can download from the internet to be loaded.
NLTK details already given above.
Standford NLP has recently launched 50+ langauge supported python framework. You should check it out for sure.
There are many others but the above 4 are most usable in the sense of community support and latest features
I personally prefer Spacy.
Spacy is one of fastest of all and can use gensim/other APIs integrated into its model.
Moreover, Spacy models has a lots of languages in its alpha stage making it a perfect choice for multilingual apps.
Scaling is whole different thing[you can use alot of tools].But lets stick to scaling in NLP: Spacy gives so much control over different pipelines that you can disable unwanted pipelines making it faster.
Look into it try yourself and explore.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm soon to start on a new project where I am going to do lots of text processing tasks like searching, categorization/classifying, clustering, and so on.
There's going to be a huge amount of documents that need to be processed; probably millions of documents. After the initial processing, it also has to be able to be updated daily with multiple new documents.
Can I use Python to do this, or is Python too slow? Is it best to use Java?
If possible, I would prefer Python since that's what I have been using lately. Plus, I would finish the coding part much faster. But it all depends on Python's speed. I have used Python for some small scale text processing tasks with only a couple of thousand documents, but I am not sure how well it scales up.
Both are good. Java has a lot of steam going into text processing. Stanford's text processing system, OpenNLP, UIMA, and GATE seem to be the big players (I know I am missing some). You can literally run the StanfordNLP module on a large corpus after a few minutes of playing with it. But, it has major memory requirements (3 GB or so when I was using it).
NLTK, Gensim, Pattern, and many other Python modules are very good at text processing. Their memory usage and performance are very reasonable.
Python scales up because text processing is a very easily scalable problem. You can use multiprocessing very easily when parsing/tagging/chunking/extracting documents. Once your get your text into any sort of feature vector, then you can use numpy arrays, and we all know how great numpy is...
I learned with NLTK, and Python has helped me greatly in reducing development time, so I opine that you give that a shot first. They have a very helpful mailing list as well, which I suggest you join.
If you have custom scripts, you might want to check out how well they perform with PyPy.
It's very difficult to answer questions like this without trying. So why don't you
Figure out what would be a difficult operation
Implement that (and I mean the simplest, quickest hack that you can make work)
Run it with a lot of data, and see how long it takes
Figure out if it's too slow
I've done this in the past and it's really the way to see if something performs well enough for something.
Just write it, the biggest flaw in programming people have is premature optimization. Work on a project, write it out and get it working. Then go back and fix the bugs and ensure that its optimized. There are going to be a number of people harping on about speed of x vs y and y is better than x but at the end of a day its just a language. Its not what a language is but how it does it.
it's not language you have to evaluate, but frameworks and app servers for clustering, data storage/retrieval etc available for the language.
you can use jython and use all the java enterprise technologies for high load system and do text parsing with python.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I would like to simulate some scenarios using the multiagent
paradigm, and it seems NetLogo and Repast are the most popular tools for that.
I'd like to know if anyone has had any experience with either one and could tell me more about them? For example, I've noticed that there is a fluxogram-like modeling option for Repast, but I believe it is rather limited. I've looked around the tutorials and documentation in the official site, and the documentation seems to be lacking. While there are some examples with it, I'd say extending it to simulate an ambient which it has not been specifically prepared to seems like an unreachable goal at the moment, despite Repast obviously being very robust and apparently able to handle it, given enough familiarity with it.
On the other hand, NetLogo has more examples and overall I've liked it more for its simplicity, but it seems to be more focused on the simulating propagation of diseases or similar models. I've found a programming book teaching Logo, so I figure it'd be easier to get started with it too.
Currently, I am thinking of simulating botnets and IDSes as multiagents. The problem, however, is that I would have to abstract the network and transport layers to an extent to be able to do it, as well as generate traffic between the nodes. Repast is apparently more fitting for this, but given its complexity and lack of documentation I'm thinking of using NetLogo. While there are some examples of NetLogo with traditional applications (ex: Tetris or Pac-Man), I'm not sure about how appropriate it'd be for that.
I have a webpage with a couple dozed netlogo multiagent simulations. I use netlogo for teaching and I have found that, once you get past the learning curve, you can develop simulations amazingly fast. Stuff that would take you 80 man-hours in other so-called agent environments (Jade, Repast, which are really mostly just programming libraries) can be done in 2 hours.
On the other hand, netlogo is not really good for simulations that require immense amount of details, like say simulating a network all the way from TCP/IP to HTTP. That would just require large amounts of code, regardless of programming language, and netlogo currently sucks if your program ends up being more that 10 pages long. Having said that, most people would be amazed at what you can get done in 10 pages of netlogo code.
Short answer: it depends on the programming paradigm or language you want to use, and the design you want for your agents:
If you want a low-entry-high-ceiling language allowing quick prototyping but sophisticated simulations, and are willing to learn a new paradigm (avoiding loops) use NetLogo. Good documentation.
If you want to make a real application to use on highly-parallelized clusters or just want to use Java Groovy or need a specific Java library for your purpose, use Repast or better Repast for High Performance Computing (but avoid ReLogo which is very slow). Mild documentation.
If you want to model cognitive agents (instead of reactive) with FIPA communications, better use Jason or better JaCaMo which supports AgentSpeak + Java (so you can also use your favourite Java libraries), and there's no Groovy required. Bad documentation (a lot of non detailed features and commands and bad too-complex-not-commented examples).
Long answer:
Disclaimer: I am more experienced with NetLogo but I also used Repast and a few others like Jason.
Basically, the difference between NetLogo and Repast is that with NetLogo you will have a simpler framework but you'll need to learn how to program in a turtle-and-patch-oriented paradigm, while in Repast you will have to learn that + the mechanisms behind Java Groovy but you will eventually get more flexibility. Speed isn't really a criteria here (see below).
To be more clear, you can program efficiently in NetLogo if you use to a maximum the turtles and the patchs native functions. For example, if you want to implement A*, instead of implementing a list of nodes, you should directly use the patchs and filter them using stuffs like this:
ask patchs with [criteria1 = value and criteria2 = value2] [do-some-stuff]
ask patchs with-min [criteria][do]
let var [somevalue] of min-one-of patches [criteria]
Also if you can't find a way to efficiently do what you want, be sure to check if maybe an extension exists (check also here under Libraries and Tools) for your purpose, like the now native matrix extension which allowed me to make an efficient neural network in NetLogo.
On the other hand, Repast is potentially more flexible than NetLogo (since you have access to the whole range of Java libraries), but a bit more complex since you have to know how to handle Groovy.
If you are solely interested in speed, do NOT use ReLogo (NetLogo-like syntax for Repast) which has been shown to be a whole lot slower than NetLogo (see the 2012 paper below). In any cases, your best bet would either to try an implementation with NetLogo using the tricks above, or if you want to use your application for real later, there is also a distribution called Repast for High Performance Computing which removes most of the overload that come with turtles and patchs objects, and thus it can be used for real applications. A similar extension exists for NetLogo to compute in clusters with parallelization but it's not an official distribution.
If you want more infos about the diverse platforms, here is a nice review of 2006:
Railsback, S. F., Lytinen, S. L., & Jackson, S. K. (2006). Agent-based Simulation Platforms: Review and Development Recommendations. SIMULATION, 82(9), 609-623.
And an updated version of this paper in 2012 dealing with NetLogo vs ReLogo:
Lytinen, S. L., & Railsback, S. F. (2012, April). The evolution of agent-based simulation platforms: A review of netlogo 5.0 and relogo. In Proceedings of the Fourth International Symposium on Agent-Based Modeling and Simulation.
/EDIT: I cited Jason but didn't give any more details. If you want to model cognitive agents (instead of reactive agents), you can do that in NetLogo using the unofficial BDI extension which works well but is a bit limited (but it's easily extensible since it's pure NetLogo), but your best bet is to use a framework specifically designed to model cognitive agent with full support of AgentSpeak.
Jason is very nice since you have access to a full AgentSpeak language + JAVA to implement the technical side. In fact, you can do whole projects using only AgentSpeak (which I did), but you can also make more Java-oriented versions, it's up to you how you want to design your program, the result will be more or less the same. This offers you a lot of flexibility in your design workflow.
Tip: search for "Jason internal actions" in the documentation to get a good description of the available AgentSpeak commands.
Also if you are interested in Jason, you might be interested in JaCaMo (= Jason + Cartago + Moise) which is the result of a cooperation of three projects authors to make a full-fledged cognitive agents framework which also can model complex environments (with artifacts theory) and multi-agents organisations (roles, groups, missions, etc.).
A last framework I know of but didn't have a chance to try is Mason which supports 2D and 3D environments. Never had a chance to try this one so I don't know how this compares with the others but you can try it out.
Here's a generic comparison.
http://www.duncanrobertson.com/research/AMLE.pdf
I had more or less the same problem a few months ago when I had to choose a framework for my simulation. I look at Repast, NetLogo, Swarm and Jade.
NetLogo was nice and I tried to write some simple test applications but since I wanted to use Java as my programming language, NetLogo wasn't the best candidate. Repast has pretty much everything you need to write larger simulations and there are many projects (especially in social sciences) where Repast is used. My problems with Repasts were: bad API documentation, parameters that are passed to methods or constructers that are never used and don't make any sense at all (have a look at the source code) and a lot of boilerplate code.
I'm using Jade (http://jade.tilab.com/) now and I'm really happy with it. The community is good and their mailing list is VERY active. Okay, Jade is just a library and a framework for agent-based modelling. You don't get anything like those visual editor in Repast and you'll have to write your own tool for visualising the results.
Cheers
You could simulate the traffic using a agent type called "packet" that will be spawned and send from a agent called "bot" to another agent called "bot" or "server". Instead of sending the packets to a IP address, you would be sending them to a pair of X and Y coordinates.
Netlogo has an example of how a virus spreads in a network, this might be a good starting point.
I have never tried NetLogo, but have I tried Repast-J and Simphony. It seems Simphony is good, but at the moment I am stuck at changing the Edge type from straight line to curved one. There is not enough documentation and examples available.
Once I tried Mason which is based on java, too. It is similar to Repast-J, yet it was faster. But recently there is not much development in Mason.
I would like to try out Jade later.
If you can already code in Java, you can also look at the following paper for a comparison between RePast, Swarm, Quicksilver, and VSEit, different freely available programming libraries for support of social scientific agent based computer simulation
Tobias, Robert, and Carole Hofmann. "Evaluation of free Java-libraries for social-scientific agent based simulation." Journal of Artificial Societies and Social Simulation 7.1 (2004).
Repast is definitely more flexible than NetLogo but the documentation is not very detailed for RePast Symphony
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have to write a prototype app for an engineering company. Most of the work is calculating various engineering properties (I'm talking pipes and real things here, not software engineering).
However, there will also have to be a GUI for:
parameter entry
displaying results
some basic diagramming
The calculation work at the moment doesn't involve complex math elements (no matrices at the moment), just logs, square roots, relatively simple formulae. Later I will have to do some:
curve fitting
numerical approximation
I was wondering if Java has been used for real world engineering apps?
Are there libraries available for this sort of thing?
Or am I better off writing in MatLab and then connecting to the code through Java?
Also open to other languages (although we are a Java shop).
I have some experience of both Matlab and of Java for scientific/engineering type codes. Yes, Java is used for real-world scientific and engineering codes, and yes there are libraries available. You can certainly do what you want using either so I'm not sure that you could sensibly distinguish between the two on the sole basis of your current requirements. I'd ask myself the following questions:
How good am I at programming advanced mathematical operations? Operations such as function minimisation, differential equation solvers, matrix algebra. If the answer is not very then lean towards Matlab which will provide all of these out-of-the-box (though you may need additional toolboxes). If you opt for Java, make sure you are very comfortable with floating-point arithmetic and dealing with the sorts of errors which occur when you use it.
Do I want to code everything in Java, everything in Matlab, or am I happy to use both and to wrestle with, say, a Java GUI on a Matlab engine ? I think you can do much better (in a vague sense) GUIs with Java than with Matlab, but Matlab's GUI facilities are good enough for most of its users that the added complexity of integrating Matlab with Java is not worth tackling. But then many Matlab users are not software engineers.
What speed of development do I need for the prototype work ? If you were equally skilled in Java and in Matlab then I'd guess that you could do it quicker in Matlab, because the numeric stuff is already provided, you could concentrate on the GUI. But if you are a skilled Java programmer coming newly to Matlab you might decide to stick with what you know.
How will I develop and deploy the production app if the prototype is successful ? If Matlab doesn't fit your deployment ideas then learning it and forgetting it may not be rewarding.
Finally, since you solicit other language recommendations: forget Java, forget Matlab, forget Python, forget R, use Mathematica, it's way more fun and very powerful.
This sounds like a job for Matlab: you don't give any reason not to use it. There's some code for evaluating Matlab expressions from Java: http://www.cs.virginia.edu/~whitehouse/matlab/JavaMatlab.html
I have done some work where I had to reimplement Matlab code in Java so it is certainly possible. The Java code can end up being quite verbose compared to the Matlab original due to Matlab being able to operate directly on matrices/arrays etc.
Some math libraries that you might want to look at to see if they support the functionality you are looking for:
Commons Math
Colt
JSci
I guess that Java would be a good choice, even though it is not considered a typical language for rapid application development.
Pros:
versatile GUI toolkit for desktop applications in standard library (Swing),
(relatively) cross-platform,
great libraries, e.g. from Apache; a great math library to look at would be colt; for charts and diagrams, you may like jfreechart ..
Cons:
"not so rapid" prototyping capabilities
Further reading:
Technical Java: Applications for Science and Engineering
Python has several decent GUI toolkits as well as NumPy, and is easy and fun to write in.
This depends mainly on how easy it will be in your environment to include the mathlab or other math engine in your product. If this is easy, I would suggest to use mathlab, but if not, e.g. you have licensing issues or deployment issues, you are probably better of just using plain Java code.
You may also want to consider R language.
I would do a search for software that's written to do piping calculations. This problem has been done. (As you've noted, the calculations need not be difficult.) At a minimum I'd recommend that you know what's available to you, how much it would cost, and where the break even point was for development costs.
A commercial product will have one huge advantage over anything you'll write: It will have a larger user community that's been banging on it and finding bugs for a longer period of time than your prototype. That's worth something as well.
What's your opportunity cost? What else could you have done with the development time that would drive more revenue?
Don't forget numpy or scipy. Both allow you to call fast matrix libraries from Python.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'd like to compute power spectral density of time series; do some bandpass, lowpass, and highpass filtering; maybe some other basic stuff.
Is there a nice open-source Java library to do this?
I've hunted a bit without success (e.g., Googling "power spectral density java" or "signal processing java" and clicking through links, looking in Apache Commons, Sourceforge, java.net, etc.).
There are lots of applets, books, tutorials, commercial products, etc., that don't meet my needs.
Update: I found org.apache.commons.math.transform for Fourier transforms. This doesn't implement power spectral density, bandpass, etc., but it is something.
My first suggestion is to not do your DSP implementation in Java. My second suggestion would be to roll your own simple DSP implementations yourself in Java.
Why not to use Java:
I have lots of experience writing DSP code over the last 10+ years... and almost none of the DSP code is in Java... so forgive me when I am hesitant to read about someone who wants to implement DSP in Java.
If you are going to be doing non-trivial DSP then you shouldn't be using Java. The reason that DSP is so painful to implement in Java is because all the good DSP implementations use low level memory management tricks, pointers (crazy amounts of pointers), large raw data arrays, etc.
Why to use Java:
If you are doing simple DSP stuff roll your own Java implementation. Simple DSP things like PSD and filtering are both relatively easy to implement (easy implementation but they won't be fast) because there is soo many implementation examples and well documented theory online.
In my case I implemented a PSD function in Java once because I was graphing the PSD in a Java GUI so it was easiest to just take the performance hit in Java and have the PSD computed in the java GUI and then plot it.
How to implement a PSD:
The PSD is usually just the magnitude of the FFT displayed in dB. There are many examples from academic, commercial and open-source showing how to compute the magnitude of the FFT in dB. For example Apache has a Java implementation that gives you the FFT output and then you just need to convert to magnitude and dB. Anything after the FFT should be tailored to what you need/want.
How to implement lowpass, bandpass filtering:
The easiest implementation (not the most computationally efficient) would in my opinion be using an FIR filter and doing time domain convolution.
Convolution is very easy to implement it is two nested for loops and there are literally millions of example code on the net.
The FIR filter will be the tricky part if you don't know anything about filter design. The easiest method would be to use Matlab to generate your FIR filter and then copy the coefficents into java. I suggest using firpmord() and firpm() from Matlab. Shoot for -30 to -50 dB attenuation in the stopband and 3 dB ripple in the passband.
I found the book Java Digital Signal Processing and its example source code. You might look through the code to see if it fits your needs.
You can also check out DSP Laboratory.
As duffymo and basszero mentioned in the comments, there have been changes to Java since the publication of Java DSP that may impact some of the code examples. In particular, the (relatively) new Concurrency Utilties package might prove useful.
I have written a collection of some Java DSP classes, e.g. IIR filters:
Java DSP collection
It looks pretty sparse. Try Signalgo or jein or the Intel Signal Processing Library, although I think the last one is just a JNI wrapper.
I saw a lot of those applets you were talking about. I think you may be able to get the JARs for them and use the class APIs inside. May have to use eclipse and jad to decompile and figure out what they do, though, due to lack of documentation. Try the source on this page for example.
I found another resource, although it's not a library: http://www.dickbaldwin.com/tocdsp.htm. It's just a basic discussion of signal processing and Fourier transform, with some Java examples. See for example tutorials 1478, 1482, 1486. Not sure what the license on the code is.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I'm now in search for a Java Text to Speech (TTS) framework. During my investigations I've found several JSAPI1.0-(partially)-compatible frameworks listed on JSAPI Implementations page, as well as a pair of Java TTS frameworks which do not appear to follow JSAPI spec (Mary, Say-It-Now). I've also noted that currently no reference implementation exists for JSAPI.
Brief tests I've done for FreeTTS (first one listed in JSAPI impls page) show that it is far from reading simple and obvious words (examples: ABC, blackboard). Other tests are currently in progress.
And here goes the question (6, actually):
Which of the Java-based TTS frameworks have you used?
Which ones, by your opinion, are capable of reading the largest wordbase?
What about their voice quality?
What about their performance?
Which non-Java frameworks with Java bindings are there on the scene?
Which of them would you recommend?
Thank you in advance for your comments and suggestions.
I've actually had pretty good luck with FreeTTS
Google Translate has a secret tts api:
https://translate.google.com/translate_tts?ie=utf-8&tl=en&q=Hello%20World
Actually, there is not a big choice:
Festival, most old. Written in C++ but has bindings to Java.
eSpeak, quick and simple, used by Google Translate
mbrola
Pure Java:
FreeTTS, which code was ported from Festival, and then was open-sourced and development was stopped.
MaryTTS - more powerful and looks production ready.
Also there is other proprietary programs like:
Acapella
Nuance Vocalizer
If your software is Windows only, you can use Microsoft Speech API.
I've used Mary before and I was very impressed with the quality of the voices. Unfortunately, I haven't used any of the other ones.
I've used AT&T Natural Voices which provides JSAPI and MS SAPI hooks. It provides excellent quality voices, a good "general" speech dictionary, many controls over pronunciation, and multiple languages. It's a little pricey, but works very well.
I used it to read important sensor telemetry to drivers in a mobile sensor application. We had no complaints about the voice quality. It had about 75% out-of-the-box accuracy with scientific terms and a much higher (maybe 90%+) with normal dialogue. We got it up to about 99+% accuracy by using markups (most errors were on scientific terms with unusual phoneme combinations).
It was a bit hard on the processor (we were running on a Pentium-III equivalent machine and it was pushing 50%-75% peak CPU). This uses a native speech engine (Windows, Linux, and Mac compatible) with a Java interface.
There's a huge variety of voices and languages...
I used FreeTTS but had a major problem getting the MBrola voices to run on My MacbookPro. I did get MBrola voices to run on Windows (painfully) and Linux. I've had no luck loading any other voice packages on FreeTTS which is a shame because the supplied voices are horrible IMO. Outside of that I had a little success with Cloudgarden as well but that only runs on Windows AFAIK. I'd be interested to hear others successes/failures with Voice engines as this type of work is particular challenging. I'm also toying a bit with Sphinx4. I just pulled down JVXML (which appears to be based on Sphinx4) last night but could not get it to run for some strange reason.
I've contributed to mary. I feel it has potential if someone smarter than me separated the HMM voices out of the core (those voices don't need large data sets and sound ok). I'm also trying to do a event system to freetts to send events when it says a word. I've had success, but it is broken in linux now. (probably because of a timer bug).
Thanks a lot everyone, the trick is in FreeTTS source. Briefly: if being run as java -jar freetts.jar some-more-args-here, it spells lesser words than when being executed in a manner of bin/Server.jar and bin/Client.jar.
I found little comfortable with MarryTTS It has multilanguage and clear voice to understand.
T convert speech to text, the better optiion is sphinx4-5prealpha.
I give one thumb, because it has adjustable, flexibility and modifiable recognizer and grammer.