What is you opinion on using camel case for web resources?
I am coming from a Java background where camel case is second nature, but still when naming web resources, such as html, css, javascript camel case does not feel right.
(e.g. http://localhost/application/editUserForm.html vs http://localhost/application/edit/user/form.html)
Any comments, suggestions are welcome!
The main consideration on naming schemes would be impact on SEO. From my understanding, Google (and presumably other engines) can 'read' amalgamated words in a single string, so camel case should be OK, as would a single case-insensitive string. Splitting the scheme by directory using rewrites would be clearer for less capable spiders. One piece of advice Google give is to use hyphens (-) rather than underscores (_), but that's not relevant here.
If you expect a real person to ever have to type the full address, using something easy to read would be a bonus in order to minimise error.
I don't find anything wrong with such naming.
My personal preference is to name web resources with -, like edit-user.jsp. I think it's more question of personal taste. I don't like _. - makes easier to visually find separate words in browser address bar (at least for me). And as far as I saw - is pretty common.
It might open the door to problems with 2 the same resources that differ in casing, if you are deploying your site on a windows environment (either for development or hosting).
But if you avoid 'double' filenames like that it's more or less taste.
A naming-scheme like this http://localhost/application/edit/user/form.htm does show the seperate words better, and might be easier to parse as something to do with "user".
I don't find camel casing very appealing. While it is the convention for Java and we should follow it when doing java, we don't have to when naming other stuff.
It's not that we find it tedious inserting a separator between every two words, but the underscore is really hard to type, it requires two pinky fingers to stretch very far. Unfortunately underscore is favored by language designers in identifiers.
Whoever invent the next programming language, please use '/' as name space separator, and '.' as word separator, so instead of
java.beans.beancontext.BeanContext.getResourceAsStream()
we have
java/beans/bean.context/Bean.Context/get.resource.as.stream()
wait... '/' is already used for division. never mind.
Related
I want to create an algorithm that searches job descriptions for given words (like Java, Angular, Docker, etc). My algorithm works, but it is rather naive. For example, it cannot detect the word Java if it is contained in another word (such as JavaEE). When I check for substrings, I have the problem that, for example, Java is recognized in the word JavaScript, which I want to avoid. I could of course make an explicit case distinction here, but I'm more looking for a general solution.
Are there any particular techniques or approaches that try to solve this problem?
Unfortunately, I don't have the amount of data necessary for data-driven approaches like machine learning.
Train a simple word2vec language model with your whole job description text data. Then use your own logic to find the keywords. When you find a match, if it's not an exact match use your similar words list.
For example you're searching for Java but find also javascript, use your word vectors to find if there is any similarity between them (in another words, if they ever been used in a similar context). Java and JavaEE probably already used in a same sentence before but java and javascript or Angular and Angularentwicklung been not.
It may seems a bit like over-engineering, but its not :).
I spent some time researching my problem, and I found that identifying certain words, even if they don't match 1:1, is not a trivial problem. You could solve the problem by listing synonyms for the words you are looking for, or you could build a rule-based named entity recognition service. But that is both error-prone and maintenance-intensive.
Probably the best way to solve my problem is to build a named entity recognition service using machine learning. I am currently watching a video series that looks very promising for the given problem. --> https://www.youtube.com/playlist?list=PL2VXyKi-KpYs1bSnT8bfMFyGS-wMcjesM
I will comment on this answer when I am done with my work to give feedback to those who are facing the same problem.
I ask myself what I shall use as an separator for complex files names like for example "Monthly Project Report".
I see a lot of people using hyphen. According: 'monthly-project.report.php'.
But I got some doubt concerning that because hyphen can be mistaken as an arithmetic minus in programming.
Wouldn't it be better to use an underscore (_)?
So what separator in appropriate to use?
That is not a problem. You better to use () If you not going to show the file in Google search engine. most of the developers normally use (-) for Google SEO targeting.
You should better use () as you mentioned some one may be have doubt if it is mines :)
I'm doing a lot of natural language processing with a bit unsusual requirements. Often I get tasks similar to lemmatization - given a word (or just piece of text) I need to find some patterns and transform the word somehow. For example, I may need to correct misspellings, e.g. given word "eatin" I need to transform it to "eating". Or I may need to transform words "ahahaha", "ahahahaha", etc. to just "ahaha" and so on.
So I'm looking for some generic tool that allows to define transormation rules for such cases. Rules may look something like this:
{w}in -> {w}ing
aha(ha)+ -> ahaha
That is I need to be able to use captured patterns from the left side on the right side.
I work with linguists who don't know programming at all, so ideally this tool should use external files and simple language for rules.
I'm doing this project in Clojure, so ideally this tool should be a library for one of JVM languages (Java, Scala, Clojure), but other languages or command line tools are ok too.
There are several very cool NLP projects, including GATE, Stanford CoreNLP, NLTK and others, and I'm not expert in all of them, so I could miss the tool I need there. If so, please let me know.
Note, that I'm working with several languages and perform very different tasks, so concrete lemmatizers, stemmers, misspelling correctors and so on for concrete languages do not fit my needs - I really need more generic tool.
UPD. It seems like I need to give some more details/examples of what I need.
Basically, I need a function for replacing text by some kind of regex (similar to Java's String.replaceAll()) but with possibility to use caught text in replacement string. For example, in real world text people often repeat characters to make emphasis on particular word, e.g. someoone may write "This film is soooo boooring...". I need to be able to replace these repetitive "oooo" with only single character. So there may be a rule like this (in syntax similar to what I used earlier in this post):
{chars1}<char>+{chars2}? -> {chars1}<char>{chars2}
that is, replace word starting with some chars (chars1), at least 3 chars and possibly ending with some other chars (chars2) with similar string, but with only a single . Key point here is that we catch on a left side of a rule and use it on a right side.
I am not an expert in NLP, but I believe Snowball might be of interest to you. Its a language to represent stemming algorithms. Its stemmer is used in the Lucene search engine.
I've found http://userguide.icu-project.org/transforms/general to be useful as well for some general pattern/transform tasks like this, ignore the stuff about transliteration, its nice for doing a lot of things.
You can just load up rules from a file into a String and register them, etc.
http://userguide.icu-project.org/transforms/general/rules
I have come across similar problems a few times in the past and want to know what language (methodology) if any is used to solve similar problems (I am a J2EE/java developer):
problem: Out of a probable set of words, with a given rule (say the word can be a combination of A and X, and always starts with a X, each word is delimited by a space), you have to read a sequence of words and parse through the input to decide which of the words are syntatctically correct. In a nutshell these are problems that involve parsing techniques. Say simulate the logic of an vending machine in Java.
So what I want to know is what are the techniques/best approach to solve problems pertaining to parsing inputs. Like alien language processing problem in google code jam
Google code jam problem
Do we use something like ANTLR or some library in java.
I know this question is slightly generic, but I had no other way of expressing it.
P.S: I do not want a solution, I am looking for best way to solve such recurring problems.
You can use JavaCC for complex parsing.
For relative simple parsing and event processing I use enum(s) as a state machine. esp as a push parser.
For very simple parsing, you can use indexOf or split(" ") with equals, switch or startsWith
If you want to simulate the logic of a something that is essentially a finite state automation, you can simply code the FSA by hand. This is a standard computer science solution. A less obvious way to do this is to use a lexer-generator (there are lots of them) to generate the FSA from descriptions of the valid sequences of events (in lexer-generator speak, these are called "characters" but you can cheat and substitute event occurrences for characters).
If you have complex recursive rules about matching, you'll want a more traditional parser.
You can code these by hand, too, if the grammar isn't complicated; see my ?SO answer on "how to build a recursive descent parser". If your grammar is complex or it changes quickly, you'll want to use a standard parser generator. Other answers here suggest specific ones but there are many to choose from, all generally very capable.
[FWIW, I applied parser generators to recognizing valid transaction sequences in 1974 in TRW POS terminals the May Company department store. Worked pretty well.]
You can use ANTLR which is good, It will help in complex problem But you can also use regular expressions eg: spilt("\\s+").
I'm working on a web project that will (hopefully) be available in several languages one day (I say "hopefully" because while we only have an English language site planned today, other products of my company are multilingual and I am hoping we are successful enough to need that too).
I understand that the best practice (I'm using Java, Spring MVC, and Velocity here) is to put all text that the user will see in external files, and refer to them in the UI files by name, such as:
#in messages_en.properties:
welcome.header = Welcome to AppName!
#in the markup
<title>#springMessage("welcome.header")</title>
But, having never had to go through this process on a project myself before, I'm curious what the best way to deal with this is when you have some segments of the UI that are heavy on markup, such as:
<p>We are excited to announce that Company1 has been acquired by
Division X,
a fast-growing division of Company 2, Inc.
(Nasdaq: BLAH), based in...
One option I can think of would be to store this "low-level" of markup in messages.properties itself for the message - but this seems like the worst possible option.
Other options that I can think of are:
Store each non-markup inner fragment in messages.properties, such as acquisitionAnnounce1, acquisitionAnnounce2, acquisitionAnnounce3. This seems very tedious though.
Break this message into more reusable components, such as Company1.name, Company2.name, Company2.ticker, etc., as each of these is likely reused in many other messages. This would probably account for 80% of the words in this particular message.
Are there any best practices for dealing with internationalizing text that is heavy with markup such as this? Do you just have to bite down and bear the pain of breaking up every piece of text? What is the best solution from any projects you've personally dealt with?
Typically if you use a template engine such as Sitemesh or Velocity you can manage these smaller HTML building blocks as subtemplates more effectively.
By so doing, you can incrementally boil down the strings which are the purely internationalized ones into groups and make them relevant to those markup subtemplates. Having done this sort of work using templates for an app which spanned multi-languages in the same locale, as well as multiple locales, we never ever placed markup in our message bundles.
I'd suggest that a key good practice would be to avoid placing markup (even at a low-level as you put it) inside message properties files at all costs! The potential this has for unleashing hell is not something to be overlooked - biting the bullet and breaking things up correctly, is far less of a pain than having to manage many files with scattered HTML markup. Its important you can visualise markup as holistic chunks and scattering that everywhere would make everyday development a chore since:
You would lose IDE color highlighting and syntax validation
High possibility that one locale file or another can easily be missed when changes to designs / markup filter down
Breaking things down (to a realistic point, eg logical sentence structures but no finer) is somewhat hard work upfront but worth the effort.
Regarding string breakdown granularity, here's a sample of what we did:
comment.atom-details=Subscribe To Comments
comment.username-mandatory=You must supply your name
comment.useremail-mandatory=You must supply your email address
comment.email.notification=Dear {0}, the comment thread you are watching has been updated.
comment.feed.title=Comments on {0}
comment.feed.title.default=Comments
comment.feed.entry.title=Comment on {0} at {1,date,medium} {2,time,HH:mm} by {3}
comment.atom-details=Suscribir a Comentarios
comment.username-mandatory=Debes indicar tu nombre
comment.useremail-mandatory=Debes indicar tu direcci\u00f3n de correo electr\u00f3nico
comment.email.notification=La conversaci\u00f3n que estas viendo ha sido actualizada
comment.feed.title=Comentarios sobre {0}
comment.feed.title.default=Comentarios
comment.feed.entry.title=Comentarios sobre {0} a {1,date,medium} {2,time,HH:mm} por {3}
So you can do interesting things with how you string replace in the message bundle which may also help you preserve it's logical meaning but allow you to manipulate it mid sentence.
As others have said, please never split the strings into segments. You will cause translators grief as they have to coerce their language syntax to the ad-hoc rules you inadvertently create. Often it will not be possible to provide a grammatically correct translation, especially if you reuse certain segments in different contexts.
Do not remove the markup, either.
Please do not assume professional translators work in Notepad :) Computer-aided translation (CAT) tools, such as the Trados suite, know about markup perfectly well. If the tagging is HTML, rather than some custom XML format, no special preparation is required. Trados will protect the tags from accidental modification, while still allowing changes where necessary. Note that certain elements of tags often need to be localized, e.g. alt text or some query strings, so just stripping all the markup won't do.
Best of all, unless you're working on a zero-budget personal project, consider contacting a localization vendor. Localization is a service just like web design. A competent vendor will help you pick the optimal solution/format for your project and guide you through the preparation of the source material and incorporating the localized result. And of course they and their translators will have all the necessary tools. (Full disclosure: I am a translator / localization specialist. And don't split up strings :)
First off, don't split up your strings. This makes it much harder for localizers to translate text because they can't see the entire string to translate.
I would probably try to use placeholders around the links:
Division X
That's how I did it when I was localizing a site into 30 languages. It's not perfect, but it works.
I don't think it's possible (or easy) to remove all markup from strings, you need to have a way to insert the urls and any extra markup.
You should avoid breaking up your strings. Not only does this become a nightmare to translate, but it also makes grammatical assumptions which may not be correct in the target language.
While placeholders can be helpful for many things, I would not recommend using placeholders for URLs. This allows you to customize the URL for different locales. After all, no sense sending them to an English language page when their locale is Argentine Spanish!