Determine file extension for image urls - java

Is there a reliable and fast way to determine the file extension of an image url? THere are a few options I see but none of them work consistently for images of the below format
https://cdn-image.blay.com/sites/default/files/styles/1600x1000/public/images/12.jpg?itok=e-zA1T
I have tried:
new MimetypesFileTypeMap().getContentType(url)
Results in the generic "application/octet-stream" in which case I use the below two:
Files.getFileExtension
FilenameUtils.getExtension
I would like to avoid regex when possible so is there another utility that properly gets past links that have args (.jpeg?blahblah). I would also like to avoid downloading the image or connection to the url in anyway as this should be a performant call

If you can trust that the URLs are not malformed, how about this:
FilenameUtils.getExtension(URI.create(url).getPath())

Cant you just look at the file extension in the URL? so that would be something like:
public static String getFileExtension(String url) {
int phpChar = url.length();
for(int i = 0; i < url.length(); i++) {
if(url.charAt(i) == '?') {
phpChar = i;
break;
}
}
int character = phpChar - 1;
while(url.charAt(character) != '.') character -= 1;
return url.substring(character + 1, phpChar);
}
Maybe not the most elegant solution, but it works, even with the php ? in the url.

Related

Using the split function in a proper way

I'm trying to understand what is the proper way to achieve the following goal. Consider the following string:
/some/path/to/some/dir
I would like to split the path by / and get the last two string and connect them with _ so the output would be:
some_dir
I'm familiar with the split function but I'm not sure what is the proper way to write this code when speaking of code-styling.
I know that I have to check first if the string is valid. For example, the string dir is not valid.
What is the proper way to solve it?
You can play with the following. I omit error checks for the sake of simplicity.
class Test {
public static void main(String[] args) {
String s = "/some/path/to/some/dir";
String[] parts = s.split("/");
int len = parts.length;
String theLastTwoParts = parts[len - 2] + "_" + parts[len - 1];
System.out.println(theLastTwoParts);
}
}
You can use the below shown function for this purpose:
public String convertPath(String path) {
String[] str = path.split("/");
int length = str.length;
if(length < 2) {
//Customize the result here for this specific case
return "";
}
return str[length-2] + "_" + str[length-1];
}
If you're actually handling paths, you probably want to use the standard library's Path ecosystem. You can use it by
Path path = Paths.get(p);
int nameCount = path.getNameCount();
if (nameCount < 2) throw new RuntimeException();
String result = String.format("%s_%s", path.getName(nameCount-2), path.getName(nameCount-1));
See it here.
The advantage is that when you're working on Windows, it will also handle the different path separator, so it's more platform independent.
The question of "dir" being "invalid" raises the follow-up question of how you want it handled. Throwing a RuntimeException like I do is probably not going to hold up.

How to Strip Out the Text From And Html String in Java

I want to analyze the structure of the html pages. For a page I have it as a string and I want to strip out the text and to keep only the html structure. I don't want to use a DOM parser, and I need something robust which works on regular html not only xhtml. I know regular expressions are good enough to strip out html tags from a string, but can they be used to strip out the text and to keep only the html tags?
Do you know any other option/framework I could use?
I doubt that there is an easy way to do this using regex.
Jericho is a pretty neat HTML parser with a small footprint and a single jar without additional external libraries.
Do you know any other option/framework I could use?
You might want to look at JSoup. Seems to be designed to solve exactly this type of problem.
If you've stripped out tags before, you know the basic gist is to strip out everything between < and >. Stripping out text is very similar, except you're stripping out everything between > and <. So yes, regular expressions would serve you very well in stripping out the text and leaving just the tags. They could also be used to strip out tag attributes as well if you didn't want to deal with them.
This might give you a decent start. I don't have much experience with HTML so I don't know if there is anything else to parse out of the string besides < tags >.
public static void main(String[] args){
String html = "<body> text text text text </body>";
String htmlTags = null;
char c;
for(int i = 0 ; i < html.length() ; i++){
c = html.charAt(i);
if(tagStart(Character.toString(c))){
for(int j = i ; j < html.length() ; j++){
if(htmlTags != null){
htmlTags += Character.toString(html.charAt(j));
}else{
htmlTags = Character.toString(html.charAt(j));
}
c = html.charAt(j);
if(tagStop(Character.toString(c))){
break;
}
}
}
}
}
private static boolean tagStart(String check){
if(check.equals("<")){
return true;
}else{
return false;
}
}
private static boolean tagStop(String check){
if(check.equals(">")){
return true;
}else{
return false;
}
}
Something along the lines of:
pageSource.replaceAll(">.*<", "><");
Should get you started.

regular expression replace 2 characters with one

i would like to use a regular expression for the following problem:
SOME_RANDOM_TEXT
should be converted to:
someRandomText
so, the _(any char) should be replaced with just the letter in upper case. i found something like that, using the tool:
_\w and $&
how to get only the second letter from the replacement?? any advice? thanks.
It might be easier simply to String.split("_") and then rejoin, capitalising the first letter of each string in your collection.
Note that Apache Commons has lots of useful string-related stuff, including a join() method.
The problem is that the case conversion from lowercase to uppercase is not supported by Java.util.regex.Pattern
This means you will need to do the conversion programmatically as Brian suggested. See also this thread
You can also write a simple method to do this. It's more complicated but more optimized :
public static String toCamelCase(String value) {
value = value.toLowerCase();
byte[] source = value.getBytes();
int maxLen = source.length;
byte[] target = new byte[maxLen];
int targetIndex = 0;
for (int sourceIndex = 0; sourceIndex < maxLen; sourceIndex++) {
byte c = source[sourceIndex];
if (c == '_') {
if (sourceIndex < maxLen - 1)
source[sourceIndex + 1] = (byte) Character.toUpperCase(source[sourceIndex + 1]);
continue;
}
target[targetIndex++] = source[sourceIndex];
}
return new String(target, 0, targetIndex);
}
I like Apache commons libraries, but sometimes it's good to know how it works and be able to write some specific code for jobs like this.

Debugging Java Out of Memory Error

I'm still a relatively new programmer, and an issue I keep having in Java is Out of Memory Errors. I don't want to increase the memory using -Xmx, because I feel that the error is due to poor programming, and I want to improve my coding rather than rely on more memory.
The work I do involves processing lots of text files, each around 1GB when compressed. The code I have here is meant to loop through a directory where new compressed text files are being dropped. It opens the second most recent text file (not the most recent, because this is still being written to), and uses the Jsoup library to parse certain fields in the text file (fields are separated with custom delimiters: "|nTa|" designates a new column and "|nLa|" designates a new row).
I feel there should be no reason for using a lot of memory. I open a file, scan through it, parse the relevant bits, write the parsed version into another file, close the file, and move onto the next file. I don't need to store the whole file in memory, and I certainly don't need to store files that have already been processed in memory.
I'm getting errors when I start parsing the second file, which suggests that I'm not dealing with garbage collection. Please have a look at the code, and see if you can spot things that I'm doing that mean I'm using more memory than I should be. I want to learn how to do this right so I stop getting memory errors!
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Scanner;
import java.util.TreeMap;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
import org.jsoup.Jsoup;
public class ParseHTML {
public static int commentExtractField = 3;
public static int contentExtractField = 4;
public static int descriptionField = 5;
public static void main(String[] args) throws Exception {
File directoryCompleted = null;
File filesCompleted[] = null;
while(true) {
// find second most recent file in completed directory
directoryCompleted = new File(args[0]);
filesCompleted = directoryCompleted.listFiles();
if (filesCompleted.length > 1) {
TreeMap<Long, File> timeStamps = new TreeMap<Long, File>(Collections.reverseOrder());
for (File f : filesCompleted) {
timeStamps.put(getTimestamp(f), f);
}
File fileToProcess = null;
int counter = 0;
for (Long l : timeStamps.keySet()) {
fileToProcess = timeStamps.get(l);
if (counter == 1) {
break;
}
counter++;
}
// start processing file
GZIPInputStream gzipInputStream = null;
if (fileToProcess != null) {
gzipInputStream = new GZIPInputStream(new FileInputStream(fileToProcess));
}
else {
System.err.println("No file to process!");
System.exit(1);
}
Scanner scanner = new Scanner(gzipInputStream);
scanner.useDelimiter("\\|nLa\\|");
GZIPOutputStream output = new GZIPOutputStream(new FileOutputStream("parsed/" + fileToProcess.getName()));
while (scanner.hasNext()) {
Scanner scanner2 = new Scanner(scanner.next());
scanner2.useDelimiter("\\|nTa\\|");
ArrayList<String> row = new ArrayList<String>();
while(scanner2.hasNext()) {
row.add(scanner2.next());
}
for (int index = 0; index < row.size(); index++) {
if (index == commentExtractField ||
index == contentExtractField ||
index == descriptionField) {
output.write(jsoupParse(row.get(index)).getBytes("UTF-8"));
}
else {
output.write(row.get(index).getBytes("UTF-8"));
}
String delimiter = "";
if (index == row.size() - 1) {
delimiter = "|nLa|";
}
else {
delimiter = "|nTa|";
}
output.write(delimiter.getBytes("UTF-8"));
}
}
output.finish();
output.close();
scanner.close();
gzipInputStream.close();
}
}
}
public static Long getTimestamp(File f) {
String name = f.getName();
String removeExt = name.substring(0, name.length() - 3);
String timestamp = removeExt.substring(7, removeExt.length());
return Long.parseLong(timestamp);
}
public static String jsoupParse(String s) {
if (s.length() == 4) {
return s;
}
else {
return Jsoup.parse(s).text();
}
}
}
How can I make sure that when I finish with objects, they are destroyed and not using any resources? For example, each time I close the GZIPInputStream, GZIPOutputStream and Scanner, how can I make sure they're completely destroyed?
For the record, the error I'm getting is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuilder.append(StringBuilder.java:203)
at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1171)
at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
at org.jsoup.parser.Parser.parse(Parser.java:24)
at org.jsoup.Jsoup.parse(Jsoup.java:44)
at ParseHTML.jsoupParse(ParseHTML.java:125)
at ParseHTML.main(ParseHTML.java:81)
I haven't spent very long analysing your code (nothing stands out), but a good general-purpose start would be to familiarise yourself with the free VisualVM tool. This is a reasonable guide to its use, though there are many more articles.
There are better commercial profilers in my opinion - JProfiler for one - but it will at the very least show you what objects/classes most memory is being assigned to, and possibly the method stack traces that caused that to happen. More simply it shows you heap allocation over time, and you can use this to judge whether you are failing to clear something or whether it is an unavoidable spike.
I suggest this rather than looking at the specifics of your code because it is a useful diagnostic skill to have.
Update: This issue was fixed in JSoup 1.6.2
It looks to me like it's probably a bug in the JSoup parser that you're using...at present the documentation for JSoup.parse() has a warning "BETA: if you do get an exception raised, or a bad parse-tree, please file a bug." Which suggests they aren't confident that it's completely safe for use in production code.
I also found several bug reports mentioning out of memory exceptions, one of which suggests that it's due to parse error objects being held statically by JSoup, and that downgrading from JSoup 1.6.1 to 1.5.2 may be a work-around.
I am wondering if your parse is failing because you have bad HTML (e.g. unclosed tags, unpaired quotes or whatnot) being parsed? You could do a output /println to see how far you are getting in the document if at all. The Java library may not understand the end of the document /file before running out of memory.
parse
public static Document parse(String html) Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a tag.
http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parse(java.lang.String)
It's a little hard to tell what's going on but two things come to my mind.
1) In some weird circumstances (depending on the input file), the following loop might load the entire file into memory:
while(scanner2.hasNext()) {
row.add(scanner2.next());
}
2) By looking at the stackTrace it seems that the jsoupParse is the problem. I believe that this line Jsoup.parse(s).text(); loads s into memory first and depending on the string size (that again depends on the particular file input) this might cause the OutOfMemoryError
Maybe a combination of the two points above is the issue. Again, it's hard to tell by just looking at the code..
Does this happen always with the same file? Did you check the input content and the custom delimiters in it?
Assuming the problem is not in JSoup code, we can do some memory optimization. In example, ArrayList<String> row could be stripped, as it holds all parsed lines in memory, but only one line needed for parsing.
Inner loop with row removed:
//Caution! May contain obvious bugs!
while (scanner.hasNext()) {
String scanStr = scanner.next();
//manually count of rows to replace 'row.size()'
int rowCount = 0;
int offset = 0;
while ((offset = scanStr.indexOf("|nTa|", offset)) >= 0) {
rowCount++;
offset++;
}
rowCount++;
Scanner scanner2 = new Scanner(scanStr);
scanner2.useDelimiter("\\|nTa\\|");
int index = 0;
while (scanner2.hasNext()) {
String curRow = scanner2.next();
if (index == commentExtractField
|| index == contentExtractField
|| index == descriptionField) {
output.write(jsoupParse(curRow).getBytes("UTF-8"));
} else {
output.write(curRow.getBytes("UTF-8"));
}
String delimiter = "";
if (index == rowCount - 1) {
delimiter = "|nLa|";
} else {
delimiter = "|nTa|";
}
output.write(delimiter.getBytes("UTF-8"));
}
}

How to insert a StringBuilder element into a GWT app?

So, I am getting as return parameter from an already established code a StringBuilder element, and I need to insert it into my GWT app. This StringBuilder element has been formatted into a table before returning.
For more clarity, below is the code of how StringBUilder is being generated and what is returned.
private static String formatStringArray(String header, String[] array, int[] removeCols) {
StringBuilder buf = new StringBuilder("<table bgcolor=\"DDDDDD\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">");
if (removeCols != null)
Arrays.sort(removeCols);
if (header != null) {
buf.append("<tr bgcolor=\"99AACC\">");
String[] tokens = header.split(",");
//StringTokenizer tokenized = new StringTokenizer(header, ",");
//while (tokenized.hasMoreElements()) {
for (int i = 0; i < tokens.length; i++) {
if (removeCols == null || Arrays.binarySearch(removeCols, i) < 0) {
buf.append("<th>");
buf.append(tokens[i]);
buf.append("</th>");
}
}
buf.append("</tr>");
}
if (array.length > 0) {
for (String element : array) {
buf.append("<tr>");
String[] tokens = element.split(",");
if (tokens.length > 1) {
for (int i = 0; i < tokens.length; i++) {
if (removeCols == null || Arrays.binarySearch(removeCols, i) < 0) {
buf.append("<td>");
buf.append(tokens[i]);
buf.append("</td>");
}
}
} else {
// Let any non tokenized row get through
buf.append("<td>");
buf.append(element);
buf.append("</td>");
}
buf.append("</tr>");
}
} else {
buf.append("<tr><td>No results returned</td></tr>");
}
buf.append("</table>");
return buf.toString();
}
So, above returned buf.toString(); is to be received in a GWT class, added to a panel and displayed... Now the question is: how to make all this happen?
I'm absolutely clueless as I'm a newbie and would be very thankful for any help.
Regards,
Chirayu
Could you be more specific, Chirayu? The "already established code" (is that a serlvet? Does it run on server side or client side?) that supposedly returns a StringBuilder, obviously returns a String, which can be easily transferred via GWT-RPC, JSON, etc.
But like Eyal mentioned, "you are doing it wrong" - you are generating HTML code by hand, which is additional work, leads to security holes (XSS, etc) and is more error-prone. The correct way would be:
Instead of generating the view/HTML code on the server (I'm assuming the above code is executed on the server), you just fetch the relevant data - via any transport that is available in GWT
On the client, put the data from the server in some nice Widgets. If you prefer to work with HTML directly, check out UiBinder. Otherwise, the old widgets, composites, etc way is ok too.
This way, you'll minimize the data sent between the client and the server and get better separation (to take it further, check out MVP). Plus, less load on the server - win-win.
And to stop being a newbie, RTFM - it's all there. Notice that all the links I've provided here lead to the official docs :)

Categories