I was wondering if theres a way to use Java to get the encoding type of a string on the clipboard. (I'd give more details but the question is fairly straight forward).
Ex. I go into a unicode program, copy text, use the Java program to decipher it, and the java program spits out "UTF-16"
To access the clipboard, you can use the awt datatransfer classes.
To detect the charset, you can use the CharsetDetector from ICU project.
Here is the code :
public static String getClipboardCharset () throws UnsupportedCharsetException, UnsupportedFlavorException, IOException {
String clipText = null;
final Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();
final Transferable contents = clipboard.getContents(null);
if ((contents != null) && contents.isDataFlavorSupported(DataFlavor.stringFlavor))
clipText = (String) contents.getTransferData(DataFlavor.stringFlavor);
if (contents!=null && clipText!=null) {
final CharsetDetector cd = new CharsetDetector();
cd.setText(clipText.getBytes());
final CharsetMatch cm = cd.detect();
if (cm != null)
return cm.getName();
}
throw new UnsupportedCharsetException("Unknown");
}
Here are the imports needed :
import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;
import java.awt.datatransfer.Transferable;
import java.awt.datatransfer.UnsupportedFlavorException;
import java.io.IOException;
import java.nio.charset.UnsupportedCharsetException;
import com.ibm.icu.text.CharsetDetector;
import com.ibm.icu.text.CharsetMatch;
Related
I am trying to edit a run using Apache POI, and I want this edit to show up as a tracked revision (so that I can accept/reject this suggestion) in a .docx file.
package apache;
import java.io.FileOutputStream;
import java.util.List;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class Main {
public static void main(String[] args) throws Exception {
XWPFDocument doc = new XWPFDocument(OPCPackage.open("input.docx"));
doc.setTrackRevisions(true);
for (XWPFParagraph p : doc.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
if (runs != null) {
for (XWPFRun r : runs) {
String text = r.getText(0);
if (text != null && text.contains("needle")) {
text = text.replace("needle", "haystack");
r.setText(text, 0);
}
}
}
doc.write(new FileOutputStream("/Users/srt/Desktop/output.docx"));
}
}
This code is able to replace the text perfectly, but I cannot show this edit as a tracked revision. I made use of the doc.setTrackRevisions(true) method, but it still does not track the revision. Any help here will be really appreciated!
My final goal is to convert a file from ANSI to UTF-8. To do so, I use some code with Java :
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class ConvertFromAnsiToUtf8 {
public static void main(String[] args) throws IOException {
try {
Path p = Paths.get("C:\\shared_to_vm\\test_encode\\test.csv");
ByteBuffer bb = ByteBuffer.wrap(Files.readAllBytes(p));
CharBuffer cb = Charset.forName("windows-1252").decode(bb);
bb = Charset.forName("UTF-8").encode(cb);
Files.write(p, bb.array());
} catch (Exception e) {
System.out.println(e);
}
}
}
The code works perfectly when I test it on small files. My file is convert from ANSI to UTF-8 and all characters are recognize and well encoded. But as soon as I try to use it on the file I need to convert, I get the error java.lang.OutOfMemoryError: Java heap space.
So far as my understanding goes, I got like 1.5 million lines in my file so I am pretty sure I create too many objects with my application.
Of course, I have checked what this error means and how I could solve it (like here or here for example) but is improving the memory capacity of my JVM the only way to solve it ? And if it is, how much more should i use ?
Any kind of help (tip, advice, link or else) would be greatly appreciated !
Don't read the whole file at once:
ByteBuffer bb = ByteBuffer.wrap(Files.readAllBytes(p));
Instead, try to read line-by-line:
Files.lines(p, Charset.forName("windows-1252")).forEach(line -> {
// Convert your line, write to file
});
Stream the input, convert the character encoding, and write the output as you go. This way, you don't need to read the entire file into memory, but only as much as you want.
If you want to minimize the number of (slowish) system calls, you could use a similar approach, but explicitly create a BufferedInputStream with a larger internal buffer, and then wrap that in an InputStreamReader. But the simple approach shown here is unlikely to be a critical point in many applications.
private static final Charset WINDOWS1252 = Charset.forName("windows-1252");
private static final int DEFAULT_BUF_SIZE = 8192;
public static void transcode(Path input, Path output) throws IOException {
try (Reader r = Files.newBufferedReader(input, WINDOWS1252);
Writer w = Files.newBufferedWriter(output, StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW)) {
char[] buf = new char[DEFAULT_BUF_SIZE];
while (true) {
int n = r.read(buf);
if (n < 0) break;
w.write(buf, 0, n);
}
}
}
If you have a large file, which is larger then available random access memory you should convert characters chunk-by-chunk.
Following you can found the example:
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
public class Iconv {
private static void iconv(Charset toCode, Charset fromCode, Path src, Path dst) throws IOException {
CharsetDecoder decoder = fromCode.newDecoder();
CharsetEncoder encoder = toCode.newEncoder();
try (ReadableByteChannel source = FileChannel.open(src, StandardOpenOption.READ);
WritableByteChannel destination = FileChannel.open(dst, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING,
StandardOpenOption.WRITE);) {
ByteBuffer readBytes = ByteBuffer.allocate(4096);
while (source.read(readBytes) > 0) {
readBytes.flip();
destination.write(encoder.encode(decoder.decode(readBytes)));
readBytes.clear();
}
}
}
public static void main(String[] args) throws Exception {
iconv(Charset.forName("UTF-8"), Charset.forName("Windows-1252"), Paths.get("test.csv") , Paths.get("test-utf8.csv") );
}
}
I used Eclipse Luna 64bit, Maven, docx4j API for PDF conversion, template letter format on which I want my HTML code. This template is saved in my database.
I want to include a hyperlink in the PDF, so my users can on click this link and open it in their browser.
This is my main class:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Properties;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.faces.bean.ManagedBean;
import javax.faces.bean.ManagedProperty;
import javax.faces.bean.ViewScoped;
import javax.faces.model.SelectItem;
import org.apache.commons.lang.StringUtils;
import org.docx4j.Docx4J;
import org.docx4j.XmlUtils;
import org.docx4j.convert.in.xhtml.XHTMLImporterImpl;
import org.docx4j.jaxb.Context;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.relationships.Namespaces;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.wml.Body;
import org.docx4j.wml.BooleanDefaultTrue;
import org.docx4j.wml.Document;
import org.docx4j.wml.P;
import org.docx4j.wml.PPrBase;
import org.docx4j.wml.R;
import org.docx4j.wml.Text;
import org.primefaces.context.RequestContext;
import org.primefaces.model.DefaultStreamedContent;
import org.primefaces.model.StreamedContent;
import org.primefaces.model.UploadedFile;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class LetterMaintenanceBean extends BaseManagedBean implements
Serializable {
public StreamedContent previewLetter() {
String content = this.letter.getHtmlContent();
String regex = "<a href=(\"[^\"]*\")[^<]*</a>"; //Digvijay
Pattern p = Pattern.compile(regex); //Digvijay
System.out.println("p: "+p);
Matcher m = p.matcher(content); //Digvijay
System.out.println("m: "+m);
content = m.replaceAll("<strong><u><span style=\"color:#0099cc\">$1</span></u></strong>"); //Digvijay
System.out.println("regex1: "+regex); //Digvijay
Map<String, String> previewExamples = this.getPreviewExamples(this.letter.getMessageTypeCode());
for (Entry<String, String> example : previewExamples.entrySet()) {
if (StringUtils.isNotBlank(example.getKey()) && StringUtils.isNotBlank(example.getValue())) {
content = content.replace(example.getKey(), example.getValue());
System.out.println("content after map date");
}
}
System.out.println("content1:: "+content);
if (!content.startsWith("<div>")) {
content = "<div>" + content + "</div>";
}
// Docx4j does not understand HTML codes for special characters. So replacing with Unicode values.
content = content.replace(" ", " ");
content = content.replace("’", "’");
content = content.replaceAll("</p>", "</p><br/>");
content = content.replaceAll("\"</span>", "</span>");
InputStream stream = null;
try {
System.out.println("content:"+content);
if (this.letter.getHtmlContent().getBytes() != null && this.letter.getWfTemplateId() != null) {
stream = new ByteArrayInputStream(this.HTMLToPDF(content.getBytes(), this.letter.getWfTemplateId()));
} else {
stream = new ByteArrayInputStream(this.HTMLToPDFWithoutTemplate(content.getBytes()));
}
StreamedContent file = new DefaultStreamedContent(stream, "application/pdf", this.letter.getLetterName() + ".pdf");
return file;
} catch (LetterMaintenanceException e) {
this.processServiceException(e);
StreamedContent file = new DefaultStreamedContent(
new ByteArrayInputStream(
"Unable to process your request. If the problem persists, please contact application support."
.getBytes()), "application/pdf", "error" + ".pdf");
return file;
} catch (Exception e) {
this.processGenericException(e);
StreamedContent file = new DefaultStreamedContent(
new ByteArrayInputStream(
"Unable to process your request. If the problem persists, please contact application support."
.getBytes()), "application/pdf", "error" + ".pdf");
return file;
}
}
This is my HTMLToPDF() method:
private byte[] HTMLToPDF(final byte[] htmlContent, final String templateId)
throws Docx4JException, LetterMaintenanceException {
LetterMaintenanceDelegate letterMaintenanceDelegate = new LetterMaintenanceDelegate();
Template template = letterMaintenanceDelegate.retrieveTemplateById(templateId);
if (template == null || template.getContent() == null) {
throw new LetterMaintenanceException("Could not retrieve template");
}
InputStream is = new ByteArrayInputStream(template.getContent());
WordprocessingMLPackage templatePackage = WordprocessingMLPackage.load(is);
// Convert HTML to docx
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(templatePackage);
XHTMLImporter.setHyperlinkStyle("Hyperlink");
templatePackage
.getMainDocumentPart()
.getContent()
.addAll(XHTMLImporter.convert(new ByteArrayInputStream(htmlContent), null));
// Add content of content docx to template
templatePackage.getMainDocumentPart().getContent().addAll(templatePackage.getMainDocumentPart().getContent());
// Handle page breaks
templatePackage = this.handlePagebreaksInDocx(templatePackage);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Docx4J.toPDF(templatePackage, baos);
return baos.toByteArray();
}
}
In this code I am trying to convert HTML (with href tag) to PDF file and in the PDF output the hyperlink must work.
The current output of this program is a PDF but there are no working links in it.
How can I activate my links?
So I'm trying to use jsoup to scrape Reddit for images, but when I scrape certain subreddits such as /r/wallpaper, I get a 429 error and am wondering how to fix this. Totally understand that this code is horrible and this is a pretty noob question, but I'm completely new to this. Anyways:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class javascraper{
public static void main (String[]args) throws MalformedURLException
{
Scanner scan = new Scanner (System.in);
System.out.println("Where do you want to store the files?");
String folderpath = scan.next();
System.out.println("What subreddit do you want to scrape?");
String subreddit = scan.next();
subreddit = ("http://reddit.com/r/" + subreddit);
new File(folderpath + "/" + subreddit).mkdir();
//test
try{
//gets http protocol
Document doc = Jsoup.connect(subreddit).timeout(0).get();
//get page title
String title = doc.title();
System.out.println("title : " + title);
//get all links
Elements links = doc.select("a[href]");
for(Element link : links){
//get value from href attribute
String checkLink = link.attr("href");
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
if (imgCheck(checkLink)){ // checks to see if img link j
System.out.println("link : " + link.attr("href"));
downloadImages(checkLink, folderpath);
}
}
}
catch (IOException e){
e.printStackTrace();
}
}
public static boolean imgCheck(String http){
String png = ".png";
String jpg = ".jpg";
String jpeg = "jpeg"; // no period so checker will only check last four characaters
String gif = ".gif";
int length = http.length();
if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
return true;
}
else{
return false;
}
}
private static void downloadImages(String src, String folderpath) throws IOException{
String folder = null;
//Exctract the name of the image from the src attribute
int indexname = src.lastIndexOf("/");
if (indexname == src.length()) {
src = src.substring(1, indexname);
}
indexname = src.lastIndexOf("/");
String name = src.substring(indexname, src.length());
System.out.println(name);
//Open a URL Stream
URL url = new URL(src);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));
for (int b; (b = in.read()) != -1;) {
out.write(b);
}
out.close();
in.close();
}
}
Your issue is caused by the fact that your scraper is violating reddit's API rules. Error 429 means "Too many requests" – you're requesting too many pages too fast.
You can make one request every 2 seconds, and you also need to set a proper user agent (they format they recommend is <platform>:<app ID>:<version string> (by /u/<reddit username>)). The way it currently looks, your code is running too fast and doesn't specify one, so it's going to be severely rate-limited.
To fix it, first off, add this to the start of your class, before the main method:
public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";
(Make sure to specify an actual user agent).
Then, change this (in downloadImages)
URL url = new URL(src);
InputStream in = url.openStream();
to this:
URLConnection connection = (new URL(src)).openConnection();
Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);
InputStream in = connection.getInputStream();
You'll also want to change this (in main)
Document doc = Jsoup.connect(subreddit).timeout(0).get();
to this:
Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();
Then your code should stop running into that error.
Note that using reddit's API (IE, /r/subreddit.json instead of /r/subreddit) would probably make this project easier, but it isn't required and your current code will work.
As you can look up at Wikipedia the 429 status code tells you that you have too many requests:
The user has sent too many requests in a given amount of time. Intended for use with rate limiting schemes.
A solution would be to slow down your scraper. There are some options how to do this, one would be to use sleep.
I have images of codes that I want to decode. How can I use zxing so that I specify the image location and get the decoded text back, and in case the decoding fails (it will for some images, that's the project), it gives me an error.
How can I setup zxing on my Windows machine? I downloaded the jar file, but I don't know where to start. I understand I'll have to create a code to read the image and supply it to the library reader method, but a guide how to do that would be very helpful.
I was able to do it. Downloaded the source and added the following code. Bit rustic, but gets the work done.
import com.google.zxing.NotFoundException;
import com.google.zxing.ChecksumException;
import com.google.zxing.FormatException;
import com.google.zxing.BarcodeFormat;
import com.google.zxing.DecodeHintType;
import com.google.zxing.Reader;
import com.google.zxing.BinaryBitmap;
import com.google.zxing.Result;
import com.google.zxing.LuminanceSource;
import com.google.zxing.client.j2se.BufferedImageLuminanceSource;
import com.google.zxing.common.HybridBinarizer;
import java.awt.image.BufferedImage;
import javax.imageio.ImageIO;
import java.io.File;
import java.io.IOException;
import java.util.*;
import com.google.zxing.qrcode.QRCodeReader;
class qr
{
public static void main(String args[])
{
Reader xReader = new QRCodeReader();
BufferedImage dest = null;
try
{
dest = ImageIO.read(new File(args[0]));
}
catch(IOException e)
{
System.out.println("Cannot load input image");
}
LuminanceSource source = new BufferedImageLuminanceSource(dest);
BinaryBitmap bitmap = new BinaryBitmap(new HybridBinarizer(source));
Vector<BarcodeFormat> barcodeFormats = new Vector<BarcodeFormat>();
barcodeFormats.add(BarcodeFormat.QR_CODE);
HashMap<DecodeHintType, Object> decodeHints = new HashMap<DecodeHintType, Object>(3);
decodeHints.put(DecodeHintType.POSSIBLE_FORMATS, barcodeFormats);
decodeHints.put(DecodeHintType.TRY_HARDER, Boolean.TRUE);
Result result = null;
try
{
result = xReader.decode(bitmap, decodeHints);
System.out.println("Code Decoded");
String text = result.getText();
System.out.println(text);
}
catch(NotFoundException e)
{
System.out.println("Decoding Failed");
}
catch(ChecksumException e)
{
System.out.println("Checksum error");
}
catch(FormatException e)
{
System.out.println("Wrong format");
}
}
}
The project includes a class called CommandLineRunner which you can simply call from the command line. You can also look at its source to see how it works and reuse it.
There is nothing to install or set up. It's a library. Typically you don't download the jar but declare it as a dependency in your Maven-based project.
If you just want to send an image to decode, use http://zxing.org/w/decode.jspx