Android parse KML file for time - java

I've been trying to work out how to obtain the travel time between two locations (walking, driving etc...).
As I understand it, the only way to do this accurately is by retrieving a KML file from google, then parsing it.
Research has shown that it then needs to be parsed with SAX. The problem is, I can't seem to work out how to extract the correct variables (the time). Does anybody know if / how this can be done?
Many thanks for your help,
Pete.

Parsing XML (what KML basically is), using a SAX-Parser: http://www.dreamincode.net/forums/blog/324/entry-2683-parsing-xml-in-java-part-1-sax/
<kml>
<Document>
<Placemark>
<name>Route</name>
<description>Distance: 1.4 mi (about 30 mins)<br/>Map data ©2011 Tele Atlas </description>
</Placemark>
</Document>
</kml>
In the example you can see, that the guessed time is stored in the "description"-Tag. It's saved in the last "Placemark"-Tag in the KML-File and it has a "<name>Route</name>"-Tag.
Getting this Tag with the SAX-Parser and extracting the time using regex should be easy done.

Here's my JSOUP implementation for getting tracks
public ArrayList<ArrayList<LatLng>> getAllTracks() {
ArrayList<ArrayList<LatLng>> allTracks = new ArrayList<ArrayList<LatLng>>();
try {
StringBuilder buf = new StringBuilder();
InputStream json = MyApplication.getInstance().getAssets().open("track.kml");
BufferedReader in = new BufferedReader(new InputStreamReader(json));
String str;
while ((str = in.readLine()) != null) {
buf.append(str);
}
in.close();
String html = buf.toString();
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
ArrayList<String> tracksString = new ArrayList<String>();
for (Element e : doc.select("coordinates")) {
tracksString.add(e.toString().replace("<coordinates>", "").replace("</coordinates>", ""));
}
for (int i = 0; i < tracksString.size(); i++) {
ArrayList<LatLng> oneTrack = new ArrayList<LatLng>();
ArrayList<String> oneTrackString = new ArrayList<String>(Arrays.asList(tracksString.get(i).split("\\s+")));
for (int k = 1; k < oneTrackString.size(); k++) {
LatLng latLng = new LatLng(Double.parseDouble(oneTrackString.get(k).split(",")[0]), Double.parseDouble(oneTrackString.get(k)
.split(",")[1]));
oneTrack.add(latLng);
}
allTracks.add(oneTrack);
}
} catch (Exception e) {
e.printStackTrace();
}
return allTracks;
}

Related

How to replace text in a pdf with correct encoding using Itext

I create a java program for translating PDFs. I am using google API for translation. I am getting the translation correct on my Eclipse IDE Console but when I check the newly created pdf, either it's not translated and copied as it is or few words are translated or the new pdf comes as empty and sometimes corrupted.
I suppose it has something to do with encoding & font types.
I have already gone through the Itext page & all the related questions but none worked for my case. I am trying to translate Portuguese Spanish Finnish French Hungarian, etc into English.
Here is my code:
public static final String SRC = "5587309Finnish.pdf";
public static final String DEST = "changed.pdf";
public static void main(String[] args) throws java.io.IOException, DocumentException {
Translate translate = TranslateOptions.getDefaultInstance().getService();
PdfReader reader = new PdfReader(SRC);
int pages = reader.getNumberOfPages();
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
for(int i=1;i<=pages;i++) {
PdfDictionary dict = reader.getPageN(i);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
String pageContent =
PdfTextExtractor.getTextFromPage(reader, i);
String[] word = pageContent.split(" ");
PRStream stream = (PRStream) object;
byte[] data = PdfReader.getStreamBytes(stream);
String dd = new String(data, BaseFont.CP1252);
for (int j=0; j < word.length; j++)
{
Translation translation = translate.translate(word[j],Translate.TranslateOption.sourceLanguage("fi"),
Translate.TranslateOption.targetLanguage("en"));
System.out.println(word[j]+"-->>"+translation.getTranslatedText());//here i can check the translation is correct.
dd = dd.replace(word[j],translation.getTranslatedText());
}
stream.setData(dd.getBytes());
}
}
stamper.close();
reader.close();
}
Please help.
According to a comment you have improved your code and are
getting the update dd(i.e. content stream which I am printing) correctly with the replaced text. I don't know why I am getting a blank pdf
Thus, I assume that your (hopefully representative) test PDFs have all their fonts of interest encoded in ANSI'ish encodings and the text arguments of the text drawing instructions contain whole words or even phrases which can properly be processed because otherwise text replacement would not have been possible.
Thus, here an example how one can replace text pieces with similarly long ones under such benign circumstances without breaking the content stream syntax. In this example I simply use a Map containing replacement strings. You can do your translation there.
First a frame loading the source, creating a stamper, iterating over the pages, and calling a helper to create a content stream replacement:
Map<String, String> replacements = new HashMap<>();
replacements.put("Förfallodatum", "Ablaufdatum");
try ( InputStream resource = SOURCE_INPUTSTREAM;
OutputStream result = new FileOutputStream(RESULT_FILE) ) {
PdfReader pdfReader = new PdfReader(resource);
PdfStamper pdfStamper = new PdfStamper(pdfReader, result);
for (int pageNum = 1; pageNum <= pdfReader.getNumberOfPages(); pageNum++) {
PdfDictionary page = pdfReader.getPageN(pageNum);
byte[] pageContentInput = ContentByteUtils.getContentBytesForPage(pdfReader, pageNum);
page.remove(PdfName.CONTENTS);
replaceInStringArguments(pageContentInput, pdfStamper.getUnderContent(pageNum), replacements);
}
pdfStamper.close();
}
(EditPageContentSimple test testReplaceInStringArgumentsForklaringAvFakturan)
The method replaceInStringArguments now parses the instructions in the given content stream, isolates string arguments, and calls another helper for each string argument doing the replacement.
void replaceInStringArguments(byte[] contentBytesBefore, PdfContentByte canvas, Map<String, String> replacements) throws IOException {
PRTokeniser tokeniser = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytesBefore)));
PdfContentParser ps = new PdfContentParser(tokeniser);
ArrayList<PdfObject> operands = new ArrayList<PdfObject>();
while (ps.parse(operands).size() > 0){
for (int i = 0; i < operands.size(); i++) {
PdfObject pdfObject = operands.get(i);
if (pdfObject instanceof PdfString) {
operands.set(i, replaceInString((PdfString)pdfObject, replacements));
} else if (pdfObject instanceof PdfArray) {
PdfArray pdfArray = (PdfArray) pdfObject;
for (int j = 0; j < pdfArray.size(); j++) {
PdfObject arrayObject = pdfArray.getPdfObject(j);
if (arrayObject instanceof PdfString) {
pdfArray.set(j, replaceInString((PdfString)arrayObject, replacements));
}
}
}
}
for (PdfObject object : operands)
{
object.toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append((byte) ' ');
}
canvas.getInternalBuffer().append((byte) '\n');
}
}
(EditPageContentSimple helper method)
The method replaceInString in turn retrieves a single string operand (a PdfString instance), manipulates it, and returns the manipulated string version:
PdfString replaceInString(PdfString string, Map<String, String> replacements) {
String value = PdfEncodings.convertToString(string.getBytes(), PdfObject.TEXT_PDFDOCENCODING);
for (Map.Entry<String, String> entry : replacements.entrySet()) {
value = value.replace(entry.getKey(), entry.getValue());
}
return new PdfString(PdfEncodings.convertToBytes(value, PdfObject.TEXT_PDFDOCENCODING));
}
(EditPageContentSimple helper method)
Instead of that for loop here you would call your translation routine and translate value.
As has been mentioned before, this code only works under certain benign circumstances. Don't expect it to work for arbitrary documents from the wild, in particular not for documents with other than Western European glyphs.

Character encoding of parsed strings is wrong only after building jar

I am writing a program that generates PDF files of printable exams. I have all the exam questions stored in a JSON file. The catch is that the exam is in Czech, so there are many special characters (specifically ěščřžýáíé). When I run the program in Idea, it works perfectly - the output is exactly as it is supposed to be.
But when I build the jar executable, the generated files have chunks of wrong encoded text. Specifically anything that went through the JSON parser. Everything hard coded like headers etc. is encoded properly, so the mistake must be in the parser.
The JSON input file is encoded in UTF-8.
I use these two methods to parse the JSON file.
private static Category[] parseJSON(){
JSONParser jsonParser = new JSONParser();
Category[] categories = new Category[0];
try (FileReader reader = new FileReader("otazky.json")){
// Read JSON file
Object obj = jsonParser.parse(reader);
JSONArray categoryJSONList = (JSONArray) obj;
java.util.List<JSONObject> categoryList = new ArrayList<>(categoryJSONList);
categories = new Category[categoryJSONList.size()];
int i = 0;
for (JSONObject category : categoryList) {
categories[i] = parseCategoryObject(category);
i++;
}
} catch (ParseException | IOException e) {
e.printStackTrace();
}
return categories;
}
private static Category parseCategoryObject(JSONObject category) {
String categoryName = (String) category.get("name");
int generateCount = (int) (long) category.get("generateCount");
JSONArray questionsJSONArray = (JSONArray) category.get("questions");
java.util.List<JSONObject> questionJSONList = new ArrayList<>(questionsJSONArray);
Question[] questions = new Question[questionJSONList.size()];
int j = 0;
for (JSONObject question : questionJSONList) {
JSONArray answers = (JSONArray) question.get("answers");
String s = (String) question.get("question");
String[] a = new String[answers.size()];
for (int i = 0; i < answers.size(); i++) {
a[i] = answers.get(i).toString();
}
int c = (int) (long) question.get("correct");
Question q = new Question(s, a, c);
questions[j] = q;
j++;
}
return new Category(categoryName, questions, generateCount);
}
The output looks like this:
...
Právnà norma:
a) je obecnÄ› závaznĂ© pravidlo chovánĂ, kterĂ© nemusĂ mĂt urÄŤitou formu,
b) nemĹŻĹľe bĂ˝t součástĂ právnĂho pĹ™edpisu,
...
While I would need it to look like this:
...
Právní norma:
a) je obecně závazné pravidlo chování, které nemusí mít určitou formu,
b) nemůže být součástí právního předpisu,
...
Benjamin Urquhart suggested that I try using InputStringReader and FileInputStream instead of FileReader to read the file, because with FileReader you cannot specify the encoding (system default is used). I find those two methods hard to use, but I found an alternative - Files.readAllLines, which is fairly easy to use, and it worked.

Getting wrong text sequence when image scanned by offline google mobile vision API

public StringBuilder scanImage(Bitmap bp)
{
StringBuilder sb=null;
TextRecognizer tcx = new
TextRecognizer.Builder(getApplicationContext()).build();
if (!tcx.isOperational())
{
Toast.makeText(getApplicationContext(), "could not get text", Toast.LENGTH_SHORT).show();
} else
{
Frame fame = new Frame.Builder().setBitmap(bp).build();
SparseArray<TextBlock> items = tcx.detect(fame);
sb = new StringBuilder();
for (int i = 0; i < items.size(); ++i)
{
TextBlock mytext = items.valueAt(i);
sb.append(mytext.getValue());
sb.append("\n");
}
}
return sb;
}
this is my code. I'm using mobile google vision API. I'm just passing image bitmap for scan but this method returns scanned text in wrong sequence.please tell me how to get text in proper sequence. Thank you in advance
The detected blocks are not provided in sequence. You will need to check the position of each text block and do some maths to arrange them.
Use the methods such as myText.getBoundingBox(), myText.getCornerPoints(), etc to find the position of the text blocks.

Converting part of .dox document to html using Apache POI

I use XHTMLConverter to convert .docx to html, to make preview of the document. Is there any way to convert only few pages from original document? I'll be grateful for any help.
You have to parse the complete .docx file. It is not possible to read just parts of it. Otherwise if you want to know how to select a specific page number, im afraid to tell you(at least I believe) that word does not store page numbers therefore there is no function in the libary to accsess a specified page..
(I've read this at another forum, it actually might be false information).
PS: the Excel POI contains a .getSheetAt()method (this might helps you for your research)
But there are also other ways to accsess your pages. For instance you could read the lines of your docx document and search for the pagenumbers(might crash if your text contains those numbers though). Another way would be to search for the header of the site which would be more accurate:
HeaderStories headerStore = new HeaderStories( doc);
String header = headerStore.getHeader(pageNumber);
this should give you the header of the specified page. Same with footer:
HeaderStories headerStore = new HeaderStories( doc);
String footer = headerStore.getFooter(pageNumber);
If this dosen't work. I am not really into that API....
here a little Example for a very sloppy solution:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile
{
public static void main(String[] args)
{
File file = null;
WordExtractor extractor = null;
try
{
file = new File("c:\\New.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String[] fileData = extractor.getParagraphText();
for (int i = 0; i < fileData.length; i++)
{
if (fileData[i].equals("headerPageOne")){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals("headerPageTwo"){
int lastLineOfPageOne = i
}
}
}
catch (Exception exep)
{
exep.printStackTrace();
}
}
}
If you go with this i would recommend you to create a String[] with your headers and refractor the for-loop to a seperate getPages() Method. Therefore your loop would look like:
List<String> = new ArrayList<String>(Arrays.asList("header1","header2","header3","header4"));
for (int i = 0; i < fileData.length; i++)
{
//well there should be a loop for "x" too
if (fileData[i].equals(headerArray[x])){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals(headerArray[x+1]){
int lastLineOfPageOne = i
}
}
You could create an Object(int pageStart, int PageStop), wich would be the product of your method.
I hope it helped you :)

Open Microsoft Word in Java

I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI to do that. My code is like the following one:
public void searchAndReplace(String inputFilename, String outputFilename,
HashMap<String, String> replacements) {
File outputFile = null;
File inputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
keySet = replacements.keySet();
for (int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
text = paragraph.text();
numCharRuns = paragraph.numCharacterRuns();
for (int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
System.out.println("Character Run text: " + text);
keySetIterator = keySet.iterator();
while (keySetIterator.hasNext()) {
key = keySetIterator.next();
if (text.contains(key)) {
value = replacements.get(key);
charRun.replaceText(key, value);
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
bufIStream.close();
bufIStream = null;
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
} catch (Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
}
I call this function with following arguments:
HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);
When the Test.doc file contains a simple line like this : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when I try to open it, it will give me the following error:
Word unable to read this document. It may be corrupt.
Try one or more of the following:
* Open and repair the file.
* Open the file with Text Recovery converter.
(C:\Test1.doc)
Please tell me what to do, because I'm a beginner in POI and I have not found a good tutorial for it.
First of all you should be closing your document.
Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).
Without seeing the actual Word document it's hard to say what's going on.
There is not much documentation about POI at all, especially for Word document unfortunately.
I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.
After navigating the web, the final solution i found is :
The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..
Thanks 4 all who help me..
You could try OpenOffice API, but there arent many resources out there to tell you how to use it.
You can also try this one: http://www.dancrintea.ro/doc-to-pdf/
Looks like this could be the issue.

Categories