How to get specific lines in a text file - java

I am trying read the contents of a text file. The idea is to get the first line with the 'title :' keyword, read the file, get the next 'title:' keyword again, keep doing it until the file is read. I am trying to store it in a database. Other ideas to do this are welcomed as well. Thanks.
This is the text file I am trying to read from.
title : Mothers Day
mattiebelle : YEA! A movie that grabbed me from beginning to end! Love to come across this kind of movie. A must see for all! Enjoy!
title : Pregnant in Heels
CuittePie : I CAN'T WATCH ANY OF THEES. :#
title : The Flintstones
Row_Sweet_Girl : Nice one to watch
title : Barter Kings
dragon3476 : Barter Kings - Season 1 Episode 4 - Rock and a Hard Place Air Date: 19/06/2012 Summary:Traders barter for a car and a pool table.

I think the easiest way would be using FileUtils from Apache Commons IO like this:
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.commons.io.FileUtils;
public class ReadFileLines {
public static void main(final String[] args) throws IOException {
List lines = FileUtils.readLines(new File("/tmp/myFile.txt"), "UTF-8");
for (Object line : lines) {
if (String.valueOf(line).startsWith("title : ")) {
System.out.println(line); // here you store it
}
}
}
}

Related

Why is JSoup printing a question mark

I'm trying to understand the following. I have some code reading a page from gutenberg.org. Almost everything is ok but some characters are not. They are ok in the browser.
package nl.atticworks.gutenberg;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Gutenberg {
private static final String GET_URL = "http://www.gutenberg.org/browse/languages/nl";
public static void main(String[] args) {
try {
Document doc = Jsoup.connect(GET_URL).get();
Elements data = doc.select("div.pgdbbylanguage");
for (Element d : data) {
Elements children = d.select("*");
for (Element child : children) {
if (child.tagName().equals("ul")) {
Element author = children.get(children.indexOf(child) - 1);
String a1 = author.select("a:last-child").text();
if (a1.startsWith("Kara")) {
System.out.println(a1);
Elements titles = child.select("li.pgdbetext a");
for (Element title : titles) {
System.out.println("\t" + title.text());
}
}
}
}
}
} catch (IOException ex) {
// do something...
}
}
}
The string a1 prints "Karadži?, Vuk Stefanovi?, 1787-1864" but should print "Karadžić, Vuk Stefanović, 1787-1864"
I'm pretty sure that the encoding is ok (UTF-8) but the c with acute isn't encoded properly.
Still, browsers do show the correct char, Jsoup doesn't. Why?
Regards,
Hans
As you haven't said what you are running your program in it is difficult to give a definitive answer, but basically there is nothing wrong with your code. JSoup is not responsible for your display problem, whichever console you are displaying on is the problem.
If you set your console (or IDE) to the UTF-8 encoding it should display correctly.
I tried this code on my own IDEA,and the output was just as you expected.
So I insist that the encoding is the problem.

Renaming video files with a Java program

I have a folder containing many videos that i'd like to rename. I can't think of any convenient way of doing so. The naming convention is the following "SeasonX, EpisodeY: Episode name". This is going to be "SXEY:Name" for short.
An example: S01E01:JavaCode
That would be Season One, Episode One of Episode called JavaCode.
I wrote something that is able to change the file names, but I need different and unique file names for every episode because it's a TV show.
Here's the code:
import java.io.File;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class BatchFileRenamer {
public static void main(String[] args) {
// TODO Auto-generated method stub
File folder = new File("C:\\Users\\Tony\\Videos\\New folder");
TreeMap map = new TreeMap();
String name = "name";
File[] files = folder.listFiles();
Pattern p = Pattern.compile("\\..*");
for (int i = 0; i != files.length; i++) {
Matcher m = p.matcher(files[i].getName());
System.out.println(files[i].getName());
m.find();
files[i].renameTo(new File(folder.getAbsolutePath() + "\\" + name + " S01E" +
(i < 10 ? "1" : "") + i + m.group()));
}
}
}
I was thinking of creating an array containing the episode names but that's just as much work as manually renaming them in Windows. I guess if I had a txt file to download for all the TV shows with the names of the episodes in it it'd be useful.
Anyway, any suggestions would be greatly appreciated!
I think the best way to do this would be to use the Open Movie Database API. With this, you can get a REST response including a list of episodes for each season of a show. (Example request).
With this, you could use Gson or another parser to serialize the list of episodes:
Here is a Gist of some sample code. (There is probably a better getter method, but you get the point)
What the code does is it gets the information from the sample request above via the API, then it serializes it into a basic POJO from the Episodes.java class using Gson:
Gson gson = new Gson();
Episodes episodes = gson.fromJson(download, Episodes.class);
System.out.println(episodes);
You can then use this information to create the individual file names for the video files.

Is there a way how to disable spell checking for Word document in java?

I'm creating MS Word document using docx4j and in MS Word all texts are marked with spell checker as text is attributed as english but it is different language. Is there a way how to disable spell checking using docx4j or Apache POI?
There is not a single place where you can turn off proofing for the entire document. Instead it must be turned off for each and every run in the document. You can do that with Apache POI, but you must use the CT classes as this property has not yet been surfaced. Here is one way you might go about it for a single run r.
public static void setNoProof (XWPFRun run) {
CTR ctR = run.getCTR();
CTRPr ctRPr = ctR.isSetRPr() ? ctR.getRPr() : ctR.addNewRPr();
if (!ctRPr.isSetNoProof()) {
// If the noProof property is missing, add it
ctRPr.addNewNoProof();
} else {
// If the noProof property is present, make sure it is not
// FALSE, OFF, or X_0
CTOnOff noProof = ctRPr.getNoProof();
if (noProof.isSetVal() &&
(noProof.getVal() == STOnOff.FALSE ||
noProof.getVal() == STOnOff.OFF ||
noProof.getVal() == STOnOff.X_0)) {
noProof.setVal(STOnOff.TRUE);
}
}
}
Now loop through your runs, and call this method for each run.
There is a possibility to "Hide spelling errors in this document only" and "Hide grammar errors in this document only". See How to Temporarily Disable Spell Check in Word.
Using XWPF (*.docx) the XML for this is contained in /word/settings.xml and looks like:
<w:settings ...>
...
<w:hideSpellingErrors/>
<w:hideGrammaticalErrors/>
...
</w:settings >
We can set this without creating the whole XWPFWorkbook using OPCPackage, PackagePart and org.openxmlformats.schemas.wordprocessingml.x2006.main.* objects.
Example:
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.regex.Pattern;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.SettingsDocument;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSettings;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
import org.openxmlformats.schemas.officeDocument.x2006.relationships.STRelationshipId;
import org.apache.xmlbeans.XmlOptions;
import javax.xml.namespace.QName;
import java.util.Map;
import java.util.HashMap;
public class XWPFDisableSpellCheck {
public static void main(String[] args) throws Exception {
File file = new File("XWPFDisableSpellCheck.docx");
OPCPackage opcPackage = OPCPackage.open(file);
PackagePart settingsPart = opcPackage.getPartsByName(Pattern.compile("/word/settings.xml")).get(0);
SettingsDocument settingsDocument = SettingsDocument.Factory.parse(settingsPart.getInputStream());
CTSettings settings = settingsDocument.getSettings();
if (settings.getHideSpellingErrors() == null) settings.addNewHideSpellingErrors();
if (settings.getHideGrammaticalErrors() == null) settings.addNewHideGrammaticalErrors();
//settings.getHideSpellingErrors().setVal(STOnOff.ON);
//settings.getHideGrammaticalErrors().setVal(STOnOff.ON);
//create XmlOptions for saving the settings
XmlOptions xmlOptions = new XmlOptions();
xmlOptions.setSaveOuter();
xmlOptions.setUseDefaultNamespace();
xmlOptions.setSaveAggressiveNamespaces();
xmlOptions.setCharacterEncoding("UTF-8");
xmlOptions.setSaveSyntheticDocumentElement(new QName(CTSettings.type.getName().getNamespaceURI(), "settings"));
Map<String, String> map = new HashMap<String, String>();
map.put(STRelationshipId.type.getName().getNamespaceURI(), "w");
map.put(STRelationshipId.type.getName().getNamespaceURI(), "m");
map.put(STRelationshipId.type.getName().getNamespaceURI(), "o");
xmlOptions.setSaveSuggestedPrefixes(map);
//save the settings
OutputStream out = settingsPart.getOutputStream();
settings.save(out, xmlOptions);
out.close();
opcPackage.close();
}
}
After this code the options "Hide spelling errors in this document only" and "Hide grammar errors in this document only" are set in document XWPFDisableSpellCheck.docx.
Disclosure: I maintain docx4j
When you use XHTMLImporter, you should be importing into a docx with suitable language settings.
Typically this is done in the styles part, w:styles/w:docDefaults/w:rPrDefault/w:rPr:
<w:styles xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:docDefaults>
<w:rPrDefault>
<w:rPr>
<w:lang w:val="en-US" w:eastAsia="ko-KR" w:bidi="ar-SA"/>
</w:rPr>
</w:rPrDefault>
</w:docDefaults>
This value will be effective unless overridden in a style, or in the direct formatting which some of the other answers discuss.
Also, in /word/settings.xml, check w:themeFontLang:
<w:settings xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" >
<w:themeFontLang w:val="en-US" w:eastAsia="ko-KR"/>
You can either maintain per-language templates, or you can use docx4j to alter these settings dynamically. If you want to do that and have any trouble, please post a separate question.
Regarding hiding spelling/grammar errors, assuming MainDocumentPart mdp:
DocumentSettingsPart dsp = mdp.getDocumentSettingsPart();
dsp.getContents().setHideGrammaticalErrors(new BooleanDefaultTrue());
dsp.getContents().setHideSpellingErrors(new BooleanDefaultTrue());
As far as I know, POI can't reach that user preference. But you can change it yourself and use that document as a template for all future documents.

jsoup: How to extract correct data from this website

I am trying to extract data from a Spanish dictionary using jsoup. Essentially, the user will input words he wants to define as command line arguments and the program will return a formatted list of definitions. Here is what I have done so far:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
String[] urls = new String[args.length];
for(int i=0; i<args.length; i++) {
urls[i] = "http://www.diccionarios.com/detalle.php?palabra="
+ args[i]
+ "&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on";
try{
Document doc = Jsoup.connect(urls[i]).get();
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
System.out.println(untokenized);
}catch (Exception e) {
System.out.println("EXCEPTION: Word is probably not in this dictionary.");
}
}
}
}
That url array gives the correct urls where the information for the definition is.
Now, what I'm expecting to be returned is what you would get if you went to the try.jsoup website and used (for example) this : http://www.diccionarios.com/detalle.php?palabra=libro&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on
as the link and typed in html as the CSS Query. I need that data so I can tokenize the definition from that.
So I guess my question is, what method would I use to obtain the same data that you can see on the try.jsoup website. Thanks a lot!
Edit: This is about interpreting the data from the url. The end result data I want (in this example) is "Conjunto de hojas escritas unidas o cosidas por uno de sus lados y cubiertas por tapas de cartón u otro material." That is the definition on the website. However, I noticed that on that try.jsoup website that if I put the html text in the CSS Query box then the result was a huge bunch of text. My assumption was that the following 2 lines of code would capture this huge bunch of text and save it as a string:
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
However, the output for when I print untokenized is instead this: "Usuario Clave ¿Olvidaste tu clave? Condiciones Privacidad Versión completa © 2011 Larousse Editorial, SL." So my question is, how to I obtain the string data for that huge bunch of text found on the try.jsoup website?
EDIT: I followed the advice of the question here: Jsoup - CSS Query selector issue (?) and it worked great.

How extract the data from a list?

I'm developing an Android application and I want to recognize hashtags, mentions and links. I have a code that can be usable in objective-c that do my propose. I question these and now I have these code:
import java.net.URL;
import java.util.List;
String input = /* text from edit text */;
String[] words = input.split("\\s");
List<URL> urls=null;
for (String s : words){
try
{
urls.add(new URL(s));
}
catch (MalformedURLException e) {
// not a url
}
}
Now I want to put these on a tweet, I have developed the code to do it, and the tweet is based on an string. My question is how I put the data from the list in the string?
//I test these
String tweet="Using my app"+urls
But in the tweet appears "Using my appnull"
How I reuse this code to recognize hashtags and mentions?
I think that is changing the input.split("\\s") by "#\\s" or "#\\s"
You could just use a library here:
https://github.com/twitter/twitter-text-java
that does what you're trying to do.

Categories