Balabolka Text Extract Utility
The utility allows to extract text from the various types of files. The extracted text can be combined into one file or/and split into few files. The list of rules for pronunciation correction from Balabolka can be applied to text.
Supported formats for input files: AZW, AZW3, CHM, DOC, DOCX, EPUB, FB2, HTML, MHT, MOBI, ODT, PDB, PDF, PRC, RTF, TCR, TXT, WPD. The IFilter interface will be used for files with unknown extensions.
The utility works from the command line, without displaying any user interface. This is useful to integrate the text processing options to other applications, for example.
Execution order of operations:
The utility handles various command line parameters to be able to extract text from files. The command line options use the syntax "blb2txt [options ...]", all parameters must be separated by a space. Options can appear in any order on the command line so long as they are paired with their related parameters. Use the "blb2txt -?" command line to get help on the command line syntax and parameters.
Extract text from BOOK.DOC and save as "New Book.txt":
blb2txt -f "d:\Docs\book.doc" -v "d:\Text\" -p "New Book"
Extract text from the Microsoft Word and RTF documents, remove empty lines and save text files in UTF-8 encoding:
blb2txt -f "d:\Docs\*.doc" -f "d:\Docs\*.rtf" -v "d:\Text\" -e "utf8" --replace-empty-lines
Extract text from all files in the specified folder, unite and save as "Document.txt":
blb2txt -f "d:\Docs\*.*" -v "d:\Text\" -p "Document" -u
Extract text from 1.DOC, divide on parts with size 100 KB and save as text files "Document 20.txt", "Document 21.txt", etc.:
blb2txt -f "d:\Docs\1.doc" -v "d:\Text\" -p "Document" -a -n 20 -t 100
Extract text from BOOK.FB2, find the words "CHAPTER" and "CONTENTS" to divide text on parts and save as files with the names "Book 1.txt", "Book 2.txt", etc.:
blb2txt -f "d:\Book\book.fb2" -v "d:\Text\" -p "Book" -k "CHAPTER" -k "CONTENTS"
Extract text from BOOK.EPUB, find "###" to divide text on parts, remove "###" from text and save each part as a new file:
blb2txt -f "d:\Book\book.epub" -v "d:\Text\" -p "Book" -r "###"
Get text from STDIN, remove excess spaces, linebreaks and empty lines, write the updated text to STDOUT:
blb2txt -i -o --remove-spaces --remove-linebreaks --replace-empty-lines
The command line options can be stored as a configuration file "blb2txt.cfg" in the same folder as the utility.
The sample configuration file:
The utility may combine options from the configuration file and the command line.
You are free to use and distribute software for noncommercial purposes. For commercial use or distribution, you need to get permission from the copyright holder.