Balabolka :: Utilitar pentru extragerea textului din fișiere

If you want to help Balabolka, purchase my software Cross+A.

As long as people pay money for Cross+A, Balabolka will remain freeware. Thank you!

Balabolka

Utilitar pentru extragerea textului din fișiere

Aplicație consolă pentru extragerea textului din fișiere Utilitarul permite extragerea textului din diferite tipuri de fișiere. Textul extras poate fi combinat într-un singur fișier sau/și împărțit în mai multe fișiere. Lista de reguli pentru corectarea pronunției din Balabolka poate fi aplicată textului.

Formate acceptate pentru fișierele de intrare: AZW, AZW3, CHM, DjVu, DOC, DOCX, EML, EPUB, FB2, FB3, HTML, LIT, MD, MHT, MOBI, ODP, ODS, ODT, PDB, PDF, PPT, PPTX, PRC, RTF, TCR, TXT, TXTZ, WPD, WRI, XLS, XLSX. Interfața IFilter va fi utilizată pentru fișierele cu extensii necunoscute.

Utilitarul funcționează din linia de comandă, fără a afișa nicio interfață de utilizator. Acest lucru poate fi util dacă este necesar să integrați opțiunile de procesare a textului în alte aplicații, de exemplu.

Ordinea de execuție a operațiunilor:

Extrageți textul din fișierul (sau fișierele) de intrare.
Formatați textul: eliminați spațiile, întreruperile de linie etc. (dacă sunt specificate opțiuni).
Combinați fișierele într-un singur fișier (dacă este specificată opțiunea).
Împărțiți textul (dacă sunt specificate opțiuni).
Aplicarea regulilor de corectare a pronunției (dacă opțiunea este specificată).
Salvarea fișierului (sau fișierelor) de ieșire.

Descarcă utilitarul care extrage textul din fișiere

Mărime fişier: MB

Versiune: Changelog

Licenţă: Gratuită (Freeware)

Sistem de operare:

Linie de comandă

Utilitarul gestionează diverși parametri ai liniei de comandă pentru a putea extrage text din fișiere. Opțiunile liniei de comandă utilizează sintaxa "blb2txt [opțiuni ...]", toți parametrii trebuie separați printr-un spațiu. Opțiunile pot apărea în orice ordine pe linia de comandă. Toți parametrii sunt insensibili la majuscule și minuscule. Utilizați linia de comandă "blb2txt -?" pentru a obține ajutor privind sintaxa și parametrii liniei de comandă.

-f file_mask

Sets the name of input file or the mask for the group of input files. The command line may contain few options -f.

-fl file_name

Sets the name of the text file with the list of input files (one file name per line).

-v folder_name

Sets the name of output folder for saving of text files.

-p text

Sets the pattern for output file name (for example, "Text Document"). If absent, the input file name will be used.

Use the %FileName% variable to insert the input file name to the output file name.
Use the %FirstLine% variable to insert the first line of text.
Use the %Header% variable to insert the chapter title.
Use the %Number% variable to change the position of the sequence number inside the output file name.
Use the %Title% variable to insert the title of the HTML document (for HTML files only).

Warning! It is necessary to double a percent sign (%) in a batch script. For example: -p %%Number%%

-ext text

Set the extension for output filenames. The default is "txt".

-out file_name

Sets the full name for output file. The option is recommended to specify only when the utility is used as a part of other software. If the utility is used for custom document import, the external program runs the utility from a command line and passes the full name of a text file to create.

-s

Search input files in subfolders.

-cf

Create a subfolder for each input file. A file name will be used as a name of an output subfolder.

-i

Reads data from STDIN. The file format will be auto-detected from data. If the option is specified, the option -f is ignored.

-o

Writes text to STDOUT. If the option is specified, the options -v and -p are ignored.

-u

Combines text files into one output file.

-b

Adds sequence number before output file name (when text is split).

-a

Adds sequence number after output file name (when text is split).

-n integer

Sets the starting sequence number for output files (when text is split). The default is 1.

-e encoding

Sets the encoding for output files ("ansi", "utf8" or "unicode"). The default is "ansi".

-t integer

Splits text by output target size of text parts. The number corresponds to an amount of characters.

-k keyword

Splits text by special keyword in input file. The option is case-sensitive. The command line may contain few options -k.

-r keyword

Splits text by keyword and removes it from output files. The option is case-sensitive. The command line may contain few options -r.

-w

Splits text by two empty lines in succession.

-l

Splits text by lines where all letters are capital.

-c

Splits text by a table of contents. The application extracts positions of chapter beginnings from the input file (if the file contains such information).

-toc

Generates a table of contents and splits text. The application splits the extracted text by keywords (like "chapter" or "volume"). If the option is used together with the option -c, the application will try to extract a table of contents from the document; if it fails, a new table of contents will be generated.

-m integer

Sets the minimal size of text parts for splitting (as a number of characters).

-j integer

Ignores the chapter beginning if the size of the previous chapter is less than the specified value (in characters). The option is used together with the option -c or -toc.

-hh text

Inserts text in front of headings (for example: ## Chapter 1).

-d file_name

Uses a dictionary for pronunciation correction (*.BXD, *.REX or *.DIC). The command line may contain few options -d.

-if

Uses IFilter interface to extract text. If this fails, the default method will be used by the application.

-g folder_name

Sets the name of output folder for saving of images from documents.

-cvr folder_name

Sets the name of output folder for saving of a book cover image.

-cft

Clones the Created/Modified/Accessed time of the input file into the output file. If the application combines text files or splits the extracted text, the option is ignored.

-x file_type

Sets the input file type. It allows to define a format of input documents with unknown file name extensions. For example: -x doc.

-pwd text

Sets the password for the encrypted PDF files.

-dll file_name

Sets the path and name for 7z.dll (32bit). This library helps to extract text and images from documents inside archives (ZIP, RAR, etc.). 7z.dll is a part of 7-Zip software. If the option is not specified, the application and the library must be in the same folder; otherwise, the application will not be able to extract data from archive files.

-dex file_types

Sets the list of file types for extracting from archives. The option contains a comma-separated list of file types, for example: -dex "fb2,epub"
The command line may contain few options -dex. If the option is not specified, the application will extract text from all files in an archive. If it is necessary to extract text for all file types supported by the application, use the value "all-". For example: -dex all-

-dne file_types

Sets the list of file types to ignore when documents are extracted from archives. The option contains a comma-separated list of file types, for example: -dne "exe,dll"
The command line may contain few options -dne. If the option is not specified, the application will extract text from all files in an archive.

-dp

Display progress information in a console window.

-cfg file_name

Sets the name of the configuration file with the command line options (a text file where each line contains one option). If the option is not specified, the file blb2txt.cfg in the same folder as the utility will be used.

-h

Prints the list of available command line options.

--remove-spaces sau -rs

Removes excess spaces (two or more blank spaces in succession, no-break spaces).

--remove-hyphens sau -rh

Removes hyphens at the ends of lines in the text.

--remove-linebreaks sau -rl

Removes linebreaks inside paragraphs.

--remove-empty-lines sau -rm

Removes empty lines.

--replace-empty-lines sau -rp

Replaces few empty lines by one empty line.

--remove-square-brackets sau -rsb

Removes text in [square brackets].

--remove-curly-brackets sau -rcb

Removes text in {curly brackets}.

--remove-angle-brackets sau -rab

Removes text in <angle brackets>.

--remove-round-brackets sau -rrb

Removes text in (round brackets).

--remove-comments sau -rc

Removes comments. Single-line comments start with // and continue until the end of the line. Multiline comments start with /* and end with */.

--remove-page-numbers sau -rpn

Removes page numbers (it may be useful for DjVu/PDF files).

--fix-ocr-errors sau -ocr

Fixes OCR errors (for languages with Cyrillic alphabets only).

--fix-letter-spacing sau -ls

Fixes letter-spacing in words (for example: s p a c e, _w_o_r_d).

--add-period sau -ap

Adds a period if there is no punctuation after the last word of the paragraph.

--extract-summary integer sau -es integer

Extracts a summary (also called "annotation") from FB2/FB3 files and inserts at the beginning of text. Possible values for the integer parameter:

0 – skips a summary (used by default);
1..5 – extracts a summary (a value determines the order in which an author name and a book title are listed).

--skip-notes sau -sn

Skips notes, when the application extracts text from DOCX/FB2/FB3/MD/ODT files.

--include-notes integer sau -in integer

Includes notes inside text, when the application extracts text from DOCX/FB2/FB3/MD/ODT files.
Possible values for the integer parameter:

0 – removes links to notes from text;
1 – keeps default positions of notes inside text (this value is used by default);
2 – places notes at the end of sentences;
3 – places notes at the end of paragraphs.

--insert-note-begin text sau -inb text

Inserts words at the beginning of notes, when notes are included inside text (for example: Editor's note.).
The option is used for DOCX/FB2/FB3/MD/ODT files.

--insert-note-end text sau -ine text

Inserts words at the end of notes, when notes are included inside text (for example: End of note.).
The option is used for DOCX/FB2/FB3/MD/ODT files.

--extract-tables integer sau -et integer

Extract tables from DOCX/FB2/FB3/ODT files. Possible values for the integer parameter:

0 – skips tables;
1 – extract data from each cell as a new text line (this value is used by default);
2 – keep formatting when extracting a table.

--csv-comma

Columns are separated by a comma, when the application extracts data from XLS/XLSX/ODS files (default delimiter for CSV files).

--csv-semicolon

Columns are separated by a semicolon, when the application extracts data from XLS/XLSX/ODS files.

--csv-space

Columns are separated by a blank space, when the application extracts data from XLS/XLSX/ODS files.

--csv-tab

Columns are separated by a tab, when the application extracts data from XLS/XLSX/ODS files.

--csv-double-quote

Uses double-quote characters, if a field must be quoted (export from XLS/XLSX/ODS files).

--csv-single-quote

Uses single-quote characters, if a field must be quoted (export from XLS/XLSX/ODS files).

--eml-save folder_name

Extracts attachments from EML files and saves to a specified folder.

--eml-att

Extracts the list of attachments from EML files (names of files attached to the message).

--eml-cc

Extracts the header field "Cc" from EML files ("carbon copy"; it specifies additional recipients of the message).

--eml-date date_format

Extracts the header field "Date" from EML files (the local time and date when the message was composed and sent). A date format are defined by specifiers (such as "d", "m", "y", etc.). For example: "dd.mm.yyyy hh:nn:ss".

--eml-from

Extracts the header field "From" from EML files (the email address, and optionally the name of the author).

--eml-org

Extracts the header field "Organization" from EML files (the name of the organization through which the sender of the message has net access).

--eml-rt

Extracts the header field "Reply-To" from EML files (the address for replies to go to).

--eml-subj

Extracts the header field "Subject" from EML files (the subject of the message).

--eml-to

Extracts the header field "To" from EML files (the email address, and optionally the name of the message's recipient).

Exemple

Extrageți textul din BOOK.DOC și salvați-l ca BOOK.TXT în folderul de ieșire:

blb2txt -f "d:\Docs\book.doc" -v "d:\Text\"

De asemenea, această variantă poate fi utilizată dacă este necesar (când este specificat un singur fișier de intrare):

blb2txt -f "d:\Docs\book.doc" -out "d:\Text\book.txt"

Extrageți textul din documentele Microsoft Word și RTF, eliminați liniile goale și salvați fișierele text în codificare UTF-8:

blb2txt -f "d:\Docs\*.doc" -f "d:\Docs\*.rtf" -v "d:\Text\" -e utf8 -rm

Extrageți textul din toate fișierele text din folderul specificat, uniți-le și salvați-le ca DOCUMENT.TXT:

blb2txt -f "d:\Docs\*.*" -v "d:\Text\" -p "Document" -u

Extrageți textul din DOCUMENT.DOCX, împărțiți-l în părți de 100 KB și salvați-l ca fișiere text "Document 20.txt", "Document 21.txt" etc.:

blb2txt -f "d:\Docs\document.docx" -v "d:\Text\" -p "Document" -a -n 20 -t 100000

Extrageți textul din BOOK.FB2, găsiți cuvintele "CAPITOL" și "CUPRINS" pentru a împărți textul în părți și salvați-l ca fișiere cu numele "Cartea 1.txt", "Cartea 2.txt" etc.:

blb2txt -f "d:\Book\book.fb2" -v "d:\Text\" -p "Cartea" -k "CAPITOL" -k "CUPRINS"

Extrageți textul din BOOK.EPUB, găsiți "###" pentru a împărți textul în părți, eliminați "###" din text și salvați fiecare parte ca fișier nou:

blb2txt -f "d:\Book\book.epub" -v "d:\Text\" -p "Book" -r "###"

Extrageți textul din BOOK.FB2, împărțiți-l în funcție de cuprins, salvați fișierele și utilizați titlurile capitolelor ca nume de fișiere. Noile fișiere text nu trebuie să aibă mai puțin de un kilobyte:

blb2txt -f "d:\Book\book.fb2" -v "d:\Text\" -p "%Number% - %Header%" -c -j 1024

Obțineți textul din STDIN, eliminați spațiile în exces, întreruperile de linie și liniile goale, scrieți textul actualizat în STDOUT:

blb2txt -i -o --remove-spaces --remove-linebreaks --replace-empty-lines

Extrageți textul din toate documentele Microsoft Word din arhivele ZIP:

blb2txt -f "d:\Archive\*.zip" -v "d:\Text\" -dll "e:\7-Zip\7z.dll" -dex doc,docx

Fișier de configurare

Opțiunile liniei de comandă pot fi stocate ca fișier de configurare "blb2txt.cfg" în același folder cu utilitarul.

Exemplu de fișier de configurare:

-f d:\Docs\*.rtf
-f d:\Books\*.epub
-f d:\Books\*.fb2
-v d:\Text
-b
-n 1
-t 25000
-e utf8
-d d:\Dict\rules.bxd
--remove-spaces
--remove-linebreaks
--replace-empty-lines

Utilitarul poate combina opțiuni din fișierul de configurare și din linia de comandă.

Licență

Puteți utiliza și distribui software-ul în scopuri necomerciale. Pentru utilizarea sau distribuirea în scopuri comerciale, trebuie să obțineți permisiunea deținătorului drepturilor de autor.