=pod =head1 NAME README for dta-tokwrap - programs, scripts, and perl modules for DTA XML corpus tokenization =cut ##====================================================================== =pod =head1 DESCRIPTION This package contains various utilities for tokenization of DTA "base-format" XML documents. see L for requirements and installation instructions, see L for a brief introduction to the high-level command-line interface, and see L for an overview of the individual tools included in this distribution. =cut ##====================================================================== =pod =head1 INSTALLATION =cut ##-------------------------------------------------------------- =pod =head2 Requirements =cut ##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ =pod =head3 C Libraries =over 4 =item expat tested version(s): 1.95.8, 2.0.1 =item libxml2 tested version(s): 2.7.3, 2.7.8 =item libxslt tested version(s): 1.1.24, 1.1.26 =back =cut ##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ =pod =head3 Perl Modules See F for a full list of required perl modules. =cut ##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ =pod =head3 Development Tools =over 4 =item C compiler tested version(s): gcc / linux: v4.3.3, 4.4.6 =item GNU flex (development only) tested version(s): 2.5.33, 2.5.35 Only needed if you plan on making changes to the lexer sources. =item GNU autoconf (SVN only) tested version(s): 2.61, 2.67 Required for building from SVN sources. =item GNU automake (SVN only) tested version(s): 1.9.6, 1.11.1 Required for building from SVN sources. =back =cut ##-------------------------------------------------------------- =pod =head2 Building from SVN To build this package from SVN sources, you must first run the shell command: bash$ sh ./autoreconf.sh from the distribution root directory B running F<./configure>. Building from SVN sources requires additional development tools to present on the build system. Then, follow the instructions in L. =cut ##-------------------------------------------------------------- =pod =head2 Building from Source To build and install the entire package, issue the following commands to the shell: bash$ cd dta-tokwrap-0.01 # (or wherever you unpacked this distribution) bash$ sh ./configure # configure the package bash$ make # build the package bash$ make install # install the package on your system More details on the top-level installation process can be found in the file F in the distribution root directory. More details on building and installing the DTA::TokWrap perl module included in this distribution can be found in the F manpage. =cut ##====================================================================== =pod =head1 USAGE The perl program L installed from the F distribution subdirectory provides a flexible high-level command-line interface to the tokenization of DTA XML documents. =cut ##-------------------------------------------------------------- =pod =head2 Input Format The L script takes as its input DTA "base-format" XML files, which are simply (TEI-conformant) UTF-8 encoded XML files with one CcE> element per character: =over 4 =item * the document B be encoded in UTF-8, =item * all text nodes to be tokenized should be descendants of a CtextE> element, and may optionally be immediate daughters of a CcE> element (XPath C). CcE> elements may not be nested. Prior to dta-tokwrap v0.38, CcE> elements were required. =back =cut ##-------------------------------------------------------------- =pod =head2 Example: Tokenizing a single XML file Assume we wish to tokenize a single DTA "base-format" XML file F. Issue the following command to the shell: bash$ dta-tokwrap.perl doc1.xml ... This will create the following output files: =over 4 =item F "Master" tokenizer output file encoding sentence boundaries, token boundaries, and tokenizer-provided token analyses. Source for various stand-off annotation formats. This format can also be passed directly to and from the L analysis suite using the L formatter class. =back =cut ##-------------------------------------------------------------- =pod =head2 Example: Tokenizing multiple XML files Assume we wish to tokenize a corpus of three DTA "base-format" XML files F, F, and F. This is as easy as: bash$ dta-tokwrap.perl doc1.xml doc2.xml doc3.xml For each input document specified on the command line, master output files and stand-off annotation files will be created. See L<"the dta-tokwrap.perl manpage"|dta-tokwrap.perl> for more details. =head2 Example: Tracing execution progess Assume we wish to tokenize a large corpus of XML input files F, and would like to have some feedback on the progress of the tokenization process. Try: bash$ dta-tokwrap.perl -verbose=1 doc*.xml or: bash$ dta-tokwrap.perl -verbose=2 doc*.xml or even: bash$ dta-tokwrap.perl -traceAll doc*.xml =cut ##-------------------------------------------------------------- =pod =head2 Example: From TEI to TCF and Back Assume we have a TEI-like document F which we want to encode as TCF to the file F, using only whitespace tokenizer "hints", but not actually tokenizing the document yet. This can be accomplished by: $ dta-tokwrap.perl -t=tei2tcf -weak-hints doc1.tei.xml If the output should instead be written to STDOUT, just call: $ dta-tokwrap.perl -t=tei2tcf -weak-hints -dO=tcffile=- doc1.tei.xml Assume that the resulting TCF document has undergone further processing (e.g. via L) to produce an annotated TCF document C. selected TCF layers (in particular the C and C layers) can be spliced back into the TEI document as F by calling: $ dta-tokwrap.perl -t=tcf2tei doc.out.tcf -dO=tcffile=doc.out.tcf -dO=tcfcwsfile=doc.out.xml =cut ##====================================================================== =pod =head1 TOOLS This section provides a brief overview of the individual tools included in the dta-tokwrap distribution. =cut ##-------------------------------------------------------------- =pod =head2 Perl Scripts & Programs The perl scripts and programs included with this distribution are installed by default in F and/or wherever your perl installs scripts by default (e.g. in C<`perl -MConfig -e 'print $Config{installsitescript}'`>). =over 4 =item dta-tokwrap.perl Top-level wrapper script for document tokenization using the L perl API. =item dtatw-add-c.perl Script to insert CcE> elements and/or C attributes for such elements into an XML document which does not yet contain them. Guaranteed not to clobber any existing //c IDs. //c/@xml:id attributes are generated by a simple document-global counter ("c1", "c2", ..., "c65536"). See L<"the dtatw-add-c.perl manpage"|dtatw-add-c.perl> for more details. =item dtatw-cids2local.perl Script to convert C attributes to page-local encoding. Never really used. See L<"the dtatw-cids2local.perl manpage"|dtatw-cids2local.perl> for more details. =item dtatw-add-ws.perl Script to splice CsE> and CwE> elements encoded from a standoff (.t.xml or .u.xml) XML file into the I "base-format" (.chr.xml) file, producing a .cws.xml file. A tad too generous with partial word segments, due to strict adjacency and boundary criteria. In earlier versions of dta-tokwrap, this functionality was split between the scripts C and C, which required only an I base-format (.chr.xml) file as the splice target. As of dta-tokwrap v0.35, the splice target base-format file must be I source file itself, since the current implementation uses byte offsets to perform the splice. See L<"the dtatw-add-ws.perl manpage"|dtatw-add-ws.perl> for more details. =item dtatw-splice.perl Script to splice generic standoff attributes and/or content into a base file; useful e.g. for merging flat DTA::CAB standoff analyses into TEI-structured *.cws.xml files. See L<"the dtatw-splice.perl manpage"|dtatw-splice.perl> for more details. =item dtatw-get-ddc-attrs.perl Script to insert DDC-relevant attributes extracted from a base file into a *.t.xml file, producing a pre-DDC XML format file (by convention *.ddc.t.xml, a subset of the *.t.xml format). See L<"the dtatw-get-ddc-attrs.perl manpage"|dtatw-get-ddc-attrs.perl> for more details. =item dtatw-get-header.perl Simple script to extract a single header element from an XML file (e.g. for later inclusion in a DDC XML format file). See L<"the dtatw-get-header.perl manpage"|dtatw-get-header.perl> for more details. See L<"the dtatw-get-header.perl manpage"|dtatw-get-header.perl> for more details. =item dtatw-pn2p.perl Script to conver insert EpE...E/pE wrappers for C key attributes in "flat" *.t.xml files. =item dtatw-xml2ddc.perl Script to convert *.ddc.t.xml files and optional headers to DDC-XML format. See L<"the dtatw-xml2ddc.perl manpage"|dtatw-xml2ddc.perl> for more details. =item dtatw-t-check.perl Simple script to check consistency of tokenizer output (*.t) offset + length fields with input (*.txt) file. =item dtatw-add-c.perl Script to add CcE> elements to an XML document which does not already contain them. Not really useful as of dta-tokwrap v0.38. =item dtatw-rm-c.perl Script to remove CcE> elements from an XML document. Regex hack, fast but not exceedingly robust, use with caution. See also L =item dtatw-rm-w.perl Fast regex hack to remove CwE> elements from an XML document =item dtatw-rm-s.perl Fast regex hack to remove CsE> elements from an XML document. =item dtatw-rm-lb.perl Script to remove ClbE> (line-break) elements from an XML document, replacing them with newlines. Regex hack, fast but not robust, use with caution. See also L =item dtatw-lb-encode.perl Encodes newlines under //text//text() in an XML document as ClbE> (line-break) elements using high-level file heuristics only. Regex hack, fast but not robust, use with caution. See also L, L, L. =item dtatw-ensure-lb.perl Script to ensure that all //text//text() newlines in an XML document are explicitly encoded with ClbE> (line-break) elements, using optional file-, element-, and line-level heuristics. Robust but slow, since it actually parses XML input documents. See also L, L, L. =item dtatw-tt-dictapply.perl Script to apply a type-"dictionary" in one-word-per-line (.tt) format to a token corpus in one-word-per-line (.tt) format. Especially useful together with standard UNIX utilities such as cut, grep, sort, and uniq. =item dtatw-cabtt2xml.perl Script to convert DTA::CAB::Format::TT (one-word-per-line with variable analysis fields identified by conventional prefixes) files to expanded .t.xml format used by dta-tokwrap. The expanded format should be identical to that used by the DTA::CAB::Format::Xml class. See also L. =item file-substr.perl Script to extract a portion of a file, specified by byte offset and length. Useful for debugging index files created by other tools. =back =cut ##-------------------------------------------------------------- =pod =head2 GNU make build system template The distribution directory F contains a "template" for using GNU F to organizing the conversion of large corpora with the dta-tokwrap utilities. This is useful because: =over 4 =item * F's intuitive, easy-to-read syntax provides a wonderful vehicle for user-defined configuration files, obviating the need to remember the names of all 64 (at last count) C options, =item * F is very good at tracking complex dependencies of the sort that exist between the various temporary files generated by the dta-tokwrap utilities, =item * F jobs can be made "robust" simply by adding a C<-k> (C<--keep-going>) to the command-line, and =item * last but certainly not least, F has built-in support for parallelization of complex tasks by means of the C<-j N> (C<--jobs=N>) option, allowing us to take advantage of multiprocessor systems. =back By default, the contents of the distribution F subdirectory are installed to F. See the comments at the top of F for instructions. =cut ##-------------------------------------------------------------- =pod =head2 Perl Modules =over 4 =item L Top-level tokenization-wrapper module, used by L. =item L Object-oriented wrapper for documents to be processed. =item L Abstract base class for elementary document-processing operations. =back See the L manpage for more details on included modules, APIs, calling conventions, etc. =cut ##-------------------------------------------------------------- =pod =head2 XSL stylesheets The XSL stylesheets included with this distribution are installed by default in F. =over 4 =item dtatw-add-lb.xsl Replaces newlines with Clb/E> elements in input document. =item dtatw-assign-cids.xsl Assigns missing C attributes using the XSL C function. =item dtatw-rm-c.xsl Removes CcE> elements from the input document. Slow but robust. =item dtatw-rm-lb.xsl Replaces Clb/E> elements with newlines. =item dtatw-txml2tt.xsl Converts "master" tokenized XML output format (F<*.t.xml>) to TAB-separated one-word-per-line format (F<*.mr.t> aka F<*.t> aka F<*.tt> aka "tt" aka "CSV" aka DTA::CAB::Format::TT aka "TnT" aka "TreeTagger" aka "vertical" aka "moot-native" aka ...). See the F manpage for basic format details, and see the top of the XSL script for some influential transformation parameters. =back =cut ##-------------------------------------------------------------- =pod =head2 C Programs Several C programs are included with the distribution. These are used by the L script to perform various intermediate document processing operations, and should not need to be called by the user directly. B: The following programs are meant for internal use by the C modules only, and their names, calling conventions, and very presence is subject to change without notice. =over 4 =item dtatw-mkindex Splits input document F into a "character index" F (CSV), a "structural index" F (XML), and a "text index" F (UTF-8 text). =item dtatw-rm-namespaces Removes namespaces from any XML document by renaming "C" attributes to "C" and "C" attributes to "C". Useful because XSL's namespace handling is annoyingly slow and ugly. =item dtatw-tokenize-dummy Dummy C tokenizer. Useful for testing. =item dtatw-txml2sxml Converts "master" tokenized XML output format (F<*.t.xml>) to sentence-level stand-off XML format (F<*.s.xml>). =item dtatw-txml2wxml Converts "master" tokenized XML output format (F<*.t.xml>) to token-level stand-off XML format (F<*.w.xml>). =item dtatw-txml2axml Converts "master" tokenized XML output format (F<*.t.xml>) to token-analysis-level stand-off XML format (F<*.a.xml>). =back =cut ##====================================================================== =pod =head1 SEE ALSO perl(1). =head1 AUTHOR Bryan Jurish Emoocow@cpan.orgE =head1 COPYRIGHT AND LICENSE Copyright (C) 2009-2018 by Bryan Jurish This package is free software. Redistribution and modification of C portions of this package are subject to the terms of the version 3 or greater of the GNU Lesser General Public License; see the files COPYING and COPYING.LESSER which came with the distribution for details. Redistribution and/or modification of the Perl portions of this package are subject to the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available. =cut