ZIP export ---------- ZIP export is the way to go if you would like to export a larger number of objects, be it transformed or not. Its features include: - export of aggregations and their children in original format, even large collections - export of whole search results, ie. you could export 'all verse text' from TextGrid - export of metadata records together with their files - transformation of all XML documents, e.g., to plain text to facilitate use of statistical tools that cannot deal with TEI markup - export in a form that is re-importable into, e.g., the TextGridLab - rewriting of links from textgrid: towards relative file names in the ZIP - customization of the file names in the ZIP Exporting large data sets ^^^^^^^^^^^^^^^^^^^^^^^^^ While you can export really large datasets, there is a caveat: In normal mode, the ZIP export (unlike TEIcorpus) needs to collect all object's metadata before starting to actually deliver something. This is required since we need to calculate all objects' filenames in order to be able to rewrite links between the objects correctly. Thus the ZIP tool (unlike, e.g., TEIcorpus export) might need quite some time before it starts to deliver the first bytes. When you're unlucky, this head start time exceeds the timeouts of your browser or the intermediate proxy. If this happens, you'll get a timeout instead of the zip. In order to be still able to export these large data sets, the ZIP export offers a special *streaming mode*. When you pass the query parameter ``stream=true``, the Aggregator will deliver data as soon as possible, even if it has not enough ifo to perform correct link rewriting. This may lead to exported files still containing ``textgrid:`` URIs, but at least you get files :-) Export Map ^^^^^^^^^^ Each ZIP file that is exported contains an additional file at the root level called `.INDEX.imex`. This is an XML file that contains a list of all exported objects and that maps textgrid URIs to the file names used in the actual export. If you don't rename stuff or move stuff around, this can be used by the TextGridLab to re-import your files. Example:: Synopsis ^^^^^^^^ Synopsis:: /zip/{objects}?sid&title&filenames&metanames&dirnames&only&meta&transform&query&filter&target&start&stop&stream General Request Query Parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +--------------------------+--------------------------+--------------------------+ | parameter | value | description | +==========================+==========================+==========================+ | **sid** | string | Session ID to access | | | | protected resources | | | | | +--------------------------+--------------------------+--------------------------+ | **stream** | boolean | if true, favor fast | | | | results over ideal | | | | rewriting | | | | | | | Default: ``false`` | | +--------------------------+--------------------------+--------------------------+ | **title** | string | (optional) title for the | | | | exported data, currently | | | | only used for generating | | | | the filename. If none is | | | | given, the first title | | | | of the first object will | | | | be used. | +--------------------------+--------------------------+--------------------------+ Choosing what to export ^^^^^^^^^^^^^^^^^^^^^^^ There are basically two options what to export: Aggregation tree or list of objects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You export one or more objects/aggregations and everything that they aggregate. To do that, specify the URI(s) as ``objects`` part in the request path as with the other exporters: +--------------------------+--------------------------+--------------------------+ | parameter | value | description | +==========================+==========================+==========================+ | **objects** | string | The TextGridURIs of the | | | | TEI documents or | | | | aggregations to zip, | | | | separated by commas (,) | +--------------------------+--------------------------+--------------------------+ Search results ~~~~~~~~~~~~~~ Alternatively, specify a query to TG-search. To do so, specify an (unused) object string plus query parameters, so a possible URL may look like . You have the full power of the query language, but only a limited set of parameters that will be passed to TG-search: +--------------------------+--------------------------+--------------------------+ | parameter | value | description | +==========================+==========================+==========================+ | **query** | string | (EXPERIMENTAL) perform | | | | the given TGsearch query | | | | and use its result as | | | | root objects instead of | | | | the objects. | +--------------------------+--------------------------+--------------------------+ | **filter** | string | for query: additional | | | | filters | | | | | | | (repeating) | | +--------------------------+--------------------------+--------------------------+ | **target** | string | if query is used, the | | | | query target (metadata, | | | | fulltext or both) | | | | | | | Default: ``both`` | | +--------------------------+--------------------------+--------------------------+ | **start** | int | for query: start at | | | | result no. | | | | | | | | | | | Default: ``0`` | | +--------------------------+--------------------------+--------------------------+ | **stop** | int | for query: max. number | | | | of results | | | | | | | | | | | Default: ``65535`` | | +--------------------------+--------------------------+--------------------------+ Please note that you typically will *not* need to specify the start and stop parameters, but you may want to use ``stream=true`` (cf. above). Further Filters ~~~~~~~~~~~~~~~ In both cases, you can further strip down what to export by specifying one or more content types and by specifying whether metadata and textgrid-specific technical files (i.e. the aggregation files) should be exported: +--------------------------+--------------------------+--------------------------+ | parameter | value | description | +==========================+==========================+==========================+ | **only** | string | If at least one only | | | | parameter is given, | | | | restrict export to | | | (repeating) | objects with the given | | | | MIME types | +--------------------------+--------------------------+--------------------------+ | **meta** | boolean | Include metadata and | | | | aggregation files in the | | | | ZIP file. | | | | | | | Default: ``true`` | | +--------------------------+--------------------------+--------------------------+ Converting TEI to something else ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Sometimes you want the text, but you don't want it in the original form. Since the aggregator has a built-in XSLT processor, you can use it to convert the documents. This typically does not considerably slow down the export process. +--------------------------+--------------------------+--------------------------+ | parameter | value | description | +==========================+==========================+==========================+ | **transform** | string | (EXPERIMENTAL) Transform | | | | each XML document before | | | | zipping. Values | | | | currently available are | | | | text, html, or the | | | | textgrid: URI of an XSLT | | | | stylesheet. | +--------------------------+--------------------------+--------------------------+ If you specify `transform=text`, a default plain-text transformation will be used on each file. We use the to-plain-text transformation of the bundled TEI XSLTs, so expect something domain-aware sensible. `transform=html` will use the built-in html transformation instead. You can also specify a textgrid: URI that points to an XSLT stylesheet – however, keep in mind that this stylesheet must be either public or you need to pass in a valid session ID. Influencing file and directory names ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It is possible to modify the filenames used inside the ZIP file (and for rewritten links) by providing file name patterns using three parameters: +--------------------------+--------------------------+--------------------------+ | parameter | value | description | +==========================+==========================+==========================+ | **filenames** | string | Pattern for the | | | | generated filenames in | | | | the ZIP files. | | | | | | | Default: | | | | ``{parent|/}{author}-{ti | | | | tle}*.{ext}`` | | +--------------------------+--------------------------+--------------------------+ | **metanames** | string | Pattern for the | | | | filenames for the | | | | metadata files in the | | | | ZIP files. | | | Default: | | | | ``{filename}.meta`` | | +--------------------------+--------------------------+--------------------------+ | **dirnames** | string | Pattern for the | | | | directory names | | | | generated for | | | | aggregations etc. This | | | Default: | pattern applied to the | | | ``{parent|/}{title}*`` | parent aggregation is | | | | available as {parent} in | | | | filenames and metanames. | +--------------------------+--------------------------+--------------------------+ The filenames will be generated from the metadata available to the aggregator when it adds the object to its internal list, so it may be that especially the author field is undefined. By default, each metadata field will be transformed to a safe character set containing only ASCII letters and numbers and a limited set of special characters, by running an automatic transcription (so *Luſtige Märchen* will become *Lustige_Maerchen*, and *ηελλασ* will become *hellas*). A literal ``*`` in the pattern will be replaced by either nothing or a disambiguation number if the same name would be generated for different objects otherwise. The filename extension ``{ext}`` will depend on the format actually exported, so it is ``txt`` if you use ``transform=text``. Pattern Syntax ~~~~~~~~~~~~~~ A pattern string is a string containing *patterns* enclosed in curly braces. Each pattern starts with a *variable* and is optionally followed by one or more options, each introduced by a vertical bar ( ``|``). Please note that all whitespace is significant. As an example, the string ``{author|fallback|20}-{title|sep=,}.{uri}.{ext}`` contains the variables author with the options ``fallback`` and ``20``, the variable title with the option ``sep=,``, and the variables uri and ext, each without any option. Basic Variables ~~~~~~~~~~~~~~~ The following basic variables are available in all policies: +--------------------------+--------------------------+--------------------------+ | Variable | Supported Options | Description | +==========================+==========================+==========================+ | author | ``fallback``, | The object's author. | | | ``sep=``\ String, | This tries to find the | | | Number, ``raw`` | nearest work object in | | | | the aggregation tree and | | | | extracts its author or | | | | authors. | | | | If the ``fallback`` | | | | option is included and | | | | the matching work *does | | | | not* include author | | | | fields, use all agents | | | | regardless of their role | | | | instead. | +--------------------------+--------------------------+--------------------------+ | title | ``sep=``\ String, | The object's title or | | | Number, ``raw`` | titles. | +--------------------------+--------------------------+--------------------------+ | uri | — | The object's TextGrid | | | | URI. This only includes | | | | the scheme-specific | | | | part. | +--------------------------+--------------------------+--------------------------+ | ext | — | A filename extension | | | | that is suitable for the | | | | object's MIME type, or | | | | ``dat`` if none found. | | | | This does *not* include | | | | a leading dot. | +--------------------------+--------------------------+--------------------------+ | \* | ``pre=``\ String | A filename | | | (Default ``.``), | disambiguation pattern, | | | ``post=`` String | only inserted if | | | | required. If filename | | | | disambiguation is on | | | | (```setUniqueFilenames(b | | | | oolean)`` `__), | | | | ```getFilename(IAggregat | | | | ionEntry)`` `__ | | | | will first generate a | | | | filename candidate with | | | | this pattern expanding | | | | to the empty string. If | | | | this filename has | | | | already been used for a | | | | *different* entry, it | | | | will re-run the filename | | | | generation with this | | | | pattern expanding to the | | | | empty string for the | | | | first object resolving | | | | to the candidate and to | | | | prefix + n-1 + postfix | | | | for every other object. | | | | I.e. for three XML | | | | documents by Goethe and | | | | the pattern | | | | ``{author}*.{ext}`` you | | | | will get ``Goethe.xml``, | | | | ``Goethe.1.xml`` and | | | | ``Goethe.2.xml``. | | | | Instead of ``{*}`` | | | | without options you can | | | | also simply write ``*``. | +--------------------------+--------------------------+--------------------------+ **If you generate multiple filenames, your pattern should include either ``{uri}`` or ``*`` or you risk to get te same filename for different objects!** Nested Patterns ~~~~~~~~~~~~~~~ +--------------+---------------------------------------------------------------------------+ | Variable | Description | +==============+===========================================================================+ | ``parent`` | This is the ``dirnames`` pattern applied to the the parent aggregation | | | of the current object, if any. In the form ``{parent /}`` it appends | | | ``/`` iff there *is* a parent. It is available in all patterns, including | | | in ``dirnames`` itself. | +--------------+---------------------------------------------------------------------------+ | ``filename`` | The name for the corresponding the metadata of which we're processing. | | | Only available in ``metanames``. | +--------------+---------------------------------------------------------------------------+ Options ~~~~~~~ +--------------------------------------+--------------------------------------+ | Number | If you pass any non-negative | | | non-zero integer number as an | | | option, the expanded value of the | | | variable will be trimmed after at | | | most Number characters. Trimming | | | occurs after all other processing | | | steps for the variable. | +--------------------------------------+--------------------------------------+ | ``raw`` | Insert the result of this variable | | | as-is, without character | | | sanitization. | | | If you *do not* include this option, | | | the result of the metadata-based | | | variables will be transcribed from | | | its original characters to a safe | | | subset of US-ASCII characters in | | | order to be safe from all kinds of | | | encoding and filename issues. This | | | tries to do something sensible with, | | | e.g., umlauts and non-latin scripts. | +--------------------------------------+--------------------------------------+ | ``sep=``\ String | If present and the respective | | | metadata field contains multiple | | | values, use all values, joined | | | together with the given separator | | | String. Otherwise, only use the | | | first value. | +--------------------------------------+--------------------------------------+ | ``fallback`` | See at the corresponding variable | | | descriptions. | +--------------------------------------+--------------------------------------+