ZIP export

ZIP export is the way to go if you would like to export a larger number of objects, be it transformed or not. Its features include:

  • export of aggregations and their children in original format, even large collections
  • export of whole search results, ie. you could export ‘all verse text’ from TextGrid
  • export of metadata records together with their files
  • transformation of all XML documents, e.g., to plain text to facilitate use of statistical tools that cannot deal with TEI markup
  • export in a form that is re-importable into, e.g., the TextGridLab
  • rewriting of links from textgrid: towards relative file names in the ZIP
  • customization of the file names in the ZIP

Exporting large data sets

While you can export really large datasets, there is a caveat: In normal mode, the ZIP export (unlike TEIcorpus) needs to collect all object’s metadata before starting to actually deliver something. This is required since we need to calculate all objects’ filenames in order to be able to rewrite links between the objects correctly. Thus the ZIP tool (unlike, e.g., TEIcorpus export) might need quite some time before it starts to deliver the first bytes. When you’re unlucky, this head start time exceeds the timeouts of your browser or the intermediate proxy. If this happens, you’ll get a timeout instead of the zip.

In order to be still able to export these large data sets, the ZIP export offers a special streaming mode. When you pass the query parameter stream=true, the Aggregator will deliver data as soon as possible, even if it has not enough ifo to perform correct link rewriting. This may lead to exported files still containing textgrid: URIs, but at least you get files :-)

Export Map

Each ZIP file that is exported contains an additional file at the root level called .INDEX.imex. This is an XML file that contains a list of all exported objects and that maps textgrid URIs to the file names used in the actual export. If you don’t rename stuff or move stuff around, this can be used by the TextGridLab to re-import your files.

Example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<importSpec xmlns="http://textgrid.info/import">
    <importObject textgrid-uri="textgrid:k2kp.0" local-data="Romane/Goethes_Briefwechsel_mit_einem_Kinde/Arnim,_Bettina_von-Goethes_Briefwechsel_mit_einem_Kinde.xml" local-metadata="Romane/Goethes_Briefwechsel_mit_einem_Kinde/Arnim,_Bettina_von-Goethes_Briefwechsel_mit_einem_Kinde.xml.meta" rewrite-method="xml" rewrite-config="internal:tei#tei"/>
    <importObject textgrid-uri="textgrid:k2k1.0" local-data="Romane/Die_Guenderode/Arnim,_Bettina_von-Die_Guenderode.xml" local-metadata="Romane/Die_Guenderode/Arnim,_Bettina_von-Die_Guenderode.xml.meta" rewrite-method="xml" rewrite-config="internal:tei#tei"/>
    <importObject textgrid-uri="textgrid:k2k7.0" local-data="Romane/Clemens_Brentanos_Fruehlingskranz/Arnim,_Bettina_von-Clemens_Brentanos_Fruehlingskranz.xml" local-metadata="Romane/Clemens_Brentanos_Fruehlingskranz/Arnim,_Bettina_von-Clemens_Brentanos_Fruehlingskranz.xml.meta" rewrite-method="xml" rewrite-config="internal:tei#tei"/>
</importSpec>

Synopsis

Synopsis:

/zip/{objects}?sid&title&filenames&metanames&dirnames&only&meta&transform&query&filter&target&start&stop&stream

General Request Query Parameters

parameter value description
sid string Session ID to access protected resources
stream

boolean

Default: false

if true, favor fast results over ideal rewriting
title string (optional) title for the exported data, currently only used for generating the filename. If none is given, the first title of the first object will be used.

Choosing what to export

There are basically two options what to export:

Aggregation tree or list of objects

You export one or more objects/aggregations and everything that they aggregate. To do that, specify the URI(s) as objects part in the request path as with the other exporters:

parameter value description
objects string The TextGridURIs of the TEI documents or aggregations to zip, separated by commas (,)

Search results

Alternatively, specify a query to TG-search. To do so, specify an (unused) object string plus query parameters, so a possible URL may look like <https://textgridlab.org/1.0/aggregator/zip/query?query=waldeinsamkeit>.

You have the full power of the query language, but only a limited set of parameters that will be passed to TG-search:

parameter value description
query string (EXPERIMENTAL) perform the given TGsearch query and use its result as root objects instead of the objects.
filter

string

(repeating)

for query: additional filters
target

string

Default: both

if query is used, the query target (metadata, fulltext or both)
start

int

Default: 0

for query: start at result no.
stop

int

Default: 65535

for query: max. number of results

Please note that you typically will not need to specify the start and stop parameters, but you may want to use stream=true (cf. above).

Further Filters

In both cases, you can further strip down what to export by specifying one or more content types and by specifying whether metadata and textgrid-specific technical files (i.e. the aggregation files) should be exported:

parameter value description
only

string

(repeating)

If at least one only parameter is given, restrict export to objects with the given MIME types
meta

boolean

Default: true

Include metadata and aggregation files in the ZIP file.

Converting TEI to something else

Sometimes you want the text, but you don’t want it in the original form. Since the aggregator has a built-in XSLT processor, you can use it to convert the documents. This typically does not considerably slow down the export process.

parameter value description
transform string (EXPERIMENTAL) Transform each XML document before zipping. Values currently available are text, html, or the textgrid: URI of an XSLT stylesheet.

If you specify transform=text, a default plain-text transformation will be used on each file. We use the to-plain-text transformation of the bundled TEI XSLTs, so expect something domain-aware sensible. transform=html will use the built-in html transformation instead.

You can also specify a textgrid: URI that points to an XSLT stylesheet – however, keep in mind that this stylesheet must be either public or you need to pass in a valid session ID.

Influencing file and directory names

It is possible to modify the filenames used inside the ZIP file (and for rewritten links) by providing file name patterns using three parameters:

parameter value description
filenames

string

Default: {parent|/}{author}-{ti tle}*.{ext}

Pattern for the generated filenames in the ZIP files.
metanames

string

Default: {filename}.meta

Pattern for the filenames for the metadata files in the ZIP files.
dirnames

string

Default: {parent|/}{title}*

Pattern for the directory names generated for aggregations etc. This pattern applied to the parent aggregation is available as {parent} in filenames and metanames.

The filenames will be generated from the metadata available to the aggregator when it adds the object to its internal list, so it may be that especially the author field is undefined. By default, each metadata field will be transformed to a safe character set containing only ASCII letters and numbers and a limited set of special characters, by running an automatic transcription (so Luſtige Märchen will become Lustige_Maerchen, and ηελλασ will become hellas). A literal * in the pattern will be replaced by either nothing or a disambiguation number if the same name would be generated for different objects otherwise. The filename extension {ext} will depend on the format actually exported, so it is txt if you use transform=text.

Pattern Syntax

A pattern string is a string containing patterns enclosed in curly braces. Each pattern starts with a variable and is optionally followed by one or more options, each introduced by a vertical bar ( |). Please note that all whitespace is significant.

As an example, the string {author|fallback|20}-{title|sep=,}.{uri}.{ext} contains the variables author with the options fallback and 20, the variable title with the option sep=,, and the variables uri and ext, each without any option.

Basic Variables

The following basic variables are available in all policies:

Variable Supported Options Description
author fallback, sep=String, Number, raw The object’s author. This tries to find the nearest work object in the aggregation tree and extracts its author or authors. If the fallback option is included and the matching work does not include author fields, use all agents regardless of their role instead.
title sep=String, Number, raw The object’s title or titles.
uri The object’s TextGrid URI. This only includes the scheme-specific part.
ext A filename extension that is suitable for the object’s MIME type, or dat if none found. This does not include a leading dot.
* pre=String (Default .), post= String A filename disambiguation pattern, only inserted if required. If filename disambiguation is on (`setUniqueFilenames(b oolean) <http://dev.di gital-humanities.de/ci/j ob/link-rewriter/site/ap idocs/info/textgrid/util s/export/filenames/Confi gurableFilenamePolicy.ht ml#setUniqueFilenames%28 boolean%29>`__), `getFilename(IAggregat ionEntry) <http://dev. digital-humanities.de/ci /job/link-rewriter/site/ apidocs/info/textgrid/ut ils/export/filenames/Con figurableFilenamePolicy. html#getFilename%28info. textgrid.utils.export.ag gregations.IAggregationE ntry%29>`__ will first generate a filename candidate with this pattern expanding to the empty string. If this filename has already been used for a different entry, it will re-run the filename generation with this pattern expanding to the empty string for the first object resolving to the candidate and to prefix + n-1 + postfix for every other object. I.e. for three XML documents by Goethe and the pattern {author}*.{ext} you will get Goethe.xml, Goethe.1.xml and Goethe.2.xml. Instead of {*} without options you can also simply write *.

If you generate multiple filenames, your pattern should include either ``{uri}`` or ``*`` or you risk to get te same filename for different objects!

Nested Patterns

Variable Description
parent This is the dirnames pattern applied to the the parent aggregation of the current object, if any. In the form {parent /} it appends / iff there is a parent. It is available in all patterns, including in dirnames itself.
filename The name for the corresponding the metadata of which we’re processing. Only available in metanames.

Options

Number If you pass any non-negative non-zero integer number as an option, the expanded value of the variable will be trimmed after at most Number characters. Trimming occurs after all other processing steps for the variable.
raw Insert the result of this variable as-is, without character sanitization. If you do not include this option, the result of the metadata-based variables will be transcribed from its original characters to a safe subset of US-ASCII characters in order to be safe from all kinds of encoding and filename issues. This tries to do something sensible with, e.g., umlauts and non-latin scripts.
sep=String If present and the respective metadata field contains multiple values, use all values, joined together with the given separator String. Otherwise, only use the first value.
fallback See at the corresponding variable descriptions.