ZIP export¶
ZIP export is the way to go if you would like to export a larger number of objects, be it transformed or not. Its features include:
- export of aggregations and their children in original format, even large collections
- export of whole search results, ie. you could export ‘all verse text’ from TextGrid
- export of metadata records together with their files
- transformation of all XML documents, e.g., to plain text to facilitate use of statistical tools that cannot deal with TEI markup
- export in a form that is re-importable into, e.g., the TextGridLab
- rewriting of links from textgrid: towards relative file names in the ZIP
- customization of the file names in the ZIP
Exporting large data sets¶
While you can export really large datasets, there is a caveat: In normal mode, the ZIP export (unlike TEIcorpus) needs to collect all object’s metadata before starting to actually deliver something. This is required since we need to calculate all objects’ filenames in order to be able to rewrite links between the objects correctly. Thus the ZIP tool (unlike, e.g., TEIcorpus export) might need quite some time before it starts to deliver the first bytes. When you’re unlucky, this head start time exceeds the timeouts of your browser or the intermediate proxy. If this happens, you’ll get a timeout instead of the zip.
In order to be still able to export these large data sets, the ZIP export
offers a special streaming mode. When you pass the query parameter
stream=true
, the Aggregator will deliver data as soon as possible, even if
it has not enough ifo to perform correct link rewriting. This may lead to
exported files still containing textgrid:
URIs, but at least you get files
:-)
Export Map¶
Each ZIP file that is exported contains an additional file at the root level called .INDEX.imex. This is an XML file that contains a list of all exported objects and that maps textgrid URIs to the file names used in the actual export. If you don’t rename stuff or move stuff around, this can be used by the TextGridLab to re-import your files.
Example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<importSpec xmlns="http://textgrid.info/import">
<importObject textgrid-uri="textgrid:k2kp.0" local-data="Romane/Goethes_Briefwechsel_mit_einem_Kinde/Arnim,_Bettina_von-Goethes_Briefwechsel_mit_einem_Kinde.xml" local-metadata="Romane/Goethes_Briefwechsel_mit_einem_Kinde/Arnim,_Bettina_von-Goethes_Briefwechsel_mit_einem_Kinde.xml.meta" rewrite-method="xml" rewrite-config="internal:tei#tei"/>
<importObject textgrid-uri="textgrid:k2k1.0" local-data="Romane/Die_Guenderode/Arnim,_Bettina_von-Die_Guenderode.xml" local-metadata="Romane/Die_Guenderode/Arnim,_Bettina_von-Die_Guenderode.xml.meta" rewrite-method="xml" rewrite-config="internal:tei#tei"/>
<importObject textgrid-uri="textgrid:k2k7.0" local-data="Romane/Clemens_Brentanos_Fruehlingskranz/Arnim,_Bettina_von-Clemens_Brentanos_Fruehlingskranz.xml" local-metadata="Romane/Clemens_Brentanos_Fruehlingskranz/Arnim,_Bettina_von-Clemens_Brentanos_Fruehlingskranz.xml.meta" rewrite-method="xml" rewrite-config="internal:tei#tei"/>
</importSpec>
Synopsis¶
Synopsis:
/zip/{objects}?sid&title&filenames&metanames&dirnames&only&meta&transform&query&filter&target&start&stop&stream
General Request Query Parameters¶
parameter | value | description |
---|---|---|
sid | string | Session ID to access protected resources |
stream | boolean Default: |
if true, favor fast results over ideal rewriting |
title | string | (optional) title for the exported data, currently only used for generating the filename. If none is given, the first title of the first object will be used. |
Choosing what to export¶
There are basically two options what to export:
Aggregation tree or list of objects¶
You export one or more objects/aggregations and everything that they aggregate. To do that, specify the URI(s) as objects
part in the request path as with the other exporters:
parameter | value | description |
---|---|---|
objects | string | The TextGridURIs of the TEI documents or aggregations to zip, separated by commas (,) |
Search results¶
Alternatively, specify a query to TG-search. To do so, specify an (unused) object string plus query parameters, so a possible URL may look like <https://textgridlab.org/1.0/aggregator/zip/query?query=waldeinsamkeit>.
You have the full power of the query language, but only a limited set of parameters that will be passed to TG-search:
parameter | value | description |
---|---|---|
query | string | (EXPERIMENTAL) perform the given TGsearch query and use its result as root objects instead of the objects. |
filter | string (repeating) |
for query: additional filters |
target | string Default: |
if query is used, the query target (metadata, fulltext or both) |
start | int Default: |
for query: start at result no. |
stop | int Default: |
for query: max. number of results |
Please note that you typically will not need to specify the start and stop parameters, but you may want to use stream=true
(cf. above).
Further Filters¶
In both cases, you can further strip down what to export by specifying one or more content types and by specifying whether metadata and textgrid-specific technical files (i.e. the aggregation files) should be exported:
parameter | value | description |
---|---|---|
only | string (repeating) |
If at least one only parameter is given, restrict export to objects with the given MIME types |
meta | boolean Default: |
Include metadata and aggregation files in the ZIP file. |
Converting TEI to something else¶
Sometimes you want the text, but you don’t want it in the original form. Since the aggregator has a built-in XSLT processor, you can use it to convert the documents. This typically does not considerably slow down the export process.
parameter | value | description |
---|---|---|
transform | string | (EXPERIMENTAL) Transform each XML document before zipping. Values currently available are text, html, or the textgrid: URI of an XSLT stylesheet. |
If you specify transform=text, a default plain-text transformation will be used on each file. We use the to-plain-text transformation of the bundled TEI XSLTs, so expect something domain-aware sensible. transform=html will use the built-in html transformation instead.
You can also specify a textgrid: URI that points to an XSLT stylesheet – however, keep in mind that this stylesheet must be either public or you need to pass in a valid session ID.
Influencing file and directory names¶
It is possible to modify the filenames used inside the ZIP file (and for rewritten links) by providing file name patterns using three parameters:
parameter | value | description |
---|---|---|
filenames | string Default:
|
Pattern for the generated filenames in the ZIP files. |
metanames | string Default:
|
Pattern for the filenames for the metadata files in the ZIP files. |
dirnames | string Default:
|
Pattern for the directory names generated for aggregations etc. This pattern applied to the parent aggregation is available as {parent} in filenames and metanames. |
The filenames will be generated from the metadata available to the aggregator
when it adds the object to its internal list, so it may be that especially the
author field is undefined. By default, each metadata field will be transformed
to a safe character set containing only ASCII letters and numbers and a limited
set of special characters, by running an automatic transcription (so Luſtige
Märchen will become Lustige_Maerchen, and ηελλασ will become hellas).
A literal *
in the pattern will be replaced by either nothing or
a disambiguation number if the same name would be generated for different
objects otherwise. The filename extension {ext}
will depend on the format actually exported, so it is txt
if you use transform=text
.
Pattern Syntax¶
A pattern string is a string containing patterns enclosed in curly
braces. Each pattern starts with a variable and is optionally followed
by one or more options, each introduced by a vertical bar ( |
).
Please note that all whitespace is significant.
As an example, the string
{author|fallback|20}-{title|sep=,}.{uri}.{ext}
contains the
variables author with the options fallback
and 20
, the variable
title with the option sep=,
, and the variables uri and ext, each
without any option.
Basic Variables¶
The following basic variables are available in all policies:
Variable | Supported Options | Description |
---|---|---|
author | fallback ,
sep= String,
Number, raw |
The object’s author.
This tries to find the
nearest work object in
the aggregation tree and
extracts its author or
authors.
If the fallback
option is included and
the matching work does
not include author
fields, use all agents
regardless of their role
instead. |
title | sep= String,
Number, raw |
The object’s title or titles. |
uri | — | The object’s TextGrid URI. This only includes the scheme-specific part. |
ext | — | A filename extension
that is suitable for the
object’s MIME type, or
dat if none found.
This does not include
a leading dot. |
* | pre= String
(Default . ),
post= String |
A filename
disambiguation pattern,
only inserted if
required. If filename
disambiguation is on
(`setUniqueFilenames(b
oolean) <http://dev.di
gital-humanities.de/ci/j
ob/link-rewriter/site/ap
idocs/info/textgrid/util
s/export/filenames/Confi
gurableFilenamePolicy.ht
ml#setUniqueFilenames%28
boolean%29>`__),
`getFilename(IAggregat
ionEntry) <http://dev.
digital-humanities.de/ci
/job/link-rewriter/site/
apidocs/info/textgrid/ut
ils/export/filenames/Con
figurableFilenamePolicy.
html#getFilename%28info.
textgrid.utils.export.ag
gregations.IAggregationE
ntry%29>`__
will first generate a
filename candidate with
this pattern expanding
to the empty string. If
this filename has
already been used for a
different entry, it
will re-run the filename
generation with this
pattern expanding to the
empty string for the
first object resolving
to the candidate and to
prefix + n-1 + postfix
for every other object.
I.e. for three XML
documents by Goethe and
the pattern
{author}*.{ext} you
will get Goethe.xml ,
Goethe.1.xml and
Goethe.2.xml .
Instead of {*}
without options you can
also simply write * . |
If you generate multiple filenames, your pattern should include either ``{uri}`` or ``*`` or you risk to get te same filename for different objects!
Nested Patterns¶
Variable | Description |
---|---|
parent |
This is the dirnames pattern applied to the the parent aggregation
of the current object, if any. In the form {parent /} it appends
/ iff there is a parent. It is available in all patterns, including
in dirnames itself. |
filename |
The name for the corresponding the metadata of which we’re processing.
Only available in metanames . |
Options¶
Number | If you pass any non-negative non-zero integer number as an option, the expanded value of the variable will be trimmed after at most Number characters. Trimming occurs after all other processing steps for the variable. |
raw |
Insert the result of this variable as-is, without character sanitization. If you do not include this option, the result of the metadata-based variables will be transcribed from its original characters to a safe subset of US-ASCII characters in order to be safe from all kinds of encoding and filename issues. This tries to do something sensible with, e.g., umlauts and non-latin scripts. |
sep= String |
If present and the respective metadata field contains multiple values, use all values, joined together with the given separator String. Otherwise, only use the first value. |
fallback |
See at the corresponding variable descriptions. |