combineExport is a Perl module to export records in XML from Combine database.
combineExport --jobname [--profile alvis|dc|combine --charset utf8|isolatin --number --recordid < n > --md5 < MD5 > --incremental --xsltscript ...]
OPTIONS AND ARGUMENTS
jobname is used to find the appropriate configuration (mandatory)
Three profiles: alvis, dc, and combine . alvis and combine are similar XML formats.
'alvis' profile format is defined by the Alvis enriched document format DTD. It uses charset UTF-8 per default.
'combine' is more compact with less redundancy.
'dc' is XML encoded Dublin Core data.
Selects a specific characterset from UTF-8, iso-latin-1 Overrides --profile settings.
Skip inlinks with duplicate anchor-texts (ie just one inlink per unique anchor-text).
Do not include any outlinks in the exported records.
ZebraIndex sends XML records directly to the Zebra server defined in Combine configuration variable 'ZebraHost'. It uses the default Zebra configuration: profile=combine, nooutlinks, collapseinlinks and is compatible with the direct Zebra indexing done during harvesting when 'ZebraHost' is defined in the Combine configuration. Requires that the Zebra server is running.
SolrIndex sends XML records directly to the Solr server defined in Combine configuration variable 'SolrHost'. It uses the default Solr configuration: profile=combine, nooutlinks, collapseinlinks and is compatible with the direct Solr indexing done during harvesting when 'SolrHost' is defined in the Combine configuration. Requires that the Solr server is running.
Generates records in Combine native format and converts them using this XSLT script before output. See example scripts in /etc/combine/*.xsl
the max number of records to be exported
Export just the one record with this recordid
Export just the one record with this MD5 checksum
Specifies the server-name and port to connect to and export data using the Alvis Pipeline. Exports incrementally, ie all changes since last call to combineExport with the same pipehost and pipeport.
Exports incrementally, ie all changes since last call to combineExport using --incremental
What's New in This Release: [ read full changelog ]
· Fixed some tests
· Added support for exceptions to GeoIP
· Better handling of special characters
· Added support for new URL scheduling algorithms (including score based)
· Improved HTML -> text extraction