Automatic Data Migrations Performed by Ambra 0.9.1-SNAPSHOT

As with 0.9, 0.9.1 ambra also contains automatic data Migrators. The idea is that when upgrading from a previous version, ambra knows exactly what changes need to be done to the data so that the new version can work with it. This is usually the safest and minimal down-time approach to upgrading.

Changes in 0.9.1-SNAPSHOT

  • 0.9 Data Migrator is turned off (See r6587)
  • Updated Bootstrap Migrator
  • New Citation Migrator
  • New Search Migrator

Bootstrap Migrator

This is run before OTM is initialized. Changes are:

  • r6542 (Convert topaz:state to use an xsd:int data-type. Addresses #418)
  • r6570 (drop an obsolete graph for the model to graph rename in r6564) (Note that r6818 moved this change into the Migrator)

Also note that the 0.9 Migrator to the 0.9.1 Migrator transformation had 2 significant wiring changes:

  • r6827 (checks for the presence of graphs before attempting migration)
  • r6938 (Move the Migrator to Spring managed - instead of from web.xml)

Note: A failure in this migrator will result in a startup failure for Ambra.

Citation Migrator

This is run after OTM has been initialized. It updates mulgara with additional triples for Article Citations as required by the changes in r6692, r6774 and r6867.

Please see r6953 for the Migrator tuning details and specifically applicationContext.xml lists the configuration options. This is not hooked to the ambra commons config mechanism, though it could easily be done if the need arises. Note that this file can be found in the WEB-INF directory of the ambra war file.

This Migrator is also 'state-less'. Any article with a missing a dc:identifier on it's bibliographic Citation is a candidate for this Migrator. And hence the status report log message from this Migrator only contains the number of successful Migrations and number of failed Migrations. (See http://lists.topazproject.org/pipermail/ambra-dev/2008-December/001048.html on why not to extrapolate this number to indicate a percentage failure as was reported in #921)

An article Migration will fail under the following cases:

  • failure to access Mulgara
  • failure to access Fedora
  • missing article XML in Fedora
  • duplicate bibtex:hasKey on Citation
  • mismatch in the author list in existing Citations in mulgara vs what is found in the article XML.

Search Migrator

Search Migrator is also run after OTM has been initialized. It is strictly speaking not a data migrator like the others. It creates lucene search indexes for all ambra data model objects with an @Searchable annotation on it. (See r6667, r6771 and #744) No attempt is done to migrate the old (v 0.9) lucene index.

Please see r6954 for the Migrator tuning details. The configuration has been changed in r7143 to increase the timeout and flip the default for 'finalize'. Configuration is not hooked to the ambra commons config mechanism, though it could easily be done if the need arises. For now this is in applicationContext.xml file for ambra. It can be found in the WEB-INF directory of the ambra war file. The current 'searchMigrator' bean definition is given below:

  <bean id="searchMigrator" class="org.topazproject.ambra.migration.SearchMigrator"
      init-method="init">
    <!-- reIndex: set to true to re-build search-index. Will re-build all indexes even if
         all search-migrations were completed and 'finalized'. -->
    <property name="reIndex" value="false"/>
    <!-- finalize: set to 'true' to write a marker on success to denote completion of all migrations.
         Setting to 'false' is really for running a migration elsewhere and copy
         the data to a real instance. For everywhere else the recommendation is to set this to 'true'.
    -->
    <property name="finalize" value="true"/> 
    <!-- background: set to true to allow web-traffic during migrations -->
    <property name="background" value="true"/>
    <!-- txnTimeout: the txn timeout in seconds -->
    <property name="txnTimeout" value="1800"/>
    <!-- blobThrottle: number of blobs to index per transaction -->
    <property name="blobThrottle" value="20"/>
    <!-- rdfThrottle: number of mulgara only entity instances (eg. Citation) to index per transaction -->
    <property name="rdfThrottle" value="3000"/>
  </bean>

Since the SearchMigrator is not really a Migrator, there is no 'end' condition it can check for. So it maintains state in a separate Mulgara graph ('sm'). This graph is not needed for running ambra after the migrations are completed and finalized. However as of r7143, the SearchMigrator does not drop this graph automatically on finalize. This is to allow any continuation of migrations at a later point in time by setting finalize to false and back.

Also note that some entities are indexed by doing an 'insert-select ... into <...lucene>' TQL command and is performed entirely in Mulgara. There is no separate state maintained for these in the 'sm' graph and therefore the 'insert-select' is performed every single time ambra is started - unless the 'finalize' flag is set. insert-select can be quite slow for large data-sets. Therefore it is better to turn 'finalize=true' to avoid slowing down ambra startup.

Stopping SearchMigrator

The finalized flag is stored in the 'ri' graph as:

   <migrator:migrations> <method:searchMigrated> '1'^^<xsd:int>  

    ('1' for finalized, '0' or missing statement for not)

You could change the value here in the triple-store to prevent or re-enable migrations.

Note that SearchMigrator will set the value to '1', only on a run without any errors. So even if the finalize flag in the applicationContext.xml is set to true, if the prior runs had a failure you would notice that the SearchMigrator will continue its attempts to migrate till a successful error-free run is complete.

If you can live with certain errors, and want the migrator to stop further attempts, you can set this flag in the triple-store directly.

Re-Index Option

Setting 'reIndex' to true is a heavy-hammer which is never usually needed. However this is available to use in case the Lucene index is ever corrupted even any time in the future - not just during migration.

Re-Index will drop the 'sm' graph, re-create it and remove the marker set up by finalize and thus causing the SearchMigrator? to re-build the Lucene index completely. SearchMigrator? does not have hard-coded knowledge of Ambra content model. It works by querying OTM for all searchable properties and indexing them. So the re-index option can work across content model changes.

Note that re-Indexer cannot make Mulgara's Lucene resolver to delete the Lucene database. So prior to running re-Indexer, the administrator may have to manually delete the old Lucene database to really start from scratch.

BOM issue

Certain article xml files seem to contain a Byte-Order-Mark. See http://en.wikipedia.org/wiki/Byte-order_mark.

xml specs have been changed to accommodate this as per: http://www.w3.org/XML/xml-V10-2e-errata (search for bom)

And the parser API that the xml tag stripper in search uses forbids BOM. See http://java.sun.com/j2se/1.4.2/docs/api/org/xml/sax/InputSource.html#InputSource(java.io.Reader)

So till OTM is updated to look for and remove the BOM, the best option is to remove the BOM from the article xml files manually. (No need to re-ingest. Edit the file in Fedora and remove the 3 bytes 'efbbbf' that represents UTF-8).

For example an od shows this:

  od -x /tmp/journal.pgen.1000137.xml | head -1
  0000000 bbef 3cbf 783f 6c6d 7620 7265 6973 6e6f

  Note the 'efbbbf' before the <?xml....

Note that ambra expects a UTF-8 encoded article.xml. So if the BOM indicates any other encoding, then a UTF-8 encoded article version needs to be re-ingested.

A BOM in the article XML will create an error in ambra.log on migration similar to:

2008-12-29 21:36:40,013 ERROR SearchMigrator()> Failed to create search indexes for
 ClassMetadata[name=TextRepresentation, type=[http://rdf.topazproject.org/RDF/Representation]] 
 with id: info:doi/10.1371/representation/e000595f-7d65-44e4-9644-a16cb0c47e11 
 [Search-Migrator org.topazproject.ambra.migration.SearchMigrator]
 org.topazproject.otm.OtmException: Error parsing document
        at org.topazproject.otm.search.XmlTagStripper.process(XmlTagStripper.java:63)
        at org.topazproject.ambra.models.TextRepresentation$BodyPreProcessor.process(TextRepresentation.java:143)
        at org.topazproject.otm.stores.ItqlStore.buildInsert(ItqlStore.java:304)
        at org.topazproject.otm.stores.ItqlStore.insertInternal(ItqlStore.java:225)
        at org.topazproject.otm.stores.ItqlStore.index(ItqlStore.java:175)
        at org.topazproject.ambra.migration.SearchMigrator.migrate(SearchMigrator.java:313)
        ...
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
        ... 13 more