Personal tools

Ace:DspaceNotes

From Adapt

Revision as of 14:23, 11 September 2008 by Toaster (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

File Identifiers

DSpace renames files stored in its storage pools. Internally, it uses a 39-digit integer to identify the bitstream in local storage. The first 6 digits of the identifier determine path within a storage pool, and the bitstream is stored using the 39-digit identifier. For example, the identifier 20677408179490428002330043551579494031 would store its data in /path/to/storage/20/67/74/20677408179490428002330043551579494031.

DSpace can also use a handle based identifier to retrieve individual bitstreams. Each item in DSpace is assigned a guid via a handle, bitstreams in an item are assigned a sequence number that is tacked to the end of the item handle. For example,

Each bitstream has a sequence ID, unique within an item. This sequence ID is used to create a persistent ID, of the form:

dspace url/bitstream/handle/sequence ID/filename

For example:

https://dspace.myu.edu/bitstream/123.456/789/24/foo.html

The above refers to the bitstream with sequence ID 24 in the item with the Handle hdl:123.456/789. 
(from http://www.dspace.org/index.php?option=com_content&task=view&id=149#data_model)

While these two systems provide unique identifiers on a per-dspace installation, neither handle the problem of global per-bitstream identifiers. DSpace guid support is limited to the per-item scope. ACE is able to function with either identifier.

Registration

Database scanning - All versions

DSpace tracks all bitstreams under its control through the bitstream table. This table includes fields containing the dspace identifier that specifies physical storage location, database identifier, checksum, checksum type, and current state of the item (deleted or present). Since the deleted field is the only field ever updated, we can use the table as a queue for items have not been seen by ACE. The periodic ACE registration process will gather items from the bitstream table, add them it its own queue and store the last processed bitstream table ID. Ace cannot use the bitstream table for its own queue in order to properly handle IMS failure and registration retries.


Media Filter Plugin - 1.4.2

DSpace 1.4.2 provides functionality for executing media filters against items, collections, or entire communities. Media filters are manually executed processes that were designed to allow administrators to automatically generate previews for images or perform format conversions. Normally, a media filter specifies a list of format types that it handles and writes a new bitstream to DSpace with the converted data.

To support ACE, the filter will need to handle all formats and not write a bitstream to DSpace. Creating a mediafilter that does not write to DSpace is trivial, the <nop>MediaFilter abstract class that all filters extend has a processBitream method that can be overridden to prevent writing new bitstreams. Media filters are configured to handle specific file formats in the dspace.cfg configuration file. Currently, there is no wildcard support for specifying all formats in DSpace. This means that as new formats are added to the Bitstream Format Registry, the ACE mediafileter in the dspace.cfg file will need to be updated to include the new format. This may become unwieldy for dspace installations with large numbers of formats or frequently changing format types.

Since DSpace expects all filters to write a new bitsream to an object, it uses the existence of these new bitstreams to determine whether a filter has been run or not. As ACE will not be writing bitstreams into dspace, this method of determining whether to execute the filter will not work. The media filter will need an external table to track which bitstreams have been registered.

Event System - 1.5.0alpha1

The development version of DSpace has implemented an event system that can be used to synchronously or asynchronously notify event consumers that certain actions have occurred. The event system will allow ACE integration that more closely follows the in-line queuing of items that Portico's workflow uses

To use the synchronous event system, an implementation of the Consumer interface needs to be provided. The ACE Consumer implementation is specified in the dspace.cfg file and set to handle Bitstream, Create events. The ACE Consumer is provided with a DSpace Event object that can be used to retrieve the checksum information for the new bitstream. The consumer will use this event to load checksum information and bitstream identifier. This infomration will be added to the ACE Audit Manager work queue.

The asynchronous notification uses JMS and specifically <nop>ActiveMQ to reliably dispatch messages. Asynchronous support is currently experimental in DSpace and will be supported at a later date. The planned ACE registration workflow would not benefit from asynchronous notification as it is already designed to run out of band from the processing issuing registration requests.

Auditing

There are two mechanisms for accessing data in DSpace to audit. First is direct access to data where absolute paths are stored in the audit manager. The second is to use DSpace libraries to access data.

Direct auduting of items in dspace is possible with the audit manager. The audit manager would need to be aware of every type of storage, currently SRB and local filesystem, that DSpace supports. The AM would need to store the base path to each storage asset, connection information, and the 39 digit identifier of all bitstreams in that asset. Alternatively, the AM could store an absolute path to each file, but that may present problems if storage assets are moved in the future. The advantage of this approach is no dependancy on the availability of DSpace, assuming networked filesystems or the SRB, auditing could also run on separate resources.

DSpace also presents an API that allows direct access to the underlying bitstreams in its storage. The DSpace API uses the current DSpace configuration to connect to the postgresql or oracle database supporting dspace. The <nop>MediaFilterManager provides an example on how to use the DSpace config to retrieve items and bitstreams from DSpace. Ace needs only to store the 39-digit identifier for the object. It will not need to store information regarding the storage asset of an item as the DSpace API provides bit-level access.

Token Storage

Token storage can follow the Portico model and store tokens in the Audit Manager database or store tokens natively in DSpace, or both. Storing tokens in only DSpace would introduce an additional dependency on the ACE audit manager, and require the audit manager to reside on the same host as the DSpace installation.

The easiest way to add tokens to DSpace would be through the use of a media filter. A media filter would be written that would query the ace audit manager for the registered token of any bitstreams. This is different from the fiter above in that bitstreams would need to be previously registered with the IMS before token metadata is available. Due to the delayed token issuing, plugging tokens into DSpace items is not a reliable way to determine which items have been registered with ACE. The filter would store the token as metadata in a METS item. Any future Ace registration process would have to be aware of the token file type and not register it for auditing.

Recommended Setup

  • Bitstream registration will occur using table scanning for DSpace versions 1.4.2 and earlier and an ACE event consumer for DSpace 1.5.0 and above.
  • Auditing will occur through the DSpace libraries so that configuration changes to DSpace storage will not require a change in the Audit Manager.
  • An optional media filter should be developed to insert tokens from registered items in the Audit Manager as metadata in DSpace. The media filter will have no affect on the operation of the Audit Manager.

-- Main.MikeSmorul - 25 Oct 2007