Personal tools

Pawn:Processing

From Adapt

Jump to: navigation, search

Development Notes for v.7

Receiving Server Processing

PAWN will allow pluggable modules to be installed on receiving servers that will execute against items in a package. These processes may be invoked either at ingestion time, or manually triggered by a user. Processes may change or annotate items in a package. Permissions regarding what action a process may take should be controlled on a per-instance basis. The older archival gateway infrastructure for module loading and config storing will serve as the basis for the new server processing

To expand upon a single process, groups of processes may be grouped together to specify workflows.

Process components

There are three components involved in executing a process. These are briefly outlined here.

  • Driver file(s) - Set of jars required to invoke process. Factory method in driver called to create instances
    • contains GUI's for both configuring the driver and invoking the driver.
    • contains process that will be invoked.
  • Global Configuration - Global config for each instance of a driver.
    • Split into two parts, configuration for GUI (client) and configuration for process (receiving server)
  • Invoking Configuration - configuration supplied from client at time of invocation.
  • Affected Items - list of items that the process is to be executed on.

The driver files and global configuration is stored on the scheduler and retrieved at startup/reload by each receiver. The invoking configuration and affected items may be supplied manually by a client, or automatically created by the manager and receiving server depending on how the process is invoked.

Process instance lifecycle

The lifecycle for configuring a process instance in PAWN is shown below. A process instance is a configuration for a process driver that is associated with a domain. For example, PAWN may contain an ftp driver to ingest files from an ftp server. An instance of the ftp driver may be tied to the Univ of MD domain and is configured to talk to ftp.umiacs.umd.edu with a username bob and password fred. For each execution of the process, an end-user may need to run a gui to select specific directories to crawl using the global server/user information.

Steps 1-3 below configure an instance of a driver while steps 4-X involve each invocation of that driver.

  1. Upload driver files to PAWN Scheduler
  2. Admin downloads driver files from scheduler
  3. Admin invokes the configuration GUI to create an instance of the process. (optional)
  4. Instance configuration (gui and receiver server) is stored on the PAWN scheduler.
  5. Client selects files that process will be invoked on.
  6. Client downloads gui configuration and driver files.
  7. Client runs execution gui to finish up configuration (optional)
  8. Client side configuration and list of files to process is sent to the receiving server to invoke.

Process invoke workflow

While the above workflow implies manual execution of client configuration, file selection and receiving server invocation, those three steps may be performed automatically during the ingest of new items with the list of files to process corresponding to the ingested files and a client configuration pre-configured by a manager and included in the ingest. This leads us to the three possibly invocations of a process instance.

Manual Invocation
Client calls receiving server and invokes a set of processes. The processes passed by the client must be listed in the clients saml assertion. The client will supply the invoking configuration and affected items to the call. This overrides any configuration information in the saml assertion
Automatic immediate invocation
Prior to the ingest call returning, processes are automatically invoked on data. The processgroup configuration in the saml assertion is used to determine which processes will be invoked The invoking configuration for each process is pulled from the clients saml assertion and affected items are the items included in current call. As these processes run in-line w/ the ingest call, care should be taken w/ regard to execution time.
Automatic batched invocation
This is similar to immediate execution except the invocation occurs out of band, after all ingest calls have returned and the list of items is the sum of all items ingested. Longer running processes should be executed here.

A process instance may be configured to run in all or fewer of the above contexts.

Invocation

  1. For each process group, a process engine is created.
  2. the process engine is loaded with each process and parameters.
  3. The first process is loaded with the initial affected item list.
  4. The engine is executed, each driver is executed against the specified files and events are logged
  5. If any files are added or removed, a new affected item list is created and supplied to the subsequent engine.
  6. Processing jumps to step three if there is a subsequent engine.
  7. Logs are committed, reservation unlocked

Details for Manual Invoation

  1. invokeProcesses() webservice is called specifying a list of processes to run, parameters, and affected items
  2. saml assertion is consulted to determine whether user is permitted to invoke processes.
  3. Client is returned a handle to monitor state of execution.
  4. Normal invocation, with affected items and configuration used from invokeProcess call.
  5. Client may call monitorProcess and use the handle returned above to watch actions.

Details for Immediate Invocation

  1. Client calls ingest w/ a set of objects.
  2. Normal invocation, with affected items used from call and configuration from saml token.
  3. Items marked for deletion are removed and the id returned to the client is -1.
  4. Call returns list of dbIds to client.

Details for delayed invocation

Identical to manual except invokeProcesses is run after all items have been pushed and a client will have to query the receiving server to find a handle to monitor processes.


Limiting processes

In addition to limiting when a process may be executed (manual, automatic, or delayed), actions that a process can perform are limited as well. A set of limitations similar to the PAWN role system is specified for each process instance in a chain. The following actions may be granted or revoked to processes.

LOCK
A process may set an item as archived and supply a final destination for the item.
ANNOTATE
Add descriptive metadata to a file or directory
APPROVE
A process is assigned a user that it will act on behalf of to approve items
REJECT
A process is assigned a user that it will act on behalf of to approve items
REMOVE
Delete an item from a package (limited to supplied list of items)

A PAWN process is able to ask its environment which actions it is allowed to perform against a certain file.

Also, it should be noted that like roles, nonsensical configurations are also possible. For example a process that virus checks items may be granted the ability to archive items, even though it never will.

Use Cases

malicious data prevention

Some data suppliers may be known to supply malicious or blatantly incorrect data. A single process group with several small processes can be setup to ensure incoming data is not malicious, and to do a quick smell test on items to make sure they meet broad criteria for packages. Examples of processes included in the group include metadata verification (size, mime limites) and virus checking.

  • Execution Context: immediate
  • Permissions: REMOVE

Remote Harvest

Clients wish to archive videos hosted at a remote web server. Videos are too large to download locally, so receiving servers must fetch items and add them to a package as appropriate. A call to ingestObjects contains a list of remote objects, optionally supplies a checksum. The call finished and a process group may be setup to validate the pulled content.

  • Execution Context: Manual
  • Permissions: CREATE

Automatic Archiving

Two process groups can be setup

A contractor is submitting data that is easily verifiable and submissions occur with enough frequency that items should be archived as soon as automated tests have been completed. At least two process groups are required. First is a validation group that marks items in a package as accepted or rejected. Second an archive group processes the package is the first group did not encounter any invalid files. A third process may be added to remove the package if archive was successful.

  • Execution Context: batched
  • Permissions: LOCK, REJECT for validators

-- Main.MikeSmorul - 03 Oct 2007