Pawn:Overview of Concepts
From Adapt
Introduction
PAWN is designed to facilitate archival ingestion from a variety of distributed sources. It provides common infrastructure for assembling and organizing data from the producers and publishing into an archive. As PAWN provides common ground for both producers and an archive, responsibility for everything from account creation to approval of data can be configured in a variety of ways to support centralized or decentralized approaches.
In PAWN, producing sites are grouped into administrative domains. These can be as broad as whole government agencies, or colleges on a campus, or smaller down to a group of offices or labs. The domain groups users into related administrative units. As a rule of thumb, Domains and IT organizations should be considered roughly equivalent. Each domain has it's own set of users, managers, types of records it ingests, and data suppliers.
Each domain contains a record schedule listing all types of records that domain will produce. The record scheduler is a hiearchy that allows you to organize record types. Any data that is to be archived in a domain will map to some item on the record schedule.
As end-users shouldn't be expected to determine what type of records they are producing and how they map into the record schedule, managers are able to create shortcuts called record sets for end-users. Record sets are convenenient goupings of items from a record schedule, and users that are allowed to fill the record set. They can be thought of as package templates. The record set is attached to some point in the orgazinational structure of a domain. For example, your business office would likely have several accouns, one for each employee, these employees would be presented with a list of record sets contain all the financial items from the record schedule.
Setup / Workflow Overview
Setup Steps
- Install Scheduler and receiving servers.
- Install management server.
- Introduce management server certificate to scheduler to clients from that server will be trusted.
Archive Workflow
- Create a domain on the management server and manager accounts in that domain.
- Create a record schedule for the domain describing what types of records that domain produces.
- Create client accounts for end users.
- Create an administrative structure reflecting how your domain is organized, (Ie, offices, labs, projects, etc)
- For each group that will be producing records, create a record set from items in the record schedule and attach accounts to it.
- Clients can connect, choose a record set to work with, and start loading data into a package.
- A manager at the archive or producer site can view, modify or approve the package. In addition any number of tests may be run on the data in the package.
- After the package is approved, it's transfered in whole or part into a long-term archive.
Component Overview
There are several physical pieces to PAWN. These are the client, management server, scheduler, and receiving server.
- Client
- The component used to ingest data, manage users and record organization, and to trigger transfer into an archive.
- Management server
- At least one management server is required in PAWN. This server tracks accounts, record schedules, record sets, package lists, and provides security for multiple domains.
- Scheduler
- When a client is ready to start loading data, the scheduler allocates space on a receiving server for the transfer. It also controlls all security and configuration for all receiving servers.
- Receiving server
- Receives data from clients into a package, allows modification of data depending on user credentials, transfers data to a backend archive at the direction of an approved user.
Package Overview
(or what the client really sends)
Creating a package in PAWN is a two step process. First a client sends the directory structure of a package to the PAWN receiving server. This is the equivalent of sending multiple mkdir commands in one call. Second, the client sends data to the receiving server. This data is attached as metadata or data to existing directories or items in PAWN.
In the past, PAWN would use XML documents to represent the directory structure of a package, however this ended up involving too much overhead with regard to validation and document processing.
Security Overview
As it's not feasible in a large environment to have a centralized authentication and authorization server, (IE, ldap, MS domain controller, etc) there are two distinct zones of security in PAWN. The management server and clients are the first, while the scheduler and receiving servers are the second. The management server authenticates clients, managers, and administrators for each domain it houses. The scheduler authenticates receiving servers, and system administrators at the archive. Since the scheduler and by extension receiving servers do not know about individual accounts at the management server, a trust must be established between management server and scheduler. This trust consists of registering a namespace for the management server (similiar to kerberos) and a certificate the management server will use to authorize clients.
This means that each incoming request from a client has a namespace for a given management server, and a signature to be used as a voucher from the management server. The scheduler or receiving server can then check the integrity of the signature and then trust the information in the clients voucher. The voucher will contain information regarding what a given client is allowed to access on the scheduler or receiving server. Even if a management server is compromised malicious clients are still only allowed to access data ingested from that management server, and will not be able to affect clients on other management servers.