Rapid development in software industry has given birth to an enormous number of file formats. Deciding which file format a certain file belongs to is not a trivial matter anymore. Although the extension of a file, such as '.doc', '.pdf' or '.xls' on Windows, could give a nice hint for its actual file format, sometimes one could be faced with less known extensions like '.skl' or '.pdz', or even no extension at all.
The file extension can be very misleading too. Even though a file can have a familiar extension, it does not necessarily mean that the file is of the familiar format. For example, the '.pdf' extension can be any of the followings: Adobe Acrobat Portable Document Format, Analyser Protocol Definition, ESRI !ArcView Preferences Definition File, Netware Printer Definition File, P-CAD Database Interchange Format, Microsoft Package Definition File, or Corel Ventura Publisher EPS-variation Page. In this case, having '.pdf' as an extension could be worse than not having anything at all in the end. In order to make certain of a file format, the only perfect, but impratical of course, way would be openning up the file with every possible application in the world.
Thus, this framework for the file format identification intends to provide an easier way to help identify a file format. Although it might not be 100% precise, it aims to be as accurate as possible with a minimum amount of time to have the final result back.
How Are Formats Identified?
There are three methods to identify a file format:
- By looking at its extension, such as '.pdf' or '.doc', if any.
- By looking at its magic numbers inside, such as '%PDF1.', if any.
- By parsing the file according to the format specification.
Obviously, the first method is the fastest, but can be very misleading in some cases. The second method gives us a fairly fast and accurate result, as the probability of accidental conflict should not be noticeable. However, in many cases where the format specification does not bear such things as magic numbers, the only option left behind, beyond the first method, would be parsing the entire file to see if it is well-formed according to the format rule. This third method is the most accurate but also the most expensive.
This file identification framework combines all the three methods above. It uses extensions and magic numbers to initially guess the file format, and if either necessary or wanted, it gives a way to verify the guess by parsing the entire file. More over, it also gives a way to convert a format to a different format, if wanted.
In order to achieve this, there are two main and many auxilliary software components. The two main components are Format Identifier (Fider) and Global Digital Format Registry (GDFR), and the other auxilliary components are verifiers (parsers) and/or converters for each format.
The Fider is a Java library (or a package) and is base on the second format identification method. Therefore, it only looks for conspicuous clues to make a decision. Fider is designed to be scalable, adding support for more file formats can be done very easily, minimizing any future hassles in updating.
The GDFR is a LDAP directory service, constructed on top of !OpenLDAP. The two major intermediate nodes in this LDAP tree are one for application information and the other for file format information. Below the application information node, each node contains information of each application. For example, Adobe Acrobat Reader 6.0 takes up a node which has many attributes about this application. Likewise, each format information is contained in each node below the format information node.
How Is LDAP Constructed for the Format Registry Purpose?
Below is the screen shot from phpLDAPAdmin that shows the initial structure of GFR.
For more information, go to the GFR development page
How Does Fider Work?
For more information, go to the Fider development page
How Do Things Work Altogether?
The Agent in the figure above can either be a stand-alone application, or a web-service.
- User wants to know about the format of the file named 'foo'.
- Agent forwards the file 'foo' to Fider.
- From Magic-number matching, Fider guesses the format is 'Adobe Acrobat PDF 1.*'.
- Agent queries GDFR about the location of a PDF verifier.
- GDFR returns the verifier's location.
- Agent now forwards the file 'foo' to the PDF verifier.
- After verification, the PDF verifier returns true.
- Agent now returns message about well-formdness and validity of file 'foo' back to the user.