CHAPTER 10 HISTORY OF SAD PREFACE This standard is the result of a joint investigation carried out by the Anglo-Australian Observatory and the Mount Stromlo and Siding Springs Observatories. The members of the investigating committee were: A. Bosma (MSSSO) R. Ekers (University of Groningen/MSSSO) B. Newell (MSSSO) J. Straede (AAO) P. Wallace (AAO) D. Warne (MSSSO) Page 2 10.1 INTRODUCTION Data formats fall into three categories: (i) those recorded at the telescope, (ii) those used to transport data from institution to institution, and (iii) those used by data reduction programs. Most discussions on standardized data formats center on the transportability of the data (case ii). Here we aim for two additional goals. First, the data interchange format is to be suitable for recording at the telescope in order to eliminate the need to copy tapes. In addition, a common data format must be defined so that software can be transferred. Two formats have, in fact, been developed: an interchange format and an access format. The interchange format is designed for a sequential medium such as magnetic tape. It can be kept simple enough to be recorded under the constraints operating at most telescopes. On the other hand, it can be extended in a standard manner to meet the demands of interchange of data which has passed through reduction programs which add descriptive information. During data reduction, flexibility and speed of access are the prime requirements. These programs use the access format. In this standard, it is assumed that the data storage medium allows random access. In situations where this is not available, data reduction programs can operate on the interchange format directly. Although the interchange and access formats are necessarily different, they have enough common ground to ensure ease of translation. This common ground is provided by building each out of the same basic unit of data, called an image, and using the same format for astronomical descriptions in both cases. The FITS (Flexible Image Transport System) format developed by Wells (KPNO) and Greison (NRAO) was released at about the same time as the original version of this standard was proposed. This standard was then altered to make it more compatible with the F keyword subdivision (see Section 10.2.3). This change has made some features of the original standard redundant, particularly with regard to comments appended to the data. These redundant features have been left in the standard. Page 3 10.2 IMAGES 10.2.1 Outline The basic unit of astronomical data is an image. An image is defined as that set of data which it is appropriate to collect under one astronomical description. For example, a single spectrum, whether one- or two- dimensional, qualifies as an image. A digitized representation of a photographic plate as produced by a microdensitometer is also an image. A complete set of Aperture Synthesis Radio-Telescope maps may be collected into the one image. An image is made up of a three dimensional data array plus header information. The header may be supplemented by a trailer in the interchange format to cope with the limitations of sequential recording. The format allows for an area around the edges of the data cube to be excluded from the active data. This area has two main uses. It can be used for descriptive information such as scan line identifiers or a wavelength scale. Alternatively, where a reduction process destroys edge information, the bounds of the active data can be tightened to compensate. Two classes of comments are available in the format. One is a fixed length field in the header (and trailer in the case of the interchange format). The other is an extensible set of comments which are normally present only in data which has undergone reduction. Both comment fields have been made largely redundant by the introduction of the keyword subdivision. As they may be removed in a future revision of the format, their use is not recommended. Page 4 10.2.2 Image Header The image header is made up of subdivisions. One subdivision, the control subdivision, holds all the information necessary to physically access the data. This is the only mandatory part of the header. The remaining subdivisions contain astronomical subdivision required. In the interchange format, the entire control subdivision is in a standard character code (such as ASCII or EBCDIC). For reasons of efficiency, however, most fields are converted to their binary equivalents in the access format. Astronomical subdivisions use identical codes in both formats and, as far as is reasonable, contain only character code. Page 5 10.2.3 Keyword Subdivision The keyword subdivision contains descriptive astronomical parameters identified by keywords. Parameter definitions are of the form: KEYWORD = Value(s) / Comments <CR>The value or list of values must be given in the appropriate character code and must conform to the Fortran 77 list directed I/O conventions. If there are no comments, the "/" may be omitted unless the value list is truncated. Keyword "END=" terminates the list. Although the structure is not identical to the Flexible Image Transport System (FITS) structure, the same keywords and units should be used to enable easy translation to and from FITS. The keyword subdivision follows the same conventions as the special purpose astronomical subdivisions, i.e. it starts with a tag ($KYWRD01) followed by a character count. In this case, the character count is redundant and the END keyword always takes precedence. Page 6 10.2.4 Special Purpose Subdivisions Although most descriptive parameters can conveniently be specified in the keyword subdivision, provision is made for special purpose subdivisions. For example, it may be be desirable to store a histogram of data values. This could not conveniently be achieved using keywords. Special purpose subdivisions would generally be of fixed format and must comply with the following conventions: 1. If a particular subdivision is included in the header, the complete subdivision must be present, though some fields may be undefined. An undefined character field is blank-filled while an undefined binary field is identified by some convention appropriate to the field (e.g., if the standard deviation is zero, both the mean and standard deviation are considered undefined). 2. A subdivision can be standard or private. A subdivision is made standard by its acceptance by those responsible for maintaining the format standard at the cooperating observatories. Users who wish to set up a private subdivision for their own use are free to do so. 3. Each subdivision starts with a six character tag; a two character version number; followed by a four character subdivision length. Standard subdivision tags must start with a "$" to avoid confusion with private subdivisions. Private subdivision tags may not contain a "$". The subdivision length is the number of bytes in the subdivision including tag, version number and length fields. It is expressed in formatted character code with I4 format. 4. It is recommended practice that all fields be a multiple of four bytes long. Where a number of consecutive fields are normally dealt with as a unit, it is acceptable for the group as a whole to be a multiple of four bytes. 5. Where an astronomical subdivision contains values in binary code, it must also contain a descriptor specifying the binary code format. This descriptor follows the conventions defined for the first four bytes of the data format description in 4 the image control subdivision (e.g., PDP-11 REAL4 would be "R4PD"). 6. It is not permissable to fill an undefined field with "garbage". Character code fields must at least be blank filled. Filling with nulls is not acceptable. Binary fields must contain a value which signifies the field is undefined. 7. Where a character field represents a number, leading blanks are permitted and the absence of a sign signifies a positive value. When the accuracy of the value does not merit the number of decimal places available, trailing blanks are recommended in preference to trailing zeroes. Page 7 10.3 INTERCHANGE FORMAT 10.3.1 Outline An image is recorded on the interchange medium as a header, followed by the data and, optionally, a trailer and comments. Access to the image is assumed to be sequential. Where a file concept is appropriate to the medium, there may be several images to the file. Except for the control subdivision, interchange format image headers are the same format as their access format counterparts. Details of the image header and trailer control subdivisions are given in Appendix A. It is not possible to update values of control parameters in a trailer. Where data has been truncated for some reason, it is assumed that this is determined from the length of the data actually encountered rather than from the trailer. When translating to access format, trailer astronomical subdivisions are appended to the corresponding header subdivisions. Comments following the trailer correspond to the extensible comments and genealogy discussed in section 10.4.4. They are normally present only in reduced data. Comments made at the time of observation are allowed for in the fixed length comment fields of the header and trailer. (See note in preface regarding obsolescence of these comment fields.) 10.3.2 File and Record Structure This discussion assumes magnetic tape is the medium in use. Appropriate modifications would need to be made for other media. An interchange file is bounded by end-of-file marks except at the beginning of the tape where the initial end-of-file mark may be omitted. File and volume labels, if present, must be separated from the body of data by end-of-file marks. It is recommended that physical blocks be of fixed length. To allow error recovery, headers, trailers and comments must start on a physical block boundary. The header, trailer and comment tags can then be used to locate the bounds of a corrupted image. This method of recovery fails in the unlikely event that data looks like a tag. It is recommended practice that, if a tag is encountered when data is expected during tape reading, the contents of the record be printed out assuming the tag is valid. The decision as to whether the record is data or not is then determined by operator inspection. A technique for automatic error recovery, based on tagging data blocks, is allowed for. It is not made compulsory due to the difficulty under some operating systems of manipulating physical blocks under time critical conditions. This method uses the area which can optionally be set aside at the beginning of physical blocks containing data. Such a tag must comply with the following conventions: (i) If four or more bytes are present, the first four bytes Page 8 must contain the characters "DATA". (ii) If eight or more bytes are present, the second four bytes must contain a block count in character code in I4 format. Block 1 is the first block in the current image. (iii) Remaining bytes are not interpreted by the error recovery procedure. Tape reading programs which do not support automatic error recovery ignore this tag field. Tags on the header and trailer blocks are the tags of the control subdivision, i.e. "$IMHDRnn" and "$IMTLRnn", where nn is the version number of the subdivision. Comments are tagged with the characters "COMMENTS". Data can start on a physical block boundary or immediately following the header within the same block. The method used is specified in the control subdivision. Data logical records immediately follow each other and may lie across block boundaries. Physical details of the tape (number of tracks, density, block length) and the character code used (ASCII, EBCDIC, BCD, etc.) should be noted on a label affixed to the reel. Page 9 10.4 ACCESS FORMAT 10.4.1 Tree Structure In the access format, images are grouped into trees which in most operating systems will correspond to files. In the simplest case there is one image per file. More commonly, there are several images within a file and these are grouped under a node which acts as an index to the images. These simple structures will suffice in most cases. However, to allow an astronomer to group his data in a flexible and astronomically meaningful way, the access format allows for an indefinitely extensible tree structure. Individual images can be collected under a common node which may in turn be collected under another node. While, in principle, there can be an indefinite number of levels in the tree, in practice there are rarely more than one or two. As an example, consider a surface photometry project in which there are two calibration images (zero and flat-field), and a number of observations of each of two objects. This data might be collected in a tree as shown in figure 1. During the reduction of this data, the astronomer might perform the step "OBJECT A: 5 - ZERO". Node records and image header control subdivisions have been made similar so that the same subroutine can access them both. This allows programs to be written in such a way that the tree structure need not be known in advance. Each node and image is named so that it can be accessed by name or number. At the head of the tree, usually the first record in the file, there is a tree/file header which contains implementation-dependent information. Under most systems the only information required is the record size and a pointer to the next vacant record. Space for a name is provided though this is normally redundant as it will be contained in a directory maintained by the operating system. A fixed length comment is also allowed for. 10.4.2 Record Structure Efficiency requirements dictate that the record structure of the access format be optimized for a particular computer and operating system. The structure must, however, meet the following requirements: (i) It is capable of random access; (ii) Records are fixed length; (iii) Headers, extensible comments, and data all start on record boundaries; and (iv) Within the headers, extensible comments, and data, logical records immediately succeed each other and may overlap record boundaries. Page 10 10.4.3 Header Chains During data reduction, astronomical subdivisions are frequently added to the header, possibly overflowing the original space set aside. This is provided for by chaining whereby the first four bytes of the random access record are set aside for an integer record pointer to the next record in the header. All record pointers assume the first record in the file is zero. A zero pointer indicates the last record in the chain. When a subdivision that causes a header overflow is written, the header writing subroutines automatically find the next available record from the tree/file header. Chain pointers are not considered as part of the header logical record. 10.4.4 Extensible Comments and Genealogy Comments and genealogy are a record of the data reduction process. Comments are entered by the person operating the reduction program while the genealogy is a reduction history automatically recorded by the program itself. Genealogy entries are enclosed between "$" signs. Logical records in the comments are terminated with a carriage return character and the last carriage return is followed by an end-of-text character. Line feeds are ignored. Extensible comments and genealogy, like headers, must be capable of indefinite extension and so use the same chaining technique. Page 11 10.5 CONCLUSIONS This format can be kept simple yet, at the same time, can be expanded to satisfy quite complex requirements. It contains only one mandatory item - the control subdivision of the image header. The concept of astronomical subdivisions provides sufficient flexibility to deal with new observing techniques as they arise. Simplicity is maintained for the computer at the recording instrument since it will only have to deal with a header appropriate to that instrument. A subset of the format came into operation at Mt Stromlo Observatory during the third quarter, 1979. Documentation of this implementation is available.