Summary

  • Every filename should be unique.
  • Filenames should conform to the following format:

    YYYYMMDD-HHMM_experiment_sample_experimenter.extension

  • If instruments have a short filename limit, use the following filesystem structure to parallel the long filename format above:

    experimenter/YYYYMMDD/experiment/sample/HHMM.extension

Introduction

It seems like most groups have an ad-hoc and byzantine approach to managing computer data files that are generated by instruments in the course of research. Different researchers often have conflicting categorizational schemes using many levels of nested folders. These schemes likely aren't even self-consistent. Various versions of data are scattered across many different computers and have inconsistent names, all of which are at various stages of analysis or processing. Some are in the original format, some have been combined with other data files in MS Excel sheets, and the original pristine copies of some have been overwritten during processing.

Herein I describe a scheme to eliminate those problems I just described that are caused by non-uniform or conflicting naming schemes for computer data files. This scheme will work for a small group (10 to 20 people, perhaps more). This scheme assumes the files aren't too big, tens to perhaps a few hundred MiB on the high end, and that up to tens of data files are produced per day. This scheme is based on the use of unique identifiers (UIDs) and combinations of disjoint sets of metadata.

File naming

The nut; files should be named using the following format:

YYYYMMDD-HHMM_experiment_sample_experimenter.extension

Filenames are thus constructed from several pieces of disjoint kinds of metadata, separated by the underscore (_) character. Each component is described below:

  • YYYY: four-digit year (e.g. 2014, not 14)
  • MM: two-digit month (e.g. 02, not 2)
  • DD: two-digit day (e.g. 05, not 5)
  • HH: two-digit hour in 24 hour format (e.g. 14, not 2pm. 09, not 9.)
  • MM: two-digit minute
  • experiment: name of technique (e.g. xps, afm, iv, etc.)
  • sample: unique identifier of the sample
  • experimenter: initials of the person who took the data (e.g. jrs).
  • extension: the file extension (e.g. .tif, .dat).

The first few things you might notice about this scheme is that this scheme does not use an index; in other words, there's no xps0001.dat, xps0002.dat, etc. Also, the date goes year-month-day instead of month-day-year like many Americans are used to. Finally, this scheme seems like it will produce long filenames.

Time as an index (and other benefits)

The structure of the date/timestamp is a key part of this naming scheme, so I'm going to spend some time explaining the reasoning of using it. By specifying time to the resolution of minutes, one gets implicit indexing, even though the indices won't likely be sequential. I've found that sequential indexing with data filenames simply isn't that useful so long as every data file has a unique name. Moreover, manually indexing filenames is distracting for the experimenter; a small part of the experimenter's mind is occupied with the running index instead of focused on the experiment. Unless a computer is keeping track of the index, an experimenter may likely skip an index or repeat one which adds to the confusion.

Typically experiments take longer than a minute to run, and so temporal resolution to the minute is usually sufficient to avoid name collisions. If higher temporal resolution is required, simply add seconds, etc. fields following the minutes.

The order of the temporal denominations are important as well, and go from largest to smallest denomination of time. This ordering is based on the ISO 8601 standard. The use of 24-hour clock format reduces ambiguity and saves an extra character or two in filename length (am vs. pm). Formatting the date and time in this way is precise and unambiguous, which should be the aspiration of a scientist.

Writing the date/timestamp in this way and putting it at the beginning of the filename yields the benefit that most computers will end up displaying the filenames sorted chronologically by default. Contrast this chronological sorting with the default sorting that would occur if the date was written MM-DD-YY or even MM-DD-YYYY.

The non-temporal metadata parts of the filename

The rest of the filename is composed of a few other bits of metadata. The experiment field is necessary because the filename extension may not give enough information to determine it. For example, our scanning electron microscope generates tiff files, regardless of if it was imaging in secondary electron mode, backscatter mode, EBIC, or some other mode. I recommend trying to use no more than a three or four character string for the technique.

The sample field gives the name of the sample, which should be unique. I'll write more on choosing unique sample names later. The experimenter field is useful for two reasons. First: attribution. Someone preparing a manuscript for publication can easily determine who contributed the data if this field is present in the filename. Second: responsibility. If this field is present, someone analyzing the data can track down the person who took the data and ask them questions.

Some advice and pitfalls to avoid

There were a few design criteria I was hoping to meet with this naming scheme. I wanted to have a rubric to generate a filename that was guaranteed to be unique. I also wanted to build in enough information so that a person could have a good idea of the contents and context of the file simply by looking at the name. Of course, you could always add even more metadata, but at some point the length of the filename will be unweildy and people won't follow the convention. To that end, the final design criterion was a convention that was long enough to be sufficiently descriptive, yet short enough that people would still use it. I would recommend that you not add additional metadata fields to this scheme.

One last piece of advice, I recommend using all lowercase letters in your filenames.

A big advantage of this scheme is that it allows you to leverage the search functionality of your operating system. For example, you could easily find all of the AFM images taken by me by searching for "afm" and "jrs". Additionally, it is simple to break apart the filename into useful metadata.

Dealing with old OSs that don't support long filenames

Many labs have old, but perfectly usable instruments that are controlled by old computers. Depending on the age, the operating system may not support long filenames. In this case, the file naming format I suggest won't work. In this case, I recommend using a nested directory structure to capture the full set of metadata -- just reverse the order of the metadata components.

experimenter/YYYYMMDD/experiment/sample/HHMM.extension

It probably occurs to you that you wouldn't want a heterogeneous system containing short filename files within the nested directory structure along with the long filename files. In a future post I will discuss a computer data file workflow to deal with the issues that such a heterogeneous system would create.