Skip to main content

Data Dictionary Guidelines

Recommendations for documenting data carried by the Data Interchange

Overview

A "Data Dictionary" documents the structure, format, meaning, and other important characteristics of a database or a data transmission. Data Dictionaries are critical to data consumers, who will rely on such documentation to correctly interpret and process the data.

This document provides guidelines for creating Data Dictionaries to describe data that will be exchanged on the Cold Chain Data Interchange. The primary audience for this document is Telemetry Providers (i.e., organizations that delivers cold-chain data to the Data Interchange) and their technical representatives.

Flexible Data Exchange

The Data Interchange was designed to accept and distribute digital data with very few limits on the types of data that are handled. In practice, this means that the Data Interchange is flexible enough to support existing monitoring and reporting programs with minimal change to existing data systems.

The following are examples of data that can be handled by the Data Interchange:  

Structured data

  • CSV or other delimited text

  • Text data with fixed-length fields

  • JSON or YAML data

  • XML data

  • Spreadsheets

Semi-structured data

  • Logs

Unstructured data

  • Freeform text documents

  • Image and PDFs

  • Compressed file archives

Because the Data Interchange does not publish (or enforce) standards on what types of data are exchanged, it is the responsibility of Telemetry Providers to document the structure and details of any data that they publish to the Data Interchange. Such documentation is critical to Telemetry Consumers, who depend on this documentation to interpret and analyze the data.

Documenting the transmission (high-level)

Each distinct type of transmission to the Data Interchange – whether it contains structured or unstructured data – should be documented in a Data Dictionary. The following are the high-level recommendations for describing each distinct type of transmission. Recommendations for documenting the low-level data details are provided in a subsequent section.

Name of the overall message / payload Provides an identifier for a specific type of data transmission that will be handled by the Data Interchange. For example, this might be "Daily Temperature Report", or "My 60DTR", or whatever name best describes the contents of the data transmission.

Description of the overall message / payload Provides additional, descriptive details to help the consumer understand the type / purpose of this transmission. Depending on the payload, this description might be a few simple sentences or might require multiple paragraphs to clarify the context and intended use.

Payload Type Provides high-level information to describe the structure and format of the payload. The following are reasonable values, though by no means an exhaustive set:

  • JSON object

  • Tab-delimited text data

  • Text data with fixed-length fields

  • PNG images

  • Semi-structured log data

  • Complex report structure (with additional description)

Frequency / volume of transmission

Provides the anticipated frequency of data transmissions and, if relevant, the expected intervals between data samples. For example, it may be that true that data is sampled every 30 seconds, but the data are aggregated and sent to the the Data Interchange every 8 hours. Such details will help data consumers and Data Interchange administrators to anticipate load on systems and infrastructure. It may also assist data consumers in analyzing the completeness of transmitted datasets.

If the frequency of data transmission and/or data samples is variable, this should be described. For example, some systems will adjust the sampling rate in relation to power availability. It is recommended to describe the behavior of the data in such cases.

Good documentation might look like the following:

This report is transmitted every 12 hours from each of ~200 devices, representing approximately 400 messages each day. Each report will contain multiple telemetry samples, on 10 second intervals, unless power is unavailable; in such cases, the sampling interval will be variable between 10s and 5m. During normal operation, each report will be approximately 64KB in size.

Documenting the transmission (low-level)

Documentation of structured data is typically called a "Data Dictionary", which is the term used through the rest of this section. In contrast to example datasets – which are valuable in their own right – a Data Dictionary provides metadata to help an analyst or other consumer understand what the data means, what data types are present, how the data is formatted, and other information that may be helpful in consuming the data.

This section describes the recommended minimum set of information to document the low-level details (typically, the field-level details) of a structured data transmission. For unstructured or semi-structured data, these recommendations are more flexible; provide whatever information may be helpful to a downstream data consumer.

Field name or identifier Provides information to unambiguously identify the name or position of the data in the payload. In a delimited text file, this might be the name of the header or the numbered position of the field. For JSON, YAML, or XML data, this should be the name of the field, typically in a dot-delimited format (e.g., facility.logger[].serialnumber).

Field datatype Provides information to clarify what type of data is contained in this field. For example, is the data numeric or a string? For more complex structures, perhaps the data is an object (i.e., dictionary) or an array.

Field description Provides information to clarify what the data in the field means or represents. This would generally include a reference to the units of measure. For example, does this string of numbers represent a latitude or longitude? Does it represent the median temperature, in celsius, over the past 1 hours, as averaged over a set of samples that are taken every 1 minute? Should the number be represented as a percentage or a decimal?

Even if data fields are named very carefully, understanding the low-level details is frequently impossible without such a detailed description.

Required Provides information to clarify whether a field is required (or optional). In cases where a field is contingent upon some other condition (e.g., an error state), it is helpful to describe the condition(s) that influence the presence or absence of the field.

Optional details

The following information should be considered optional (although frequently valuable) when documenting structured data:

Data length Provides information to clarify the maximum expected length of data in the associated field.

Can be null? Provides information to clarify whether the field accepts null values. Note that this is distinct from whether the field is required.

Acceptable values Provides information to implement data validation, if relevant. For example, if numeric data was only expected to fall within a range from 0-99, this may be helpful for a consumer to understand. The information is especially valuable if a data consumer will discard or provide special handling for unacceptable / unexpected values. Including other details, such as anticipated averages or min/max values (where relevant) is also helpful, especially in cases where data-processing software should make automated decisions about the quality of data in a report.It may be useful to add a section for more descriptors about the parameters – such as average/max/min, time interval over which it is relevant

Spreadsheet tab For spreadsheets that have multiple tabs, it is important to identify the tab where a given field appears. Note that processing data from spreadsheets with multiple tabs puts a special burden on data consumers, so it should be avoided wherever possible.

Formatting the Data Dictionary

In general, the most straightforward approach is to use a table to format your Data Dictionary. The following is a simple template, which which can be extended or modified to suit your specific data.

Data Dictionary Template (Simple)

Documenting the transmission (examples)

In addition to the Data Dictionary, it is recommended to include one or more example payloads for each type of transmission. This is especially valuable for structured data, so the data consumer has access to real data for analysis and integration purposes.

Summary

Telemetry Providers are responsible for documenting each type of transmission that is exchanged by the Cold Chain Data Interchange. Good documentation will contain three important sections:

  1. High-level description and details the transmission itself Refer to section: Documenting the transmission (high-level)
  1. Low-level description of data (esp., structured data) Refer to section: Documenting the transmission (structured data)
  1. Examples of data transmission Refer to section: Documenting the transmission (examples)

Data Dictionary: An Example

The following spreadsheet provides a Data Dictionary to interpret telemetry reports from a simple (and hypothetical) cold-chain device. An example report that is described by this Data Dictionary is provided below.

Example data dictionary Data Dictionary Example

Example data report

[
{
"BATHOLD":10.97,
"LAMPS":0.06,
"BATCHG":100,
"LFREQ":50.43,
"ERR":[
{
"ID":"6184",
"INFO":"0 (360)"
}
],
"BATVOL":4.14,
"LVOL":236.70,
"LWATT":237.68,
"LWH":3425.25,
"FRIDGEID":2928041723101708391,
"HUMREL":79.37,
"FSTATE":0,
"COMPWR":1,
"AMTMP":17.91,
"VCBTMP":3.31,
"VCTTMP":3.19,
"DATETIME":"2020-04-28T23:56:59+00:00"
},
{
"BATHOLD":10.97,
"LAMPS":0.06,
"BATCHG":100,
"LFREQ":50.43,
"BATVOL":4.14,
"LVOL":236.70,
"LWATT":237.68,
"LWH":3425.25,
"FRIDGEID":2928041723101708391,
"HUMREL":79.37,
"FSTATE":0,
"COMPWR":1,
"AMTMP":17.91,
"VCBTMP":3.31,
"VCTTMP":3.19,
"DATETIME":"2020-04-28T23:56:49+00:00"
},
{
<additional records on 10 second intervals>
}
]