Skip to content

Data types

Computers store all data as sequences of 1 and 0 in memory. To be useful, these need to be interpreted as strings, numbers, dates, booleans, etc. This is called data typing. Different file formats have varying levels of support for one or more data types. When working with complex data in any computer software, you will quickly run into situations where you need to explicitly specify which type you except some data to be.

The dataTypes argument

Various custom functions allow you to specify the data type of each column in a table using the dataTypes argument.

This argument is a mapping of column names to data type specifiers. You can use the dict function to create the mapping.

Examples

  • dict()
  • dict("columnA", "string")
  • dict("columnB", "int:decimal=round", "columnC", "float:error=null", "columnD", "bool")
  • dict("columnE", "date", "columnF", "datetime", "columnG", "time")
  • dict("columnE", "date:%Y-%m-%d", "columnF", "datetime:%Y-%m-%d %H:%M:%S", "columnG", "time:%H:%M:%S")

Column names

  • Column names are case-insensitive
  • If a column is present in dataTypes but is not present in the table or in the SQL query results, it is ignored.
  • If a column is present in the source table or in the SQL query results, but not in the dataTypes mapping, the behaviour depends on the function. See the documentation for the specific function for details.

Data type specifiers

The data type specifier is a string that describes the expected or desired data type of the column. It may be one of the following:

  • int
  • int:error=…,decimal=…
  • float
  • float:error=…
  • bool
  • string
  • date
  • date:FORMAT
  • datetime
  • datetime:FORMAT
  • time
  • time:FORMAT

Date and time formats

FORMAT is a strptime/strftime format string. When omitted, a few variants of ISO 8601 are attempted when reading, and the following defaults are used when writing:

  • datedate:%Y-%m-%d
  • datetimedatetime:%Y-%m-%d %H:%M:%S
  • timetime:%H:%M:%S

Numbers (int and float)

  • int means INTEGER, i.e., a whole number.
  • float means FLOAT, i.e., a real number, stored using the IEEE 754 standard.
  • int:error=… and float:error=… are available since Autory v3. They allow you to specify how to handle errors when converting strings to numbers. The following options are available:
    • error=raise (default): Raise an error. This stops the run. This is useful for ensuring that all numbers are valid.
    • error=null: Return NULL (a.k.a. None in Python, or an empty cell in Excel).
    • error=0: Return a specific number (e.g., 0).
  • int:decimal=… is available since Autory v3. It allows you to specify how floats are converted to ints.
    • decimal=raise: Raise an error. This stops the run. This is useful when you need to ensure that you do not lose any precision.
    • decimal=floor (default): Round towards negative infinity. For positive numbers, this is equivalent to throwing away the decimals. This uses Python's floor function.
    • decimal=ceil: Round towards positive infinity. For negative numbers, this is equivalent to throwing away the decimals. This uses Python's ceil function.
    • decimal=round: Round the number if the string cannot be converted to a number. This uses Python's round function, which rounds to the nearest even number in case of a tie.
  • int:error=…,decimal=… is available since Autory v3. It allows you to specify both error handling and rounding.

Rationale

  • Since CSV files are text files, they contain only strings. There is no standard way of representing data types in CSV files, so we need to explicitly state which columns are expected to contain dates, booleans, etc.
  • Excel tables do represent data types, but SQL queries like SELECT foo AS bar may create columns that are not present in the original table, and so we might want to explicitly specify the data type of these columns.
  • Not all SQL databases support all data types, so we might want to re-interpret a string as a date/time after running the SQL query, according to our preferred date/time format. See SQLite data types for more details.
  • We might want to override the data type of a column for whatever reason.
  • There is no single standard for representing dates/times, booleans or numbers as strings.

Interpreting strings as dates and times

There are many valid ways to represent a date/time as a string, e.g.:

  • 2 November 2020
  • 20201102T00:00:00+Z
  • 2020-11-02
  • 2020-02-11

Some strings or numbers may look like dates, while they may represent something else, e.g.:

  • The number 20201102
  • The name Jan

Although there are good standards for representing dates as strings, like ISO 8601, you will encounter many other standards in practice. Therefore, it is impossible to reliably detect date/time columns from strings without false positives and misinterpreted data. Instead, you should explicitly state which columns are expected to contain dates, especially when reading from a CSV file.

Interpreting strings as booleans

There are many valid ways to represent a boolean as a string, e.g.:

  • true / false
  • True / False (Python)
  • TRUE / FALSE (Excel)
  • yes / no
  • 1 / 0 (SQLite)
  • on / off

Some strings or numbers may look like booleans, while they may represent something else, e.g.:

  • The number 1
  • The abbreviation no

Therefore, it is impossible to reliably detect boolean columns from strings without false positives and misinterpreted data. Instead, you should explicitly state which columns are expected to contain booleans, especially when reading from a CSV file.

Interpreting strings as numbers

There are many valid ways to represent a number as a string. Each way is further complicated by the fact that there are regional differences in the characters used to separate thousands and decimals, e.g.:

  • Integers: 1, 1000 1 000, 1_000
  • Floating point numbers: 1.1, 1.10, 1,000.1, 1,000.10, 1 000.1, 1 000.10, 1_000.0, 1_000.10
  • Scientific notation: 1e3, 1e+3, 1e-3, 1.0e3, 1.0e+3, 1.0e-3, 1,000.0e3, 1,000.0e+3, 1,000.0e-3, 1 000.0e3, 1 000.0e+3, 1 000.0e-3, 1_000.0e3, 1_000.0e+3, 1_000.0e-3
  • Numbers that use bases other than decimal:
    • Binary: 0b1010, 1010 (easily confused with base 10)
    • Octal: 0o12, 12 (easily confused with base 10)
    • Hexadecimal: 0xDEADBEEF, DEADBEEF (easily confused with words)

Some strings may look like numbers, while they may represent something else, e.g.:

  • RSA ID numbers.
  • Cellphone numbers.
  • Dates in YYYYMMDD format.

Therefore, it is impossible to reliably detect number columns from strings without false positives and misinterpreted data. Luckily, Excel tables and all known SQL implementations support numeric data types. Numeric detection is usually only required when reading from CSV files. See csvToTable for details.

NULL

What is NULL?

NULL (a.k.a. empty / missing values) is a special value that represents the absence of a value.

Each programming language and each file format has different ways of representing NULL:

  • NULL (SQL)
  • None (Python)
  • null (JSON, JavaScript, Java, C#, etc.)
  • JavaScript has another special value called undefined, which is distinct from null, but similar.
  • nil (Ruby, Swift, etc.)
  • Empty cells in Excel are often considered NULL, but unfortunately not consistently so. It depends on the formula that references the cell, and the number format of the cell.

Things that can't be represented in a specific file format are frequently replaced with NULL when there is no better alternative.

NULL is sometimes conflated with, but technically distinct from:

  • An empty string, which is a string with zero characters.
  • Whitespace-only strings.
  • The number zero.
  • The boolean false.
  • Not a Number (NaN), which is a special floating point value that represents an undefined or unrepresentable value in some programming languages.
  • Other special values, like infinity.
  • Error values, such as #N/A or #DIV/0! in Excel.
  • See the Wikipedia disambiguation page.

Interpreting strings as NULL

While some text-based file formats (like JSON and YAML) have explicit support for NULL values, others (like CSV) have no standard way of representing NULL values. In practice, there are many different ways to do this in text files, e.g.:

  • An empty string.
  • The string NULL, null, NaN, nan, N/A, n/a, -, etc.

The pandas.read_csv function in Python uses an extensive list of these by default.

However, there are many situations in which these detections may be false positives, e.g.:

  • The string N/A may be a valid value in a column that contains survey responses.
  • "Null" is a surname. Yes, really.

Putting quotes around such strings does not help, either, as quotes in CSV files are treated as optional in almost all CSV parsers, with no standard to specify that "null" is any different from null.

How Autory handles NULL detection

When reading from CSV files, Autory will not make any assumptions about which strings represent NULL, except for the empty string "" (with or without quotes). See csvToTable for details.

The reason for treating empty strings as NULL by default can be illustrated with an example CSV file:

a,b,c
1,"one",True
,,
2,"two",False
"","",""
3,"three",True

In this example, the second and fourth rows contain empty fields for all columns. Column a is clearly numeric, column b is clearly string (text), and column c is clearly boolean.

Treating empty string as strings would not cause problems in column b, but it would cause columns a and c to have heterogeneous data types, which is usually not desired:

  • Column a: Mixture of integers and strings.
  • Column b: Just strings. No problem.
  • Column c: Mixture of booleans and strings.

Treating empty strings as NULL resolves this issue and yields homogeneous data types, since NULL is not considered as a separate data type, but rather as the absence of a value:

  • Column a: Integers.
  • Column b: Strings.
  • Column c: Booleans.

Nullability

A nullable type is a data type that does not require a value to be present. In other words, it allows NULL values.

In many SQL databases, columns may be defined as NULL (i.e., some rows might have nothing in that column) or NOT NULL (i.e., every row must have a value in that column).

In Autory all data types are considered nullable.

SQLite data types

SQLite has no explicit date/time or boolean data types. It is left to the user to decide how to represent these values in the database as strings or integers. Good choices are ISO 8601 (strings) for dates/times and 0/1 (integers) for booleans. However, you will find many other representations in practice, like Unix time or regional date/time formats.

SQLite does not enforce data types on columns by default. This feels like a great feature at first, but quickly leads to unexpected results in serious use cases. When Autory writes data to an SQLite database, it will emit a warning if a column contains values of more than one data type.

See the official SQLite documentation on data types for more details.

Homogeneous and heterogeneous data types

Definitions

A column (or any collection of values) is said to have a

  • homogeneous data type if all values in the column are of the same data type (or NULL, since all data types are nullable in Autory).
  • heterogeneous data type if it contains values of more than one data type (excluding NULL values, since all data types are nullable in Autory).

Examples

  • 1, 2, 3, NULL, 5 ⇒ Homogeneous (there are only integers)
  • 1.0, 2.5, 3.14, NULL, 0.0 ⇒ Homogeneous (there are only floats)
  • True, False, NULL, True ⇒ Homogeneous (there are only booleans)
  • "apple", "banana", "cherry", NULL ⇒ Homogeneous (there are only strings)
  • 1, 2.5, 3, NULL, 0.0
    • In theory, integers and floats are different data types, so this is heterogeneous.
    • In practice, since integers can easily be represented as floats (1 == 1.0), this is treated as homogeneous.
  • 1, True, "apple", NULL ⇒ Heterogeneous, since there are integers, booleans, and strings.

Why homogeneous data types are important

  • It is faster to process.
  • It uses less memory.
  • It is easier to analyze and visualize, e.g.:
    • If a column contains both numbers and strings, you cannot calculate the sum or average of the numbers.
    • If a column contains both dates and strings, you cannot plot a time series.

Error values

When Autory encounters an error while calculating the result of a variable, it will store an error value in the results instead of crashing the entire run. This can happen when dividing by zero, trying to convert an invalid string to a number, misspelling a variable name or function name, trying to optimize something that does not converge, etc.

As mentioned above, error values are not the same thing as NULL values. Therefore, any column that contains one or more error values is considered to have a heterogeneous data type. Autory will emit a warning when this happens.

Sometimes it's useful to ignore errors, replacing them with NULL instead. The autory convert results command has the --encode-error-values null option for this purpose. Use this with caution, as it may hide problems in your model or data.