Data types¶
Computers store all data as sequences of 1 and 0 in memory. To be useful, these need to be interpreted as strings,
numbers, dates, booleans, etc. This is called data typing.
Different file formats have varying levels of support for one or more data types. When working with complex data in
any computer software, you will quickly run into situations where you need to explicitly specify which type you except
some data to be.
The dataTypes argument¶
Various custom functions allow you to specify the data type of each column in a table using the dataTypes argument.
This argument is a mapping of column names to data type specifiers.
You can use the dict function to create the mapping.
Examples¶
dict()dict("columnA", "string")dict("columnB", "int:decimal=round", "columnC", "float:error=null", "columnD", "bool")dict("columnE", "date", "columnF", "datetime", "columnG", "time")dict("columnE", "date:%Y-%m-%d", "columnF", "datetime:%Y-%m-%d %H:%M:%S", "columnG", "time:%H:%M:%S")
Column names¶
- Column names are case-insensitive
- If a column is present in
dataTypesbut is not present in the table or in the SQL query results, it is ignored. - If a column is present in the source table or in the SQL query results, but not in the
dataTypesmapping, the behaviour depends on the function. See the documentation for the specific function for details.
Data type specifiers¶
The data type specifier is a string that describes the expected or desired data type of the column. It may be one of the following:
intint:error=…,decimal=…floatfloat:error=…boolstringdatedate:FORMATdatetimedatetime:FORMATtimetime:FORMAT
Date and time formats¶
FORMAT is a
strptime/strftime format string.
When omitted, a few variants of ISO 8601 are attempted when reading, and the
following defaults are used when writing:
date⇒date:%Y-%m-%ddatetime⇒datetime:%Y-%m-%d %H:%M:%Stime⇒time:%H:%M:%S
Numbers (int and float)¶
intmeans INTEGER, i.e., a whole number.floatmeans FLOAT, i.e., a real number, stored using the IEEE 754 standard.int:error=…andfloat:error=…are available since Autory v3. They allow you to specify how to handle errors when converting strings to numbers. The following options are available:error=raise(default): Raise an error. This stops the run. This is useful for ensuring that all numbers are valid.error=null: ReturnNULL(a.k.a.Nonein Python, or an empty cell in Excel).error=0: Return a specific number (e.g.,0).
int:decimal=…is available since Autory v3. It allows you to specify howfloats are converted toints.decimal=raise: Raise an error. This stops the run. This is useful when you need to ensure that you do not lose any precision.decimal=floor(default): Round towards negative infinity. For positive numbers, this is equivalent to throwing away the decimals. This uses Python'sfloorfunction.decimal=ceil: Round towards positive infinity. For negative numbers, this is equivalent to throwing away the decimals. This uses Python'sceilfunction.decimal=round: Round the number if the string cannot be converted to a number. This uses Python'sroundfunction, which rounds to the nearest even number in case of a tie.
int:error=…,decimal=…is available since Autory v3. It allows you to specify both error handling and rounding.
Rationale¶
- Since CSV files are text files, they contain only strings. There is no standard way of representing data types in CSV files, so we need to explicitly state which columns are expected to contain dates, booleans, etc.
- Excel tables do represent data types, but SQL queries like
SELECT foo AS barmay create columns that are not present in the original table, and so we might want to explicitly specify the data type of these columns. - Not all SQL databases support all data types, so we might want to re-interpret a string as a date/time after running the SQL query, according to our preferred date/time format. See SQLite data types for more details.
- We might want to override the data type of a column for whatever reason.
- There is no single standard for representing dates/times, booleans or numbers as strings.
Interpreting strings as dates and times¶
There are many valid ways to represent a date/time as a string, e.g.:
2 November 202020201102T00:00:00+Z2020-11-022020-02-11
Some strings or numbers may look like dates, while they may represent something else, e.g.:
- The number
20201102 - The name
Jan
Although there are good standards for representing dates as strings, like ISO 8601, you will encounter many other standards in practice. Therefore, it is impossible to reliably detect date/time columns from strings without false positives and misinterpreted data. Instead, you should explicitly state which columns are expected to contain dates, especially when reading from a CSV file.
Interpreting strings as booleans¶
There are many valid ways to represent a boolean as a string, e.g.:
true/falseTrue/False(Python)TRUE/FALSE(Excel)yes/no1/0(SQLite)on/off
Some strings or numbers may look like booleans, while they may represent something else, e.g.:
- The number
1 - The abbreviation
no
Therefore, it is impossible to reliably detect boolean columns from strings without false positives and misinterpreted data. Instead, you should explicitly state which columns are expected to contain booleans, especially when reading from a CSV file.
Interpreting strings as numbers¶
There are many valid ways to represent a number as a string. Each way is further complicated by the fact that there are regional differences in the characters used to separate thousands and decimals, e.g.:
- Integers:
1,10001 000,1_000 - Floating point numbers:
1.1,1.10,1,000.1,1,000.10,1 000.1,1 000.10,1_000.0,1_000.10 - Scientific notation:
1e3,1e+3,1e-3,1.0e3,1.0e+3,1.0e-3,1,000.0e3,1,000.0e+3,1,000.0e-3,1 000.0e3,1 000.0e+3,1 000.0e-3,1_000.0e3,1_000.0e+3,1_000.0e-3 - Numbers that use bases other than decimal:
- Binary:
0b1010,1010(easily confused with base 10) - Octal:
0o12,12(easily confused with base 10) - Hexadecimal:
0xDEADBEEF,DEADBEEF(easily confused with words)
- Binary:
Some strings may look like numbers, while they may represent something else, e.g.:
- RSA ID numbers.
- Cellphone numbers.
- Dates in
YYYYMMDDformat.
Therefore, it is impossible to reliably detect number columns from strings without false positives and misinterpreted
data. Luckily, Excel tables and all known SQL implementations support numeric data types. Numeric detection is usually
only required when reading from CSV files. See csvToTable for details.
NULL¶
What is NULL?¶
NULL (a.k.a. empty / missing values) is a special value that represents the absence of a value.
Each programming language and each file format has different ways of representing NULL:
NULL(SQL)None(Python)null(JSON, JavaScript, Java, C#, etc.)- JavaScript has another special value called
undefined, which is distinct fromnull, but similar. nil(Ruby, Swift, etc.)- Empty cells in Excel are often considered
NULL, but unfortunately not consistently so. It depends on the formula that references the cell, and the number format of the cell.
Things that can't be represented in a specific file format are frequently replaced with NULL
when there is no better alternative.
NULL is sometimes conflated with, but technically distinct from:
- An empty string, which is a string with zero characters.
- Whitespace-only strings.
- The number zero.
- The boolean
false. - Not a Number (
NaN), which is a special floating point value that represents an undefined or unrepresentable value in some programming languages. - Other special values, like infinity.
- Error values, such as
#N/Aor#DIV/0!in Excel. - See the Wikipedia disambiguation page.
Interpreting strings as NULL¶
While some text-based file formats (like JSON and YAML) have explicit support for NULL values,
others (like CSV) have no standard way of representing NULL values.
In practice, there are many different ways to do this in text files, e.g.:
- An empty string.
- The string
NULL,null,NaN,nan,N/A,n/a,-, etc.
The pandas.read_csv function in Python uses
an extensive list of these
by default.
However, there are many situations in which these detections may be false positives, e.g.:
- The string
N/Amay be a valid value in a column that contains survey responses. - "Null" is a surname. Yes, really.
Putting quotes around such strings does not help, either,
as quotes in CSV files are treated as optional in almost all CSV parsers,
with no standard to specify that "null" is any different from null.
How Autory handles NULL detection¶
When reading from CSV files, Autory will not make any assumptions about which strings represent NULL,
except for the empty string "" (with or without quotes).
See csvToTable for details.
The reason for treating empty strings as NULL by default can be illustrated with an example CSV file:
a,b,c
1,"one",True
,,
2,"two",False
"","",""
3,"three",True
In this example, the second and fourth rows contain empty fields for all columns.
Column a is clearly numeric, column b is clearly string (text), and column c is clearly boolean.
Treating empty string as strings would not cause problems in column b,
but it would cause columns a and c to have heterogeneous data types, which is usually not desired:
- Column
a: Mixture of integers and strings. - Column
b: Just strings. No problem. - Column
c: Mixture of booleans and strings.
Treating empty strings as NULL resolves this issue and yields homogeneous data types,
since NULL is not considered as a separate data type, but rather as the absence of a value:
- Column
a: Integers. - Column
b: Strings. - Column
c: Booleans.
Nullability¶
A nullable type is a data type that does not require a value to be present.
In other words, it allows NULL values.
In many SQL databases, columns may be defined as
NULL (i.e., some rows might have nothing in that column)
or NOT NULL (i.e., every row must have a value in that column).
In Autory all data types are considered nullable.
SQLite data types¶
SQLite has no explicit date/time or boolean data types.
It is left to the user to decide how to represent these values in the database as strings or integers.
Good choices are ISO 8601 (strings) for dates/times
and 0/1 (integers) for booleans.
However, you will find many other representations in practice,
like Unix time or regional date/time formats.
SQLite does not enforce data types on columns by default. This feels like a great feature at first, but quickly leads to unexpected results in serious use cases. When Autory writes data to an SQLite database, it will emit a warning if a column contains values of more than one data type.
See the official SQLite documentation on data types for more details.
Homogeneous and heterogeneous data types¶
Definitions¶
A column (or any collection of values) is said to have a
- homogeneous data type
if all values in the column are of the same data type
(or
NULL, since all data types are nullable in Autory). - heterogeneous data type
if it contains values of more than one data type
(excluding
NULLvalues, since all data types are nullable in Autory).
Examples¶
1,2,3,NULL,5⇒ Homogeneous (there are only integers)1.0,2.5,3.14,NULL,0.0⇒ Homogeneous (there are only floats)True,False,NULL,True⇒ Homogeneous (there are only booleans)"apple","banana","cherry",NULL⇒ Homogeneous (there are only strings)1,2.5,3,NULL,0.0- In theory, integers and floats are different data types, so this is heterogeneous.
- In practice, since integers can easily be represented as floats (
1 == 1.0), this is treated as homogeneous.
1,True,"apple",NULL⇒ Heterogeneous, since there are integers, booleans, and strings.
Why homogeneous data types are important¶
- It is faster to process.
- It uses less memory.
- It is easier to analyze and visualize, e.g.:
- If a column contains both numbers and strings, you cannot calculate the sum or average of the numbers.
- If a column contains both dates and strings, you cannot plot a time series.
Error values¶
When Autory encounters an error while calculating the result of a variable, it will store an error value in the results instead of crashing the entire run. This can happen when dividing by zero, trying to convert an invalid string to a number, misspelling a variable name or function name, trying to optimize something that does not converge, etc.
As mentioned above, error values are not the same thing as NULL values.
Therefore, any column that contains one or more error values is considered to have a heterogeneous data type.
Autory will emit a warning when this happens.
Sometimes it's useful to ignore errors, replacing them with NULL instead.
The autory convert results command
has the --encode-error-values null option for this purpose.
Use this with caution, as it may hide problems in your model or data.