1.1.8. etl
The etl command line interface (cli) provides functionalities to preprocess
battery data provided by foxBMS 2.
etl is an abbreviation for Extract, Transform and Load, which is a common
approach in the context of data engineering.
With etl a data pipeline can be defined, which extracts various data from
different data sources, transforming them into an uniform data format and loads
those into a database (data warehouse) for further analyzing.
In the current status, the fox CLI supports a logfile from a CAN bus as data
source and provides a filter (extract) function to select desired CAN messages
from it.
Subsequently these filtered data can be decoded (transform) and converted to
a structured data format (transform).
The following description is divided into the following sections:
1.1.8.1. Preprocessing Concept
The goal is to transform the data sent by the CAN bus to a structured data format (table), containing all information of the system at each point in time for the later analysis or visualization.
Typically multiple devices are connected to a CAN bus, where each device tries to send periodically its messages. CAN messages contain an ID and one or multiple signals, where each signal has a name, a value in hexadecimal representation and a physical unit. A log file of a CAN bus could look like
...
1212239.051044 8 240 Rx D 8 00 08 00 40 02 00 10 01
1212239.057110 8 35C Rx D 6 00 0A FF FF F6 FB
1212239.070987 8 240 Rx D 8 00 08 00 40 02 00 10 01
1212239.077085 8 35C Rx D 6 00 0B FF FF F7 03
1212239.091120 8 240 Rx D 8 00 08 00 3F FE 00 10 01
1212239.097439 8 35C Rx D 6 00 0C FF FF F6 E9
1212239.111037 8 240 Rx D 8 00 08 00 3F FE 00 10 01
1212239.117353 8 35C Rx D 6 00 0D FF FF F6 E8
1212239.130953 8 240 Rx D 8 00 08 00 3F FE 00 10 01
1212239.137263 8 35C Rx D 6 00 0E FF FF F6 D0
1212239.151173 8 240 Rx D 8 00 08 00 40 02 00 10 01
1212239.157239 8 35C Rx D 6 00 0F FF FF F6 CE
...
where the first column is the timestamp, the third column is the CAN ID and the data of the signals begin at column 7. For simplification, the CAN log file contains only the system current and a few cell voltages. The data in the CAN log file can be seen as irregular time series (varying time interval between two timestamps) with missing values (no current value is available at the moment cell voltages were sent) in a semi-structured data format caused by the serial communication of a CAN bus.
To avoid unnecessary payload in the later preprocessing steps,
specific CAN messages can be filtered out by etl.
The messages could be filtered by their ID or by their number of occurrence,
so that e.g., only every 10th occurrence of a message remains in the resulting
CAN log file.
Afterwards the filtered CAN messages can be decoded and sorted based on their ID into separate files, which will create regular time series without missing values at each point in time in each file. The used format for the decoded messages is JSON. One of those files is depicted below
...
{"Timestamp": 1212239.057110,...,"0x35C_IVT0_Result_I_mA":-2309}
{"Timestamp": 1212239.077085,...,"0x35C_IVT0_Result_I_mA":-2301}
{"Timestamp": 1212239.097439,...,"0x35C_IVT0_Result_I_mA":-2327}
{"Timestamp": 1212239.117353,...,"0x35C_IVT0_Result_I_mA":-2328}
{"Timestamp": 1212239.137263,...,"0x35C_IVT0_Result_I_mA":-2352}
...
where the first key-value pair is the timestamp and the following
pairs are the signals.
For simplification all decoded CAN messages were shortened.
The key of the signals is a compound of the CAN ID (hex),
CurrentSensor_SIG_Current and the phyiscal unit (mA).
Note
By default the physical unit of the timestamps is set to seconds.
To transform the decoded CAN messages to a structured format (table), the keys are set as column names and the values are used as rows. At this point the timestamp column is replaced by a date column in UTC format, where each timestamp is mapped to a date with respect to start date of the logging. Each of the aforementioned tables contain a regular time series. The resulting table for one type of CAN messages is depicted in the table below.
Date |
… |
860_CurrentSensor_SIG_Current_mA |
|---|---|---|
… |
… |
… |
2024-01-01 00:00:00.057110 |
… |
-2309 |
2024-01-01 00:00:00.077085 |
… |
-2301 |
2024-01-01 00:00:00.097439 |
… |
-2327 |
2024-01-01 00:00:00.117353 |
… |
-2328 |
2024-01-01 00:00:00.137263 |
… |
-2352 |
… |
… |
… |
To obtain one regular time series, all time series could be combined by
a left join.
etl uses as left join method for time series
join_asof
of Apache Arrow.
In the context of lithium-ion batteries, most measurements are galvanostatic
and therefore the current as system excitation is a good candidate as left
table in the join.
All other tables are used as the right table in the join_asof step.
A table after the join could look like
Date |
576_cellVoltage_000_mV |
860_CurrentSensor_SIG_Current_mA |
… |
|---|---|---|---|
… |
… |
… |
… |
2024-01-01 00:00:00.057110 |
4096 |
-2309 |
… |
2024-01-01 00:00:00.077085 |
4096 |
-2301 |
… |
2024-01-01 00:00:00.097439 |
4096 |
-2327 |
… |
2024-01-01 00:00:00.117353 |
4096 |
-2328 |
… |
2024-01-01 00:00:00.137263 |
4096 |
-2352 |
… |
… |
… |
… |
… |
where all columns are alphanumerical sorted.
Important
If the values of the selected left table occur less frequently
than values from the right tables, some rows of the right tables will
be discared in the join_asof step!
Note
By default the method join_asof is configured to uses previous values
to fill missing values, therefore the first rows of the joined table will
contain missing values, because no previous values are available at that
point of time.
Further to extract data from specific file formats (e.g., measurement device as
the Gamry Reference 3000), the fox CLI provides the convert subcommand which
converts the aforementioned files to a .csv or .parquet files.
The etl command offers for use-cases without a specified
“system excitation” e.g. debugging, a method to outer join all tables.
Important
Outer joins are slower and generate significantly larger tables, especially when the input tables are large!
1.1.8.2. Database & Data Analytics Engine/Libraries
Battery data processed as described above are in a structured data format and therefore in the following we only consider databases and data analytics engines/libraries handling such structured data.
Considered database management systems (DMS) as well as the data
models of data analytics libraries/engines can be categorized into row or
column oriented.
Most queries in the battery context will read many values
from a few columns.
Hence from a performance perspective, column oriented systems should be
preferred and therfore the etl command supports mostly column
oriented file formats and databases.
DMS store the data and provide usually a SQL interface to query data exceeding the main memory of the host system. Known column oriented databases are DuckDB and ClickHouse with known row oriented databases as MySQL and PostgreSQL. Time series databases as InfluxDB are a special class of DMS providing high performance for time series data with queries effecting the timestamp/date column.
Data analytics libraries provide no data storage and are limited by the main memory of the host system. Most data analytics libraries use an object oriented interface to analyze data stored in files, which increases the usability of those. Usually the data model of such libraries is column oriented and known examples are Pandas and Apache Arrow.
To reduce the hardware limitations of data analytics libraries, data analytics engines were developed which additionally provide a task manager to divide analytic tasks between computer within a cluster. Apache Spark is one of such engines.
At the moment a load functionality to directly upload the data to a database is not implemented yet. Hence, the data can only be uploaded into a database via the native file import of the database or by other tools. Most data analytics libraries/engines are able to directly query the data from files, where we recommend to use the supported Apache Parquet file format.
1.1.8.3. Usage
The etl command is divided into multiple subcommands each providing
specific functionalities described in the previous sections.
Below the general help text of the etl command gives an overview of all
subcommands.
Usage: fox.py etl [OPTIONS] COMMAND [ARGS]...
ETL command group entry point.
These scripts and tools will simplify the collection of foxBMS 2 data and
their analysis.
Options:
-h, --help Show this message and exit.
Commands:
convert Configure conversion defaults for ETL convert subcommands.
decode Decode CAN messages from standard input.
filter Filter out unwanted CAN messages from input stream or file.
table Convert files with decoded CAN message (JSON) to one or...
The filter and decode subcommand expect a data stream as input which could
be provided by the command cat.
Note
Piping to standard input does only work when the required Python environment is installed.
Get-Content -Raw CAN_LOG_FILE | .\fox.ps1 etl decode -c DECODE_CONFIG_FILE -o OUTPUT_DIRECTORY
cat CAN_LOG_FILE | ./fox.sh etl decode -c DECODE_CONFIG_FILE \
-o OUTPUT_DIRECTORY
cat CAN_LOG_FILE | ./fox.sh etl decode -c DECODE_CONFIG_FILE \
-o OUTPUT_DIRECTORY
Moreover the filter subcommand provides a data stream as output and therefore it can be used ahead of the decode subcommand.
Get-Content -Raw CAN_LOG_FILE | .\fox.ps1 etl filter -c FILTER_CONFIG_FILE | .\fox.ps1 etl decode -c DECODE_CONFIG_FILE -o OUTPUT_DIRECTORY
cat CAN_LOG_FILE | ./fox.sh etl filter -c FILTER_CONFIG_FILE \
| ./fox.sh etl decode -c DECODE_CONFIG_FILE -o OUTPUT_DIRECTORY
cat CAN_LOG_FILE | ./fox.sh etl filter -c FILTER_CONFIG_FILE \
| ./fox.sh etl decode -c DECODE_CONFIG_FILE -o OUTPUT_DIRECTORY
More complex data pipelines can be created with Apache Airflows or Azure Data Factory.
Important
The counting for all positions in the following command options starts at 0.
1.1.8.3.1. filter Usage
The filter subcommand is used to filter out CAN messages from a CAN log file as described in this paragraph. The input and output of the command is a data stream. The subcommand is executed as described below.
Usage: fox.py etl filter [OPTIONS] INPUT
Filter out unwanted CAN messages from input stream or file.
The subcommand writes the filtered CAN messages to standard output or to a
file.
Options:
-c, --config FILE A configuration file (YML) to define the
filter setup
-o, --output FILE Stores the filter output to a file;
otherwise stdout is used
-ip, --id-pos INTEGER Index of the CAN ID column in the log file
-i, --ids TEXT IDs to be filtered
-s, --sampling <TEXT INTEGER>...
-v, --verbose Verbose information.
-h, --help Show this message and exit.
A configuration file of the subcommand could look like
id_pos: 2
ids: ["240", "35C"]
sampling:
"240": 10
"35C": 1
The key id_pos defines the position of the CAN IDs in the CAN log file,
ids is a list of all CAN IDs that should be included in the resulting file.
The optional parameter sampling filters the CAN IDs based on occurrence.
The example configuration file can be downloaded
here.
1.1.8.3.2. decode Usage
The decode subcommand is used to decode CAN messages as described in this paragraph. The input of the command is a data stream. The subcommand is executed as described below.
Usage: fox.py etl decode [OPTIONS]
Decode CAN messages from standard input.
Decoded CAN messages are saved in separate files (JSON) in the output
directory.
Options:
-c, --config FILE A configuration file (YML) to define the
decoding parameters
-d, --dbc FILE Path to the DBC file
-tp, --timestamp-pos INTEGER Index of the timestamp column in the log file
-ip, --id-pos INTEGER Index of the CAN ID column in the log file
-dp, --data-pos INTEGER Index of the data column in the log file
-o, --output DIRECTORY Directory in which the files with decoded CAN
messages are saved
-v, --verbose Verbose information.
-h, --help Show this message and exit.
A configuration file of the subcommand could look like
dbc: foxbms.dbc
timestamp_pos: 0
id_pos: 2
data_pos: 6
The key dbc defines the path to the used DBC file, timestamp_pos is the
column position of the timestamp within the CAN log file, the id_pos is the
column position of the CAN IDs and data_pos is the start column position of
the data in each message. The example configuration file can be downloaded
here.
1.1.8.3.3. table Usage
The table subcommand is used to convert the decoded CAN messages to tables as described in this paragraph. Optionally the table subcommand can be used to join these tables to one table as described in this paragraph. The subcommand is executed as described below.
Usage: fox.py etl table [OPTIONS] DATA
Convert files with decoded CAN message (JSON) to one or multiple tables.
The input is either one file with decoded CAN messages or a directory
containing files with decoded CAN messages. The output of the table command
depends on the use-case, which is defined by the configuration file and
whether the input or output is a file or directory. The use-cases are
One to One: etl table -c config.yml -o output_file.csv input_file.json
Many to One: etl table -c config.yml -o output_file.csv input_directory
Many to Many: etl table -c config.yml -o output_directory input_directory
Options:
-c, --config FILE A configuration file to define the conversion of decoded
data (JSON) to a table [required]
-o, --output PATH Directory or file in which the table(s) should be saved
[required]
-v, --verbose Verbose information.
-h, --help Show this message and exit.
1.1.8.3.3.1. One to One:
In case only one file of decoded CAN messages should be converted to a table, the configuration file for the table subcommand could look like
start_date: "2024-01-01T00:00:00"
with the output parameter as path to a file.
start_date defines the date in UTC format at which the CAN logging has
started.
Note
If an output format is specified in the configuration file, the table command converts each json file into a separate table as explained below in the use-case many to many.
.\fox.ps1 etl table -c table_one_one.yml -o OUTPUT_FILE.csv INPUT_FILE.csv
./fox.sh etl table -c table_one_one.yml -o OUTPUT_FILE.csv INPUT_FILE.csv
./fox.sh etl table -c table_one_one.yml -o OUTPUT_FILE.csv INPUT_FILE.csv
1.1.8.3.3.2. Many to One:
If multiple files with decoded CAN messages should be converted and joined to one table, the configuration file for the table subcommand could look like
join_on: 860_CurrentSensor_SIG_Current_mA
start_date: "2024-01-01T00:00:00"
with join_on defining the column of the left table in the join.
Note
If ALL is used as join_on parameter, an outer join is used instead
of a left join.
The output parameter must be the path to a file.
Note
If an output format is specified in the configuration file, the table command converts each json file into a separate table as explained below in the use-case many to many.
.\fox.ps1 etl table -c table_many_one.yml -o OUTPUT_FILE.csv INPUT_DIRECTORY
./fox.sh etl table -c table_many_one.yml -o OUTPUT_FILE.csv INPUT_DIRECTORY
./fox.sh etl table -c table_many_one.yml -o OUTPUT_FILE.csv INPUT_DIRECTORY
1.1.8.3.3.3. Many to Many:
In case each file with decoded CAN message should be converted to a table, without any join, the configuration file for the table command could look like
start_date: "2024-01-01T00:00:00"
output_format: "csv"
with output_format as csv or parquet defining the file format
at which all tables are saved.
The output parameter must be the path to a directory.
.\fox.ps1 etl table -c table_many_many.yml -o OUTPUT_DIRECTORY INPUT_DIRECTORY
./fox.sh etl table -c table_many_many.yml -o OUTPUT_DIRECTORY INPUT_DIRECTORY
./fox.sh etl table -c table_many_many.yml -o OUTPUT_DIRECTORY INPUT_DIRECTORY
Note
One file with decoded CAN message can not be converted to multiple tables!
If the timestamp values of a CAN log are not in seconds, the table subcommand
is able to correctly convert these values to the needed phyiscal unit with the
optional parameter timestamp_factor.
Internally all timestamp values are multiplied with the timestamp_factor
to interpret these values as duration in microseconds.
The default value of timestamp_factor is 1000000 for timestamp values in
seconds.
If the timestamp values are in milliseconds, the timestamp_factor needs
to be 1000.
The aforementioned
join_asof defines with a tolerance parameter how to fill
missing values with previous values (forward fill).
By default the tolerance value is set to -100000, where the minus indicates a
forward fill and the 100000 indicates the maximum considered time difference.
This tolerance parameter can be changed by the optional parameter tolerance
in the configuration file.
All example configuration files for the table subcommand can be download below:
1.1.8.3.4. convert Usage
The convert subcommand group provides utilities to convert measurement files into structured formats for downstream analysis.
Note
In contrary to the other commands, convert does not use a configuration file. Instead, all options are passed via the command line.
Currently supported file formats:
Input: Gamry .DTA files and GRAPHTEC .csv files
Output: CSV or Parquet
Usage: fox.py etl convert [OPTIONS] COMMAND [ARGS]...
Configure conversion defaults for ETL convert subcommands.
Options:
-r, --recursive Enables recursive file search in a
directory.
-o, --output-format [csv|parquet]
Defines the target file format.
-v, --verbose Verbose information.
-h, --help Show this message and exit.
Commands:
gamry Convert Gamry DTA files to CSV or PARQUET output.
graphtec Convert GRAPHTEC CSV files to normalized output formats.
Options (group-level):
–recursive: If set, subdirectories are scanned when the input is a directory.
–output-format: Target file format (CSV or PARQUET).
Subcommands:
Usage: fox.py etl convert gamry [OPTIONS] DATA
Convert Gamry DTA files to CSV or PARQUET output.
Options:
-s, --skip_footer INTEGER Skips a number of lines at the end (footer) of a
.dta file.
-h, --help Show this message and exit.
Usage: fox.py etl convert graphtec [OPTIONS] DATA
Convert GRAPHTEC CSV files to normalized output formats.
Options:
-s, --skip INTEGER Skips a specific number of lines at the beginning of a
file.
-h, --help Show this message and exit.
Behavior:
If DATA points to a file, exactly that file is converted.
If DATA points to a directory, all files matching the input suffix (e.g.,
.dta) are collected.When –recursive is set, subdirectories are included.
Output files are written next to the source with the appropriate extension (
.csvor.parquet).
# Convert a single .dta file to CSV
.\fox.ps1 etl convert --output-format CSV DTA -s 0 PATH\TO\INPUT.dta
# Convert all .dta files in a directory (non-recursive) to Parquet
.\fox.ps1 etl convert --output-format PARQUET DTA -s 0 PATH\TO\DIRECTORY
# Convert all .dta files in a directory (recursive) to CSV
.\fox.ps1 etl convert --recursive --output-format CSV DTA -s 0 PATH\TO\DIRECTORY
# Convert a single .dta file to CSV
./fox.sh etl convert --output-format CSV DTA -s 0 PATH/TO/INPUT.dta
# Convert all .dta files in a directory (non-recursive) to Parquet
./fox.sh etl convert --output-format PARQUET DTA -s 0 PATH/TO/DIRECTORY
# Convert all .dta files in a directory (recursive) to CSV
./fox.sh etl convert --recursive --output-format CSV DTA -s 0 PATH/TO/DIRECTORY
# Convert a single .dta file to CSV
./fox.sh etl convert --output-format CSV DTA -s 0 PATH/TO/INPUT.dta
# Convert all .dta files in a directory (non-recursive) to Parquet
./fox.sh etl convert --output-format PARQUET DTA -s 0 PATH/TO/DIRECTORY
# Convert all .dta files in a directory (recursive) to CSV
./fox.sh etl convert --recursive --output-format CSV DTA -s 0 PATH/TO/DIRECTORY
Notes:
–skip_footer applies to Gamry
.dtafiles and is passed to the CSV reader to ignore trailing lines.Non-existent paths or permission issues are reported and cause the command to exit with a non-zero status.