roboto.domain.topics.parquet.arrow_to_roboto#
Module Contents#
- roboto.domain.topics.parquet.arrow_to_roboto.arrow_type_to_canonical_type(arrow_type)#
- Parameters:
arrow_type (pyarrow.DataType)
- Return type:
- roboto.domain.topics.parquet.arrow_to_roboto.compute_boolean_statistics(data)#
- Parameters:
data (Union[pyarrow.Array, pyarrow.ChunkedArray])
- Return type:
dict[str, Any]
- roboto.domain.topics.parquet.arrow_to_roboto.compute_dictionary_metadata(column_name, data, max_dictionary_size=2048)#
- Parameters:
column_name (str)
data (Union[pyarrow.Array, pyarrow.ChunkedArray])
max_dictionary_size (int)
- Return type:
dict[str, Any]
- roboto.domain.topics.parquet.arrow_to_roboto.compute_field_metadata(parser, column_name, field_path, canonical_data_type, is_inside_list=False)#
Compute metadata including statistics for a field.
Handles both top-level and nested fields, extracting data appropriately based on the field’s location in the schema hierarchy.
- Parameters:
parser (roboto.domain.topics.parquet.parquet_parser.ParquetParser) – ParquetParser instance to read data from.
column_name (str) – Name of the top-level column.
field_path (list[str]) – List of field names to traverse (empty for top-level fields).
canonical_data_type (roboto.domain.topics.record.CanonicalDataType) – The canonical type of the field.
is_inside_list (bool) – Whether this field is inside a list (affects data extraction).
- Returns:
Metadata dictionary with statistics if applicable.
- Return type:
dict[str, Any]
- roboto.domain.topics.parquet.arrow_to_roboto.compute_numeric_statistics(data)#
- Parameters:
data (Union[pyarrow.Array, pyarrow.ChunkedArray])
- Return type:
dict[str, Any]
- roboto.domain.topics.parquet.arrow_to_roboto.generate_message_path_requests(parser, timestamp, max_depth=10)#
Generate AddMessagePathRequest objects for all fields in a Parquet schema.
Traverses the schema recursively to generate message paths for nested types (structs, lists) in addition to top-level fields.
- Parameters:
parser (roboto.domain.topics.parquet.parquet_parser.ParquetParser) – ParquetParser instance containing the schema and data.
timestamp (roboto.domain.topics.parquet.timestamp.TimestampInfo) – Timestamp information for the topic.
max_depth (int) – Maximum recursion depth for nested types (default: 10).
- Yields:
AddMessagePathRequest objects for each field and nested field in the schema.
- Return type:
Generator[roboto.domain.topics.operations.AddMessagePathRequest, None, None]
Examples
For a schema with a struct column position: struct<x: float, y: float>: - Yields position (Object) - Yields position.x (Number) - Yields position.y (Number)
For a schema with values: list<float64>: - Yields values (NumberArray)
For a schema with points: list<struct<x: float, y: float>>: - Yields points (Array) - Yields points.x (Number) - Yields points.y (Number)
- roboto.domain.topics.parquet.arrow_to_roboto.get_list_element_data(parser, column_name, field_path)#
Extract flattened data from list columns for statistics computation.
For list<primitive> columns, flattens all list elements into a single array. For list<struct> columns, flattens and then accesses the struct field.
- Parameters:
parser (roboto.domain.topics.parquet.parquet_parser.ParquetParser) – ParquetParser instance to read data from.
column_name (str) – Name of the top-level column.
field_path (list[str]) – List of field names to traverse after flattening the list.
- Returns:
The flattened Array or ChunkedArray suitable for statistics computation.
- Return type:
Union[pyarrow.Array, pyarrow.ChunkedArray]
- roboto.domain.topics.parquet.arrow_to_roboto.get_nested_column_data(parser, column_name, field_path)#
Extract data for nested fields from a PyArrow table.
Navigates through struct fields using the provided field path to extract the data for a nested field.
- Parameters:
parser (roboto.domain.topics.parquet.parquet_parser.ParquetParser) – ParquetParser instance to read data from.
column_name (str) – Name of the top-level column.
field_path (list[str]) – List of field names to traverse (excluding the column name).
- Returns:
The extracted Array or ChunkedArray for the nested field.
- Raises:
KeyError – If a field in the path does not exist.
- Return type:
Union[pyarrow.Array, pyarrow.ChunkedArray]
- roboto.domain.topics.parquet.arrow_to_roboto.logger#
- roboto.domain.topics.parquet.arrow_to_roboto.sanitize_column_name(field)#
- Parameters:
field (pyarrow.Field)
- Return type:
str