roboto.domain.topics.parquet.arrow_to_roboto#

Module Contents#

roboto.domain.topics.parquet.arrow_to_roboto.arrow_type_to_canonical_type(arrow_type)#

Parameters:: arrow_type (pyarrow.DataType)
Return type:: roboto.domain.topics.record.CanonicalDataType

roboto.domain.topics.parquet.arrow_to_roboto.compute_boolean_statistics(data)#

Parameters:: data (Union[pyarrow.Array, pyarrow.ChunkedArray])
Return type:: dict[str, Any]

roboto.domain.topics.parquet.arrow_to_roboto.compute_dictionary_metadata(column_name, data, max_dictionary_size=2048)#

Parameters:

column_name (str)
data (Union[pyarrow.Array, pyarrow.ChunkedArray])
max_dictionary_size (int)

Return type:

dict[str, Any]

roboto.domain.topics.parquet.arrow_to_roboto.compute_field_metadata(parser, column_name, field_path, canonical_data_type, is_inside_list=False)#

Compute metadata including statistics for a field.

Handles both top-level and nested fields, extracting data appropriately based on the field’s location in the schema hierarchy.

Parameters:

parser (roboto.domain.topics.parquet.parquet_parser.ParquetParser) – ParquetParser instance to read data from.
column_name (str) – Name of the top-level column.
field_path (list[str]) – List of field names to traverse (empty for top-level fields).
canonical_data_type (roboto.domain.topics.record.CanonicalDataType) – The canonical type of the field.
is_inside_list (bool) – Whether this field is inside a list (affects data extraction).

Returns:

Metadata dictionary with statistics if applicable.

Return type:

dict[str, Any]

roboto.domain.topics.parquet.arrow_to_roboto.compute_numeric_statistics(data)#

Parameters:: data (Union[pyarrow.Array, pyarrow.ChunkedArray])
Return type:: dict[str, Any]

roboto.domain.topics.parquet.arrow_to_roboto.generate_message_path_requests(parser, timestamp, max_depth=10)#

Generate AddMessagePathRequest objects for all fields in a Parquet schema.

Traverses the schema recursively to generate message paths for nested types (structs, lists) in addition to top-level fields.

Parameters:

parser (roboto.domain.topics.parquet.parquet_parser.ParquetParser) – ParquetParser instance containing the schema and data.
timestamp (roboto.domain.topics.parquet.timestamp.TimestampInfo) – Timestamp information for the topic.
max_depth (int) – Maximum recursion depth for nested types (default: 10).

Yields:

AddMessagePathRequest objects for each field and nested field in the schema.

Return type:

Generator[roboto.domain.topics.operations.AddMessagePathRequest, None, None]

Examples

For a schema with a struct column position: struct<x: float, y: float>: - Yields position (Object) - Yields position.x (Number) - Yields position.y (Number)

For a schema with values: list<float64>: - Yields values (NumberArray)

For a schema with points: list<struct<x: float, y: float>>: - Yields points (Array) - Yields points.x (Number) - Yields points.y (Number)

roboto.domain.topics.parquet.arrow_to_roboto.get_list_element_data(parser, column_name, field_path)#

Extract flattened data from list columns for statistics computation.

For list<primitive> columns, flattens all list elements into a single array. For list<struct> columns, flattens and then accesses the struct field.

Parameters:

parser (roboto.domain.topics.parquet.parquet_parser.ParquetParser) – ParquetParser instance to read data from.
column_name (str) – Name of the top-level column.
field_path (list[str]) – List of field names to traverse after flattening the list.

Returns:

The flattened Array or ChunkedArray suitable for statistics computation.

Return type:

Union[pyarrow.Array, pyarrow.ChunkedArray]

roboto.domain.topics.parquet.arrow_to_roboto.get_nested_column_data(parser, column_name, field_path)#

Extract data for nested fields from a PyArrow table.

Navigates through struct fields using the provided field path to extract the data for a nested field.

Parameters:

parser (roboto.domain.topics.parquet.parquet_parser.ParquetParser) – ParquetParser instance to read data from.
column_name (str) – Name of the top-level column.
field_path (list[str]) – List of field names to traverse (excluding the column name).

Returns:

The extracted Array or ChunkedArray for the nested field.

Raises:

KeyError – If a field in the path does not exist.

Return type:

Union[pyarrow.Array, pyarrow.ChunkedArray]

roboto.domain.topics.parquet.arrow_to_roboto.logger#

roboto.domain.topics.parquet.arrow_to_roboto.sanitize_column_name(field)#

Parameters:: field (pyarrow.Field)
Return type:: str