roboto.domain.topics.parquet.arrow_to_roboto#

Module Contents#

roboto.domain.topics.parquet.arrow_to_roboto.arrow_type_to_canonical_type(arrow_type)#
Parameters:

arrow_type (pyarrow.DataType)

Return type:

roboto.domain.topics.record.CanonicalDataType

roboto.domain.topics.parquet.arrow_to_roboto.compute_boolean_statistics(data)#
Parameters:

data (Union[pyarrow.Array, pyarrow.ChunkedArray])

Return type:

dict[str, Any]

roboto.domain.topics.parquet.arrow_to_roboto.compute_dictionary_metadata(column_name, data, max_dictionary_size=2048)#
Parameters:
  • column_name (str)

  • data (Union[pyarrow.Array, pyarrow.ChunkedArray])

  • max_dictionary_size (int)

Return type:

dict[str, Any]

roboto.domain.topics.parquet.arrow_to_roboto.compute_field_metadata(parser, column_name, field_path, canonical_data_type, is_inside_list=False)#

Compute metadata including statistics for a field.

Handles both top-level and nested fields, extracting data appropriately based on the field’s location in the schema hierarchy.

Parameters:
Returns:

Metadata dictionary with statistics if applicable.

Return type:

dict[str, Any]

roboto.domain.topics.parquet.arrow_to_roboto.compute_numeric_statistics(data)#
Parameters:

data (Union[pyarrow.Array, pyarrow.ChunkedArray])

Return type:

dict[str, Any]

roboto.domain.topics.parquet.arrow_to_roboto.generate_message_path_requests(parser, timestamp, max_depth=10)#

Generate AddMessagePathRequest objects for all fields in a Parquet schema.

Traverses the schema recursively to generate message paths for nested types (structs, lists) in addition to top-level fields.

Parameters:
Yields:

AddMessagePathRequest objects for each field and nested field in the schema.

Return type:

Generator[roboto.domain.topics.operations.AddMessagePathRequest, None, None]

Examples

For a schema with a struct column position: struct<x: float, y: float>: - Yields position (Object) - Yields position.x (Number) - Yields position.y (Number)

For a schema with values: list<float64>: - Yields values (NumberArray)

For a schema with points: list<struct<x: float, y: float>>: - Yields points (Array) - Yields points.x (Number) - Yields points.y (Number)

roboto.domain.topics.parquet.arrow_to_roboto.get_list_element_data(parser, column_name, field_path)#

Extract flattened data from list columns for statistics computation.

For list<primitive> columns, flattens all list elements into a single array. For list<struct> columns, flattens and then accesses the struct field.

Parameters:
Returns:

The flattened Array or ChunkedArray suitable for statistics computation.

Return type:

Union[pyarrow.Array, pyarrow.ChunkedArray]

roboto.domain.topics.parquet.arrow_to_roboto.get_nested_column_data(parser, column_name, field_path)#

Extract data for nested fields from a PyArrow table.

Navigates through struct fields using the provided field path to extract the data for a nested field.

Parameters:
Returns:

The extracted Array or ChunkedArray for the nested field.

Raises:

KeyError – If a field in the path does not exist.

Return type:

Union[pyarrow.Array, pyarrow.ChunkedArray]

roboto.domain.topics.parquet.arrow_to_roboto.logger#
roboto.domain.topics.parquet.arrow_to_roboto.sanitize_column_name(field)#
Parameters:

field (pyarrow.Field)

Return type:

str