Formatting data

treecat.format

treecat.format.cat(*paths)

Print .pkz files in human readable form.

treecat.format.export_rows(schema, data)

Export multiple rows of internal data to json format.

Args:

schema: A schema dict as returned by load_schema(). data: An [N, R]-shaped numpy array of ragged data, where N is the

number of rows and R = schema[‘ragged_index’][-1].
Returns:
A N-long list of sparse dicts mapping feature names to json values, where N is the number of rows.
treecat.format.guess_feature_type(count, values)

Guess the type of a feature, given statistics about the feature.

Args:
count: Total number of observations of the feature. values: A list of uniqe observed values of the feature.
Returns:
One of: ‘ordinal’, ‘categorical’, or ‘’
treecat.format.guess_schema(data_csvs_in, types_csv_out, values_csv_out, encoding='utf-8')

Create a best-guess types and values for a given dataset.

Common encodings include: utf-8, cp1252.

treecat.format.import_data(data_csvs_in, types_csv_in, values_csv_in, groups_csv_in, dataset_out, encoding='utf-8')

Import a comma-delimited list of csv files into internal treecat format.

Common encodings include: utf-8, cp1252.

treecat.format.import_rows(schema, rows)

Import multiple rows of json data to internal format.

Args:

schema: A schema dict as returned by load_schema(). rows: A N-long list of sparse dicts mapping feature names to values,

where N is the number of rows. Extra keys and invalid values will be silently ignored.
Returns:
An [N, R]-shaped numpy array of ragged data, where N is the number of rows and R = schema[‘ragged_index’][-1].
treecat.format.join_csvs(column, csvs_in, csv_out, encoding_in='utf-8', encoding_out='utf-8')

Outer join a comma-delimited list of csvs on a given column.

Common encodings include: utf-8, cp1252.

treecat.format.pd_outer_join(dfs, on)

Outer-join an iterable of pandas dataframes on a given column.

Args:
dfs: A pandas dataframe. on: A column name or list of column names.
Returns:
A pandas dataframe whose columns are the union of columns in dfs, and whose rows are the union of rows joined on ‘on’.
treecat.format.pickle_dump(data, filename)

Serialize data to file using gzip compression.

treecat.format.pickle_load(filename)

Deserialize data from file using gzip compression.