Formatting data¶

treecat.format¶

treecat.format.export_rows(schema, data)¶

Export multiple rows of internal data to json format.

Args:: schema: A schema dict as returned by load_schema(). data: An [N, R]-shaped numpy array of ragged data, where N is the

number of rows and R = schema[‘ragged_index’][-1].
Returns:: A N-long list of sparse dicts mapping feature names to json values, where N is the number of rows.

treecat.format.guess_feature_type(count, values)¶

Guess the type of a feature, given statistics about the feature.

Args:: count: Total number of observations of the feature. values: A list of uniqe observed values of the feature.
Returns:: One of: ‘ordinal’, ‘categorical’, or ‘’

treecat.format.guess_schema(data_csvs_in, types_csv_out, values_csv_out, encoding='utf-8')¶

Create a best-guess types and values for a given dataset.

Common encodings include: utf-8, cp1252.

treecat.format.import_data(data_csvs_in, types_csv_in, values_csv_in, groups_csv_in, dataset_out, encoding='utf-8')¶

Import a comma-delimited list of csv files into internal treecat format.

Common encodings include: utf-8, cp1252.

treecat.format.import_rows(schema, rows)¶

Import multiple rows of json data to internal format.

Args:: schema: A schema dict as returned by load_schema(). rows: A N-long list of sparse dicts mapping feature names to values,

where N is the number of rows. Extra keys and invalid values will be silently ignored.
Returns:: An [N, R]-shaped numpy array of ragged data, where N is the number of rows and R = schema[‘ragged_index’][-1].

treecat.format.join_csvs(column, csvs_in, csv_out, encoding_in='utf-8', encoding_out='utf-8')¶

Outer join a comma-delimited list of csvs on a given column.

Common encodings include: utf-8, cp1252.

treecat.format.pd_outer_join(dfs, on)¶

Outer-join an iterable of pandas dataframes on a given column.

Args:: dfs: A pandas dataframe. on: A column name or list of column names.
Returns:: A pandas dataframe whose columns are the union of columns in dfs, and whose rows are the union of rows joined on ‘on’.

treecat.format.pickle_dump(data, filename)¶: Serialize data to file using gzip compression.

treecat.format.pickle_load(filename)¶: Deserialize data from file using gzip compression.