Formatting data¶
treecat.format¶
-
treecat.format.
cat
(*paths)¶ Print .pkz files in human readable form.
-
treecat.format.
export_rows
(schema, data)¶ Export multiple rows of internal data to json format.
- Args:
schema: A schema dict as returned by load_schema(). data: An [N, R]-shaped numpy array of ragged data, where N is the
number of rows and R = schema[‘ragged_index’][-1].- Returns:
- A N-long list of sparse dicts mapping feature names to json values, where N is the number of rows.
-
treecat.format.
guess_feature_type
(count, values)¶ Guess the type of a feature, given statistics about the feature.
- Args:
- count: Total number of observations of the feature. values: A list of uniqe observed values of the feature.
- Returns:
- One of: ‘ordinal’, ‘categorical’, or ‘’
-
treecat.format.
guess_schema
(data_csvs_in, types_csv_out, values_csv_out, encoding='utf-8')¶ Create a best-guess types and values for a given dataset.
Common encodings include: utf-8, cp1252.
-
treecat.format.
import_data
(data_csvs_in, types_csv_in, values_csv_in, groups_csv_in, dataset_out, encoding='utf-8')¶ Import a comma-delimited list of csv files into internal treecat format.
Common encodings include: utf-8, cp1252.
-
treecat.format.
import_rows
(schema, rows)¶ Import multiple rows of json data to internal format.
- Args:
schema: A schema dict as returned by load_schema(). rows: A N-long list of sparse dicts mapping feature names to values,
where N is the number of rows. Extra keys and invalid values will be silently ignored.- Returns:
- An [N, R]-shaped numpy array of ragged data, where N is the number of rows and R = schema[‘ragged_index’][-1].
-
treecat.format.
join_csvs
(column, csvs_in, csv_out, encoding_in='utf-8', encoding_out='utf-8')¶ Outer join a comma-delimited list of csvs on a given column.
Common encodings include: utf-8, cp1252.
-
treecat.format.
pd_outer_join
(dfs, on)¶ Outer-join an iterable of pandas dataframes on a given column.
- Args:
- dfs: A pandas dataframe. on: A column name or list of column names.
- Returns:
- A pandas dataframe whose columns are the union of columns in dfs, and whose rows are the union of rows joined on ‘on’.
-
treecat.format.
pickle_dump
(data, filename)¶ Serialize data to file using gzip compression.
-
treecat.format.
pickle_load
(filename)¶ Deserialize data from file using gzip compression.