Serving a trained model

treecat.serving

class treecat.serving.DataServer(dataset, ensemble)

A schema-aware server interface for TreeCat and ensemble models.

edge_logits

A [K]-shaped array of log odds of edges in the complete graph.

estimate_tree()

Returns a tuple of edges. Each edge is a (vertex,vertex) pair.

feature_density()

Returns a [V]-shaped array of feature densities in [0, 1].

feature_names

Returns a tuple containing the names of all features.

latent_correlation()

Compute correlation matrix among latent features.

This computes the generalization of Pearson’s correlation to discrete data. Let I(X;Y) be the mutual information. Then define correlation as

rho(X,Y) = sqrt(1 - exp(-2 I(X;Y)))
Returns:
A [V, V]-shaped numpy array of feature-feature correlations.
latent_perplexity()

Compute perplexity = exp(entropy) of latent variables.

Perplexity is an information theoretic measure of the number of clusters or latent classes. Perplexity is a real number in the range [1, M], where M is model_num_clusters.

Returns:
A [V]-shaped numpy array of perplexity.
logprob(rows, evidence=None)

Compute non-normalized log probabilies of many rows of data.

If evidence is specified, compute conditional log probability; otherwise compute unconditional log probability.

Args:
data: A list of rows of data, where each row is a sparse dict
mapping feature name to feature value.
evidence: An optional row of conditioning data, as a sparse dict
mapping feature name to feature value.
Returns:
An [len(rows)]-shaped numpy array of log probabilities.
median(evidence)

Compute an L1-loss-minimizing row of data conditioned on evidence.

Args:
evidence: A single row of conditioning data, as a sparse dict
mapping feature name to feature value.
Returns:
A row of data as a full dict mapping feature name to feature value.
mode(evidence)

Compute a maximum a posteriori row of data conditioned on evidence.

Args:
evidence: A single row of conditioning data, as a sparse dict
mapping feature name to feature value.
Returns:
A row of data as a full dict mapping feature name to feature value.
observed_perplexity()

Compute perplexity = exp(entropy) of observed variables.

Perplexity is an information theoretic measure of the number of clusters or observed classes. Perplexity is a real number in the range [1, dim[v]], where dim[v] is the number of categories in an observed categorical variable or 2 for an ordinal variable.

Returns:
A [V]-shaped numpy array of perplexity.
sample(N, evidence=None)

Draw N samples from the posterior distribution.

Args:

N: The number of samples to draw. evidence: An optional single row of conditioning data, as a sparse

dict mapping feature name to feature value.
Returns:
An [N, R]-shaped numpy array of sampled multinomial data.
sample_tree(num_samples)

Returns a num_samples-long list of trees, each a list of pairs.

class treecat.serving.EnsembleServer(ensemble)

Class for serving queries against a trained TreeCat ensemble.

latent_correlation()

Compute correlation matrix among latent features.

This computes the generalization of Pearson’s correlation to discrete data. Let I(X;Y) be the mutual information. Then define correlation as

rho(X,Y) = sqrt(1 - exp(-2 I(X;Y)))
Returns:
A [V, V]-shaped numpy array of feature-feature correlations.
latent_perplexity()

Compute perplexity = exp(entropy) of latent variables.

Perplexity is an information theoretic measure of the number of clusters or latent classes. Perplexity is a real number in the range [1, M], where M is model_num_clusters.

Returns:
A [V]-shaped numpy array of perplexity.
marginals(data)

Compute observed marginals conditioned on data.

mode(counts, data)

Compute a maximum a posteriori data value conditioned on data.

observed_perplexity(counts)

Compute perplexity = exp(entropy) of observed variables.

Perplexity is an information theoretic measure of the number of clusters or observed classes. Perplexity is a real number in the range [1, dim[v]], where dim[v] is the number of categories in an observed categorical variable or 2 for an ordinal variable.

Args:
counts: A [V]-shaped array of multinomial counts.
Returns:
A [V]-shaped numpy array of perplexity.
class treecat.serving.ServerBase(ragged_index)

Base class for TreeCat and Ensemble servers.

edge_logits

Get edge log probabilities on the complete graph.

estimate_tree()

Return the maximum a posteriori estimated tree structure.

latent_correlation()

Compute correlation matrix among latent features.

latent_perplexity()

Compute perplexity = exp(entropy) of latent variables.

logprob(data)

Compute non-normalized log probabilies of many rows of data.

make_zero_row()

Make an empty data row.

marginals(data)

Compute observed marginals conditioned on data.

median(counts, data)

Compute L1-loss-minimizing quantized marginals conditioned on data.

Args:

counts: A [V]-shaped numpy array of quantization resolutions. data: An [N, R]-shaped numpy array of row of conditioning data, as

a ragged nummpy array of multinomial counts, where R = server.ragged_size.
Returns:
An array of the same shape as data, but with specified counts.
mode(counts, data)

Compute a maximum a posteriori data value conditioned on data.

observed_perplexity(counts)

Compute perplexity = exp(entropy) of observed variables.

sample(N, counts, data=None)

Draw N samples from the posterior distribution.

sample_tree(num_samples)

Return a num_samples-long list of trees, each a list of pairs.

class treecat.serving.TreeCatServer(model)

Class for serving queries against a trained TreeCat model.

latent_correlation()

Compute correlation matrix among latent features.

This computes the generalization of Pearson’s correlation to discrete data. Let I(X;Y) be the mutual information. Then define correlation as

rho(X,Y) = sqrt(1 - exp(-2 I(X;Y)))
Returns:
A [V, V]-shaped numpy array of feature-feature correlations.
latent_perplexity()

Compute perplexity = exp(entropy) of latent variables.

Perplexity is an information theoretic measure of the number of clusters or latent classes. Perplexity is a real number in the range [1, M], where M is model_num_clusters.

Returns:
A [V]-shaped numpy array of perplexity.
logprob(data)

Compute non-normalized log probabilies of many rows of data.

To compute conditional probabilty, use the identity:

log P(data|evidence) = server.logprob(data + evidence)
  • server.logprob(evidence)
Args:
data: A [N,R]-shaped ragged nummpy array of multinomial count data,
where N is the number of rows, and R = server.ragged_size.
Returns:
An [N]-shaped numpy array of log probabilities.
marginals(data)

Compute observed marginals conditioned on data.

Args:
data: An [N, R]-shaped numpy array of row of conditioning data, as
a ragged nummpy array of multinomial counts, where R = server.ragged_size.
Returns:
An real-valued array of the same shape as data.
mode(counts, data)

Compute a maximum a posteriori data value conditioned on data.

observed_perplexity(counts)

Compute perplexity = exp(entropy) of observed variables.

Perplexity is an information theoretic measure of the number of clusters or latent classes. Perplexity is a real number in the range [1, M], where M is model_num_clusters.

Args:
counts: A [V]-shaped array of multinomial counts.
Returns:
A [V]-shaped numpy array of perplexity.
sample(N, counts, data=None)

Draw N samples from the posterior distribution.

Args:

size: The number of samples to draw. counts: A [V]-shaped numpy array of requested counts of

multinomials to sample.
data: An optional single row of conditioning data, as a [R]-shaped
ragged numpy array of multinomial counts, where R = server.ragged_size.
Returns:
An [N, R]-shaped numpy array of sampled multinomial data.
sample_tree(num_samples)

Returns a num_samples-long list of trees, each a list of pairs.

treecat.serving.correlation(probs)

Compute correlation rho(X,Y) = sqrt(1 - exp(-2 I(X;Y))).

Args:
probs: An [M, M]-shaped numpy array representing a joint distribution.
Returns:
A number in [0,1) representing the information-theoretic correlation.
treecat.serving.multinomial_entropy(probs, count)

Compute entropy of multinomial distribution with given probs and count.

Args:
probs: A 1-dimensional array of normalized probabilities. count: The number of draws in a multinomial distribution.
Returns:
A number in [0, count * len(probs)] representing entropy.
treecat.serving.serve_model(dataset, model)

Create a server object from the given dataset and model.

Args:
dataset: Either the filename of a pickled dataset or an already loaded
dataset.
model: Either the filename of a pickled TreeCat model or ensemble, or
an already loaded model or ensemble.
Returns:
A DataServer object.