superduperdb package

Subpackages

Module contents

class superduperdb.CodeModel(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, signature: ~typing.Literal['*args', '**kwargs', '*args, **kwargs', 'singleton'] = '*args, **kwargs', datatype: ~superduperdb.components.datatype.DataType | ~superduperdb.backends.ibis.field_types.FieldType | None = None, output_schema: ~superduperdb.components.schema.Schema | None = None, flatten: bool = False, model_update_kwargs: ~typing.Dict = <factory>, predict_kwargs: ~typing.Dict = <factory>, compute_kwargs: ~typing.Dict = <factory>, validation: ~superduperdb.components.model.Validation | None = None, metric_values: ~typing.Dict = <factory>, num_workers: int = 0, object: ~superduperdb.base.code.Code)[source]

Bases: _ObjectModel

Model component which wraps a Model to become serializable.

Base class for components which can predict.

Parameters:
  • signature – Model signature.

  • datatype – DataType instance.

  • output_schema – Output schema (mapping of encoders).

  • flatten – Flatten the model outputs.

  • model_update_kwargs – The kwargs to use for model update.

  • predict_kwargs – Additional arguments to use at prediction time.

  • compute_kwargs – Kwargs used for compute backend job submit. Example (Ray backend): compute_kwargs = dict(resources=…).

  • validation – The validation Dataset instances to use.

  • metric_values – The metrics to evaluate on.

  • object – Code object

full_import_path = 'superduperdb.components.model.CodeModel'
classmethod handle_integration(kwargs)[source]

Handler integration from ui.

Parameters:

kwargs – integration kwargs

object: Code
ui_schema: t.ClassVar[t.List[t.Dict]] = [{'default': 'from superduperdb import code\n\n@code\ndef my_code(x):\n    return x\n', 'name': 'object', 'type': 'code'}]
class superduperdb.DataType(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, encoder: Callable | None = None, decoder: Callable | None = None, info: Dict | None = None, shape: Sequence | None = None, directory: str | None = None, encodable: str = 'encodable', bytes_encoding: str | None = BytesEncoding.BYTES, intermidia_type: str | None = 'bytes', media_type: str | None = None)[source]

Bases: Component

A data type component that defines how data is encoded and decoded.

Base class for all components in SuperDuperDB.

Class to represent SuperDuperDB serializable entities that can be saved into a database.

Parameters:
  • identifier – A unique identifier for the component.

  • artifacts – List of artifacts which represent entities that are not serializable by default.

  • encoder – A callable that converts an encodable object of this encoder to bytes.

  • decoder – A callable that converts bytes to an encodable object of this encoder.

  • info – An optional information dictionary.

  • shape – The shape of the data.

  • directory – The directory to store file types.

  • encodable – The type of encodable object (‘encodable’, ‘lazy_artifact’, or ‘file’).

  • bytes_encoding – The encoding type for bytes (‘base64’ or ‘bytes’).

  • intermidia_type – Type of the intermidia data [IntermidiaType.BYTES, IntermidiaType.STRING]

  • media_type – The media type.

__call__(x: Any | None = None, uri: str | None = None) _BaseEncodable[source]

Create an instance of the encodable class.

Parameters:
  • x – The optional content.

  • uri – The optional URI.

__post_init__(artifacts)[source]

Post-initialization hook.

Parameters:

artifacts – The artifacts.

bytes_encoding: str | None = 'Bytes'
bytes_encoding_after_encode(data)[source]

Encode the data to base64.

if the bytes_encoding is BASE64 and the intermidia_type is BYTES

Parameters:

data – Encoded data

bytes_encoding_before_decode(data)[source]

Encode the data to base64.

if the bytes_encoding is BASE64 and the intermidia_type is BYTES

Parameters:

data – Decoded data

decode_data(item, info: Dict | None = None)[source]

Decode the item from bytes.

Parameters:
  • item – The item to decode.

  • info – The optional information dictionary.

decoder: Callable | None = None
dict()[source]

Get the dictionary representation of the object.

directory: str | None = None
encodable: str = 'encodable'
encode_data(item, info: Dict | None = None)[source]

Encode the item into bytes.

Parameters:
  • item – The item to encode.

  • info – The optional information dictionary.

encoder: Callable | None = None
full_import_path = 'superduperdb.components.datatype.DataType'
info: Dict | None = None
intermidia_type: str | None = 'bytes'
media_type: str | None = None
classmethod register_datatype(instance)[source]

Register a datatype.

Parameters:

instance – The datatype instance to register.

registered_types: ClassVar[Dict[str, DataType]] = {'dill': DataType(identifier='dill', encoder=<function dill_encode>, decoder=<function dill_decode>, info=None, shape=None, directory=None, encodable='artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'dill_lazy': DataType(identifier='dill_lazy', encoder=<function dill_encode>, decoder=<function dill_decode>, info=None, shape=None, directory=None, encodable='lazy_artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'file': DataType(identifier='file', encoder=<function file_check>, decoder=<function file_check>, info=None, shape=None, directory=None, encodable='file', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'file_lazy': DataType(identifier='file_lazy', encoder=<function file_check>, decoder=<function file_check>, info=None, shape=None, directory=None, encodable='lazy_file', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'json': DataType(identifier='json', encoder=<function json_encode>, decoder=<function json_decode>, info=None, shape=None, directory=None, encodable='encodable', bytes_encoding=<BytesEncoding.BASE64: 'Str'>, intermidia_type='string', media_type=None), 'pickle': DataType(identifier='pickle', encoder=<function pickle_encode>, decoder=<function pickle_decode>, info=None, shape=None, directory=None, encodable='artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'pickle_lazy': DataType(identifier='pickle_lazy', encoder=<function pickle_encode>, decoder=<function pickle_decode>, info=None, shape=None, directory=None, encodable='lazy_artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'torch': DataType(identifier='torch', encoder=<function torch_encode>, decoder=<function torch_decode>, info=None, shape=None, directory=None, encodable='lazy_artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None)}
shape: Sequence | None = None
type_id: ClassVar[str] = 'datatype'
ui_schema: ClassVar[List[Dict]] = [{'choices': ['pickle', 'dill', 'torch'], 'default': 'dill', 'name': 'serializer', 'type': 'string'}, {'name': 'info', 'optional': True, 'type': 'json'}, {'name': 'shape', 'optional': True, 'type': 'json'}, {'name': 'directory', 'optional': True, 'type': 'str'}, {'choices': ['encodable', 'lazy_artifact', 'file'], 'default': 'lazy_artifact', 'name': 'encodable', 'type': 'str'}, {'choices': ['base64', 'bytes'], 'default': 'bytes', 'name': 'bytes_encoding', 'type': 'str'}, {'name': 'media_type', 'optional': True, 'type': 'str'}]
class superduperdb.Dataset(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, select: Select | None = None, sample_size: int | None = None, random_seed: int | None = None, creation_date: str | None = None, raw_data: Sequence[Any] | None = None)[source]

Bases: Component

A dataset is an immutable collection of documents.

Base class for all components in SuperDuperDB.

Class to represent SuperDuperDB serializable entities that can be saved into a database.

Parameters:
  • identifier – A unique identifier for the component.

  • artifacts – List of artifacts which represent entities that are not serializable by default.

  • select – A query to select the documents for the dataset.

  • sample_size – The number of documents to sample from the query.

  • random_seed – The random seed to use for sampling.

  • creation_date – The date the dataset was created.

  • raw_data – The raw data for the dataset.

__post_init__(artifacts)[source]

Post-initialization method.

Parameters:

artifacts – Optional additional artifacts for initialization.

creation_date: t.Optional[str] = None
property data

Property representing the dataset’s data.

full_import_path = 'superduperdb.components.dataset.Dataset'
init()[source]

Initialization method.

pre_create(db: Datalayer) None[source]

Pre-create hook for database operations.

Parameters:

db – The database to use for the operation.

property random

Cached property representing the random number generator.

random_seed: t.Optional[int] = None
raw_data: t.Optional[t.Sequence[t.Any]] = None
sample_size: t.Optional[int] = None
select: t.Optional[Select] = None
type_id: t.ClassVar[str] = 'dataset'
class superduperdb.Document[source]

Bases: MongoStyleDict

A wrapper around an instance of dict or a Encodable.

The document data is used to dump that resource to a mix of json-able content, ids and bytes

static decode(r: Dict, db: Datalayer | None = None) Any[source]

Decode the object from a encoded data.

Parameters:
  • r – Encoded data.

  • db – Datalayer instance.

encode(schema: Schema | None = None, leaf_types_to_keep: Sequence[Type] = ()) Dict[source]

Make a copy of the content with all the Leaves encoded.

Parameters:
  • schema – The schema to encode with.

  • leaf_types_to_keep – The types of leaves to keep.

get_leaves(*leaf_types: str)[source]

Get all the leaves in the document.

Parameters:

*leaf_types

The types of leaves to get.

set_variables(db: Datalayer, **kwargs) Document[source]

Set free variables of self.

Parameters:

db – The datalayer to use.

unpack(db=None, leaves_to_keep: Sequence = ()) Any[source]

Returns the content, but with any encodables replaced by their contents.

Parameters:
  • db – The datalayer to use.

  • leaves_to_keep – The types of leaves to keep.

property variables: List[str]

Return a list of variables in the object.

superduperdb.Encoder

alias of DataType

class superduperdb.Listener(artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, identifier: str = '', key: str | ~typing.List[str] | ~typing.Tuple[~typing.List[str], ~typing.Dict[str, str]], model: ~superduperdb.components.model.Model, select: ~superduperdb.backends.base.query.CompoundSelect, active: bool = True, predict_kwargs: ~typing.Dict | None = <factory>)[source]

Bases: Component

Listener component.

Listener object which is used to process a column/key of a collection or table, and store the outputs.

Parameters:
  • key – Key to be bound to the model.

  • model – Model for processing data.

  • select – Object for selecting which data is processed.

  • active – Toggle to False to deactivate change data triggering.

  • predict_kwargs – Keyword arguments to self.model.predict().

  • identifier – A string used to identify the model.

active: bool = True
cleanup(database: Datalayer) None[source]

Clean up when the listener is deleted.

Parameters:

database – Data layer instance to process.

classmethod create_output_dest(db: Datalayer, predict_id, model: Model)[source]

Create output destination.

Parameters:
  • db – Data layer instance.

  • predict_id – Predict ID.

  • model – Model instance.

property dependencies: List[ComponentTuple]

Listener model dependencies.

depends_on(other: Component)[source]

Check if the listener depends on another component.

Parameters:

other – Another component.

classmethod from_predict_id(db: Datalayer, predict_id) Listener[source]

Split predict ID.

Parameters:
  • db – Data layer instance.

  • predict_id – Predict ID.

full_import_path = 'superduperdb.components.listener.Listener'
classmethod handle_integration(kwargs)[source]

Method to handle integration.

Parameters:

kwargs – Integration keyword arguments.

property id_key: str

Get identifier key.

identifier: str = ''
key: str | List[str] | Tuple[List[str], Dict[str, str]]
property mapping

Mapping property.

model: Model
property outputs

Get reference to outputs of listener model.

property outputs_key

Model outputs key.

property outputs_select

Get query reference to model outputs.

post_create(db: Datalayer) None[source]

Post-create hook.

Parameters:

db – Data layer instance.

pre_create(db: Datalayer) None[source]

Pre-create hook.

Parameters:

db – Data layer instance.

property predict_id

Get predict ID.

predict_kwargs: Dict | None
schedule_jobs(db: Datalayer, dependencies: Sequence[Job] = (), overwrite: bool = False) Sequence[Any][source]

Schedule jobs for the listener.

Parameters:
  • db – Data layer instance to process.

  • dependencies – A list of dependencies.

select: CompoundSelect
type_id: ClassVar[str] = 'listener'
ui_schema: ClassVar[List[Dict]] = [{'default': '', 'name': 'identifier', 'type': 'str'}, {'name': 'key', 'type': 'json'}, {'name': 'model', 'type': 'component/model'}, {'default': {'documents': [], 'query': '<collection_name>.find()'}, 'name': 'select', 'type': 'json'}, {'default': True, 'name': 'active', 'type': 'bool'}, {'default': {}, 'name': 'predict_kwargs', 'type': 'json'}]
class superduperdb.Metric(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, object: Callable)[source]

Bases: Component

Metric base object used to evaluate performance on a dataset.

These objects are callable and are applied row-wise to the data, and averaged.

Base class for all components in SuperDuperDB.

Class to represent SuperDuperDB serializable entities that can be saved into a database.

Parameters:
  • identifier – A unique identifier for the component.

  • artifacts – List of artifacts which represent entities that are not serializable by default.

  • object – Callable or an Artifact to be applied to the data.

public_api(beta): This API is in beta and may change before becoming stable.

__call__(x: Sequence[int], y: Sequence[int]) bool[source]

Call the metric object on the x and y data.

Parameters:
  • x – First sequence of data.

  • y – Second sequence of data.

full_import_path = 'superduperdb.components.metric.Metric'
object: Callable
type_id: ClassVar[str] = 'metric'
ui_schema: ClassVar[List[Dict]] = [{'name': 'object', 'type': 'artifact'}]
class superduperdb.Model(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, signature: ~typing.Literal['*args', '**kwargs', '*args, **kwargs', 'singleton'] = '*args, **kwargs', datatype: ~superduperdb.components.datatype.DataType | ~superduperdb.backends.ibis.field_types.FieldType | None = None, output_schema: ~superduperdb.components.schema.Schema | None = None, flatten: bool = False, model_update_kwargs: ~typing.Dict = <factory>, predict_kwargs: ~typing.Dict = <factory>, compute_kwargs: ~typing.Dict = <factory>, validation: ~superduperdb.components.model.Validation | None = None, metric_values: ~typing.Dict = <factory>)[source]

Bases: Component

Base class for components which can predict.

Parameters:
  • signature – Model signature.

  • datatype – DataType instance.

  • output_schema – Output schema (mapping of encoders).

  • flatten – Flatten the model outputs.

  • model_update_kwargs – The kwargs to use for model update.

  • predict_kwargs – Additional arguments to use at prediction time.

  • compute_kwargs – Kwargs used for compute backend job submit. Example (Ray backend): compute_kwargs = dict(resources=…).

  • validation – The validation Dataset instances to use.

  • metric_values – The metrics to evaluate on.

__call__(*args, outputs: str | None = None, **kwargs)[source]

Connect the models to build a graph.

Parameters:
  • args – Arguments to be passed to the model.

  • outputs – Identifier for the model outputs.

  • kwargs – Keyword arguments to be passed to the model.

_infer_auto_schema(outputs, predict_id)[source]

Infer datatype from outputs of the model.

Parameters:

outputs – Outputs to infer datatype from.

compute_kwargs: t.Dict
datatype: EncoderArg = None
encode_outputs(outputs)[source]

Method that encodes outputs of a model for saving in the database.

Parameters:

outputs – outputs to encode.

encode_with_schema(outputs)[source]

Encode model outputs corresponding to the provided output_schema.

Parameters:

outputs – Encode the outputs with the given schema.

flatten: bool = False
full_import_path = 'superduperdb.components.model.Model'
static handle_input_type(data, signature)[source]

Method to transform data with respect to signature.

Parameters:
  • data – Data to be transformed

  • signature – Data signature for transforming

property inputs: Inputs

Instance of Inputs to represent model params.

metric_values: t.Dict
model_update_kwargs: t.Dict
output_schema: t.Optional[Schema] = None
abstract predict(dataset: List | QueryDataset) List[source]

Execute on a series of data points defined in the dataset.

Parameters:

dataset – Series of data points to predict on.

predict_in_db(X: ModelInputType, db: Datalayer, predict_id: str, select: CompoundSelect, ids: t.Optional[t.List[str]] = None, max_chunk_size: t.Optional[int] = None, in_memory: bool = True, overwrite: bool = False) t.Any[source]

Predict on the data points in the database.

Execute a single prediction on a data point given by positional and keyword arguments as a job.

Parameters:
  • X – combination of input keys to be mapped to the model

  • db – Datalayer instance

  • predict_id – Identifier for saving outputs.

  • select – CompoundSelect query

  • ids – Iterable of ids

  • max_chunk_size – Chunks of data

  • in_memory – Load data into memory or not

  • overwrite – Overwrite all documents or only new documents

predict_in_db_job(X: ModelInputType, db: Datalayer, predict_id: str, select: t.Optional[CompoundSelect], ids: t.Optional[t.List[str]] = None, max_chunk_size: t.Optional[int] = None, dependencies: t.Sequence[Job] = (), in_memory: bool = True, overwrite: bool = False)[source]

Run a prediction job in the database.

Execute a single prediction on the data points given by positional and keyword arguments as a job.

Parameters:
  • X – combination of input keys to be mapped to the model

  • db – Datalayer instance

  • predict_id – Model outputs identifier

  • select – CompoundSelect query

  • ids – Iterable of ids

  • max_chunk_size – Chunks of data

  • dependencies – List of dependencies (jobs)

  • in_memory – Load data into memory or not

  • overwrite – Overwrite all documents or only new documents

predict_kwargs: t.Dict
abstract predict_one(*args, **kwargs) int[source]

Predict on a single data point.

Execute a single prediction on a data point given by positional and keyword arguments.

signature: Signature = '*args,**kwargs'
to_listener(key: str | List[str] | Tuple[List[str], Dict[str, str]], select: CompoundSelect, identifier='', predict_kwargs: dict | None = None, **kwargs)[source]

Convert the model to a listener.

Parameters:
  • key – Key to be bound to the model

  • select – Object for selecting which data is processed

  • identifier – A string used to identify the model.

  • predict_kwargs – Keyword arguments to self.model.predict

type_id: t.ClassVar[str] = 'model'
ui_schema: t.ClassVar[t.Dict] = [{'name': 'datatype', 'optional': True, 'type': 'component/datatype'}, {'default': {}, 'name': 'predict_kwargs', 'type': 'json'}, {'default': '*args,**kwargs', 'name': 'signature', 'type': 'str'}]
validate(X, dataset: Dataset, metrics: t.Sequence[Metric])[source]

Validate dataset on metrics.

Parameters:
  • X – Define input map

  • dataset – Dataset to run validation on.

  • metrics – Metrics for performing validation

validate_in_db(db)[source]

Validation job in database.

Parameters:

db – DataLayer instance.

validate_in_db_job(db, dependencies: Sequence[Job] = ())[source]

Perform a validation job.

Parameters:
  • db – DataLayer instance

  • dependencies – dependencies on the job

validation: t.Optional[Validation] = None
class superduperdb.ObjectModel(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, signature: ~typing.Literal['*args', '**kwargs', '*args, **kwargs', 'singleton'] = '*args, **kwargs', datatype: ~superduperdb.components.datatype.DataType | ~superduperdb.backends.ibis.field_types.FieldType | None = None, output_schema: ~superduperdb.components.schema.Schema | None = None, flatten: bool = False, model_update_kwargs: ~typing.Dict = <factory>, predict_kwargs: ~typing.Dict = <factory>, compute_kwargs: ~typing.Dict = <factory>, validation: ~superduperdb.components.model.Validation | None = None, metric_values: ~typing.Dict = <factory>, num_workers: int = 0, object: ~typing.Any)[source]

Bases: _ObjectModel

Model component which wraps a Model to become serializable.

Base class for components which can predict.

Parameters:
  • signature – Model signature.

  • datatype – DataType instance.

  • output_schema – Output schema (mapping of encoders).

  • flatten – Flatten the model outputs.

  • model_update_kwargs – The kwargs to use for model update.

  • predict_kwargs – Additional arguments to use at prediction time.

  • compute_kwargs – Kwargs used for compute backend job submit. Example (Ray backend): compute_kwargs = dict(resources=…).

  • validation – The validation Dataset instances to use.

  • metric_values – The metrics to evaluate on.

full_import_path = 'superduperdb.components.model.ObjectModel'
ui_schema: t.ClassVar[t.List[t.Dict]] = [{'name': 'object', 'type': 'artifact'}]
class superduperdb.QueryModel(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, datatype: ~superduperdb.components.datatype.DataType | ~superduperdb.backends.ibis.field_types.FieldType | None = None, output_schema: ~superduperdb.components.schema.Schema | None = None, flatten: bool = False, model_update_kwargs: ~typing.Dict = <factory>, predict_kwargs: ~typing.Dict = <factory>, compute_kwargs: ~typing.Dict = <factory>, validation: ~superduperdb.components.model.Validation | None = None, metric_values: ~typing.Dict = <factory>, preprocess: ~typing.Callable | None = None, postprocess: ~typing.Callable | ~superduperdb.base.code.Code | None = None, select: ~superduperdb.backends.base.query.CompoundSelect)[source]

Bases: Model

QueryModel component.

Model which can be used to query data and return those precomputed queries as Results.

Parameters:
  • preprocess – Preprocess callable

  • postprocess – Postprocess callable

  • select – query used to find data (can include like)

full_import_path = 'superduperdb.components.model.QueryModel'
classmethod handle_integration(kwargs)[source]

Handle integration from UI.

Parameters:

kwargs – Integration kwargs.

property inputs: Inputs

Instance of Inputs to represent model params.

postprocess: t.Optional[t.Union[t.Callable, Code]] = None
predict(dataset: List | QueryDataset) List[source]

Execute on a series of data points defined in the dataset.

Parameters:

dataset – Series of data points to predict on.

predict_one(*args, **kwargs)[source]

Predict on a single data point.

Method to perform a single prediction on args and kwargs. This method is also used for debugging the model.

preprocess: t.Optional[t.Callable] = None
select: CompoundSelect
signature: t.ClassVar[Signature] = '**kwargs'
ui_schema: t.ClassVar[t.List[t.Dict]] = [{'default': 'from superduperdb import code\n\n@code\ndef my_code(x):\n    return x\n', 'name': 'postprocess', 'type': 'code'}, {'default': {'documents': [{'<key-1>': '$my_value'}, {'_id': 0, '_outputs': 0}], 'query': "<collection_name>.like(_documents[0], vector_index='<index_id>').find({}, _documents[1]).limit(10)"}, 'name': 'select', 'type': 'json'}]
class superduperdb.Schema(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, fields: Mapping[str, DataType])[source]

Bases: Component

A component containing information about the types or encoders of a table.

Base class for all components in SuperDuperDB.

Class to represent SuperDuperDB serializable entities that can be saved into a database.

Parameters:
  • identifier – A unique identifier for the component.

  • artifacts – List of artifacts which represent entities that are not serializable by default.

  • fields – A mapping of field names to types or encoders.

public_api(beta): This API is in beta and may change before becoming stable.

__call__(data: dict[str, Any]) dict[str, Any][source]

Encode data using the schema’s encoders.

Parameters:

data – Data to encode.

decode_data(data: dict[str, Any]) dict[str, Any][source]

Decode data using the schema’s encoders.

Parameters:

data – Data to decode.

property encoded_types

List of fields of type DataType.

property encoders

An iterable to list DataType fields.

fields: Mapping[str, DataType]
full_import_path = 'superduperdb.components.schema.Schema'
pre_create(db) None[source]

Database pre-create hook to add datatype to the database.

Parameters:

db – Datalayer instance.

property raw

Return the raw fields.

Get a dictionary of fields as keys and datatypes as values. This is used to create ibis tables.

property trivial

Determine if the schema contains only trivial fields.

type_id: ClassVar[str] = 'schema'
class superduperdb.Stack(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, components: Sequence[Component])[source]

Bases: Component

Component to hold a list of components under a namespace and package.

A placeholder to hold a list of components under a namespace and package them as a tarball. This tarball can be retrieved back to a Stack instance with the load method.

Base class for all components in SuperDuperDB.

Class to represent SuperDuperDB serializable entities that can be saved into a database.

Parameters:
  • identifier – A unique identifier for the component.

  • artifacts – List of artifacts which represent entities that are not serializable by default.

  • components – List of components to stack together and add to the database.

public_api(alpha): This API is in alpha and may change before becoming stable.

components: Sequence[Component]
property db

Datalayer property.

static from_list(identifier, content, db: Datalayer | None = None)[source]

Helper method to create a Stack from a list content.

Parameters:
  • identifier – Unique identifier.

  • content – Content to create a stack.

  • db – Datalayer instance.

full_import_path = 'superduperdb.components.stack.Stack'
type_id: ClassVar[str] = 'stack'
class superduperdb.Validation(identifier: str, artifacts: dc.InitVar[t.Optional[t.Dict]] = None, *, metrics: t.Sequence[Metric] = (), key: t.Optional[ModelInputType] = None, datasets: t.Sequence[Dataset] = ())[source]

Bases: Component

component which represents Validation definition.

Parameters:
  • metrics – List of metrics for validation

  • key – Model input type key

  • datasets – Sequence of dataset.

datasets: t.Sequence[Dataset] = ()
full_import_path = 'superduperdb.components.model.Validation'
key: t.Optional[ModelInputType] = None
metrics: t.Sequence[Metric] = ()
type_id: t.ClassVar[str] = 'validation'
class superduperdb.VectorIndex(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, indexing_listener: ~superduperdb.components.listener.Listener, compatible_listener: ~superduperdb.components.listener.Listener | None = None, measure: ~superduperdb.vector_search.base.VectorIndexMeasureType = VectorIndexMeasureType.cosine, metric_values: ~typing.Dict | None = <factory>)[source]

Bases: Component

A component carrying the information to apply a vector index to a DB instance.

Base class for all components in SuperDuperDB.

Class to represent SuperDuperDB serializable entities that can be saved into a database.

Parameters:
  • identifier – A unique identifier for the component.

  • artifacts – List of artifacts which represent entities that are not serializable by default.

  • indexing_listener – Listener which is applied to created vectors

  • compatible_listener – Listener which is applied to vectors to be compared

  • measure – Measure to use for comparison

  • metric_values – Metric values for this index

compatible_listener: Listener | None = None
property dimensions: int

Get dimension for vector database.

This dimension will be used to prepare vectors in the vector database.

full_import_path = 'superduperdb.components.vector_index.VectorIndex'
get_nearest(like: Document, db: Any, id_field: str = '_id', outputs: Dict | None = None, ids: Sequence[str] | None = None, n: int = 100) Tuple[List[str], List[float]][source]

Get nearest results in this vector index.

Given a document, find the nearest results in this vector index, returned as two parallel lists of result IDs and scores.

Parameters:
  • like – The document to compare against

  • db – The datalayer to use

  • id_field – Identifier field

  • outputs – An optional dictionary

  • ids – A list of ids to match

  • n – Number of items to return

get_vector(like: Document, models: List[str], keys: str | List | Dict, db: Any = None, outputs: Dict | None = None)[source]

Peform vector search.

Perform vector search with query like from outputs in db on self.identifier vector index.

Parameters:
  • like – The document to compare against

  • models – List of models to retrieve outputs

  • keys – Keys available to retrieve outputs of model

  • db – A datalayer instance.

  • outputs – (optional) update like with outputs

indexing_listener: Listener
measure: VectorIndexMeasureType = 'cosine'
metric_values: Dict | None
property models_keys: Tuple[List[str], List[str | List[str] | Tuple[List[str], Dict[str, str]]]]

Return a list of model and keys for each listener.

on_load(db: Datalayer) None[source]

On load hook to perform indexing and compatible listenernd compatible listener.

Automatically loads the listeners if they are not already loaded.

Parameters:

db – A DataLayer instance

schedule_jobs(db: Datalayer, dependencies: Sequence[Job] = ()) Sequence[Any][source]

Schedule jobs for the listener.

Parameters:
  • db – The DB instance to process

  • dependencies – A list of dependencies

type_id: ClassVar[str] = 'vector_index'
ui_schema: ClassVar[List[Dict]] = [{'name': 'indexing_listener', 'type': 'component/listener'}, {'name': 'compatible_listener', 'optional': True, 'type': 'component/listener'}, {'choices': ['cosine', 'dot', 'l2'], 'name': 'measure', 'type': 'str'}]
superduperdb.code(my_callable)[source]

Decorator to mark a function as remote code.

Parameters:

my_callable – The callable to mark as remote code.

superduperdb.logging

alias of Logging

superduperdb.objectmodel(item: Callable | None = None, identifier: str | None = None, datatype=None, model_update_kwargs: Dict | None = None, flatten: bool = False, output_schema: Schema | None = None)[source]

Decorator to wrap a function with ObjectModel.

When a function is wrapped with this decorator, the function comes out as an ObjectModel.

Parameters:
  • item – Callable to wrap with ObjectModel.

  • identifier – Identifier for the ObjectModel.

  • datatype – Datatype for the model outputs.

  • model_update_kwargs – Dictionary to define update kwargs.

  • flatten – If True, flatten the outputs and save.

  • output_schema – Schema for the model outputs.

superduperdb.superduper(item: Any | None = None, **kwargs) Any[source]

Superduper API to automatically wrap an object to a db or a component.

Attempts to automatically wrap an item in a superduperdb component by using duck typing to recognize it.

Parameters:

item – A database or model

superduperdb.vector(shape, identifier: str | None = None)[source]

Create an encoder for a vector (list of ints/ floats) of a given shape.

Parameters:
  • shape – The shape of the vector

  • identifier – The identifier of the vector