superduperdb package¶
Subpackages¶
- superduperdb.backends package
- superduperdb.base package
- Submodules
- superduperdb.base.build module
- superduperdb.base.code module
- superduperdb.base.config module
BaseConfig
BytesEncoding
CDCConfig
CDCStrategy
Cluster
Compute
Config
Config.artifact_store
Config.bytes_encoding
Config.cluster
Config.comparables
Config.data_backend
Config.diff()
Config.downloads
Config.envs
Config.fold_probability
Config.hybrid_storage
Config.lance_home
Config.log_level
Config.logging_type
Config.match()
Config.metadata_store
Config.retries
Config.to_yaml()
Downloads
LogBasedStrategy
LogLevel
LogType
PollingStrategy
Rest
Retry
VectorSearch
_diff()
- superduperdb.base.config_dicts module
- superduperdb.base.configs module
- superduperdb.base.cursor module
- superduperdb.base.datalayer module
Datalayer
Datalayer.__init__()
Datalayer._add_component_to_cache()
Datalayer._build_task_workflow()
Datalayer._delete()
Datalayer._insert()
Datalayer._select()
Datalayer._update()
Datalayer._write()
Datalayer.add()
Datalayer.apply()
Datalayer.backfill_vector_search()
Datalayer.close()
Datalayer.drop()
Datalayer.execute()
Datalayer.get_compute()
Datalayer.infer_schema()
Datalayer.initialize_vector_searcher()
Datalayer.load()
Datalayer.refresh_after_delete()
Datalayer.refresh_after_update_or_insert()
Datalayer.remove()
Datalayer.replace()
Datalayer.select_nearest()
Datalayer.server_mode
Datalayer.set_compute()
Datalayer.show()
Datalayer.type_id_to_cache_mapping
LoadDict
- superduperdb.base.decorators module
- superduperdb.base.document module
- superduperdb.base.enums module
- superduperdb.base.exceptions module
- superduperdb.base.leaf module
- superduperdb.base.logger module
- superduperdb.base.serializable module
- superduperdb.base.superduper module
- Module contents
- superduperdb.cdc package
- Submodules
- superduperdb.cdc.app module
- superduperdb.cdc.cdc module
BaseDatabaseListener
BaseDatabaseListener.IDENTITY_SEP
BaseDatabaseListener.Packet
BaseDatabaseListener._build_identifier()
BaseDatabaseListener.create_event()
BaseDatabaseListener.event_handler()
BaseDatabaseListener.identity
BaseDatabaseListener.info()
BaseDatabaseListener.listen()
BaseDatabaseListener.next_cdc()
BaseDatabaseListener.on_create()
BaseDatabaseListener.on_delete()
BaseDatabaseListener.on_update()
BaseDatabaseListener.setup_cdc()
BaseDatabaseListener.stop()
CDCHandler
DBEvent
DatabaseChangeDataCapture
DatabaseListenerFactory
DatabaseListenerThreadScheduler
Packet
- superduperdb.cdc.deployed_app module
- Module contents
- superduperdb.cli package
- superduperdb.components package
- Submodules
- superduperdb.components.component module
Component
Component.artifact_schema
Component.artifacts
Component.changed
Component.create_validation_job()
Component.db
Component.decode()
Component.deep_flat_encode()
Component.dependencies
Component.dict()
Component.encode()
Component.export()
Component.full_import_path
Component.get_ui_schema()
Component.handle_integration()
Component.id
Component.id_tuple
Component.identifier
Component.init()
Component.leaf_type
Component.make_unique_id()
Component.on_load()
Component.post_create()
Component.pre_create()
Component.schedule_jobs()
Component.set_post_init
Component.set_variables()
Component.type_id
Component.ui_schema
Component.unique_id
ComponentTuple
ensure_initialized()
getdeepattr()
import_()
- superduperdb.components.dataset module
- superduperdb.components.datatype module
Artifact
DataType
DataType.__call__()
DataType.__post_init__()
DataType.bytes_encoding
DataType.bytes_encoding_after_encode()
DataType.bytes_encoding_before_decode()
DataType.decode_data()
DataType.decoder
DataType.dict()
DataType.directory
DataType.encodable
DataType.encode_data()
DataType.encoder
DataType.full_import_path
DataType.identifier
DataType.info
DataType.intermidia_type
DataType.media_type
DataType.register_datatype()
DataType.registered_types
DataType.shape
DataType.type_id
DataType.ui_schema
DataTypeFactory
DecodeTorchStateDict
Empty
Encodable
Encoder
File
IntermidiaType
LazyArtifact
LazyFile
Native
_BaseEncodable
_BaseEncodable.__post_init__()
_BaseEncodable._deep_flat_encode()
_BaseEncodable._get_object()
_BaseEncodable.datatype
_BaseEncodable.decode()
_BaseEncodable.file_id
_BaseEncodable.full_import_path
_BaseEncodable.get_encodable_cls()
_BaseEncodable.get_hash()
_BaseEncodable.id
_BaseEncodable.reference
_BaseEncodable.sha1
_BaseEncodable.unique_id
_BaseEncodable.unpack()
_BaseEncodable.uri
_find_descendants()
base64_to_bytes()
build_torch_state_serializer()
bytes_to_base64()
dill_decode()
dill_encode()
encode_torch_state_dict()
file_check()
json_decode()
json_encode()
pickle_decode()
pickle_encode()
torch_decode()
torch_encode()
- superduperdb.components.graph module
- superduperdb.components.listener module
Listener
Listener.active
Listener.cleanup()
Listener.create_output_dest()
Listener.dependencies
Listener.depends_on()
Listener.from_predict_id()
Listener.full_import_path
Listener.handle_integration()
Listener.id_key
Listener.identifier
Listener.key
Listener.mapping
Listener.model
Listener.outputs
Listener.outputs_key
Listener.outputs_select
Listener.post_create()
Listener.pre_create()
Listener.predict_id
Listener.predict_kwargs
Listener.schedule_jobs()
Listener.select
Listener.type_id
Listener.ui_schema
- superduperdb.components.metric module
- superduperdb.components.model module
APIBaseModel
APIModel
CallableInputs
CodeModel
IndexableNode
Inputs
Mapping
Model
Model.__call__()
Model._infer_auto_schema()
Model.compute_kwargs
Model.datatype
Model.encode_outputs()
Model.encode_with_schema()
Model.flatten
Model.full_import_path
Model.handle_input_type()
Model.identifier
Model.inputs
Model.metric_values
Model.model_update_kwargs
Model.output_schema
Model.predict()
Model.predict_in_db()
Model.predict_in_db_job()
Model.predict_kwargs
Model.predict_one()
Model.signature
Model.to_listener()
Model.type_id
Model.ui_schema
Model.validate()
Model.validate_in_db()
Model.validate_in_db_job()
Model.validation
ObjectModel
QueryModel
QueryModel.compute_kwargs
QueryModel.full_import_path
QueryModel.handle_integration()
QueryModel.identifier
QueryModel.inputs
QueryModel.metric_values
QueryModel.model_update_kwargs
QueryModel.postprocess
QueryModel.predict()
QueryModel.predict_kwargs
QueryModel.predict_one()
QueryModel.preprocess
QueryModel.select
QueryModel.signature
QueryModel.ui_schema
SequentialModel
Trainer
Validation
_DeviceManaged
_Fittable
_ObjectModel
codemodel()
objectmodel()
- superduperdb.components.schema module
- superduperdb.components.stack module
- superduperdb.components.vector_index module
DecodeArray
EncodeArray
VectorIndex
VectorIndex.compatible_listener
VectorIndex.dimensions
VectorIndex.full_import_path
VectorIndex.get_nearest()
VectorIndex.get_vector()
VectorIndex.identifier
VectorIndex.indexing_listener
VectorIndex.measure
VectorIndex.metric_values
VectorIndex.models_keys
VectorIndex.on_load()
VectorIndex.schedule_jobs()
VectorIndex.type_id
VectorIndex.ui_schema
sqlvector()
vector()
- Module contents
- superduperdb.ext package
- Subpackages
- superduperdb.ext.anthropic package
- superduperdb.ext.auto package
- superduperdb.ext.cohere package
- superduperdb.ext.jina package
- superduperdb.ext.llamacpp package
- superduperdb.ext.llm package
- superduperdb.ext.numpy package
- superduperdb.ext.openai package
- superduperdb.ext.pillow package
- superduperdb.ext.sentence_transformers package
- superduperdb.ext.sklearn package
- superduperdb.ext.torch package
- superduperdb.ext.transformers package
- superduperdb.ext.unstructured package
- superduperdb.ext.vllm package
- Submodules
- superduperdb.ext.utils module
- Module contents
- Subpackages
- superduperdb.misc package
- Subpackages
- Submodules
- superduperdb.misc.annotations module
- superduperdb.misc.anonymize module
- superduperdb.misc.archives module
- superduperdb.misc.auto_schema module
- superduperdb.misc.colors module
- superduperdb.misc.compat module
- superduperdb.misc.data module
- superduperdb.misc.download module
- superduperdb.misc.files module
- superduperdb.misc.hash module
- superduperdb.misc.retry module
- superduperdb.misc.run module
- superduperdb.misc.serialization module
- superduperdb.misc.server module
- superduperdb.misc.special_dicts module
- Module contents
- superduperdb.server package
- superduperdb.vector_search package
- Subpackages
- Submodules
- superduperdb.vector_search.atlas module
- superduperdb.vector_search.base module
- superduperdb.vector_search.in_memory module
- superduperdb.vector_search.interface module
- superduperdb.vector_search.lance module
- superduperdb.vector_search.update_tasks module
- Module contents
Module contents¶
- class superduperdb.CodeModel(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, signature: ~typing.Literal['*args', '**kwargs', '*args, **kwargs', 'singleton'] = '*args, **kwargs', datatype: ~superduperdb.components.datatype.DataType | ~superduperdb.backends.ibis.field_types.FieldType | None = None, output_schema: ~superduperdb.components.schema.Schema | None = None, flatten: bool = False, model_update_kwargs: ~typing.Dict = <factory>, predict_kwargs: ~typing.Dict = <factory>, compute_kwargs: ~typing.Dict = <factory>, validation: ~superduperdb.components.model.Validation | None = None, metric_values: ~typing.Dict = <factory>, num_workers: int = 0, object: ~superduperdb.base.code.Code)[source]¶
Bases:
_ObjectModel
Model component which wraps a Model to become serializable.
Base class for components which can predict.
- Parameters:
signature – Model signature.
datatype – DataType instance.
output_schema – Output schema (mapping of encoders).
flatten – Flatten the model outputs.
model_update_kwargs – The kwargs to use for model update.
predict_kwargs – Additional arguments to use at prediction time.
compute_kwargs – Kwargs used for compute backend job submit. Example (Ray backend): compute_kwargs = dict(resources=…).
validation – The validation
Dataset
instances to use.metric_values – The metrics to evaluate on.
object – Code object
- full_import_path = 'superduperdb.components.model.CodeModel'¶
- classmethod handle_integration(kwargs)[source]¶
Handler integration from ui.
- Parameters:
kwargs – integration kwargs
- ui_schema: t.ClassVar[t.List[t.Dict]] = [{'default': 'from superduperdb import code\n\n@code\ndef my_code(x):\n return x\n', 'name': 'object', 'type': 'code'}]¶
- class superduperdb.DataType(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, encoder: Callable | None = None, decoder: Callable | None = None, info: Dict | None = None, shape: Sequence | None = None, directory: str | None = None, encodable: str = 'encodable', bytes_encoding: str | None = BytesEncoding.BYTES, intermidia_type: str | None = 'bytes', media_type: str | None = None)[source]¶
Bases:
Component
A data type component that defines how data is encoded and decoded.
Base class for all components in SuperDuperDB.
Class to represent SuperDuperDB serializable entities that can be saved into a database.
- Parameters:
identifier – A unique identifier for the component.
artifacts – List of artifacts which represent entities that are not serializable by default.
encoder – A callable that converts an encodable object of this encoder to bytes.
decoder – A callable that converts bytes to an encodable object of this encoder.
info – An optional information dictionary.
shape – The shape of the data.
directory – The directory to store file types.
encodable – The type of encodable object (‘encodable’, ‘lazy_artifact’, or ‘file’).
bytes_encoding – The encoding type for bytes (‘base64’ or ‘bytes’).
intermidia_type – Type of the intermidia data [IntermidiaType.BYTES, IntermidiaType.STRING]
media_type – The media type.
- __call__(x: Any | None = None, uri: str | None = None) _BaseEncodable [source]¶
Create an instance of the encodable class.
- Parameters:
x – The optional content.
uri – The optional URI.
- bytes_encoding: str | None = 'Bytes'¶
- bytes_encoding_after_encode(data)[source]¶
Encode the data to base64.
if the bytes_encoding is BASE64 and the intermidia_type is BYTES
- Parameters:
data – Encoded data
- bytes_encoding_before_decode(data)[source]¶
Encode the data to base64.
if the bytes_encoding is BASE64 and the intermidia_type is BYTES
- Parameters:
data – Decoded data
- decode_data(item, info: Dict | None = None)[source]¶
Decode the item from bytes.
- Parameters:
item – The item to decode.
info – The optional information dictionary.
- decoder: Callable | None = None¶
- directory: str | None = None¶
- encodable: str = 'encodable'¶
- encode_data(item, info: Dict | None = None)[source]¶
Encode the item into bytes.
- Parameters:
item – The item to encode.
info – The optional information dictionary.
- encoder: Callable | None = None¶
- full_import_path = 'superduperdb.components.datatype.DataType'¶
- info: Dict | None = None¶
- intermidia_type: str | None = 'bytes'¶
- media_type: str | None = None¶
- classmethod register_datatype(instance)[source]¶
Register a datatype.
- Parameters:
instance – The datatype instance to register.
- registered_types: ClassVar[Dict[str, DataType]] = {'dill': DataType(identifier='dill', encoder=<function dill_encode>, decoder=<function dill_decode>, info=None, shape=None, directory=None, encodable='artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'dill_lazy': DataType(identifier='dill_lazy', encoder=<function dill_encode>, decoder=<function dill_decode>, info=None, shape=None, directory=None, encodable='lazy_artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'file': DataType(identifier='file', encoder=<function file_check>, decoder=<function file_check>, info=None, shape=None, directory=None, encodable='file', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'file_lazy': DataType(identifier='file_lazy', encoder=<function file_check>, decoder=<function file_check>, info=None, shape=None, directory=None, encodable='lazy_file', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'json': DataType(identifier='json', encoder=<function json_encode>, decoder=<function json_decode>, info=None, shape=None, directory=None, encodable='encodable', bytes_encoding=<BytesEncoding.BASE64: 'Str'>, intermidia_type='string', media_type=None), 'pickle': DataType(identifier='pickle', encoder=<function pickle_encode>, decoder=<function pickle_decode>, info=None, shape=None, directory=None, encodable='artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'pickle_lazy': DataType(identifier='pickle_lazy', encoder=<function pickle_encode>, decoder=<function pickle_decode>, info=None, shape=None, directory=None, encodable='lazy_artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None), 'torch': DataType(identifier='torch', encoder=<function torch_encode>, decoder=<function torch_decode>, info=None, shape=None, directory=None, encodable='lazy_artifact', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>, intermidia_type='bytes', media_type=None)}¶
- shape: Sequence | None = None¶
- type_id: ClassVar[str] = 'datatype'¶
- ui_schema: ClassVar[List[Dict]] = [{'choices': ['pickle', 'dill', 'torch'], 'default': 'dill', 'name': 'serializer', 'type': 'string'}, {'name': 'info', 'optional': True, 'type': 'json'}, {'name': 'shape', 'optional': True, 'type': 'json'}, {'name': 'directory', 'optional': True, 'type': 'str'}, {'choices': ['encodable', 'lazy_artifact', 'file'], 'default': 'lazy_artifact', 'name': 'encodable', 'type': 'str'}, {'choices': ['base64', 'bytes'], 'default': 'bytes', 'name': 'bytes_encoding', 'type': 'str'}, {'name': 'media_type', 'optional': True, 'type': 'str'}]¶
- class superduperdb.Dataset(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, select: Select | None = None, sample_size: int | None = None, random_seed: int | None = None, creation_date: str | None = None, raw_data: Sequence[Any] | None = None)[source]¶
Bases:
Component
A dataset is an immutable collection of documents.
Base class for all components in SuperDuperDB.
Class to represent SuperDuperDB serializable entities that can be saved into a database.
- Parameters:
identifier – A unique identifier for the component.
artifacts – List of artifacts which represent entities that are not serializable by default.
select – A query to select the documents for the dataset.
sample_size – The number of documents to sample from the query.
random_seed – The random seed to use for sampling.
creation_date – The date the dataset was created.
raw_data – The raw data for the dataset.
- __post_init__(artifacts)[source]¶
Post-initialization method.
- Parameters:
artifacts – Optional additional artifacts for initialization.
- creation_date: t.Optional[str] = None¶
- property data¶
Property representing the dataset’s data.
- full_import_path = 'superduperdb.components.dataset.Dataset'¶
- pre_create(db: Datalayer) None [source]¶
Pre-create hook for database operations.
- Parameters:
db – The database to use for the operation.
- property random¶
Cached property representing the random number generator.
- random_seed: t.Optional[int] = None¶
- raw_data: t.Optional[t.Sequence[t.Any]] = None¶
- sample_size: t.Optional[int] = None¶
- type_id: t.ClassVar[str] = 'dataset'¶
- class superduperdb.Document[source]¶
Bases:
MongoStyleDict
A wrapper around an instance of dict or a Encodable.
The document data is used to dump that resource to a mix of json-able content, ids and bytes
- static decode(r: Dict, db: Datalayer | None = None) Any [source]¶
Decode the object from a encoded data.
- Parameters:
r – Encoded data.
db – Datalayer instance.
- encode(schema: Schema | None = None, leaf_types_to_keep: Sequence[Type] = ()) Dict [source]¶
Make a copy of the content with all the Leaves encoded.
- Parameters:
schema – The schema to encode with.
leaf_types_to_keep – The types of leaves to keep.
- get_leaves(*leaf_types: str)[source]¶
Get all the leaves in the document.
- Parameters:
*leaf_types –
The types of leaves to get.
- set_variables(db: Datalayer, **kwargs) Document [source]¶
Set free variables of self.
- Parameters:
db – The datalayer to use.
- unpack(db=None, leaves_to_keep: Sequence = ()) Any [source]¶
Returns the content, but with any encodables replaced by their contents.
- Parameters:
db – The datalayer to use.
leaves_to_keep – The types of leaves to keep.
- property variables: List[str]¶
Return a list of variables in the object.
- class superduperdb.Listener(artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, identifier: str = '', key: str | ~typing.List[str] | ~typing.Tuple[~typing.List[str], ~typing.Dict[str, str]], model: ~superduperdb.components.model.Model, select: ~superduperdb.backends.base.query.CompoundSelect, active: bool = True, predict_kwargs: ~typing.Dict | None = <factory>)[source]¶
Bases:
Component
Listener component.
Listener object which is used to process a column/key of a collection or table, and store the outputs.
- Parameters:
key – Key to be bound to the model.
model – Model for processing data.
select – Object for selecting which data is processed.
active – Toggle to
False
to deactivate change data triggering.predict_kwargs – Keyword arguments to self.model.predict().
identifier – A string used to identify the model.
- active: bool = True¶
- cleanup(database: Datalayer) None [source]¶
Clean up when the listener is deleted.
- Parameters:
database – Data layer instance to process.
- classmethod create_output_dest(db: Datalayer, predict_id, model: Model)[source]¶
Create output destination.
- Parameters:
db – Data layer instance.
predict_id – Predict ID.
model – Model instance.
- property dependencies: List[ComponentTuple]¶
Listener model dependencies.
- depends_on(other: Component)[source]¶
Check if the listener depends on another component.
- Parameters:
other – Another component.
- classmethod from_predict_id(db: Datalayer, predict_id) Listener [source]¶
Split predict ID.
- Parameters:
db – Data layer instance.
predict_id – Predict ID.
- full_import_path = 'superduperdb.components.listener.Listener'¶
- classmethod handle_integration(kwargs)[source]¶
Method to handle integration.
- Parameters:
kwargs – Integration keyword arguments.
- property id_key: str¶
Get identifier key.
- identifier: str = ''¶
- key: str | List[str] | Tuple[List[str], Dict[str, str]]¶
- property mapping¶
Mapping property.
- property outputs¶
Get reference to outputs of listener model.
- property outputs_key¶
Model outputs key.
- property outputs_select¶
Get query reference to model outputs.
- property predict_id¶
Get predict ID.
- predict_kwargs: Dict | None¶
- schedule_jobs(db: Datalayer, dependencies: Sequence[Job] = (), overwrite: bool = False) Sequence[Any] [source]¶
Schedule jobs for the listener.
- Parameters:
db – Data layer instance to process.
dependencies – A list of dependencies.
- select: CompoundSelect¶
- type_id: ClassVar[str] = 'listener'¶
- ui_schema: ClassVar[List[Dict]] = [{'default': '', 'name': 'identifier', 'type': 'str'}, {'name': 'key', 'type': 'json'}, {'name': 'model', 'type': 'component/model'}, {'default': {'documents': [], 'query': '<collection_name>.find()'}, 'name': 'select', 'type': 'json'}, {'default': True, 'name': 'active', 'type': 'bool'}, {'default': {}, 'name': 'predict_kwargs', 'type': 'json'}]¶
- class superduperdb.Metric(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, object: Callable)[source]¶
Bases:
Component
Metric base object used to evaluate performance on a dataset.
These objects are callable and are applied row-wise to the data, and averaged.
Base class for all components in SuperDuperDB.
Class to represent SuperDuperDB serializable entities that can be saved into a database.
- Parameters:
identifier – A unique identifier for the component.
artifacts – List of artifacts which represent entities that are not serializable by default.
object – Callable or an Artifact to be applied to the data.
public_api(beta): This API is in beta and may change before becoming stable.
- __call__(x: Sequence[int], y: Sequence[int]) bool [source]¶
Call the metric object on the x and y data.
- Parameters:
x – First sequence of data.
y – Second sequence of data.
- full_import_path = 'superduperdb.components.metric.Metric'¶
- object: Callable¶
- type_id: ClassVar[str] = 'metric'¶
- ui_schema: ClassVar[List[Dict]] = [{'name': 'object', 'type': 'artifact'}]¶
- class superduperdb.Model(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, signature: ~typing.Literal['*args', '**kwargs', '*args, **kwargs', 'singleton'] = '*args, **kwargs', datatype: ~superduperdb.components.datatype.DataType | ~superduperdb.backends.ibis.field_types.FieldType | None = None, output_schema: ~superduperdb.components.schema.Schema | None = None, flatten: bool = False, model_update_kwargs: ~typing.Dict = <factory>, predict_kwargs: ~typing.Dict = <factory>, compute_kwargs: ~typing.Dict = <factory>, validation: ~superduperdb.components.model.Validation | None = None, metric_values: ~typing.Dict = <factory>)[source]¶
Bases:
Component
Base class for components which can predict.
- Parameters:
signature – Model signature.
datatype – DataType instance.
output_schema – Output schema (mapping of encoders).
flatten – Flatten the model outputs.
model_update_kwargs – The kwargs to use for model update.
predict_kwargs – Additional arguments to use at prediction time.
compute_kwargs – Kwargs used for compute backend job submit. Example (Ray backend): compute_kwargs = dict(resources=…).
validation – The validation
Dataset
instances to use.metric_values – The metrics to evaluate on.
- __call__(*args, outputs: str | None = None, **kwargs)[source]¶
Connect the models to build a graph.
- Parameters:
args – Arguments to be passed to the model.
outputs – Identifier for the model outputs.
kwargs – Keyword arguments to be passed to the model.
- _infer_auto_schema(outputs, predict_id)[source]¶
Infer datatype from outputs of the model.
- Parameters:
outputs – Outputs to infer datatype from.
- compute_kwargs: t.Dict¶
- datatype: EncoderArg = None¶
- encode_outputs(outputs)[source]¶
Method that encodes outputs of a model for saving in the database.
- Parameters:
outputs – outputs to encode.
- encode_with_schema(outputs)[source]¶
Encode model outputs corresponding to the provided output_schema.
- Parameters:
outputs – Encode the outputs with the given schema.
- flatten: bool = False¶
- full_import_path = 'superduperdb.components.model.Model'¶
- static handle_input_type(data, signature)[source]¶
Method to transform data with respect to signature.
- Parameters:
data – Data to be transformed
signature – Data signature for transforming
- metric_values: t.Dict¶
- model_update_kwargs: t.Dict¶
- abstract predict(dataset: List | QueryDataset) List [source]¶
Execute on a series of data points defined in the dataset.
- Parameters:
dataset – Series of data points to predict on.
- predict_in_db(X: ModelInputType, db: Datalayer, predict_id: str, select: CompoundSelect, ids: t.Optional[t.List[str]] = None, max_chunk_size: t.Optional[int] = None, in_memory: bool = True, overwrite: bool = False) t.Any [source]¶
Predict on the data points in the database.
Execute a single prediction on a data point given by positional and keyword arguments as a job.
- Parameters:
X – combination of input keys to be mapped to the model
db – Datalayer instance
predict_id – Identifier for saving outputs.
select – CompoundSelect query
ids – Iterable of ids
max_chunk_size – Chunks of data
in_memory – Load data into memory or not
overwrite – Overwrite all documents or only new documents
- predict_in_db_job(X: ModelInputType, db: Datalayer, predict_id: str, select: t.Optional[CompoundSelect], ids: t.Optional[t.List[str]] = None, max_chunk_size: t.Optional[int] = None, dependencies: t.Sequence[Job] = (), in_memory: bool = True, overwrite: bool = False)[source]¶
Run a prediction job in the database.
Execute a single prediction on the data points given by positional and keyword arguments as a job.
- Parameters:
X – combination of input keys to be mapped to the model
db – Datalayer instance
predict_id – Model outputs identifier
select – CompoundSelect query
ids – Iterable of ids
max_chunk_size – Chunks of data
dependencies – List of dependencies (jobs)
in_memory – Load data into memory or not
overwrite – Overwrite all documents or only new documents
- predict_kwargs: t.Dict¶
- abstract predict_one(*args, **kwargs) int [source]¶
Predict on a single data point.
Execute a single prediction on a data point given by positional and keyword arguments.
- signature: Signature = '*args,**kwargs'¶
- to_listener(key: str | List[str] | Tuple[List[str], Dict[str, str]], select: CompoundSelect, identifier='', predict_kwargs: dict | None = None, **kwargs)[source]¶
Convert the model to a listener.
- Parameters:
key – Key to be bound to the model
select – Object for selecting which data is processed
identifier – A string used to identify the model.
predict_kwargs – Keyword arguments to self.model.predict
- type_id: t.ClassVar[str] = 'model'¶
- ui_schema: t.ClassVar[t.Dict] = [{'name': 'datatype', 'optional': True, 'type': 'component/datatype'}, {'default': {}, 'name': 'predict_kwargs', 'type': 'json'}, {'default': '*args,**kwargs', 'name': 'signature', 'type': 'str'}]¶
- validate(X, dataset: Dataset, metrics: t.Sequence[Metric])[source]¶
Validate dataset on metrics.
- Parameters:
X – Define input map
dataset – Dataset to run validation on.
metrics – Metrics for performing validation
- validate_in_db_job(db, dependencies: Sequence[Job] = ())[source]¶
Perform a validation job.
- Parameters:
db – DataLayer instance
dependencies – dependencies on the job
- validation: t.Optional[Validation] = None¶
- class superduperdb.ObjectModel(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, signature: ~typing.Literal['*args', '**kwargs', '*args, **kwargs', 'singleton'] = '*args, **kwargs', datatype: ~superduperdb.components.datatype.DataType | ~superduperdb.backends.ibis.field_types.FieldType | None = None, output_schema: ~superduperdb.components.schema.Schema | None = None, flatten: bool = False, model_update_kwargs: ~typing.Dict = <factory>, predict_kwargs: ~typing.Dict = <factory>, compute_kwargs: ~typing.Dict = <factory>, validation: ~superduperdb.components.model.Validation | None = None, metric_values: ~typing.Dict = <factory>, num_workers: int = 0, object: ~typing.Any)[source]¶
Bases:
_ObjectModel
Model component which wraps a Model to become serializable.
Base class for components which can predict.
- Parameters:
signature – Model signature.
datatype – DataType instance.
output_schema – Output schema (mapping of encoders).
flatten – Flatten the model outputs.
model_update_kwargs – The kwargs to use for model update.
predict_kwargs – Additional arguments to use at prediction time.
compute_kwargs – Kwargs used for compute backend job submit. Example (Ray backend): compute_kwargs = dict(resources=…).
validation – The validation
Dataset
instances to use.metric_values – The metrics to evaluate on.
- full_import_path = 'superduperdb.components.model.ObjectModel'¶
- ui_schema: t.ClassVar[t.List[t.Dict]] = [{'name': 'object', 'type': 'artifact'}]¶
- class superduperdb.QueryModel(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, datatype: ~superduperdb.components.datatype.DataType | ~superduperdb.backends.ibis.field_types.FieldType | None = None, output_schema: ~superduperdb.components.schema.Schema | None = None, flatten: bool = False, model_update_kwargs: ~typing.Dict = <factory>, predict_kwargs: ~typing.Dict = <factory>, compute_kwargs: ~typing.Dict = <factory>, validation: ~superduperdb.components.model.Validation | None = None, metric_values: ~typing.Dict = <factory>, preprocess: ~typing.Callable | None = None, postprocess: ~typing.Callable | ~superduperdb.base.code.Code | None = None, select: ~superduperdb.backends.base.query.CompoundSelect)[source]¶
Bases:
Model
QueryModel component.
Model which can be used to query data and return those precomputed queries as Results.
- Parameters:
preprocess – Preprocess callable
postprocess – Postprocess callable
select – query used to find data (can include like)
- full_import_path = 'superduperdb.components.model.QueryModel'¶
- classmethod handle_integration(kwargs)[source]¶
Handle integration from UI.
- Parameters:
kwargs – Integration kwargs.
- predict(dataset: List | QueryDataset) List [source]¶
Execute on a series of data points defined in the dataset.
- Parameters:
dataset – Series of data points to predict on.
- predict_one(*args, **kwargs)[source]¶
Predict on a single data point.
Method to perform a single prediction on args and kwargs. This method is also used for debugging the model.
- preprocess: t.Optional[t.Callable] = None¶
- select: CompoundSelect¶
- signature: t.ClassVar[Signature] = '**kwargs'¶
- ui_schema: t.ClassVar[t.List[t.Dict]] = [{'default': 'from superduperdb import code\n\n@code\ndef my_code(x):\n return x\n', 'name': 'postprocess', 'type': 'code'}, {'default': {'documents': [{'<key-1>': '$my_value'}, {'_id': 0, '_outputs': 0}], 'query': "<collection_name>.like(_documents[0], vector_index='<index_id>').find({}, _documents[1]).limit(10)"}, 'name': 'select', 'type': 'json'}]¶
- class superduperdb.Schema(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, fields: Mapping[str, DataType])[source]¶
Bases:
Component
A component containing information about the types or encoders of a table.
Base class for all components in SuperDuperDB.
Class to represent SuperDuperDB serializable entities that can be saved into a database.
- Parameters:
identifier – A unique identifier for the component.
artifacts – List of artifacts which represent entities that are not serializable by default.
fields – A mapping of field names to types or encoders.
public_api(beta): This API is in beta and may change before becoming stable.
- __call__(data: dict[str, Any]) dict[str, Any] [source]¶
Encode data using the schema’s encoders.
- Parameters:
data – Data to encode.
- decode_data(data: dict[str, Any]) dict[str, Any] [source]¶
Decode data using the schema’s encoders.
- Parameters:
data – Data to decode.
- property encoded_types¶
List of fields of type DataType.
- property encoders¶
An iterable to list DataType fields.
- full_import_path = 'superduperdb.components.schema.Schema'¶
- pre_create(db) None [source]¶
Database pre-create hook to add datatype to the database.
- Parameters:
db – Datalayer instance.
- property raw¶
Return the raw fields.
Get a dictionary of fields as keys and datatypes as values. This is used to create ibis tables.
- property trivial¶
Determine if the schema contains only trivial fields.
- type_id: ClassVar[str] = 'schema'¶
- class superduperdb.Stack(identifier: str, artifacts: dataclasses.InitVar[Optional[Dict]] = None, *, components: Sequence[Component])[source]¶
Bases:
Component
Component to hold a list of components under a namespace and package.
A placeholder to hold a list of components under a namespace and package them as a tarball. This tarball can be retrieved back to a Stack instance with the
load
method.Base class for all components in SuperDuperDB.
Class to represent SuperDuperDB serializable entities that can be saved into a database.
- Parameters:
identifier – A unique identifier for the component.
artifacts – List of artifacts which represent entities that are not serializable by default.
components – List of components to stack together and add to the database.
public_api(alpha): This API is in alpha and may change before becoming stable.
- property db¶
Datalayer property.
- static from_list(identifier, content, db: Datalayer | None = None)[source]¶
Helper method to create a Stack from a list content.
- Parameters:
identifier – Unique identifier.
content – Content to create a stack.
db – Datalayer instance.
- full_import_path = 'superduperdb.components.stack.Stack'¶
- type_id: ClassVar[str] = 'stack'¶
- class superduperdb.Validation(identifier: str, artifacts: dc.InitVar[t.Optional[t.Dict]] = None, *, metrics: t.Sequence[Metric] = (), key: t.Optional[ModelInputType] = None, datasets: t.Sequence[Dataset] = ())[source]¶
Bases:
Component
component which represents Validation definition.
- Parameters:
metrics – List of metrics for validation
key – Model input type key
datasets – Sequence of dataset.
- full_import_path = 'superduperdb.components.model.Validation'¶
- key: t.Optional[ModelInputType] = None¶
- type_id: t.ClassVar[str] = 'validation'¶
- class superduperdb.VectorIndex(identifier: str, artifacts: dataclasses.InitVar[typing.Optional[typing.Dict]] = None, *, indexing_listener: ~superduperdb.components.listener.Listener, compatible_listener: ~superduperdb.components.listener.Listener | None = None, measure: ~superduperdb.vector_search.base.VectorIndexMeasureType = VectorIndexMeasureType.cosine, metric_values: ~typing.Dict | None = <factory>)[source]¶
Bases:
Component
A component carrying the information to apply a vector index to a
DB
instance.Base class for all components in SuperDuperDB.
Class to represent SuperDuperDB serializable entities that can be saved into a database.
- Parameters:
identifier – A unique identifier for the component.
artifacts – List of artifacts which represent entities that are not serializable by default.
indexing_listener – Listener which is applied to created vectors
compatible_listener – Listener which is applied to vectors to be compared
measure – Measure to use for comparison
metric_values – Metric values for this index
- property dimensions: int¶
Get dimension for vector database.
This dimension will be used to prepare vectors in the vector database.
- full_import_path = 'superduperdb.components.vector_index.VectorIndex'¶
- get_nearest(like: Document, db: Any, id_field: str = '_id', outputs: Dict | None = None, ids: Sequence[str] | None = None, n: int = 100) Tuple[List[str], List[float]] [source]¶
Get nearest results in this vector index.
Given a document, find the nearest results in this vector index, returned as two parallel lists of result IDs and scores.
- Parameters:
like – The document to compare against
db – The datalayer to use
id_field – Identifier field
outputs – An optional dictionary
ids – A list of ids to match
n – Number of items to return
- get_vector(like: Document, models: List[str], keys: str | List | Dict, db: Any = None, outputs: Dict | None = None)[source]¶
Peform vector search.
Perform vector search with query like from outputs in db on self.identifier vector index.
- Parameters:
like – The document to compare against
models – List of models to retrieve outputs
keys – Keys available to retrieve outputs of model
db – A datalayer instance.
outputs – (optional) update like with outputs
- measure: VectorIndexMeasureType = 'cosine'¶
- metric_values: Dict | None¶
- property models_keys: Tuple[List[str], List[str | List[str] | Tuple[List[str], Dict[str, str]]]]¶
Return a list of model and keys for each listener.
- on_load(db: Datalayer) None [source]¶
On load hook to perform indexing and compatible listenernd compatible listener.
Automatically loads the listeners if they are not already loaded.
- Parameters:
db – A DataLayer instance
- schedule_jobs(db: Datalayer, dependencies: Sequence[Job] = ()) Sequence[Any] [source]¶
Schedule jobs for the listener.
- Parameters:
db – The DB instance to process
dependencies – A list of dependencies
- type_id: ClassVar[str] = 'vector_index'¶
- ui_schema: ClassVar[List[Dict]] = [{'name': 'indexing_listener', 'type': 'component/listener'}, {'name': 'compatible_listener', 'optional': True, 'type': 'component/listener'}, {'choices': ['cosine', 'dot', 'l2'], 'name': 'measure', 'type': 'str'}]¶
- superduperdb.code(my_callable)[source]¶
Decorator to mark a function as remote code.
- Parameters:
my_callable – The callable to mark as remote code.
- superduperdb.objectmodel(item: Callable | None = None, identifier: str | None = None, datatype=None, model_update_kwargs: Dict | None = None, flatten: bool = False, output_schema: Schema | None = None)[source]¶
Decorator to wrap a function with ObjectModel.
When a function is wrapped with this decorator, the function comes out as an ObjectModel.
- Parameters:
item – Callable to wrap with ObjectModel.
identifier – Identifier for the ObjectModel.
datatype – Datatype for the model outputs.
model_update_kwargs – Dictionary to define update kwargs.
flatten – If True, flatten the outputs and save.
output_schema – Schema for the model outputs.