Skip to main content

Encoding data

In AI, typical types of data are:

  • Numbers (integers, floats, etc.)
  • Text
  • Images
  • Audio
  • Videos
  • ...bespoke in house data

Most databases don't support any data other than numbers and text. SuperDuperDB enables the use of these more interesting data-types using the Document wrapper.

Document​

The Document wrapper, wraps dictionaries, and is the container which is used whenever data is exchanged with your database. That means inputs, and queries, wrap dictionaries used with Document and also results are returned wrapped with Document.

Whenever the Document contains data which is in need of specialized serialization, then the Document instance contains calls to DataType instances.

DataType​

The DataType class, allows users to create and encoder custom datatypes, by providing their own encoder/decoder pairs.

Here is an example of applying an DataType to add an image to a Document:

import pickle
import PIL.Image
from superduperdb import DataType, Document

image = PIL.Image.open('my_image.jpg')

my_image_encoder = DataType(
identifier='my-pil',
encoder=lambda x: pickle.dumps(x),
decoder=lambda x: pickle.loads(x),
)

document = Document({'img': my_image_encoder(image)})

The bare-bones dictionary may be exposed with .unpack():

>>> document.unpack()
{'img': <PIL.PngImagePlugin.PngImageFile image mode=P size=400x300>}

By default, data encoded with DataType is saved in the database, but developers may alternatively save data in the db.artifact_store instead.

This may be achiever by specifying the encodable=... parameter:

my_image_encoder = DataType(
identifier='my-pil',
encoder=lambda x: pickle.dumps(x),
decoder=lambda x: pickle.loads(x),
encodable='artifact', # saves to disk/ db.artifact_store
# encodable='lazy_artifact', # Just in time loading
)

The encodable specifies the type of the output of the __call__ method, which will be a subclass of superduperdb.components.datatype._BaseEncodable. These encodables become leaves in the tree defines by a Document.

Schema​

A Schema allows developers to connect named fields of dictionaries or columns of pandas.DataFrame objects with DataType instances.

A Schema is used, in particular, for SQL databases/ tables, and for models that return multiple outputs.

Here is an example Schema, which is used together with text and image fields:

s = Schema('my-schema', fields={'my-text': 'str', 'my-image': my_image_encoder})