Skip to main content

Working with external data sources


This functionality is currently supported for MongoDB only

Using the MongoDB query API, superduperdb supports data added from external data-sources. When doing this, superduperdb supports:

  • web URLs
  • URIs of objects in s3 buckets

The trick is to pass the uri parameter to an encoder, instead of the raw-data. Here is an example where we add a .pdf file directly from a location on the public internet.

import io
from PyPDF2 import PdfReader
from superduperdb.backends.mongodb import Collection

collection = Collection('pdf-files')

def load_pdf(bytes):
text = []
for page in PdfReader(io.BytesIO(bytes)).pages:
return '\n----NEW-PAGE----\n'.join(text)

# no `encoder=...` parameter required since text is not converted to `.pdf` format
pdf_enc = Encoder('my-pdf-encoder', decoder=load_pdf)


# This command inserts a record which refers to this URI
# and also downloads the content from the URI and saves
# it in the record
collection.insert_one(Document({'txt': pdf_enc(uri=PDF_URI)}))

Now when the data is loaded from the database, it is loaded as text:

>>> r = db.execute(collection.find_one())
>>> print(r['txt'])