Fine tune LLM on database

Configure your production system

note

If you would like to use the production features of SuperDuperDB, then you should set the relevant connections and configurations in a configuration file. Otherwise you are welcome to use "development" mode to get going with SuperDuperDB quickly.

import os

os.makedirs('.superduperdb', exist_ok=True)
os.environ['SUPERDUPERDB_CONFIG'] = '.superduperdb/config.yaml'

CFG = '''
data_backend: mongodb://127.0.0.1:27017/documents
artifact_store: filesystem://./artifact_store
cluster:
  cdc:
    strategy: null
    uri: ray://127.0.0.1:20000
  compute:
    uri: ray://127.0.0.1:10001
  vector_search:
    backfill_batch_size: 100
    type: in_memory
    uri: http://127.0.0.1:21000
'''        

CFG = '''
artifact_store: filesystem://<path-to-artifact-store>
cluster: 
    compute: ray://<ray-host>
    cdc:    
        uri: http://<cdc-host>:<cdc-port>
    vector_search:
        uri: http://<vector-search-host>:<vector-search-port>
        type: native
databackend: mongodb+srv://<user>:<password>@<mongo-host>:27017/documents
'''        

CFG = '''
artifact_store: filesystem://<path-to-artifact-store>
cluster: 
    compute: ray://<ray-host>
    cdc:    
        uri: http://<cdc-host>:<cdc-port>
    vector_search:
        uri: http://<vector-search-host>:<vector-search-port>
databackend: sqlite://<path-to-db>.db
'''        

CFG = '''
artifact_store: filesystem://<path-to-artifact-store>
cluster: 
    compute: ray://<ray-host>
    cdc:    
        uri: http://<cdc-host>:<cdc-port>
    vector_search:
        uri: http://<vector-search-host>:<vector-search-port>
databackend: mysql://<user>:<password>@<host>:<port>/database
'''        

CFG = '''
artifact_store: filesystem://<path-to-artifact-store>
cluster: 
    compute: ray://<ray-host>
    cdc:    
        uri: http://<cdc-host>:<cdc-port>
    vector_search:
        uri: http://<vector-search-host>:<vector-search-port>
databackend: mssql://<user>:<password>@<host>:<port>
'''        

CFG = '''
artifact_store: filesystem://<path-to-artifact-store>
cluster: 
    compute: ray://<ray-host>
    cdc:    
        uri: http://<cdc-host>:<cdc-port>
    vector_search:
        uri: http://<vector-search-host>:<vector-search-port>
databackend: postgres://<user>:<password>@<host>:<port</<database>
'''        

CFG = '''
artifact_store: filesystem://<path-to-artifact-store>
metadata_store: sqlite://<path-to-sqlite-db>.db
cluster: 
    compute: ray://<ray-host>
    cdc:    
        uri: http://<cdc-host>:<cdc-port>
    vector_search:
        uri: http://<vector-search-host>:<vector-search-port>
databackend: snowflake://<user>:<password>@<account>/<database>
'''        

CFG = '''
artifact_store: filesystem://<path-to-artifact-store>
metadata_store: sqlite://<path-to-sqlite-db>.db
cluster: 
    compute: ray://<ray-host>
    cdc:    
        uri: http://<cdc-host>:<cdc-port>
    vector_search:
        uri: http://<vector-search-host>:<vector-search-port>
databackend: clickhouse://<user>:<password>@<host>:<port>
'''        

with open(os.environ['SUPERDUPERDB_CONFIG'], 'w') as f:
    f.write(CFG)

Start your cluster

note

Starting a SuperDuperDB cluster is useful in production and model development if you want to enable scalable compute, access to the models by multiple users for collaboration, monitoring.

If you don't need this, then it is simpler to start in development mode.

Experimental Cluster
Docker-Compose

!python -m superduperdb local-cluster up        

!make testenv_image
!make testenv_init

Connect to SuperDuperDB

note

Note that this is only relevant if you are running SuperDuperDB in development mode. Otherwise refer to "Configuring your production system".

from superduperdb import superduper

db = superduper('mongodb://localhost:27017/documents')        

from superduperdb import superduper
db = superduper('sqlite://my_db.db')        

from superduperdb import superduper

user = 'superduper'
password = 'superduper'
port = 3306
host = 'localhost'
database = 'test_db'

db = superduper(f"mysql://{user}:{password}@{host}:{port}/{database}")        

from superduperdb import superduper

user = 'sa'
password = 'Superduper#1'
port = 1433
host = 'localhost'

db = superduper(f"mssql://{user}:{password}@{host}:{port}")        

!pip install psycopg2
from superduperdb import superduper

user = 'postgres'
password = 'postgres'
port = 5432
host = 'localhost'
database = 'test_db'
db_uri = f"postgres://{user}:{password}@{host}:{port}/{database}"

db = superduper(db_uri, metadata_store=db_uri.replace('postgres://', 'postgresql://'))        

from superduperdb import superduper

user = "superduperuser"
password = "superduperpassword"
account = "XXXX-XXXX"  # ORGANIZATIONID-USERID
database = "FREE_COMPANY_DATASET/PUBLIC"

snowflake_uri = f"snowflake://{user}:{password}@{account}/{database}"

db = superduper(
    snowflake_uri, 
    metadata_store='sqlite:///your_database_name.db',
)        

from superduperdb import superduper

user = 'default'
password = ''
port = 8123
host = 'localhost'

db = superduper(f"clickhouse://{user}:{password}@{host}:{port}", metadata_store=f'mongomock://meta')        

from superduperdb import superduper

db = superduper('duckdb://mydb.duckdb')        

from superduperdb import superduper

db = superduper(['my.csv'], metadata_store=f'mongomock://meta')        

from superduperdb import superduper

db = superduper('mongomock:///test_db')        

!pip install transformers torch accelerate trl peft datasets

Get LLM Finetuning Data

The following are examples of training data in different formats.

Text
Prompt-Response
Chat

from datasets import load_dataset
from superduperdb.base.document import Document
dataset_name = "timdettmers/openassistant-guanaco"
dataset = load_dataset(dataset_name)

train_dataset = dataset["train"]
eval_dataset = dataset["test"]

train_documents = [
    Document({**example, "_fold": "train"})
    for example in train_dataset
]
eval_documents = [
    Document({**example, "_fold": "valid"})
    for example in eval_dataset
]

datas = train_documents + eval_documents        

from datasets import load_dataset
from superduperdb.base.document import Document
dataset_name = "mosaicml/instruct-v3"
dataset = load_dataset(dataset_name)

train_dataset = dataset["train"]
eval_dataset = dataset["test"]

train_documents = [
    Document({**example, "_fold": "train"})
    for example in train_dataset
]
eval_documents = [
    Document({**example, "_fold": "valid"})
    for example in eval_dataset
]

datas = train_documents + eval_documents        

from datasets import load_dataset
from superduperdb.base.document import Document
dataset_name = "philschmid/dolly-15k-oai-style"
dataset = load_dataset(dataset_name)['train'].train_test_split(0.9)

train_dataset = dataset["train"]
eval_dataset = dataset["test"]

train_documents = [
    Document({**example, "_fold": "train"})
    for example in train_dataset
]
eval_documents = [
    Document({**example, "_fold": "valid"})
    for example in eval_dataset
]

datas = train_documents + eval_documents        

We can define different training parameters to handle this type of data.

Text
Prompt-Response
Chat

# Function for transformation after extracting data from the database
transform = None
key = ('text')
training_kwargs=dict(dataset_text_field="text")        

# Function for transformation after extracting data from the database
def transform(prompt, response):
    return {'text': prompt + response + "</s>"}

key = ('prompt', 'response')
training_kwargs=dict(dataset_text_field="text")        

# Function for transformation after extracting data from the database
transform = None

key = ('messages')
training_kwargs=None        

Example input_text and output_text

Text
Prompt-Response
Chat

data = datas[0]
input_text, output_text = data["text"].rsplit("### Assistant: ", maxsplit=1)
input_text += "### Assistant: "
output_text = output_text.rsplit("### Human:")[0]
print("Input: --------------")
print(input_text)
print("Response: --------------")
print(output_text)        

data = datas[0]
input_text = data["prompt"]
output_text = data["response"]
print("Input: --------------")
print(input_text)
print("Response: --------------")
print(output_text)        

data = datas[0]
messages = data["messages"]
input_text = messages[:-1]
output_text = messages[-1]["content"]
print("Input: --------------")
print(input_text)
print("Response: --------------")
print(output_text)        

Setup simple tables or collections

MongoDB
SQL

# If our data is in a format natively supported by MongoDB, we don't need to do anything.
from superduperdb.backends.mongodb import Collection

table_or_collection = Collection('documents')
select = table_or_collection.find({})        

from superduperdb.backends.ibis import Table
from superduperdb import Schema, DataType
from superduperdb.backends.ibis.field_types import dtype

for index, data in enumerate(datas):
    data["id"] = str(index) 

fields = {}

for key, value in data.items():
    fields[key] = dtype(type(value))

schema = Schema(identifier="schema", fields=fields)

table_or_collection = Table('documents', schema=schema)

db.apply(table_or_collection)

select = table_or_collection.select("id", "prompt", "response")        

Insert simple data

In order to create data, we need to create a Schema for encoding our special Datatype column(s) in the databackend.

MongoDB
SQL

from superduperdb import Document

ids, _ = db.execute(table_or_collection.insert_many(datas))        

ids, _ = db.execute(table_or_collection.insert(datas))        

Select a Model

model_name = "facebook/opt-125m"
model_kwargs = dict()
tokenizer_kwargs = dict()

# or 
# model_name = "mistralai/Mistral-7B-Instruct-v0.2"
# token = "hf_xxxx"
# model_kwargs = dict(token=token)
# tokenizer_kwargs = dict(token=token)

Build A Trainable LLM

Create an LLM Trainer for training

The parameters of this LLM Trainer are basically the same as transformers.TrainingArguments, but some additional parameters have been added for easier training setup.

from superduperdb.ext.transformers import LLM, LLMTrainer
trainer = LLMTrainer(
    identifier="llm-finetune-trainer",
    output_dir="output/finetune",
    overwrite_output_dir=True,
    num_train_epochs=3,
    save_total_limit=3,
    logging_steps=10,
    evaluation_strategy="steps",
    save_steps=100,
    eval_steps=100,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    max_seq_length=512,
    key=key,
    select=select,
    transform=transform,
    training_kwargs=training_kwargs,
)

Lora
QLora
Deepspeed
Multi-GPUS

trainer.use_lora = True        

trainer.use_lora = True
trainer.bits = 4        

!pip install deepspeed
deepspeed = {
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 2,
    },
}
trainer.use_lora = True
trainer.bits = 4
trainer.deepspeed = deepspeed        

trainer.use_lora = True
trainer.bits = 4
trainer.num_gpus = 2        

Create a trainable LLM model and add it to the database, then the training task will run automatically.

llm = LLM(
    identifier="llm",
    model_name_or_path=model_name,
    trainer=trainer,
    model_kwargs=model_kwargs,
    tokenizer_kwargs=tokenizer_kwargs,
)

db.apply(llm)

Load the trained model

There are two methods to load a trained model:

Load the model directly: This will load the model with the best metrics (if the transformers' best model save strategy is set) or the last version of the model.
Use a specified checkpoint: This method downloads the specified checkpoint, then initializes the base model, and finally merges the checkpoint with the base model. This approach supports custom operations such as resetting flash_attentions, model quantization, etc., during initialization.

Load Trained Model Directly
Use a specified checkpoint

llm = db.load("model", "llm")        

from superduperdb.ext.transformers import LLM, LLMTrainer
experiment_id = db.show("checkpoint")[-1]
version = None # None means the last checkpoint
checkpoint = db.load("checkpoint", experiment_id, version=version)
llm = LLM(
    identifier="llm",
    model_name_or_path=model_name,
    adapter_id=checkpoint,
    model_kwargs=dict(load_in_4bit=True)
)        

llm.predict_one(input_text, max_new_tokens=200)

Fine tune LLM on database

Configure your production system​

Start your cluster​

Connect to SuperDuperDB​

Install related dependencies​

Get LLM Finetuning Data​

Setup simple tables or collections​

Insert simple data​

Select a Model​

Build A Trainable LLM​

Load the trained model​

Configure your production system

Start your cluster

Connect to SuperDuperDB

Install related dependencies

Get LLM Finetuning Data

Setup simple tables or collections

Insert simple data

Select a Model

Build A Trainable LLM

Load the trained model