5 Steps to Build a Machine Learning Model from Scratch in Databricks

Building, training, and deploying a complete Machine Learning model end-to-end in a single environment is easier than it looks, especially when using Databricks.

Anyone working in data science knows that exploration, modeling, and deployment often happen in separate tools, making the process slower and more error-prone.
Databricks solves this by unifying everything into one collaborative and scalable space, where you can handle everything from data engineering to model deployment.

In this article, you’ll learn how to build a complete Machine Learning model in Databricks, following 5 practical steps, from exploratory analysis to production.

What is Machine Learning

Simply put, Machine Learning is the field of artificial intelligence that teaches computers to learn from data.

When working with millions of records, identifying patterns manually becomes impossible. Machine learning does that for us, uncovering relationships and behaviors that aren’t visible to the naked eye.

This technology is everywhere, from streaming recommendations and e-commerce product suggestions to financial transaction analysis. Every time your bank blocks a suspicious purchase, there’s a machine learning model making that decision in the background.

What is Databricks

Databricks is a unified data and AI platform that combines data engineering, data science, and machine learning in a collaborative notebook-based environment.

It’s compatible with all major cloud providers (AWS, Azure, and GCP) and natively integrates powerful tools like Apache Spark and MLflow, allowing you to process massive datasets and track model experiments seamlessly.

It also supports multiple programming languages (Python, SQL, R, and Scala) within the same workspace, making collaboration across technical teams much easier.

Why Use Databricks for Machine Learning

The main reason to use Databricks for machine learning is integration. With Databricks, you can manage the entire machine learning lifecycle in one place:

  • Data ingestion and exploration
  • Feature engineering and preprocessing
  • Model training and evaluation
  • Model versioning and deployment

Another major advantage is MLflow, which automatically records parameters, metrics, and model versions during experiments.

This makes it easy to compare results and select the best-performing model, while keeping a full experiment history.

Step 1 – Setting Up the Environment and Loading the Dataset

Let’s start by creating the cluster and loading the dataset.
In this example, we’ll use a credit card transaction dataset with over 1 million records. The goal is to predict whether a transaction is fraudulent (1) or legitimate (0).

As is typical in fraud detection problems, the dataset is highly imbalanced, only about 0.6% of transactions are fraudulent.

We begin by loading the data into Databricks using Spark:

				
					df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/FileStore/tables/creditcard.csv")

df.display()

				
			

The display() function in Databricks allows for a quick preview of your data right inside the notebook. From here, you can check the column names, data types, and initial statistics directly in the web interface.

Step 2 – Exploratory Data Analysis

Once the dataset is loaded, the next step is exploratory data analysis (EDA).

Databricks makes it easy to switch between pandas and Spark DataFrames. For small data samples, pandas is fast and straightforward; for large volumes, Spark handles distributed computation efficiently.

				
					import pandas as pd

# Convert to pandas for a quick preview
pdf = df.limit(10000).toPandas()

# Check for null values and duplicates
print(pdf.isnull().sum())
print(pdf.duplicated().sum())

				
			

During this stage, we check for:

  • missing or null values,
  • duplicate records,
  • variable distributions,
  • and the balance of the target class (fraud vs. non-fraud).

Databricks also provides built-in visualization tools to plot correlations and value distributions directly from your dataframe.

Step 3 – Feature Engineering

After exploring the data, we move on to feature engineering, the process of creating new features that help the model learn more effectively.

For example, we can extract the card brand (Visa, MasterCard, etc.) from the first digit of the card number:

				
					from pyspark.sql.functions import col, substring, when

df = df.withColumn("card_brand",
    when(substring(col("card_number"), 1, 1) == "4", "Visa")
    .when(substring(col("card_number"), 1, 1) == "5", "MasterCard")
    .otherwise("Other"))

				
			

Other useful transformations include:

  • converting categorical variables into one-hot encoded columns,
  • normalizing numerical values (when appropriate),
  • deriving time-based features (such as transaction hour or day of the week).

Important note: in fraud detection, outliers are often the key to identifying fraud. That means you shouldn’t remove or normalize extreme values, they contain valuable information that helps the model learn unusual behaviors.

Step 4 – Model Training and Experiment Tracking with MLflow

With the data ready, it’s time to move into model training.

Databricks integrates directly with MLflow, which automatically logs model parameters, metrics, and artifacts for each run.

Here, we’ll train three classification models for comparison:

				
					import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

# Split training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {
    "RandomForest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(),
    "XGBoost Balanced": XGBClassifier(scale_pos_weight=100)
}

for name, model in models.items():
    with mlflow.start_run(run_name=name):
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        acc = accuracy_score(y_test, preds)
        f1 = f1_score(y_test, preds)
        auc = roc_auc_score(y_test, preds)
        
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("f1_score", f1)
        mlflow.log_metric("auc", auc)
        mlflow.sklearn.log_model(model, name)

				
			

MLflow records all metrics automatically, allowing you to compare model results visually in the Databricks UI.

In our case, the balanced XGBoost model achieved the best results, with accuracy above 70% and a higher F1-score, making it the best candidate for deployment.

Step 5 – Saving and Deploying the Model

Once the best model has been selected, it’s time to register and deploy it. The Databricks Model Registry lets you store model versions, assign stages (Staging, Production), and expose them via API endpoints for real-time use.

				
					import mlflow.pyfunc

# Register model in Model Registry
result = mlflow.register_model(
    "runs:/<run_id>/model",
    "FraudDetectionModel"
)

				
			

With the model registered, you can serve it through a REST API to make live predictions.

For example, in a banking scenario, each new transaction sent to the model could return:

  • 0 → legitimate transaction
  • 1 → potential fraud

This output can be connected to automated decision systems to block suspicious transactions or trigger alerts for manual review.

FAQ

Why use Databricks instead of another environment?
Databricks unifies data exploration, processing, training, and deployment in one environment, eliminating the need to switch tools or languages.

Do I need to know Spark to follow this tutorial?
Not necessarily, you can start with pandas and scale up to Spark when working with larger datasets.

How can I handle class imbalance?
We used balanced models such as XGBoost with scale_pos_weight. You can also try oversampling or undersampling techniques depending on your dataset.

Can I apply this workflow to problems other than fraud detection?
Yes. These steps apply to any classification or regression problem, simply adjust the dataset and evaluation metrics as needed.

Next Steps

Databricks enables you to develop, test, and deploy machine learning models in a simple, collaborative, and scalable way. 

If you’d like to try it yourself, create a free account in the Databricks Community Edition and experiment with your first MLflow project.

And if you want to apply this process to your business, our team can help! Click the banner below and schedule a call.

Don't miss any of our content

Sign up for our BIX News

Our Social Media

Most Popular

Start your tech project risk-free

AI, Data & Dev teams aligned with your time zone – get a free consultation and pay $0 if you're not satisfied with the first sprint.