October 28, 2023

Aleo ZKML on the Boston Housing dataset [PART 1]

Introduction

Zero-knowledge machine learning (ZKML) is a relatively new field that focuses on training machine learning models without revealing the underlying data. For providing frameworks with inherent privacy-preserving properties in machine learning, Aleo can be used for this purpose especially in respect to ZKML.

In this blog we will be explaining the application of Aleo in ZKML using the Boston housing dataset. This dataset has found wide use in statistics and machine learning. The Boston housing dataset contains different house features for houses in Boston, Massachusetts in the united States. With 506 data points, this dataset is usually used in regression tasks for predicting house prices based on the given features.

We will be applying simple linear regression algorithm on the Boston dataset, this is used to establish a linear relationship between these features and the housing prices

Pre-requisite

This project was done on jupyter note book

Set up Aleo and Leo

You'll need to install Aleo and Leo and then set up the required environment. This sets up our environment and downloads Aleo and leo into our system or server.

run the code below;

!wget https://github.com/AleoHQ/leo/releases/download/v1.7.0/leo-v1.7.0-x86_64-unknown-linux-musl.zip
!unzip leo-v1.7.0-x86_64-unknown-linux-musl.zip
!rm -rf leo-v1.7.0-x86_64-unknown-linux-musl.zip

wait for the process to complete, you will get an image like similar to this

Get the dataset

Obtain the Boston dataset which we will be using for processing.

# Get the boston housing data
!wget https://gist.githubusercontent.com/icodragon/63ede7ff4478c680049aa353d1cefb11/raw/3985fb403e3c0839000c62396b49116be7d67692/BostonHousing.csv

Clone the Aleo zk-ML initiative

!git clone https://github.com/stakemepro/aleo-zkml-initiative-1.git && mv aleo-zkml-initiative-1/interp_leo interp_leo && rm -rf aleo-zkml-initiative-1

Without ZK

Import the necessary libraries which you will need for your analysis

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd

Build a function to load in the data to your work space

def load_data():
  data = pd.read_csv('./BostonHousing.csv')
  X = data.drop('medv', axis=1)    
  y = data['medv']    
  X = np.round(X).astype(int)    
  y = np.round(y).astype(int)
  return X, y

Load the data and return X and y

X, y = load_data()

Split the data into train and test sets in a ratio of 80:20

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Convert the train data to a list

X_train_leo = X_train.values.tolist()
y_train_leo = y_train.values.tolist()

print(X_train_leo)print(y_train_leo)

Model

We are using a simple linear model for our analysis here. So we get our model and fit our linear model on our train data (X_train, y_train)

model = LinearRegression()
model.fit(X_train, y_train)

Next we get the model weights and intercepts and assign them to the variable weights and bias

weights = model.coef_
bias = model.intercept_

View the weights by calling the variable weights

weights

Prediction

Here, we make predictions using our linear model on the test dataset to obtain our predicted values for various values of y

Build the prediction function for linear regression

# Perform the linear prediction
def linear_regression_predict(weights, features, bias):
    prediction = 0    
    for i in range(len(weights)):        
    prediction += weights[i] * features[i]    
    prediction += bias    
    return prediction

Get the predicted values of y and assign the values as y_pred.

y_pred = [linear_regression_predict(weights, x, bias) for x in X_test.values.tolist()]

Next visualize our y_pred

y_pred

Evaluation

Next we will be evaluating our model using the metric root mean squared error [RMSE]

Build the RMSE function

def rmse(y_true, y_pred):    
    '''    
    Compute Root Mean Square Percentage Error between two arrays.    
    '''    
    loss = np.sqrt(np.mean(np.square(((y_true - y_pred) / y_true)), axis=0))
    return loss

Print the RMSE error value

import numpy as np
print('🔥 RMSE error:', rmse(np.array(y_test), np.array(y_pred)))

From the image we can see that our RMSE error is 0.28489

We will be continuing the Part II on the next update.

Follow this link to access the Part II


Website: https://www.aleo.org/

T witter : https://twitter.com/AleoHQ

G ithub : https://github.com/AleoHQ

D iscord : https://discord.com/invite/aleohq