Introduction.
These days I'm working, with my colleague C. De Bari, on setting up training for customers and partners where we will talk about Oracle Machine Learning for Python. Cool, but what is it?
Oracle DB since a long time has an option called: Oracle Advanced Analytics. It is an option that provides you access to many widely used Machine Learning algorithms and enables you to run these algorithms without the need to move data away from the DB.
Many algorithms are available. Algorithms for regression, classification, anomaly detection, clustering, feature extraction and so on. The list is long.
Using these algorithms as part of Oracle DB not only enables you to have the power of Machine Learning without moving data away from the DB, with all the positive implications about security, efficiency and so on but also to leverage the underlying parallel DB engine.
Till now, the only available interfaces were: PL/SQL and R. OML4Py gives you the ability to invoke all these algorithms, and some added recently, using what is now the most widely used language in Data Science: Python.
OML4Py will be soon available as part of Oracle Database 19c, on-premises and then as part of Oracle Cloud.
A Python client library.
As a starting point, we can think of OML4PY as a client library that you can use, in your Python program or inside your Jupyter Notebook, to invoke Oracle Machine Learning algorithms.
The nice thing is that the library has been designed to make it easy to start writing ML code for those people already using Open Source frameworks like Scikit-learn. The API is very similar.
With OML4PY you can:
- Connect to the DB
- Create Python DataFrame that are proxy objects to Oracle DB tables
- Train your selected algorithms on data coming from these tables
- Score your algorithm and improve the performances
- Make predictions on new data
OML4Py will be available through a Python module (oml) based on cx_Oracle.
The deployment architecture.
Proxy Objects.
This is one of the nice features.
If you've studied ML on scikit-learn, you're aware of the fact that many examples in ML field are using data wrapped in Pandas DataFrame. A DataFrame is a tabular structure, in many ways similar to a DB table, with rows and columns. You can, with a single line of Python code, load a Pandas DataFrame from a csv file, or from an Oracle DB table. But a Pandas DataFrame is entirely held in memory. Therefore its size is constrained from the available memory.
With OML4Py, after having established a connection to Oracle DB, you can easily create an OML DataFrame, in sync with a DB table.
data = oml.sync(table = 'BREASTCANCER')
What is the difference? The difference is that "data" is a "proxy object". The sync invocation doesn't pull the data in memory, out of the DB. Instead, it connects the Python object to the DB table and all the subsequent invocations on the OML DataFrame object will be translated in SQL instructions executed inside the DB.
So, for example, you can call:
data.head()
where data is the OML dataFrame, and you will see, in your Jupyter Notebook, the list of the first 10 rows. The head call is translated in a SQL SELECT statement and only the first ten rows are extracted from the DB and returned to the Python process.
From the ML viewpoint, what is important is that you will use the OML DataFrame as the starting point for the training of your ML algorithm, in the same way you have used a Pandas DataFrame in scikit-learn.
If you need it, you can also extract from the OML DataFrame a Pandas DataFrame, using:
data.pull()
This way you can use all the Python code you have already developed.
Algorithms.
The list of supported ML algorithms is rather long and complete.
Just to give you an idea, you can use:
- Naive Bayes
- Generalized Linear Model
- Decision Tree
- Random Forest
- Support Vector Machine (Linear, Gaussian)
- K-Mean (clustering)
- One-class SVM (anomaly detection)
- Neural Network (Dense, for now)
But you have not only a nice Python interface to Oracle DB ML implementation but also important features like:
- Auto Tuning of Hyper-parameters
- Automatic Model selection
- Automatic Feature Extraction
In other words, AutoML, that can greatly simplify the tuning of your Machine Learning Model and also reduce the time needed to get the right model's performances for your Business Case.
Another important feature designed to simplify ML adoption is the fact that many algorithms automatically handle null values, categorical features encoding and features scaling. In scikit-learn you have to do all these things, if you want to get the right performances, writing code.
I have tested OML4Py against scikit-learn, using the Wisconsin Breast Cancer dataset, and in all the cases with AutoML I have seen better accuracy than scikit-learn. I have reached comparable accuracy only doing feature scaling and accurate Hyper-parameters selection. Sometimes a rather long process.
What is more?
In addition, with OML4PY you can easily load and store data from DB.
But, what is most interesting is the feature called "Embedded execution". It means that you can also write your own Python code, for example a function using a scikit-learn algorithm, store this code in the DB and call the function to have it applied to data selected from the DB. With only a few lines of code.
This is an interesting extension mechanism that I'll explore more and more in the next weeks.
One example.
In the following code example, I show how you can create a Machine Learning model, using Support Vector Machine, for classifying patients. The example is based on the Wisconsin Breast Cancer Dataset, with data stored in an Oracle DB table.
import oml
from oml import automl
from oml import algo
import config
# connection to Oracle DB...
oml.connect(config.USER,config.PWD,
'(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=1521))(CONNECT_DATA=(service_name=OAA1)))',
automl=True)
# test if connected
oml.isconnected()
# create an OML dataframe connected to DB table
BREASTCANCER = oml.sync(table = 'BREASTCANCER')
# train, test set split
train_dat, test_dat = BREASTCANCER.split(ratio=(0.8, 0.2), seed=1234)
train_x = train_dat.drop('TARGET')
train_y = train_dat['TARGET']
test_x = test_dat['TARGET']
test_y = test_dat.drop('TARGET')
# I'll be using AutoTuning
from oml.automl import ModelTuning
at = ModelTuning(mining_function = 'classification', parallel=4)
# search the best algorithm and fit
results = at.run('svm_linear', train_x, train_y, score_metric='accuracy')
tuned_model = results['best_model']
# make the prediction on test set
pred_y = tuned_model.predict(test_x)
test_y_df = test_y.pull()
pred_y_df = pred_y.pull()
# utility function based on sklearn for displaying accuracy, precision, recall, F1 score
myutils.print_metrics(pred_y_df, test_y_df)
Conclusion.
OML4PY is an extension to Oracle Advanced Analytics that gives you a nice and easy-to-use Python interface to Oracle DB Machine Learning algorithms.
Using OML4Py you can develop ML models, running inside an Oracle DB, writing Python code that is in many ways similar to the code you would be writing with scikit-learn.
But, in addition, you have functionalities like automatic feature selection, auto-selection of the model, auto-tuning.
It will be soon generally available and we're running in Italy some training for customers and partners around it.
I will write soon some more articles, with details, on the subject.