Introduction.
It is the year 2020, no more time for large and expensive clusters.
These days, a modern Data Lake, built in a Cloud environment, should use as much as possible Cloud Native, Serverless services, to get the full agility, elasticity, and efficiency provided by the Public Cloud Paradigm.
In this note, I want to briefly highlight how such a Data Lake can be built using Oracle Cloud Infrastructure (OCI) and how we can use Oracle Autonomous Data Warehouse (ADWH) to provide SQL access to data stored in the Data Lake.
I'll briefly describe what is the best option to store the data and what are the services available to extract, transform, and load the data. Then, I'll address the steps needed to enable SQL access.
I'll also provide and describe some utilities useful to simplify the task.
Files as Objects.
In a Data Lake environment, you should be able to provide, efficiently and effectively, elastic storage for all the datasets that your Data Scientists, BI experts, and Business Users want to explore and exploit. And you shouldn't forget the "schema-on-read" requirement: when you store the data you don't want to be obliged to specify a schema.
Object Storage is the perfect place to store all your datasets. You can organize them in different buckets, creating for example a "raw data area", where datasets are stored as they're provided from the source systems, and a "ready-for-analysis" area, where you store all the dataset verified, filtered, and aggregated, ready for Analytics and the Notebooks of your Data Scientists.
Object Storage can scale as big as you need, without the need for any planning, and is enough inexpensive (around 25.5$ per Terabyte/month) to store Petabytes of data. Besides, you can use many OCI features, for example, to manage the life-cycle of your files (for example, you can define rules to archive files older than a certain amount of time) and even add metadata, to make it easier to find the data you need for your work.
A quick view of processing.
The processing in the Data Lake, in my view, must be done using serverless Cloud Services and the preferred engine should be based on Apache Spark.
Apache Spark is a great environment for parallel, distributed, large scale processing. It is a well established and long-lasting Open Source project, backed by a lot of companies. You can develop choosing between several languages (Scala, Python, Java, SQL) and you find libraries with support for DataFrame, Streaming, and Distributed Machine Learning (MLlib).
(Image source: https://spark.apache.org/)
And, Spark is much faster than Hadoop and MapReduce.
(Image source: https://spark.apache.org/)
In OCI we have two services providing a complete and integrated processing environment based on a Spark Serverless engine: OCI Data Integration and OCI Data Flow.
With these services, you can develop fast and scalable ETL. And, due to the serverless nature, you can specify the resources (CPU) used only when you launch a Job. And, what is more, you pay only for the execution time.
The first one allows developing DataFlows using a "no-code" approach, using a graphical UI. With the second one, you have the full power and flexibility of writing code, based on Apache Spark libraries, in your preferred language of choice. The two combined provide all you need.
This picture shows an architecture integrating all these services:
(image by author)
Serverless services, like Oracle Data Flow and OCI Data Integration, combined with the Object Storage, can provide you all the needed computational power and storage to process and prepare your datasets. Even if they're bigger than what you could have planned at the beginning of your BigData journey.
Ok, Spark is cool, but we want to use SQL.
But then, when data are ready and well-structured, your BI experts, Business Users, and Data Scientists maybe want to be able to query these data using the well-known SQL language. For example to build rich and powerful Data Visualization using Oracle Analytics Cloud (OAC).
Oracle Autonomous Data Warehouse (ADWH) is well integrated with Object Storage and gives you two choices: load the data inside ADWH storage or use "external tables" and query the data using the ADWH SQL query engine, while the datasets remain in buckets in the Object Storage.
If you leave the datasets on Object Storage you can combine the ease of use and the efficiency of Object Storage with the power, robustness, and security of the Oracle SQL engine provided by ADWH. And maybe later, you can decide to load inside the ADWH storage only those data where you need even better performances.
Having said that, I want to provide you, in the remainder of the article, with some suggestions to make it easy to access files stored on Object Storage. And I'll provide you also with a link to some utilities I have developed for my own work and my customers.
Security, first of all.
To enable access from ADWH to Object Storage you need a set of credentials: an Auth Token. This token must be stored in the DB, to enable PL/SQL code to access the Object Storage Bucket.
First, to create the token, go to the OCI UI Console and create, inside your profile (top right icon and UserSetting), and AuthToken:
Then, copy from the UI the Authentication Token. Remember to do that right after created, otherwise, you will not able to re-read it.
Second, store the Cloud Credentials inside the DB, executing this code in SQLDeveloper; This step needs to be done only once:
BEGIN
DBMS_CLOUD.CREATE_CREDENTIAL(
credential_name => 'ADWH2',
username => 'oracleidentitycloudservice/myname',
password => '<AUTH Token>'
);
END;
Be careful: if your user is a federated one, the name is the fully qualified name. It has a prefix like:
oracleidentitycloudservice/
The DDL to create the external table.
The next step is to create the "external table". It is an object stored in the DB Data Dictionary that enables SQL access to the file stored in the Object Storage bucket.
In the following, I'll assume that the data file is a CSV file and that it contains, in the first row, a header, with the names of the fields.
In my initial developments, I have used the files coming from the MovieLens dataset. Therefore one of the files (movies.csv) contains records of this kind:
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
Using the Python utility provided (see link in the last section), I have been able to generate the following SQL DDL instruction (to be executed again in SQLDeveloper):
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
table_name =>'MOVIES',
credential_name =>'ADWH2',
file_uri_list =>'https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/emeaseitalyp/b/datalake_in/o/movies.csv',
format => json_object('type' value 'csv', 'skipheaders' value '1'),
column_list => '
MOVIEID NUMBER,
TITLE VARCHAR2(4000),
GENRES VARCHAR2(4000)
'
);
END;
/
Some explanation:
- credential_name is the name of the Auth Token (see above):
- column_list contains a set of name/type for all the columns. The datatypes are Oracle types
- format here specifies that the first row contains the name of columns (header)
- file_uri_list contains the URL to access the file in the Object Storage.
This is the format for the URL:
https://<object storage region endpoint>/n/<namespace>/b/<bucket name>/o/<object name>
Querying the data.
After having created the external table, I can use the full power of the SQL language to explore and extract all the information I need.
For example, again from MovieLens datasets, assume that I want to know what are the most rated War films. Using SqlDeveloper, in about 20 seconds, I'm able to get the result:
(Image by author)
Link to the utility Python code.
What is a bit boring and prone to errors is to write the DDL code to create the external table.
I have prepared a Python utility that reads the CSV file (only the first N rows), understands data types, and prints the DDL on the standard output, ready to be executed.
You can find the latest version of the utility in one of my GitHub repositories: https://github.com/luigisaetta/adw_external_tables.
Hope you'll find it useful and, Happy Data Science.