Using JupyterHub in Big Data Service 3.0.26 or Earlier
Use JupyterHub to manage Big Data Service 3.0.26 or earlier ODH 1.x notebooks for groups of users.
Prerequisites
Before JupyterHub can be accessed from a browser, an administrator must:
Make the node available to incoming connections from users. The node's private IP address needs to be mapped to a public IP address. Alternatively, the cluster can be set up to use a bastion host or Oracle FastConnect. See Connecting to Cluster Nodes with Private IP Addresses.
Open port 8000 on the node by configuring the ingress rules in the network security list. See Defining Security Rules.
JupyterHub Default Credentials 🔗
The default admin sign-in credentials for JupyterHub in Big Data Service 3.0.21 and earlier are:
Username: jupyterhub
Password: Apache Ambari admin password. This is the cluster admin password that was specified when the cluster was created.
Principal name for HA cluster: jupyterhub
Keytab for HA cluster: /etc/security/keytabs/jupyterhub.keytab
The default admin sign-in credentials for JupyterHub in Big Data Service 3.0.22 through 3.0.26 are:
User name: jupyterhub
Password: Apache Ambari admin password. This is the cluster admin password that was specified when the cluster was created.
Principal name for HA cluster: jupyterhub/<FQDN-OF-UN1-Hostname>
Keytab for HA cluster: /etc/security/keytabs/jupyterhub.keytab
Example:
Principal name for HA cluster: jupyterhub/pkbdsv2un1.rgroverprdpub1.rgroverprd.oraclevcn.com
Keytab for HA cluster: /etc/security/keytabs/jupyterhub.keytab
The admin creates additional users and their sign-in credentials, and provides the sign-in credentials to those users. For more information, see Manage Users and Permissions.
Note
Unless explicitly referenced as some other type of administrator, the use of administrator or admin throughout this section refers to the JupyterHub administrator, jupyterhub.
Accessing JupyterHub 🔗
Access JupyterHub through the browser for Big Data Service 3.0.26 or earlier clusters. The JupyterHub is accessed in a browser after the prerequisites are met.
The prerequisites must be met for the user trying to spawn notebooks.
Access JupyterHub.
Sign-in with admin credentials. The authorization works only if the user is present on the Linux host. JupyterHub searches for the user on the Linux host while trying to spawn the notebook server.
You're redirected to a Server Options page where you must request a Kerberos ticket. This ticket can be requested using either Kerberos principal and the keytab file, or the Kerberos password. The cluster admin can provide the Kerberos principal and the keytab file, or the Kerberos password.
The Kerberos ticket is needed to get access on the HDFS directories and other big data services that you want to use.
The prerequisites must be met for the user trying to spawn notebooks.
Access JupyterHub.
Sign-in with admin credentials. The authorization works only if the user is present on the Linux host. JupyterHub searches for the user on the Linux host while trying to spawn the notebook server.
Manage JupyterHub 🔗
A JupyterHub admin user can perform the following tasks to manage notebooks in JupyterHub on Big Data Service 3.0.26 or earlier ODH 1.x nodes.
Configure JupyterHub through the browser for Big Data Service 3.0.26 or earlier clusters.
Connect as opc user to the utility node where JupyterHub is installed (the second utility node of an HA (highly-available) cluster, or the first and only utility node of a non-HA cluster).
Use sudo to manage JupyterHub configs that are stored at /opt/jupyterhub/jupyterhub_config.py.
For example, to change port number of JupyterHub, run following commands
vi /opt/jupyterhub/jupyterhub_config.py
# search for c.JupyterHub.bind_url and edit the port number and save
sudo systemctl restart jupyterhub.service
sudo systemctl status jupyterhub.service
Stop or start JupyterHub through the browser for Big Data Service 3.0.26 or earlier clusters.
As an admin, you can stop or disable the application so it doesn't consume resources, such as memory. Restarting might also help with unexpected issues or behavior.
Connect as opc user to the utility node where JupyterHub is installed (the second utility node of an HA (highly-available) cluster, or the first and only utility node of a non-HA cluster).
As an admin, you can limit the number of active notebook servers in Big Data Service cluster.
By default, the number of active notebook servers is set as twice the number of OCPUs in the node. The default OCPU limit is three, and the default memory limit is 2 G. The default settings for minimum active notebooks is 10 and maximum active notebooks is 80.
Connect as opc user to the utility node where JupyterHub is installed (the second utility node of an HA (highly-available) cluster, or the first and only utility node of a non-HA cluster).
Use sudo to edit JupyterHub configs that are stored at /opt/jupyterhub/jupyterhub_config.py.
By default, notebooks are stored in HDFS directory of a cluster.
You must have access to the HDFS directory hdfs:///user/<username>/. The notebooks are saved in hdfs:///user/<username>/notebooks/.
Connect as opc user to the utility node where JupyterHub is installed (the second utility node of an HA (highly-available) cluster, or the first and only utility node of a non-HA cluster).
Use sudo to manage JupyterHub configs that are stored at /opt/jupyterhub/jupyterhub_config.py.
As an admin user, you can store the individual user notebooks in Object Storage instead of HDFS. When you change the content manager from HDFS to Object Storage, the existing notebooks aren't copied over to Object Storage. The new notebooks are saved in Object Storage.
Connect as opc user to the utility node where JupyterHub is installed (the second utility node of an HA (highly-available) cluster, or the first and only utility node of a non-HA cluster).
Use sudo to manage JupyterHub configs that are stored at /opt/jupyterhub/jupyterhub_config.py. See generate access and secret key to learn how to generate the required keys.
Integrate Spark with Object Storage for use with Big Data Service clusters.
In JupyterHub, for Spark to work with Object Storage you must define some system properties and populate them into the spark.driver.extraJavaOption and spark.executor.extraJavaOptions properties in Spark configs.
The properties you must define in Spark configs are:
TenantID
Userid
Fingerprint
PemFilePath
PassPhrase
Region
The retrieve the values for these properties:
Open the navigation menu and click Analytics & AI. Under Data Lake, click Big Data Service.
Under Compartment, select the compartment that hosts the cluster.
In the list of clusters, click the cluster you're working with that has JupyterHub.
Under Resources click Object Storage API keys.
From the actions menu of the API key you want to view, click View configuration file.
The configuration file has all the system properties details except the passphrase. The passphrase is specified while creating the Object Storage API key and you must recollect and use the same passphrase.
Copy and paste the following commands to connect to Spark.
Copy
import findspark
findspark.init()
import pyspark
Copy and paste the following commands to create a Spark session with the specified configurations. Replace the variables with the system properties values you retrieved before.
The output of the code is displayed. You can navigate to the Object Storage bucket from the Console and find the file created in the bucket.
Manage Users and Permissions 🔗
Use one of the two authentication methods to authenticate users to JupyterHub so that they can create notebooks, and optionally administer JupyterHub.
By default, ODH 1.x clusters support native authentication. But, authentication for JupyterHub and other big data services must be handled differently. To spawn single user notebooks, the user signing in to JupyterHub needs to be present on the Linux host and needs to have permissions to write to the root directory in HDFS. Otherwise, the spawner fails as the notebook process is triggered as the Linux user.
These prerequisites must be met to authorize a user in a Big Data Service HA cluster using native authentication.
The user must be existing in the Linux host. Run the following command to add a new
Linux user on all the nodes of a cluster.
# Add linux user
dcli -C "useradd -d /home/<username> -m -s /bin/bash <username>"
To start a notebook server, a user must provide the principal and the keytab file path/password and request a Kerberos ticket from the JupyterHub interface. To create a keytab, the cluster admin must add Kerberos principal with a password and with a keytab file. Run the following commands on the first master node (mn0) in the cluster.
# Create a kdc principal with password or give access to existing keytabs.
kadmin.local -q "addprinc <principalname>"
Password Prompt: Enter passwrod
# Create a kdc principal with keytab file or give access to existing keytabs.
kadmin.local -q 'ktadd -k /etc/security/keytabs/<principal>.keytab principal'
The new user must have correct Ranger permissions to store files in the HDFS directory hdfs:///users/<username> as the individual notebooks are stored in /users/<username>/notebooks. The cluster admin can add the required permission from the Ranger interface by opening the following URL in a web browser.
https://<un0-host-ip>:6182
The new user must have correct permissions on Yarn, Hive, and Object Storage to read and write data, and run Spark jobs. Alternatively, user can use Livy impersonation (run Big Data Service jobs as Livy user) without getting explicit permissions on Spark, Yarn, and other services.
Run the following command to give the new user access to the HDFS directory.
# Give access to hdfs directory
# kdc realm is by default BDSCLOUDSERVICE.ORACLE.COM
kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-<clustername>@<kdc_realm>
sudo su hdfs -c "hdfs dfs -mkdir /user/<username>"
sudo su hdfs -c "hdfs dfs -chown -R jupy5 /user/<username>"
These prerequisites must be met to authorize a user in a Big Data Service non-HA cluster using native authentication.
The user must be existing in the Linux host. Run the following command to add a new
Linux user on all the nodes of a cluster.
# Add linux user
dcli -C "useradd -d /home/<username> -m -s /bin/bash <username>"
The new user must have correct permissions to store files in the HDFS directory hdfs:///users/<username>. Run the following command to give the new user access to the HDFS directory.
# Give access to hdfs directory
sudo su hdfs -c "hdfs dfs -mkdir /user/<username>"
sudo su hdfs -c "hdfs dfs -chown -R jupy5 /user/<username>"
Admin users are responsible for configuring and managing JupyterHub. Admin users are also responsible for authorizing newly signed up users on JupyterHub.
Trino must be installed and configured in Big Data Service cluster.
Install the following Python module in the JupyterHub node (un1 for HA / un0 for non-HA cluster)
Note
Ignore this step if the Trino-Python module is already present in the node.
python3.6 -m pip install trino[sqlalchemy]
Offline Installation:
Download the required python module in any machine where we have internet access
Example:
python3 -m pip download trino[sqlalchemy] -d /tmp/package
Copy the above folder content to the offline node & install the package
python3 -m pip install ./package/*
Note : trino.sqlalchemy is compatible with the latest 1.3.x and 1.4.x SQLAlchemy versions.
BDS cluster node comes with python3.6 and SQLAlchemy-1.4.46 by default.
If the Trino-Ranger-Plugin is enabled, then be sure to add the provided keytab user in the respective Trino Ranger policies. See Integrating Trino with Ranger.
By default, Trino uses the full Kerberos principal name as the user. Therefore, when adding/updating trino-ranger policies, you must use full Kerberos principal name as username.
For the following code sample, use jupyterhub@BDSCLOUDSERVICE.ORACLE.COM as the user in the trino-ranger policies.
Note
If the Trino-Ranger-Plugin is enabled, be sure to add the provided keytab user in the respective Trino Ranger policies. For more details see Enabling Ranger for Trino.
Provide Ranger permissions for JupyterHub to the following policies:
from sqlalchemy import create_engine
from sqlalchemy.schema import Table, MetaData
from sqlalchemy.sql.expression import select, text
from trino.auth import KerberosAuthentication
from subprocess import Popen, PIPE
import pandas as pd
# Provide user specific keytab_path and principal. If user wants to run queries
with different keytab then user can update below keytab_path & user_principal
else #user can use same keytab_path, principal that is used while starting the
notebook session.
#Refer below sample code
keytab_path='/etc/security/keytabs/jupyterhub.keytab'
user_principal='jupyterhub@BDSCLOUDSERVICE.ORACLE.COM'
# Cert path is required for SSL.
cert_path= '/etc/security/serverKeys/oraclerootCA.crt'
# trino url = 'trino://<trino-coordinator>:<port>'
trino_url='trino://trinohamn0.sub03011425120.hubvcn.oraclevcn.com:7778'
# This is optional step, required only if user wants to run queries with different keytab.
kinit_args = [ '/usr/bin/kinit', '-kt', keytab_path, user_principal]
subp = Popen(kinit_args, stdin=PIPE, stdout=PIPE, stderr=PIPE)
subp.wait()
engine = create_engine(
trino_url,
connect_args={
"auth": KerberosAuthentication(service_name="trino", principal=user_principal, ca_bundle=cert_path),
"http_scheme": "https",
"verify": True
}
)