One of the key features of Alluxio Enterprise Edition is Transparent URI, which provides ease of integration of Alluxio with your existing data stack without any changes to the location metadata of the Hive Metastore. This article provides a tutorial on employing the Alluxio Transparent URI capability with Trino, Hive Metastore and Spark, and with MinIO as the S3-compatible object store.
1. What is Transparent URI and how does it work
Alluxio is often deployed with existing applications accessing their existing storage systems. When adding Alluxio to the existing stack, the URIs used by the application need to be changed. For example, to change from Trino + S3 to Trino + Alluxio + S3, the s3:// URI
must be changed to the alluxio:// URI
.
This is where Transparent URI helps. Instead of manually changing all of the URI usages from S3 to Alluxio, Alluxio's Transparent URI switches the storage scheme mapping, so that URIs with s3://
are routed to Alluxio's URI. The code path to S3 stays as s3://
instead of alluxio://
, so that client code or table definition doesn’t need to change. Therefore, the Transparent URI feature significantly simplifies the DevOps of introducing Alluxio to your existing systems.
2. Employ the Alluxio Transparent URI with Trino, Hive Metastore and Spark and with MinIO as the S3-compatible object store
In this part, we provide a step-by-step guide to using Transparent URI. You can find more details in the git repo, which provides a complete environment for demonstrating how to configure Alluxio's Transparent URI capability for use with Trino as a query engine and Hive as the metastore, as well as with Spark and SparkSQL.
This docker compose package was based on the docker compose setup described here: https://github.com/bitsondatadev/trino-getting-started/tree/main/hive/trino-minio
This modified compose package deploys Trino, Alluxio, Hive metastore, Spark and Minio. It configures a Trino Hive catalog to use the Alluxio Transparent URI feature. See: https://docs.alluxio.io/ee/user/stable/en/operation/Transparent-Uri.html
Step 1. Install Docker desktop
Install Docker desktop on your laptop, including the docker-compose command. See: https://www.docker.com/products/docker-desktop/
Step 2. Clone this repo
Use the git command to clone this repo (or download the zip file from the github.com site).
$ git clone https://github.com/gregpalmr/alluxio-trino-minio $ cd alluxio-trino-minio
Step 3. Specify your Alluxio Enterprise Edition license key
Get a base64 formatted version of your Alluxio Enterprise Edition license key by running the following command:
$ export ALLUXIO_LICENSE_BASE64=$(cat ./my-alluxio-license-file.json , base64)
Step 4. Launch the docker containers.
Launch the containers defined in the docker-compose.yml file using the command:
$ docker-compose up -d
The command will create the network object and the docker volumes, then it will take some time to pull the various docker images. When it is complete, you see this output:
$ docker-compose up -d Creating network "trino-minio_trino-network" with driver "bridge" Creating volume "trino-minio_minio-data" with local driver Creating volume "trino-minio_alluxio-data" with local driver Creating trino-coordinator ... done Creating spark-master ... done Creating minio ... done Creating mariadb ... done Creating spark-worker ... done Creating minio-create-buckets ... done Creating alluxio-master ... done Creating alluxio-mount-minio-bucket ... done Creating alluxio-worker1 ... done Creating hive-metastore ... done
Step 5. View the buckets in Minio
Use the Minio web console to view the pre-staged "hive" bucket in Minio. Follow these steps:
- Point your web browser to http://localhost:9001
- Log in using the user id "minio" and the password "minio123"
- Click on the "Buckets" link in the upper left margin
- Click on the buck link named "hive"
- View the folder named "warehouse" in the hive bucket
Step 6. View the mounted Minio bucket in Alluxio
Use the Alluxio web console to view the pre-staged mount of the Minio "hive" bucket in Alluxio. Follow these steps:
- Point your web browser to http://localhost:19999
- Click on the "MountTable" tab at the top of the page
- View the "/hive" mount point for the "s3a://hive" Minio bucket
- Now, click on the "Browse" tab at the top of the page
- Click on the "hive" directory name
- View the /hive/warehouse directory in the "hive" bucket
If you want to use the Alluxio command line to view the bucket mount, you can follow these steps:
$ docker exec -it alluxio-master bash $ alluxio fs mount s3a://hive/ on /hive (s3, capacity=-1B, used=-1B, not read-only, not shared, properties={alluxio.underfs.s3.inherit.acl=false, alluxio.underfs.s3.disable.dns.buckets=true, alluxio.underfs.s3.endpoint=http://minio:9000, s3a.accessKeyId=******, s3a.secretKey=******}) /opt/alluxio/underFSStorage on / (local, capacity=195.80GB, used=87.07GB(44%), not read-only, not shared, properties={})
Also, you can view the Alluxio audit log to see when Trino queries access the Alluxio virtual file system. Use this command:
$ tail -f /opt/alluxio/logs/master_audit.log
Step 7. (Optional) Enable debug mode on the Alluxio master
Run the following command if you would like to enable debugging:
$ alluxio logLevel --logName=alluxio --target master --level=DEBUG
Later, you can disable debugging mode with the command:
$ alluxio logLevel --logName=alluxio --level=INFO
Step 8. Explore the Trino configuration
Launch a bash session in the Trino coordinator container and explore the setup of the "minio" catalog and the integration with Alluxio. Use the following commands:
$ docker exec -it trino-coordinator bash
First, see how the Alluxio client jar file was added to the Trino Hive plugin directory (and classpath). This is required for accessing Alluxio using the Transparent URI capability with s3a URIs (s3a://).
$ ls -al /usr/lib/trino/plugin/hive/alluxio*client.jar -rw-r--r-- 1 trino trino 89061794 Nov 17 22:53 /usr/lib/trino/plugin/hive/alluxio-enterprise-2.9.0-1.0-client.jar
Display the Minio catalog's minio.properties file and notice that the Minio endpoint and credentials are NOT configured here. Instead, Trino will use the Alluxio Transparent URI capability when accessing s3a URIs. Also, see the "hive.config.resources" property pointing to the Alluxio cores-site.xml file:
$ cat /etc/trino/catalog/minio/minio.properties connector.name=hive-hadoop2 hive.s3-file-system-type=HADOOP_DEFAULT hive.metastore.uri=thrift://hive-metastore:9083 hive.non-managed-table-writes-enabled=true hive.s3select-pushdown.enabled=true hive.storage-format=ORC hive.allow-drop-table=true hive.config.resources=/etc/trino/core-site.xml
Next see how the Alluxio "shim" file system is configured to handle references to s3a URIs. See the properties named "fs.s3a.impl" and "fs.AbstractFileSystem.s3a.impl". Also, the "alluxio.master.hostname" property is defined to point to the Alluxio master node. If you are using Alluxio in high availability (HA) mode, with 3, 5 or 7 master nodes, then you would use the "alluxio.master.rpc.addresses" property instead.
Display the Alluxio core-site.xml file contents:
$ cat /etc/trino/core-site.xml <?xml version="1.0"?><configuration><!-- Enable the Alluxio Transparent URI feature for s3 and s3a end-points --><property><name>fs.s3a.impl</name><value>alluxio.hadoop.ShimFileSystem</value></property><property><name>fs.AbstractFileSystem.s3a.impl</name><value>alluxio.hadoop.AlluxioShimFileSystem</value></property><!-- Enable the Alluxio Transparent URI feature for HDFS end-points --><property><name>fs.hdfs.impl</name><value>alluxio.hadoop.ShimFileSystem</value></property><property><name>fs.AbstractFileSystem.hdfs.impl</name><value>alluxio.hadoop.AlluxioShimFileSystem</value></property><!-- Don't apply Transparent URI for these files --><property><name>alluxio.user.shimfs.bypass.prefix.list</name><value></value></property><!-- Don't auto mount Alluxio mounts --><property><name>alluxio.master.shimfs.auto.mount.enabled</name><value>false</value></property><!-- Specify the Alluxio master node --><property><name>alluxio.master.hostname</name><value>alluxio-master</value></property><!-- Tell Alluxio to CACHE data when it is read for the first time --><property><name>alluxio.user.file.readtype.default</name><value>CACHE</value></property><!-- Enable client-side (northbound) impersonation in Alluxio --><property><name>alluxio.security.login.impersonation.username</name><value>_HDFS_USER_</value></property>
Step 9. View the Trino Web console
Point your web browser to the Trino Web console to view query status and history
http://localhost:8080
The user id is "trino" and there is no password.
Step 10. Test Trino using the Alluxio Transparent URI feature
The Alluxio Transparent URI feature will redirect references to s3 and s3a URIs to the native Alluxio URI (alluxio://). Therefore Hive table definitions with the "external_location=s3a://<bucket_name>/" will be redirected to Alluxio instead of to native Minio. All the Alluxio data orchestration and data caching capabilities will be employed.
Launch a bash session in the Trino coordinator container and run a CREATE TABLE command to create a table using the "minio" Trino cagtalog setup and the "s3a" URI. Then query the data. Use these commands:
$ docker exec -it trino-coordinator bash $ trino --catalog minio --debug trino> USE default; CREATE TABLE default.customer_s3a WITH ( format = 'ORC', external_location = 's3a://hive/warehouse/customer_s3a/' ) AS SELECT * FROM tpch.tiny.customer; SELECT * FROM default.customer_s3a WHERE acctbal > 3500.00 AND acctbal < 9000.00 ORDER BY acctbal LIMIT 25;
Step 11. View the Alluxio cache storage usage
When Trino queries data using Alluxio's Transparent URI feature, it will cache data to Alluxio cache storage, when it is first read from the under file system (in this case, Minio).
To see if the customer_s3a data files are cached, go back to the bash session on the alluxio-master container and list the data files. You should see that some data is cached for the cusotmer_s3a data set by the fact that the 8th column shows a percentage being cached. If it shows 100%, then Alluxio cached all of the data in that file.
$ docker exec -it alluxio-master bash $ alluxio fs ls -R /hive/warehouse/customer_s3a -rwx------ alluxio alluxio 78509 PERSISTED 11-17-2022 22:29:47:610 100% /hive/warehouse/customer_s3a/20221117_222944_00003_76isd_03efa9e5-a56a-4d83-8f8a-5bbdbaf9bc1f
You can also view the Alluxio Web console you launched in Step 6 to see if any data is being cached by Alluxio, or run the following Alluxio CLI command:
$ alluxio fsadmin report
Step 12. Explore the Spark configuration
Launch a bash session in the Spark master container and explore how the Spark environment is integrated with the Alluxio Transparent URI capability. Use the following commands:
$ docker exec -it spark-master bash
See that the Alluxio client jar file was added to the Spark "jars" directory (and classpath). This is required for accessing Alluxio using both the native protocol ("alluxio://") and the Transparent URI s3a protocol (s3a://).
$ ls -al $SPARK_HOME/jars/*alluxio* -rw-r--r-- 1 1001 root 89061794 Nov 28 16:34 /opt/bitnami/spark/jars/alluxio-enterprise-2.9.0-1.0-client.jar
Next see how the Alluxio "shim" file system is configured to handle references to s3a URIs. See the properties named "fs.s3a.impl" and "fs.AbstractFileSystem.s3a.impl". Also, the "alluxio.master.hostname" property is defined to point to the Alluxio master node. If you are using Alluxio in high availability (HA) mode, with 3, 5 or 7 master nodes, then you would use the "alluxio.master.rpc.addresses" property instead.
$ cat $SPARK_HOME/conf/core-site.xml <?xml version="1.0"?><configuration><!-- Enable the Alluxio Transparent URI feature for s3 and s3a end-points --><property><name>fs.s3a.impl</name><value>alluxio.hadoop.ShimFileSystem</value></property><property><name>fs.AbstractFileSystem.s3a.impl</name><value>alluxio.hadoop.AlluxioShimFileSystem</value></property><!-- Enable the Alluxio Transparent URI feature for HDFS end-points --><property><name>fs.hdfs.impl</name><value>alluxio.hadoop.ShimFileSystem</value></property><property><name>fs.AbstractFileSystem.hdfs.impl</name><value>alluxio.hadoop.AlluxioShimFileSystem</value></property><!-- Don't apply Transparent URI for these files --><property><name>alluxio.user.shimfs.bypass.prefix.list</name><value></value></property><!-- Don't auto mount Alluxio mounts --><property><name>alluxio.master.shimfs.auto.mount.enabled</name><value>false</value></property><!-- Specify the Alluxio master node --><property><name>alluxio.master.hostname</name><value>alluxio-master</value></property><!-- Tell Alluxio to CACHE data when it is read for the first time --><property><name>alluxio.user.file.readtype.default</name><value>CACHE</value></property><!-- Enable client-side (northbound) impersonation in Alluxio --><property><name>alluxio.security.login.impersonation.username</name><value>_HDFS_USER_</value></property></configuration>
Finally, see how Spark is configured to integrate with the Hive metastore:
$ cat $SPARK_HOME/conf/spark-defaults.conf spark.hadoop.hive.metastore.uris=thrift://hive-metastore:9083
Step 13. Test Spark using the Alluxio Transparent URI feature
The Alluxio Transparent URI feature will redirect references to s3 and s3a URIs to the native Alluxio URI (alluxio://). Therefore Hive table definitions with the "external_location=s3a://<bucket_name>/" will be redirected to Alluxio instead of native Minio. All the Alluxio data orchestration and data caching capabilities will be employed.
Launch a bash session in the Spark master container and run some Spark Scala commands to access the Hive table via Alluxio and also access the data file directly without using the Hive metastore. Use these commands:
$ docker exec -it spark-master bash $ spark-shell --master "spark://spark-master:7077" \ --driver-java-options "-Dspark.hadoop.hive.metastore.uris=thrift://hive-metastore:9083" scala> // Read the Hive table using the hive metastore // Note: It will read via Alluxio's Transparent URI capability import org.apache.spark.sql.hive.HiveContext val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.sql("USE default") hiveContext.sql("SHOW TABLES").show() hiveContext.sql("SELECT * FROM default.customer_s3a WHERE acctbal > 3500.00 AND acctbal < 9000.00 ORDER BY acctbal LIMIT 25").show() // Read the S3 bucket directly without using the Hive metastore // Note: It will read via Alluxio's Transparent URI capability val df=spark.read.orc("s3a://hive/warehouse/customer_s3a/").show(25) // Create a new table in the Hive warehouse val df = Seq((1, 2, 3),(2, 3, 4),(3, 4, 5)).toDF("a", "b", "c") df.write.saveAsTable("default.test_table") spark.catalog.listTables.show(false) spark.sql("SELECT * FROM default.test_table LIMIT 10").show()
To view the data files created by the Spark df.write.saveAsTable() operation, go back to the Alluxio web console at:
http://localhost:19999
And click on the "Browse" tab at the top, then click on the "hive" folder and then the "warehouse" folder. The "test_table" folder will show the parquet files created for the new table.
Step 14. Destroy the containers
When finished, destroy the docker containers and clean up the docker volumes using these commands:
$ docker-compose down $ docker volume prune
3. Learn More
By using Transparent URI, you can simplify the DevOps of implementing Alluxio. To learn more, visit our documentation page: https://docs.alluxio.io/ee/user/stable/en/operation/Transparent-Uri.html
For any future questions or comments, please contact our Lead Solutions Engineer, Greg Palmer, at greg.palme@alluxio.com.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.