Introduction
Alluxio enables data orchestration for compute in any cloud. It unifies data silos on-premise and across cloud environments to provide the data locality, accessibility, and elasticity needed to reduce the complexities associated with orchestrating data for today’s big data and AI/ML workloads.
Alluxio is designed to help any framework access any data, from any storage at high performance regardless of the environment, which enables an organization to remain agile and competitive in adopting and experimenting with new and existing technologies.
Apache Ranger
Many organizations have expanded access to their data lake beyond their initial ETL and batch analytics users and they need a way to centralize how they define and enforce fine-grained access permissions. Increasingly, enterprise data managers are adopting Apache Ranger to meet that need.
Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. Ranger was created to meet the following goals:
- Provide centralized security administration to manage all security related tasks in a central UI or using REST APIs.
- Provide fine grained authorization to do a specific action and/or operation with Hadoop component/tool and manage through a central administration tool.
- Standardize authorization method across all Hadoop components
- Enhanced support for different authorization methods - Role based access control, attribute based access control, etc.
- Centralize auditing of user access and administrative actions (security related) within all the components of Hadoop.
Alluxio and Apache Ranger
Alluxio implements a virtual file system that provides access to heterogeneous data stores, providing a unified namespace along with meta-data caching, data caching and policy driven data management services. To make the Alluxio virtual file system secure, Alluxio provides the following:
- User Authentication
- User Authorization
- Access Control Lists (ACLs)
- Data Path Authorization
- Client-side Hadoop Impersonation
- Auditing
- Encryption
Alluxio integrates with Apache Ranger using a Ranger Plugin to support the user authorization and auditing mechanisms as shown in Figure 2 below.
As Apache Ranger administrators define centralized access policies in Ranger, those policies are retrieved and cached locally by the Alluxio master node and are enforced by Alluxio when users make read or write requests to the Alluxio virtual file system.
Best Practices
Alluxio supports using Apache Ranger to manage and enforce access to directories and files. There are two ways to use Ranger with Alluxio:
- Use Ranger to directly manage access permissions on Alluxio virtual file system paths. This method should be used when the Alluxio under file system (UFS) is not HDFS or Alluxio has two or more under file systems
- Using Alluxio’s unified namespace features, and Alluxio will be the main access layer. For example, Alluxio may have an HDFS UFS and an S3 compatible UFS that are mounted using a UNION UFS.
- Have Alluxio enforce existing Ranger policies for an HDFS under file system. Use this method when there are existing HDFS access policies being managed in Ranger and there are no other under file systems other than HDFS.
While it is possible to use Ranger to manage permissions for both Alluxio and the under file system, it is not recommended to enable both at the same time because it can be confusing to have multiple sources of truth.
Option 1. Ranger manages Alluxio file system permissions
With this option, the Alluxio service plugin needs to be enabled in the Ranger admin console. Since Alluxio uses the HDFS Ranger plugin type, a new HDFS service can be defined in the Service Manager page.
Step 1. Create the Alluxio HDFS Service
In the Ranger admin console’s Service Manager page, click on the plus sign (+) to create a new service.
Ranger will display the Create Service page where the Alluxio master node will be referenced as the service to be targeted. In that page, enter the details for the Alluxio service, including a unique Service Name. If multiple Alluxio environments exist, for example: one for dev, one for test and several production environments in different data centers, then specific names for the Alluxio service should be used (such as alluxio-datacenter1-test
). Again, since Alluxio uses the HDFS plugin, the Create Service page shows HDFS properties. In the Namenode URL property, enter the Alluxio master node URI (such as alluxio://alluxio-master:19998
).
Setting Authorization Enabled to Yes will require that all users are authenticated and most organizations will set the Authentication Type to Kerberos. If the Ranger Admin service is configured with SSL certificates, then the Common Name for Certificate property should be set correctly (based on the CN specification for the SSL certificate) and the Alluxio master node should have access to those certificate files. Note that the Username and Password are set to the Ranger admin username and password, and not the Alluxio admin username and password. Clicking on the Create button will create the new HDFS Service and show it on the Service Manager page.
Step 2. Configure Alluxio Master Nodes
Once the Alluxio Ranger HDFS service is created using the Ranger admin console, the Alluxio master nodes can be configured to use the Ranger HDFS plugin to retrieve and cache Ranger policies. First, copy the core-site.xml, hdfs-site.xml, ranger-hdfs-security.xml, ranger-hdfs-audit.xml and ranger-policymgr-ssl.xml files from the $HADOOP_CONF directory on the HDFS namenode server to the $ALLUXIO_HOME/conf directory on the Alluxio master node servers. The ranger-hdfs-security.xml file should be modified to name the Alluxio Ranger HDFS Service defined using the Ranger admin console in Step 1 above. Like this:
<property> <name>ranger.plugin.hdfs.service.name</name> <value>alluxio-datacenter1-test</value> <description> Name of the Ranger service containing policies for this Alluxio instance </description> </property>
The alluxio-site.properties
file on the Alluxio master nodes should be changed to enable Ranger integration, like this:
alluxio.security.authorization.plugins.enabled=true alluxio.security.authorization.plugin.name=<plugin name> alluxio.security.authorization.plugin.paths=/opt/alluxio/conf alluxio.security.authorization.permission.umask=077
The plugin name tells Alluxio to use a specific Ranger HDFS plugin, located in .jar files in the $ALLUXIO_HOME/lib directory. Several versions of Apache Ranger are supported and are implemented with these jar files:
alluxio-authorization-ranger-2.0-cdp-7.1-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-0.5-hdp-2.4-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-0.7-hdp-2.6-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-1.1-hdp-3.0-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-1.2-hdp-3.1-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-0.6-hdp-2.5-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-2.1-privacera-4.7-enterprise-2.7.0-2.4.jar
For example, if Privacera 4.7 is being used, then the plugin name would be specified as ranger-privacera-4.7
, and if Hortonworks HDP 2.6 is being used, then the plugin name would be specified as ranger-hdp-2.6
.
After copying the Ranger xml files and modifying the alluixo-site.properties
file, restart the Alluxio master daemons.
Step 3. Restrict Alluxio permissions on sensitive directories
When a Ranger policy is not available for a specific path, Alluxio will fall back to its own POSIX style permissions to determine if a user has access permissions on a directory or file. Therefore, it is recommended that all users except for the privileged root user be denied access to all the directories except for the /tmp directory. To enforce this, run the following Alluxio cli commands:
alluxio fs chmod 777 / alluxio fs chmod 777 /user alluxio fs chmod 777 /tmp alluxio fs chmod 700 /sensitive_data1 alluxio fs chmod 700 /sensitive_data2
Execute the chmod 077
… on any sub-directories that should be managed by Ranger policies.
When a terminal session is opened to one of the Alluxio nodes and an attempt is made to access the /sensitive_data1
directory as a non-root user, a permission denied message like this should be displayed:
$ id uid=1001(user1) gid=1001(alluxio-users) $ alluxio fs ls /sensitive_data1 Permission denied by authorization plugin: alluxio.exception.AccessControlException: Permission denied: user=user1, access=--x, path=/sensitive_data1: failed at /, inode owner=root, inode group=root, inode mode=rwx------
Step 4. Create Ranger Allow Policies
At this point the data management team and the data security team should review each directory or folder path in the under file system (HDFS, S3, GCS etc.) and determine which user groups or users should be granted access to each path.
Use the Ranger admin console to define an Allow policy by clicking on the alluxio-datacenter1-test
HDFS Service link to display the list of defined policies.
By default Ranger will create several policies for the admin users, but no policies exist yet for Alluxio users. Click on the Add New Policy button to display the Create Policy page.
In the Create Policy page, define an Allow policy for a specific user group on the user directory (/sensitive_data1
), recursively. Allow Read
,Execute
only permissions. In this example, using the group name alluxio-users
accomplishes that for all the users in that group.
Click the Add button to create the new policy and display the new policy in the list.
Wait a minute for the policy to be retrieved and cached by the Alluxio master node. Then open a terminal session on an Alluxio node to test the allow policy. Run the alluxio fs ls
command again and it should successfully show a listing of the sub-directory, like this:
$ id uid=1001(user1) gid=1001(alluxio-users) $ alluxio fs ls /sensitive_data1/dataset1/ -rw------- root root 283 PERSISTED 02-01-2022 14:59:45:457 100% /sensitive_data1/dataset1/data-file-001 $ alluxio fs copyFromLocal my_data-file-002 /sensitive_data1/dataset1/ Permission denied by authorization plugin: alluxio.exception.AccessControlException: Permission denied: user=user1, access=--x, path=/sensitive_data1/dataset1/my_data-file-002: failed at /, inode owner=root, inode group=root, inode mode=rwx------
Notice that the Ranger policy allowed read access to the /sensitive_data1/dataset1/
directory, but did not allow write access to it (the copyFromLocal command failed). This is because the Ranger policy only specified Read
,Execute
permissions on the /sensitive_data1
directory tree.
Later, use Ranger to add or remove user groups or specific users from the Allow and Deny policies. Alluxio will rescan the policies and will update its local policy cache, and enforce the policies when users make read or write requests to the Alluxio virtual file system.
Option 2. Alluxio enforces existing Ranger policies
With this option, there is no need to enable an Alluxio service plugin in the Ranger admin console, because Alluxio can use the policies defined in the existing HDFS service. The HDFS service should already exist in the Admin console as shown in Figure 10.
However, the Alluxio master node will need to be configured to use Ranger as an authorizer.
Step 1. Configure Alluxio Master Nodes
The Alluxio master nodes can be configured to use the Ranger HDFS plugin to retrieve and cache Ranger policies. Copy the core-site.xml
, hdfs-site.xml
, ranger-hdfs-security.xml
, ranger-hdfs-audit.xml
and ranger-policymgr-ssl.xml
files from the $HADOOP_CONF directory on the HDFS namenode server to the $ALLUXIO_HOME/conf directory on the Alluxio master node servers.
Then, the alluxio-site.properties
file on the Alluxio master nodes should be changed in two ways.
First, Ranger integration should be enabled, like this:
alluxio.security.authorization.plugins.enabled=true alluxio.security.authorization.permission.umask=077
Then, if HDFS is mounted as the root UFS, the Ranger plugin should be referenced as the plug in to use for the root UFS, like this:
alluxio.master.mount.table.root.option.alluxio.underfs.security.authorization.plugin.name=<plugin name> alluxio.master.mount.table.root.option.alluxio.underfs.security.authorization.plugin.paths=/opt/alluxio/conf
If HDFS is not being mounted as the root UFS, but is being mounted using the nested mount method, then the Alluxio mount command should include the options to specify the Ranger plugin name and plugin paths, like this:
alluxio fs mount \ --option alluxio.underfs.security.authorization.plugin.name=<plugin name> \ --option alluxio.underfs.security.authorization.plugin.paths=/opt/alluxio/conf \ --option alluxio.underfs.version=2.7 \ /my_hdfs_mount \ hdfs://<name node>:<port>/
The plugin name tells Alluxio to use a specific Ranger HDFS plugin, located in .jar files in the $ALLUXIO_HOME/lib directory. Several versions of Apache Ranger are supported and are implemented with these jar files:
alluxio-authorization-ranger-2.0-cdp-7.1-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-0.5-hdp-2.4-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-0.7-hdp-2.6-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-1.1-hdp-3.0-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-1.2-hdp-3.1-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-0.6-hdp-2.5-enterprise-2.7.0-2.4.jar alluxio-authorization-ranger-2.1-privacera-4.7-enterprise-2.7.0-2.4.jar
For example, if Privacera 4.7 is being used, then the plugin name would be specified as ranger-privacera-4.7
, and if Hortonworks HDP 2.6 is being used, then the plugin name would be specified as ranger-hdp-2.6
.
After copying the Ranger xml files and modifying the alluixo-site.properties
file, restart the Alluxio master daemons.
Step 2. Re-format Alluxio Masters
For these changes to take effect, the Alluxio master nodes need to be re-formatted, using the following command:
alluxio formatJournal
If using an embedded journal (alluxio.master.journal.type=EMBEDDED
) , run the command on each master node. If using a journal type of UFS
, then simply run the command once on any master node.
Now Alluxio should use the existing Ranger HDFS service policies to determine access permissions to HDFS UFS directories and files.
Summary
As data stewards and security teams provide broader access to their organization’s data lake environments, having a centralized way to manage fine-grained access policies becomes increasingly important. Alluxio can use Apache Ranger’s centralized access policies in two ways: 1) directly controlling access to virtual paths in the Alluxio virtual file system or 2) enforcing existing access policies for the HDFS under stores.
To gain some hands-on experience using Alluxio with Apache Ranger, you may deploy Alluxio and Apache Ranger on your own computer using the Alluxio Ranger Best Practices sandbox at: https://github.com/gregpalmr/alluxio-ranger-sandbox. To learn more about Alluxio’s security, refer to the Alluxio documentation at: https://docs.alluxio.io/ee/user/stable/en/operation/Security.html.
1 Apache Ranger - https://ranger.apache.org
2 Alluxio Security - https://docs.alluxio.io/ee/user/stable/en/operation/Security.html
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.