Everything you want to know about how to decouple SQL engines from Hive Data Warehouse

Are you using SQL engines, such as Presto, to query existing Hive data warehouse and experiencing challenges including overloaded Hive Metastore with slow and unpredictable access, unoptimized data formats and layouts such as too many small files, or lack of influence over the existing Hive system and other Hive applications? 

To address these challenges, we launched Alluxio Structured Data Management in the 2.1 open source release, which includes the Alluxio Catalog Service. In the latest office hour, Gene Pang, the project lead of Alluxio Catalog Service, provided an overview of the key challenges and demo’d Alluxio Catalog Service (watch full video on-demand). Gene will be holding online open office hours on Wed April 1 at 10am PT, and you’re welcome to bring your questions or share feedback to help improve the Alluxio open source project.

Here is the FAQ based on the top questions from the community:

How does the Alluxio Catalog Service work with Hive Metastore?

The Alluxio catalog manages metadata for structured data, such as schema and table information. In order to do so, The Alluxio catalog manages connections to external metadata services (like Hive Metastore or AWS Glue) via an abstraction called Under Database (UDB). With the UDB abstraction, the Alluxio catalog can interact with other external metadata services for database and table information.

In order for the Alluxio catalog to manage and maintain interactions with external metadata services, the external service must be “attached” to the Alluxio catalog. This attaching process adds the relationship between the external service and the Alluxio catalog. Attaching an external metadata service is initiated via the “attachdb” CLI command.

In the Alluxio 2.2 release, there is an implementation of the UDB for the Hive Metastore. This means databases in a Hive Metastore can be attached to the Alluxio catalog, so that the Alluxio catalog can manage and serve this metadata. In future versions, support for the AWS Glue UDB will be added.

After an external database is attached, the Alluxio catalog syncs all the information from the external database and imports the table definitions into the catalog. The Alluxio catalog can then serve the information to the client, which in this case is in the Alluxio connector in Presto. The “sync” CLI command triggers the Alluxio catalog to sync with the UDB again.

Once a UDB is attached, the Alluxio catalog manages the table metadata. This means the Alluxio catalog will transparently use Alluxio file locations instead of the locations found in the Hive Metastore, without having to modify the Hive Metastore. This allows simple deployment of the Alluxio catalog service with existing Hive Metastores.

How does the Alluxio Catalog Service work with Presto?

Presto interacts with the Alluxio catalog via the Alluxio connector. The Alluxio connector in Presto is new functionality for the existing Hive connector. The new functionality has already been merged into the codebases of PrestoSQL and PrestoDB. The Presto catalog simply needs to be configured to interact with the Alluxio catalog, instead of the Hive Metastore. Once Presto is configured in this way, all of the existing Hive connector operations will interact with the Alluxio catalog for the metadata of tables, and not the Hive Metastore.

In the Alluxio 2.2 release, only read-only queries are supported with the Alluxio connector and Alluxio catalog. In future versions, write queries like DDL and DML statements will be supported.

How do Alluxio transformations work?

Alluxio transformations are performed by an Alluxio-internal job service, which executes distributed, asynchronous tasks. When a transformation is initiated, this is converted into a distributed job, and executed across the various Alluxio job workers. Once the job is complete, the Alluxio catalog is updated with the newest transformed locations, and when the Alluxio catalog serves the table metadata, the new transformed locations are used transparently, without any modification to the UDB.

When transformations are generated, they are written through to the underlying persistent store, and not cached in Alluxio. Once the transformed data is written, it can be read from the new transformed locations. The first read of the transformed data will cache the data in Alluxio on-demand.

In Alluxio 2.2, the output file format is the parquet format, since it is a more compute-optimized file format. In future versions, additional file formats and parquet output options will be available. For coalescing many small files together, the default behavior is to generate files of about 2G, but that is configurable.

What is the architecture of Alluxio Structured Data Services?

Alluxio Structured Data Services is a part of the Alluxio system and has 3 main components: Alluxio connector in Presto, Alluxio Catalog Service, and Alluxio Transformation Service.

Alluxio connector: The Alluxio connector is part of the Presto SQL engine. It is a connector based on the existing Hive connector, and is the client to the Alluxio catalog.

Alluxio Catalog Service: The catalog service manages the metadata for structured data, and is a part of the Alluxio master processes.

Alluxio Transformation Service: The transformation service is managed by the Alluxio catalog, and the transformations are performed by the Alluxio job service, which is an internal service for executing distributed, asynchronous tasks. The jobs are executed and managed by the Alluxio job masters and Alluxio job workers.