AWS Athena makes it easy to analyze semi-structured and non-structured data like json, csv & xml directly in Amazon S3 using SQL. However, it also allows you to easily query a number of relational databases hosted in AWS such as mySQL and PostgreSQL. For any Data Scientist, this opens up a world of potential because now it’s possible to write SQL queries that combine data from both relational and non-relational data that’s stored in your AWS data lake in a single query!
In this post, I’ll bring you through all the steps ( and all the gotchas!) to get up and running with the Athena federated query service.
Create a new Data Source
The first step, to get going, is to open the Athena service and click Connect data Source to setup a new connection.
Choose Query a data source then PostgreSQL as the data source that you want to query:
Athena uses data source connectors that run on AWS Lambda to run federated queries. Luckily, AWS have a number of prebuilt Athena data source connectors for JDBC-compliant relational data sources. In this next step we will choose to Configure a new AWS Lambda function to run our federated queries.
Selecting Configure a new AWS Lambda will open up the prebuilt Lambda function in the AWS Lambda service.
Lambda Configuration
The first thing to configure is the SpillBucket. Any query result sets larger than ~6MB, will spill this S3 bucket. The value supplied should be the bucket name and NOT the ARN. Also, do make sure that the account that you will be running the queries with has write access to the bucket.
Next add the default connection string for your PostgreSQL DB. It takes the form: postgres://jdbc:postgresql://<database endpoint>:<database port>/<database name>?user=<username>&password=<password>. You will find connection string details in the Connectivity & security section of your DB
*It’s highly recommended that you use the secrets management service to supply any sensitive values
The LambdaFunctionName will be both the name of the Lambda function that you are about to deploy and also the name you can use to specify this source inside of your Athena query.
The security group added to SecurityGroupIDs, must have inbound rules configured for Postgres. Otherwise, you’ll get a “connection attempt failed” error message when you query the DB.
Finally, configure the SubnetIds that the lambda function needs to access your PostgreSQL DB. You will find the subnets on the connectivity & security tab for your DB. It is recommend to add all subnetIds that are listed on the connectivity & security tab separated by a comma.
Once you’ve configured your function, it should look something like the below:
Querying the postgres DB from Athena
At this point, you can now start querying the PostgreSQL DB by referencing the lambda function directly in the query using the syntax “lambda:<function name> in the FROM clause, as seen in the below screen shot:
Pre-register the federated source with Athena
Instead of using the registration-less option (as above), Athena offers you the ability to register these federated sources. To do so:
- Return to Athena Query Editor page
- Click Connect data source
- Next click Query a data source
- Finally choose PostgreSQL as the data source that you want to query.
On the connection details page this time select the Lambda function you previously created in the drop down. For the Catalog name, choose a descriptive name for your data source, as it’ll appear in the data source dropdown in the Athena query editor. Click Connect to register the federated source.
You should now see the catalog name in the Data source drop down. However, you still wont be able query the DB using the pre-registered catalog option and will receive this error:
“Catalog is not supported in multiplexer. After registering the catalog in Athena, must set ‘jdbc_postgres_catalog_connection_string’ environment variable in Lambda“
Reading the error, Athena wants us to add a new environment variable to the lambda function we created. Switching back to the jdbc_postgres function, scroll down to the environment variables, click edit to add the new variable. The syntax for the key is the catalog name you specified earlier followed by “_connection_string”. The value is the connection string for the DB, which you can copy from the default key/value pair.
Once you’ve added the environment variable, and refreshed your Athena Query Editor page, the databases and it’s tables should now be visible in the left pane and you’ll be able to query the database without having to specify the lambda connector directly.
If you enjoyed this article, please like and share or check out some others at www.dataengineeringacademy.com
I could not resist commenting. Perfectly written!
I think the admin of this web page is actually working hard for his website, for the
reason that here every stuff is quality based information.