AWS Glue is an ETL service from Amazon that allows you to easily prepare and load your data for storage and analytics. Using the PySpark module along with AWS Glue, you can create jobs that work with data over JDBC connectivity, loading the data directly into AWS data stores. In this article, we walk through uploading the CData JDBC Driver for Cloudant into an Amazon S3 bucket and creating and running an AWS Glue job to extract Cloudant data and store it in S3 as a CSV file.
In order to work with the CData JDBC Driver for Cloudant in AWS Glue, you will need to store it (and any relevant license files) in a bucket in Amazon S3.
AWS Glue is an Extract, Transform, Load (ETL) service available as part of Amazon’s hosted web services. Glue is intended to make it easy for users to connect their data in a variety of data stores, edit and clean the data as needed, and load the data into an AWS-provisioned store for a unified view.
Glue supports accessing data via JDBC, and currently, the databases supported through JDBC are Postgres, MySQL, Redshift, and Aurora. Of course, JDBC drivers exist for many other databases besides these four. Using the DataDirect JDBC connectors, you can access many other data sources for use in AWS Glue.
This tutorial demonstrates accessing Salesforce data with AWS Glue, but the same steps apply with any of the DataDirect JDBC drivers.
Download DataDirect Salesforce JDBC driver from here.
To install the driver, you would have to execute the JAR package and you can do it by running the following command in terminal or just by double-clicking on the JAR package.
java -jar PROGRESS_DATADIRECT_JDBC_SF_ALL.jar
This will launch an interactive Java installer using which you can install the Salesforce JDBC driver to your desired location as either a licensed or evaluation installation.
Note that this will install Salesforce JDBC driver and bunch of other drivers too for your trial purposes in the same folder.
Go to AWS Glue Console on your browser, under ETL > Jobs, click on the Add Job button to create a new job. You should see an interface as shown below.
Fill in the name of the job, and choose/create an IAM role that gives permissions to your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job. For this tutorial, we just need access to Amazon S3, as I have my JDBC driver and the destination will also be S3.
Choose A new script to be authored by you under This job runs options.
Give a name for your script and choose a temporary directory for Glue Job in S3.
Under Script Libraries and job parameters (optional), for Dependent Jars path, choose the
sforce.jar file in your S3. Your configuration should look as shown below.
Click the Next button and you should see Glue asking if you want to add any connections that might be required by the job. In this tutorial, we don’t need any connections, but if you plan to use another destination such as RedShift, SQL Server, Oracle, etc., you can create the connections to these data sources in your Glue and those connections will show up here.
Click on Next, review your configuration, and click Finish to create the job.
You should now see an editor to write a Python script for the job. Here, you write your custom Python code to extract data from Salesforce using DataDirect JDBC driver and write it to S3 or any other destination.
You can use this code sample to get an idea of how you can extract data from data from Salesforce using DataDirect JDBC driver and write it to S3 in a CSV format. Feel free to make any changes to suit your needs. Save the job.
You can use similar steps with any of DataDirect JDBC suite of drivers available for relational, big data, SaaS, and NoSQL Data sources. Feel free to try any of our drivers with AWS Glue for your ETL jobs for a 15-day trial period.