Data analytics is the new trend among businesses looking to gain insights and competitive advantage. Nevertheless, to extract insights from data, it must first be cleaned, transformed, and analyzed in a usable format. Raw data, often scattered across multiple systems, requires a system in place to collect, process and store it. This system is called a data pipeline. AWS Glue is an integration service that makes it easy to move data among Amazon Web Services data storage and processing services. It is serverless, scalable, and fully-managed, allowing data to be processed without having to manage any infrastructure.
In this tutorial, we will learn how to create a data pipeline using AWS Glue to transform data stored in an Amazon S3 bucket. We will cover the following topics:
- Setting up the AWS Glue environment
- Creating an Amazon S3 bucket and folder structure
- Defining the Glue Data Catalog
- Creating a crawler to populate the Data Catalog
- Creating a Glue job to transform the data
- Running a job and verifying the output
Setting up the AWS Glue environment
To use AWS Glue, you must first sign up for an AWS account. Once you have signed up, navigate to the AWS Management Console and select AWS Glue from the list of services. This will take you to the AWS Glue console, where you can create and manage your data pipelines.
Creating an Amazon S3 bucket and folder structure
To store the data we want to process, we need an Amazon S3 bucket where we can create a folder structure to organize our data. To create a new S3 bucket, navigate to the S3 console and click the “Create Bucket” button. Give your bucket a unique name and choose your desired region. After creating the bucket, create two new folders within it, “input” and “output”. The input folder will contain the raw data we want to process, and the output folder will contain the transformed data.
Defining the Glue Data Catalog
The Glue Data Catalog is a central metadata repository that stores metadata information about data sources, schemas, and associated metadata. The data in the Data Catalog serves as a foundation for the Glue ETL jobs. To define the Glue Data Catalog, we need to first create a database. To create a new database, click on the “Databases” tab in the AWS Glue console and then click the “Add database” button. Give your database a name and click “Create.” Once the database is created, you should see it listed in the Databases tab of the AWS Glue console.
Creating a crawler to populate the Data Catalog
A crawler is used to populate the Glue Data Catalog with metadata information about the data being processed. To create a new crawler, click on the “Crawlers” tab in the AWS Glue console and then click “Add crawler.” Give your crawler a name and select the data store you want to crawl. In our case, we will select the input folder in our S3 bucket.
Next, we need to configure the crawler to populate the Glue Data Catalog with the metadata about the data. To do this, click on the “Configure crawler” button and select “Add database” to associate the crawler with the database we created earlier. Once the crawler is associated with the database, the “Configure the crawler’s output” section should be populated. Here, we choose the output path for the crawler and decide if the crawler should crawl tables matching a specific pattern.
The last step in configuring the crawler is to set up a schedule for it. The default setting is to run one time, but we will change this to run hourly. Finally, start the crawler by clicking on the “Run it now” button. The crawler will then begin the process of analyzing the data in our S3 bucket and populating the Glue Data Catalog with metadata.
Creating a Glue job to transform the data
Once the crawler has finished running and populated the Glue Data Catalog, we can create a Glue job to transform the data. To create a new job, navigate to the “Jobs” tab in the AWS Glue console and click “Add job.” Give your job a unique name, select the IAM role you want to use for the job, and then click “Next.”
In the next step, we will select the “data source.” Here, we specify where the data we want to transform is located, in our case, the input folder in the S3 bucket. Select the data source type as “Data Catalog” and then click on the “Browse” button to select the table created by the crawler. After selecting the table, click “Next.”
In the “Map the source columns to target columns” step, we define the transformation logic using code written in Python or Scala. AWS Glue provides a range of transformations that can be applied to data, and these transformations can be chained together to perform more complex data processing. We can write the transformation code either in the script editor on the console or using external tools and uploading the code as a package.
In this example, we will use the script editor provided on the console. The following Python code can be used to transform the raw data:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql.functions import *
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
## @type: DataSource
## @args: [database = "mydatabase", table_name = "mytable"]
## @return: myinputdata
## @inputs: []
myinputdata = glueContext.create_dynamic_frame.from_catalog(database="mydatabase", table_name="mytable")
## Read data from the inputdata
mydata = myinputdata.toDF()
## Transform the data
mydata = mydata.select('timestamp', 'value').groupBy('timestamp').agg(avg('value').alias('average'))
## Output the transformed data
myoutputdata = DynamicFrame.fromDF(mydata, glueContext, "transformed_output")
glueContext.write_dynamic_frame.from_options(myoutputdata, connection_type="s3", connection_options={"path": "s3://mybucket/output"}, format="parquet")
This transformation logic reads the raw data and aggregates it by timestamp and calculates the average. Then, it saves the transformed data in the output folder in a parquet format.
Running a job and verifying the output
Once we have finished defining the transformation logic, we can run the job. To run the job, navigate to the Jobs tab in the AWS Glue console, select the job you want to run, and then click the “Run” button. Once the job is complete, we can verify the output by navigating to the S3 console, selecting the output folder, and viewing the transformed data files.
Conclusion
In this tutorial, we learned how to create a data pipeline using AWS Glue to transform data stored in an Amazon S3 bucket. AWS Glue simplifies the process of building data processing pipelines by providing a serverless, fully-managed, scalable platform to process and transform data. By using these steps, you can transform any raw data and load it in a structured way for further analysis.