{"id":4195,"date":"2023-11-04T23:14:08","date_gmt":"2023-11-04T23:14:08","guid":{"rendered":"http:\/\/localhost:10003\/creating-a-data-pipeline-to-process-data-using-aws-glue\/"},"modified":"2023-11-05T05:47:56","modified_gmt":"2023-11-05T05:47:56","slug":"creating-a-data-pipeline-to-process-data-using-aws-glue","status":"publish","type":"post","link":"http:\/\/localhost:10003\/creating-a-data-pipeline-to-process-data-using-aws-glue\/","title":{"rendered":"Creating a Data Pipeline to process data using AWS Glue"},"content":{"rendered":"
Data analytics is the new trend among businesses looking to gain insights and competitive advantage. Nevertheless, to extract insights from data, it must first be cleaned, transformed, and analyzed in a usable format. Raw data, often scattered across multiple systems, requires a system in place to collect, process and store it. This system is called a data pipeline. AWS Glue is an integration service that makes it easy to move data among Amazon Web Services data storage and processing services. It is serverless, scalable, and fully-managed, allowing data to be processed without having to manage any infrastructure.<\/p>\n
In this tutorial, we will learn how to create a data pipeline using AWS Glue to transform data stored in an Amazon S3 bucket. We will cover the following topics:<\/p>\n
To use AWS Glue, you must first sign up for an AWS account. Once you have signed up, navigate to the AWS Management Console and select AWS Glue from the list of services. This will take you to the AWS Glue console, where you can create and manage your data pipelines.<\/p>\n
To store the data we want to process, we need an Amazon S3 bucket where we can create a folder structure to organize our data. To create a new S3 bucket, navigate to the S3 console and click the “Create Bucket” button. Give your bucket a unique name and choose your desired region. After creating the bucket, create two new folders within it, “input” and “output”. The input folder will contain the raw data we want to process, and the output folder will contain the transformed data.<\/p>\n
The Glue Data Catalog is a central metadata repository that stores metadata information about data sources, schemas, and associated metadata. The data in the Data Catalog serves as a foundation for the Glue ETL jobs. To define the Glue Data Catalog, we need to first create a database. To create a new database, click on the “Databases” tab in the AWS Glue console and then click the “Add database” button. Give your database a name and click “Create.” Once the database is created, you should see it listed in the Databases tab of the AWS Glue console.<\/p>\n
A crawler is used to populate the Glue Data Catalog with metadata information about the data being processed. To create a new crawler, click on the “Crawlers” tab in the AWS Glue console and then click “Add crawler.” Give your crawler a name and select the data store you want to crawl. In our case, we will select the input folder in our S3 bucket.<\/p>\n
Next, we need to configure the crawler to populate the Glue Data Catalog with the metadata about the data. To do this, click on the “Configure crawler” button and select “Add database” to associate the crawler with the database we created earlier. Once the crawler is associated with the database, the “Configure the crawler’s output” section should be populated. Here, we choose the output path for the crawler and decide if the crawler should crawl tables matching a specific pattern.<\/p>\n
The last step in configuring the crawler is to set up a schedule for it. The default setting is to run one time, but we will change this to run hourly. Finally, start the crawler by clicking on the “Run it now” button. The crawler will then begin the process of analyzing the data in our S3 bucket and populating the Glue Data Catalog with metadata.<\/p>\n
Once the crawler has finished running and populated the Glue Data Catalog, we can create a Glue job to transform the data. To create a new job, navigate to the “Jobs” tab in the AWS Glue console and click “Add job.” Give your job a unique name, select the IAM role you want to use for the job, and then click “Next.”<\/p>\n
In the next step, we will select the “data source.” Here, we specify where the data we want to transform is located, in our case, the input folder in the S3 bucket. Select the data source type as “Data Catalog” and then click on the “Browse” button to select the table created by the crawler. After selecting the table, click “Next.”<\/p>\n
In the “Map the source columns to target columns” step, we define the transformation logic using code written in Python or Scala. AWS Glue provides a range of transformations that can be applied to data, and these transformations can be chained together to perform more complex data processing. We can write the transformation code either in the script editor on the console or using external tools and uploading the code as a package.<\/p>\n
In this example, we will use the script editor provided on the console. The following Python code can be used to transform the raw data:<\/p>\n
import sys\nfrom awsglue.transforms import *\nfrom awsglue.utils import getResolvedOptions\nfrom awsglue.context import GlueContext\nfrom pyspark.context import SparkContext\nfrom pyspark.sql.functions import *\n\n## @params: [JOB_NAME]\nargs = getResolvedOptions(sys.argv, ['JOB_NAME'])\n\nsc = SparkContext()\nglueContext = GlueContext(sc)\nspark = glueContext.spark_session\n\n## @type: DataSource\n## @args: [database = \"mydatabase\", table_name = \"mytable\"]\n## @return: myinputdata\n## @inputs: []\nmyinputdata = glueContext.create_dynamic_frame.from_catalog(database=\"mydatabase\", table_name=\"mytable\")\n\n## Read data from the inputdata\nmydata = myinputdata.toDF()\n\n## Transform the data\nmydata = mydata.select('timestamp', 'value').groupBy('timestamp').agg(avg('value').alias('average'))\n\n## Output the transformed data\nmyoutputdata = DynamicFrame.fromDF(mydata, glueContext, \"transformed_output\")\nglueContext.write_dynamic_frame.from_options(myoutputdata, connection_type=\"s3\", connection_options={\"path\": \"s3:\/\/mybucket\/output\"}, format=\"parquet\")\n<\/code><\/pre>\nThis transformation logic reads the raw data and aggregates it by timestamp and calculates the average. Then, it saves the transformed data in the output folder in a parquet format.<\/p>\n
Running a job and verifying the output<\/h3>\n
Once we have finished defining the transformation logic, we can run the job. To run the job, navigate to the Jobs tab in the AWS Glue console, select the job you want to run, and then click the “Run” button. Once the job is complete, we can verify the output by navigating to the S3 console, selecting the output folder, and viewing the transformed data files.<\/p>\n
Conclusion<\/h2>\n
In this tutorial, we learned how to create a data pipeline using AWS Glue to transform data stored in an Amazon S3 bucket. AWS Glue simplifies the process of building data processing pipelines by providing a serverless, fully-managed, scalable platform to process and transform data. By using these steps, you can transform any raw data and load it in a structured way for further analysis.<\/p>\n","protected":false},"excerpt":{"rendered":"
Data analytics is the new trend among businesses looking to gain insights and competitive advantage. Nevertheless, to extract insights from data, it must first be cleaned, transformed, and analyzed in a usable format. Raw data, often scattered across multiple systems, requires a system in place to collect, process and store Continue Reading<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[1],"tags":[317,202,387,30,468,95,1637],"yoast_head":"\nCreating a Data Pipeline to process data using AWS Glue - Pantherax Blogs<\/title>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\n\t\n