{"id":4136,"date":"2023-11-04T23:14:05","date_gmt":"2023-11-04T23:14:05","guid":{"rendered":"http:\/\/localhost:10003\/big-data-processing-with-spark\/"},"modified":"2023-11-05T05:47:58","modified_gmt":"2023-11-05T05:47:58","slug":"big-data-processing-with-spark","status":"publish","type":"post","link":"http:\/\/localhost:10003\/big-data-processing-with-spark\/","title":{"rendered":"Big data processing with Spark"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>Apache Spark is an open-source distributed computing system designed for big data processing. It was initially developed at the University of California, Berkeley, and has become one of the most popular big data frameworks in the industry. With its powerful processing engine and intuitive API, Spark makes it easy to process large volumes of data quickly and efficiently. In this tutorial, we will be covering the basics of big data processing with Spark.<\/p>\n<h2>Prerequisites<\/h2>\n<p>To follow along with this tutorial, you will need to have the following prerequisites:<\/p>\n<ul>\n<li>Basic understanding of Scala, Java, or Python<\/li>\n<li>A computer running Linux or MacOS<\/li>\n<li>A recent version of Apache Spark installed on your system. You can download Spark from the official website: https:\/\/spark.apache.org\/downloads.html<\/li>\n<\/ul>\n<h2>Setting up Spark<\/h2>\n<p>Once you have downloaded and installed Spark, you will need to set up your environment variables. Here are the steps to follow:<\/p>\n<ol>\n<li>Open your terminal and navigate to the directory where Spark is installed. For example, if you installed Spark in the \/opt directory, you would run the following command:<\/li>\n<\/ol>\n<pre><code class=\"language-shell\">cd \/opt\/spark\n<\/code><\/pre>\n<ol>\n<li>Next, you\u2019ll need to configure your <code>SPARK_HOME<\/code> environment variable to point to the directory where Spark is installed:<\/li>\n<\/ol>\n<pre><code class=\"language-shell\">export SPARK_HOME=\/opt\/spark\n<\/code><\/pre>\n<ol>\n<li>Add the Spark binaries to your <code>PATH<\/code> environment variable to make them available in your terminal:<\/li>\n<\/ol>\n<pre><code class=\"language-shell\">export PATH=$PATH:$SPARK_HOME\/bin\n<\/code><\/pre>\n<ol>\n<li>Finally, you can confirm that Spark is installed correctly by typing <code>spark-shell<\/code> in your terminal. This will open the Spark shell, where you can test your Spark code.<\/li>\n<\/ol>\n<h2>Spark Context<\/h2>\n<p>The Spark context is the entry point for all Spark functionality. It represents the connection to a Spark cluster and can be used to create RDDs, accumulators, and broadcast variables on that cluster. Here&#8217;s how you can create a Spark context in Python:<\/p>\n<pre><code class=\"language-python\">from pyspark import SparkContext\n\nsc = SparkContext(\"local\", \"myApp\")\n<\/code><\/pre>\n<p>The above code creates a Spark context with the name &#8220;myApp&#8221; running locally on your machine. You can replace <code>\"local\"<\/code> with the master URL of your Spark cluster to connect to a remote cluster instead.<\/p>\n<h2>RDDs<\/h2>\n<p>Resilient Distributed Datasets (RDDs) are the primary data abstraction in Spark. They are fault-tolerant collections of elements that can be processed in parallel across a cluster. RDDs can be created by loading data from external storage systems like Hadoop Distributed File System (HDFS), and various other data sources, such as Apache Cassandra, Amazon S3, and local file system among others. RDDs are immutable, which means that they cannot be changed once they are created.<\/p>\n<h3>Creating RDDs<\/h3>\n<p>You can create RDDs using parallelizing an existing collection in your program or loading data from an external storage system. Here&#8217;s how you can create an RDD from an array of integers:<\/p>\n<pre><code class=\"language-python\">from pyspark import SparkContext\n\nsc = SparkContext(\"local\", \"myApp\")\nrdd = sc.parallelize([1, 2, 3, 4, 5])\n<\/code><\/pre>\n<p>The above code creates an RDD called <code>rdd<\/code> with the elements <code>[1, 2, 3, 4, 5]<\/code>. You can also create RDDs by loading data from a file system using the <code>textFile<\/code> method. For example:<\/p>\n<pre><code class=\"language-python\">from pyspark import SparkContext\n\nsc = SparkContext(\"local\", \"myApp\")\nrdd = sc.textFile(\"\/path\/to\/data\")\n<\/code><\/pre>\n<p>The above code creates an RDD called <code>rdd<\/code> by loading the data from a file located at <code>\/path\/to\/data<\/code>. <code>textFile<\/code> creates one partition for each block of the file by default. The default block size is 64MB, but you can change it by setting the <code>spark.hadoop.fs.local.block.size<\/code> configuration property.<\/p>\n<h3>Transformations<\/h3>\n<p>RDDs support two types of operations: transformations and actions. Transformations create a new RDD from an existing one, whereas actions produce a result or side effect. Transformations are lazy, which means that they do not execute immediately; instead, they build a lineage of transformations to execute when an action is called. This allows Spark to optimize the execution plan and schedule tasks to run in parallel across a cluster.<\/p>\n<p>There are many built-in transformations in Spark, such as <code>map<\/code>, <code>flatMap<\/code>, <code>filter<\/code>, <code>union<\/code>, and <code>distinct<\/code>, among others. Here are some examples:<\/p>\n<pre><code class=\"language-python\">from pyspark import SparkContext\n\nsc = SparkContext(\"local\", \"myApp\")\nrdd = sc.parallelize([\"hello world\", \"goodbye world\"])\n\n# Map each item in the RDD to its length\nlengths = rdd.map(lambda s: len(s))\n\n# Flatten each item in the RDD to a list of words\nwords = rdd.flatMap(lambda s: s.split())\n\n# Filter out items in the RDD that contain the word \"goodbye\"\nfiltered = rdd.filter(lambda s: \"goodbye\" not in s)\n\n# Union two RDDs together\nrdd2 = sc.parallelize([\"hello again\", \"see you later\"])\nunion = rdd.union(rdd2)\n\n# Reduce the RDD to find the longest string\nlongest = rdd.reduce(lambda a, b: a if len(a) &gt; len(b) else b)\n<\/code><\/pre>\n<h3>Actions<\/h3>\n<p>Actions are operations that return a result or side effect. Unlike transformations, actions are eager, which means that they execute immediately and trigger the execution of any queued transformations. Some built-in actions in Spark include <code>collect<\/code>, <code>count<\/code>, <code>first<\/code>, <code>reduce<\/code>, and <code>saveAsTextFile<\/code>, among others. Here are some examples:<\/p>\n<pre><code class=\"language-python\">from pyspark import SparkContext\n\nsc = SparkContext(\"local\", \"myApp\")\nrdd = sc.parallelize([\"hello world\", \"goodbye world\"])\n\n# Collect the RDD elements to a list in memory\ncollected = rdd.collect()\n\n# Count the number of elements in the RDD\ncount = rdd.count()\n\n# Get the first element in the RDD\nfirst = rdd.first()\n\n# Reduce the RDD to find the longest string\nlongest = rdd.reduce(lambda a, b: a if len(a) &gt; len(b) else b)\n\n# Save the RDD elements to a text file\nrdd.saveAsTextFile(\"\/path\/to\/output\")\n<\/code><\/pre>\n<h2>DataFrames and Datasets<\/h2>\n<p>DataFrames and Datasets are higher-level abstractions in Spark that offer a more flexible and type-safe API than RDDs. They are based on the same concepts of RDDs but offer a more efficient query engine and optimizations for structured data. DataFrames are similar to tables in a relational database and offer support for SQL-like operations, such as <code>select<\/code>, <code>filter<\/code>, <code>groupby<\/code>, and <code>join<\/code>. Datasets, on the other hand, are type-safe and offer a more intuitive API for working with structured data.<\/p>\n<h3>Creating DataFrames<\/h3>\n<p>You can create a DataFrame by loading data from a structured data source, such as a CSV file, or by converting an existing RDD to a DataFrame. Here&#8217;s how you can create a DataFrame from a CSV file:<\/p>\n<pre><code class=\"language-python\">from pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"myApp\").getOrCreate()\ndf = spark.read.csv(\"\/path\/to\/data.csv\", header=True, inferSchema=True)\n<\/code><\/pre>\n<p>The above code creates a DataFrame called <code>df<\/code> by loading the data from a CSV file located at <code>\/path\/to\/data.csv<\/code>. The <code>header<\/code> argument specifies whether the first row of the file contains column names, while the <code>inferSchema<\/code> argument enables Spark to automatically infer the data types of each column.<\/p>\n<p>You can also create a DataFrame from an existing RDD by specifying the schema of the DataFrame. Here&#8217;s how you can create a DataFrame from an RDD of tuples:<\/p>\n<pre><code class=\"language-python\">from pyspark.sql import SparkSession\nfrom pyspark.sql.types import StructField, StructType, StringType, IntegerType\n\nspark = SparkSession.builder.appName(\"myApp\").getOrCreate()\n\nschema = StructType([\n  StructField(\"name\", StringType(), True),\n  StructField(\"age\", IntegerType(), True)\n])\n\nrdd = spark.sparkContext.parallelize([(\"Alice\", 20), (\"Bob\", 25), (\"Charlie\", 30)])\ndf = spark.createDataFrame(rdd, schema)\n<\/code><\/pre>\n<p>The above code creates a DataFrame called <code>df<\/code> with the schema <code>name:string, age:int<\/code> by converting an RDD of tuples to a DataFrame.<\/p>\n<h3>Transformations<\/h3>\n<p>DataFrames support a wide range of transformations, including <code>select<\/code>, <code>filter<\/code>, <code>groupby<\/code>, <code>join<\/code>, and many others. Transformations on DataFrames are lazy, just like transformations on RDDs, and build a query plan that is optimized for execution by the Spark engine. For example:<\/p>\n<pre><code class=\"language-python\">from pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"myApp\").getOrCreate()\ndf = spark.read.csv(\"\/path\/to\/data.csv\", header=True, inferSchema=True)\n\n# Select two columns from the DataFrame\nselected = df.select(\"name\", \"age\")\n\n# Filter the DataFrame by a condition\nfiltered = df.filter(df.age &gt; 25)\n\n# Group the DataFrame by a column and aggregate the results\ngrouped = df.groupBy(\"gender\").agg({\"age\": \"avg\"})\n\n# Join two DataFrames together\ndf2 = spark.read.csv(\"\/path\/to\/data2.csv\", header=True, inferSchema=True)\njoined = df.join(df2, \"id\")\n<\/code><\/pre>\n<h3>Actions<\/h3>\n<p>DataFrames support a wide range of actions, such as <code>count<\/code>, <code>collect<\/code>, <code>head<\/code>, <code>take<\/code>, and many others. Actions on DataFrames trigger the execution of any queued transformations and return a result or side effect. For example:<\/p>\n<pre><code class=\"language-python\">from pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"myApp\").getOrCreate()\ndf = spark.read.csv(\"\/path\/to\/data.csv\", header=True, inferSchema=True)\n\n# Count the number of rows in the DataFrame\ncount = df.count()\n\n# Collect the DataFrame rows to a list in memory\ncollected = df.collect()\n\n# Get the first row of the DataFrame\nfirst = df.first()\n<\/code><\/pre>\n<h2>Spark SQL<\/h2>\n<p>Spark SQL is a module in Spark that provides support for SQL-like queries on structured data. It is built on top of the Spark DataFrame API and supports a wide range of SQL-like operations, including <code>select<\/code>, <code>filter<\/code>, <code>groupby<\/code>, <code>join<\/code>, and <code>union<\/code>, among others. Spark SQL can also read data from a wide range of sources, such as traditional Hive tables, Parquet files, and JSON data.<\/p>\n<h3>Creating a Spark Session<\/h3>\n<p>To use Spark SQL, you need to create a SparkSession. Here&#8217;s how you can create a SparkSession in Python:<\/p>\n<pre><code class=\"language-python\">from pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"myApp\").getOrCreate()\n<\/code><\/pre>\n<h3>Creating a DataFrame<\/h3>\n<p>You can create a DataFrame from an external data source, such as a CSV file, by calling the <code>read<\/code> method on the SparkSession and specifying the source data:<\/p>\n<pre><code class=\"language-python\">from pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"myApp\").getOrCreate()\ndf = spark.read.csv(\"\/path\/to\/data.csv\", header=True, inferSchema=True)\n<\/code><\/pre>\n<p>The above code creates a DataFrame called <code>df<\/code> by loading data from a CSV file located at <code>\/path\/to\/data.csv<\/code>.<\/p>\n<h3>Executing a SQL Query<\/h3>\n<p>Once you have a DataFrame, you can execute a SQL query on it by registering the DataFrame as a temporary table and then issuing a SQL query against that table:<\/p>\n<pre><code class=\"language-python\">from pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"myApp\").getOrCreate()\ndf = spark.read.csv(\"\/path\/to\/data.csv\", header=True, inferSchema=True)\n\n# Register the DataFrame as a temporary table\ndf.createOrReplaceTempView(\"myTable\")\n\n# Execute a SQL query on the temporary table\nresult = spark.sql(\"SELECT name, age FROM myTable WHERE gender = 'M'\")\n<\/code><\/pre>\n<p>The above code registers the DataFrame as a temporary table called <code>myTable<\/code>, and then executes a SQL query against that table, selecting only the <code>name<\/code> and <code>age<\/code> columns where the <code>gender<\/code> column is equal to <code>'M'<\/code>.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this tutorial, we covered the basics of big data processing with Apache Spark. We learned about the Spark context, RDDs, DataFrames, Datasets, and Spark SQL. We also covered how to create RDDs and DataFrames, how to perform transformations and actions on them, and how to execute SQL queries against them. As you continue to use Spark, you&#8217;ll find that it has many other powerful features and abstractions to help you process and analyze large volumes of data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Apache Spark is an open-source distributed computing system designed for big data processing. It was initially developed at the University of California, Berkeley, and has become one of the most popular big data frameworks in the industry. With its powerful processing engine and intuitive API, Spark makes it easy <a href=\"http:\/\/localhost:10003\/big-data-processing-with-spark\/\" class=\"btn btn-link continue-link\">Continue Reading<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[1],"tags":[1436,996,1317,193,1435,96,1434,92],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v21.5 (Yoast SEO v21.5) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Big data processing with Spark - Pantherax Blogs<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"http:\/\/localhost:10003\/big-data-processing-with-spark\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Big data processing with Spark\" \/>\n<meta property=\"og:description\" content=\"Introduction Apache Spark is an open-source distributed computing system designed for big data processing. It was initially developed at the University of California, Berkeley, and has become one of the most popular big data frameworks in the industry. With its powerful processing engine and intuitive API, Spark makes it easy Continue Reading\" \/>\n<meta property=\"og:url\" content=\"http:\/\/localhost:10003\/big-data-processing-with-spark\/\" \/>\n<meta property=\"og:site_name\" content=\"Pantherax Blogs\" \/>\n<meta property=\"article:published_time\" content=\"2023-11-04T23:14:05+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-11-05T05:47:58+00:00\" \/>\n<meta name=\"author\" content=\"Panther\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Panther\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\n\t    \"@context\": \"https:\/\/schema.org\",\n\t    \"@graph\": [\n\t        {\n\t            \"@type\": \"Article\",\n\t            \"@id\": \"http:\/\/localhost:10003\/big-data-processing-with-spark\/#article\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"http:\/\/localhost:10003\/big-data-processing-with-spark\/\"\n\t            },\n\t            \"author\": {\n\t                \"name\": \"Panther\",\n\t                \"@id\": \"http:\/\/localhost:10003\/#\/schema\/person\/b63d816f4964b163e53cbbcffaa0f3d7\"\n\t            },\n\t            \"headline\": \"Big data processing with Spark\",\n\t            \"datePublished\": \"2023-11-04T23:14:05+00:00\",\n\t            \"dateModified\": \"2023-11-05T05:47:58+00:00\",\n\t            \"mainEntityOfPage\": {\n\t                \"@id\": \"http:\/\/localhost:10003\/big-data-processing-with-spark\/\"\n\t            },\n\t            \"wordCount\": 1192,\n\t            \"publisher\": {\n\t                \"@id\": \"http:\/\/localhost:10003\/#organization\"\n\t            },\n\t            \"keywords\": [\n\t                \"\\\"big data management\\\"\",\n\t                \"\\\"big data processing\\\"\",\n\t                \"\\\"big data tools\\\"\",\n\t                \"\\\"Data analysis\\\"\",\n\t                \"\\\"data processing performance\\\"\",\n\t                \"\\\"distributed computing\\\"\",\n\t                \"\\\"Spark\\\"\",\n\t                \"[\\\"Apache Spark\\\"\"\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"WebPage\",\n\t            \"@id\": \"http:\/\/localhost:10003\/big-data-processing-with-spark\/\",\n\t            \"url\": \"http:\/\/localhost:10003\/big-data-processing-with-spark\/\",\n\t            \"name\": \"Big data processing with Spark - Pantherax Blogs\",\n\t            \"isPartOf\": {\n\t                \"@id\": \"http:\/\/localhost:10003\/#website\"\n\t            },\n\t            \"datePublished\": \"2023-11-04T23:14:05+00:00\",\n\t            \"dateModified\": \"2023-11-05T05:47:58+00:00\",\n\t            \"breadcrumb\": {\n\t                \"@id\": \"http:\/\/localhost:10003\/big-data-processing-with-spark\/#breadcrumb\"\n\t            },\n\t            \"inLanguage\": \"en-US\",\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"ReadAction\",\n\t                    \"target\": [\n\t                        \"http:\/\/localhost:10003\/big-data-processing-with-spark\/\"\n\t                    ]\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"BreadcrumbList\",\n\t            \"@id\": \"http:\/\/localhost:10003\/big-data-processing-with-spark\/#breadcrumb\",\n\t            \"itemListElement\": [\n\t                {\n\t                    \"@type\": \"ListItem\",\n\t                    \"position\": 1,\n\t                    \"name\": \"Home\",\n\t                    \"item\": \"http:\/\/localhost:10003\/\"\n\t                },\n\t                {\n\t                    \"@type\": \"ListItem\",\n\t                    \"position\": 2,\n\t                    \"name\": \"Big data processing with Spark\"\n\t                }\n\t            ]\n\t        },\n\t        {\n\t            \"@type\": \"WebSite\",\n\t            \"@id\": \"http:\/\/localhost:10003\/#website\",\n\t            \"url\": \"http:\/\/localhost:10003\/\",\n\t            \"name\": \"Pantherax Blogs\",\n\t            \"description\": \"\",\n\t            \"publisher\": {\n\t                \"@id\": \"http:\/\/localhost:10003\/#organization\"\n\t            },\n\t            \"potentialAction\": [\n\t                {\n\t                    \"@type\": \"SearchAction\",\n\t                    \"target\": {\n\t                        \"@type\": \"EntryPoint\",\n\t                        \"urlTemplate\": \"http:\/\/localhost:10003\/?s={search_term_string}\"\n\t                    },\n\t                    \"query-input\": \"required name=search_term_string\"\n\t                }\n\t            ],\n\t            \"inLanguage\": \"en-US\"\n\t        },\n\t        {\n\t            \"@type\": \"Organization\",\n\t            \"@id\": \"http:\/\/localhost:10003\/#organization\",\n\t            \"name\": \"Pantherax Blogs\",\n\t            \"url\": \"http:\/\/localhost:10003\/\",\n\t            \"logo\": {\n\t                \"@type\": \"ImageObject\",\n\t                \"inLanguage\": \"en-US\",\n\t                \"@id\": \"http:\/\/localhost:10003\/#\/schema\/logo\/image\/\",\n\t                \"url\": \"http:\/\/localhost:10003\/wp-content\/uploads\/2023\/11\/cropped-9e7721cb-2d62-4f72-ab7f-7d1d8db89226.jpeg\",\n\t                \"contentUrl\": \"http:\/\/localhost:10003\/wp-content\/uploads\/2023\/11\/cropped-9e7721cb-2d62-4f72-ab7f-7d1d8db89226.jpeg\",\n\t                \"width\": 1024,\n\t                \"height\": 1024,\n\t                \"caption\": \"Pantherax Blogs\"\n\t            },\n\t            \"image\": {\n\t                \"@id\": \"http:\/\/localhost:10003\/#\/schema\/logo\/image\/\"\n\t            }\n\t        },\n\t        {\n\t            \"@type\": \"Person\",\n\t            \"@id\": \"http:\/\/localhost:10003\/#\/schema\/person\/b63d816f4964b163e53cbbcffaa0f3d7\",\n\t            \"name\": \"Panther\",\n\t            \"image\": {\n\t                \"@type\": \"ImageObject\",\n\t                \"inLanguage\": \"en-US\",\n\t                \"@id\": \"http:\/\/localhost:10003\/#\/schema\/person\/image\/\",\n\t                \"url\": \"http:\/\/2.gravatar.com\/avatar\/b8c0eda5a49f8f31ec32d0a0f9d6f838?s=96&d=mm&r=g\",\n\t                \"contentUrl\": \"http:\/\/2.gravatar.com\/avatar\/b8c0eda5a49f8f31ec32d0a0f9d6f838?s=96&d=mm&r=g\",\n\t                \"caption\": \"Panther\"\n\t            },\n\t            \"sameAs\": [\n\t                \"http:\/\/localhost:10003\"\n\t            ],\n\t            \"url\": \"http:\/\/localhost:10003\/author\/pepethefrog\/\"\n\t        }\n\t    ]\n\t}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Big data processing with Spark - Pantherax Blogs","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"http:\/\/localhost:10003\/big-data-processing-with-spark\/","og_locale":"en_US","og_type":"article","og_title":"Big data processing with Spark","og_description":"Introduction Apache Spark is an open-source distributed computing system designed for big data processing. It was initially developed at the University of California, Berkeley, and has become one of the most popular big data frameworks in the industry. With its powerful processing engine and intuitive API, Spark makes it easy Continue Reading","og_url":"http:\/\/localhost:10003\/big-data-processing-with-spark\/","og_site_name":"Pantherax Blogs","article_published_time":"2023-11-04T23:14:05+00:00","article_modified_time":"2023-11-05T05:47:58+00:00","author":"Panther","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Panther","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"http:\/\/localhost:10003\/big-data-processing-with-spark\/#article","isPartOf":{"@id":"http:\/\/localhost:10003\/big-data-processing-with-spark\/"},"author":{"name":"Panther","@id":"http:\/\/localhost:10003\/#\/schema\/person\/b63d816f4964b163e53cbbcffaa0f3d7"},"headline":"Big data processing with Spark","datePublished":"2023-11-04T23:14:05+00:00","dateModified":"2023-11-05T05:47:58+00:00","mainEntityOfPage":{"@id":"http:\/\/localhost:10003\/big-data-processing-with-spark\/"},"wordCount":1192,"publisher":{"@id":"http:\/\/localhost:10003\/#organization"},"keywords":["\"big data management\"","\"big data processing\"","\"big data tools\"","\"Data analysis\"","\"data processing performance\"","\"distributed computing\"","\"Spark\"","[\"Apache Spark\""],"inLanguage":"en-US"},{"@type":"WebPage","@id":"http:\/\/localhost:10003\/big-data-processing-with-spark\/","url":"http:\/\/localhost:10003\/big-data-processing-with-spark\/","name":"Big data processing with Spark - Pantherax Blogs","isPartOf":{"@id":"http:\/\/localhost:10003\/#website"},"datePublished":"2023-11-04T23:14:05+00:00","dateModified":"2023-11-05T05:47:58+00:00","breadcrumb":{"@id":"http:\/\/localhost:10003\/big-data-processing-with-spark\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["http:\/\/localhost:10003\/big-data-processing-with-spark\/"]}]},{"@type":"BreadcrumbList","@id":"http:\/\/localhost:10003\/big-data-processing-with-spark\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"http:\/\/localhost:10003\/"},{"@type":"ListItem","position":2,"name":"Big data processing with Spark"}]},{"@type":"WebSite","@id":"http:\/\/localhost:10003\/#website","url":"http:\/\/localhost:10003\/","name":"Pantherax Blogs","description":"","publisher":{"@id":"http:\/\/localhost:10003\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"http:\/\/localhost:10003\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"http:\/\/localhost:10003\/#organization","name":"Pantherax Blogs","url":"http:\/\/localhost:10003\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/localhost:10003\/#\/schema\/logo\/image\/","url":"http:\/\/localhost:10003\/wp-content\/uploads\/2023\/11\/cropped-9e7721cb-2d62-4f72-ab7f-7d1d8db89226.jpeg","contentUrl":"http:\/\/localhost:10003\/wp-content\/uploads\/2023\/11\/cropped-9e7721cb-2d62-4f72-ab7f-7d1d8db89226.jpeg","width":1024,"height":1024,"caption":"Pantherax Blogs"},"image":{"@id":"http:\/\/localhost:10003\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"http:\/\/localhost:10003\/#\/schema\/person\/b63d816f4964b163e53cbbcffaa0f3d7","name":"Panther","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/localhost:10003\/#\/schema\/person\/image\/","url":"http:\/\/2.gravatar.com\/avatar\/b8c0eda5a49f8f31ec32d0a0f9d6f838?s=96&d=mm&r=g","contentUrl":"http:\/\/2.gravatar.com\/avatar\/b8c0eda5a49f8f31ec32d0a0f9d6f838?s=96&d=mm&r=g","caption":"Panther"},"sameAs":["http:\/\/localhost:10003"],"url":"http:\/\/localhost:10003\/author\/pepethefrog\/"}]}},"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/posts\/4136"}],"collection":[{"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/comments?post=4136"}],"version-history":[{"count":1,"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/posts\/4136\/revisions"}],"predecessor-version":[{"id":4392,"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/posts\/4136\/revisions\/4392"}],"wp:attachment":[{"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/media?parent=4136"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/categories?post=4136"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/localhost:10003\/wp-json\/wp\/v2\/tags?post=4136"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}