Skip to content Skip to sidebar Skip to footer
Showing posts with the label Apache Spark

Rdd Collect Issue

I configured a new system, spark 2.3.0, python 3.6.0, dataframe read and other operations working a… Read more Rdd Collect Issue

Error Pythonudfrunner: Python Worker Exited Unexpectedly (crashed)

I am running a PySpark job that calls udfs. I know udfs are bad with memory and slow due to seriali… Read more Error Pythonudfrunner: Python Worker Exited Unexpectedly (crashed)

Sparkexception: Python Worker Failed To Connect Back When Execute Spark Action

When I try to execute this command line at pyspark arquivo = sc.textFile('dataset_analise_senti… Read more Sparkexception: Python Worker Failed To Connect Back When Execute Spark Action

Sum In Spark Gone Bad

Based on Unbalanced factor of KMeans?, I am trying to compute the Unbalanced Factor, but I fail. Ev… Read more Sum In Spark Gone Bad

Cartesian Product Of Two Rdd In Spark

I am completely new to Apache Spark and I trying to Cartesian product two RDD. As an example I have… Read more Cartesian Product Of Two Rdd In Spark

Pyspark Outofmemoryerrors When Performing Many Dataframe Joins

There's many posts about this issue, but none have answered my question. I'm running into O… Read more Pyspark Outofmemoryerrors When Performing Many Dataframe Joins

How To Split A Text File Into Multiple Columns With Spark

I'm having difficulty on splitting a text data file with delimiter '|' into data frame … Read more How To Split A Text File Into Multiple Columns With Spark

Downloading Files From Google Storage Using Spark (python) And Dataproc

I have an application that parallelizes the execution of Python objects that process data to be dow… Read more Downloading Files From Google Storage Using Spark (python) And Dataproc

Mongodb Spark Connector Py4j.protocol.py4jjavaerror: An Error Occurred While Calling O50.load

I have been able to load this MongoDB database before, but am now receiving an error I haven't … Read more Mongodb Spark Connector Py4j.protocol.py4jjavaerror: An Error Occurred While Calling O50.load

Removing Characters From Python Output

I did alot of work to remove the characters from the spark python output like u u' u' [()/&… Read more Removing Characters From Python Output

Spark - Merge / Union Dataframe With Different Schema (column Names And Sequence) To A Dataframe With Master Common Schema

I tried taking a schema as a common schema by df.schema() and load all the CSV files to it .But fai… Read more Spark - Merge / Union Dataframe With Different Schema (column Names And Sequence) To A Dataframe With Master Common Schema

Reading And Writing From Hive Tables With Spark After Aggregation

We have a hive warehouse, and wanted to use spark for various tasks (mainly classification). At tim… Read more Reading And Writing From Hive Tables With Spark After Aggregation

Elephas Not Loaded In Pyspark: No Module Named Elephas.spark_model

I am trying to distribute Keras training on a cluster and use Elephas for that. But, when running t… Read more Elephas Not Loaded In Pyspark: No Module Named Elephas.spark_model

Spark Stream - 'utf8' Codec Can't Decode Bytes

I'm fairly new to stream programming. We have Kafka stream which use Avro. I want to connect a … Read more Spark Stream - 'utf8' Codec Can't Decode Bytes

Pyspark: Remove Utf Null Character From Pyspark Dataframe

I have a pyspark dataframe similar to the following: df = sql_context.createDataFrame([ Row(a=3, … Read more Pyspark: Remove Utf Null Character From Pyspark Dataframe

How To Read Avro File In Pyspark

I am writing a spark job using python. However, I need to read in a whole bunch of avro files. Thi… Read more How To Read Avro File In Pyspark

Is It Possible To Scale Data By Group In Spark?

I want to scale data with StandardScaler (from pyspark.mllib.feature import StandardScaler), by now… Read more Is It Possible To Scale Data By Group In Spark?

Spark: How To Transform To Data Frame Data From Multiple Nested Xml Files With Attributes

How to transform values below from multiple XML files to spark data frame : attribute Id0 from Lev… Read more Spark: How To Transform To Data Frame Data From Multiple Nested Xml Files With Attributes

Pyspark - Create New Column From Operations Of Dataframe Columns Gives Error "column Is Not Iterable"

I have a PySpark DataFrame and I have tried many examples showing how to create a new column based … Read more Pyspark - Create New Column From Operations Of Dataframe Columns Gives Error "column Is Not Iterable"

How To Get Postgres Command 'nth_value' Equivalent In Pyspark Hive Sql?

I was solving this example : https://www.windowfunctions.com/questions/grouping/5 Here, they use Or… Read more How To Get Postgres Command 'nth_value' Equivalent In Pyspark Hive Sql?