I am trying to obtain a cartesian product of 1.5 million records(7 columns) with itself,I am currently using this configuration "./pyspark --master yarn-client --num-executors 50 --name "LND_TEST" --conf "spark.executor.cores=2" --conf "spark.executor.memory=14g" --conf "spark.driver.memory=14g" --conf "spark.shuffle.compress=true" --conf "spark.io.compression.codec=org.apache.spark.io.LZ4CompressionCodec""
However,i am not able to scale the application.I am trying to split the data into 15rdds and would want to perform this computation in incremental way.