Async IO problem and OutOfMemory

ariox41 · August 13, 2016, 10:46am

Hi,

I’m using couchbase 4.1, spark 1.6.2 with spark-connector 1.2.1.

The following code ended with OutOfMemory for large file (more then 50GB):
sc.textFile(file_name)//50G .map(line =>toJsonObject(line)) .zipWithIndex()//or zipWithUniqueId() .map(pair=>JsonDocument.create(pair._2.toString, pair._1)) .saveToCouchbase()// OutOfMemory

This problem occurs because of the saveToCouchbase implementation.
Simplified:
rdd.foreachPartition(iter => { val bucket = CouchbaseConnection().bucket(cbConfig).async() Observable .from(OnceIterable(iter)) .flatMap(doc => {bucket.upsert(doc) …/error handling/…}) .toBlocking .last })

Sending in Couchbase is slower than reading from a file, leaving the entire partition finally loaded into RAM. (Sorry for my English, if I made a mistake)

A possible solution is rdd.repartition(???).foreachPartition(…), but what I have to replace the ‘???’ if the amount of data is unknown?

The following solution works more well:
val asyncPartitionSize = 1000 // user defined rdd.foreachPartition(iter => { val bucket = CouchbaseConnection().bucket(cbConfig).async() iter.grouped(asyncPartitionSize).map(_.iterator).foreach(partIter => { Observable .from(OnceIterable(partIter )) .flatMap(doc => {bucket.upsert(doc) …/error handling/…}) .toBlocking .toIterable .iterator }) .flatten// lazy flatten Iterator[Iterator[T]] to Iterator[T] .last })

Thus I was able to save file size of 300 GB on a cluster with a total RAM of 20 GB.

However, the same problem is apparent when reading the data from the Couchbase using ‘CouchbaseView’, and I could not solve it.

I do not have experience with Apache Spark, maybe I missed something?

Topic		Replies	Views
Couchbase Spark connector 3.1.x - issues reading data from a medium sized collection Spark Connector	2	1730	May 5, 2022
spark.sparkContext.couchbaseQuery number of partitions Spark Connector	8	4695	September 30, 2020
Persisting spark dataframe to couchbase Spark Connector	1	2405	April 12, 2018
Spark job throws TemporaryFailureException when saving RDD Spark Connector spark	2	4718	November 10, 2015
Kafka-connect-couchbase Connector Stops working Kafka Connector	15	3626	August 8, 2018

Async IO problem and OutOfMemory

Related topics