Friday, November 4, 2022

[SOLVED] PySpark WARN message

November 04, 2022 apache-spark, python, ubuntu

Issue

I am receving a Warn message as follows whenever the RDD size is large. The task completes and output is shown.

Sample Code

>>> ord_items = sc.textFile('practice/retail_db/Order_items',1)
>>> ord_items.take(60)

Output Warning

WARN BlockReaderFactory: I/O error constructing remote block reader

java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:191)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3033)
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:829)
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:754)
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:381)
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:755)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:685)
at org.apache.hadoop.hdfs.DFSInputStream.seekToBlockSource(DFSInputStream.java:1647)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:851)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:893)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:957)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:227)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:185)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:261)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:50)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:314)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:245)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:732)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:438)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:272)

The WARN message goes away if I increase the number of partitions to 20+ currently it is passed as 1 as shown in the sample input I also find it strange that the STAGE ====> loader never shows up even for large inputs I am running spark version 3.3.0 on local in client mode using pyspark command on Ubuntu 20.04.5 LTS Python version 3.8.10

UPDATE: I have found that this happens if I have 4470 lines in my text file it does not throw warning at 4469 lines Link to file I used test.txt

Solution

The Warning has disappeared after switching to spark-3.1.3-bin-hadoop3.2.tgz version

Answered By - cpt.John

Answer Checked By - Pedro (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 4, 2022

[SOLVED] PySpark WARN message

Issue

Solution

Popular Posts

Labels