PySpark: Learn & Test Your Knowledge (PySpark Interview Preparation) - Level: Advance
Data Structures and Operations - Level 3 (Advanced)
1. Which of the following is NOT a valid way to create a DataFrame in PySpark?
From an existing RDD.Reading from a CSV file.
Reading from a relational database.
Using a Python list of dictionaries.
2. Which operation is used to repartition a DataFrame in PySpark?
repartition()partition()
redistribute()
rearrange()
3. Which of the following methods can be used to cache a DataFrame in memory for faster access?
persist()store()
cache()
save()
4. Which function can be used to convert a PySpark DataFrame into an RDD?
toRDD()rdd()
asRDD()
convertToRDD()
5. What does the coalesce method in PySpark do?
It increases the number of partitions in a DataFrame.It sorts the DataFrame.
It joins two DataFrames.
It reduces the number of partitions in a DataFrame by moving data.
6. Which of the following operations can be performed using the DataFrame API in PySpark?
Filtering rows based on a condition.Adding a new column.
Grouping and aggregating data.
All of the above.
7. Which method is used to rename columns in a PySpark DataFrame?
renameColumn()withColumnRenamed()
alterColumn()
modifyColumn()
8. How can you drop duplicate rows from a PySpark DataFrame based on selected columns?
dropDuplicates()removeDuplicates()
deduplicate()
excludeDuplicates()
9. Which of the following is a valid way to perform an inner join between two PySpark DataFrames?
Using the merge() method.Using the concat() method.
Using the combine() method.
Using the join() method with the "inner" option.
10. Which of the following methods allows you to perform a left outer join between two DataFrames in PySpark?
join("columnName", "inner")join("columnName", "left_outer")
join("columnName", "right_outer")
join("columnName", "full_outer")
11. In PySpark, which method is used to pivot a DataFrame using one or more columns?
stack()melt()
pivot()
spread()
12. Which function allows you to calculate the cumulative sum of a column in PySpark?
pyspark.sql.functions.rank()pyspark.sql.functions.window()
pyspark.sql.functions.cumsum()
pyspark.sql.functions.accumulate()
13. How can you convert a Spark DataFrame to a Pandas DataFrame in PySpark?
Using the toPandas() methodUsing the collect() method
Using the asPandas() method
Using the convertToPandas() method
14. Which of the following is NOT a method to handle missing data in PySpark?
dropna()fillna()
replace()
fillnones()
15. Which PySpark method is used to compute the frequency of unique values in a column?
uniqueItems()countDistinct()
freqItems()
valueCounts()
16. Which function is used in PySpark to perform a rolling window operation on a column?
roll()window()
slide()
rollWindow()
17. How can you broadcast a variable to all nodes in a PySpark cluster?
Using the distribute()Using the scatter()
Using the broadcast()
Using the send()
18. Which method in PySpark is used to apply a function to all elements in a DataFrame or Series?
apply()map()
transform()
invoke()
19. How can you increase the number of partitions in a PySpark DataFrame?
Using the partitionBy()Using the repartition()
Using the shuffle()
Using the setPartitions()
20. Which method in PySpark is used to create a new column based on a conditional expression?
Using the addColumn()Using the appendColumn()
Using the insertColumn()
Using the withColumn()
Bonus. How can you sort the rows of a PySpark DataFrame based on multiple columns?
Using the sort()Using the orderBy()
Using the groupBy()
Using the shuffle()