[pyspark] 한 column 내에서 중복인 value들을 확인하고 싶을 때

Recent Posts

Recent Comments

관리 메뉴

KEEP GOING

bigdata/spark

jmHan 2023. 6. 17. 12:46

예를 들어 다음과 같이 spark dataframe이 있다고 가정합니다.

df = spark.createDataFrame([(1,), (1,), (4,), (4,), (4,), (5,), (6,), (8,), (3,)], ('col1',))
df.show()

한 컬럼 안에서 중복인 값을 확인하고 싶을 때

df.groupBy('col1').count().where('count > 1').show()

만약 count 값은 확인하고 싶지 않다면 drop('count')를 추가합니다.

df.groupBy('col1').count().where('count > 1').drop('count').show()

[spark][nlp] 대규모 텍스트 유사도 성능 개선하기 : spark broadcast and parallelize (1)	2023.10.29
[spark] pyspark datframe: filter 메서드 총 정리 (0)	2023.06.09
[Spark] Spark Configuration 적용 방식(SparkConf, spark-shell, spark-default.conf)과 주 (0)	2023.05.09
[Spark][Tibero] ClassNotFoundException: com.tmax.tibero.jdbc.tbdriver 에러 해결 (0)	2023.05.08
[Spark] 스파크 버전 확인하기 (0)	2023.01.10

'bigdata/spark' Related Articles

Comments