[hdfs] copyFromLocal 명령보다 더 많은 시간이 걸리는 Apache Flume

티스토리 뷰

카테고리 없음

[hdfs] copyFromLocal 명령보다 더 많은 시간이 걸리는 Apache Flume

필살기쓰세요 2021. 2. 19. 22:28

If you're looking simple at read and write operations flume is going to be at least 2x slower with your configuration as you're using a file channel - every file read from disk is encapsulated into a flume event (in memory) and then serialized back down to disk via the file channel. The sink then reads the event back from the file channel (disk) before pushing it up to hdfs.

You also haven't set a blob deserializer on your spoolDir source (so it's reading one line at a time from your source files, wrapping in a flume Event and then writing to the file channel), so paired with the HDFS Sink default rollXXX values, you'll be getting a file in hdfs per 10 events / 30s / 1k rather than a file per input file that you'd get with copyFromLocal.

이러한 모든 요소가 합쳐져 성능이 저하됩니다. 더 비슷한 성능을 얻으려면 메모리 채널과 결합 된 spoolDir 소스에서 BlobDeserializer를 사용해야합니다 (하지만 JRE가 조기에 종료되는 경우 메모리 채널이 이벤트 전달을 보장하지 않음을 이해하십시오.

-------------------

Apache Flume은 로컬 파일 시스템에서 HDFS로 폴더를 이동하거나 복사하기위한 것이 아닙니다. Flume은 많은 양의 로그 데이터를 여러 소스에서 중앙 집중식 데이터 저장소로 효율적으로 수집, 집계 및 이동하기위한 것입니다. (참조 : Flume 사용 설명서 )

큰 파일이나 디렉토리를 이동 hdfs dfs -copyFromLocal하려면 이미 언급 한대로 사용해야합니다 .

출처
https://stackoverflow.com/questions/39940149

공지사항

Total

Today

Yesterday

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

uiyam

티스토리 뷰

[hdfs] copyFromLocal 명령보다 더 많은 시간이 걸리는 Apache Flume

티스토리툴바