동아일보

daily log

동아닷컴 크롤링 데이터 확장

동아일보

동아닷컴 크롤링 데이터 확장

Assigned To

Date

2026/02/05

Status

Done

Type

Feature

Server

•

#63

github.com

Table of contents

Issue Point

•

동아닷컴 메인페이지 기사 데이터 수집 범위 확대

•

기존 탑, 서브 헤드라인 이외에도 하위 기사 파티션 추가로 수집

Detail

donga-data-pipeline/

•

external/donga/crawler.py

•

lambda/extractors/collect_realtime_main_articles/

•

lambda/extractors/get_realtime_main_articles/

•

cralwer.py

→ selectolax 파서 사용. 위에 명시한 람다에서 crawler.py 메서드를 가져와서 실행함. 람다를 싹 다 handler로 분리해 놓고 메인 로직은 따로 dashboard 모듈에서 구현한 것 같은데, 좋은 구조 같음.

◦

메인 페이지 기사 수집(fetch_main_page_articles)

▪

동아닷컴 메인 페이지에서 미리 정의된 섹션 (MAIN_PAGE_POSITIONS)에서 기사 긁어옴. 변수 잡아놓고 ‘헤드라인’, ‘이슈 뉴스’, ‘서브 뉴스’ 등 맞춰서 수집

▪

중복 제거 로직

◦

개별 기사 상세 파싱(extract_thumbnail, extract_page_title)

▪

썸네일 추출 같은거

▪

<title> 태그 추출함. GA4 용인듯

◦

대량 데이터 병렬 처리(fetch_page_titles, process_thumbnails)

▪

asyncio.gather로 여러 기사 url이랑 제목 동시에 요청

•

기존 크롤링 대상

MAIN_PAGE_POSITIONS = [
    ('.topnews_left .top_headline_sec', 'topnews', 'topnews_left > top_headline_sec'),
    ('.topnews_left .sub_headline_sec', 'topnews', 'topnews_left > sub_headline_sec'),
    ('.topnews_right .sub_headline_sec', 'topnews', 'topnews_right > sub_headline_sec'),
    ('.sub_headline_right', 'sub_headline', 'sub_headline_right'),
    ('.sub_headline_left', 'sub_headline', 'sub_headline_left'),
]
Python
복사

•

변경 크롤링 대상

MAIN_PAGE_POSITIONS = [
    ('.topnews_left .top_headline_sec', 'topnews', 'topnews_left > top_headline_sec'),
    ('.topnews_left .sub_headline_sec', 'topnews', 'topnews_left > sub_headline_sec'),
    ('.topnews_right .sub_headline_sec', 'topnews', 'topnews_right > sub_headline_sec'),
    ('.sub_headline_right', 'sub_headline', 'sub_headline_right'),
    ('.sub_headline_left', 'sub_headline', 'sub_headline_left'),
    ('.main_news_inner .issue_news_sec', 'main_news', 'main_news_inner > issue_news_sec'),
    ('.main_news_inner .sub_news_type02', 'main_news', 'main_news_inner > sub_news_type02'),
]
Python
복사

Task

MAIN_PAGE_POSITIONS에 크롤링 대상 추가.

MonitoringDonga.jsx 파일에 UI 표시 영역 추가

20분 간격으로 작동하는 람다 파일 → 수정 17:10, cron 20분 간격 → 17:20분 작동, 17:30 작동 예정