깃허브 블로그와 깃허브 커밋 기록 크롤링하기

Date: 2023.04.13 Updated: 2023.04.13

카테고리: Python

태그: BeautifulSoup, Crawling, python, selenium

해당 포스팅은 고려대학교 23년도 1학기 신은경 교수님의 ‘데이터사이언스와 사회학’ 수업 과제의 일환이다.

1. import

우선 크롤링에 필요한 라이브러리를 import한다.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from datetime import datetime, timedelta

2. 깃허브 브로그 크롤링

크롤링할 블로그의 주소를 적는다. 이때, 모든 글이 한 페이지에 모여져 있는 page-title 창을 활용한다. 이는 크롤링하기에 용이하게 하기 위함이다.

또한 “archive”라는 class를 찾아주기 위해 beautifulsoup의 함수인 find를 사용한다. “archive” class에는 내가 포스팅한 모든 글들이 모여있다.

page = requests.get("https://heestogram.github.io/categories/#page-title")
soup = bs(page.text, "html.parser")
soup = soup.find(class_="archive")

위 사진을 통해 알 수 있듯이, 각 포스팅에 대한 text 정보는 archive__item class에 들어가있다. 이 class를 더욱 세분화해서 들어가면, archive__item-title no_toc class는 제목을, archive__item-excerpt은 게시 날짜와 category를, description은 글의 요약 정보를 알려주고 있다.

dict_ = {}
title_list = []
date_list = []
category_list = []
summary_list = []
num_of_posts = len(soup.find_all(class_='archive__item-title no_toc'))

# 총 포스팅 개수만큼 반복문
for i in range(num_of_posts):
    step1 = soup.find_all(class_='archive__item-title no_toc')[i]
    step2 = step1.find('a').text[:-1]
    title_list.append(step2)
    
    step1 = soup.find_all(class_='archive__item-excerpt')[2*i].text
    
    date_list.append(datetime.strptime(step1.split()[0], '%m/%d/%Y').strftime("%Y-%m-%d"))
    category_list.append(step1.split()[1])
    
    step1 = soup.find_all(itemprop='description')[i].text[:-1]
    summary_list.append(step1)
    
dict_['title'] = title_list
dict_['date'] = date_list
dict_['category'] = category_list
dict_['summary'] = summary_list

df_ = pd.DataFrame(dict_)

df_commit.to_csv('github_blog_crawling.csv')

3. 깃허브 커밋 기록 크롤링

깃허브 내에서 파일을 올리거나, 수정하는 모든 행위는 commit이라는 하나의 단위가 된다. 따라서 commit이 많을수록 활발하게 깃허브를 가꾸었다고 짐작할 수 있다.

커밋 기록을 크롤링하는 건 더 간단하다.

위 사진처럼, 격자형태로 날짜별로 커밋 정도가 초록색으로 표시되어있다. 하나의 작은 사각형이 하루를 뜻한다. 이 사각형에 칠해진 초록색이 짙어질수록, 해당 날짜에 많은 커밋을 했다는 의미이다.

각 사각형은 해당 날짜에 몇 개의 커밋을 했다는 text를 가지고 있다.

따라서 이 모든 사각형을 find_all함수로 잡아내고 text만 추출하면 어느날 몇 개의 커밋을 했는지 알 수 있다.

page = requests.get("https://github.com/heestogram")
soup = bs(page.text, "html.parser")
soup = soup.find_all(class_="ContributionCalendar-day")

text_list = []
date_list = []
dict_ = {}
length = len(soup)
for i in range(length):
    text_list.append(soup[i].text)
    try:
        date_list.append(soup[i]['data-date'])
    except KeyError:
        date_list.append('no date')

dict_['date']=date_list
dict_['contribution']=text_list
df_commit = pd.DataFrame(dict_)

	date	contribution
0	2022-03-27	No contributions on Sunday, March 27, 2022
1	2022-03-28	No contributions on Monday, March 28, 2022
2	2022-03-29	No contributions on Tuesday, March 29, 2022
3	2022-03-30	No contributions on Wednesday, March 30, 2022
4	2022-03-31	No contributions on Thursday, March 31, 2022
...	...	...
371	no date
372	no date
373	no date
374	no date
375	no date

376 rows × 2 columns

df_commit = df_commit[df_commit.date != 'no date']

	date	contribution
0	2022-03-27	No contributions on Sunday, March 27, 2022
1	2022-03-28	No contributions on Monday, March 28, 2022
2	2022-03-29	No contributions on Tuesday, March 29, 2022
3	2022-03-30	No contributions on Wednesday, March 30, 2022
4	2022-03-31	No contributions on Thursday, March 31, 2022
...	...	...
366	2023-03-28	No contributions on Tuesday, March 28, 2023
367	2023-03-29	No contributions on Wednesday, March 29, 2023
368	2023-03-30	1 contribution on Thursday, March 30, 2023
369	2023-03-31	11 contributions on Friday, March 31, 2023
370	2023-04-01	No contributions on Saturday, April 1, 2023

371 rows × 2 columns

df_commit['date'] = pd.to_datetime(df_commit['date'])

C:\Users\희준\AppData\Local\Temp\ipykernel_25104\184324504.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_commit['date'] = pd.to_datetime(df_commit['date'])

df_commit.to_csv('github_commit_crawling.csv')

Twitter Facebook LinkedIn

Hee Jun Kim🍀

깃허브 블로그와 깃허브 커밋 기록 크롤링하기

1. import

2. 깃허브 브로그 크롤링

3. 깃허브 커밋 기록 크롤링

공유하기

참고

에브리타임 원하는 게시판 본문과 댓글 등 크롤링

[NLP] 위키피디아 문서 요약 시스템

[NLP] huggingface 나무위키 데이터셋 전처리

[NLP] KoBART fine tuning 매뉴얼