๐Ÿ“study archive/web-crawling

์›น ์Šคํฌ๋ ˆ์ดํผ ์ž…๋ฌธ

Hush 2022. 6. 9. 17:19

Beautiful Soup์˜ ์‹คํ–‰

Beautiful Soup์—์„œ ๊ฐ€์žฅ ๋„๋ฆฌ ์“ฐ์ด๋Š” ๊ฐ์ฒด๋Š” BeautifulSoup ๊ฐ์ฒด์ด๋‹ค.

๋‹ค์Œ ์˜ˆ์ œ๋ฅผ ๋ณด์ž.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("https://pythonscraping.com/pages/page1.html")
bs=BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

์ถœ๋ ฅ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

<h1>An Interesting Title</h1>

 

 ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด์ž.

 

์›น ํŽ˜์ด์ง€ ์ฃผ์†Œ๋กœ BeautifulSoup ๊ฐ์ฒด ์ƒ์„ฑํ•˜๊ธฐ

 

urlopen ํ•จ์ˆ˜๋Š” ์›น ํŽ˜์ด์ง€ ์ฝ”๋“œ๋ฅผ ๋ฐ˜ํ™˜ํ•ด์ค€๋‹ค. ๋”ฐ๋ผ์„œ html์ด๋ผ๋Š” ๋ณ€์ˆ˜์—๋Š” ํ˜„์žฌ ์›น ํŽ˜์ด์ง€ ์ฝ”๋“œ๊ฐ€ ๋“ค์–ด์žˆ๋‹ค.

 

BeautifulSoup ๋กœ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋•Œ ๋‘ ๊ฐ€์ง€ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ๋“ค์–ด๊ฐ„๋‹ค.

 

์ฒซ ๋ฒˆ์งธ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ๋˜๋Š” ๊ฒƒ์€ ๊ฐ์ฒด์˜ ๊ทผ๊ฐ„์ด ๋˜๋Š” HTML ํ…์ŠคํŠธ์ด๋‹ค.

html.read()๋ฅผ ์ถœ๋ ฅํ•ด ๋ณด๋ฉด ์›น ํŽ˜์ด์ง€ ์ฝ”๋“œ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

๋‘ ๋ฒˆ์งธ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” BeautifulSoup๊ฐ€ ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค ๋•Œ ์“ฐ๋Š” ๊ตฌ๋ฌธ ๋ถ„์„๊ธฐ์ธ๋ฐ, ์ด ๊ตฌ๋ฌธ ๋ถ„์„๊ธฐ๋Š” ์šฐ๋ฆฌ๊ฐ€ ์ง์ ‘ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ƒ๊ธฐ ์˜ˆ์ œ์—์„œ ์‚ฌ์šฉํ•œ html.parser๋Š” ํŒŒ์ด์ฌ3์™€ ํ•จ๊ป˜ ์„ค์น˜๋˜๋ฏ€๋กœ ๋”ฐ๋กœ ์„ค์น˜ํ•  ํ•„์š” ์—†๋‹ค.

ํŠน๋ณ„ํžˆ ๋‹ค๋ฅธ ๋ถ„์„๊ธฐ๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด ์ด๋ฅผ ๊ณ„์† ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

lxml ๋˜๋Š” html5lib ๋„ ๋ถ„์„๊ธฐ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ๋“ค์€ ์ƒํ™ฉ์— ๋”ฐ๋ผ parser๋ณด๋‹ค ๋” ์ข‹์€ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ณ„๋„๋กœ ์„ค์น˜ํ•ด์•ผ ํ•œ๋‹ค.

 

๋ถ„์„๊ธฐ๋ฅผ ๊ฑฐ์ณ ์ƒ์„ฑ๋œ BeautifulSoup ๊ฐ์ฒด์˜ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

html : <html><head>...</head><body>...</body></html>

- head : <head><title>A Useful Page</title></head>

-- title : <title>A Useful Page</title>

-body : ......

์ด๋Ÿฐ ์‹์ด๋‹ค.

 

์ƒ๊ธฐ ์˜ˆ์ œ์—์„œ ์ถ”์ถœํ•œ <h1> ํƒœ๊ทธ๋Š” ๊ฐ์ฒด์—์„œ ๋‘ ๋‹จ๊ณ„๋งŒํผ ์ค‘์ฒฉ๋˜์–ด์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ์ฒด์—์„œ ๊ฐ€์ ธ์˜ฌ ๋•Œ๋Š” h1 ํƒœ๊ทธ๋ฅผ ์ง์ ‘ ๊ฐ€์ ธ์™”๋‹ค. ์‚ฌ์‹ค ๋‹ค์Œ ์ค‘ ๋ฌด์—‡์„ ์‚ฌ์šฉํ•ด๋„ ๊ฒฐ๊ณผ๋Š” ๊ฐ™๋‹ค.

bs.html.body.h1

bs.body.h1

bs.html.h1

 

์›น์‚ฌ์ดํŠธ์—์„œ์˜ ํฌ๋กค๋ง ์ฐจ๋‹จ

ํฌ๋กค๋ง์„ ์ฐจ๋‹จํ•˜์—ฌ 406 ์—๋Ÿฌ๊ฐ€ ๋‚˜์˜ฌ ์ˆ˜๋„ ์žˆ๋‹ค.

์ด๋•Œ๋Š” ์œ ์ € ์ •๋ณด๋ฅผ ์ง์ ‘ ์ง€์ •ํ•˜์—ฌ ํฌ๋กค๋ง ์ฐจ๋‹จ์„ ์šฐํšŒํ•ด์•ผํ•œ๋‹ค.

๊ธฐ๋ณธ ์ฝ”๋“œ์™€ ์šฐํšŒ ์ฝ”๋“œ๋ฅผ ์ฐจ๋ก€๋Œ€๋กœ ๋ณด์—ฌ์ฃผ๊ฒ ๋‹ค.

from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("https://pythonscraping.com/pages/page1.html")
bs=BeautifulSoup(html.read(), 'html.parser')
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup

headers_info = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}#์œ ์ € ์ •๋ณด ์ง์ ‘ ์ง€์ •

html=urllib.request.Request('https://hushinn.tistory.com/manage/newpost/112', headers=headers_info)
bs=BeautifulSoup(urllib.request.urlopen(html).read() , 'html.parser')

 

์˜ˆ์™ธ์ฒ˜๋ฆฌ(์ค‘์š”!!)

์›น์€ ์—‰๋ง์ง•์ฐฝ์ด๋‹ค! ๋ฐ์ดํ„ฐ ํ˜•์‹์€ ์ œ๋Œ€๋กœ ์ง€์ผœ์ง€์ง€ ์•Š๊ณ  ์›น์‚ฌ์ดํŠธ๋Š” ์ž์ฃผ ๋‹ค์šด๋˜๊ณ  ๋‹ซ๋Š” ํƒœ๊ทธ๋„ ์ข…์ข… ๋น ์ ธ์žˆ๋‹ค...

์›ป ์Šคํฌ๋ ˆ์ดํผ๊ฐ€ ์˜ˆ๊ธฐ์น˜ ๋ชปํ•œ ๋ฐ์ดํ„ฐ ํ˜•์‹์— ๋ถ€๋”›ํ˜€ ๋ฉˆ์ถฐ๋ฒ„๋ฆฌ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งค์šฐ ๋งŽ๋‹ค!

๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ์›น ์Šคํฌ๋ž˜์ดํผ๋ฅผ ๋งŒ๋“ค ๋•Œ ์šฐ๋ฆฌ๊ฐ€ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ์š”์†Œ์— ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•  ๊ฒƒ์ธ๊ฐ€ ๋ฟ ์•„๋‹ˆ๋ผ ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์˜ค๋ฅ˜๋ฅผ ๋งŒ๋‚ฌ์„ ๋•Œ ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ง€์— ๋Œ€ํ•ด์„œ๋„ ๊นŠ์ด ๊ณ ๋ฏผํ•ด์•ผํ•œ๋‹ค.

 

์•ž์„  ์˜ˆ์ œ๋ฅผ ์ƒ๊ธฐํ•˜๋ฉฐ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์˜ค๋ฅ˜์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ณ , ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ๋„ ์•Œ์•„๋ณด์ž.

html=urlopen("https://pythonscraping.com/pages/page1.html")

์ด ํ–‰์—์„œ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€์ด๋‹ค.

  • ํŽ˜์ด์ง€๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†๊ฑฐ๋‚˜, URL ํ•ด์„์—์„œ ์—๋Ÿฌ๊ฐ€ ์ƒ๊ธด ๊ฒฝ์šฐ.
  • ์„œ๋ฒ„๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ.

์˜ˆ์™ธ์ฒ˜๋ฆฌ๋ฅผ ๋”ฐ๋กœ ํ•ด์ฃผ์ง€ ์•Š๋Š” ๊ฒฝ์šฐ, ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ๋งˆ์ฃผ์ณค์„๋•Œ ์›น ์Šคํฌ๋ ˆ์ดํผ๋Š” ๋ฉˆ์ถฐ๋ฒ„๋ฆฌ๊ณ  ๋งŒ๋‹ค.

 

์ฒซ ๋ฒˆ์งธ ์ƒํ™ฉ์—์„œ๋Š” HTTP ์—๋Ÿฌ๋ฅผ ๋ฐ˜ํ™˜ํ•  ๊ฒƒ์ด๋‹ค.

์ด ์—๋Ÿฌ๋Š” "404 Page Not Found", "500 Internal Server Error" ๋“ฑ์ด๋‹ค.

์ด๋Ÿฐ ๋ชจ๋“  ๊ฒฝ์šฐ์— urlopen ํ•จ์ˆ˜๋Š” HTTPError๋ฅผ ์ผ์œผํ‚จ๋‹ค. ์ด ์˜ˆ์™ธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ฒ˜๋ฆฌํ•œ๋‹ค.

from urllib.request import urlopen
from urllib.request import HTTPError
from bs4 import BeautifulSoup

try:
    html=urlopen("https://pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
    #null์„ ๋ฐ˜ํ™˜ํ•˜๊ฑฐ๋‚˜, break๋ฌธ์„ ์‹คํ–‰ํ•˜๊ฑฐ๋‚˜...
else:
    #ํ”„๋กœ๊ทธ๋žจ์„ ๊ณ„์† ์‹คํ–‰ํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ์ฝ”๋“œ๋ฅผ ์งœ์ฃผ๋ฉด HTTP์—๋Ÿฌ ์ฝ”๋“œ๊ฐ€ ๋ฐ˜ํ™˜๋˜์—ˆ์„ ๋•Œ, ํ”„๋กœ๊ทธ๋žจ์€ ์—๋Ÿฌ๋ฅผ ์ถœ๋ ฅํ•˜๊ณ  else๋ฌธ์€ ์‹คํ–‰ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค.

 

๋‘ ๋ฒˆ์งธ ์ƒํ™ฉ์€ ์„œ๋ฒ„๋ฅผ ์ „ํ˜€ ์ฐพ์„ ์ˆ˜ ์—†์„๋•Œ ๋ฐœ์ƒํ•œ๋‹ค.

์›น์‚ฌ์ดํŠธ๊ฐ€ ๋‹ค์šด๋๊ฑฐ๋‚˜, URL์— ์˜คํƒ€๊ฐ€์žˆ์„๋•Œ urlopen์€ URLError ์˜ˆ์™ธ๋ฅผ ์ผ์œผํ‚จ๋‹ค.

์ด๊ฒƒ๋„ ์บ์น˜ํ•˜๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

from urllib.request import urlopen
from urllib.request import HTTPError
from urllib.request import URLError

from bs4 import BeautifulSoup

try:
    html=urlopen("https://pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked!')

 

๋ฌผ๋ก  ํŽ˜์ด์ง€๋ฅผ ์„œ๋ฒ„์—์„œ ์„ฑ๊ณต์ ์œผ๋กœ ๊ฐ€์ ธ์™”์–ด๋„ ํŽ˜์ด์ง€ ์ฝ˜ํ…์ธ ๊ฐ€ ์˜ˆ์ƒ๊ณผ ๋‹ฌ๋ผ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ BeautifulSoup ๊ฐ์ฒด์— ๋“ค์–ด์žˆ๋Š” ํƒœ๊ทธ์— ์ ‘๊ทผํ•  ๋•Œ๋งˆ๋‹ค ๊ทธ ํƒœ๊ทธ๊ฐ€ ์‹ค์ œ ์กด์žฌํ•˜๋Š”์ง€ ์ฒดํฌํ•˜๋Š” ํŽธ์ด ์ข‹๋‹ค.

์กด์žฌํ•˜์ง€ ์•Š๋Š” ํƒœ๊ทธ์— ์ ‘๊ทผ์„ ์‹œ๋„ํ•˜๋ฉด BeautifulSoup๋Š” None ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ None ๊ฐ์ฒด ์ž์ฒด์— ํƒœ๊ทธ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์ ‘๊ทผํ•˜๋ ค ํ•˜๋ฉด AttributeError๊ฐ€ ์ผ์–ด๋‚œ๋‹ค.

๋”ฐ๋ผ์„œ ์• ์ดˆ์— ํƒœ๊ทธ์— ์ ‘๊ทผํ•  ๋•Œ ํŠน์ • ํƒœ๊ทธ๊ฐ€ ์กด์žฌํ•˜๋Š” ์ง€๋ฅผ ์‚ฌ์ „์— ๊ฒ€์‚ฌํ•˜๋ฉฐ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด ์•ž์„  ์˜ˆ์‹œ์˜ bs ๊ฐ์ฒด์— nonExistiongTag ๋ผ๋Š” ํƒœ๊ทธ๊ฐ€ ์กด์žฌํ•  ๊ฒƒ์ด๋ฉฐ, ๊ทธ ํ•˜์œ„์—” anotherTag ๋ผ๋Š” ํƒœ๊ทธ๊ฐ€ ์กด์žฌํ•  ๊ฒƒ์ด๋ผ๊ณ  ์˜ˆ์ƒํ–ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž. ๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„๊นŒ?

๋‹ค์Œ ์ฝ”๋“œ์™€ ๊ฐ™์ด ์ž‘์„ฑํ•˜๋ฉด ๋œ๋‹ค.

try:
    target_content=bs.nonExistingTag.anotherTag
except AttributeError as e:
    print("Tag was not found")#nonExistingTag ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ
else:
    if badContent == None:
        print("Tag was not found")#anotherTag ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ
    else:
        print(badContent)#์ •์ƒ ์ƒํ™ฉ!!

 

์ด๋ ‡๊ฒŒ ๊ฐ€๋Šฅํ•œ ์—๋Ÿฌ๋ฅผ ๋ชจ๋‘ ์ฒดํฌํ•˜๊ณ  ์ฒ˜๋ฆฌํ•˜๋Š” ์ผ์ด ์ง€๊ฒน๋‹ค๊ณ  ๋А๋‚„ ์ˆ˜ ์žˆ์œผ๋‚˜, ์ฝ”๋“œ๋ฅผ ์กฐ๊ธˆ๋งŒ ์ˆ˜์ •ํ•˜๋ฉด ์ข€ ๋” ์‰ฝ๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ๋‹ค.

๋‹ค์Œ ์ฝ”๋“œ๋ฅผ ๋ณด์ž.

 

from urllib.request import urlopen
from urllib.request import HTTPError
from urllib.request import URLError
from bs4 import BeautifulSoup

def GetTitle(url):
    try:
        html=urlopen(url)
    except HTTPError as e:
        return None
    except URLError as e:
        return None
    try:
        bs=BeautifulSoup(html.read(), 'html.parser')
        title=bs.body.h1
    except AttributeError as e:
        return None
    return title

title = GetTitle('https://pythonscraping.com/pages/page1.html')
if title == None:
    print('Title could not be found')
else:
    print(title)

 

์ด ์˜ˆ์ œ์—์„œ๋Š” GetTitle ํ•จ์ˆ˜์— ๋งํฌ๋ฅผ ์ „๋‹ฌํ•  ๊ฒฝ์šฐ ํƒ€์ดํ‹€์ด ์žˆ์œผ๋ฉด ํƒ€์ดํ‹€์„, ํƒ€์ดํ‹€์ด ์—†์œผ๋ฉด none๊ฐ’์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.

์šฐ๋ฆฌ๋Š” ์›น์‚ฌ์ดํŠธ์˜ ํƒ€์ดํ‹€์— ์ ‘๊ทผํ•ด์•ผ ํ•  ๊ฒฝ์šฐ ์ด ํ•จ์ˆ˜์— ๋งํฌ๋ฅผ ์ „๋‹ฌํ•œ ๋’ค, ๊ฒฐ๊ณผ๊ฐ’์ด None์ธ์ง€๋งŒ ๊ฒ€์‚ฌํ•˜๋ฉด ๋ฐ”๋กœ ํƒ€์ดํ‹€์„ ์ทจ๊ธ‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋ ‡๋“ฏ ๊ธฐ๋Šฅ์„ ์บก์Аํ™”ํ•˜์—ฌ ์žฌ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์Šคํฌ๋ ˆ์ดํผ๋ฅผ ์ œ์ž‘ํ•ด์•ผ ํ•œ๋‹ค.

์—๋Ÿฌ ์ฒ˜๋ฆฌ๊ฐ€ ๋ฒˆ๊ฑฐ๋กญ๋‹ค๊ณ  ํ•˜์—ฌ ์—๋Ÿฌ ์ƒํ™ฉ์„ ๋ฐฐ์ œํ•œ ์Šคํฌ๋ ˆ์ดํผ๋ฅผ ์ œ์ž‘ํ•˜๋ฉด ๊ฒฐ๊ตญ ํฌ๋กค๋ง์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธฐ๊ณ  ๋ง ๊ฒƒ์ด๋‹ค.