7.1. Einsteig BeautifulSoup#

Zum Einstieg in BeautifulSoup wollen wir ein paar Zitate von der Webseite https://quotes.toscrape.com/ scrapen. Für die Übung werden wir zwei Pakete verwenden: Requests und BeautifulSoup. Öffnet jetzt ein neues Jupyter Notebook in JupyerLab und kopiert nacheinander den Code von dieser Seite in euer Jupyter Notebook. Zunächst installieren und importieren wir die Pakete requests und BeautifulSoup:

#import sys
#!conda install --yes --prefix {sys.prefix} requests
#!conda install --yes --prefix {sys.prefix} beautifulsoup4
import requests
from bs4 import BeautifulSoup

7.1.1. HTTP-Anfrage stellen mit requests#

Requests Dokumentation: https://requests.readthedocs.io/

Bevor wir die Anfrage stellen, sollten wir überprüfen, ob die Seite eine robots.txt hat: https://quotes.toscrape.com/robots.txt. Die Seite hat keine robots.txt, das heißt, wir müssen uns beim Scrapen der Seite an keine besonderen Vorgaben richten. Jetzt können wir die Anfrage an den Server der Webseite stellen. Dazu brauchen wir nur die URL der Seite und die Funktion get() aus dem Paket requests. Die Funktion get() formuliert die Anfrage nach den Vorgaben des HTTP-Protokolls. Als Argumente können wir der Funktion genau die Parameter übergeben, die für Header der HTTP-Request definiert sind, also zum Beispiel die Parameter “User-Agent” oder “Accept”, die wir bereits im Beispiel im Kapitel über HTTP gesehen haben. Für den Einstieg verwenden wir einfach die Default-Parameter:

# HTTP get-Request stellen
URL = "https://quotes.toscrape.com/"
page = requests.get(URL)

Jetzt können wir die HTTP Response untersuchen:

# Unser Objekt page mit der HTTP-Response hat den Typ 'requests.models.Response'
# Objekte von diesem Typ haben Attribute status_code, headers und text, die wir abrufen können
type(page)
requests.models.Response
# Attribut Statuscode abrufen
page.status_code
200
# Header der HTTP Response abrufen
page.headers
{'Date': 'Wed, 29 May 2024 10:30:40 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '11054', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload'}
# Body der HTTP Response abrufen
page.text
'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > \n            \n            <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/">world</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>\n        <span>by <small class="author" itemprop="author">J.K. Rowling</small>\n        <a href="/author/J-K-Rowling">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="abilities,choices" /    > \n            \n            <a class="tag" href="/tag/abilities/page/1/">abilities</a>\n            \n            <a class="tag" href="/tag/choices/page/1/">choices</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" /    > \n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/live/page/1/">live</a>\n            \n            <a class="tag" href="/tag/miracle/page/1/">miracle</a>\n            \n            <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>\n        <span>by <small class="author" itemprop="author">Jane Austen</small>\n        <a href="/author/Jane-Austen">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" /    > \n            \n            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n            \n            <a class="tag" href="/tag/books/page/1/">books</a>\n            \n            <a class="tag" href="/tag/classic/page/1/">classic</a>\n            \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it&#39;s better to be absolutely ridiculous than absolutely boring.”</span>\n        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n        <a href="/author/Marilyn-Monroe">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" /    > \n            \n            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="adulthood,success,value" /    > \n            \n            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n            \n            <a class="tag" href="/tag/success/page/1/">success</a>\n            \n            <a class="tag" href="/tag/value/page/1/">value</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>\n        <span>by <small class="author" itemprop="author">André Gide</small>\n        <a href="/author/Andre-Gide">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="life,love" /    > \n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/love/page/1/">love</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“I have not failed. I&#39;ve just found 10,000 ways that won&#39;t work.”</span>\n        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n        <a href="/author/Thomas-A-Edison">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" /    > \n            \n            <a class="tag" href="/tag/edison/page/1/">edison</a>\n            \n            <a class="tag" href="/tag/failure/page/1/">failure</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it&#39;s in hot water.”</span>\n        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>\n        <a href="/author/Eleanor-Roosevelt">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" /    > \n            \n            <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>\n        <span>by <small class="author" itemprop="author">Steve Martin</small>\n        <a href="/author/Steve-Martin">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" /    > \n            \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n            <a class="tag" href="/tag/obvious/page/1/">obvious</a>\n            \n            <a class="tag" href="/tag/simile/page/1/">simile</a>\n            \n        </div>\n    </div>\n\n    <nav>\n        <ul class="pager">\n            \n            \n            <li class="next">\n                <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>\n            </li>\n            \n        </ul>\n    </nav>\n    </div>\n    <div class="col-md-4 tags-box">\n        \n            <h2>Top Ten tags</h2>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/life/">life</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 22px" href="/tag/books/">books</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n            </span>\n            \n        \n    </div>\n</div>\n\n    </div>\n    <footer class="footer">\n        <div class="container">\n            <p class="text-muted">\n                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>\n            </p>\n            <p class="copyright">\n                Made with <span class=\'zyte\'>❤</span> by <a class=\'zyte\' href="https://www.zyte.com">Zyte</a>\n            </p>\n        </div>\n    </footer>\n</body>\n</html>'
# Body der Response ist ein String
type(page.text)
str
# Body als bytes-Objekt
page.content
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > \n            \n            <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/">world</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cIt is our choices, Harry, that show what we truly are, far more than our abilities.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">J.K. Rowling</small>\n        <a href="/author/J-K-Rowling">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="abilities,choices" /    > \n            \n            <a class="tag" href="/tag/abilities/page/1/">abilities</a>\n            \n            <a class="tag" href="/tag/choices/page/1/">choices</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles" /    > \n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/live/page/1/">live</a>\n            \n            <a class="tag" href="/tag/miracle/page/1/">miracle</a>\n            \n            <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Jane Austen</small>\n        <a href="/author/Jane-Austen">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" /    > \n            \n            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n            \n            <a class="tag" href="/tag/books/page/1/">books</a>\n            \n            <a class="tag" href="/tag/classic/page/1/">classic</a>\n            \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cImperfection is beauty, madness is genius and it&#39;s better to be absolutely ridiculous than absolutely boring.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n        <a href="/author/Marilyn-Monroe">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="be-yourself,inspirational" /    > \n            \n            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cTry not to become a man of success. Rather become a man of value.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="adulthood,success,value" /    > \n            \n            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n            \n            <a class="tag" href="/tag/success/page/1/">success</a>\n            \n            <a class="tag" href="/tag/value/page/1/">value</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cIt is better to be hated for what you are than to be loved for what you are not.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Andr\xc3\xa9 Gide</small>\n        <a href="/author/Andre-Gide">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="life,love" /    > \n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/love/page/1/">love</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cI have not failed. I&#39;ve just found 10,000 ways that won&#39;t work.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n        <a href="/author/Thomas-A-Edison">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" /    > \n            \n            <a class="tag" href="/tag/edison/page/1/">edison</a>\n            \n            <a class="tag" href="/tag/failure/page/1/">failure</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cA woman is like a tea bag; you never know how strong it is until it&#39;s in hot water.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>\n        <a href="/author/Eleanor-Roosevelt">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" /    > \n            \n            <a class="tag" href="/tag/misattributed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cA day without sunshine is like, you know, night.\xe2\x80\x9d</span>\n        <span>by <small class="author" itemprop="author">Steve Martin</small>\n        <a href="/author/Steve-Martin">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" /    > \n            \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n            <a class="tag" href="/tag/obvious/page/1/">obvious</a>\n            \n            <a class="tag" href="/tag/simile/page/1/">simile</a>\n            \n        </div>\n    </div>\n\n    <nav>\n        <ul class="pager">\n            \n            \n            <li class="next">\n                <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>\n            </li>\n            \n        </ul>\n    </nav>\n    </div>\n    <div class="col-md-4 tags-box">\n        \n            <h2>Top Ten tags</h2>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/life/">life</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 22px" href="/tag/books/">books</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n            </span>\n            \n        \n    </div>\n</div>\n\n    </div>\n    <footer class="footer">\n        <div class="container">\n            <p class="text-muted">\n                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>\n            </p>\n            <p class="copyright">\n                Made with <span class=\'zyte\'>\xe2\x9d\xa4</span> by <a class=\'zyte\' href="https://www.zyte.com">Zyte</a>\n            </p>\n        </div>\n    </footer>\n</body>\n</html>'

7.1.2. HTML-Dokument parsen und Inhalte extrahieren mit BeautifulSoup#

BeautifulSoup Dokumentation: https://beautiful-soup-4.readthedocs.io

Dokumentation speziell zu BeautifulSoup-Klassen: https://tedboy.github.io/bs4_doc/generated/bs4.html#classes

Die get()-Funktion hat den Body der HTTP-Response als String geliefert. Zwar könnten wir auch einen String durchsuchen, aber es wäre viel praktischer, wenn wir einfach die einzelnen HTML-Elemente als Attribute abrufen könnten, also genau so, wie wir den Statuscode, Header und Body der HTTP-Response einfach als Attribute eines Response-Objekts abrufen konnten. Genau dazu wurde das Paket BeautifulSoup entwickelt. Mit der folgenden Codezeile kann aus dem Body der HTTP-Response ein BeautifulSoup-Objekt erstellt werden, dessen Attribute gängige HTML-Elemente wie head, title und body sind:

# BeautifulSoup() nimmt den Body einer HTTP-Response als bytes-Objekt an
soup = BeautifulSoup(page.content, "html.parser")
type(soup)
bs4.BeautifulSoup

Was ist ein BeautifulSoup Objekt?

“The BeautifulSoup object represents the parsed document as a whole.”

Quelle: https://beautiful-soup-4.readthedocs.io/en/latest/#beautifulsoup

Attribute des BeautifulSoup-Objekts abrufen:

# HTML-Element head
soup.head
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
# HTML-Element title
soup.title
<title>Quotes to Scrape</title>
# HTML-Element body
# Hier auskommentiert, da die Ausgabe sehr lang ist
# soup.body

Methoden des BeautifulSoup-Objekts aufrufen:

# .prettify() formatiert den body der HTTP-Response übersichtlich
# Hier auskommentiert, da die Ausgabe sehr lang ist
# print(soup.prettify())

Wie eingangs erläutert wollen wir die Zitate von der Seite scrapen. Wenn wir die Seite in den Chrome-Entwicklertools untersuchen, sehen wir, dass die Zitate in einem HTML span-Element liegen. Um die Zitate von der Seite zu extrahieren, müssen wir also zunächst die HTML-Elemente extrahieren, in denen sich die Zitate befinden:

# Erstes span-Element ausgeben lassen
soup.find("span")
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
# Alle span-Elemente ausgeben lassen
soup.find_all("span")
[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span>by <small class="author" itemprop="author">J.K. Rowling</small>
 <a href="/author/J-K-Rowling">(about)</a>
 </span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span>by <small class="author" itemprop="author">Jane Austen</small>
 <a href="/author/Jane-Austen">(about)</a>
 </span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span>by <small class="author" itemprop="author">Marilyn Monroe</small>
 <a href="/author/Marilyn-Monroe">(about)</a>
 </span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>,
 <span>by <small class="author" itemprop="author">André Gide</small>
 <a href="/author/Andre-Gide">(about)</a>
 </span>,
 <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>,
 <span>by <small class="author" itemprop="author">Thomas A. Edison</small>
 <a href="/author/Thomas-A-Edison">(about)</a>
 </span>,
 <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>,
 <span>by <small class="author" itemprop="author">Eleanor Roosevelt</small>
 <a href="/author/Eleanor-Roosevelt">(about)</a>
 </span>,
 <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>,
 <span>by <small class="author" itemprop="author">Steve Martin</small>
 <a href="/author/Steve-Martin">(about)</a>
 </span>,
 <span aria-hidden="true">→</span>,
 <span class="tag-item">
 <a class="tag" href="/tag/love/" style="font-size: 28px">love</a>
 </span>,
 <span class="tag-item">
 <a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a>
 </span>,
 <span class="tag-item">
 <a class="tag" href="/tag/life/" style="font-size: 26px">life</a>
 </span>,
 <span class="tag-item">
 <a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a>
 </span>,
 <span class="tag-item">
 <a class="tag" href="/tag/books/" style="font-size: 22px">books</a>
 </span>,
 <span class="tag-item">
 <a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a>
 </span>,
 <span class="tag-item">
 <a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a>
 </span>,
 <span class="tag-item">
 <a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a>
 </span>,
 <span class="tag-item">
 <a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a>
 </span>,
 <span class="tag-item">
 <a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a>
 </span>,
 <span class="zyte">❤</span>]

Wenn wir die Ausgabe der Methode .find_all() durchsuchen, fällt auf, dass unter den gefundenen span-Elementen auch span-Elemente sind, die keine Zitate beinhalten. Wie können wir nur auf die span-Elemente, die Zitate beinhalten, zugreifen? Die span-Elemente, die Zitate beinhalten, haben alle ein Attribut class mit dem Wert “text”, während die span-Elemente, die keine Zitate beinhalten, ein Attribut class mit dem Wert “tag-item” oder “sh-red” haben. Es gibt also verschiedene Klassen von span-Elementen. span-Elemente, die Zitate enthalten, können wir deswegen mithilge des Attributs class extrahieren. Aber wie genau machen wir das? Dazu können wir wieder die Methode .find_all() verwenden. Die Methode .find_all() hat einen Parameter class_, der verwendet werden kann, um die Suche nach HTML-Elementen auf Elemente mit einem bestimmten Wert für das Attribut class einzuschränken:

# Alle span-Elemente mit einem Attribut class="text"
# Beachtet den Unterstrich im Parameter class_. Der Unterstrich ist notwendig, weil das Wort class in Python ein Signalwort ist, das eine Klassendefinition einleitet. Der unterstrich hat also keine Bedeutung, er ist nur eine Formalität, die verhindert, dass Python den Parameter class_ fälschlich als Beginn einer Klassendefinition interpretiert.
soup.find_all("span", class_="text")
[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>,
 <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>,
 <span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>,
 <span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>]

Jetzt findet .find_all() alle span-Elemente, die wir brauchen. Jetzt müssen wir nur noch den Text zwischen den Tags des span-Elements extrahieren. Dazu können wir laut den Dokumentationsseiten zum Paket BeautifulSoup die Methode .get_text() verwenden:

zitate = soup.find_all("span", class_="text")
# zitate.get_text()

Die Zeile zitate.get_text() habe ich auskommentiert, weil sie eine Fehlermeldung produziert: AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of elements like a single element. Was bedeutet das? zitate ist scheinbar ein “ResultSet”-Objekt, und für Objekte von diesem Typ ist keine Methode .get_text() definiert. Zur Erinnerung: In Python sind auch Methoden eine Art von Attribut. In der Fehlermeldung ist also mit Attribut kein Eigenschafts-Attribut gemeint, sondern ein Methoden-Attribut, das wir zur besseren Unterscheidung einfach nur Methode nennen.

# zitate ist ein ResultSet Objekt
type(zitate)
bs4.element.ResultSet

Um mehr Informationen über das ResultSet Objekt zu bekommen, können wir auf der Seite https://tedboy.github.io/bs4_doc/generated/generated/bs4.ResultSet.html nachlesen. Ein ResultSet-Objekt ist also im Grunde eine Python Liste. Die Methode .get_text() ist allerdings nur für einzelne Elemente in einem ResultSet definiert. Was müssen wir also machen, damit wir die Methode .get_text() anwenden können? Wir brauchen eine Schleife, die über das ResultSet-Objekt zitate iteriert und in jedem Schleifendurchlauf die Methode .get_text() für das aktuelle Element aufruft.

# Inhalt der span-Elemente mithilfe der Methode .get_text() extrahieren und ausgeben lassen
for zitat in zitate:
    print(zitat.get_text())
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”

Voilà! So haben erfolgreich wir die Zitate von der Webseite https://quotes.toscrape.com/ extrahiert. In der nächsten Woche lernt ihr, wie ihr die Zitate in einem nächsten Schritt in einer Datei auf eurem Computer speichern könnt. Wir werden außerdem die Schritte, die wir bisher auf eine einzelne Webseite angewendet haben, so verallgemeinern, dass wir sie auf alle Seiten der Website anwenden können, ohne uns jede einzelne Seite zuerst in den Entwicklertools anzusehen. Denn genau das wollen wir mit dem Webscrapen ja erreichen: Wir wollen ein Skript schreiben, das automatisiert die selbe Art von Daten von vielen Webseiten extrahiert und auf unserem Computer speichert.

7.1.2.1. Quellen#

  1. Quotes to Scrape – JavaScript Version. URL: https://quotes.toscrape.com/js/.

  2. Quotes to Scrape – . URL: https://quotes.toscrape.com/js/.

  3. bs4 Classes. 2016. URL: https://tedboy.github.io/bs4_doc/generated/bs4.html#classes.

  4. bs4 ResultSet. 2016. URL: https://tedboy.github.io/bs4_doc/generated/generated/bs4.ResultSet.html.

  5. Kenneth Reitz. Requests Documentation. Developer Interface: get. 2023. URL: https://requests.readthedocs.io/en/latest/api/?highlight=get#requests.Session.get.

  6. Kenneth Reitz. Requests Documentation. 2023. URL: https://requests.readthedocs.io/en/latest/.

  7. Leonard Richardson. BeautifulSoup Documentation. BeautifulSoup. 2015. URL: https://beautiful-soup-4.readthedocs.io/en/latest/#beautifulsoup.

  8. Leonard Richardson. BeautifulSoup Documentation. Navigating Using Tag Names. 2015. URL: https://beautiful-soup-4.readthedocs.io/en/latest/index.html#navigating-using-tag-names.

  9. Leonard Richardson. BeautifulSoup Documentation. Searching By CSS Class. 2015. URL: https://beautiful-soup-4.readthedocs.io/en/latest/index.html#searching-by-css-class.

  10. Leonard Richardson. BeautifulSoup Documentation. 2015. URL: https://beautiful-soup-4.readthedocs.io.