7.2. Fortsetzung BeautifulSoup#
Letzte Woche haben wir alle Zitate von der ersten Seite der Website https://quotes.toscrape.com extrahiert. Heute werden wir den Code in drei Aspekten ergänzen:
Zitate von der ersten Seite mit Metadaten extrahieren
Zitate von allen Seiten extrahieren, mit und ohne Metadaten
Daten in Dateien schreiben: Beispiel pandas DataFrame in Excel-Tabelle
Für die Ausführung des Codes brauchen wir zusätzlich den Paketen requests und bs4 außerdem noch das Paket pandas, und eine Funktion aus dem Paket urllib3. Zusätzlich könnt ihr das Paket memory_profiler installieren, das erlaubt, zu messen, wieviel Speicher zum Ausführen einer Codezelle benötigt wird. Daneben installieren wir ein Paket openpyxl, das zum Schreiben von pandas-DataFrames in Excel-Dateien verwendet wird. Das Paket müssen wir aber nicht laden, weil es automatisch geladen wird, wenn wir später versuchen, einen Pandas DataFrame in eine Excel-Datei zu schreiben.
#import sys
#!conda install --yes --prefix {sys.prefix} memory_profiler
#!conda install --yes --prefix {sys.prefix} openpyxl
import requests
from bs4 import BeautifulSoup
import pandas as pd
# %load_ext memory_profiler
7.2.1. Recap: Zitate von der ersten Seite extrahieren, ohne Metadaten#
URL = "https://quotes.toscrape.com/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
zitate = soup.find_all("span", class_="text")
for zitat in zitate:
print(zitat.get_text())
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”
7.2.2. Zitate von der ersten Seite extrahieren, mit Metadaten#
In der letzten Stunde haben wir die BeautifulSoup-Methoden .find() und .find_all() verwendet, um HTML-Elemente in einem BeautifulSoup-Objekt zu finden. Genauso könnten wir auch vorgehen, um neben den Zitaten auch einige Metadaten zu extrahieren, also zum Beispiel auch den Namen der Person, von der das Zitat stammt, und die Tags, mit denen das Zitat versehen wurde.
quotes = soup.find_all("div", class_="quote")
quotes_dict = {"Text":[], "Author":[], "Tags":[]}
for quote in quotes:
quote_text = quote.find("span", class_="text").get_text()
quote_author = quote.find("small", class_="author").get_text()
quote_tags = quote.find_all("a", class_="tag")
tags_text = []
for tag in quote_tags:
tags_text.append(tag.get_text())
quotes_dict["Text"].append(quote_text)
quotes_dict["Author"].append(quote_author)
quotes_dict["Tags"].append(tags_text)
# dictionary ist nicht besonders übersichtlich
quotes_dict
{'Text': ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
'“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
'“Try not to become a man of success. Rather become a man of value.”',
'“It is better to be hated for what you are than to be loved for what you are not.”',
"“I have not failed. I've just found 10,000 ways that won't work.”",
"“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
'“A day without sunshine is like, you know, night.”'],
'Author': ['Albert Einstein',
'J.K. Rowling',
'Albert Einstein',
'Jane Austen',
'Marilyn Monroe',
'Albert Einstein',
'André Gide',
'Thomas A. Edison',
'Eleanor Roosevelt',
'Steve Martin'],
'Tags': [['change', 'deep-thoughts', 'thinking', 'world'],
['abilities', 'choices'],
['inspirational', 'life', 'live', 'miracle', 'miracles'],
['aliteracy', 'books', 'classic', 'humor'],
['be-yourself', 'inspirational'],
['adulthood', 'success', 'value'],
['life', 'love'],
['edison', 'failure', 'inspirational', 'paraphrased'],
['misattributed-eleanor-roosevelt'],
['humor', 'obvious', 'simile']]}
# Dataframe ist übersichtlicher
quotes_df = pd.DataFrame.from_dict(quotes_dict)
quotes_df
Text | Author | Tags | |
---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | [change, deep-thoughts, thinking, world] |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling | [abilities, choices] |
2 | “There are only two ways to live your life. On... | Albert Einstein | [inspirational, life, live, miracle, miracles] |
3 | “The person, be it gentleman or lady, who has ... | Jane Austen | [aliteracy, books, classic, humor] |
4 | “Imperfection is beauty, madness is genius and... | Marilyn Monroe | [be-yourself, inspirational] |
5 | “Try not to become a man of success. Rather be... | Albert Einstein | [adulthood, success, value] |
6 | “It is better to be hated for what you are tha... | André Gide | [life, love] |
7 | “I have not failed. I've just found 10,000 way... | Thomas A. Edison | [edison, failure, inspirational, paraphrased] |
8 | “A woman is like a tea bag; you never know how... | Eleanor Roosevelt | [misattributed-eleanor-roosevelt] |
9 | “A day without sunshine is like, you know, nig... | Steve Martin | [humor, obvious, simile] |
Anstelle der Methode .get_text() kann auch das Attribut .text abgerufen werden: Beide geben den Textinhalt des Elements zurück.
Aber was passiert, wenn ein Element nicht gefunden werden kann, beispielsweise, weil für ein Zitat keine Tags angegeben wurden, oder wenn die Angabe der Autor:in fehlt? In diesem Fall würden die Methoden .find() bzw. .find_all() den Wert None zurückgeben. Die Methode .get_text() (oder das Attribut .text) würde dann auf ein Objekt vom Typ NoneType angewandt werden. Aber NoneType-Objekte haben keine Methode .get_text() und auch kein Attribut .text! Der Code würde also eine Fehlermeldung produzieren und die Ausführung abbrechen. Um das zu verhindern, könnten wir zunächst überprüfen, ob tatsächlich ein Element gefunden wurde. Nur, wenn ein Element gefunden wurde, wird der Text extrahiert, im folgenden Beispiel mithilfe des Attributs .text statt der Methode .get_text():
quotes = soup.find_all("div", class_="quote")
quotes_dict = {"Text":[], "Author":[], "Tags":[]}
for quote in quotes:
quote_text = quote.find("span", class_="text")
quote_author = quote.find("small", class_="author")
quote_tags = quote.find_all("a", class_="tag")
tags_text = []
for tag in quote_tags:
tags_text.append(tag.text)
if quote_text is not None:
quotes_dict["Text"].append(quote_text.text)
if quote_author is not None:
quotes_dict["Author"].append(quote_author.text)
if len(tags_text) != 0:
quotes_dict["Tags"].append(tags_text)
Die Suche mit find() und find_all() produziert bei komplexeren Abfragen aber unübersichtlichen Code, weil wir als Argument zusätzlich die CSS-Klasse des gesuchten Elements angeben müssen. Eine einfachere Möglichkeit, direkt nach der CSS-Klasse selbst zu suchen, sind die Methoden .select_one() und .select().
Zum Nachlesen: https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=select#css-selectors
quotes = soup.select("div.quote")
quotes_dict = {"Text":[], "Author":[], "Tags":[]}
for quote in quotes:
quote_text = quote.select_one("span.text") # oder einfach '.text'
quote_author = quote.select_one("small.author")
quote_tags = quote.select("a.tag")
tags_text = []
for tag in quote_tags:
tags_text.append(tag.text)
if quote_text is not None:
quotes_dict["Text"].append(quote_text.text)
if quote_author is not None:
quotes_dict["Author"].append(quote_author.text)
if len(tags_text) != 0:
quotes_dict["Tags"].append(tags_text)
Die Suche nach CSS-Selektoren mithilfe der .select_one() und .select() Methoden hat außerdem einen weiteren Vorteil: Sie erlauben, direkt nach Kind- oder Geschwisterelementen eines Elements zu suchen. Bei der Verwendung von .find() und .find_all() hatten wir eine for-Schleife verwendet, um die Suche auf Kindelemente der div-Elemente mit der Klasse ‘quote’ einzuschränken. In manchen Fällen kann die Verwendung von .select() anstelle von .find_all() eine solche for-Schleife ersetzen. In unserem Beispiel würde zur Suche nach Zitattexten und Autor:innennamen beispielsweise keine for-Schleife erfordern:
# Zitattexte: 'div.quote > span.text' findet direkte Kindelemente des div-Elements mit der Klasse 'quote'
soup.select("div.quote > span.text")
[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
<span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
<span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
<span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
<span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</span>,
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>,
<span class="text" itemprop="text">“A woman is like a tea bag; you never know how strong it is until it's in hot water.”</span>,
<span class="text" itemprop="text">“A day without sunshine is like, you know, night.”</span>]
# Autor:innen: 'div.quote small.author' findet alle small-Elemente mit der Klasse 'author' innerhalb de des div-Elements mit der Klasse 'quote', auch "Enkelkinder"
soup.select("div.quote small.author")
[<small class="author" itemprop="author">Albert Einstein</small>,
<small class="author" itemprop="author">J.K. Rowling</small>,
<small class="author" itemprop="author">Albert Einstein</small>,
<small class="author" itemprop="author">Jane Austen</small>,
<small class="author" itemprop="author">Marilyn Monroe</small>,
<small class="author" itemprop="author">Albert Einstein</small>,
<small class="author" itemprop="author">André Gide</small>,
<small class="author" itemprop="author">Thomas A. Edison</small>,
<small class="author" itemprop="author">Eleanor Roosevelt</small>,
<small class="author" itemprop="author">Steve Martin</small>]
7.2.3. Zitate von allen Seiten der Website extrahieren, ohne Metadaten#
7.2.3.1. Lösung mit for-Schleife: für exakt 10 Unterseiten#
%%time
# %%memit
# Lösungsidee von GitHub-Nutzer:in Bhavya Bindela: https://bhavyasree.github.io/PythonClass/Notebooks/18.scrape-quotes/ . Angepasst für die Extraktion von Zitaten statt Autor:innen
# for Schleife mit Set
base_url = "http://quotes.toscrape.com/page/"
quotes = set()
for i in range(1,11):
scrape_url = base_url + str(i)
page = requests.get(scrape_url)
soup = BeautifulSoup(page.content, "html.parser")
for quote in soup.select(".quote > .text"):
quotes.add(quote.text) # type(quote) ist bs4.element.Tag: hat Attribut text; add() ist eine set-Methode
CPU times: user 110 ms, sys: 6.4 ms, total: 116 ms
Wall time: 1.12 s
quotes # sets haben keine Ordnung: das ist unpraktisch
{'“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”',
'“A day without sunshine is like, you know, night.”',
"“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”",
'“A lie can travel half way around the world while the truth is putting on its shoes.”',
"“A person's a person, no matter how small.”",
'“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”',
"“A wise girl kisses but doesn't love, listens but doesn't believe, and leaves before she is left.”",
"“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
"“All you need is love. But a little chocolate now and then doesn't hurt.”",
'“Any fool can know. The point is to understand.”',
'“Anyone who has never made a mistake has never tried anything new.”',
'“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”',
'“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”',
'“But better to get hurt by the truth than comforted with a lie.”',
'“Do not pity the dead, Harry. Pity the living, and, above all those who live without love.”',
'“Do one thing every day that scares you.”',
'“Finish each day and be done with it. You have done what you could. Some blunders and absurdities no doubt crept in; forget them as soon as you can. Tomorrow is a new day. You shall begin it serenely and with too high a spirit to be encumbered with your old nonsense.”',
'“For every minute you are angry you lose sixty seconds of happiness.”',
'“Good friends, good books, and a sleepy conscience: this is the ideal life.”',
"“He's like a drug for you, Bella.”",
'“I am free of all prejudice. I hate everyone equally. ”',
'“I am good, but not an angel. I do sin, but I am not the devil. I am just a small girl in a big world trying to find someone to love.”',
'“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”',
'“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”',
'“I have always imagined that Paradise will be a kind of library.”',
"“I have heard there are troubles of more than one kind. Some come from ahead and some come from behind. But I've bought a big bat. I'm all ready you see. Now my troubles are going to have troubles with me!”",
'“I have never let my schooling interfere with my education.”',
"“I have not failed. I've just found 10,000 ways that won't work.”",
'“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”',
'“I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”',
'“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”',
"“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”",
'“If I had a flower for every time I thought of you...I could walk through my garden forever.”',
'“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”',
'“If you can make a woman laugh, you can make her do anything.”',
"“If you can't explain it to a six year old, you don't understand it yourself.”",
'“If you judge people, you have no time to love them.”',
'“If you only read the books that everyone else is reading, you can only think what everyone else is thinking.”',
'“If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”',
"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
'“It is better to be hated for what you are than to be loved for what you are not.”',
'“It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.”',
'“It is never too late to be what you might have been.”',
'“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”',
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
'“It matters not what someone is born, but what they grow to be.”',
'“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”',
'“It takes courage to grow up and become who you really are.”',
'“Life is like riding a bicycle. To keep your balance, you must keep moving.”',
'“Life is what happens to us while we are making other plans.”',
"“Life isn't about finding yourself. Life is about creating yourself.”",
'“Logic will get you from A to Z; imagination will get you everywhere.”',
'“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”',
'“Never tell the truth to people who are not worthy of it.”',
'“Not all of us can do great things. But we can do small things with great love.”',
'“Not all those who wander are lost.”',
'“Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”',
'“One good thing about music, when it hits you, you feel no pain.”',
'“Only in the darkness can you see the stars.”',
'“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”',
"“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”",
'“Some day you will be old enough to start reading fairy tales again.”',
'“Some people never go crazy. What truly horrible lives they must lead.”',
"“That's the problem with drinking, I thought, as I poured myself a drink. If something bad happens you drink in an attempt to forget; if something good happens you drink in order to celebrate; and if nothing happens you drink to make something happen.”",
'“The difference between genius and stupidity is: genius has its limits.”',
'“The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”',
"“The more that you read, the more things you will know. The more that you learn, the more places you'll go.”",
"“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”",
'“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
"“The question isn't who is going to let me; it's who is going to stop me.”",
'“The real lover is the man who can thrill you by kissing your forehead or smiling into your eyes or just staring into space.”',
'“The reason I talk to myself is because I’m the only one whose answers I accept.”',
'“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”',
'“The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.”',
'“The truth." Dumbledore sighed. "It is a beautiful and terrible thing, and should therefore be treated with great caution.”',
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
'“There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”',
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
'“There is no friend as loyal as a book.”',
'“There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.”',
'“There is nothing to writing. All you do is sit down at a typewriter and bleed.”',
'“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”',
"“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”",
'“To die will be an awfully big adventure.”',
'“To love at all is to be vulnerable. Love anything and your heart will be wrung and possibly broken. If you want to make sure of keeping it intact you must give it to no one, not even an animal. Wrap it carefully round with hobbies and little luxuries; avoid all entanglements. Lock it up safe in the casket or coffin of your selfishness. But in that casket, safe, dark, motionless, airless, it will change. It will not be broken; it will become unbreakable, impenetrable, irredeemable. To love is to be vulnerable.”',
'“To the well-organized mind, death is but the next great adventure.”',
'“Today you are You, that is truer than true. There is no one alive who is Youer than You.”',
'“Try not to become a man of success. Rather become a man of value.”',
'“We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”',
"“We read to know we're not alone.”",
"“What really knocks me out is a book that, when you're all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn't happen much, though.”",
'“When one door of happiness closes, another opens; but often we look so long at the closed door that we do not see the one which has been opened for us.”',
'“You believe lies so you eventually learn to trust no one but yourself.”',
'“You can never get a cup of tea large enough or a book long enough to suit me.”',
'“You don’t forget the face of the person who was your last hope.”',
'“You have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.”',
"“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”",
"“You may say I'm a dreamer, but I'm not the only one. I hope someday you'll join us. And the world will live as one.”",
'“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”',
"“′Classic′ - a book which people praise and don't read.”"}
# haben wir alle zitate extrahiert?
len(quotes)
100
Diese Lösung nutzt ein python-Set, um die extrahierten Elemente zu speichern. Das ist allerdings etwas unpraktisch, da Sets ungeordnet sind und die Zitate so nicht in chronologischer Reihenfolge gespeichert werden. Es empfiehlt sich deswegen, stattdessen eine Liste zu verwenden:
%%time
# %%memit
base_url = "http://quotes.toscrape.com/page/"
quotes = []
for i in range (1,11):
scrape_url = base_url + str(i)
page = requests.get(scrape_url)
soup = BeautifulSoup(page.content, "html.parser")
for quote in soup.select(".quote > .text"):
quotes.append(quote.text)
CPU times: user 106 ms, sys: 4.56 ms, total: 111 ms
Wall time: 944 ms
quotes
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
'“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
'“Try not to become a man of success. Rather become a man of value.”',
'“It is better to be hated for what you are than to be loved for what you are not.”',
"“I have not failed. I've just found 10,000 ways that won't work.”",
"“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
'“A day without sunshine is like, you know, night.”',
"“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”",
'“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”',
"“If you can't explain it to a six year old, you don't understand it yourself.”",
"“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”",
'“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”',
'“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”',
"“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”",
'“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”',
'“Good friends, good books, and a sleepy conscience: this is the ideal life.”',
'“Life is what happens to us while we are making other plans.”',
'“I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”',
'“For every minute you are angry you lose sixty seconds of happiness.”',
'“If you judge people, you have no time to love them.”',
'“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”',
'“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”',
'“Today you are You, that is truer than true. There is no one alive who is Youer than You.”',
'“If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”',
'“It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.”',
'“Logic will get you from A to Z; imagination will get you everywhere.”',
'“One good thing about music, when it hits you, you feel no pain.”',
"“The more that you read, the more things you will know. The more that you learn, the more places you'll go.”",
'“Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”',
'“The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.”',
'“Not all of us can do great things. But we can do small things with great love.”',
'“To the well-organized mind, death is but the next great adventure.”',
"“All you need is love. But a little chocolate now and then doesn't hurt.”",
"“We read to know we're not alone.”",
'“Any fool can know. The point is to understand.”',
'“I have always imagined that Paradise will be a kind of library.”',
'“It is never too late to be what you might have been.”',
'“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”',
'“You can never get a cup of tea large enough or a book long enough to suit me.”',
'“You believe lies so you eventually learn to trust no one but yourself.”',
'“If you can make a woman laugh, you can make her do anything.”',
'“Life is like riding a bicycle. To keep your balance, you must keep moving.”',
'“The real lover is the man who can thrill you by kissing your forehead or smiling into your eyes or just staring into space.”',
"“A wise girl kisses but doesn't love, listens but doesn't believe, and leaves before she is left.”",
'“Only in the darkness can you see the stars.”',
'“It matters not what someone is born, but what they grow to be.”',
'“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”',
'“There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.”',
'“Do one thing every day that scares you.”',
'“I am good, but not an angel. I do sin, but I am not the devil. I am just a small girl in a big world trying to find someone to love.”',
'“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”',
'“If you only read the books that everyone else is reading, you can only think what everyone else is thinking.”',
'“The difference between genius and stupidity is: genius has its limits.”',
"“He's like a drug for you, Bella.”",
'“There is no friend as loyal as a book.”',
'“When one door of happiness closes, another opens; but often we look so long at the closed door that we do not see the one which has been opened for us.”',
"“Life isn't about finding yourself. Life is about creating yourself.”",
"“That's the problem with drinking, I thought, as I poured myself a drink. If something bad happens you drink in an attempt to forget; if something good happens you drink in order to celebrate; and if nothing happens you drink to make something happen.”",
'“You don’t forget the face of the person who was your last hope.”',
"“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”",
'“To love at all is to be vulnerable. Love anything and your heart will be wrung and possibly broken. If you want to make sure of keeping it intact you must give it to no one, not even an animal. Wrap it carefully round with hobbies and little luxuries; avoid all entanglements. Lock it up safe in the casket or coffin of your selfishness. But in that casket, safe, dark, motionless, airless, it will change. It will not be broken; it will become unbreakable, impenetrable, irredeemable. To love is to be vulnerable.”',
'“Not all those who wander are lost.”',
'“Do not pity the dead, Harry. Pity the living, and, above all those who live without love.”',
'“There is nothing to writing. All you do is sit down at a typewriter and bleed.”',
'“Finish each day and be done with it. You have done what you could. Some blunders and absurdities no doubt crept in; forget them as soon as you can. Tomorrow is a new day. You shall begin it serenely and with too high a spirit to be encumbered with your old nonsense.”',
'“I have never let my schooling interfere with my education.”',
"“I have heard there are troubles of more than one kind. Some come from ahead and some come from behind. But I've bought a big bat. I'm all ready you see. Now my troubles are going to have troubles with me!”",
'“If I had a flower for every time I thought of you...I could walk through my garden forever.”',
'“Some people never go crazy. What truly horrible lives they must lead.”',
'“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”',
'“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”',
"“What really knocks me out is a book that, when you're all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn't happen much, though.”",
'“The reason I talk to myself is because I’m the only one whose answers I accept.”',
"“You may say I'm a dreamer, but I'm not the only one. I hope someday you'll join us. And the world will live as one.”",
'“I am free of all prejudice. I hate everyone equally. ”',
"“The question isn't who is going to let me; it's who is going to stop me.”",
"“′Classic′ - a book which people praise and don't read.”",
'“Anyone who has never made a mistake has never tried anything new.”',
"“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”",
'“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”',
'“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”',
'“There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”',
'“Some day you will be old enough to start reading fairy tales again.”',
'“We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”',
'“The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”',
'“A lie can travel half way around the world while the truth is putting on its shoes.”',
'“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”',
'“The truth." Dumbledore sighed. "It is a beautiful and terrible thing, and should therefore be treated with great caution.”',
"“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”",
'“To die will be an awfully big adventure.”',
'“It takes courage to grow up and become who you really are.”',
'“But better to get hurt by the truth than comforted with a lie.”',
'“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”',
'“You have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.”',
'“Never tell the truth to people who are not worthy of it.”',
"“A person's a person, no matter how small.”",
'“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”']
Die Ausgabe von %%time und %%memit zeigen, dass die Nutzung des Sets keinen (Effizienz-) Vorteil gegenüber Listen hat: Wir können also genausogut eine Liste verwenden.
7.2.3.2. Lösung mit while-Schleife: unbekannte Anzahl von Unterseiten#
%%time
# %%memit
# while-Schleife mit Set
# Lösung wieder von GitHub-Nutzer:in Bhavya Bindela: https://bhavyasree.github.io/PythonClass/Notebooks/18.scrape-quotes/ . Angepasst für die Extraktion von Zitaten statt Autor:innen
page = requests.get(scrape_url)
soup = BeautifulSoup(page.content, "html.parser")
page_no = 1
quotes = set()
base_url = "http://quotes.toscrape.com/page/"
while True:
scrape_url = base_url + str(page_no)
page = requests.get(scrape_url)
# Das funktioniert nur für die Seite quotes.toscrape.com
# Für andere Seiten könnte hier die Bedingung if page.status_code != 200
# getestet werden
if "No quotes found!" in page.text:
break
soup = BeautifulSoup(page.content, "html.parser")
for quote in soup.select(".quote > .text"):
quotes.add(quote.text)
page_no +=1
CPU times: user 149 ms, sys: 1.25 ms, total: 151 ms
Wall time: 1.07 s
quotes
{'“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”',
'“A day without sunshine is like, you know, night.”',
"“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”",
'“A lie can travel half way around the world while the truth is putting on its shoes.”',
"“A person's a person, no matter how small.”",
'“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”',
"“A wise girl kisses but doesn't love, listens but doesn't believe, and leaves before she is left.”",
"“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
"“All you need is love. But a little chocolate now and then doesn't hurt.”",
'“Any fool can know. The point is to understand.”',
'“Anyone who has never made a mistake has never tried anything new.”',
'“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”',
'“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”',
'“But better to get hurt by the truth than comforted with a lie.”',
'“Do not pity the dead, Harry. Pity the living, and, above all those who live without love.”',
'“Do one thing every day that scares you.”',
'“Finish each day and be done with it. You have done what you could. Some blunders and absurdities no doubt crept in; forget them as soon as you can. Tomorrow is a new day. You shall begin it serenely and with too high a spirit to be encumbered with your old nonsense.”',
'“For every minute you are angry you lose sixty seconds of happiness.”',
'“Good friends, good books, and a sleepy conscience: this is the ideal life.”',
"“He's like a drug for you, Bella.”",
'“I am free of all prejudice. I hate everyone equally. ”',
'“I am good, but not an angel. I do sin, but I am not the devil. I am just a small girl in a big world trying to find someone to love.”',
'“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”',
'“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”',
'“I have always imagined that Paradise will be a kind of library.”',
"“I have heard there are troubles of more than one kind. Some come from ahead and some come from behind. But I've bought a big bat. I'm all ready you see. Now my troubles are going to have troubles with me!”",
'“I have never let my schooling interfere with my education.”',
"“I have not failed. I've just found 10,000 ways that won't work.”",
'“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”',
'“I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”',
'“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”',
"“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”",
'“If I had a flower for every time I thought of you...I could walk through my garden forever.”',
'“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”',
'“If you can make a woman laugh, you can make her do anything.”',
"“If you can't explain it to a six year old, you don't understand it yourself.”",
'“If you judge people, you have no time to love them.”',
'“If you only read the books that everyone else is reading, you can only think what everyone else is thinking.”',
'“If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”',
"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
'“It is better to be hated for what you are than to be loved for what you are not.”',
'“It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.”',
'“It is never too late to be what you might have been.”',
'“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”',
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
'“It matters not what someone is born, but what they grow to be.”',
'“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”',
'“It takes courage to grow up and become who you really are.”',
'“Life is like riding a bicycle. To keep your balance, you must keep moving.”',
'“Life is what happens to us while we are making other plans.”',
"“Life isn't about finding yourself. Life is about creating yourself.”",
'“Logic will get you from A to Z; imagination will get you everywhere.”',
'“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”',
'“Never tell the truth to people who are not worthy of it.”',
'“Not all of us can do great things. But we can do small things with great love.”',
'“Not all those who wander are lost.”',
'“Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”',
'“One good thing about music, when it hits you, you feel no pain.”',
'“Only in the darkness can you see the stars.”',
'“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”',
"“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”",
'“Some day you will be old enough to start reading fairy tales again.”',
'“Some people never go crazy. What truly horrible lives they must lead.”',
"“That's the problem with drinking, I thought, as I poured myself a drink. If something bad happens you drink in an attempt to forget; if something good happens you drink in order to celebrate; and if nothing happens you drink to make something happen.”",
'“The difference between genius and stupidity is: genius has its limits.”',
'“The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”',
"“The more that you read, the more things you will know. The more that you learn, the more places you'll go.”",
"“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”",
'“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
"“The question isn't who is going to let me; it's who is going to stop me.”",
'“The real lover is the man who can thrill you by kissing your forehead or smiling into your eyes or just staring into space.”',
'“The reason I talk to myself is because I’m the only one whose answers I accept.”',
'“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”',
'“The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.”',
'“The truth." Dumbledore sighed. "It is a beautiful and terrible thing, and should therefore be treated with great caution.”',
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
'“There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”',
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
'“There is no friend as loyal as a book.”',
'“There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.”',
'“There is nothing to writing. All you do is sit down at a typewriter and bleed.”',
'“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”',
"“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”",
'“To die will be an awfully big adventure.”',
'“To love at all is to be vulnerable. Love anything and your heart will be wrung and possibly broken. If you want to make sure of keeping it intact you must give it to no one, not even an animal. Wrap it carefully round with hobbies and little luxuries; avoid all entanglements. Lock it up safe in the casket or coffin of your selfishness. But in that casket, safe, dark, motionless, airless, it will change. It will not be broken; it will become unbreakable, impenetrable, irredeemable. To love is to be vulnerable.”',
'“To the well-organized mind, death is but the next great adventure.”',
'“Today you are You, that is truer than true. There is no one alive who is Youer than You.”',
'“Try not to become a man of success. Rather become a man of value.”',
'“We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”',
"“We read to know we're not alone.”",
"“What really knocks me out is a book that, when you're all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn't happen much, though.”",
'“When one door of happiness closes, another opens; but often we look so long at the closed door that we do not see the one which has been opened for us.”',
'“You believe lies so you eventually learn to trust no one but yourself.”',
'“You can never get a cup of tea large enough or a book long enough to suit me.”',
'“You don’t forget the face of the person who was your last hope.”',
'“You have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.”',
"“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”",
"“You may say I'm a dreamer, but I'm not the only one. I hope someday you'll join us. And the world will live as one.”",
'“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”',
"“′Classic′ - a book which people praise and don't read.”"}
Die Zeile, die auf den ersten Blick verwundert, ist wahrscheinlich die if-Anweisung: if ‘No quotes found!’ in page.text: break. Was hat es damit auf sich? Die Macher:innen der Seite quotes.toscrape haben sich überlegt, dass auch Seiten, auf denen keine Zitate mehr publiziert sind, existieren sollen, sodass eine HTTP-Anfrage für diese Seiten einen Erfolgscode 200 zurückgeben. Wenn die Seiten nicht existieren würden, könnte einfach die while-Schleife in Abhängigkeit von dem Statuscode abgebrochen werden.
Das können wir im Vergleich mit einer anderen Seite illustrieren:
# warum if 'No quotes found!' ...?
# Es gibt eine seite 99999: Das ist nur ausnahmsweise auf der Seite quotes.toscrape so.
page = requests.get("http://quotes.toscrape.com/page/9999")
page.status_code
200
# Auf dieser Seite steht ein einziger Satz
# Durchsucht den String nach dem Satz: Welcher ist es?
page.text
'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n <link rel="stylesheet" href="/static/bootstrap.min.css">\n <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n <div class="container">\n <div class="row header-box">\n <div class="col-md-8">\n <h1>\n <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n </h1>\n </div>\n <div class="col-md-4">\n <p>\n \n <a href="/login">Login</a>\n \n </p>\n </div>\n </div>\n \n\n<div class="row">\n <div class="col-md-8">\n\nNo quotes found!\n\n <nav>\n <ul class="pager">\n \n <li class="previous">\n <a href="/page/9998/"><span aria-hidden="true">←</span> Previous</a>\n </li>\n \n \n </ul>\n </nav>\n </div>\n <div class="col-md-4 tags-box">\n \n <h2>Top Ten tags</h2>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 28px" href="/tag/love/">love</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 26px" href="/tag/inspirational/">inspirational</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 26px" href="/tag/life/">life</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 24px" href="/tag/humor/">humor</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 22px" href="/tag/books/">books</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 14px" href="/tag/reading/">reading</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 10px" href="/tag/friendship/">friendship</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n </span>\n \n \n </div>\n</div>\n\n </div>\n <footer class="footer">\n <div class="container">\n <p class="text-muted">\n Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>\n </p>\n <p class="copyright">\n Made with <span class=\'zyte\'>❤</span> by <a class=\'zyte\' href="https://www.zyte.com">Zyte</a>\n </p>\n </div>\n </footer>\n</body>\n</html>'
# Anders wäre es z.B. hier:
page = requests.get("https://www.projekt-gutenberg.org/balzac/kurtisa2/chap001.html")
page.status_code # 200
# Es gibt keine Seite 99999
page = requests.get("https://www.projekt-gutenberg.org/balzac/kurtisa2/chap99999.html")
page.status_code # 404
404
Auch die while-Schleife können wir wieder zum Erstellen einer Liste anstelle eines Sets verwenden:
%%time
# %%memit
# while-Schleife mit Liste
page = requests.get(scrape_url)
soup = BeautifulSoup(page.content, "html.parser")
page_no = 1
quotes = []
base_url = "http://quotes.toscrape.com/page/"
while True:
scrape_url = base_url + str(page_no)
page = requests.get(scrape_url)
if "No quotes found!" in page.text:
break
soup = BeautifulSoup(page.content, "html.parser")
for quote in soup.select(".quote > .text"):
quotes.append(quote.text)
page_no +=1
CPU times: user 115 ms, sys: 1.06 ms, total: 116 ms
Wall time: 1.03 s
quotes
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
'“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
'“Try not to become a man of success. Rather become a man of value.”',
'“It is better to be hated for what you are than to be loved for what you are not.”',
"“I have not failed. I've just found 10,000 ways that won't work.”",
"“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
'“A day without sunshine is like, you know, night.”',
"“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”",
'“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”',
"“If you can't explain it to a six year old, you don't understand it yourself.”",
"“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”",
'“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”',
'“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”',
"“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”",
'“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”',
'“Good friends, good books, and a sleepy conscience: this is the ideal life.”',
'“Life is what happens to us while we are making other plans.”',
'“I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”',
'“For every minute you are angry you lose sixty seconds of happiness.”',
'“If you judge people, you have no time to love them.”',
'“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”',
'“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”',
'“Today you are You, that is truer than true. There is no one alive who is Youer than You.”',
'“If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales.”',
'“It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.”',
'“Logic will get you from A to Z; imagination will get you everywhere.”',
'“One good thing about music, when it hits you, you feel no pain.”',
"“The more that you read, the more things you will know. The more that you learn, the more places you'll go.”",
'“Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”',
'“The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.”',
'“Not all of us can do great things. But we can do small things with great love.”',
'“To the well-organized mind, death is but the next great adventure.”',
"“All you need is love. But a little chocolate now and then doesn't hurt.”",
"“We read to know we're not alone.”",
'“Any fool can know. The point is to understand.”',
'“I have always imagined that Paradise will be a kind of library.”',
'“It is never too late to be what you might have been.”',
'“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”',
'“You can never get a cup of tea large enough or a book long enough to suit me.”',
'“You believe lies so you eventually learn to trust no one but yourself.”',
'“If you can make a woman laugh, you can make her do anything.”',
'“Life is like riding a bicycle. To keep your balance, you must keep moving.”',
'“The real lover is the man who can thrill you by kissing your forehead or smiling into your eyes or just staring into space.”',
"“A wise girl kisses but doesn't love, listens but doesn't believe, and leaves before she is left.”",
'“Only in the darkness can you see the stars.”',
'“It matters not what someone is born, but what they grow to be.”',
'“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”',
'“There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.”',
'“Do one thing every day that scares you.”',
'“I am good, but not an angel. I do sin, but I am not the devil. I am just a small girl in a big world trying to find someone to love.”',
'“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”',
'“If you only read the books that everyone else is reading, you can only think what everyone else is thinking.”',
'“The difference between genius and stupidity is: genius has its limits.”',
"“He's like a drug for you, Bella.”",
'“There is no friend as loyal as a book.”',
'“When one door of happiness closes, another opens; but often we look so long at the closed door that we do not see the one which has been opened for us.”',
"“Life isn't about finding yourself. Life is about creating yourself.”",
"“That's the problem with drinking, I thought, as I poured myself a drink. If something bad happens you drink in an attempt to forget; if something good happens you drink in order to celebrate; and if nothing happens you drink to make something happen.”",
'“You don’t forget the face of the person who was your last hope.”',
"“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”",
'“To love at all is to be vulnerable. Love anything and your heart will be wrung and possibly broken. If you want to make sure of keeping it intact you must give it to no one, not even an animal. Wrap it carefully round with hobbies and little luxuries; avoid all entanglements. Lock it up safe in the casket or coffin of your selfishness. But in that casket, safe, dark, motionless, airless, it will change. It will not be broken; it will become unbreakable, impenetrable, irredeemable. To love is to be vulnerable.”',
'“Not all those who wander are lost.”',
'“Do not pity the dead, Harry. Pity the living, and, above all those who live without love.”',
'“There is nothing to writing. All you do is sit down at a typewriter and bleed.”',
'“Finish each day and be done with it. You have done what you could. Some blunders and absurdities no doubt crept in; forget them as soon as you can. Tomorrow is a new day. You shall begin it serenely and with too high a spirit to be encumbered with your old nonsense.”',
'“I have never let my schooling interfere with my education.”',
"“I have heard there are troubles of more than one kind. Some come from ahead and some come from behind. But I've bought a big bat. I'm all ready you see. Now my troubles are going to have troubles with me!”",
'“If I had a flower for every time I thought of you...I could walk through my garden forever.”',
'“Some people never go crazy. What truly horrible lives they must lead.”',
'“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”',
'“Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!”',
"“What really knocks me out is a book that, when you're all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn't happen much, though.”",
'“The reason I talk to myself is because I’m the only one whose answers I accept.”',
"“You may say I'm a dreamer, but I'm not the only one. I hope someday you'll join us. And the world will live as one.”",
'“I am free of all prejudice. I hate everyone equally. ”',
"“The question isn't who is going to let me; it's who is going to stop me.”",
"“′Classic′ - a book which people praise and don't read.”",
'“Anyone who has never made a mistake has never tried anything new.”',
"“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”",
'“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”',
'“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”',
'“There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”',
'“Some day you will be old enough to start reading fairy tales again.”',
'“We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”',
'“The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”',
'“A lie can travel half way around the world while the truth is putting on its shoes.”',
'“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”',
'“The truth." Dumbledore sighed. "It is a beautiful and terrible thing, and should therefore be treated with great caution.”',
"“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”",
'“To die will be an awfully big adventure.”',
'“It takes courage to grow up and become who you really are.”',
'“But better to get hurt by the truth than comforted with a lie.”',
'“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”',
'“You have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.”',
'“Never tell the truth to people who are not worthy of it.”',
"“A person's a person, no matter how small.”",
'“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”']
Anstatt die while-Schleife wie bisher abzubrechen, wenn “No quotes found!” auf der Seite steht, kann auch der Next-Button unten rechts auf der Seite zur Formulierung einer Abbruchbedingung genutzt werden. Den Next-Button gibt es nämlich nur auf Seiten, die auch Zitate enthalten. Wir können also eine while-Schleife konstruieren, die zunächst nur den Inhalt der ersten Seite extrahiert, und anschließend den Inhalt aller Unterseiten, die ein Element mit der Klasse “next” haben. Das Element mit der Klasse “next” hat ein a-Element als Kindelement, dessen Attribut “href” als Wert den relativen Pfad der nächsten Unterseite enthält, also zum Beispiel “/page/2”, “/page/3”, usw. In jedem Schleifendurchlauf wird überprüft, ob die aktuelle Seite ein a-Element enthält, das Kindelement von einem Element mit der Klasse “next” ist. Wenn kein solches Element gefunden wird, dann wird die Schleife mithilfe einer break-Anweisung abgrebrochen.
%%time
# %%memit
base_url = "http://quotes.toscrape.com"
current_url = base_url
quotes = []
while True:
page = requests.get(current_url)
soup = BeautifulSoup(page.content, "html.parser")
for quote in soup.select(".quote > .text"):
quotes.append(quote.text)
next_elem = soup.select_one(".next > a")
if next_elem is not None:
current_url = next_elem['href'] # auf den Wert des Attributs "href" zugreifen
current_url = base_url + current_url
else:
break
CPU times: user 88.7 ms, sys: 8.95 ms, total: 97.6 ms
Wall time: 619 ms
7.2.4. Zitate von allen Seiten extrahieren, mit Metadaten#
7.2.4.1. Lösung mit for-Schleife#
%%time
# %%memit
base_url = 'http://quotes.toscrape.com/page/'
quotes_dict = {"Text":[], "Author":[], "Tags":[]}
for i in range (1,11):
scrape_url = base_url + str(i)
page = requests.get(scrape_url)
soup = BeautifulSoup(page.content, "html.parser")
quotes_divs = soup.select(".quote")
for div in quotes_divs:
quotes_dict["Text"].append(div.select_one(".text").get_text()) # oder .text
quotes_dict["Author"].append(div.select_one(".author").get_text())
quote_tags = div.select(".tags > a") # findet Kind-Elemente von dem Element mit der Klasse tags: tags aus der Top Ten Tags-Liste werden nicht gefunden, weil wir bereits in der quote div sind
tags_text = []
for tag in quote_tags:
tags_text.append(tag.get_text())
quotes_dict["Tags"].append(tags_text)
quotes_df = pd.DataFrame.from_dict(quotes_dict)
CPU times: user 120 ms, sys: 330 µs, total: 121 ms
Wall time: 883 ms
quotes_df
Text | Author | Tags | |
---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | [change, deep-thoughts, thinking, world] |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling | [abilities, choices] |
2 | “There are only two ways to live your life. On... | Albert Einstein | [inspirational, life, live, miracle, miracles] |
3 | “The person, be it gentleman or lady, who has ... | Jane Austen | [aliteracy, books, classic, humor] |
4 | “Imperfection is beauty, madness is genius and... | Marilyn Monroe | [be-yourself, inspirational] |
... | ... | ... | ... |
95 | “You never really understand a person until yo... | Harper Lee | [better-life-empathy] |
96 | “You have to write the book that wants to be w... | Madeleine L'Engle | [books, children, difficult, grown-ups, write,... |
97 | “Never tell the truth to people who are not wo... | Mark Twain | [truth] |
98 | “A person's a person, no matter how small.” | Dr. Seuss | [inspirational] |
99 | “... a mind needs books as a sword needs a whe... | George R.R. Martin | [books, mind] |
100 rows × 3 columns
7.2.4.2. Lösung mit while-Schleife#
%%time
# %%memit
page = requests.get(scrape_url)
soup = BeautifulSoup(page.content, "html.parser")
page_no = 1
quotes_dict = {"Text":[], "Author":[], "Tags":[]}
base_url = "http://quotes.toscrape.com/page/"
while True:
scrape_url = base_url + str(page_no)
page = requests.get(scrape_url)
if "No quotes found!" in page.text:
break
soup = BeautifulSoup(page.content, "html.parser")
quote_divs = soup.select(".quote")
for div in quote_divs:
quotes_dict["Text"].append(div.select_one(".text").get_text())
quotes_dict["Author"].append(div.select_one(".author").get_text())
quote_tags = div.select(".tags > a")
tags_text = []
for tag in quote_tags:
tags_text.append(tag.get_text())
quotes_dict["Tags"].append(tags_text)
page_no +=1
quotes_df = pd.DataFrame.from_dict(quotes_dict)
CPU times: user 139 ms, sys: 4 ms, total: 143 ms
Wall time: 1.06 s
quotes_df
Text | Author | Tags | |
---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | [change, deep-thoughts, thinking, world] |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling | [abilities, choices] |
2 | “There are only two ways to live your life. On... | Albert Einstein | [inspirational, life, live, miracle, miracles] |
3 | “The person, be it gentleman or lady, who has ... | Jane Austen | [aliteracy, books, classic, humor] |
4 | “Imperfection is beauty, madness is genius and... | Marilyn Monroe | [be-yourself, inspirational] |
... | ... | ... | ... |
95 | “You never really understand a person until yo... | Harper Lee | [better-life-empathy] |
96 | “You have to write the book that wants to be w... | Madeleine L'Engle | [books, children, difficult, grown-ups, write,... |
97 | “Never tell the truth to people who are not wo... | Mark Twain | [truth] |
98 | “A person's a person, no matter how small.” | Dr. Seuss | [inspirational] |
99 | “... a mind needs books as a sword needs a whe... | George R.R. Martin | [books, mind] |
100 rows × 3 columns
Alternative Lösung, die wieder den Next-Button zur Formulierung der Abbruchbedingung nutzt:
%%time
# %%memit
base_url = "http://quotes.toscrape.com"
current_url = base_url
quotes_dict = {"Text":[], "Author":[], "Tags":[]}
while True:
page = requests.get(current_url)
soup = BeautifulSoup(page.content, "html.parser")
quote_divs = soup.select(".quote")
for div in quote_divs:
quotes_dict["Text"].append(div.select_one(".text").get_text())
quotes_dict["Author"].append(div.select_one(".author").get_text())
quote_tags = div.select(".tags > a")
tags_text = []
for tag in quote_tags:
tags_text.append(tag.get_text())
quotes_dict["Tags"].append(tags_text)
next_elem = soup.select_one(".next > a")
if next_elem is not None:
next_path = next_elem['href'] # auf den Wert des Attributs "href" zugreifen
current_url = base_url + next_path
else:
break
quotes_df = pd.DataFrame.from_dict(quotes_dict)
CPU times: user 135 ms, sys: 7.61 ms, total: 143 ms
Wall time: 667 ms
In allen Lösungen haben wir zwei for-Schleifen ineinander verschachtelt, um die Tags zu allen Zitaten als Liste zu extrahieren. Die Schleife for tag in tags ...
kann aber auch durch ein etwas übersichtlicheres und effizienteres Konstrukt ersetzt werden, das sich “List Comprehension” nennt. Was das genau ist, lernen wir nächste Stunde. So sieht die while-Schleife mit list comprehension aus:
%%time
# %%memit
base_url = "http://quotes.toscrape.com"
current_url = base_url
quotes_dict = {"Text":[], "Author":[], "Tags":[]}
while True:
page = requests.get(current_url)
soup = BeautifulSoup(page.content, "html.parser")
quote_divs = soup.select(".quote")
for div in quote_divs:
quotes_dict["Text"].append(div.select_one(".text").get_text())
quotes_dict["Author"].append(div.select_one(".author").get_text())
quote_tags = div.select(".tags > a")
tags_text = [tag.get_text() for tag in div.select(".tags > a")]
quotes_dict["Tags"].append(tags_text)
next_elem = soup.select_one(".next > a")
if next_elem is not None:
next_path = next_elem['href'] # auf den Wert des Attributs "href" zugreifen
current_url = base_url + next_path
else:
break
quotes_df = pd.DataFrame.from_dict(quotes_dict)
CPU times: user 158 ms, sys: 1.25 ms, total: 159 ms
Wall time: 694 ms
7.2.4.3. Lösung mit Funktionen#
Zuletzt wäre auch eine Lösung mithilfe von Funktionsdefinitionen denkbar:
def get_soup(url):
"""
Argumente: `url` (String): Die URL der Webseite.
Rückgabewert: `soup` (BeautifulSoup-Objekt): Das BeautifulSoup-Objekt, das den analysierten HTML-Inhalt der Webseite repräsentiert.
Diese Funktion ruft den HTML-Inhalt der Webseite unter der angegebenen URL mit der requests.get()-Methode ab. Anschließend wird ein BeautifulSoup-Objekt erstellt, indem der HTML-Inhalt mit dem Parser "html.parser" analysiert wird. Das resultierende BeautifulSoup-Objekt wird zurückgegeben.
"""
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
return soup
def extract_quotes(soup, quotes_dict):
"""
Argumente:
`soup` (BeautifulSoup-Objekt): Das BeautifulSoup-Objekt, das den analysierten HTML-Inhalt einer Webseite repräsentiert.
`quotes_dict` (Dictionary): Ein Wörterbuch, das die Zitate enthält.
Rückgabewert:
`quotes_dict` (Dictionary): Das aktualisierte Wörterbuch, das die extrahierten Zitate enthält.
Die Funktion extrahiert Zitate, Autor:innen, Tags und URLs aus einem BeautifulSoup-Objekt. Sie wählt dazu bestimmte HTML-Elemente mit den entsprechenden Klassen aus und fügt die extrahierten Informationen in das quotes_dict-Wörterbuch ein, das anschließend zurückgegeben wird.
"""
quote_divs = soup.select(".quote")
for div in quote_divs:
quotes_dict["Text"].append(div.select_one(".text").get_text())
quotes_dict["Author"].append(div.select_one(".author").get_text())
tags_text = [tag.get_text() for tag in div.select(".tags > a")]
quotes_dict["Tags"].append(tags_text)
return quotes_dict
def scrape_all_quotes(url, quotes_dict = None):
"""
Argumente:
`url` (String): Die URL der Webseite, von der die Zitate gesammelt werden sollen.
`quotes_dict` (Dictionary, optional): Ein Wörterbuch, das die Zitate enthält. Wenn nicht angegeben, wird ein neues Wörterbuch erstellt.
Rückgabewert:
`quotes_df` (DataFrame): Ein Pandas DataFrame, der die gesammelten Zitate enthält.
Die Funktion sammelt Zitate von der angegebenen Webseite und allen Unterseiten. Wenn kein quotes_dict-Wörterbuch bereitgestellt wird, wird ein neues Wörterbuch mit leeren Listen erstellt. Das BeautifulSoup-Objekt wird über get_soup() abgerufen und die extract_quotes()-Funktion extrahiert die Zitate und aktualisiert das Wörterbuch. Wenn keine nächste Seite vorhanden ist (bestimmt durch das Fehlen des ".next"-Elements), wird ein Pandas DataFrame (quotes_df) aus dem quotes_dict erstellt und zurückgegeben. Andernfalls wird die URL der nächsten Seite abgerufen und scrape_all_quotes() rekursiv mit der nächsten Seiten-URL und dem aktuellen quotes_dict aufgerufen.
"""
if quotes_dict is None:
quotes_dict = {"Text":[], "Author":[], "Tags":[]}
soup = get_soup(url)
quotes_dict = extract_quotes(soup, quotes_dict)
next_elem = soup.select_one(".next > a")
if next_elem is not None:
next_path = next_elem['href'] # auf den Wert des Attributs "href" zugreifen
current_url = base_url + next_path
return scrape_all_quotes(current_url, quotes_dict)
else:
quotes_df = pd.DataFrame.from_dict(quotes_dict)
return quotes_df
%%time
# %%memit
# Funktion aufrufen
quotes_df = scrape_all_quotes("https://quotes.toscrape.com")
CPU times: user 123 ms, sys: 0 ns, total: 123 ms
Wall time: 684 ms
quotes_df
Text | Author | Tags | |
---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | [change, deep-thoughts, thinking, world] |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling | [abilities, choices] |
2 | “There are only two ways to live your life. On... | Albert Einstein | [inspirational, life, live, miracle, miracles] |
3 | “The person, be it gentleman or lady, who has ... | Jane Austen | [aliteracy, books, classic, humor] |
4 | “Imperfection is beauty, madness is genius and... | Marilyn Monroe | [be-yourself, inspirational] |
... | ... | ... | ... |
95 | “You never really understand a person until yo... | Harper Lee | [better-life-empathy] |
96 | “You have to write the book that wants to be w... | Madeleine L'Engle | [books, children, difficult, grown-ups, write,... |
97 | “Never tell the truth to people who are not wo... | Mark Twain | [truth] |
98 | “A person's a person, no matter how small.” | Dr. Seuss | [inspirational] |
99 | “... a mind needs books as a sword needs a whe... | George R.R. Martin | [books, mind] |
100 rows × 3 columns
Bei der Definition von Funktionen beim Web Scraping sind einige Dinge zu beachten:
Die Funktion scrape_all_urls()
ist so definiert, dass sie die while-Schleife ersetzt. Um das zu erreichen, wurde ein Prinzip angewandt, das sich Rekursion nennt. Dabei ruft sich eine Funktion so lange selbst auf, wie eine bestimmte Bedingung erfüllt ist: Als Rückgabewert gibt die Funktion einen erneuten Funktionsaufruf zurück. Wenn die Bedingung nicht mehr erfüllt ist, wird stattdessen der Dataframe quotes_df
zurückgegeben. Bei jedem erneuten Aufruf der Funktion wird deswegen das quotes_dict
als zusätzliches Argument übergeben: So kann das Dictionary bei jedem Aufruf der Funktion weiter befüllt werden. Wichtig ist jedoch zu beachten, dass in Python festgelegt ist, wie oft eine Funktion sich selbst aufrufen darf (das heißt dann “maximale Rekursionstiefe”). Wenn die erlaubte Anzahl an Aufrufen überschritten wird, kommt es zu einem schwerwiegenden Fehler, der sich “Stack Overflow” nennt. Bei der Verwendung von Rekursion beim Web Scraping ist also Vorsicht geboten! Zwar sind wir beim Scrapen der Seite quotes.toscrape.com noch weit davon entfernt, diese Anzahl zu überschreiten, aber für größere Webscraping-Projekte ist das ein Problem, das berücksichtigt werden muss.
Mehr Informationen zum Thema Rekursion in Python findet ihr hier.
Ein weiterer Aspekt, der bei der Definition von Funktionen zu beachten ist, ist die Verwendung von Defaultargumenten. Defaultargumente sind Werte, die bereits in der Funktionsdefinition für einen Parameter angegeben werden. Diese Werte werden automatisch verwendet, wenn beim Funktionsaufruf kein expliziter Wert für das entsprechende Argument angegeben wird. Im ersten Moment würde es vielleicht intuitiv erscheinen, als Defaultargument für die Funktion scrape_all_urls()
ein leeres Dictionary quotes_dict
festzulegen, das bei den nachfolgenden Funktionsaufrufen befüllt wird. Allerdings ist das keine gute Idee: Wenn Defaultargumente einen veränderbaren Datentyp haben, werden sie in Python bei einem wiederholten Funktionsaufruf “mitgenommen” und nicht durch den Default-Wert ersetzt. Das heißt, dass bei jedem erneuten Funktionsaufruf der quotes_df
DataFrame wächst, weil das quotes_dict
nicht nur die Elemente des aktuellen Funktionsaufrufs, sondern auch die Elemente aller vorhergegangener Funktionsaufurfe enthält. Statt direkt das Dictionary als Defaultargument festzulegen, sollte deswegen lieber zunächst None als Defaultwert festgelegt werden. Das Dictionary kann dann mithilfe einer bedingten Anweisung im Funktionskörper erstellt werden, nämlich genau dann, wenn das quotes_dict
beim Funktionsaufruf None ist. Das ist nur beim ersten Funktionsaufruf der Fall.
Mehr Informationen zum Umgang mit Defaultargumenten findet ihr hier.
7.2.5. Pandas-DataFrame in Exceldatei schreiben#
Zuletzt wollen wir die extrahierten Daten auf unserem Computer abspeichern. Um einen Pandas DataFrame zu speichern, gibt es verschiedene Methoden. Die Methode .to_excel() erlaubt zum Beispiel, einen DataFrame in einer Exceltabelle zu speichern, also in einer Datei mit der Dateiendung .xlsx. Für “speichern” sagt man in diesem Kontext auch “schreiben”: Daten werden in eine Datei geschrieben.
Die Methode .to_excel() greift unter der Motorhaube auf ein Paket mit dem Namen openpyxl zurück. Diese Paket mussten wir deswegen am Anfang installieren.
Dokumentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
# quotes_df.to_excel("quotes_df.xlsx", index=False) # default-encoding ist UTF-8
7.2.6. %%time und %%memit: Was ist das?#
Wir haben beim Ausführen der Codezellen in diesem Jupyter Notebook jeweils zwei Zeilen am Anfang hinzugefügt:
%%time berechnet die Laufzeit einer Jupyter Notebook Codezelle.
%%memit berechnet, wie viel Speicher für die Ausführung der Codezelle benötigt wird.
Damit können wir vergleichen, welche Lösung am effizientesten ist. Mehr Informationen findet ihr wie immer in den Dokumentationsseiten: