I’m working on a Python exercise and need support.
Using the code on Harvesting the Web Intro
1. Scrape Marymount’s website for any news stories and display their title in the console.
Harvesting the Web Intro
In many jobs you will be required to perform some simple data analysis to solve an open question your company of boss may have. You can see this in our fist Data assignment for instance if your company wanted to quantify how secure their users were being with their passwords.
Often the data is not as readily available as a nice .txt or .csv file. You need to go out and collect the data. If it is a page or two, you may spend an hour and just got collect the data manually. If you need to do some repeated behavior across thousands of sites, this becomes a larger issue. For instance if I want to collect all the recipe names of a recipe website: I could go through and copy and paste a few thousand times or … we could write a python program to do it for us.
We will start off this week using the simplest of the libraries which is requests . This library simulates a web call such as going to cnn.com and what packet of data is sent back.
Let’s start with importing the library
This library may not be installed on your system so you will need to either use ‘pip install requests’ or if you are using pycharm, install the library through the GUI. For how to do this, see https://www.jetbrains.com/help/pycharm/installing-uninstalling-and-upgrading-packages.html (Links to an external site.)
Once we have the library we can make our first web call.
We’ll look at two things first, the headers and the status:
The headers gives alot of information about the request/response such as encodings. Some of this information will be uised in advanced scripting. The Status Code is more useful immediately for us as it tells us if we correctly loaded the page. a code of 200 is Successful
You can see all of the possible status codes here https://docs.python.org/3/library/http.html (Links to an external site.)
Some you should know such as 404 not found.
Ultimately what we are looking for is some text or attribute of the website. In order to get that we need the text.
For instance if we wanted to scrape Marymount’s sites for all events happening we would look at the page and the text give to use and see this line in the html
<a href=”/Home/News-Events/Calendar-of-Events/Event?eventid=1719 (Links to an external site.)“><h6>Sciences, Math, and Education Webinar</h6></a>
<a href=”/Home/News-Events/Calendar-of-Events/Event?eventid=1744 (Links to an external site.)“><h6>Virtual Pizza and Conversation- Life transitions. Wh…</h6></a>
What pattern do these have? The events are inside an H6. That is good but there are other H6 tags on the page such as
<a href=”/Home/saints-on-the-go/news?newsId=52 (Links to an external site.)“><h6>MU student named 2020 Honors Scholar of the Year</h6></a></br>
and that is a news item not an event.
So we could define a pattern that is is an h6 preceeded by the word eventid
for event in matches:
This regular expression captures all the events going on at Marymount according to that pattern. The regular expression breaks down as follows
- eventid=d+ This means the literal String eventid=followed by 1 or more numbers
- “><h6> this is the end of the event tag and the start of h6 literally in the text
- ([‘w,’s]+)</h6> this is looking for any combination of letters spaces and commas after the previous <h6> and the parethesis tells the regular expression to return this value via grouping. After this statement it must end with </h6> which is the enjd of our pattern
As you can see , detecting these patterns can be tricky but they are feasible. This makes webscrapping and data collection and unique but very useful skill in the new economy that we are in.
2. Fix the error in the codes
- Rewrite Assignment 1: Dog Years to be a function called calculate_dog_years()
- It should take 3 parameters (firstname, lastname, age)
- First name will have a default of John, Last Name will default to Doe, and Age will Default to 18
- It should print out the same results as Lab 1 except this time enforce that the first letter of both firstname and last name are capitalized. So nathan becomes Nathan and green becomes Green
- These commands should work
- calculate_dog_years(lastname=”Green”, firstname=”nathan”,age=37)
#print(“Hi”,first_name.capitalize(),last_name.capitalize(),”you may be”,your_age,”years old but in dog years you are”,age_in_dog_years,”old so get busy living!”)
first_name=input(‘please Enter your First Name ‘)
last_name=input(‘please Enter your Last Name ‘)
your_age=input(‘please Enter your Your Age ‘)
print(“Hi”,first_name.capitalize(),last_name.capitalize(),”you may be”,your_age,”years old but in dog years you are”,age_in_dog_years,”old so get busy living!”)