Extra credit (simple python)

I’m working on a Python exercise and need support.

Using the code on Harvesting the Web Intro

1. Scrape Marymount’s website for any news stories and display their title in the console.

Harvesting the Web Intro

In many jobs you will be required to perform some simple data analysis to solve an open question your company of boss may have. You can see this in our fist Data assignment for instance if your company wanted to quantify how secure their users were being with their passwords.

Often the data is not as readily available as a nice .txt or .csv file. You need to go out and collect the data. If it is a page or two, you may spend an hour and just got collect the data manually. If you need to do some repeated behavior across thousands of sites, this becomes a larger issue. For instance if I want to collect all the recipe names of a recipe website: I could go through and copy and paste a few thousand times or … we could write a python program to do it for us.

We will start off this week using the simplest of the libraries which is requests . This library simulates a web call such as going to cnn.com and what packet of data is sent back.

Let’s start with importing the library

import requests

This library may not be installed on your system so you will need to either use ‘pip install requests’ or if you are using pycharm, install the library through the GUI. For how to do this, see https://www.jetbrains.com/help/pycharm/installing-uninstalling-and-upgrading-packages.html (Links to an external site.)

Once we have the library we can make our first web call.

response=requests.get('http://www.marymount.edu')

We are doing two things in this call. w=Our library is going to the marymount webpage and returning the html/javascript/and full network response. We are assigning the response to a variable called 'response'

We’ll look at two things first, the headers and the status:

print(response.headers)
print(response.status_code)

The headers gives alot of information about the request/response such as encodings. Some of this information will be uised in advanced scripting. The Status Code is more useful immediately for us as it tells us if we correctly loaded the page. a code of 200 is Successful

You can see all of the possible status codes here https://docs.python.org/3/library/http.html (Links to an external site.)

Some you should know such as 404 not found.

Ultimately what we are looking for is some text or attribute of the website. In order to get that we need the text.

print(response.text)

This will return a whole mess of text. If you have had web programming it should look familiar if not it may look like greek. What is returned is anything that is consider ‘client-side’ code. This includes mainly HTML, Javascript, and CSS. This code is what your web browsers (chrome,firefox, safari,..) need in order to render a screen. Unfortunately web systems do not know inherent meaning behind text since every developer styles and structures their page differently. So in the end we get a blob of text and code. HTML does have a structure though so in order to get some information we just need to look for patterns …. which should make you think of regular expressions!

For instance if we wanted to scrape Marymount’s sites for all events happening we would look at the page and the text give to use and see this line in the html

<a href=”/Home/News-Events/Calendar-of-Events/Event?eventid=1719 (Links to an external site.)“><h6>Sciences, Math, and Education Webinar</h6></a>

and another

<a href=”/Home/News-Events/Calendar-of-Events/Event?eventid=1744 (Links to an external site.)“><h6>Virtual Pizza and Conversation- Life transitions. Wh…</h6></a>

What pattern do these have? The events are inside an H6. That is good but there are other H6 tags on the page such as

<a href=”/Home/saints-on-the-go/news?newsId=52 (Links to an external site.)“><h6>MU student named 2020 Honors Scholar of the Year</h6></a></br>

and that is a news item not an event.

So we could define a pattern that is is an h6 preceeded by the word eventid

import re
matches=re.findall(r'eventid=d+"><h6>([w,s]+)</h6>',text)
for event in matches:
print(event)

This regular expression captures all the events going on at Marymount according to that pattern. The regular expression breaks down as follows

  • eventid=d+ This means the literal String eventid=followed by 1 or more numbers
  • “><h6> this is the end of the event tag and the start of h6 literally in the text
  • ([‘w,’s]+)</h6> this is looking for any combination of letters spaces and commas after the previous <h6> and the parethesis tells the regular expression to return this value via grouping. After this statement it must end with </h6> which is the enjd of our pattern

As you can see , detecting these patterns can be tricky but they are feasible. This makes webscrapping and data collection and unique but very useful skill in the new economy that we are in.

______________________________________________________________________________

2. Fix the error in the codes

  • Rewrite Assignment 1: Dog Years to be a function called calculate_dog_years()
  • It should take 3 parameters (firstname, lastname, age)
  • First name will have a default of John, Last Name will default to Doe, and Age will Default to 18
  • It should print out the same results as Lab 1 except this time enforce that the first letter of both firstname and last name are capitalized. So nathan becomes Nathan and green becomes Green
  • These commands should work
    • calculate_dog_years(“Nathan”,”Green”,37)
    • calculate_dog_years(lastname=”Green”, firstname=”nathan”,age=37)
    • calculate_dog_years(age=21)

codes:

def calculate_dog_years():

#print(“Hi”,first_name.capitalize(),last_name.capitalize(),”you may be”,your_age,”years old but in dog years you are”,age_in_dog_years,”old so get busy living!”)

first_name=input(‘please Enter your First Name ‘)

last_name=input(‘please Enter your Last Name ‘)

your_age=input(‘please Enter your Your Age ‘)

age_in_dog_years=int(your_age)*7

print(“Hi”,first_name.capitalize(),last_name.capitalize(),”you may be”,your_age,”years old but in dog years you are”,age_in_dog_years,”old so get busy living!”)

calculate_dog_years()

Order this or a similar paper and get 20% discount on your first order with us. Use coupon: GET20