AuthenticationĪll the websites I needed to scrape required login credentials. Some challenges are easier to combat than others and depend on the type of website the scraper is making requests against. Throughout my experience with this scraper, there were a few challenges to overcome in the process. Once game data is abstracted, adding the game to Google Calendar is the same regardless of the website. My project structure allows for only needing to create a new scraper to parse the HTML data. Once I got the main scraper working for one website, I broke the code into more modular classes so that the scraper is scalable for different websites. Here’s an example of how I extract data:įor my initial use case, I take note of if my name shows up in the list of referees and if so, I connect the game to my Google Calendar to sync it with games on my schedule from other websites. This requires a lot of knowledge about the page. Finally, we can manipulate the game data for each row. ![]() This corresponds to each row we are looking for which is a separate game. Then we can get all the games with games = main_table.find_all(‘tr’). First we can get the table we are looking for with: ![]() Although they might be slightly different, the parsing concept is the same. Luckily, most referee websites use tables to show the list of games. Using BeautifulSoup, we can parse this HTML for the games we need. It requires the scraper to know exactly how the page will be formatted. This makes our code very brittle to any changes the website might make but it’s the only way to get the data needed. In order extract the data needed, we must code to the structure of the page. Below is the page with the relevant data that needs to be parsed:įrom this HTML, we can see a deeply nested table. This is where inspecting the elements on the page becomes really important. From here, I was able to continue this process until I got to the page with the game data I was looking for. If all goes well, we would see in that BeautifulSoup object the same HTML content as the page as if we had logged in using a browser. We can see what POST request is being made and then emulate that within our code. Using this we can inspect the request the website makes in order to log into the system as shown in Image 1. Chrome dev tools was a good place to start. The first step is to figure out the login and make sure the session is stored for other pages. But how do I get Python to login for me? Inspecting The HTML For me, I need to log into the website and see the list of games I have signed up for and put them in my Google Calendar so I can have a single source of truth to look at for my soccer games. ![]() The first step in web scraping is knowing what information you are trying to scrape. Thus, I learned how to combine Beautiful Soup with Python’s requests package to web scrape websites. ![]() A friend had suggested Python’s Beautiful Soup package was a good tool for HTML parsing. Python is a widely used language and I figured it would be fairly easy to pick up. For this project, I wanted to teach myself a new language. Most of my programming experience is in Ruby on Rails, Java, and SQL. I learned how to scrape websites in order to more efficiently find game information and put it all in one place. This is where I got the idea to automate the process. Once I’ve signed up for these games, the websites don’t have calendar integrations to make exporting the schedule into a central location. In addition, I found myself spending hours on end sifting through the various locations, trying to figure out which games were the closest to my home. Different soccer leagues and tournaments use different websites, and it can be annoying to flip back and forth between them to figure out the optimal schedule. In addition to being a software engineer at Snapsheet, I am a soccer referee on the weekends and I spend a lot of my time looking through the various referee scheduling websites for open games. Web scraping can be a powerful tool but there are some challenges when trying to parse data that could continuously change. Automation through web scraping is a strategy that many companies and individuals adopt in order to parse information from various sources quickly. With the widespread information that can be found on the internet, manually searching for data can be tedious and time-consuming.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |