In this post, we will scrape and prepare data for analyzing PCSO lottery results in Excel to take your lottery play to the next level.
Scraping the Data
Let us first install the modules we’re going to need for scraping the website (skip this if you already have them).
|
|
And then import them for our project:
|
|
Simulate clicks in the Browser with selenium
For simulating clicks in a web browser, we are going to use selenium
. We’re also going to need to download a web driver such as Microsoft Edge’s
webdriver (of course, you need the corresponding browser
Microsoft Edge to be installed). Since I’m on Windows, MS Edge comes right out of the box so I will use it, but feel free to use your own preferred browser that has a driver such as
Chrome or
Firefox.
Now set the path to where the webdriver executable is as well as the URL to the lottery data is located.
|
|
Now initialize a Selenium session by directing it to the webdriver executable.
|
|
One problem with the page though is that if you inspect the page’s source there is class called pre-con
in a div. If you would try to just have the driver proceed without waiting for a few seconds some of the buttons are unclickable and blocked by this div container, so we have to tell the WebDriver to wait for a set amount of time. I discovered this after a while of troubleshooting why the selenium cannot give any input to the web form.
|
|
[OUT]: <selenium.webdriver.remote.webelement.WebElement (session="fb056d28fee8b152833d3ed6d8827c99", element="a7ca8ad3-7f46-47c8-8b8b-1b80414f1e51")>
Now that the wait is over let us now proceed to entering our parameters in the ASP.NET web form (the form with filters we see if we visit the web page) for the data we need. We are going to do that by using the .find_element_by_id
method here for the options because we know their id
s by inspecting the page’s source at the place where the dropdown menus are.
Tip: To inspect the dropdown menu, right-click on it and navigate to “Developer Tools” and select “Inspect” (or press F12
). We then get the value inside of the id
parameter.
We want the end date to be today to get the latest data from all the games and the start date to be the earliest possible option which is January 1, 2012
in the dropdown menu. As for the lotto game we want all games. We will just split the data up later into smaller dataframes using pandas for each lotto game, so that we only need to scrape the website every time we want to update our data. But first get today’s date which will be used later.
|
|
[OUT]: Today is May 9, 2022.
Now let’s have Selenium and the webdriver find the elements of the form and select the parameters we want in the form options.
|
|
Scraping the data using BeautifulSoup
Now it’s time to scrape the data from the current page’s session with BeautifulSoup
to get the data we need.
Firstly, feed the page’s source code into Beautiful Soup and then have it find our results by id
. Inspect the source again) and get all the table’s rows by their attributes such as class
.
|
|
Now time to put the data in a python list/dictionary.
|
|
Processing the Data
Cleaning Up the Data with pandas
Now that we have the data in a list, it is now time to put it in a pandas
dataframe and clean it up. There are duplicates in the data if you examine it closely so we have to remove those. We also need to get the data into the proper data types to make it easier for us to process down the line (i.e. sanitization).
|
|
The part df = df[df["Combination"] != "- "]
above is to look for and remove entries that do not have a combination. I also found this after hours of figuring out why I cannot do certain operations on the data like converting them into the proper data types. Speaking of data types, let’s go convert the data now.
|
|
Now let’s look at our data so far:
|
|
Game | Combination | Date | Prize | Winners | |
---|---|---|---|---|---|
0 | Superlotto 6/49 | 18-24-04-26-47-36 | 2022-05-08 | 67522822.8 | 0 |
1 | Suertres Lotto 4PM | 1-3-9 | 2022-05-08 | 4500.0 | 279 |
2 | EZ2 Lotto 11AM | 08-04 | 2022-05-08 | 4000.0 | 251 |
3 | EZ2 Lotto 9PM | 07-04 | 2022-05-08 | 4000.0 | 784 |
4 | Lotto 6/42 | 14-24-08-16-22-37 | 2022-05-07 | 6051682.0 | 0 |
... | ... | ... | ... | ... | ... |
15880 | 4D Vismin | 0-6-3-3 | 2012-01-02 | 40672.0 | 7 |
15881 | Suertres Lotto 4PM | 3-9-3 | 2012-01-02 | 4500.0 | 256 |
15882 | EZ2 Lotto 9PM | 03-13 | 2012-01-02 | 4000.0 | 540 |
15883 | EZ2 Lotto 11AM | 31-24 | 2012-01-02 | 4000.0 | 68 |
15884 | Grand Lotto 6/55 | 44-14-51-52-39-08 | 2012-01-02 | 71768080.8 | 0 |
15878 rows × 5 columns
Saving the data to an MS Excel workbook
So far it’s looking good. Since we’re now here it’s time for us to split this huge dataframe of ours into smaller dataframes by the type of lotto game. While we’re at it let’s also fix the time for the Suertres Lotto and EZ2 Lotto games so that they are included in the data and not as a separate category.
After doing that, let’s save that into an Excel workbook so that we do not have to scrape every time we want to analyze the data.
|
|
Let’s look at one of the dataframes:
|
|
Game | Combination | Date | Prize | Winners | |
---|---|---|---|---|---|
11555 | EZ2 Lotto 11AM | 29-28 | 2014-09-24 | 4000.0 | 54 |
15533 | EZ2 Lotto 11AM | 27-16 | 2012-03-12 | 4000.0 | 55 |
15563 | EZ2 Lotto 11AM | 03-08 | 2012-03-06 | 4000.0 | 206 |
9533 | EZ2 Lotto 11AM | 24-30 | 2016-01-06 | 4000.0 | 121 |
2394 | EZ2 Lotto 11AM | 27-08 | 2020-11-07 | 4000.0 | 127 |
For the Suertres Lotto and EZ2 Lotto games the games are split into 11:00 AM, 4:00 PM, and 9:00 PM games. Let’s fix that by assigning them the proper datetime values in the Date column and combining them into bigger dataframes
|
|
|
|
Now let’s look at one of them again to see if we were successful:
|
|
Game | Combination | Date | Prize | Winners | |
---|---|---|---|---|---|
2877 | EZ2 Lotto | 25-06 | 2020-02-21 11:00:00 | 4000.0 | 54 |
3858 | EZ2 Lotto | 05-06 | 2019-07-18 11:00:00 | 4000.0 | 164 |
7987 | EZ2 Lotto | 30-25 | 2016-12-24 11:00:00 | 4000.0 | 223 |
3876 | EZ2 Lotto | 29-29 | 2019-07-14 11:00:00 | 4000.0 | 231 |
14423 | EZ2 Lotto | 19-25 | 2012-11-12 11:00:00 | 4000.0 | 135 |
... | ... | ... | ... | ... | ... |
11555 | EZ2 Lotto | 29-28 | 2014-09-24 11:00:00 | 4000.0 | 54 |
15533 | EZ2 Lotto | 27-16 | 2012-03-12 11:00:00 | 4000.0 | 55 |
15563 | EZ2 Lotto | 03-08 | 2012-03-06 11:00:00 | 4000.0 | 206 |
9533 | EZ2 Lotto | 24-30 | 2016-01-06 11:00:00 | 4000.0 | 121 |
2394 | EZ2 Lotto | 27-08 | 2020-11-07 11:00:00 | 4000.0 | 127 |
1815 rows × 5 columns
Let’s see also one of the combined dataframes:
|
|
Game | Combination | Date | Prize | Winners | |
---|---|---|---|---|---|
2877 | EZ2 Lotto | 25-06 | 2020-02-21 11:00:00 | 4000.0 | 54 |
3858 | EZ2 Lotto | 05-06 | 2019-07-18 11:00:00 | 4000.0 | 164 |
7987 | EZ2 Lotto | 30-25 | 2016-12-24 11:00:00 | 4000.0 | 223 |
3876 | EZ2 Lotto | 29-29 | 2019-07-14 11:00:00 | 4000.0 | 231 |
14423 | EZ2 Lotto | 19-25 | 2012-11-12 11:00:00 | 4000.0 | 135 |
... | ... | ... | ... | ... | ... |
4686 | EZ2 Lotto | 16-27 | 2019-01-11 21:00:00 | 4000.0 | 402 |
4681 | EZ2 Lotto | 15-08 | 2019-01-12 21:00:00 | 4000.0 | 249 |
12361 | EZ2 Lotto | 16-29 | 2014-03-16 21:00:00 | 4000.0 | 530 |
4695 | EZ2 Lotto | 06-20 | 2019-01-09 21:00:00 | 4000.0 | 335 |
4290 | EZ2 Lotto | 01-20 | 2019-04-09 21:00:00 | 4000.0 | 339 |
5334 rows × 5 columns
So far so good. The final stretch here is going to be to save our data to Microsoft Excel and we can achieve that easily with pandas
with pandas.DataFrame.to_excel
.
|
|
Conclusion
With that we have saved an Excel file named lotto.xlsx
where all of our scraped data have been put in.