In this post, I will show you how to scrape and prepare data for analyzing PCSO lottery results in Excel to take your lottery play to the next level. Discover powerful data-driven insights that can help you make smarter, more informed decisions and maximize your chances of winning.
Scraping the Data
Let us first install the modules we’re going to need for scraping the website (skip this if you already have them).
python
1
python-mpipinstallseleniumbeautifulsoup4pandas
And then import them for our project:
python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Selenium for simulating clicks in a browserfromseleniumimportwebdriverfromselenium.webdriver.support.uiimportSelectfromselenium.webdriver.common.byimportByfromselenium.webdriver.support.uiimportWebDriverWaitfromselenium.webdriver.supportimportexpected_conditionsasec# BeautifulSoup for scrapingfrombs4importBeautifulSoup# pandas for processing the dataimportpandasaspd# datetime for getting today's date and formattingfromdatetimeimportdatetimefromdatetimeimportdatefromdatetimeimportdatetime,timedelta
Simulate clicks in the Browser with selenium
For simulating clicks in a web browser, we are going to use selenium. We’re also going to need to download a web driver such as Microsoft Edge’s webdriver (of course, you need the corresponding browser Microsoft Edge to be installed). Since I’m on Windows, MS Edge comes right out of the box so I will use it, but feel free to use your own preferred browser that has a driver such as Chrome or Firefox.
Now set the path to where the webdriver executable is as well as the URL to the lottery data is located.
python
1
2
3
4
5
6
7
# Set the path to where the webdriver executuble is# For me it's the following:path=("C:\\Users\\<Username>\\AppData\\Local\\Programs\\Python"+"\\Python39\\Scripts\\msedgedriver.exe")# Designate the url to be scraped to a variableurl="https://www.pcso.gov.ph/SearchLottoResult.aspx"
Now initialize a Selenium session by directing it to the webdriver executable.
python
1
2
3
4
5
# Initialize the Edge webdriverdriver=webdriver.Edge(executable_path=path)# Grab the web pagedriver.get(url)
One problem with the page though is that if you inspect the page’s source there is class called pre-con in a div. If you would try to just have the driver proceed without waiting for a few seconds some of the buttons are unclickable and blocked by this div container, so we have to tell the WebDriver to wait for a set amount of time. I discovered this after a while of troubleshooting why the selenium cannot give any input to the web form.
python
1
2
3
4
5
# Designate a variable for waiting for the page to loadwait=WebDriverWait(driver,5)# wait for the div class "pre-con" to be invisible to ensure clicks workwait.until(ec.invisibility_of_element_located((By.CLASS_NAME,"pre-con")))
Now that the wait is over let us now proceed to entering our parameters in the ASP.NET web form (the form with filters we see if we visit the web page) for the data we need. We are going to do that by using the .find_element_by_id method here for the options because we know their ids by inspecting the page’s source at the place where the dropdown menus are.
Tip: To inspect the dropdown menu, right-click on it and navigate to “Developer Tools” and select “Inspect” (or press F12). We then get the value inside of the id parameter.
The PCSO search Lotto form
python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Wewanttheenddatetobetodaytogetthelatestdatafromallthegamesandthestartdatetobetheearliestpossibleoptionwhichis`January1,2012`inthedropdownmenu.Asforthelottogamewewantallgames.Wewilljustsplitthedatauplaterintosmallerdataframesusingpandasforeachlottogame,sothatweonlyneedtoscrapethewebsiteeverytimewewanttoupdateourdata.Butfirstgettoday's date which will be used later.```python# Get today's date with the datetime importtoday=date.today()# Store the current year, month, and day to variablestd_year=today.strftime("%Y")td_month=today.strftime("%B")td_day=today.strftime("%d").lstrip("0").replace(" 0"," ")startyr=int(td_year)-10startyr=str(startyr)print("Today is "+td_month+" "+td_day+", "+td_year+".\n")
[OUT]: Today is May 9, 2022.
Now let’s have Selenium and the webdriver find the elements of the form and select the parameters we want in the form options.
# Select Start Date as January 1, 2012start_month=Select(driver.find_element_by_id("cphContainer_cpContent_ddlStartMonth"))start_month.select_by_value("January")start_day=Select(driver.find_element_by_id("cphContainer_cpContent_ddlStartDate"))start_day.select_by_value("1")start_year=Select(driver.find_element_by_id("cphContainer_cpContent_ddlStartYear"))start_year.select_by_value(startyr)# Select End Date as Todayend_month=Select(driver.find_element_by_id("cphContainer_cpContent_ddlEndMonth"))end_month.select_by_value(td_month)end_day=Select(driver.find_element_by_id("cphContainer_cpContent_ddlEndDay"))end_day.select_by_value(td_day)end_year=Select(driver.find_element_by_id("cphContainer_cpContent_ddlEndYear"))end_year.select_by_value(td_year)# Lotto Gamegame=Select(driver.find_element_by_id("cphContainer_cpContent_ddlSelectGame"))# If you inspect the page, the value of the option for "All Games" is 0game.select_by_value('0')# Submit the parameters by clicking the search buttonsearch_button=driver.find_element_by_id("cphContainer_cpContent_btnSearch")search_button.click()
Scraping the data using BeautifulSoup
Now it’s time to scrape the data from the current page’s session with BeautifulSoup to get the data we need.
Firstly, feed the page’s source code into Beautiful Soup and then have it find our results by id. Inspect the source again) and get all the table’s rows by their attributes such as class.
python
1
2
3
4
5
6
# Feed the page's source code into Beautiful Soupdoc=BeautifulSoup(driver.page_source,"html.parser")# Find the table of the results by id (rows=doc.find('table',id='cphContainer_cpContent_GridView1').find_all('tr',attrs={'class':"alt"})
Now time to put the data in a python list/dictionary.
python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Initialize a list to hold our dataentries=[]# Now loop through the rows and put the data into the list to make a tableforrowinrows:cells=row.select("td")entry={"Game":cells[0].text,"Combination":cells[1].text,"Date":cells[2].text,"Prize":cells[3].text,"Winners":cells[4].text,}entries.append(entry)
Processing the Data
Cleaning Up the Data with pandas
Now that we have the data in a list, it is now time to put it in a pandas dataframe and clean it up. There are duplicates in the data if you examine it closely so we have to remove those. We also need to get the data into the proper data types to make it easier for us to process down the line (i.e. sanitization).
python
1
2
3
4
5
6
7
8
# Turn the list into a DataFramedf=pd.DataFrame(entries)# Remove duplicate rowsdf.drop_duplicates(inplace=True,keep=False)# Remove rows that have no combination associateddf=df[df["Combination"]!="- "]
The part df = df[df["Combination"] != "- "] above is to look for and remove entries that do not have a combination. I also found this after hours of figuring out why I cannot do certain operations on the data like converting them into the proper data types. Speaking of data types, let’s go convert the data now.
python
1
2
3
4
5
6
7
8
9
# Convert the dates to datetime typedf["Date"]=df["Date"].astype('datetime64[ns]')# Remove the commas in the prize amountsdf["Prize"]=df["Prize"].replace(',','',regex=True)# Convert data types of Prize and Winners to float and integersdf["Prize"]=df["Prize"].astype(float)# float because there are still centavosdf["Winners"]=df["Winners"].astype(int)
Now let’s look at our data so far:
python
1
df
Game
Combination
Date
Prize
Winners
0
Superlotto 6/49
18-24-04-26-47-36
2022-05-08
67522822.8
0
1
Suertres Lotto 4PM
1-3-9
2022-05-08
4500.0
279
2
EZ2 Lotto 11AM
08-04
2022-05-08
4000.0
251
3
EZ2 Lotto 9PM
07-04
2022-05-08
4000.0
784
4
Lotto 6/42
14-24-08-16-22-37
2022-05-07
6051682.0
0
...
...
...
...
...
...
15880
4D Vismin
0-6-3-3
2012-01-02
40672.0
7
15881
Suertres Lotto 4PM
3-9-3
2012-01-02
4500.0
256
15882
EZ2 Lotto 9PM
03-13
2012-01-02
4000.0
540
15883
EZ2 Lotto 11AM
31-24
2012-01-02
4000.0
68
15884
Grand Lotto 6/55
44-14-51-52-39-08
2012-01-02
71768080.8
0
15878 rows × 5 columns
Saving the data to an MS Excel workbook
So far it’s looking good. Since we’re now here it’s time for us to split this huge dataframe of ours into smaller dataframes by the type of lotto game. While we’re at it let’s also fix the time for the Suertres Lotto and EZ2 Lotto games so that they are included in the data and not as a separate category.
After doing that, let’s save that into an Excel workbook so that we do not have to scrape every time we want to analyze the data.
# Sort the DataFrame by gamedf.sort_values(by=["Game"],inplace=True)# Get a list of the gamesgames=df["Game"].unique().tolist()# Now we can create DataFrames for each gamelotto_658=df.loc[df["Game"]=="Ultra Lotto 6/58"].copy()lotto_655=df.loc[df["Game"]=="Grand Lotto 6/55"].copy()# Sidenote: Super Lotto 6/49 and Mega Lotto 6/45 have different values from# what was on the dropdown menulotto_649=df.loc[df["Game"]=="Superlotto 6/49"].copy()lotto_645=df.loc[df["Game"]=="Megalotto 6/45"].copy()# Anyways, continuing...lotto_642=df.loc[df["Game"]=="Lotto 6/42"].copy()lotto_6d=df.loc[df["Game"]=="6Digit"].copy()lotto_4d=df.loc[df["Game"]=="4Digit"].copy()lotto_3da=df.loc[df["Game"]=="Suertres Lotto 11AM"].copy()lotto_3db=df.loc[df["Game"]=="Suertres Lotto 4PM"].copy()lotto_3dc=df.loc[df["Game"]=="Suertres Lotto 9PM"].copy()lotto_2da=df.loc[df["Game"]=="EZ2 Lotto 11AM"].copy()lotto_2db=df.loc[df["Game"]=="EZ2 Lotto 4PM"].copy()lotto_2dc=df.loc[df["Game"]=="EZ2 Lotto 9PM"].copy()
Let’s look at one of the dataframes:
python
1
lotto_2da.tail()
Game
Combination
Date
Prize
Winners
11555
EZ2 Lotto 11AM
29-28
2014-09-24
4000.0
54
15533
EZ2 Lotto 11AM
27-16
2012-03-12
4000.0
55
15563
EZ2 Lotto 11AM
03-08
2012-03-06
4000.0
206
9533
EZ2 Lotto 11AM
24-30
2016-01-06
4000.0
121
2394
EZ2 Lotto 11AM
27-08
2020-11-07
4000.0
127
For the Suertres Lotto and EZ2 Lotto games the games are split into 11:00 AM, 4:00 PM, and 9:00 PM games. Let’s fix that by assigning them the proper datetime values in the Date column and combining them into bigger dataframes
python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Add 11 hours to the datetime for Suertres Lotto 11AM game to match because of time zoneslotto_3da["Date"]=lotto_3da["Date"]+timedelta(hours=11)# Add 16 hours to the datetime for Suertres Lotto 4PM game to matchlotto_3db["Date"]=lotto_3db["Date"]+timedelta(hours=16)# Add 21 hours to the datetime for Suertres Lotto 9PM game to matchlotto_3dc["Date"]=lotto_3dc["Date"]+timedelta(hours=21)# Rename all the game entries as just Suertres Lottolotto_3da["Game"]="Suertres Lotto"lotto_3db["Game"]="Suertres Lotto"lotto_3dc["Game"]="Suertres Lotto"# Combine the three Suertres Lotto DataFrames into onelotto_3d=lotto_3dalotto_3d=lotto_3d.append(lotto_3db)lotto_3d=lotto_3d.append(lotto_3dc)
python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Do the same for EZ2 Lottolotto_2da["Date"]=lotto_2da["Date"]+timedelta(hours=11)lotto_2db["Date"]=lotto_2db["Date"]+timedelta(hours=16)lotto_2dc["Date"]=lotto_2dc["Date"]+timedelta(hours=21)# Rename all the game entries as just EZ2 Lottolotto_2da["Game"]="EZ2 Lotto"lotto_2db["Game"]="EZ2 Lotto"lotto_2dc["Game"]="EZ2 Lotto"# Combine the three EZ2 Lotto DataFrames into onelotto_2d=lotto_2dalotto_2d=lotto_2d.append(lotto_2db)lotto_2d=lotto_2d.append(lotto_2dc)
Now let’s look at one of them again to see if we were successful:
python
1
lotto_2da
Game
Combination
Date
Prize
Winners
2877
EZ2 Lotto
25-06
2020-02-21 11:00:00
4000.0
54
3858
EZ2 Lotto
05-06
2019-07-18 11:00:00
4000.0
164
7987
EZ2 Lotto
30-25
2016-12-24 11:00:00
4000.0
223
3876
EZ2 Lotto
29-29
2019-07-14 11:00:00
4000.0
231
14423
EZ2 Lotto
19-25
2012-11-12 11:00:00
4000.0
135
...
...
...
...
...
...
11555
EZ2 Lotto
29-28
2014-09-24 11:00:00
4000.0
54
15533
EZ2 Lotto
27-16
2012-03-12 11:00:00
4000.0
55
15563
EZ2 Lotto
03-08
2012-03-06 11:00:00
4000.0
206
9533
EZ2 Lotto
24-30
2016-01-06 11:00:00
4000.0
121
2394
EZ2 Lotto
27-08
2020-11-07 11:00:00
4000.0
127
1815 rows × 5 columns
Let’s see also one of the combined dataframes:
python
1
lotto_2d
Game
Combination
Date
Prize
Winners
2877
EZ2 Lotto
25-06
2020-02-21 11:00:00
4000.0
54
3858
EZ2 Lotto
05-06
2019-07-18 11:00:00
4000.0
164
7987
EZ2 Lotto
30-25
2016-12-24 11:00:00
4000.0
223
3876
EZ2 Lotto
29-29
2019-07-14 11:00:00
4000.0
231
14423
EZ2 Lotto
19-25
2012-11-12 11:00:00
4000.0
135
...
...
...
...
...
...
4686
EZ2 Lotto
16-27
2019-01-11 21:00:00
4000.0
402
4681
EZ2 Lotto
15-08
2019-01-12 21:00:00
4000.0
249
12361
EZ2 Lotto
16-29
2014-03-16 21:00:00
4000.0
530
4695
EZ2 Lotto
06-20
2019-01-09 21:00:00
4000.0
335
4290
EZ2 Lotto
01-20
2019-04-09 21:00:00
4000.0
339
5334 rows × 5 columns
So far so good. The final stretch here is going to be to save our data to Microsoft Excel and we can achieve that easily with pandas with pandas.DataFrame.to_excel.
python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Create Excel writer objectwriter=pd.ExcelWriter("lotto.xlsx")# Write DataFrames to excel worksheetsdf.to_excel(writer,"All Data")lotto_658.to_excel(writer,"Ultra Lotto 6-58")lotto_655.to_excel(writer,"Grand Lotto 6-55")lotto_649.to_excel(writer,"Super Lotto 6-49")lotto_645.to_excel(writer,"Mega Lotto 6-45")lotto_642.to_excel(writer,"Lotto 6-42")lotto_6d.to_excel(writer,"6 Digit")lotto_4d.to_excel(writer,"4 Digit")lotto_3d.to_excel(writer,"Suertres Lotto")lotto_2d.to_excel(writer,"EZ2 Lotto")# Save the Excel workbookwriter.save()
Conclusion
With that we have saved an Excel file named lotto.xlsx where all of our scraped data have been put in. Now let’s proceed to making the scripts for analyzing the Lotto data. For example, we might look at the frequency of combinations, how frequent some numbers appear compared to others, or the expectancy value is for each drawing date in part two.