Title: Sports Data Sources and Data Extraction
1Sports Data Sources and Data Extraction
- Gavin Zhang
- MIS580
- University of Arizona
- 02-06-2008
2Outline
- Sports Data Sources
- Baseball
- Basketball
- Football
- Olympics
- Greyhound
- Data Extraction
- Case Study AZGreyhound System
3Baseball Data Source
Download the database
4Data Download
- This database contains pitching, hitting, and
fielding statistics for Major League Baseball
from 1871 through 2007. - The data are provided in Microsoft Access, CVS
and other formats. - The newest version is Version 5.5.
- The database can be downloaded at
- http//baseball1.com/content/view/57/82/
5Database
AwardPlayers.csv
- Detailed description of the database is available
at - http//baseball1.com/content/view/57/82/
- The database has 21 tables main tables include
- MASTER Table- Player names, DOB, and biographical
info - Batting Table- batting statistics
- Pitching Table- pitching statistics
- Fielding Table- fielding statistics.
- Detailed description about each data field in
each table is available.
6Basketball Data Source
Download all of the player and team statistics
- http//databaseBasketball.com/
7Data Download
- The website contains the NBA data from 1947 to
2007 and ABA data from 1968 to 1976 on players,
teams, leagues, all-star games, awards, and
coaches. - Download at
- http//databasebasketball.com/stats_download.htm
8Database
Teams.txt
teamlocationnameleag ANAAnaheimAmigosA ANDA
ndersonDuffey PackersN ATLAtlantaHawksN BA1B
altimoreBulletsN BALBaltimoreBulletsN BOSBos
tonCelticsN BUFBuffaloBravesN CAPCapitalBul
letsN CARCarolinaCougarsA CH1ChicagoStagsN
CH2ChicagoZephyrsN CHACharlotteHornetsN CHI
ChicagoBullsN
- This download contains nine column delimited
files (.txt format), each of which represents a
table in the database. - If you open the files up in excel, you may need
to select Data - Text to Columns, then use the
bar ("") character as the delimiter.
9Football Data Source
- http//www.pro-football-reference.com/
10Data Download
- A copy of data set (in CVS format) can be
downloaded from http//ai.arizona.edu/hchen/chenc
ourse/SportsData/Pro-football-refernce_CSV.zip - This version contains the game data from 1995 to
2006. The dataset contains 64,327 players and the
games they played in. - Tables include
- Masterinformation about players
- Seasonsthe statistics of the players records by
season - Gamesthe statistics of the players records by
game - Detailed description about each data field in
each table is available.
11Database
Master.csv
12Some Other Football Data Sources
- http//www.databasefootball.com/
- The website contains the National Football League
(NFL) data from 1922 to 2005 and Australian
Football League (AFL) data from 1960 to 1969 on
players, teams, leagues, awards, and coaches. - Data set can not be downloaded directly. The data
need to be extracted from the HTML Web pages by
using parsing programs. - http//www.jt-sw.com/football/
- The website contains the player/coach statistics
of NFL from 1920 to present and statistics of AFL
from 1960 to 1969. - Data set can not be downloaded directly. The data
need to be extracted from the HTML Web pages by
using parsing programs.
13Olympics Data Source
- http//www.databaseolympics.com/
14Data Format
- DatabaseOlympics.com is your source for every
Summer and Winter Olympics medal winner. - Summer Olympics from 1896-2004
- Winter Olympics 1924 -2002
- You'll find every medal winner for every country
with easy links to each Olympics, sports, and
athletes.
15Data Format
16Greyhound
- http//66.236.122.2338080/tracklink/
17Data Format
- Data includes daily race programs (videos) and
odds charts (.txt file format) for all US
Greyhound tracks. - Some tracks had both Afternoon and Evening
programs.
18Data Format
Chart.txt
1st Grade B Distance 550 Condition Fast DOG
WT P O 1/8 Str Fin
Time Odds Comment PTL Jane
63.5 6 3 1 1 1 ns 32.00
11.60 Held At Wire Inside Silver Speck
68.5 1 1 2 2 2 ns 32.01
2.80 Cutff 1st, Stayd Cls Jain't It Doug
75 7 7 6 6 3 1.5 32.10
7.50 Closed For Show Outs Flyer Whitesocks
75.5 8 8 7 3 4 1.5 32.11
2.30 In The Hunt Flying Detroit
69 5 5 4 4 5 2 32.15
9.00 Not Far Behind Mdtrk VP Twix Twizala
59.5 3 4 3 5 6 4.5 32.31
4.20 Losing Position Ins Sergio
73 4 6 5 7 7 5
32.34 13.30 Blocked 1st Turn Heartattack
Jack 71.5 2 2 8 8 8 5.5
32.39 7.10 Bumped 1st Turn
19Case Study AZGreyhound System
20AZGreyhound System Design
Greyhound Data
AZGreyhound
DB
Odds Data
Model Building
Training / Testing
Race Data
Prediction
Betting Engine
Metrics
Traditional
Straight Bets
Box Bets
Win
Accuracy
Exacta
Quiniela
Payout
Place
Trifecta
Trifecta
Efficiency
Show
Superfecta
Superfecta
21Greyhound Data Extraction
- Grayhound data was gathered from
www.trackinfo.com. The Web site links to - GreyMatter http//66.236.122.2338080/tracklink/
- TrackInfo http//www.trackinfo.com/index2.html
- The race and odds data was parsed into a SQL
Server database then the data was sent to the
AZGreyhound system for prediction.
22Example code
public void RacePrograms() throws Exception
... ... String URL1 "http//www.trackinfo
.com/trakdocs/hound/" String URL2
"/Rpages" ... ... OpenConnection2()
try ... ... TrackAbbrev
rSet.getString("TrackAbbrev") String URL
URL1 TrackAbbrev URL2 Feed
web.Scraper(URL, 1) ... ... NumItems
web.NumItems(Feed, "icons/html.gif")
for(int y 1 y Feed Feed.substring(Feed.indexOf("icons/html.gi
f")) FileName web.ExtractText(Feed,
"") Feed
Feed.substring(Feed.indexOf(" FileDate web.ExtractText(Feed, "NOWRAP",
"") FileContents
web.Scraper(URL "/" FileName, 1)
FileContents FileContents.replaceAll("'",
"-") db.Insert2DBProgram(FileName,
FileDate, FileContents)
CloseConnection2()
catch(SQLException e) System.out.println(e
)
This method picks up the overall race
information and puts it in the database
Data parsing URL
Parsing out each data field
Insert into DB
23You can use the sports data sources introduced in
this set of slides for your data mining
project. You are strongly encouraged to identify
other interesting public sports data sets for
your project.
Thanks!?