Title: Earthquake Prediction using Data Mining Tools
1- Earthquake Prediction using Data Mining Tools
- Mrinalini Kabbur
- Ritu Chinya
- Progress Report
2Introduction
- An earthquake is a sudden movement of the Earth,
caused by the abrupt release of strain that has
accumulated over a long time. - Earthquakes remain to be one of the unpredictable
natural hazards so far. - The goal of earthquake prediction is to give
warning of potentially damaging earthquakes early
enough to allow appropriate response to the
disaster, enabling people to minimize loss of
life and property.
3Project Design
- This project deals with Earthquake classification
and prediction using Data mining tools. - Weka was used to develop the model
- Naïve Bayesian was used to classify unknown class
label. - Used C4.5 with 66 split to classify the data and
10-fold cross validation to evaluate accuracy. -
4Method
- Installalation of Weka
- Weka is a set of software for machine learning
and mining - Developed at the University of Waikato in New
Zealand - Available for free
- Easy to use Graphical User Interface
5- Learning Weka
- Both of us were new to weka
- Used tutorial by Svetlana Aksanova
- Looked up the internet for additional information
on Weka - Gathering EarthQuake Data Set
- Consists of the Earthquakes that happened in
the Northern California region during 2005. - Data gathered from United States Geological
Survey (USGS) website.
6- Data preprocessing
- Weka algorithms work on ARFF format
- But the data was in HTML format as shown below.
7The data was in HTML format as shown below.
8Data Preprocessing (Contd)
- So the data had to be transferred to an Excel
file. - Tough to directly convert from HTML to Excel.
- So the data was first saved in the word format.
-
9Excel Format
10- Conversion from Excel to ARFF format.
- Save the Excel file as csv.
- Used awk commands to format the data.
- Keyed in some missing data.
11- Data Cleansing
- The earthquake data contained many parameters.
They include - Date and time
- Longitude
- Latitude
- Depth
- Magnitude
- Event ID
- Source
- Magt
- Nst
- Gap
- Clo
- Attributes of interest include
- Date and Time
- Longitude
- Latitude
- Depth
- Magnitude
12Date and time fields are not considered while
applying the classification algorithm. The filter
weka.filters.unsupervised.attribute.Remove is
applied to remove the date and time attribute.
This is shown below.
13- Descretize
- Attributes contain numeric data.
- Some Weka algorithms like ID3 require nominal
attribute Values. - Convertion of numeric attributes to nominal.
- The attributes Longitude, Latitude, Depth and
Magnitude are all desctretized by using the
filter weka.filters.unsupervised.attribute.Descre
tize.
14- Apply Classification rules to come up with
- Decision trees
- Rules sets
- Algorithms used for modelling
- C4.5
- Naïve Bayesian
15C4.5
- We have considered two cases.
- Cross-Validation Evaluates the classifier by
cross-validation, using the number of folds that
are entered in the Folds text field. - Percentage split Evaluates the classifier on how
well it predicts a certain percentage of the
data, which is held out for testing. The amount
of data held out depends on the value entered in
the field.
16First we will consider the classifier based on
how well it predicts 66 of the test data as
shown in the below.
17Run Analysis
18Run Information gives you the following
information the algorithm you used - J48 the
relation name Earthquake number of
instances in the relation 113 number of
attributes in the relation 4 and the list of
the attributes Longitude, Latitude, Depth,
Magnitude. the test mode you selected split66
Classifier model is a un-pruned decision tree in
textual form that was produced on the full
training data. As you can see, the first split
is on the Longitude attribute, at the second
level, the splits are on Latitude and
Longitude
Below the tree structure, there is a number of
leaves (which is 10), and the number of nodes in
the tree - size of the tree (which is 19). The
program gives a time it took to build the model,
which is 0.06 seconds.
In this case only 67 of 113 training instances
have been classified correctly. This indicates
that the results obtained from the training data
are not optimistic compared with what might
be obtained from the independent test set from
the same source.
19WEKA also lets you to visualize decision tree
20- Accuracy Estimation
- Ten fold Cross validation
- Snapshot of Naïve
- Bayesian classification
- using Weka
21Run Information
- Run information
- Scheme weka.classifiers.bayes.NaiveBayes
- Relation Earthquake-weka.filters.unsupervised
.attribute.Discretize-B10-M-1.0-Rlast - Instances 113
- Attributes 4
- Latitude
- Longitude
- Depth
- Magnitude
- Test mode 10-fold cross-validation
- Classifier model (full training set)
- Naive Bayes Classifier
- Time taken to build model 0.06 seconds
- Stratified cross-validation
- Summary
- Correctly Classified Instances 69
61.0619 - Incorrectly Classified Instances 44
38.9381 - Kappa statistic -0.0061
- Mean absolute error 0.1187
22Run Information (Cont)
- Detailed Accuracy By Class
- TP Rate FP Rate Precision Recall F-Measure
Class - 0.972 0.976 0.627 0.972 0.762
'(-inf-3.41' - 0 0 0 0 0
'(3.41-3.82' - 0 0.019 0 0 0
'(3.82-4.23' - 0 0 0 0 0
'(4.23-4.64' - 0 0 0 0 0
'(4.64-5.05' - 0 0 0 0 0
'(5.05-5.46' - 0 0 0 0 0
'(5.46-5.87' - 0 0 0 0 0
'(5.87-6.28' - 0 0 0 0 0
'(6.28-6.69' - 0 0.009 0 0 0
'(6.69-inf)' - Confusion Matrix
- a b c d e f g h i j lt-- classified
as - 69 0 1 0 0 0 0 0 0 1 a
'(-inf-3.41' - 24 0 1 0 0 0 0 0 0 0 b
'(3.41-3.82' - 8 0 0 0 0 0 0 0 0 0 c
'(3.82-4.23' - 6 0 0 0 0 0 0 0 0 0 d
'(4.23-4.64' - 2 0 0 0 0 0 0 0 0 0 e
'(4.64-5.05'
23Learnings from the project
- We both were new to Weka and learnt to use Weka
software. - It was challenging to analyze large amount of
data as compared to what we did in our home
works. - We realized that data pre-processing indeed takes
a long time. - We got a clear understanding of C4.5 and Naïve
Bayesian classification algorithms.
24Division of work
- We worked together on all the tasks.
Conclusion
We realized that data mining tools are very
powerful and save a lot of time for classifying
huge amount data. We found that using C4.5
algorithm and 66 of data as training data gave
an accuracy of 67 whereas 10-fold
cross-validation gave an accuracy of 62 in the
case of earthquake data. The Naïve Bayesian
algorithm also correctly classified 61 of the
test data. So, the results were pretty close. All
in all, the project was very interesting and
challenging and we enjoyed working on it.
25Reference
- http//www.studentprogress.com/appln/colleges/cogr
ec/Papers/D_05.pdf - www.meteoquake.org/our.html
- http//www.cs.waikato.ac.nz/ml/weka/index.html
- http//gaia.ecs.csus.edu/mei/215/tutorial.html
- http//www.ngdc.noaa.gov/seg/hazard/sig_srch_idb.s
html - Weka Explorer tutorial by Svetlana Aksanova