Title: Chapter-7: Case Study
1- Chapter-7 Case Study
- www.AhsanAbdullah.com
2Background
- Data being recorded for decades by several
organization, mostly never digitized and never
used for decision making. Under-utilization. - Data is horizontally wide i.e. 100 attributes
and vertically deep i.e. tens of thousands of
rows. - Huge potential for long-term and short-term
decision making. - Decision making not data driven, but based on
expert judgment, sometimes with tragic results.
3Area under study
- Punjab is the bread-basket of Pakistan, and
administratively divided into eight divisions. - Multan division is the cotton hub of Punjab,
which consists of district Multan. - District Multan has three tehsils, which are
furhter divided into central points or Markaz. - This study is about 10 tehsils of Multan using
data for years 2001 and 2002.
4Area under study Map
- Key Markaz
- 1 Bosan
- 2 Qadirpurran
- 3 Multan
- Makhdum Rashid
- Mumtazabad
- 6 Shujabad
- 7 Hafizwala
- 8 Jalalpur Pirwala
- 9 Qasba Marral
5Major Players
- A pest is an insect that eats the crop. There are
two types of cotton pests - Sucking pests (White Fly, Jassid, Thrips etc.)
- Boll Worms (spotted boll-worm, pink boll-worm)
- A predator is an insect that eats the pest.
- Such as Lady bug Beetle, Spiders, Ants etc.
- Cotton is also effected by virus
- Cotton Leaf Curl Virus (CLCV)
6Economic Threshold Level (ETL_A)
- The pest population beyond which it is cost
effective to use pesticide. - Pesticide is a poison which is used to kill
pests. - Note that eradicating pests is NOT feasible,
controlling pest population is feasible.
7Pests Predators
Pests
Predators
Source United States Department of Agriculture
(USDA)
8ETL_A Graph
Economic Injury
ETL_A
Pest Population
Time
9The need
- Extensive losses to cotton crop due to pest
attacks in 1983 resulted in the establishment of
Directorate of Pest Warning in 1984. - Since 1984 scouts from the Directorate have been
sampling fields and recording data and advising
farmers. - During 2003-04 season, Boll Worm attack on the
cotton crop resulted in a loss of nearly 0.5 M
bales. - Weather not the only factor, but points to a
multitude of factors, requiring efficient and
effective data analysis, for better decision
making.
10The need IT in Agriculture
- The volume of pest scouting data accumulated
todate is enormous both horizontally and
vertically. - A typical pest scouting sheet consists of 35
variables or attributes. - Metrological data consists of 50 variables.
- Coarse estimate of pest scouting data recorded
for the cotton crop alone stands at 5 million
records, and growing. - Tasking the human brain alone, for synthesis of
information from this data is not only
impractical but dangerous too. - Need a Data Warehouse, OLAP tools and Data Mining
to analyze the data.
11Agro-Informatics
- I.T. sector is at the heart of the economic
revival of Pakistan President of Pakistan,
Launching of VU, Mar. 23, 2003. - Agriculture is the backbone of our economy, upto
74 of the population is dependent on it. - IT is an enabler, and has the potential to
benefit everyone when applied in Agriculture. - IT Agriculture A win-win scenario.
12Agro-Informatics Relevant URL
To know more about Agro-Informatics,
visit www.agroict.org
13How to go about?
- Discussed several DWH implementation
methodologies in lectures 32-35. - Will adopt a pilot project approach, because
- A full-blown DWH requires extensive investment.
- Show users the value of DSS.
- Establish blue print for full-blown system.
- Identify problem areas.
- Reveal true data demographics.
- Pilot projects are supposed to work with limited
data.
14The 12-step Approach of Shaku Atre
15Step-1 Determine Users needs
- A data warehouse or a data mart?
- User not educated, think DWH is a big Excel
sheet. - Some confuse DSS with Decision Making Systems.
- DWH at Federal level.
- Data marts at district level.
- What are the business objectives?
- Good governance via E-Governance.
- Timely and data driven decision making.
- Fore warning of farmers.
- Timely planning for purchase/stocking of
pesticides.
16Step-1 Determine Users needs
- Cost/benefit analysis, project estimation risk
assessment - Cost of a prototype system fraction of the loss
in one year. - Required data either not made available or
expensive. - Required data in very raw format.
- People hungry for ideas looking for a solution on
a plate. - Where do I get the data from?
- Data available at Directorate of pest warning.
- Raw form
- Metrological department asking too much for data
collected using public funds.. - Weather forecasts downloaded from web site of
www.dawn.com
17Steps-23 Determine DBMS Sever Hardware
Platform
- Hardware and Software along with trained people
was already available on Campus. - The pilot DWH was implemented using NCR Teradata
Data Warehousing solution for NT running on a
dual Intel 950 Mhz Xeon processors server and 1GB
of RAM. - Total internal Hard Disk capacity of the server
amounted to 36 GB while external RAID control
supports 8 additional SCSIs of 18 GB each.
18Step-4 Information Data Modeling
Dimensional Modeling
19Step-4 Information Data Modeling
Simplified ERD
Other field inputs such as irrigation, fertilizer
etc. not included as data not available
20Step-4 Information Data Modeling
KEY WFN White Fly Nymph WFA White Fly Adult W
White B Brown S Small Larvae L Large Larvae
(optional)
Logical Physical Design
21Step-5 Construct Metadata Repository
- What means what was important.
- Who access what was not important as single user
environment. - Meetings held similar to requirements definition,
and questions transformed into SQL queries. - Meta data issues still cropped up, more of
business rules.
22Step-5 Surprise case
Sucking pests
Ball Worm Complex
SBW Spotted Ball Worm ABW Army Ball Worm PBW
Pink Ball Worm
If pest population is low, predator population
will also be low, because there will be less
food for predators to live on i.e. pests.
23Step-6 Data Acquisition Cleansing
Hand filled pest scouting sheet
Typed pest scouting sheet
24Step-6 Issues
- The pest scouting sheets are larger than A4 size
(8.5 x 11), hence the right end was cropped
when scanned on a flat-bed A4 size scanner. - The right part of the scouting sheet is also the
most troublesome, because of pesticide names for
a single record typed on multiple lines i.e. for
multiple farmers. - As a first step, OCR (Optical Character Reader)
based image to text transformation of the pest
scouting sheets was attempted. But it did not
work even for relatively clean sheets with very
high scanning resolutions. - Subsequently DEOs (Data Entry Operators) were
employed to digitize the scouting sheets by
typing.
25Step-6 Why the issues?
- Major issues of data cleansing had arisen due to
data processing and handling at four levels by
different groups of people - Hand recordings by the scouts at the field level.
- Typing hand recordings into data sheets at the
DPWQCP office. - Photocopying of the typed sheets by DPWQCP
personnel. - Data entry or digitization by hired data entry
operators.
26Step-7 Transform, Transport Populate
27Motivation For Transformation
- Trivial queries give wrong results.
- Static and dynamic attributes
- Static attributes recorded repeatedly.
28Step-7 Resolving the issue
- Solution Individualization of cultivated fields.
- Technique similar to BSN used to fix names.
- Unique ID assigned to farmers.
- BSN used again, and unique ID assigned to fields.
- Results
- Limitation Field individualization not perfect.
Some cases of farmers with same geography, sowing
date, same variety and same area. Such cases
were dropped.
29Step-8 Middleware Connectivity
- Since the source data is maintained in a non
digital format, hence connectivity with the data
warehouse was irrelevant. - Once digitized, it was rather straightforward to
load data into the warehouse. - Furthermore, in the foreseeable future, it was
not anticipated that the scouting sheets were
going to be maintained in a digitized form.
30Step-9-11 Prototyping, Querying Reporting
- Implemented the prototype with user involvement.
- Applications developed
- 10. A data mining tool was also developed based
on an indigenous technique that used crossing
minimization paradigm for unsupervised
clustering. - 11. A low-cost OLAP tool was indigenously
developed actually it was a Multi dimensional
OLAP or MOLAP. - Use querying reporting tools
- The following SQL query was used for validation
- SELECT Date_of_Visit, AVG(Predators),
AVG(Dose1Dose2Dose3Dose4) - FROM Scouting_Data
- WHERE Date_of_Visit lt 12/31/2001 and predators
gt 0 - GROUP BY Date_of_Visit
31Step-12 Deployment System Management
- Since a pilot project, therefore, the
traditional deployment methodologies and system
management techniques were not followed to the
word, and are not discussed here.
32Agri-DSS usage Data Validation
- Quality and validity of the underlying data is
the key to meaningful and authentic analysis. - After ensuring a satisfactory level of data
quality (based on cost-benefit trade-off)
extremely important to scientifically validate
the data that the DWH will constitute. - Some very natural checks were employed for this
purpose. Relationship between the pesticide
spraying and predator (beneficial insects)
population is a fact well understood by
agriculturists. - Predator population decreases as pesticide spray
increases and then continually decreases till the
end of season.
33Agri-DSS usage Data Validation Graph
34Agri-DSS usage Spray Dates
- Working Behaviors at Field Level Spray dates
- As expected, apparently spray dates and spray
frequency for 2001 and 2002 do not displayed any
correlation as it is dependent on pest
populations, availability of pesticides etc. - For deep analysis, moving average of sprays for
five days, and a moving correlation of sprays for
five days were calculated and plotted. - For the sake of uniformity, the moving average of
spray was normalized using the maximum spray
frequency.
35Agri-DSS usage Spray Dates Graph
No relationship should have existed for the two
years. But note the surprising finding that most
sprays occurring on and around 12th Aug. in BOTH
years with high correlation, appearing as a
spike. Also note the dip in sprays around 11th
Sep.! Sowing at predetermined time makes sense,
as it is under the control of the farmer, but
that is not true for spraying. Pests dont follow
calendars therefore, whenever, ETL_A is crossed
pesticides are sprayed.
36Agri-DSS usage Explaining Findings
14th Aug. is the independence day of Pakistan and
a national holiday. In Pakistan, people are in
a habit of sandwiching gazetted holidays with
casual leaves consequently businesses are closed
for a longer period, including that of pesticide
suppliers. 14th Aug. occurred on Tue and Wed in
2001 and 2002, respectively, thus making it ideal
to stretch the weekend. During Aug/Sep.
humidity is also high, with correspondingly high
chances of pest infestations. Therefore,
apparently the farmers decided not to take any
chances, and started spraying around 11th Aug.
evidently even when it was not required. The
weather forecast for 13 Aug. 2001 and 2002 was
showers and cloudy, respectively. Therefore, most
likely the pesticide sprayed was washed-off.
Decline in sprays around 9/11 could not be
explained.
37Agri-DSS usage Sowing Dates
The results of querying the sowing date based on
the day of the week are shown below Obs
erve least number of sowings done on Thursdays,
in each year. This finding was later confirmed by
extension personnel. Multan is famous for its
shrines. Thursdays are usually related with
religious festivals and activities, a mix of
devotion and recreation, and usually held at
shrines, hence a tendency of doing less work on
Thursdays. Similar behavior was observed for
spraying too.
38Conclusions Lessons
- ETL is a big issue. There are no digitized
operational databases, data entry of these sheets
is very expensive, slow and prone to errors. - Each farmer is repeatedly visited by agriculture
extension people. This results in repetition of
information, about land, sowing date, variety
etc. Hence, farmer and land individualization are
critical. Such an individualization task is hard
to implement for multiple reasons. - There is a skewness in the scouting data. Public
extension personnel (scouts) are more likely to
visit educated or progressive farmers, as it
makes their job of data collection easy.
Furthermore, large land owners and influential
farmers are also more frequently visited by the
scouts. Thus the data does not give a true
statistical picture of the farmer demographics. - All pests are not present all the time (such as
ball worms), most of the time second spray is not
done (or not recorded), and hence tables are
sparse. Therefore, it is better to split tables
so as to decrease header size and table space. - Unlike traditional data warehouse where the end
users are decision makers, here the
decision-making goes all the way down to the
extension level. This presents a challenge to the
analytical operations designer, as the findings
must be fairly simple to understand and
communicate.