Title: Failure in the PATHFINDER Mission
1Failure in the PATHFINDER Mission
- Chandan Kumar
- EE 585 Fault Tolerant Computing
2Outline
- Background
- Simplified view of H/W architecture
- S/W architecture
- Failure
- Cause
- Correction
3Background
- Launched Dec 4 1996
- Landed July 4 1997.
- Mission Objectives
- To prove that the development of "faster, better
and - cheaper" spacecraft is possible (with three
years for development and a cost under US 150
million). - To show that it is possible to send a load of
scientific instruments to another planet with a
simple system and at one fifth the cost of a
Viking mission.
4Background Contd.
- To demonstrate NASA's commitment to low-cost
planetary exploration finishing the mission with
a total expenditure of US 280 million, including
the launch vehicle and mission operations. - Demonstrate the mobility and usefulness of a
micro rover on the surface of Mars - It carried a number of scientific instruments
like - Mars Pathfinder Lander
- Imager for Mars Pathfinder (IMP),(includes
magnetometer and anemometer) - Atmospheric and meteorological sensors (ASI/MET)
5Background Contd.
- Rover Sojourner
- Imaging system (three cameras front BW stereo,
1 rear color) - Laser striper hazard detection system
- Alpha Proton X-ray Spectrometer (APXS)
- Wheel Abrasion Experiment
- Material Adherence Experiment
- Accelerometers
- Potentiometers
- Final transmission Sept 27 1997.
- 16500 images sent from lander,550 from rover
- 15 analysis of rocks.
6Simplified view of Hardware Architecture
- Single CPU Controls the Spacecraft.
- Resides on VME bus.
- Interface cards for Radio and Camera.
- Interface to 1553 bus.
- 1553 bus connects to cruiser and lander
stages. - H/W on Cruiser controls thrusters .etc
- H/W on Lander interface to instruments like
accelerometer,radar altimeter and ASI/MET etc.
7The Software Architecture
lt ------------------------ .125 seconds
----------------------------gt
lt
gt
lt- bc_dist active -gt bc_sched active
lt - bus active - gt
lt-gt
------------------------------------------------
------------------------------ t1
t2 t3
t4 t5 t1
The are periods when tasks
other than the ones listed are executing. There
is some idle time. t1 - bus hardware starts via
hardware control on the 8 Hz boundary. The
transactions for the this cycle had been set up
by the previous execution of the bc_sched
task. t2 - 1553 traffic is complete and the
bc_dist task is awakened.t3 - bc_dist task has
completed all of the data distributiont4 -
bc_sched task is awakened to setup transactions
for the next cyclet5 - bc_sched activity is
complete
8The Failure
- The spacecraft began experiencing total system
resets. - This reset reinitializes all of the hardware and
software. It also terminates the execution of the
current ground commanded activities. - The remainder of the activities for that day were
not accomplished until the next day
9The Cause
- The Failure - a case of Priority Inversion
- In scheduling, priority inversion is the scenario
where a low priority task holds a shared resource
that is required by a high priority task. - This causes the execution of the high priority
task to be blocked until the low priority task
has released the resource, effectively
"inverting" the relative priorities of the two
tasks. - If some other medium priority task attempts to
run in the interim, it will take precedence over
both the low priority task and the high priority
task.
10The Cause Contd.
- The failure was identified by the spacecraft as a
failure of the bc_dist task to complete its
execution before the bc_sched task started - The ASI/MET task is delivered its information via
an interprocess communication mechanism (IPC). - IPC mechanism based on using Pipes.
- The higher priority bc_dist task was blocked by
the much lower priority ASI/MET task that was
holding a shared resource.
11The Cause contd..
- The resource that caused this problem was a
mutual exclusion semaphore used within the
select() mechanism. - The ASI/MET task had acquired this resource and
then been preempted by several of the medium
priority tasks. - The bc_dist task attempted to send the newest
ASI/MET data via the IPC mechanism which called a
Pipe. This pipe blocked taking the semaphore.
12The Cause contd..
- The medium priority tasks ran, still not allowing
the ASI/MET task to run, until the bc_sched task
was awakened. - At that point, the bc_sched task determined that
the bc_dist task had not completed its cycle (a
hard deadline in the system) and declared the
error that initiated the reset.
13Correction
- Changing the creation flags for the semaphore so
as to enable the priority inheritance - Modify the semaphore associated with the pipe
used for bc_dist task to ASI/MET task
communications corrected the problem.
14S/W modification on the spacecraft
- Patching is a specialised process.
- Send the difference b/w what you have onboard and
what you want on the spacecraft. - S/W on the spacecraft modifies the onboard copy.
15Questions??
16References
- http//mars.jpl.nasa.gov/missions/past/pathfinder.
html - http//research.microsoft.com/7embj/Mars_Pathfind
er/Authoritative_Account.html - http//en.wikipedia.org/wiki/Mars_Pathfinder