Mining%20the%20LAMOST%20Spectral%20Archive - PowerPoint PPT Presentation

About This Presentation

Title:

Mining%20the%20LAMOST%20Spectral%20Archive

Description:

The archive will provide reference data for stars, galaxies and quasars in ... While NLRs in Seyfert galaxies are already relatively well studied, there are no ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 21

Provided by: Luo83

Learn more at: http://www.china-vo.org

Category:

more less

Transcript and Presenter's Notes

Title: Mining%20the%20LAMOST%20Spectral%20Archive

1
Mining the LAMOST Spectral Archive

A-Li Luo, Yan-Xia Zhang, Jian-Nan Zhang, and
Yong-Heng Zhao
National Astronomical Observatories
Chinese Academy of Sciences
lal_at_lamost.bao.ac.cn

2
Brief Introduction About LAMOST Telescope

Aperture 4m
Focal length 20m
Field of view5 degree
Focal plane1.75m
Number of fiber4000
Spectra of objects as faint as 20m.5 with an
exposure of 1.5 hour
Range370nm-900nm
Resolution R1000
2000
(high R spectrographs for stellar spectra are
in plan)

Focal Plane
MB Spherical Primary Mirror
MA Reflecting Schmidt Corrector
A meridian reflecting Schmidt telescope
3
Introduction About the LAMOST Archive

Two main data sets
a spectroscopic catalogue (catalogue
subsets)
a set of individual spectra
Expected size of LAMOST archive

The telescope will be used to survey 1,000,000
QSOs, 10,000,000 galaxies, and
1,000,000 stars. The size of the final one
dimensional spectral dataset will exceed 1
terabyte.
4
The archive will provide

The archive will provide reference data for
stars, galaxies and quasars in
FITS format, and will also provide a variety of
services including a
flexible User interface on the web, which will
allow sophisticated queries
within the database. Mining tools are also
indispensable in order to
extract novel information.
Two main kinds of query
1. simple and advance search
Enables users to retrieve data subsets based on
search limits chosen by the them.
2. SQL search
Gives users the option of supplying
their search criteria in standard SQL.
Mining tools
The LAMOST software system will contain a
spectra-based data miner of knowledge, which
incorporates data mining functions such as
clustering, characterization, and classification.

5
Introduction to Data Mining

Data mining (DM)
knowledge discovery in databases,
knowledge extraction,
data archaeology,
data dredging,
information harvesting,
business intelligence etc.
Two kinds of DM
Event-based mining
Relationship-based mining.

6
Mining the LAMOST archive

Clustering
Characterization
Classification

7
1. Clustering

What is clustering?
It divides a database into different groups.
The goal of clustering?
To find groups of objects that are very
different from one other.
Same as classification?
No, one does not know a priori either which
objects one's clusters will include or by which
attributes the data will be clustered.
How to use clusters?
When we find clusters that segment the
database meaningfully, these clusters may then be
used to classify the new data.

8
Clustering

Some of the common algorithms used to perform
clustering include
(1) partitioning-based algorithms, which
enumerate various partitions and then score them
by some criterion e.g. K-means, K-medoids etc.
(2) hierarchy-based algorithms, which create a
hierarchical decomposition of the set of data (or
objects) using some criterion
(3) model-based algorithms, in which a model is
hypothesized for each of the clusters.

9
An example of clusteringSearching for NLQs

What is NLQ?
QSOs are active galactic nuclei(AGN) in which two
different regions of ionized gas can be
distinguished a broad-line region (BLR) and a
narrow-line region(NLR).
The full-width half maxima (FWHM) of emission
lines in spectra of broad-line QSOs (BLQs) often
exceed 5000km/s, except that in the cases of
narrow-line QSOs (NLQs) the FWHMs are generally
narrower than 1000km/s.

10
An example of clusteringSearching for NLQs

Why do we select this example?
In the LAMOST archive, there will be 1
million QSO spectra, including large numbers of
NLQs amongst them. While NLRs in Seyfert galaxies
are already relatively well studied, there are no
comparable studies of NLRs in quasars.
Why we use clustering technique?
Under the framework of the united AGN model,
we will need to compare statistically the spectra
of NLQs with those of Seyfert galaxies.
In which space to do the clustering?
The basic goal of principal component
analysis (PCA) is to reduce the dimensions of the
multi-parameter space defined by one's data
without loss of information. Such a reduction in
dimensions has important benefits, especially as
projection onto a 2-d or 3-d subspace is often
useful for visualizing the data.
What kind of clustering algorithm we will use?
K-means clustering algorithm.
Experiment data?
15000 spectra of QSO in SDSS DR1

11
An example of clusteringSearching for NLQs

SDSS DR1 QSO spectra for more than 15,000
objects projected onto a 2-d PCA subspace. The
x-axis is the first principal component, PC1,
while the y-axis is the second principal
component, PC2. Each small asterisk in the figure
represents a projection of a spectrum. We found
that most of the spectra were located within a
spherical space. A quick check revealed that most
BLQs lie within the spherical space, while most
NLQs (which are less numerous) lie outside it.
Using a K-means algorithm, we altered the size of
the spherical space in order to achieve an
optimal separation between BLQs and NLQs.

12
2. Characterization

What is data characterization?
It is a summarization of general features of
objects in target classes, and produces
characteristic rules.
How does characterization do?
The data relevant to a user-specified class
are normally retrieved by a database query and
run through a summarization module to extract the
essence of the data at different levels of
abstraction.

13
An example of characterizationeffective
temperatures of stars

Different with direct measurement
DM methods to estimate stellar parameters
are different from traditional methods based on
direct measurement. Its need not to measure each
stellar spectrum.
Other automated estimation method for Teff
Bailer-Jones (2000,2002) have trained an
artificial neural network (ANN) to estimate
stellar parameters. Soubiran et al. (1998) and
Katz et al. (1998) have established a template
library containing 211 stellar spectra, and used
cross-correlation techniques to match their
observations with their templates.
Our method
Here we present a surface-fitting technique
to estimate the distribution function of stellar
effective temperature. We estimate the
temperature distribution in PCA space, and the
effective temperature of each star is just one
point in such a distribution.

14
An example of characterizationeffective
temperatures of stars
The data set we used is a comprehensive library
of synthetic stellar spectra from Lejeune et
al.(1997), which is based on three original grids
of model atmosphere spectra by Kurucz et
al.(1979), Fluks et al.(1994), and Bessell et
al.(1989,1991).

First of all, the spectra in this data set
were processed by means of a PCA, yielding above
figure, in which all 1599 stellar spectra are
projected onto a 2-d PCA plane.

15
An example of characterizationeffective
temperatures of stars
Consider that the data distribution in PCA
space is a locus X, and effective temperature T
is the function of X Tf(X). Thus, T is a
surface in a 3-d space as shown in above figure.
16
An example of characterizationeffective
temperatures of stars

By experimentation, we found that the following
equations can fit the surface well.
T10P(x,y)
Where P(x,y) is a polynomial of the form
P(x,y)25.0069-1.80461x0.0525264x2-
0.000450855x3 3.22394y-0.181638xy0.00256156x2y0
.173964y2-0.00434289xy20.00358684y3.

17
An example of characterizationeffective
temperatures of stars

This figure gives the isotherm of effective
temperature in a PCA plane. When an observational
spectrum is projected onto this PCA space, we can
judge the effective temperature of the object in
question.
We are presently working on optimizing
characterization algorithms in order to obtain
distributions of stellar parameters, such as
Teff, g, and Fe/H.

18
3. Classification

What is classification?
Classification is also called predictive
data mining'', in that the aim is to identify the
characteristics of group in advance.
What data of LAMOST needs to be classified?
For the LAMOST data archive, the data
analysis pipeline will give the classification
result e.g. QSO, galaxy or star of a particular
spectral type. But for galaxies, the pipeline
will not classify them further. The archive will
include 107 galaxies, and the classification of
galaxy spectra is a complex problem.
Why should we classify galaxy spectra?
A good classification scheme should be useful
in understanding the evolutionary relationship
between different types.

19
Classification of galaxy spectra

Galaxy spectral classifications can depend on
different methods
Line strength
Correlation between morphology and spectrum
Objective method (ANN or PCA)
Evolutionary models
We are now finding objective methods with more
physical meaning to explain evolution of
galaxies.

20
DM VO

An objective of LAMOST DM is to provide software
tools that will also be useful for the
development of China's Virtual Observatory(VO).
The LAMOST data set, including all its
sub-catalogues and FITs files of 1-d spectra,
will of course be another important contribution
to the VO
The true relationship between LAMOST and the VO
is in using data mining and knowledge discovery
to explore the LAMOST data.