Title: Mining%20the%20LAMOST%20Spectral%20Archive
1Mining the LAMOST Spectral Archive
- A-Li Luo, Yan-Xia Zhang, Jian-Nan Zhang, and
Yong-Heng Zhao - National Astronomical Observatories
- Chinese Academy of Sciences
- lal_at_lamost.bao.ac.cn
2Brief Introduction About LAMOST Telescope
- Aperture 4m
- Focal length 20m
- Field of view5 degree
- Focal plane1.75m
- Number of fiber4000
- Spectra of objects as faint as 20m.5 with an
exposure of 1.5 hour - Range370nm-900nm
- Resolution R1000
- 2000
- (high R spectrographs for stellar spectra are
in plan)
Focal Plane
MB Spherical Primary Mirror
MA Reflecting Schmidt Corrector
A meridian reflecting Schmidt telescope
3Introduction About the LAMOST Archive
- Two main data sets
- a spectroscopic catalogue (catalogue
subsets) - a set of individual spectra
- Expected size of LAMOST archive
The telescope will be used to survey 1,000,000
QSOs, 10,000,000 galaxies, and
1,000,000 stars. The size of the final one
dimensional spectral dataset will exceed 1
terabyte.
4The archive will provide
- The archive will provide reference data for
stars, galaxies and quasars in - FITS format, and will also provide a variety of
services including a - flexible User interface on the web, which will
allow sophisticated queries - within the database. Mining tools are also
indispensable in order to - extract novel information.
- Two main kinds of query
- 1. simple and advance search
- Enables users to retrieve data subsets based on
search limits chosen by the them. - 2. SQL search
- Gives users the option of supplying
their search criteria in standard SQL. - Mining tools
- The LAMOST software system will contain a
spectra-based data miner of knowledge, which
incorporates data mining functions such as
clustering, characterization, and classification.
5Introduction to Data Mining
- Data mining (DM)
- knowledge discovery in databases,
- knowledge extraction,
- data archaeology,
- data dredging,
- information harvesting,
- business intelligence etc.
- Two kinds of DM
- Event-based mining
- Relationship-based mining.
6Mining the LAMOST archive
- Clustering
- Characterization
- Classification
71. Clustering
- What is clustering?
- It divides a database into different groups.
- The goal of clustering?
- To find groups of objects that are very
different from one other. - Same as classification?
- No, one does not know a priori either which
objects one's clusters will include or by which
attributes the data will be clustered. - How to use clusters?
- When we find clusters that segment the
database meaningfully, these clusters may then be
used to classify the new data.
8Clustering
- Some of the common algorithms used to perform
clustering include - (1) partitioning-based algorithms, which
enumerate various partitions and then score them
by some criterion e.g. K-means, K-medoids etc. - (2) hierarchy-based algorithms, which create a
hierarchical decomposition of the set of data (or
objects) using some criterion - (3) model-based algorithms, in which a model is
hypothesized for each of the clusters.
9An example of clusteringSearching for NLQs
- What is NLQ?
- QSOs are active galactic nuclei(AGN) in which two
different regions of ionized gas can be
distinguished a broad-line region (BLR) and a
narrow-line region(NLR). - The full-width half maxima (FWHM) of emission
lines in spectra of broad-line QSOs (BLQs) often
exceed 5000km/s, except that in the cases of
narrow-line QSOs (NLQs) the FWHMs are generally
narrower than 1000km/s.
10An example of clusteringSearching for NLQs
- Why do we select this example?
- In the LAMOST archive, there will be 1
million QSO spectra, including large numbers of
NLQs amongst them. While NLRs in Seyfert galaxies
are already relatively well studied, there are no
comparable studies of NLRs in quasars. - Why we use clustering technique?
- Under the framework of the united AGN model,
we will need to compare statistically the spectra
of NLQs with those of Seyfert galaxies. - In which space to do the clustering?
- The basic goal of principal component
analysis (PCA) is to reduce the dimensions of the
multi-parameter space defined by one's data
without loss of information. Such a reduction in
dimensions has important benefits, especially as
projection onto a 2-d or 3-d subspace is often
useful for visualizing the data. - What kind of clustering algorithm we will use?
- K-means clustering algorithm.
- Experiment data?
- 15000 spectra of QSO in SDSS DR1
11An example of clusteringSearching for NLQs
- SDSS DR1 QSO spectra for more than 15,000
objects projected onto a 2-d PCA subspace. The
x-axis is the first principal component, PC1,
while the y-axis is the second principal
component, PC2. Each small asterisk in the figure
represents a projection of a spectrum. We found
that most of the spectra were located within a
spherical space. A quick check revealed that most
BLQs lie within the spherical space, while most
NLQs (which are less numerous) lie outside it.
Using a K-means algorithm, we altered the size of
the spherical space in order to achieve an
optimal separation between BLQs and NLQs.
122. Characterization
- What is data characterization?
- It is a summarization of general features of
objects in target classes, and produces
characteristic rules. - How does characterization do?
- The data relevant to a user-specified class
are normally retrieved by a database query and
run through a summarization module to extract the
essence of the data at different levels of
abstraction.
13An example of characterizationeffective
temperatures of stars
- Different with direct measurement
- DM methods to estimate stellar parameters
are different from traditional methods based on
direct measurement. Its need not to measure each
stellar spectrum. - Other automated estimation method for Teff
- Bailer-Jones (2000,2002) have trained an
artificial neural network (ANN) to estimate
stellar parameters. Soubiran et al. (1998) and
Katz et al. (1998) have established a template
library containing 211 stellar spectra, and used
cross-correlation techniques to match their
observations with their templates. - Our method
- Here we present a surface-fitting technique
to estimate the distribution function of stellar
effective temperature. We estimate the
temperature distribution in PCA space, and the
effective temperature of each star is just one
point in such a distribution.
14An example of characterizationeffective
temperatures of stars
The data set we used is a comprehensive library
of synthetic stellar spectra from Lejeune et
al.(1997), which is based on three original grids
of model atmosphere spectra by Kurucz et
al.(1979), Fluks et al.(1994), and Bessell et
al.(1989,1991).
- First of all, the spectra in this data set
were processed by means of a PCA, yielding above
figure, in which all 1599 stellar spectra are
projected onto a 2-d PCA plane.
15An example of characterizationeffective
temperatures of stars
Consider that the data distribution in PCA
space is a locus X, and effective temperature T
is the function of X Tf(X). Thus, T is a
surface in a 3-d space as shown in above figure.
16An example of characterizationeffective
temperatures of stars
- By experimentation, we found that the following
equations can fit the surface well. - T10P(x,y)
- Where P(x,y) is a polynomial of the form
- P(x,y)25.0069-1.80461x0.0525264x2-
0.000450855x3 3.22394y-0.181638xy0.00256156x2y0
.173964y2-0.00434289xy20.00358684y3.
17An example of characterizationeffective
temperatures of stars
- This figure gives the isotherm of effective
temperature in a PCA plane. When an observational
spectrum is projected onto this PCA space, we can
judge the effective temperature of the object in
question. - We are presently working on optimizing
characterization algorithms in order to obtain
distributions of stellar parameters, such as
Teff, g, and Fe/H.
183. Classification
- What is classification?
- Classification is also called predictive
data mining'', in that the aim is to identify the
characteristics of group in advance. - What data of LAMOST needs to be classified?
- For the LAMOST data archive, the data
analysis pipeline will give the classification
result e.g. QSO, galaxy or star of a particular
spectral type. But for galaxies, the pipeline
will not classify them further. The archive will
include 107 galaxies, and the classification of
galaxy spectra is a complex problem. - Why should we classify galaxy spectra?
- A good classification scheme should be useful
in understanding the evolutionary relationship
between different types.
19Classification of galaxy spectra
- Galaxy spectral classifications can depend on
- different methods
- Line strength
- Correlation between morphology and spectrum
- Objective method (ANN or PCA)
- Evolutionary models
- We are now finding objective methods with more
physical meaning to explain evolution of
galaxies.
20DM VO
- An objective of LAMOST DM is to provide software
tools that will also be useful for the
development of China's Virtual Observatory(VO). - The LAMOST data set, including all its
sub-catalogues and FITs files of 1-d spectra,
will of course be another important contribution
to the VO - The true relationship between LAMOST and the VO
is in using data mining and knowledge discovery
to explore the LAMOST data.