Large Scale Machine Translation Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

Large Scale Machine Translation Architectures

Description:

The size of the table is too large to fit into memory ... Large Language Models in Machine Translation, Thorsten Brants, Ashok C. Popat, ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 50

Provided by: csC76

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Large Scale Machine Translation Architectures

1
Large Scale Machine Translation Architectures

Qin Gao

2
Outline

Typical Problems in Machine Translation
Program Model for Machine Translation
MapReduce
Required System Component
Supporting software
Distributed streaming data storage system
Distributed structured data storage system
Integrating How to make a full-distributed
system

3
Why large scale MT

We need more data..
But

4
Some representative MT problems

Counting events in corpora
? Ngram count
Sorting
? Phrase table extraction
Preprocessing Data
?Parsing, tokenizing, etc
Iterative optimization
? GIZA (All EM algorithms)

5
Characteristics of different tasks

Counting events in corpora
Extract knowledge from data
Sorting
Process data, knowledge is inside data
Preprocessing Data
Process data, require external knowledge
Iterative optimization
For each iteration, process data using existing
knowledge and update knowledge

6
Components required for large scale MT
Knowledge
7
Components required for large scale MT
Knowledge
8
Components required for large scale MT
Stream Data
Processor
Knowledge
Structured Knowledge
9
Problem for each component

Stream data
As the amount of data grows, even a complete
navigation is impossible.
Processor
Single processors computation power is not
enough
Knowledge
The size of the table is too large to fit into
memory
Cache-based/distributed knowledge base suffers
from low speed

10
Make it simple What is the underlying problem?

We have a huge cake and we want to cut them into
pieces and eat.
Different cases
We just need to eat the cake.
We also want to count how many peanuts inside
the cake
(Sometimes)We have only one folk!

11
Parallelization
Knowledge
12
Solutions

Large-scale distributed processing
MapReduce Simplified Data Processing on Large
Clusters, Jeffrey Dean, Sanjay Ghemawat,
Communications of the ACM, vol. 51, no. 1 (2008),
pp. 107-113.
Handling huge streaming data
The Google File System, Sanjay Ghemawat, Howard
Gobioff, Shun-Tak Leung, Proceedings of the 19th
ACM Symposium on Operating Systems Principles,
2003, pp. 20-43.
Handling structured data
Large Language Models in Machine Translation,
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz
J. Och, Jeffrey Dean, Proceedings of the 2007
Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL), pp. 858-867.
Bigtable A Distributed Storage System for
Structured Data, Fay Chang, Jeffrey Dean, Sanjay
Ghemawat, Wilson C. Hsieh, Deborah A. Wallach,
Mike Burrows, Tushar Chandra, Andrew Fikes,
Robert E. Gruber, 7th USENIX Symposium on
Operating Systems Design and Implementation
(OSDI), 2006, pp. 205-218.

13
MapReduce

MapReduce can refer to
A programming model that deal with massive,
unordered, streaming data processing tasks(MUD)
A set of supporting software environment
implemented by Google Inc
Alternative implementation
Hadoop by Apache fundation

14
MapReduce programming model

Abstracts the computation into two functions
MAP
Reduce
User is responsible for the implementation of the
Map and Reduce functions, and supporting software
take care of executing them

15
Representation of data

The streaming data is abstracted as a sequence of
key/value pairs
Example
(sentence_id sentence_content)

16
Map function

The Map function takes an input key/value pair,
and output a set of intermediate key/value pairs

Key1 Value1 Key2 Value2 Key3 Value3 ..
Key1 Value1
Map()
Key1 Value2 Key2 Value1 Key3 Value3 ..
Key2 Value2
Map()
17
Reduce function

Reduce function accepts one intermediate key and
a set of intermediate values, and produce the
result

Key1 Value1 Key1 Value2 Key1 Value3 ..
Result
Reduce()
Key2 Value1 Key2 Value2 Key2 Value3 ..
Result
Reduce()
18
The architecture of MapReduce
Reduce Function
Map function
Distributed Sort
19
Benefit of MapReduce

Automatic splitting data
Fault tolerance
High-throughput computing, uses the nodes
efficiently
Most important Simplicity, just need to convert
your algorithm to the MapReduce model.

20
Requirement for expressing algorithm in MapReduce

Process Unordered data
The data must be unordered, which means no matter
in what order the data is processed, the result
should be the same
Produce Independent intermediate key
Reduce function can not see the value of other
keys

21
Example

Distributed Word Count (1)
Input key word
Input value 1
Intermediate key constant
Intermediate value 1
Reduce() Count all intermediate values
Distributed Word Count (2)
Input key Document/Sentence ID
Input value Document/Sentence content
Intermediate key constant
Intermediate value number of words in the
document/sentence
Reduce() Count all intermediate values

22
Example 2

Distributed unigram count
Input key Document/Sentence ID
Input value Document/Sentence content
Intermediate key Word
Intermediate value Number of the word in the
document/sentence
Reduce() Count all intermediate values

23
Example 3

Distributed Sort
Input key Entry key
Input value Entry content
Intermediate key Entry key (modification may
be needed for ascend/descend order)
Intermediate value Entry content
Reduce() All the entry content
Making use of built-in sorting functionality

24
Supporting MapReduce Distributed Storage

Reminder what we are dealing with in MapReduce
Massive, unordered, streaming data
Motivation
We need to store large amount of data
Make use of storage in all the nodes
Automatic replication
Fault tolerant
Avoid hot spots client can read from many servers
Google FS and Hadoop FS (HDFS)

25
Design principle of Google FS

Optimizing for special workload
Large streaming reads, small random reads
Large streaming writes, rare modification
Support concurrent appending
It actually assumes data are unordered
High sustained bandwidth is more important than
low latency, fast response time is not important
Fault tolerant

26
Google FS Architecture

Optimize for large streaming reading and large,
concurrent writing
Small random reading/writing is also supported,
but not optimized
Allow appending to existing files
File are spitted into chunks and stored in
several chunk servers
A master is responsible for storage and query of
chunk information

27
Google FS architecture
28
Replication

When a chunk is frequently or simultaneously
read from a client, the client may fail
A fault in one client may cause the file not
usable
Solution store the chunks in multiple machines.
The number of replica of each chunk replication
factor

29
HDFS

HDFS shares similar design principle of Google FS
Write-once-read-many Can only write file once,
even appending is now allowed
Moving computation is cheaper than moving data

30
Are we done?
NO Problems about the existing architecture
31
We are good at dealing with data

What about knowledge? I.E. structured data?
What if the size of the knowledge is HUGE?

32
A good example GIZA

A typical EM algorithm

World Alignment
Collect Counts
Has More Sentences?
Y
Y
N
Has More Iterations?
Normalize Counts
N
33
When parallelized seems to be a perfect
MapReduce application
Word Alignment
Word Alignment
Word Alignment
Collect Counts
Collect Counts
Collect Counts
Has More Sentences?
Has More Sentences?
Has More Sentences?
Y
Y
Y
N
N
N
Y
Has More Iterations?
Normalize Counts
N
Run on cluster
34
However
Memory
Large parallel corpus

Corpus chunks
Map
Count tables
. . . .. . . . . . . . . .
. . . .. . . . . . . . . .
. . . .. . . . . . . . . .
. . . .. . . . . . . . . .
. . . . .. . . . . . . . . .
. . . . . . . . .
Data I/O
Combined count table
Reduce
Memory
Renormalization
Redistribute for next iteration
. . . . .. . . . . . . . . .
. . . . . . . . .
Statistical lexicon
35
Huge tables

Lexicon probability table T-Table
Up to 3G in early stages
As the number of workers increases, they all need
to load this 3G file!
And all the nodes need to have 3G memory we
need a cluster of super computers?

36
Another example, decoding

Consider language models, what can we do if the
language model grows to several TBs
We need storage/query mechanism for large,
structured data
Consideration
Distributed storage
Fast access network has high latency

37
Google Language Model

Storage
Central storage or distributed storage
How to deal with latency?
Modify the decoder, collect a number of queries
and send them in one time.
It is a specific application, we still need
something more general.

38
Again, made in GoogleBigtable

It is the specially optimized for structured data
Serving many applications now
It is not a complete database
Definition
A Bigtable is a sparse, distributed, persistent,
multi-dimensional, sorted map

39
Data model in Bigtable

Four dimension table
Row
Column family
Column
Timestamp

Column family
Column
Row
Timestamp
40
Distributed storage unit Tablet

A tablet consists a range of rows
Tablets can be stored in different nodes, and
served by different servers
Concurrent reading multiple rows can be fast

41
Random access unit Column family

Each tablet is a string-to-string map
(Though not mentioned, the API shows that ) In
the level of column family, the index is loaded
into memory so fast random access is possible
Column family should be fixed

42
Tables inside table Column and Timestamp

Column can be any arbitrary string value
Timestamp is an integer
Value is byte array
Actually it is a table of tables

43
Performance

Number of 1000-byte values read/write per second.
What is shocking
Effective IO for random read (from GFS) is more
than 100 MB/second
Effective IO for random read from memory ismore
than 3 GB/second

44
An example Phrase Table

Row First bigram/trigram of the source phrase
Column Family Length of source phrase or some
hashed number of remaining part of source phrase
Column Remaining part of the source phrase
Value All the phrase pairs of the source phrase

45
Benefit

Different source phrase comes from different
servers
The load is balanced and the reading can be
concurrent and much faster.
Filtering the phrase table before decoding
becomes much more efficient.

46
Another Example GIZA

Lexicon table
Row Source word id
Column Family nothing
Column Target word id
Value The probability value
With a simple local cache, the table loading can
be extremely efficient comparing to current
implemenetation

47
Conclusion