Title: SoQL
1SoQL A Language for Querying and Creating Data
in Social Networks
- Royi Ronen and Oded Shmueli
- Technion Israel Institute of Technology
March 29th, 2009 M3SN, Shenghai, China
2Introduction
- As social networks become popular
- A lot of data
- Many participant
- Many connections
- Sizable participant record
- Proliferation to business and organizational
cultures - Many querying scenarios which can benefit from a
domain-specific language - SoQL is proposed as a step in this direction
3Example
1
Charlie
Bob
5
2
4
Alice
3
Eve
Dave
9
6
7
Frank
Gloria
TF(id,weight)
8
0.7 1
0.6 2
0.4 3
0.4 4
0.3 5
0.9 6
0.85 7
0.85 8
0.5 9
TN(name,company,e-mail,position,experience)
4 Manager alice_at_hal.com HAL Alice
3 Manager bob_at_acme.net ACME Bob
2 Engineer cha_at_cia.gov CIA Charlie
5 Teacher dave_at_mtv.com MTV Dave
6 Scientist eve_at_acme.net ACME Eve
7 Technician fr_at_hal.com HAL Frank
6 Producer glor_at_abc.org ABC Gloria
4Bobs Information Needs
- Bob works for ACME, and is looking for a job in
HAL - Bob is looking for a path which connects him to a
manager in HAL, which in addition - is at most 4 nodes long, and
- does not have any participant, except for Bob,
working for ACME - Results are to be ordered by the multiplication
of weights along the path, excluding the first
edge - Higher quality social paths
5Bobs query
- SELECT COUNT(PATH.nodes.), PATH
- FROM PATH (Bob TO X AS P1 TO Y AS P2)
- WHERE Y.company 'HAL' and
- Y.position 'manager' and
- ATMOST 0 IN P2.nodes SATISFY (company'ACME')
and - COUNT(P1.nodes.) 2 and
- COUNT(PATH.nodes.) lt 4
- ORDER BY MULT(P2.edges.weight)
The Path
Conditions on attributes
Path Predicates
Aggregation Path predicates
6Result
- 4 (Bob, Dave, Gloria, Alice)
- 3 (Bob, Charlie, Alice)
- 4 (Bob, Dave, Eve, Alice)
- Multiplication values are 0.765, 0.3, 0.16
7Model
- Undirected graph
- Reciprocal friends model
- Nodes and edges have attributes
- New Data Types
- Path An ordered set of distinct nodes, every
two successive nodes are connected - Group A set of nodes
8Model
- Results are finite
- Social networks are constantly growing
- But finite at any point
9Aggregation over Path/Group
- Aggregation over path/group is possible
- E.g., the number of nodes in path P1
- SELECT COUNT() FROM P1.nodes
- Or, as in the previous example
- MULT(P2.edges.weight)
10Path Predicates
- ALL SATISFY (condition)
- ATMOST n
- ATLEAST n
- ALL EXCEPT UPTO n
- MAJORITY
11Another information need
- Bob would like to find a group such that
- The group contains Bob and three others
- There exists a path of up to three edges from Bob
to each of the three - There exists a path of up to two edges between
every two of the three - All three have experience gt 5
12SELECT FROM GROUP
Group with Paths
SELECT GROUP FROM GROUP (Bob AS G1,
DISTINCT(X,Y,Z) AS G2) WITH PATH (Bob TO X AS
P1), PATH (Bob TO Y AS P2), PATH (Bob TO Z AS
P3) WHERE COUNT(P1.edges.)lt3
and COUNT(P2.edges.)lt3 and COUNT(P3.edges.)lt
3 and ALL IN G2.nodes SATISFY (experiencegt5)
and ALL SUBGROUPS(U,V) IN G2 SATISFY (PATH(U
TO V AS P4) COUNT(P4.edges.)lt2))
IN
Aggregation on paths
Group Predicate
Subgroups
and COUNT(GROUP.nodes.)lt5
13Group Predicates
- Group predicates refer to either
- nodes in a group or
- paths involving members of the group
- When referring to nodes, operators are the same
as for paths - ALL IN G2.nodes SATISFY (experiencegt5)
- When referring to paths, as in
- ALL SUBGROUPS(U,V) IN G2
SATISFY(PATH(U TO V AS P4) - COUNT(P4.edges.)lt2)
- operators are
- ALL SUBGROUPS,
- ATLEAST n SUBGROUPS,
- ALL EXCEPT UPTO n SUBGROUPS,
- MAJORITY SUBGROUPS
14CONNECT
- Let R be a one-column relation of paths
- The paths are used for an automated process of
referral intended to create a connection to the
last node in the path
CONNECT USING PATH FROM R WHERE TIMEOUT36,
ATTEMPTS5, PARALLEL2, HISTORYtrue
15CONNECT
- Let R be a one-column relation of groups
- An automated process will attempt
- Form a group, like, e.g., Facebook, or
- Create an edge between each pair in the group
CONNECT GROUP FROM R WHERE TIMEOUT48,
ATTEMPTS1, PARALLEL1
16Implementation Issues
- Path/Group sizes are not necessarily predefined
or known a priori - Deployment parameters needed
- Maximum tuples in a result (Googles 1k)
- Maximal length of any path
- Maximal size of any group
- Time Limit
17Finding paths
- Top-k self joins can be used to avoid large
intermediate results - In, e.g., distributed data, random walks can be
used to extract candidates for paths in the
result - At any point, if the path can not satisfy the
query, the walk aborts - Many walking agents can provide a good
approximation
18Conclusions
- SoQL is a domain-specific, SQL-like query
language for the social networks domain - Creation of data is possible using the Path and
the Group data types is possible - Future work
- More expressive predicates, e.g., disjointness of
two paths - Implementation
- Advanced, optimized evaluation techniques for
centralized and distributed environments
19Thank You