Advertisement

Spark GraphX

阅读量:

Concept

GraphX is Apache Spark’s API for graphs and graph-parallel computation.

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

GraphX is a layer on top of Spark that provides a graph data structure composed of Spark RDDs, and it provides an API to operate on those graph data structures. GraphX comes with the standard Spark distribution, and you use it through a combination of the GraphX-specific API and the regular Spark API

GraphX is not a database. Instead, it’s a graph processing system, which is useful,
for example, for fielding web service queries or performing one-off, long-running
standalone computations. Because GraphX isn’t a database, it doesn’t handle updates
and deletes like Neo4j and Titan, which are graph databases.

Apache Giraph is another example of a graph processing system, but Giraph is limited to slow Hadoop Map/Reduce.

GraphX, Giraph, and GraphLab are all separate implementations of the ideas expressed in the Google Pregel paper. Such graph processing systems are optimized for running algorithms on the entire graph in a massively parallel manner, as opposed to working with small pieces of graphs like graph databases.

#画一个类似于标准关系数据库的比较
To draw a comparison to the world of standard relational databases,
graph databases like Neo4j are like OLTP (Online Transaction Processing) whereas graph processing systems like GraphX are like OLAP (Online Analytical Processing).

Six Degrees of Kevin Bacon

也叫 six degrees of separation,
美国的一个脱口秀节目有一次请了三个大学生来参加,主题是证明好莱坞的任何其他明星与演技派男星凯文·贝肯之间都能通过五个人联系起来。他们甚至成功的把已经去世了的卓别林与凯文·贝肯之间通过三个人建立了联系。节目引起了巨大反响。

六度分离(六度区隔)理论(Six Degrees of Separation):“你和任何一个陌生人之间所间隔的人不会超过五个,也就是说,最多通过五个人你就能够认识任何一个陌生人。”根据这个理论,你和世界上的任何一个人之间只隔着五个人,不管对方在哪个国家,属哪类人种,是哪种肤色。

六度分隔的现象,并不是说任何人与人之间的联系都必须要通过六个层次才会产生联系,而是表达了这样一个重要的概念:任何两位素不相识的人之间,通过一定的联系方式,总能够产生必然联系或关系。显然,随着联系方式和联系能力的不同,实现个人期望的机遇将产生明显的区别。

RDDs and Partitioning

GraphX represents a graph using 2 RDDs, vertices and edges. Representing graphs in this way allows GraphX to deal with one of the major issues in processing large graphs: partitioning.

GraphX stores a graph’s edges in one table and vertices in another.
在这里插入图片描述

Although GraphX stores edges and vertices in separate tables as one might design an RDBMS schema to do, internally GraphX has special indexes to rapidly traverse(遍历) the graph, and it exposes an API that makes graph querying and processing easier than trying to do the same in SQL.

Various GraphX data flows

Because GraphX’s capabilities for reading graph data files are so limited, data files usually have to be massaged and transformed using the Spark Core API into the graph format that GraphX uses. The output of a GraphX algorithm can be another graph, a number, some subgraphs, or a machine learning model.

GraphX limitations

GraphX is still young, and some of its limitations stem from the limitations of Spark. For example, GraphX datasets, like all Spark datasets, can’t normally be shared by multiple Spark programs unless a REST server add-on like Spark JobServer is used. Until the IndexedRDD capability is added to Spark (Jira ticket SPARK-2365), which is effectively a mutable (that is, updatable) HashMap version of an RDD (Resilient Distributed Dataset, the foundation of Spark), GraphX is limited by the immutability of Spark RDDs, which is an issue for large graphs. Although faster for some uses, GraphX is often slower than systems written in C++, such as GraphLab/PowerGraph, due to GraphX’s reliance(依赖) on the JVM.

Storing the graphs: distributed file storage vs. graph database

Because GraphX is strictly an in-memory processing system, you need a place to store graph data. Spark expects distributed storage, such as HDFS, Cassandra, or S3, and storing graphs in distributed storage is the usual way to go.
But some use GraphX, a graph processing system, in conjunction with a graph database to get the best of both worlds . GraphX versus Neo4j is a frequent debate, but for some use cases, both are better than one or the other. The open source project Mazerunner is an extension to Neo4j that offloads graph analytics such as PageRank to GraphX.
在这里插入图片描述

Graph vs Graphics

Graph is a related term of graphics.
As nouns the difference between graph and graphics is that graph is a diagram displaying data; in particular one showing the relationship between two or more quantities, measurements or indicative numbers that may or may not have a specific mathematical formula relating them to each other while graphics is the making of architectural or design drawings.

全部评论 (0)

还没有任何评论哟~