The
goal of this project is classifying nodes in graphs considering its
position in the graph. Classic machine learning classification
techniques assume that the label of each object is independent from
others, however in a graph this assumption is not true and connected
nodes will probably be in the same class. In other words, we cannot
assume the label of each node as an independent entity from other
nodes. For instance in a graph of friendship if a person is infected,
her\his friends will have higher probability of infection. Another
example can be in scientific paper author graphs, which represent the
connection between nodes of the graph when they have at least one
common author. In this data set the topic of connected papers are
probably the same. In these 2 examples it is obvious that the label of
data points are not independent of each other. This idea is the motivation for us to explore new methods, which consider the dependencies between nodes in a graph. These methods are named "relational classification methods." Our goal is to solve the collective classification problem when the labels of objects are related to each other. |
Figure 1. A friendship graph
|
Figure 2. Feature vector in ICA
|
In
order to construct the test and train sets we used snowball sampling
evaluation strategy (SS) to become sure there is sufficient number of
observed data for each node. As we can see in Figure 3 we start from a
random node in a graph and then expand our sampling by sampling from
neighbors of observed nodes. The implementation was done in Matlab, which is available here. |
Figure 3 An overview of snowball sampling
|
In
this project we used the Cora dataset, which consists of 2708 Machine
Learning papers. These papers are classified into one of the following
seven classes: * Case_Based * Genetic_Algorithms * Neural_Networks * Probabilistic_Methods * Reinforcement_Learning * Rule_Learning * Theory publications classifed into For each paper there is a feature vector of size 1433, which corresponds to a word. An element in a vector is one if the corresponding word has appeared in the paper otherwise it is 0. The graph of data set is a directed graph and each edge between apair of papers means paper 1 cited paper 2. The citation network consists of 5429 links. Figure 4 is a representation of the graph, which was computed using Gephi application You can find the dataset here. |
Figure 4. A visualization of Cora dataset
|
Figure 5 denotes the achieved accuracy using different
classification methods. The first observation in the results is
that relational learning methods perform much better than content based
learning algorithms. The next observation is that changing the base
classifier does not affect the results of ICA method. Figure 6 shows the obtained accuracy of LBP algorithm with different penalty rates. In some cases we saw that ICA did not converge (Figure 7). |
Figure 5. Achieved accuracy using different classification methods |
Figure 6. Results with different penalties in LBP algorithm
|
Figure 7. An example of ICA algorithm when it did not converge
|
© Sorour Ekhtiari Amiri