# Uniform Random Sampling on Graph: To Be or Not To Be?

Hao Wang and Jianguo Lu

Two of the basic approaches to sampling graph nodes are uniform random node (RN) sampling and PPS (probability proportional to size) sampling. PPS sampling corresponds to random edge (RE) sampling in graph.
While uniform random sampling is often used and preferred in practice, the question as for which method is better remains open. The answer to this question may vary depending on the properties to estimate, and the data under investigation. By focusing on the global properties such as average degree and population size, this paper shows that RE outperforms uniform random sampling on dozens of large graphs in real life when degree variation is large. Since most networks, especially the large ones, are scale-free networks that induce large degree variation, RE sampling is better than RN in most cases.

Since RE sampling is hard to implement in practice, we use simple random walk (RW) sampling to approximate RE sampling. Comparing RN and RW samplings, the answer depends on not only the variation, but also the structure, of the data. We show that RW outperforms RN sampling when the degree variation is large and there are not loosely connected components.

The results are supported by dozens of large graphs including Twitter and Facebook user networks.

Since RE sampling is hard to implement in practice, we use simple random walk (RW) sampling to approximate RE sampling. Comparing RN and RW samplings, the answer depends on not only the variation, but also the structure, of the data. We show that RW outperforms RN sampling when the degree variation is large and there are not loosely connected components.

The results are supported by dozens of large graphs including Twitter and Facebook user networks.

## Data

Most datasets we are using can refer to SNAP Project and they are selected as representatives in each catagory.Two datasets Facebook-1 and Facebook-2 could not be shared due to the policy of facebook and agreement with the author of the data which we are not allowed to distribute the data.

### Social Graphs

As the datasets we choose are in different types, precisely, directed and undirected, we unify them into undirected graphs.**Download**

**Format**

SourceNode \t DestinationNode \n**Note**

* Each file contains two parts: header and body. The header starts with character "#", the first line indicates the number of nodes and the number of edges.

The second line is the symbol of "FromNodeId" and "ToNodeId". The body is described above, each line contains two numeric IDs.

* These numeric IDs are anonymous so they don't have real meannings just differ from each other.

* IDs are not in sequence.

* Each line represents an edge, is unique in the file. That is to say, you will find only one "a"->"b", additionally, as we convert directed graphs into undirected

ones, you will neither find "b"->"a" in the file if it oraginally exists in the raw data.

### Sampled Data

In this paper, we propose three sampling methods--Uniform Random Sampling, Random Walking Sampling and Random Edge Sampling. Using these sampling techniques, we estimate averagedegree of each dataset. In this section, we release the data containing average degree by different sampling methods with five thousand runs.

**Download**

**File Name**

The file name contains the information about how we sample this data. For example, Stanford1000_5000STARD.dat,

Stanford means the data source, 1000 means the size of this sample, 5000 means how many times we run or how many samples we get, STAR means Star sampling method.

We have four kinds of sampling results:

URD: Uniform Random Sampling

RWD: Random Walk Sampling

RED: Random Edge Sampling

STAR: Star sampling

**Format**

Average Degree \t Estimator Variation \nThe first two lines of the file represent average degree \t number of nodes \n number of edges \t number of samples

### Degrees

**Download**

**Format**

Node ID \t Degrees \n

## Figures and Descriptions

Description | Figures | Eps File Download | Comments |
---|---|---|---|

Comparison of RN, RE and RW | Eps File | ||

Degree Distribution | Eps File | ||

Random Edge Sampling Distribution | Eps File | ||

Ratio of RN and RE Correlation(Gamma) | Eps File | ||

Berkley and Stanford Network Graphviz | Eps File | ||

Flicker Network Graphviz | Eps File | ||

NotreDame Network Graphviz | Eps File | ||

RoadNet-CA Network Graphviz | Eps File |

## Programs

We also release our programs for scientific interest.**Note:The input file should be same with my descriptions above to cater for the program.

**Getting Samples From Network**

Descriptions: This program is written in Cplusplus in 64-bits env.

Input: The graph files, described in the first section.

Parameters: -i: input files

-o: output path

-r: sample rate, i.e. "-r 15" means 15% of the total population as the sample size.

-s: turns of runs.

-c: top n nodes' ranking correlation, i.e. "-c 300" means calculate tp 300 nodes' ranking correlation.

-sm: sampling methods, options--NSWE, i.e. -sm NSW means generating samples using RN, RW and Star sampling. options can be alternative.

-x: generating sample with degrees. options--0 or 1. "1" means enable while "0" means disable.

Output: A file containing average degree of each sample.