Uniform Random Sampling on Graph: To Be or Not To Be?

Hao Wang and Jianguo Lu

Two of the basic approaches to sampling graph nodes are uniform random node (RN) sampling and PPS (probability proportional to size) sampling. PPS sampling corresponds to random edge (RE) sampling in graph. While uniform random sampling is often used and preferred in practice, the question as for which method is better remains open. The answer to this question may vary depending on the properties to estimate, and the data under investigation. By focusing on the global properties such as average degree and population size, this paper shows that RE outperforms uniform random sampling on dozens of large graphs in real life when degree variation is large. Since most networks, especially the large ones, are scale-free networks that induce large degree variation, RE sampling is better than RN in most cases.

 Since RE sampling is hard to implement in practice, we use simple random walk (RW) sampling to approximate RE sampling. Comparing RN and RW samplings, the answer depends on not only the variation, but also the structure, of the data. We show that RW outperforms RN sampling when the degree variation is large and there are not loosely connected components.

 The results are supported by dozens of large graphs including Twitter and Facebook user networks.

Data

Most datasets we are using can refer to SNAP Project and they are selected as representatives in each catagory.
Two datasets Facebook-1 and Facebook-2 could not be shared due to the policy of facebook and agreement with the author of the data which we are not allowed to distribute the data.

Social Graphs

As the datasets we choose are in different types, precisely, directed and undirected, we unify them into undirected graphs.

Sampled Data

In this paper, we propose three sampling methods--Uniform Random Sampling, Random Walking Sampling and Random Edge Sampling. Using these sampling techniques, we estimate average
degree of each dataset. In this section, we release the data containing average degree by different sampling methods with five thousand runs.

Degrees

Figures and Descriptions

Description Figures Eps File Download Comments
Comparison of RN, RE and RW Eps File
Degree Distribution Eps File
Random Edge Sampling Distribution Eps File
Ratio of RN and RE Correlation(Gamma) Eps File
Berkley and Stanford Network Graphviz Eps File
Flicker Network Graphviz Eps File
NotreDame Network Graphviz Eps File
RoadNet-CA Network Graphviz Eps File

Programs

We also release our programs for scientific interest.
**Note:The input file should be same with my descriptions above to cater for the program.

  • Getting Samples From Network
  • Download can be found here.
    Descriptions: This program is written in Cplusplus in 64-bits env.
    Input: The graph files, described in the first section.
    Parameters: -i: input files
                    -o: output path
                    -r: sample rate, i.e. "-r 15" means 15% of the total population as the sample size.
                    -s: turns of runs.
                    -c: top n nodes' ranking correlation, i.e. "-c 300" means calculate tp 300 nodes' ranking correlation.
                    -sm: sampling methods, options--NSWE, i.e. -sm NSW means generating samples using RN, RW and Star sampling. options can be alternative.
                    -x: generating sample with degrees. options--0 or 1. "1" means enable while "0" means disable.
    Output: A file containing average degree of each sample.
  • Matlab files