# Shuffle and Sampling - 随机抽样和洗牌
### 洗牌算法
- [Shuffle a given array - GeeksforGeeks](http://www.geeksforgeeks.org/shuffle-a-given-array/)
Given an array, write a program to generate a random permutation of array elements. This question is also asked as “shuffle a deck of cards” or “randomize a given array”.
### 题解
这里以 Fisher–Yates shuffle 算法为例,伪代码如下:
~~~
To shuffle an array a of n elements (indices 0..n-1):
for i from 0 downto i do
j ← random integer such that 0 ≤ j ≤ i
exchange a[j] and a[i]
~~~
转化为代码为:
~~~
/*
* shuffle cards
*/
public static void shuffleCard(int[] cards) {
if (cards == null || cards.length == 0) return;
Random rand = new Random();
for (int i = 0; i < cards.length; i++) {
int k = rand.nextInt(i + 1); // 0~i (inclusive)
int temp = cards[i];
cards[i] = cards[k];
cards[k] = temp;
}
}
~~~
看了算法和代码后让我们来使用归纳法简单证明下这个洗牌算法的正确性。我们要证明的问题是:**数组中每个元素在每个索引处出现的概率均相等。**
对于单个元素来讲,以上算法显然正确,因为交换后仍然只有一个元素。接下来我们不妨假设其遍历到数组索引为`i-1`时满足随机排列特性,那么当遍历到数组索引为`i`时,随机数`k`为`i`的概率为`1/i`, 为`0~i-1`的概率为`(i-1)/i`. 接下来与索引为`i`的值交换,可以得知`card[i]`出现在索引`i`的位置的概率为`1/i`, 在其他索引位置的概率也为`1/i`; 而对于`card[i]`之前的元素,以索引`j`处的元素`card[j]`为例进行分析可知其在位置`j`的概率为`1/(i-1) * (i-1)/i = 1/i`, 具体含义为遍历到索引`i-1`时`card[j]`位于索引`j`的概率(`1/(i-1)`)乘以遍历到索引`i`时随机数未选择与索引`j`的数进行交换的概率(`(i-1)/i`).
需要注意的是前面的`j <= i-1`, 那么`card[j]`位于索引`i`的概率又是多少呢?要位于索引`i`,则随机数`k`须为`i`, 这种概率为`1/i`.
综上,以上算法可以实现完美洗牌(等概率)。
### Random sampling - 随机抽样
随机抽样也称为水池抽样,Randomly choosing a sample of k items from a list S containing n items. 大意是从大小为 n 的数组中随机选出 m 个整数,要求每个元素被选中的概率相同。
### 题解
比较简洁的有算法 Algorithm R, 伪代码如下:
~~~
/*
S has items to sample, R will contain the result
*/
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i = 1 to k
R[i] := S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i) // important: inclusive range
if j <= k
R[j] := S[i]
~~~
转化为代码为:
~~~
/*
* random sample
*/
public static int[] randomSample(int[] nums, int m) {
if (nums == null || nums.length == 0 || m <= 0) return new int[]{};
int[] sample = new int[m];
for (int i = 0; i < m; i++) {
sample[i] = nums[i];
}
Random random = new Random();
for (int i = m; i < nums.length; i++) {
int k = random.nextInt(i + 1); // 0~i(inclusive)
if (k < m) {
sample[k] = nums[i];
}
}
return sample;
}
~~~
和洗牌算法类似,我们要证明的问题是:**数组中每个元素在最终采样的数组中出现的概率均相等且为`m/n`.** 洗牌算法中是排列,而对于随机抽样则为组合。
维基百科上的证明相对容易懂一些,这里我稍微复述下。首先将数组前 m 个元素填充进新数组`sample`, 然后从`m`开始遍历直至数组最后一个索引。随机数`k`的范围为`0~i`, 如果`k < m`,新数组的索引为 k 的元素则和原数组索引为`i`的元素交换;如果`k >= m`, 则不进行交换。`i == m`时,以原数组中第`j`个元素为例,它被`nums[m]`替换的概率为`1/(m+1)`, 也就是说保留在`sample`数组中的概率为`m/(m+1)`. 对与第`m+1`个元素`nums[m]`来说,其位于`sample`数组中的概率则为`m*1/(m+1)`(可替换 m 个不同的元素).
接下来仍然使用数学归纳法证明,若`i`遍历到`r`时,其之前的元素出现的概率为`m/(r-1)`, 那么其之前的元素中任一元素`nums[j]`被替换的概率为`m/r * 1/m = 1/r`, 不被替换的概率则为`(r-1)/r`. 故元素`nums[j]`在`i`遍历完`r`后仍然保留的概率为`m/(r-1) * (r-1)/r = m/r`. 而对于元素`nums[r]`来说,其要被替换至`sample`数组中的概率则为`m/r`(随机数小于m 的个数为 m).
综上,以上算法在遍历完长度为 n 的数组后每个数出现在最终`sample`数组中的概率都为`m/n`.
### Implementation and Test case
**Talk is cheap, show me the code!**
### Java
~~~
import java.util.*;
import java.util.Random;
public class Probability {
public static void main(String[] args) {
int[] cards = new int[10];
for (int i = 0; i < 10; i++) {
cards[i] = i;
}
// 100000 times test
final int times = 100000;
final int m = 5;
int[][] count = new int[cards.length][cards.length];
int[][] count2 = new int[cards.length][m];
for (int i = 0; i < times; i++) {
shuffleCard(cards);
shuffleTest(cards, count);
int[] sample = randomSample(cards, m);
shuffleTest(sample, count2);
}
System.out.println("Shuffle cards");
shufflePrint(count);
System.out.println();
System.out.println("Random sample");
shufflePrint(count2);
}
/*
* shuffle cards
*/
public static void shuffleCard(int[] cards) {
if (cards == null || cards.length == 0) return;
Random rand = new Random();
for (int i = 0; i < cards.length; i++) {
int k = rand.nextInt(i + 1);
int temp = cards[i];
cards[i] = cards[k];
cards[k] = temp;
}
}
/*
* random sample
*/
public static int[] randomSample(int[] nums, int m) {
if (nums == null || nums.length == 0 || m <= 0) return new int[]{};
m = Math.min(m, nums.length);
int[] sample = new int[m];
for (int i = 0; i < m; i++) {
sample[i] = nums[i];
}
Random random = new Random();
for (int i = m; i < nums.length; i++) {
int k = random.nextInt(i + 1);
if (k < m) {
sample[k] = nums[i];
}
}
return sample;
}
/*
* nums[i] = j, num j appear in index i ==> count[j][i]
*/
public static void shuffleTest(int[] nums, int[][] count) {
if (nums == null || nums.length == 0) return;
for (int i = 0; i < nums.length; i++) {
count[nums[i]][i]++;
}
}
/*
* print shuffle test
*/
public static void shufflePrint(int[][] count) {
if (count == null || count.length == 0) return;
// print index
System.out.print(" ");
for (int i = 0; i < count[0].length; i++) {
System.out.printf("%-7d", i);
}
System.out.println();
// print num appear in index i in total
for (int i = 0; i < count.length; i++) {
System.out.print(i + ": ");
for (int j = 0; j < count[i].length; j++) {
System.out.printf("%-7d", count[i][j]);
}
System.out.println();
}
}
}
~~~
以十万次试验为例,左侧是元素`i`, 列代表在相应索引位置出现的次数。可以看出分布还是比较随机的。
~~~
Shuffle cards
0 1 2 3 4 5 6 7 8 9
0: 10033 9963 10043 9845 9932 10020 9964 10114 10043 10043
1: 9907 9951 9989 10071 10059 9966 10054 10023 10015 9965
2: 10042 10046 9893 10080 10050 9994 10024 9852 10098 9921
3: 10039 10023 10039 10024 9919 10057 10188 9916 9907 9888
4: 9944 9913 10196 10059 9838 10205 9899 9945 9850 10151
5: 10094 9971 10054 9958 10022 9922 10047 9978 9965 9989
6: 9995 10147 9824 10015 10023 9804 10050 10192 9939 10011
7: 9941 10131 9902 9920 10040 10121 10010 9928 9984 10023
8: 10010 9926 9883 10098 10083 10028 9801 9936 10200 10035
9: 9995 9929 10177 9930 10034 9883 9963 10116 9999 9974
Random sample
0 1 2 3 4
0: 9966 10026 10078 9966 9891
1: 9958 9806 10066 10022 10039
2: 9923 9936 9964 10051 10083
3: 10165 10088 10184 9928 9916
4: 9998 9990 9973 9931 9832
5: 10026 9932 9873 10085 10035
6: 9942 9972 9990 10030 10026
7: 9903 10153 9997 10051 10044
8: 10082 10066 9804 9899 10147
9: 10037 10031 10071 10037 9987
~~~
### Reference
- [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm) - 洗牌算法的详述,比较简洁的算法
- [Reservoir sampling ](https://en.wikipedia.org/wiki/Reservoir_sampling) - 水池抽样算法
- [如何测试洗牌程序 | 酷 壳 - CoolShell.cn](http://coolshell.cn/articles/8593.html) - 借鉴了其中的一些测试方法
- 《计算机程序设计艺术》第二卷(半数值算法) - 3.4.2 随机抽样和洗牌
- 《编程珠玑》第十二章 - 抽样问题
- Preface
- Part I - Basics
- Basics Data Structure
- String
- Linked List
- Binary Tree
- Huffman Compression
- Queue
- Heap
- Stack
- Set
- Map
- Graph
- Basics Sorting
- Bubble Sort
- Selection Sort
- Insertion Sort
- Merge Sort
- Quick Sort
- Heap Sort
- Bucket Sort
- Counting Sort
- Radix Sort
- Basics Algorithm
- Divide and Conquer
- Binary Search
- Math
- Greatest Common Divisor
- Prime
- Knapsack
- Probability
- Shuffle
- Basics Misc
- Bit Manipulation
- Part II - Coding
- String
- strStr
- Two Strings Are Anagrams
- Compare Strings
- Anagrams
- Longest Common Substring
- Rotate String
- Reverse Words in a String
- Valid Palindrome
- Longest Palindromic Substring
- Space Replacement
- Wildcard Matching
- Length of Last Word
- Count and Say
- Integer Array
- Remove Element
- Zero Sum Subarray
- Subarray Sum K
- Subarray Sum Closest
- Recover Rotated Sorted Array
- Product of Array Exclude Itself
- Partition Array
- First Missing Positive
- 2 Sum
- 3 Sum
- 3 Sum Closest
- Remove Duplicates from Sorted Array
- Remove Duplicates from Sorted Array II
- Merge Sorted Array
- Merge Sorted Array II
- Median
- Partition Array by Odd and Even
- Kth Largest Element
- Binary Search
- Binary Search
- Search Insert Position
- Search for a Range
- First Bad Version
- Search a 2D Matrix
- Search a 2D Matrix II
- Find Peak Element
- Search in Rotated Sorted Array
- Search in Rotated Sorted Array II
- Find Minimum in Rotated Sorted Array
- Find Minimum in Rotated Sorted Array II
- Median of two Sorted Arrays
- Sqrt x
- Wood Cut
- Math and Bit Manipulation
- Single Number
- Single Number II
- Single Number III
- O1 Check Power of 2
- Convert Integer A to Integer B
- Factorial Trailing Zeroes
- Unique Binary Search Trees
- Update Bits
- Fast Power
- Hash Function
- Count 1 in Binary
- Fibonacci
- A plus B Problem
- Print Numbers by Recursion
- Majority Number
- Majority Number II
- Majority Number III
- Digit Counts
- Ugly Number
- Plus One
- Linked List
- Remove Duplicates from Sorted List
- Remove Duplicates from Sorted List II
- Remove Duplicates from Unsorted List
- Partition List
- Two Lists Sum
- Two Lists Sum Advanced
- Remove Nth Node From End of List
- Linked List Cycle
- Linked List Cycle II
- Reverse Linked List
- Reverse Linked List II
- Merge Two Sorted Lists
- Merge k Sorted Lists
- Reorder List
- Copy List with Random Pointer
- Sort List
- Insertion Sort List
- Check if a singly linked list is palindrome
- Delete Node in the Middle of Singly Linked List
- Rotate List
- Swap Nodes in Pairs
- Remove Linked List Elements
- Binary Tree
- Binary Tree Preorder Traversal
- Binary Tree Inorder Traversal
- Binary Tree Postorder Traversal
- Binary Tree Level Order Traversal
- Binary Tree Level Order Traversal II
- Maximum Depth of Binary Tree
- Balanced Binary Tree
- Binary Tree Maximum Path Sum
- Lowest Common Ancestor
- Invert Binary Tree
- Diameter of a Binary Tree
- Construct Binary Tree from Preorder and Inorder Traversal
- Construct Binary Tree from Inorder and Postorder Traversal
- Subtree
- Binary Tree Zigzag Level Order Traversal
- Binary Tree Serialization
- Binary Search Tree
- Insert Node in a Binary Search Tree
- Validate Binary Search Tree
- Search Range in Binary Search Tree
- Convert Sorted Array to Binary Search Tree
- Convert Sorted List to Binary Search Tree
- Binary Search Tree Iterator
- Exhaustive Search
- Subsets
- Unique Subsets
- Permutations
- Unique Permutations
- Next Permutation
- Previous Permuation
- Unique Binary Search Trees II
- Permutation Index
- Permutation Index II
- Permutation Sequence
- Palindrome Partitioning
- Combinations
- Combination Sum
- Combination Sum II
- Minimum Depth of Binary Tree
- Word Search
- Dynamic Programming
- Triangle
- Backpack
- Backpack II
- Minimum Path Sum
- Unique Paths
- Unique Paths II
- Climbing Stairs
- Jump Game
- Word Break
- Longest Increasing Subsequence
- Palindrome Partitioning II
- Longest Common Subsequence
- Edit Distance
- Jump Game II
- Best Time to Buy and Sell Stock
- Best Time to Buy and Sell Stock II
- Best Time to Buy and Sell Stock III
- Best Time to Buy and Sell Stock IV
- Distinct Subsequences
- Interleaving String
- Maximum Subarray
- Maximum Subarray II
- Longest Increasing Continuous subsequence
- Longest Increasing Continuous subsequence II
- Graph
- Find the Connected Component in the Undirected Graph
- Route Between Two Nodes in Graph
- Topological Sorting
- Word Ladder
- Bipartial Graph Part I
- Data Structure
- Implement Queue by Two Stacks
- Min Stack
- Sliding Window Maximum
- Longest Words
- Heapify
- Problem Misc
- Nuts and Bolts Problem
- String to Integer
- Insert Interval
- Merge Intervals
- Minimum Subarray
- Matrix Zigzag Traversal
- Valid Sudoku
- Add Binary
- Reverse Integer
- Gray Code
- Find the Missing Number
- Minimum Window Substring
- Continuous Subarray Sum
- Continuous Subarray Sum II
- Longest Consecutive Sequence
- Part III - Contest
- Google APAC
- APAC 2015 Round B
- Problem A. Password Attacker
- Microsoft
- Microsoft 2015 April
- Problem A. Magic Box
- Problem B. Professor Q's Software
- Problem C. Islands Travel
- Problem D. Recruitment
- Microsoft 2015 April 2
- Problem A. Lucky Substrings
- Problem B. Numeric Keypad
- Problem C. Spring Outing
- Microsoft 2015 September 2
- Problem A. Farthest Point
- Appendix I Interview and Resume
- Interview
- Resume