利用马尔可夫链生成随机文本 · 程序算法艺术与实践

### 问题描述二阶马尔可夫链：**例如：of the people, by thepeople, for the people** ### 分析 Prefix(后缀数组) Suffix of the people 比如 thepeople后面可以跟by for 空，可根据概率选择一个如果选择for，则过度到状态people for，从 the people by 而可以知道后续为the依次类推，给定一个前缀，可以一直往下找，直到”空”为止 people, by the **几种实现方法：(note:C++ stl中可用map>实现)** by the people 1)普通的hash表，记录前缀与后缀(比如前缀thepeople的后缀包括by, for,空)，给定前缀可以通过hash the people for 表很快查找到后缀，然后根据概率选择一个后缀(根据在后缀中出现次数)，过渡到下一个状态 people for the 2)使用后缀数组，数组中每个元素指向一个单词的开始的位置，**先对后缀数组排序**，然后**用二分查找** for the people **得到prefix第一次出现的位置，**最后往后遍历根据概率选择一个后缀。 the people 空 3) hash表与后缀数组相结合，使用后缀数组构造hash表 **首先解决一个问题** 当有多个suffix时，如何按照概率选择一个，比如the people by for 空， ~~~ int nmatch=0; for everyone in suffix if( rand()%++nmatch==0 ) select=this_suffix; ~~~ 对每一个后缀都执行上述的判断，可知第一个suffix一定被选中，第二个suffix以1/2的概率替换，第三个以1/3的概率替换 ~~~ #include <stdio.h> #include <string.h> #include <stdlib.h> #define NHASH 49979 #define MULT 31 #define MAXWORDS 80000 char inputchars[4300000];//存储输入数据 char *word[MAXWORDS];//后缀数组 int nword=0;//记录单词数 int k=2;//2阶 int next[MAXWORDS];//用于构建hash表 int bin[NHASH]; //以k个单词为单位，进行hash unsigned int hash(char* str){ int n; unsigned int h=0; char* p=str; for(n=k;n>0;++p){ h=MULT*h+*p; if(*p=='\0') --n; } return h%NHASH; } //比较前k个单词的大小 int wordncmp(char* p,char *q){ int n; for(n=k;*p==*q;++p,++q){ if(*p=='\0'&&(--n)==0) return 0; } return *p-*q; } //从当前单词出发，跳过前n个单词 char* skip(char* p,int n){ for(;n>0;++p){ if(*p=='\0') --n; } return p; } int main(){ int i,j; //步骤1：构建后缀数组 word[0]=inputchars; //scanf以空格作为分隔符, 并且自动加上'\0' while((scanf("%s",word[nword]))!=EOF){ word[nword+1]=word[nword]+strlen(word[nword])+1; ++nword; } //附加k个空字符,保证wordncmp()正确（感觉不需要这个） for(i=0;i<k;++i) word[nword][i]='\0'; //步骤2：构建hash table //初始化hash table for(i=0;i<NHASH;++i) bin[i]=-1; //hash表采用前插的方式。例如：word[0], word[1], word[5]拥有相同的hash值15 //则： bin[15](5)->next[5](1)->next[1](0)->next[0](-1) for(i=0;i<=nword-k;++i) { j=hash(word[i]); next[i]=bin[j]; bin[j]=i; } //步骤3：生成随机文本 int wordsleft;//生成单词数 int psofar; char *phrase,*p; phrase=inputchars; for(wordsleft=10000;wordsleft>0;--wordsleft){ psofar=0; for(j=bin[hash(phrase)];j>=0;j=next[j]) //在hash值相同的项中找出字符串值相同的后缀数组表项，根据概率选择一个 if(wordncmp(phrase,word[j])==0&&rand()%(++psofar)==0) p=word[j]; //将phrase重新设置 phrase=skip(p,1); //输出符合要求单词的后面第k个单词 if(strlen(skip(phrase,k-1))==0) break; printf("%s\n",skip(phrase,k-1)); } return 0; } ~~~ **转载请注明出处：**[http://blog.csdn.net/utimes/article/details/8864122](http://blog.csdn.net/utimes/article/details/8864122)