coreseek在windows 和 linux下的使用-

Sphinx 的安装及使用一、为什么要使用 sphinx假设你现在运营着一个论坛，论坛数据已经超过 100W，很多用户都反映论坛搜索的速度非常慢，那么这时候你就可以考虑使用 sphinx 了二、 Sphinx 是什么它是一个高性能的全文搜索软件包。全文搜索是指以文档的全部文本信息作为检索对象的一种信息检索技术，检索的对象有可能是文章的标题，也有可能是文章的作者，也有可能是文章摘要或内容。三、 Sphinx 的特性高速索引（在新款 CPU 上，近 10M/s）高速搜索(在 2-4G 的文本量中平均查询速度不到 0.1 秒)高可用性（单 CPU 上最大可支持 100GB 的文本，100M 文档）提供良好的相关性排名支持分布式搜索提供文档摘要生成支持每个文档多属性支持断词四、下载并安装 sphinxhttp:/www.coreseek.cn/news/7/52/ 找到适合自己操作系统的版本， linux 下下载源码包，编译安装。Coreseek 是基于 sphinx 开发的一款软件，对 sphinx 做了一些改动，在中文方面支持得比 sphinx 好。下载完成后，解压到想解压的地方，比如 E 盘根目录下，更名为coreseek。Coreseek 就安装完成了。解压后的目录如下：五、 Sphinx 的使用要使用 sphinx 需要做以下几件事1）首先数据库里得有数据2）建立 sphinx 配置文件3）生成索引4）启动 sphinx5）在程序中通过 api 调用 sphinx，获取数据1）、导入数据解压的文件里找到 var/test/documents.sql 执行，建立 documens 表。2）、建立配置文件接下来我们需要建立一个 sphinx 的配置文件 E:coreseeketcmysql.conf，将其内容改为下面这些：source mysql Type = mysql sql_host = localhost sql_user = root sql_pass = sql_db = test sql_port = 3306 sql_query_pre = SET NAMES utf8 sql_query = SELECT id,group_id,UNIX_TIMESTAMP(date_added) AS date_added,title,content FROM documents WHERE idsetServer(localhost,9312);$sc-SetMatchMode(SPH_MATCH_ANY);$keyword = 谷歌;$res = $sc-query($keyword,main,delta);print_r($res);?打印结果： Array（matches = Array(2 = Array(weight = 2attrs = Array(date_added = 2014-10-10group_id = 2)4 = Array(weight = 2attrs = Array(date_added = 2014-10-09group_id = 5)）Matches 中就是查询的结果了，但是并不是我们想要的数据，比如 title，content 字段的内容才是我们想要的，而这里没有，实际上 Sphinx 并没有连接到 mysql 去取数据，只是根据它自己的索引内容进行计算，因此如果想用 Sphinx 提供的 API 去取得我们想要的数据，还必须以查询的结果为依据，再次查询 mysql 从而得到我们想要的数据。上面的查询结果中键值分别表示2 唯一主键，比如 documents 表里的 idWeight 权重attrs sql_attr_* 中配置的。至此，搜索引擎算是完成一大半了，剩下的取数据就可以自由发挥了，比如：我们接着以上的代码接着写：六、实时索引更新如果你一步一步走下来的话可能会考虑到这样一个问题，对于数据表中已有的数据生成了索引，在用户搜索的时候能够搜索到，那如果这个表中新增了数据之后，该怎样才能查到新增的数据呢，难道要重新将所有已建好的索引也重新再生成一遍？再比如有这样一种情况：整个数据集非常大，以至于难于经常性的重建索引，但是每次新增的记录却相当的少。一个典型的例子是：一个论坛有 1000000 个已经归档的帖子，但每天只有 1000 个新帖子。在这种情况下可以用所谓的“主索引+增量索引”（main + delta）模式来实现“近实时”的索引更新。这种方法的基本思路是设置两个数据源和两个索引，对很少更新的数据建立主索引，而对新增文档建立增量索引，在上述例子中，那1000000 个已经归档的帖子放在主索引中，而每天新增的 1000 个帖子则放在增量索引中。增量索引更新的频率可以非常快，比如我们可以在 linux 下编写 shell脚本，每分钟执行 1 次，这样，新增的文档也可以被检索到，而不用重建所有索引，白白耗费资源。“主索引+ 增量索引 ”模式的原理：在数据库中建立一个 sph_counter 表，该表有两个字段，counter_id,max_doc_id.Counter_id 可以自己指定，max_doc_id 的值为最后一次根据数据表的数据建立主索引时主键 id 的最大值，比如我根据 documents 表建立的实时索引配置如下：#source mainsource maintype = mysqlsql_host = 127.0.0.1sql_user = rootsql_pass =sql_db = testsql_port = 3306sql_query_pre = SET NAMES utf8sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documentssql_query = SELECT id,group_id,UNIX_TIMESTAMP(date_added) AS date_added,title,content FROM documents WHERE id( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )#index deltaindex delta:mainsource = deltapath = E:/coreseek/var/data/mysqldocinfo = externmlock = 0morphology = nonemin_word_len = 1html_strip = 0#中文分词配置charset_dictpath = E:/coreseek/etc/charset_type = zh_cn.utf-8 min_prefix_len = 0min_infix_len = 1ngram_len = 1#全局 indexer 定义indexermem_limit = 256M#searchd 服务定义searchdlisten = 3312read_timeout = 5max_children = 30max_matches = 10000seamless_rotate = 0preopen_indexes = 0unlink_old = 1pid_file = E:/coreseek/var/log/searchd_discuzx.pid #windows下最好用全路径log = E:/coreseek/var/log/searchd_discuzx.log #windows下最好用全路径 query_log = E:/coreseek/var/log/query_discuzx.log #windows下最好用全路径在 php 程序中这样使用：setServer(localhost,3312);$sc-SetMatchMode(SPH_MATCH_ANY);$keyword = 谷歌;$res = $sc-query($keyword,main,delta);$tids = array();if($res) if(is_array($resmatches) $tids = array_keys($resmatches);$tidStr = implode(,$tids);$tidStr = implode(,$tids);$dbhost = 127.0.0.1;$username = root;$userpass = ;$database = test;$db_connect=mysql_connect($dbhost,$username,$userpass) or die(Unable to connect to the MySQL!);mysql_query(set names utf8);mysql_select_db($database,$db_connect);$result=mysql_query(SELECT title,content FROM documents where id in ($tidStr);$preStr = ;while( $row=mysql_fetch_array($result) )$preStr.= ;$preStr.= bat_highlight($rowtitle,$keyword);$preStr.= ;$preStr.= ;$preStr.= bat_highlight($rowcontent,$keyword);$preStr.= ;$preStr.=;$preStr.=;echo $preStr;exit;function highlight($text, $words, $prepend) $text = str_replace(, , $text);foreach($words AS $key = $replaceword) $text = str_replace($replaceword, .$replaceword., $text);return $prepend$text;function bat_highlight($message, $words, $color = #ff0000) if(!empty($words) $highlightarray = explode( , $words);$sppos = strrpos($message, chr(0).chr(0).chr(0);if($sppos != FALSE) $specialextra = substr($message, $sppos + 3);$message = substr($message, 0,