一步一步跟我学习lucene11-lucene搜索之高亮显示highlighter

JAY.LIN 收录于 Lucene

2017-11-21 约 3057 字预计阅读 7 分钟

https://bing.ee123.net/img/rand?artid=45898099

一步一步跟我学习lucene（11）—lucene搜索之高亮显示highlighter

highlighter介绍

这几天一直加班，博客有三天没有更新了，望见谅；我们在做查询的时候，希望对我们自己的搜索结果与搜索内容相近的地方进行着重显示，就如下面的效果

这里我们搜索的内容是“一步一步跟我学习lucene”，搜索引擎展示的结果中对用户的输入信息进行了配色方面的处理，这种区分正常文本和输入内容的效果即是高亮显示；

这样做的好处：

视觉上让人便于查找有搜索对应的文本块；
界面展示更友好；

lucene提供了highlighter插件来体现类似的效果；

highlighter对查询关键字高亮处理；

highlighter包包含了用于处理结果页查询内容高亮显示的功能，其中Highlighter类highlighter包的核心组件，借助Fragmenter, fragment Scorer, 和Formatter等类来支持用户自定义高亮展示的功能；

示例程序

这里边我利用了之前的做的目录文件索引

package com.lucene.search.util;

import java.io.IOException;
import java.io.StringReader;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.util.BytesRef;

public class HighlighterTest {
	  public static void main(String[] args) {
		IndexSearcher searcher;
		TopDocs docs; 
		ExecutorService service = Executors.newCachedThreadPool();
		try {
			searcher = SearchUtil.getMultiSearcher("index", service);
	        Term term = new Term("content",new BytesRef("lucene"));
	        TermQuery termQuery = new TermQuery(term);
	        docs = SearchUtil.getScoreDocsByPerPage(1, 30, searcher, termQuery);
			ScoreDoc[] hits = docs.scoreDocs;
	        QueryScorer scorer = new QueryScorer(termQuery);
	        SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter("<B>","</B>");//设定高亮显示的格式<B>keyword</B>,此为默认的格式  
	        Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);   
	        highlighter.setTextFragmenter(new SimpleFragmenter(20));//设置每次返回的字符数
	        Analyzer analyzer = new StandardAnalyzer();
	        for(int i=0;i<hits.length;i++){   
	            Document doc = searcher.doc(hits[i].doc);   
	            String str = highlighter.getBestFragment(analyzer, "content", doc.get("content")) ;
	            System.out.println(str);   
	        }   

		} catch (IOException e1) {
			// TODO Auto-generated catch block
			e1.printStackTrace();
		} catch (InvalidTokenOffsetsException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}finally{
			service.shutdown();
		}
	}
}

lucene的highlighter高亮展示的原理：

根据Formatter和Scorer创建highlighter对象，formatter定义了高亮的显示方式，而scorer定义了高亮的评分；

评分的算法是先根据term的评分值获取对应的document的权重，在此基础上对文本的内容进行轮询,获取对应的文本出现的次数，和它在term对应的文本中出现的位置（便于高亮处理），评分并分词的算法为：

public float getTokenScore() {
    position += posIncAtt.getPositionIncrement();//记录出现的位置
    String termText = termAtt.toString();

    WeightedSpanTerm weightedSpanTerm;

    if ((weightedSpanTerm = fieldWeightedSpanTerms.get(
              termText)) == null) {
      return 0;
    }

    if (weightedSpanTerm.positionSensitive &&
          !weightedSpanTerm.checkPosition(position)) {
      return 0;
    }

    float score = weightedSpanTerm.getWeight();//获取权重

    // found a query term - is it unique in this doc?
    if (!foundTerms.contains(termText)) {//结果排重处理
      totalScore += score;
      foundTerms.add(termText);
    }

    return score;
  }

formatter的原理为：对搜索的文本进行判断，如果scorer获取的totalScore不小于0，即查询内容在对应的term中存在，则按照格式拼接成preTag+查询内容+postTag的格式

详细算法如下：

public String highlightTerm(String originalText, TokenGroup tokenGroup) {
    if (tokenGroup.getTotalScore() <= 0) {
      return originalText;
    }

    // Allocate StringBuilder with the right number of characters from the
    // beginning, to avoid char[] allocations in the middle of appends.
    StringBuilder returnBuffer = new StringBuilder(preTag.length() + originalText.length() + postTag.length());
    returnBuffer.append(preTag);
    returnBuffer.append(originalText);
    returnBuffer.append(postTag);
    return returnBuffer.toString();
  }

其默认格式为“”的形式；

Highlighter根据scorer和formatter，对document进行分析，查询结果调用getBestTextFragments,TokenStream tokenStream,String text,boolean mergeContiguousFragments,int maxNumFragments)，其过程为

scorer首先初始化查询内容对应的出现位置的下标，然后对tokenstream添加PositionIncrementAttribute，此类记录单词出现的位置；
对文本内容进行轮询，判断查询的文本长度是否超出限制，如果超出文本长度提示过长内容；
如果获取到指定的文本，先对单次查询的内容进行内容的截取（截取值根据setTextFragmenter指定的值决定），再调用formatter的highlightTerm方法对文本进行重新构建
在本次轮询和下次单词出现之前对文本内容进行处理

查询工具类

package com.lucene.search.util;

import java.io.File;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.Set;
import java.util.concurrent.ExecutorService;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRef;

/**lucene索引查询工具类
 * @author lenovo
 *
 */
public class SearchUtil {
	/**获取IndexSearcher对象
	 * @param indexPath
	 * @param service
	 * @return
	 * @throws IOException
	 */
	public static IndexSearcher getIndexSearcherByParentPath(String parentPath,ExecutorService service) throws IOException{
		MultiReader reader = null;
		//设置
		try {
			File[] files = new File(parentPath).listFiles();
			IndexReader[] readers = new IndexReader[files.length];
			for (int i = 0 ; i < files.length ; i ++) {
				readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
			}
			reader = new MultiReader(readers);
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		return new IndexSearcher(reader,service);
	}
	/**多目录多线程查询
	 * @param parentPath 父级索引目录
	 * @param service 多线程查询
	 * @return
	 * @throws IOException
	 */
	public static IndexSearcher getMultiSearcher(String parentPath,ExecutorService service) throws IOException{
		File file = new File(parentPath);
		File[] files = file.listFiles();
		IndexReader[] readers = new IndexReader[files.length];
		for (int i = 0 ; i < files.length ; i ++) {
			readers[i] = DirectoryReader.open(FSDirectory.open(Paths.get(files[i].getPath(), new String[0])));
		}
		MultiReader multiReader = new MultiReader(readers);
		IndexSearcher searcher = new IndexSearcher(multiReader,service);
		return searcher;
	}
	/**根据索引路径获取IndexReader
	 * @param indexPath
	 * @return
	 * @throws IOException
	 */
	public static DirectoryReader getIndexReader(String indexPath) throws IOException{
		return DirectoryReader.open(FSDirectory.open(Paths.get(indexPath, new String[0])));
	}
	/**根据索引路径获取IndexSearcher
	 * @param indexPath
	 * @param service
	 * @return
	 * @throws IOException
	 */
	public static IndexSearcher getIndexSearcherByIndexPath(String indexPath,ExecutorService service) throws IOException{
		IndexReader reader = getIndexReader(indexPath);
		return new IndexSearcher(reader,service);
	}
	
	/**如果索引目录会有变更用此方法获取新的IndexSearcher这种方式会占用较少的资源
	 * @param oldSearcher
	 * @param service
	 * @return
	 * @throws IOException
	 */
	public static IndexSearcher getIndexSearcherOpenIfChanged(IndexSearcher oldSearcher,ExecutorService service) throws IOException{
		DirectoryReader reader = (DirectoryReader) oldSearcher.getIndexReader();
		DirectoryReader newReader = DirectoryReader.openIfChanged(reader);
		return new IndexSearcher(newReader, service);
	}
	
	/**多条件查询类似于sql in
	 * @param querys
	 * @return
	 */
	public static Query getMultiQueryLikeSqlIn(Query ... querys){
		BooleanQuery query = new BooleanQuery();
		for (Query subQuery : querys) {
			query.add(subQuery,Occur.SHOULD);
		}
		return query;
	}
	
	/**多条件查询类似于sql and
	 * @param querys
	 * @return
	 */
	public static Query getMultiQueryLikeSqlAnd(Query ... querys){
		BooleanQuery query = new BooleanQuery();
		for (Query subQuery : querys) {
			query.add(subQuery,Occur.MUST);
		}
		return query;
	}
	/**从指定配置项中查询
	 * @return
	 * @param analyzer 分词器
	 * @param field 字段
	 * @param fieldType	字段类型
	 * @param queryStr 查询条件
	 * @param range 是否区间查询
	 * @return
	 */
	public static Query getQuery(String field,String fieldType,String queryStr,boolean range){
		Query q = null;
		try {
			if(queryStr != null && !"".equals(queryStr)){
				if(range){
					String[] strs = queryStr.split("\\|");
					if("int".equals(fieldType)){
						int min = new Integer(strs[0]);
						int max = new Integer(strs[1]);
						q = NumericRangeQuery.newIntRange(field, min, max, true, true);
					}else if("double".equals(fieldType)){
						Double min = new Double(strs[0]);
						Double max = new Double(strs[1]);
						q = NumericRangeQuery.newDoubleRange(field, min, max, true, true);
					}else if("float".equals(fieldType)){
						Float min = new Float(strs[0]);
						Float max = new Float(strs[1]);
						q = NumericRangeQuery.newFloatRange(field, min, max, true, true);
					}else if("long".equals(fieldType)){
						Long min = new Long(strs[0]);
						Long max = new Long(strs[1]);
						q = NumericRangeQuery.newLongRange(field, min, max, true, true);
					}
				}else{
					if("int".equals(fieldType)){
						q = NumericRangeQuery.newIntRange(field, new Integer(queryStr), new Integer(queryStr), true, true);
					}else if("double".equals(fieldType)){
						q = NumericRangeQuery.newDoubleRange(field, new Double(queryStr), new Double(queryStr), true, true);
					}else if("float".equals(fieldType)){
						q = NumericRangeQuery.newFloatRange(field, new Float(queryStr), new Float(queryStr), true, true);
					}else{
						Analyzer analyzer = new StandardAnalyzer();
						q = new QueryParser(field, analyzer).parse(queryStr);
					}
				}
			}else{
				q= new MatchAllDocsQuery();
			}
			
			System.out.println(q);
		} catch (ParseException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		return q;
	}
	/**根据field和值获取对应的内容
	 * @param fieldName
	 * @param fieldValue
	 * @return
	 */
	public static Query getQuery(String fieldName,Object fieldValue){
		Term term = new Term(fieldName, new BytesRef(fieldValue.toString()));
		return new TermQuery(term);
	}
	/**根据IndexSearcher和docID获取默认的document
	 * @param searcher
	 * @param docID
	 * @return
	 * @throws IOException
	 */
	public static Document getDefaultFullDocument(IndexSearcher searcher,int docID) throws IOException{
		return searcher.doc(docID);
	}
	/**根据IndexSearcher和docID
	 * @param searcher
	 * @param docID
	 * @param listField
	 * @return
	 * @throws IOException
	 */
	public static Document getDocumentByListField(IndexSearcher searcher,int docID,Set<String> listField) throws IOException{
		return searcher.doc(docID, listField);
	}
	
	/**分页查询
	 * @param page 当前页数
	 * @param perPage 每页显示条数
	 * @param searcher searcher查询器
	 * @param query 查询条件
	 * @return
	 * @throws IOException
	 */
	public static TopDocs getScoreDocsByPerPage(int page,int perPage,IndexSearcher searcher,Query query) throws IOException{
		TopDocs result = null;
		if(query == null){
			System.out.println(" Query is null return null ");
			return null;
		}
		ScoreDoc before = null;
		if(page != 1){
			TopDocs docsBefore = searcher.search(query, (page-1)*perPage);
			ScoreDoc[] scoreDocs = docsBefore.scoreDocs;
			if(scoreDocs.length > 0){
				before = scoreDocs[scoreDocs.length - 1];
			}
		}
		result = searcher.searchAfter(before, query, perPage);
		return result;
	}
	public static TopDocs getScoreDocs(IndexSearcher searcher,Query query) throws IOException{
		TopDocs docs = searcher.search(query, getMaxDocId(searcher));
		return docs;
	}
	/**高亮显示字段
	 * @param searcher
	 * @param field
	 * @param keyword
	 * @param preTag
	 * @param postTag
	 * @param fragmentSize
	 * @return
	 * @throws IOException 
	 * @throws InvalidTokenOffsetsException 
	 */
	public static String[] highlighter(IndexSearcher searcher,String field,String keyword,String preTag, String postTag,int fragmentSize) throws IOException, InvalidTokenOffsetsException{
		Term term = new Term("content",new BytesRef("lucene"));
        TermQuery termQuery = new TermQuery(term);
        TopDocs docs = getScoreDocs(searcher, termQuery);
        ScoreDoc[] hits = docs.scoreDocs;
        QueryScorer scorer = new QueryScorer(termQuery);
        SimpleHTMLFormatter simpleHtmlFormatter = new SimpleHTMLFormatter(preTag,postTag);//设定高亮显示的格式<B>keyword</B>,此为默认的格式  
        Highlighter highlighter = new Highlighter(simpleHtmlFormatter,scorer);   
        highlighter.setTextFragmenter(new SimpleFragmenter(fragmentSize));//设置每次返回的字符数
        Analyzer analyzer = new StandardAnalyzer();
        String[] result = new String[hits.length];
        for (int i = 0; i < result.length ; i++) {
			Document doc = searcher.doc(hits[i].doc);
			result[i] = highlighter.getBestFragment(analyzer, field, doc.get(field));
		}
        return result;
	}
	/**统计document的数量,此方法等同于matchAllDocsQuery查询
	 * @param searcher
	 * @return
	 */
	public static int getMaxDocId(IndexSearcher searcher){
		return searcher.getIndexReader().maxDoc();
	}
	
}

源码下载地址

一步一步跟我学习lucene是对近期做lucene索引的总结，大家有问题的话联系本人的Q-Q: 891922381，同时本人新建Q-Q群：106570134（lucene,solr,netty,hadoop），大家共同探讨,本人争取每日一博，希望大家持续关注，会带给大家惊喜的