Solr官方文档系列——Text Analysis

2019-03-27 01:21|来源: 网路

Text fields are typically indexed by breaking the text into words and applying various transformations such as lowercasing, removing plurals, or stemming to increase relevancy. The same text transformations are normally applied to any queries in order to match what is indexed.

The schema defines the fields in the index and what type of analysis is applied to them. The current schema your collection is using may be viewed directly via the Schema tab in the Admin UI, or explored dynamically using the Schema Browser tab.

The best analysis components (tokenization and filtering) for your textual content depends heavily on language. As you can see in the Schema Browser, many of the fields in the example schema are using afieldType named text_general, which has defaults appropriate for most languages.

If you know your textual content is English, as is the case for the example documents in this tutorial, and you'd like to apply English-specific stemming and stop word removal, as well as split compound words, you can use the text_en_splitting fieldType instead. Go ahead and edit the schema.xml in the solr/example/solr/collection1/conf directory, to use the text_en_splitting fieldType for the text and features fields like so:

   <field name="features" type="text_en_splitting" indexed="true" stored="true" multiValued="true"/>
   <field name="text" type="text_en_splitting" indexed="true" stored="false" multiValued="true"/>

Stop and restart Solr after making these changes and then re-post all of the example documents using java -jar post.jar *.xml. Now queries like the ones listed below will demonstrate English-specific transformations:

  • A search for power-shot can match PowerShot, and adata can match A-DATA by using the WordDelimiterFilter and LowerCaseFilter.

  • A search for features:recharging can match Rechargeable using the stemming features of PorterStemFilter.

  • A search for "1 gigabyte" can match 1GB, and the commonly misspelled pixima can matches Pixma using the SynonymFilter.

A full description of the analysis components, Analyzers, Tokenizers, and TokenFilters available for use is here.

Analysis Debugging

There is a handy Analysis tab where you can see how a text value is broken down into words by both Index time nad Query time analysis chains for a field or field type. This page shows the resulting tokens after they pass through each filter in the chains.

This url shows the tokens created from "Canon Power-Shot SD500" using the text_en_splitting type. Each section of the table shows the resulting tokens after having passed through the next TokenFilter in the (Index) analyzer. Notice how both powershot and power, shot are indexed, using tokens that have the same "position". (Compare the previous output with The tokens produced using the text_general field type.)

Mousing over the section label to the left of the section will display the full name of the analyzer component at that stage of the chain. Toggling the "Verbose Output" checkbox will show/hide the detailed token attributes.

When both Index and Query values are provided, two tables will be displayed side by side showing the results of each chain. Terms in the Index chain results that are equivalent to the final terms produced by the Query chain will be highlighted.

Other interesting examples:




禁用代码分析输出MSBuild(Disable Code Analysis output MSBuild)

可以使用CodeAnalysisGenerateSuccessFile选项禁用生成成功标记文件。 例如 : <CodeAnalysisGenerateSuccessFile>false</CodeAnalysisGenerateSuccessFile> 没有阻止生成日志文件的选项,但您可以通过CodeAnalysisLogFile选项将其移动到其他位置。 例如,要将其放在项目根文件夹中,可以使用以下命令: <CodeAnalysisLogFile>CodeAnalysisLog.xml</Co

Solr指数分析(Analysis of Solr index)

有条目的十大公司(以及每个公告的通知数量) :面对公司,做一个: -搜索。 如果每个通知都有一个文档,您将在分面请求中获得所需的结果。 每年公布的通知数量 :在日期时间范围内以年份为间隔。 发布通知的最多和最不受欢迎的日/月 :为日期和月份添加两个显式字段,并在这些字段上添加facet。 也许你也可以在工作日编制索引吗? 发布通知当天最受欢迎的小时 :制作一个仅包含小时的字段,其中包含方面。 最长的通知(按字符数) :函数查询是这里的候选者,但是没有strLength函数。 此外,它不适用于您使

文档分析和标记(Document Analysis and Tagging)

哇,这是一个非常大的话题,你正在冒险:)绝对有很多书籍和文章你可以阅读它,但我会尝试提供一个简短的介绍。 我不是一个大专家,但我研究过这些东西。 首先,您需要决定是否要将论文分类为预定义的主题/类别(分类问题),或者您希望算法自己决定不同的组(聚类问题)。 根据您的描述,您似乎对分类感兴趣。 现在,在进行分类时,首先需要创建足够的训练数据。 您需要将许多文章分成不同的组。 例如5篇物理论文,5篇化学论文,5篇编程论文等。 通常,您需要尽可能多的训练数据,但多少就取决于具体的算法。 您还需要验证数

夹板和测试覆盖率分析?(splint and test coverage analysis?)

GCC文档指出读者可以使用gcov进行代码覆盖率分析。 2005年Dobbs博士关于防御性编程的文章包括对使用gcov的讨论。 The GCC documentation points the reader to gcov for code coverage analysis. This 2005 Dr. Dobbs article on Defensive programming includes a discussion of using gcov.

Solr edismax通配符搜索找不到原始字符串(Solr edismax wildcard search does not find original string)提到 - 在通配符和模糊搜索上,不对搜索词执行文本分析。 因此搜索查询在查询期间不会进行任何分析。 因此,索引的术语将与正在搜索的术语不同。 由于索引术语是cherri ,因此对cherry*的搜索与任何文档都不匹配。 m