首页 技术 正文
技术 2022年11月10日
0 收藏 773 点赞 3,000 浏览 1650 个字

PageRanking 通过:

  1. Input degree of link
  2. “Flow” model – 流量判断喜好度

传统的方式又是什么呢?

[IR] Ranking – top k

Every term在某个doc中的权重(地位)。

[IR] Ranking – top k

公共的terms在Query与Doc中对应的的地位(单位化后)直接相乘,然后全部加起来,构成了cosin相似度。

  [IR] Ranking – top k


Efficient cosine ranking

传统放入堆的模式:n * log(k)

使用Quick Select:n + k * log(k) : “find top k” + “sort top k”

Threshold Methods

[IR] Ranking – top k

  Solution: 

[IR] Ranking – top k

也可以采取非精确的方式,为什么一定要绝对准确的top k呢?

Index Elimination (heuristic function)

  1. idf低,很可能是停用词
  2. 只考虑包含了多个term的doc。但有risk,return的文档数小于k

3 of 4 query terms

故意抽样只关注一部分满足一定人为定制条件的docs。

Champion List

Term 1  R个最高权重的docs

Term 2  R个最高权重的docs

Term 3  R个最高权重的docs

以上的result求并集,得到champion Set,然后在此内求Cosine Similarity.

Cluster Pruning Method

Can you propose some modification to this method such that it guarantees returning
the closest vector for any query? Describe your method and illustrate it with a small
example.

Step 1: Sort leaders.
Step 2: In the high dimensionality, check whether the query is surrounded by the top k leaders. The
initial value of k > 1.
Step 3: If the query is surrounded by top k leaders, we retrieve all the followers around top k
leaders.
Step 4: If not, k = k+1 and goto Step 2.
Let’s illustrate it in 2D space.

[IR] Ranking – top k

When k = 3, Q1 is not surrounded by top 3 leaders (A1, A2, A3). Then, k = 4, Q1 is surrounded by
top 4 leaders. We retrieve all the followers around top 4 leaders and get the result. In this case, the
followers around other leaders cannot be closer than this result. This guarantees returning
the closest vector for any query.
This method depends on how do we define the “surround” for high-dimensional space. Normally, at
least k+1 points are needed in k-demensional space to surround one point.

If Q1 (query terms: a, b, c) is surrounded by 4 leaders, as following,
Query (a, b, c)
leader 1: (A1, B1, C1)
leader 2: (A2, B2, C2)
leader 3: (A3, B3, C3)
leader 4: (A4, B4, C4)
a must be between min(A1, A2, A3, A4) and max(A1, A2, A3, A4).
b must be between min(B1, B2, B3, B4) and max(B1, B2, B3, B4).
c must be between min(C1, C2, C3, C4) and max(C1, C2, C3, C4).

相关推荐
python开发_常用的python模块及安装方法
adodb:我们领导推荐的数据库连接组件bsddb3:BerkeleyDB的连接组件Cheetah-1.0:我比较喜欢这个版本的cheeta…
日期:2022-11-24 点赞:878 阅读:9,492
Educational Codeforces Round 11 C. Hard Process 二分
C. Hard Process题目连接:http://www.codeforces.com/contest/660/problem/CDes…
日期:2022-11-24 点赞:807 阅读:5,907
下载Ubuntn 17.04 内核源代码
zengkefu@server1:/usr/src$ uname -aLinux server1 4.10.0-19-generic #21…
日期:2022-11-24 点赞:569 阅读:6,740
可用Active Desktop Calendar V7.86 注册码序列号
可用Active Desktop Calendar V7.86 注册码序列号Name: www.greendown.cn Code: &nb…
日期:2022-11-24 点赞:733 阅读:6,495
Android调用系统相机、自定义相机、处理大图片
Android调用系统相机和自定义相机实例本博文主要是介绍了android上使用相机进行拍照并显示的两种方式,并且由于涉及到要把拍到的照片显…
日期:2022-11-24 点赞:512 阅读:8,132
Struts的使用
一、Struts2的获取  Struts的官方网站为:http://struts.apache.org/  下载完Struts2的jar包,…
日期:2022-11-24 点赞:671 阅读:5,297