基于BOW的图像检索（Python）

阅读量：

一、BOW原理

最初的Bag of words，也叫做“词袋”。Bag of words模型是信息检索领域常见的文档表示方法。在信息检索中，假设一个文本，忽视它的词序和语法等要素，将其仅仅看作是一个词集合，文本中每个词的出现都是独立的。然后根据文本中的词频分布，构造文本描述子。

例如：文本1：Tom likes to eat cake, Jake likes too.

文本2：Tom likes to eat fruit .

基于这两个文本，构造一个词典：

dictionary={1：“Tom”，2：“likes”，3：“to”，4,：“eat”，5：“cake”，6：“fruit”，7：“Jake”，8：“too”}

这个词典一共包含8个不同的单词，利用词汇中的索引号，上面的两个文档每一个都可以用一个8维的向量表示某词在文档中出现的次数。

文本1：【1,2,1,1,1,0,1,1】

文本2：【1,1,1,1,0,1,0,0】

然后根据每个单词在文本中出现的权重，便可构造单词的频率直方图。

二、将Bag of words模型应用到图像检索中

为了表示一幅图像，我们可以将图像看作文档，即若干个“视觉词汇”的集合，同样的，视觉词汇之间没有顺序。

1、特征提取

由于图像中的词汇不像文本文档那样是现成的单词，所以我们首先要从图像中提取出相互独立的视觉词汇。然后为创建视觉单词词汇，第一步要做的就是提取特征描述子。

SIFT算法是提取图像中局部不变特征的应用最广的算法，所以我们可以采用SIFT算法才进行特征提取，SIFT特征的具体原理分析，请参见博客：<>

然后将每幅图像提取出的描述子保存在一个文件中，构建视觉词典

2、学习 “视觉词典（visual vocabulary）”

聚类是实现 visual vocabulary /codebook的关键，最常见的聚类方法就是，K-means 聚类算法。

（1）随机初始化 K 个聚类中心

（2）对应每个特征，根据距离关系赋值给某个中心/类别。其中距离的计算可采用欧式距离：

（3）对每个类别，根据其对应的特征集重新计算聚类中心

重复（2）（3）步骤，直至算法收敛

3、针对输入特征集，根据视觉词典进行量化

对于输入特征，量化的过程是将该特征映射到距离其最接近的视觉单词，并实现计数。选择合适的视觉词典的规模是我们需要考虑的问题，若规模太少，会出现视觉单词无法覆盖所有可能出现的情况。若规模太多，又会计算量大，容易过拟合。只能通过不断的测试，才能找到最合适的词典规模。

4、把输入图像转化成视觉单词（visual words）的频率直方图

利用SIFT算法，可以从每幅图像中提取很多特征点，通过统计每个视觉单词在词汇中出现的次数，可以得到如下直方图，（假设词汇中只有四个视觉单词）

5、构造特征到图像的倒排表，通过倒排表快速索引相关图像

用K邻近算法 ，进行图像的检索。给定输入图像的BOW直方图, 在数据库中查找 k 个最近邻的图像。对于图像分类问题，可以根据这k个近邻图像的分类标签，投票获得分类结果。

6、根据索引结果进行直方图匹配

我们可以利用建立起来的索引找到包含特定单词的所有图像。为了获得包含多个单词的候选图像，有两种解决方法。

（1）我们可以在每个单词上进行遍历，得到包含该单词的所有图像，然后合并这些列表。接着对在合并了的列表中，对每一个图像id出现的次数进行跟踪排序，排在列表最前面的是最好的匹配图像。

（2）如果不想遍历所有的单词，可以根据其倒排序文档频率权重进行排序，并使用那些权重最高的单词，在这些单词上进行遍历，减少计算量，提高运行的效率。

单词的TF-IDF权重

三、代码解析

1、生成词汇

用SIFT算法提取特征描述子，然后保存词汇。Vocabulary类包含了一个由单词聚类中心VOC与每个单词对应的逆向文档频率构成的向量，为了在某些图像集上训练词汇，train（）方法获取包含有.sift后缀的描述子文件列表和词汇单词数k。例如：在本代码中，含有1000个单词的词汇表，在K-means聚类阶段训练数据，聚成指定的10类。

复制代码

 # -*- coding: utf-8 -*-

    
 import pickle
    
 from PCV.imagesearch import vocabulary
    
 from PCV.tools.imtools import get_imlist
    
 from PCV.localdescriptors import sift
    
  
    
 #获取图像列表
    
 imlist = get_imlist('first1000/')
    
 nbr_images = len(imlist)
    
 #获取特征列表
    
 featlist = [imlist[i][:-3]+'sift' for i in range(nbr_images)]
    
  
    
 #提取文件夹下图像的sift特征
    
 for i in range(nbr_images):
    
     sift.process_image(imlist[i], featlist[i])
    
  
    
 #生成词汇
    
 voc = vocabulary.Vocabulary('ukbenchtest')
    
 voc.train(featlist, 1000, 10)
    
 #保存词汇
    
 # saving vocabulary
    
 with open('first1000/vocabulary.pkl', 'wb') as f:
    
     pickle.dump(voc, f)
    
 print 'vocabulary is:', voc.name, voc.nbr_words

2、图像索引

在开始之前，需要创建表，索引和索引器indexer类，以便将图像数据写入数据库。有了数据库表单，我们便可以在索引中添加图像。

复制代码

 # -*- coding: utf-8 -*-

    
 import pickle
    
 from PCV.imagesearch import imagesearch
    
 from PCV.localdescriptors import sift
    
 from sqlite3 import dbapi2 as sqlite
    
 from PCV.tools.imtools import get_imlist
    
  
    
 #获取图像列表
    
 imlist = get_imlist('first1000/')
    
 nbr_images = len(imlist)
    
 #获取特征列表
    
 featlist = [imlist[i][:-3]+'sift' for i in range(nbr_images)]
    
  
    
 # load vocabulary
    
 #载入词汇
    
 with open('first1000/vocabulary.pkl', 'rb') as f:
    
     voc = pickle.load(f)
    
 #创建索引
    
 indx = imagesearch.Indexer('testImaAdd.db',voc)
    
 indx.create_tables()
    
 # go through all images, project features on vocabulary and insert
    
 #遍历所有的图像，并将它们的特征投影到词汇上
    
 for i in range(nbr_images)[:1000]:
    
     locs,descr = sift.read_features_from_file(featlist[i])
    
     indx.add_to_index(imlist[i],descr)
    
 # commit to database
    
 #提交到数据库
    
 indx.db_commit()
    
  
    
 con = sqlite.connect('testImaAdd.db')
    
 print con.execute('select count (filename) from imlist').fetchone()
    
 print con.execute('select * from imlist').fetchone()

3、在数据库中搜索图像

建立好索引，我们就可以在数据库中搜索相似的图像了。并且通过计算前4个位置中搜索到相似图像数，从而评价所搜结果的好坏。分数为4时，结果最理想，分数为0时，表示没有一个是准确的。

复制代码

 # -*- coding: utf-8 -*-

    
 import pickle
    
 from PCV.imagesearch import imagesearch
    
 from PCV.localdescriptors import sift
    
 from sqlite3 import dbapi2 as sqlite
    
 from PCV.tools.imtools import get_imlist
    
  
    
 #获取图像列表
    
 imlist = get_imlist('first1000/')
    
 nbr_images = len(imlist)
    
 #获取特征列表
    
 featlist = [imlist[i][:-3]+'sift' for i in range(nbr_images)]
    
  
    
 #载入词汇
    
 f = open('first1000/vocabulary.pkl', 'rb')
    
 voc = pickle.load(f)
    
 f.close()
    
  
    
 src = imagesearch.Searcher('testImaAdd.db',voc)
    
 locs,descr = sift.read_features_from_file(featlist[0])
    
 iw = voc.project(descr)
    
  
    
 print ('ask using a histogram...')
    
 print (src.candidates_from_histogram(iw)[:10])
    
  
    
 src = imagesearch.Searcher('testImaAdd.db',voc)
    
 print ('try a query...')
    
  
    
  
    
 nbr_results = 10
    
 res = [w[1] for w in src.query(imlist[68])[:nbr_results]]
    
 imagesearch.plot_results(src,res)
    
  
    
 print("计算搜索结果得分:")
    
 print(imagesearch.compute_ukbench_score(src,imlist[:10]))

4、建立演示程序及Web应用

为了建立演示程序，我们将采用CherryPy包，CherryPy是一个纯Python轻量级Web服务器，使用面向对象模型。

配置完服务器后，我们需要用一些HTML标签进行初始化界面，并用Pickle载入数据。另外还需要有与数据库进行交互的Searcher对象词汇。可以看到，这个简单的演示程序包含了单个类，该类包含了一个初始化__init__（）方法和一个“索引”页面index（）方法。这两个方法可以自动地映射至URL，并且方法中的参数可以直接传递到URL中。

代码中的最后一行通过读取service.conf配置文件开启CherryPy Web服务器，配置如下：第一部分指定使用的IP地址和端口，第二部分为存放图像的地址。

复制代码

 [global]

    
  
    
 server.socket_host = "127.0.0.1"
    
  
    
 server.socket_port = 8080
    
  
    
 server.thread_pool = 50
    
  
    
 tools.sessions.on = True
    
  
    
 [/]
    
  
    
 tools.staticdir.root = "F:\ ch07"
    
  
    
 tools.staticdir.on = True
    
  
    
 tools.staticdir.dir = ""

复制代码

 # -*- coding: utf-8 -*-

    
 import cherrypy
    
 import pickle
    
 import urllib
    
 import os
    
 from numpy import *
    
 #from PCV.tools.imtools import get_imlist
    
 from PCV.imagesearch import imagesearch
    
 import random
    
  
    
 """
    
 This is the image search demo in Section 7.6.
    
 """
    
  
    
  
    
 class SearchDemo:
    
  
    
     def __init__(self):
    
     # 载入图像列表
    
     self.path = 'first1000/'
    
     #self.path = 'D:/python_web/isoutu/first500/'
    
     self.imlist = [os.path.join(self.path,f) for f in os.listdir(self.path) if f.endswith('.jpg')]
    
     #self.imlist = get_imlist('./first500/')
    
     #self.imlist = get_imlist('E:/python/isoutu/first500/')
    
     self.nbr_images = len(self.imlist)
    
     print (self.imlist)
    
     print (self.nbr_images)
    
     self.ndx = list(range(self.nbr_images))
    
     print (self.ndx)
    
  
    
     # 载入词汇
    
     # f = open('first1000/vocabulary.pkl', 'rb')
    
     with open('first1000/vocabulary.pkl','rb') as f:
    
         self.voc = pickle.load(f)
    
     #f.close()
    
  
    
     # 显示搜索返回的图像数
    
     self.maxres = 10
    
  
    
     # header and footer html
    
     self.header = """
    
         <!doctype html>
    
         <head>
    
         <title>Image search</title>
    
         </head>
    
         <body>
    
         """
    
     self.footer = """
    
         </body>
    
         </html>
    
         """
    
  
    
     def index(self, query=None):
    
     self.src = imagesearch.Searcher('testImaAdd.db', self.voc)
    
  
    
     html = self.header
    
     html += """
    
         <br />
    
         Click an image to search. <a href='?query='> Random selection </a> of images.
    
         <br /><br />
    
         """
    
     if query:
    
         # query the database and get top images
    
         #查询数据库，并获取前面的图像
    
         res = self.src.query(query)[:self.maxres]
    
         for dist, ndx in res:
    
             imname = self.src.get_filename(ndx)
    
             html += "<a href='?query="+imname+"'>"
    
             
    
             html += "<img src='"+imname+"' alt='"+imname+"' width='100' height='100'/>"
    
             print (imname+"################")
    
             html += "</a>"
    
         # show random selection if no query
    
         # 如果没有查询图像则随机显示一些图像
    
     else:
    
         random.shuffle(self.ndx)
    
         for i in self.ndx[:self.maxres]:
    
             imname = self.imlist[i]
    
             html += "<a href='?query="+imname+"'>"
    
             
    
             html += "<img src='"+imname+"' alt='"+imname+"' width='100' height='100'/>"
    
             print (imname+"################")
    
             html += "</a>"
    
  
    
     html += self.footer
    
     return html
    
  
    
     index.exposed = True
    
  
    
 #conf_path = os.path.dirname(os.path.abspath(__file__))
    
 #conf_path = os.path.join(conf_path, "service.conf")
    
 #cherrypy.config.update(conf_path)
    
 #cherrypy.quickstart(SearchDemo())
    
  
    
 cherrypy.quickstart(SearchDemo(), '/', config=os.path.join(os.path.dirname(__file__), 'service.conf'))