冯龙宇的技术专栏 NLP and ML Coder

文本相似度

2019-12-18
fly

NLP

NLP 的文本相似度(短语、单词、句子)

前言

自然语言处理中,会经常涉及几种计算两个文本相似度的问题。提升信息检索系统的召回率,问答系统中的,需要计算的query和候选集的相似度;  ## 文本相似度方法 * 基于统计的方法     主要是无监督学习,句子、段落, 较大粒度的文本。 * 基于语义的方法     大部分是基于深度学习的有监督学习,一般词语或句子, 较小粒度的文本。

文章结构

  1. 常见的几种相似性比较方法
  • 基于词向量

  • 基于字符

  • 基于概率统计

  • 基于词嵌入模型

  1. State of the art 的几种代表性方法

第一部分

  1. 莱文斯坦距离(编辑距离) 一个字符串改成另一个字符串的最少编辑操作,“度假自由行”和“自由行度假”,中文意思完全一样,但是编辑距离是5, 表示相似度很低。

  2. 欧式和余弦距离 基于向量空间的距离度量方法

和BERT的相似度计算的对比, BERT+COSINE的度量

  1. 利用Bert计算单词或者短语之间的相似度。

中国足协和中国足球的向量相似度: 0.94828343 中国足协和中国篮协的向量相似度: 0.9531492 自由行度假和度假自由行相似度: 0.97110826

  1. 利用Bert聚类的结果

2. 开头的小数部分  是聚类的相似度阈值

0.92 [‘月光’, ‘月色’, ‘夕阳’, ‘大海’, ‘春色’, ‘落花’] [‘斜阳’] [‘空中’, ‘家中’] [‘高楼’, ‘楼阁’, ‘楼台’, ‘高山’, ‘山上’, ‘山峦’] [‘小路’, ‘小船’] [‘城外’, ‘宫中’, ‘渡口’, ‘江上’] [‘门前’, ‘窗外’, ‘江边’, ‘头上’, ‘地上’, ‘枝头’] [‘人家’] [‘黄河’, ‘长江’] [‘江面’, ‘水面’] [‘山水’] [‘山中’, ‘山谷’, ‘原野’] [‘山林’, ‘野草’, ‘树林’] [‘野外’] [‘春光’, ‘花儿’, ‘芳草’] [‘浮云’, ‘云雾’] [‘东风’] [‘狂风’, ‘大雪’] [‘露珠’] [‘美酒’] [‘栏杆’, ‘窗户’] [‘车马’] [‘衣裳’, ‘衣衫’, ‘羽毛’] [‘头发’] [‘琵琶’] [‘花朵’, ‘鲜花’, ‘花丛’, ‘树枝’, ‘蝴蝶’] [‘草木’] [‘树木’, ‘花草’] [‘大雁’, ‘骏马’, ‘乌鸦’] [‘鱼儿’] [‘美女’, ‘美人’] [‘妻子’, ‘儿子’] [‘父母’] [‘将军’] [‘皇上’] [‘战士’, ‘士兵’] [‘军队’, ‘战争’] [‘歌舞’, ‘歌唱’] [‘饮酒’] [‘弹奏’]

0.9 [‘月光’, ‘月色’, ‘夕阳’, ‘原野’, ‘大海’, ‘春色’, ‘云雾’, ‘微风’, ‘露珠’, ‘影子’, ‘羽毛’, ‘花朵’, ‘落花’, ‘芳草’, ‘大雁’, ‘鱼儿’, ‘蝴蝶’, ‘眼泪’]

[‘斜阳’, ‘落日’]

[‘银河’, ‘黄河’, ‘江上’]

[‘楼阁’, ‘高楼’, ‘楼台’, ‘台阶’]

[‘道路’]

[‘京城’, ‘城外’, ‘宫中’]

[‘房屋’]

[‘窗外’, ‘空中’, ‘家中’, ‘江边’, ‘头上’, ‘地上’, ‘衣襟’, ‘枝头’]

[‘水边’, ‘门前’]

[‘春水’, ‘春光’]

[‘泉水’, ‘渡口’, ‘水面’, ‘河水’]

[‘山谷’, ‘高山’, ‘山峰’, ‘山峦’, ‘沙洲’, ‘野草’, ‘树林’, ‘树枝’]

[‘细雨’, ‘大雪’, ‘花儿’]

[‘烟雾’, ‘狂风’, ‘鲜花’]

[‘寒风’]

[‘美酒’, ‘美女’]

[‘栏杆’, ‘窗户’]

[‘车马’, ‘山水’, ‘衣裳’, ‘草木’, ‘骏马’]

[‘黄金’]

[‘红花’, ‘花丛’, ‘荷叶’]

[‘枝条’, ‘花草’]

[‘鸳鸯’, ‘浮云’, ‘美人’]

[‘儿子’, ‘妻子’]

[‘亲人’]

[‘行人’]

[‘天子’]

[‘士兵’, ‘小船’, ‘战士’, ‘军队’, ‘战争’]

[‘歌声’, ‘歌唱’]

[‘饮酒’]

[‘弹奏’, ‘琵琶’]

0.85 [‘月光’, ‘月色’, ‘斜阳’, ‘太阳’, ‘银河’, ‘楼台’, ‘城外’, ‘宫中’, ‘渡口’, ‘水面’, ‘江面’, ‘春水’, ‘山水’, ‘高山’, ‘山上’, ‘山林’, ‘沙洲’, ‘春光’, ‘细雨’, ‘烟雾’, ‘东风’, ‘狂风’, ‘露珠’, ‘积雪’, ‘酒杯’, ‘窗户’, ‘车马’, ‘衣裳’, ‘影子’, ‘黄金’, ‘羽毛’, ‘花朵’, ‘落花’, ‘花丛’, ‘芳草’, ‘树林’, ‘树枝’, ‘荷叶’, ‘大雁’, ‘骏马’, ‘乌鸦’, ‘美女’, ‘士兵’, ‘歌声’, ‘眼泪’, ‘漫步’]

[‘夕阳’, ‘落日’, ‘小路’, ‘黄河’, ‘江上’, ‘山中’, ‘山谷’, ‘原野’, ‘大海’, ‘浮云’, ‘微风’, ‘大雪’, ‘小船’, ‘琵琶’, ‘红花’, ‘野草’, ‘梧桐’, ‘鱼儿’, ‘美人’, ‘战士’]

[‘楼阁’, ‘高楼’, ‘房屋’, ‘泉水’, ‘山峦’, ‘栏杆’, ‘鲜花’, ‘花草’, ‘枝条’, ‘客人’]

[‘京城’, ‘空中’, ‘家中’, ‘边塞’, ‘春色’, ‘军队’, ‘歌唱’, ‘游玩’, ‘战争’]

[‘窗外’, ‘门前’, ‘江边’, ‘头上’, ‘地上’, ‘寒风’, ‘衣襟’, ‘枝头’]

[‘河水’, ‘长江’]

[‘云雾’, ‘美酒’, ‘衣衫’, ‘花儿’, ‘鸳鸯’]

[‘头发’, ‘水边’]

[‘蝴蝶’, ‘草木’]

[‘妻子’, ‘儿子’, ‘父母’]

[‘将军’, ‘天子’, ‘太守’]

[‘匈奴’]

[‘饮酒’]

总结:

1. 明显看出自由行度假和度假自由行的相似度,是高于前两个的,证明BERT体系可以很好起到编辑距离的作用。

2. 根据聚类的关键是由相似度的度量决定,为了获取更好的词向量空间,应该更多地使用metrci任务的finetune整个bert模型.

3. 正如BM25+Bert模型,应用于文本搜索,Bert可以作为普适地词嵌入空间初始参数。整体的任务应该,面向具体任务场景进行过拟合的finetune训练,打造特殊领域的词向量表示~~,很长时间内将会是NLP各大NLU任务的发展和探索方向,持续霸榜各大任务的SOTA.

  1. N-gram 距离(BLEU核心思想) Similarity = |Gn(S)|+|Gt(T)|-2|Gn(S)||Gn(T)| 字符串完全相等 == 0
  • 中国足协和中国足球的向量相似度: 0.94828343
  • 中国足协和中国篮协的向量相似度: 0.9531492
  • 中国足协和足协中国的向量相似度 0.9493892
  • 美丽中国和中国美丽的向量相似度 0.95919204
  • 自由行度假和度假自由行相似度: 0.97110826 相对于N-gram的距离,加入更多语义信息作为词的表示
  1. BM25 OKapi BM25 (BM代表最佳匹配)是搜索引擎根据其与给定的搜索查询的相关性对匹配文档进行排名的排名函数。Stenphen E. Robertson 和KarenSparckJones 等人开发的概率检索框架。 query -> 若干结果D,

  2. WMD EMD 在NLP领域的延伸。EMD距离是一类运输问题。 WMD是2015年提出的一种衡量文本相似度的方法,world Mover’s Distance@ (简称)WMD ,词移距离,度量两个文档之间的相似度,即WMD距离越小相似度越大。

DOC2VEC, simHash原理介绍

Bert出现前2016年的kaggle文本相似度大赛的冠军方案~

这个方案集成了多种传统特征提取方案,使用了Bert之前最好的词向量表征方案Glove, 结合矩阵分解和word2vec 的词向量生成方案。

# the some ideas from

  1. Features three kind of features: 1. word embedding 2. Sentence embeddings (Doc2Vec, Sent2Vec) 3. Encode question pair using dense layer from ESIM model trained on SNLI
  2. Classical text mining features
    • Similarity meatures on LDA and LSI embedding
    • Similarity meatures on bag of character n-grams (TFIDF reweighted or not ) from 1 to 8 gram
    • Abhishek’s and owl’s kindly shared features
    • Edit and sequence matching distance , percentage of common tokens up to 1,2,…..,6 when question ends the same, or starts the same

    • Length of questions, diff of legnth
    • Number of capital letters, question marks etc…
    • Indicators for Question 1/2 starting with “are”,” Can”, “How” etc …. and all mathematical engineering corresponding

Tokenizer : stanford corenlp to tokenizer

Postagger and Ner : postagger and ner to preprocessing text input for some deep learning models

Structural features (i.e from graph)

使用图提取特征

如何建图和使用图提取特征~

  1. We built density features from the graph built from the edges between pairs of questions inside train and test datasets concatenated.

  2. We had counts of neighbors of question 1, question 2, the min, the max, intersections, unions, shortest path length when main edge cut…..

  3. We went further and built density features to count the neighbors of the questions neighbors….. and questions neighbors neighbors … (inception)

  4. We also counted neighbors of higher order which also were neighbors of lower order (loops)

  5. We tried different graph structures : we built undirected and directed graphs (edges directed from question 1 to question 2 ), we also tried to separate the density features of question 1 from the features of question 2 to generate non commutative(交换的)features in addition to commutative ones

  6. We built features describing the connex((连通) subgraph the pair belonged to : Number of edges , number of nodes, % of edge in train

  7. We also computed the same features on subgrphs built only from the edges of questions which both appear more than once.我们想做的是移除fake questions which we thought were damaging the graph features by changing its structure.

  8. 最后,我们给graph 增加了权重, 我们使用了一种我们相似度的特征给我们的graph增加权重。

Models

  1. We worked on two main architectures for our NNets: Siamese and Attention Neutral Networks.

  2. Siamese LSTM with pretrained GLOVE embedding

  3. Decomposable attention with pretrained FastText embedding . This model achive ~0.3 on cv

  4. ESIM with pretrained FastText embedding . this is our best pure Deep Learning NLP model , it achieves ~ 0.27 on CV. However this model take too long to run , we only add it once in the first stacking layer

  5. We noticed that DL complex works on the first stacking layer but did not do better than simple MLP on second layer

Bert文本相似性的典型案例 利用预训练的中文模型实现基于Bert的语义匹配模型 数据集为LCQMC数据集

https://github.com/pengming617/bert_textMatching

LSTM句子相似度分析

https://github.com/zqhZY/semanaly

query 的规则匹配

速度快, 可控,易于实现,高效。

  1. 与FAQ库中的标问和相似文问, 进行分词、提炼出大量的概念,并将上述概念组合,构成大量的句式, 华为mate30 是cellphone概念,askmoney概念-, “现在” 是时间概念。

深度学习语义匹配

DSSM~ LSTM-DSSM
基于BERT等预训练模型   #   Knowledge Based Question &Answer   基于知识体系的问答系统

KBQA 最关键一步在于KG的搭建,对绝大部分NLP任务都有极大极大的加成。 # 基于知识图谱的语义相似度匹配和智能搜索应用


Similar Posts

Comments