Distance calculation between rows in Pandas Dataframe using a distance matrix(使用距离矩阵计算 Pandas Dataframe 中行之间的距离)
问题描述
I have the following Pandas DataFrame:
In [31]:
import pandas as pd
sample = pd.DataFrame({'Sym1': ['a','a','a','d'],'Sym2':['a','c','b','b'],'Sym3':['a','c','b','d'],'Sym4':['b','b','b','a']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
Sym1 Sym2 Sym3 Sym4
Item1 a a a b
Item2 a c c b
Item3 a b b b
Item4 d b d a
and I want to find the elegant way to get the distance between each Item according to this distance matrix:
In [34]:
DistMatrix = pd.DataFrame({'a': [0,0,0.67,1.34],'b':[0,0,0,0.67],'c':[0.67,0,0,0],'d':[1.34,0.67,0,0]},index=['a','b','c','d'])
print(DistMatrix)
Out[34]:
a b c d
a 0.00 0.00 0.67 1.34
b 0.00 0.00 0.00 0.67
c 0.67 0.00 0.00 0.00
d 1.34 0.67 0.00 0.00
For example comparing Item1 to Item2 would compare aaab -> accb -- using the distance matrix this would be 0+0.67+0.67+0=1.34
Ideal output:
Item1 Item2 Item3 Item4
Item1 0 1.34 0 2.68
Item2 1.34 0 0 1.34
Item3 0 0 0 2.01
Item4 2.68 1.34 2.01 0
this is doing twice as much work as needed, but technically works for non-symmetric distance matrices as well ( whatever that is supposed to mean )
pd.DataFrame ( { idx1: { idx2:sum( DistMatrix[ x ][ y ]
for (x, y) in zip( row1, row2 ) )
for (idx2, row2) in sample.iterrows( ) }
for (idx1, row1 ) in sample.iterrows( ) } )
you can make it more readable by writing it in pieces:
# a helper function to compute distance of two items
dist = lambda xs, ys: sum( DistMatrix[ x ][ y ] for ( x, y ) in zip( xs, ys ) )
# a second helper function to compute distances from a given item
xdist = lambda x: { idx: dist( x, y ) for (idx, y) in sample.iterrows( ) }
# the pairwise distance matrix
pd.DataFrame( { idx: xdist( x ) for ( idx, x ) in sample.iterrows( ) } )
这篇关于使用距离矩阵计算 Pandas Dataframe 中行之间的距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:使用距离矩阵计算 Pandas Dataframe 中行之间的距离
基础教程推荐
- 将 x 轴刻度更改为自定义字符串 2022-01-01
- 使用生成器和迭代器时 Python 多循环失败 2022-01-01
- pyserial - 可以从线程 a 写入串行端口,是否阻塞从线程 b 读取? 2022-01-01
- Discord.py 缺少必需的参数 2022-01-01
- 尝试制作WhatsApp机器人 2022-01-01
- 用 Python 编写 Fortran 无格式文件 2022-01-01
- 在 Celery 工作人员中捕获 Heroku SIGTERM 以优雅地关 2022-01-01
- 由Python将MP3转换为MIDI(类型错误:无法加载插件:mtg-Melodia:Melodia) 2022-01-01
- 与常规 dict 相比,Python manager.dict() 非常慢 2022-01-01
- numpy float:比算术运算中内置的慢 10 倍? 2022-01-01
