Get count of matching word in string of pandas column with a predefined list(使用预定义列表获取 pandas 列中匹配单词的计数)
问题描述
我有一个 DataFrame 包含 index 和 text 列.
I have a DataFrame contains index and text columns.
例如:
index | text
1 | "I have a pen, but I lost it today."
2 | "I have pineapple and pen, but I lost it today."
现在我有一个很长的列表,我想将 text 中的每个单词与列表进行匹配.
Now I have a long list, and I want to match each of the words in text with the list.
假设:
long_list = ['pen', 'pineapple']
我想创建一个 FunctionTransformer 来匹配 long_list 中的单词与列值的每个单词,如果匹配,则返回计数.
I would want to create a FunctionTransformer to match words in the long_list with each word of the column value, if there is a match, return the count.
index | text | count
1 | "I have a pen, but I lost it today." | 1
2 | "I have pineapple and pen, but I lost it today." | 2
我是这样做的:
def count_words(df):
long_list = ['pen', 'pineapple']
count = 0
for c in df['tweet_text']:
if c in long_list:
count = count + 1
df['count'] = count
return df
count_word = FunctionTransformer(count_words, validate=False)
我如何开发其他 FunctionTransformer 的示例如下:
An example of how I develop my other FunctionTransformer will be:
def convert_twitter_datetime(df):
df['hour'] = pd.to_datetime(df['created_at'], format='%a %b %d %H:%M:%S +0000 %Y').dt.strftime('%H').astype(int)
return df
convert_datetime = FunctionTransformer(convert_twitter_datetime, validate=False)
推荐答案
灵感来自@Quang Hoang 的回答
Inspired by @Quang Hoang's answer
import pandas as pd
import sklearn as sk
y=['pen', 'pineapple']
def count_strings(X, y):
pattern = r'{}'.format('|'.join(y))
return X['text'].str.count(pattern)
string_transformer = sk.preprocessing.FunctionTransformer(count_strings, kw_args={'y': y})
df['count'] = string_transformer.fit_transform(X=df)
结果
text count
1 "I have a pen, but I lost it today." 1
2 "I have pineapple and pen, but I lost it today. 2
对于下面的df2:
#df2
text
1 "I have a pen, but I lost it today. pen pen"
2 "I have pineapple and pen, but I lost it today."
我们得到
string_transformer.transform(X=df2)
#result
1 3
2 2
Name: text, dtype: int64
这表明,我们将函数转换为 sklearn 样式的对象.为了进一步抽象这一点,我们可以将列名作为关键字参数传递给 count_strings.
This shows, that we converted the function to an sklearn-style object. To abstact this even further we can hand over the column name as key-word argument to count_strings.
这篇关于使用预定义列表获取 pandas 列中匹配单词的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:使用预定义列表获取 pandas 列中匹配单词的计数
基础教程推荐
- 用 Python 编写 Fortran 无格式文件 2022-01-01
- 尝试制作WhatsApp机器人 2022-01-01
- numpy float:比算术运算中内置的慢 10 倍? 2022-01-01
- 将 x 轴刻度更改为自定义字符串 2022-01-01
- 在 Celery 工作人员中捕获 Heroku SIGTERM 以优雅地关 2022-01-01
- Discord.py 缺少必需的参数 2022-01-01
- 使用生成器和迭代器时 Python 多循环失败 2022-01-01
- 与常规 dict 相比,Python manager.dict() 非常慢 2022-01-01
- pyserial - 可以从线程 a 写入串行端口,是否阻塞从线程 b 读取? 2022-01-01
- 由Python将MP3转换为MIDI(类型错误:无法加载插件:mtg-Melodia:Melodia) 2022-01-01
