Advertisement

python流式数据处理_流式数据处理

阅读量:

1、直接登陆服务器:ssh 2014210***@thumedia.org -p 6349

创建streaming.py:touch streaming.py,并且如下编辑:

#! /usr/bin/python

import logging

import math

import time

pg2count={}

t=1

while 1:

fp=open('/tmp/hw3.log','r')

for line in fp:

line = line.strip()

times, page, count = line.split()[0],line.split()[1],line.split()[2]

if count.isdigit() & page.startswith('Page-'):

try:

pg2count[page] = [pg2count[page][0] + int(count),t]

except:

pg2count[page] = [int(count),t]

fp.close()

a=sorted(pg2count.items(), key=lambda page:page[1][0], reverse = True)

print '%s%s%s' % ('the page rank at current time ',times,' is:')

for i in range(0,10):

print '%s\t%d' % (a[i][0],a[i][1][0])

logger = logging.getLogger()

#set loghandler

file = logging.FileHandler("output.log")

logger.addHandler(file)

#set formater

formatter = logging.Formatter("%(asctime)s %(levelname)s %(message)s")

file.setFormatter(formatter)

#set log level

logger.setLevel(logging.NOTSET)

logger.info('%s%s%s' % ('the page rank at current time ',times,' is:'))

for i in range(0,10):

logger.info('%s\t%d' % (a[i][0],a[i][1][0]))

time.sleep(60)

2、写好代码之后测试运行:python streaming.py输出如下:

使用nough时会丢弃输入并将其输出附加到nough.out文件中;这表明后台进程已顺利运行,则输出结果将被存储在nough.out文件中。

也可以查看output.log文件里的输出:

最后我们让它在后台一直执行:nohup python streaming.py &输出:

[1] 8994

一天之后,我们再次查看结果:

可以看到,累计的结果已经和第一次不太一样

3、杀掉进程:ps -ef|grep 1020得到如下输出:

2014210***@cluster-3-1:~$ ps -ef|grep 1020

1020751274710 Jan10 ?00:00:00 sshd: 2014210***@pts/30

1020751375120 Jan10 pts/3000:00:00 -bash

1020757475080 20:55 ?00:00:00 sshd: 2014210***@pts/52

1020757575740 20:55 pts/5200:00:00 -bash

1020828275750 21:04 pts/5200:00:00 ps -ef

1020828375750 21:04 pts/5200:00:00 grep --color=auto 1020

1020899410 13:20 ?00:01:46 python streaming.py

102012260 122320 Jan10 ?00:00:00 sshd: 2014210***@pts/35

102012261 122600 Jan10 pts/3500:00:01 –bash

输入kill 8994:

2014210***@cluster-3-1:~$ kill 8994

2014210***@cluster-3-1:~$ ps -ef|grep 1020

1020751274710 Jan10 ?00:00:00 sshd: 2014210***@pts/30

1020751375120 Jan10 pts/3000:00:00 -bash

1020757475080 20:55 ?00:00:00 sshd: 2014210***@pts/52

1020757575740 20:55 pts/5200:00:00 -bash

1020833575750 21:05 pts/5200:00:00 ps -ef

1020833675750 21:05 pts/5200:00:00 grep --color=auto 1020

102012260 122320 Jan10 ?00:00:00 sshd: 2014210***@pts/35

102012261 122600 Jan10 pts/3500:00:01 –bash

至此,streaming.py运行结束。

Question

What challenges does your design present when dealing with massive streaming data and intricate calculations?

答:首先计算每个程序周期所需的时间长度;接着确保这段时间内能够存储足够多的数据流量;最后通过这种方式,在下一个程序周期就可以利用上一个周期存储好的实时数据进行处理。

全部评论 (0)

还没有任何评论哟~