Python, pandas, データフレームの操作(メソッド)、count, describe
データフレーム作成。
In [41]: df = pd.DataFrame(np.random.randn(10,5)) In [42]: df Out[42]: 0 1 2 3 4 0 0.077817 -0.759317 -0.229288 1.330958 0.083665 1 0.356439 -0.977023 1.361504 -0.573934 0.391663 2 0.441791 -1.374674 2.725946 0.301030 -1.610476 3 0.344326 1.048565 0.425799 -0.458068 1.657750 4 -0.949965 0.460259 0.183424 -0.325335 1.090816 5 0.732572 -0.670865 -1.052756 -0.121706 1.520297 6 1.376255 -0.870515 -1.682764 1.858966 -1.593190 7 1.742534 -0.149480 -0.031331 0.292201 -2.005993 8 -1.571542 0.557061 -1.280689 0.350430 0.482291 9 0.661776 -0.581982 -0.720512 0.334272 -0.043694 In [43]: df.columns = list('abcde') In [44]: df Out[44]: a b c d e 0 0.077817 -0.759317 -0.229288 1.330958 0.083665 1 0.356439 -0.977023 1.361504 -0.573934 0.391663 2 0.441791 -1.374674 2.725946 0.301030 -1.610476 3 0.344326 1.048565 0.425799 -0.458068 1.657750 4 -0.949965 0.460259 0.183424 -0.325335 1.090816 5 0.732572 -0.670865 -1.052756 -0.121706 1.520297 6 1.376255 -0.870515 -1.682764 1.858966 -1.593190 7 1.742534 -0.149480 -0.031331 0.292201 -2.005993 8 -1.571542 0.557061 -1.280689 0.350430 0.482291 9 0.661776 -0.581982 -0.720512 0.334272 -0.043694
で、describe
In [46]: df.describe() Out[46]: a b c d e count 10.000000 10.000000 10.000000 10.000000 10.000000 mean 0.321200 -0.331797 -0.030067 0.298881 -0.002687 std 0.982764 0.782483 1.318384 0.774907 1.324976 min -1.571542 -1.374674 -1.682764 -0.573934 -2.005993 25% 0.144444 -0.842716 -0.969695 -0.274428 -1.205816 50% 0.399115 -0.626424 -0.130309 0.296615 0.237664 75% 0.714873 0.307824 0.365205 0.346391 0.938685 max 1.742534 1.048565 2.725946 1.858966 1.657750
で、他のメソッド。
In [47]: df.count() Out[47]: a 10 b 10 c 10 d 10 e 10 dtype: int64 In [48]: df.mean() Out[48]: a 0.321200 b -0.331797 c -0.030067 d 0.298881 e -0.002687 dtype: float64
df.count()の戻し値はシリーズ/Seriesなので、 一連のシリーズの操作が可能。
In [49]: se=df.count() In [50]: se[se == 0] Out[50]: Series([], dtype: int64) In [51]: se[se == 10] Out[51]: a 10 b 10 c 10 d 10 e 10 dtype: int64 In [52]: se.keys() Out[52]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object') In [53]: se.values Out[53]: array([10, 10, 10, 10, 10])
あと、meanの値によって、列を選択する技。 df.ix[:, 式]がポイント。
In [60]: se = df.mean() In [61]: se Out[61]: a 0.321200 b -0.331797 c -0.030067 d 0.298881 e -0.002687 dtype: float64 In [62]: df.ix[:, se > 0] Out[62]: a d 0 0.077817 1.330958 1 0.356439 -0.573934 2 0.441791 0.301030 3 0.344326 -0.458068 4 -0.949965 -0.325335 5 0.732572 -0.121706 6 1.376255 1.858966 7 1.742534 0.292201 8 -1.571542 0.350430 9 0.661776 0.334272 In [63]: df.ix[:, se < 0] Out[63]: b c e 0 -0.759317 -0.229288 0.083665 1 -0.977023 1.361504 0.391663 2 -1.374674 2.725946 -1.610476 3 1.048565 0.425799 1.657750 4 0.460259 0.183424 1.090816 5 -0.670865 -1.052756 1.520297 6 -0.870515 -1.682764 -1.593190 7 -0.149480 -0.031331 -2.005993 8 0.557061 -1.280689 0.482291 9 -0.581982 -0.720512 -0.043694