読者です 読者をやめる 読者になる 読者になる

<Python, pandas> 列が一定でないデータを読み込む時、、

pandas Python

pandasさんで、列columnが一定でないデータを読み込むとエラーをはく。

In [18]: csv_data = '''
    ...: 1,2,3,4,5
    ...: 1,2,3,
    ...: 1,2,3,4,5,6
    ...: '''

In [19]: df = pd.read_csv(io.StringIO(csv_data))
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-19-4220f4db8df7> in <module>()
----> 1 df = pd.read_csv(io.StringIO(csv_data))

C:\Anaconda3\Lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    527                     skip_blank_lines=skip_blank_lines)
    528 
--> 529         return _read(filepath_or_buffer, kwds)
    530 
    531     parser_f.__name__ = name

C:\Anaconda3\Lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    303         return parser
    304 
--> 305     return parser.read()
    306 
    307 _parser_defaults = {

C:\Anaconda3\Lib\site-packages\pandas\io\parsers.py in read(self, nrows)
    761                 raise ValueError('skip_footer not supported for iteration')
    762 
--> 763         ret = self._engine.read(nrows)
    764 
    765         if self.options.get('as_recarray'):

C:\Anaconda3\Lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1211     def read(self, nrows=None):
   1212         try:
-> 1213             data = self._reader.read(nrows)
   1214         except StopIteration:
   1215             if self._first_chunk:

pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7988)()

pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8244)()

pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8970)()

pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)()

pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:22649)()

CParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 6

なんじゃ、こりゃ、、と思ってたら、
世の中には、いつもどおり、いい感じに解決策を準備してる人がいる。
助かるね。

対策。

In [20]: import csv

In [21]: rows = csv.reader(io.StringIO(csv_data))

In [22]: rows
Out[22]: <_csv.reader at 0xb898db0>

In [23]: data = []

In [24]: for row in rows:
    ...:     data.append(row)
    ...:     

In [25]: data
Out[25]: 
[[],
 ['1', '2', '3', '4', '5'],
 ['1', '2', '3', ''],
 ['1', '2', '3', '4', '5', '6']]

In [26]: df = pd.DataFrame(data)

In [27]: df
Out[27]: 
      0     1     2     3     4     5
0  None  None  None  None  None  None
1     1     2     3     4     5  None
2     1     2     3        None  None
3     1     2     3     4     5     6

あるいは、カラム名namesで与えればいいらしい。

In [28]: csv_data = '''
    ...: 1,2,3,4,5
    ...: 1,2,3,
    ...: 1,2,3,4,5,6
    ...: '''

In [29]: df = pd.read_csv(io.StringIO(csv_data), names=[1,2,3,4,5,6])

In [30]: df
Out[30]: 
   1  2  3    4    5    6
0  1  2  3  4.0  5.0  NaN
1  1  2  3  NaN  NaN  NaN
2  1  2  3  4.0  5.0  6.0

In [31]: df = pd.read_csv(io.StringIO(csv_data), names=[1,2,3])

In [32]: df
Out[32]: 
     1    2    3
1 2  3  4.0  5.0
  2  3  NaN  NaN
  2  3  4.0  5.0

なるほどね。

参考。

ameblo.jp

pandasでカラムサイズが一定でないcsv/tsvを読み込む : mwSoft blog