<Python, pandas> 列が一定でないデータを読み込む時、、
pandas
さんで、列column
が一定でないデータを読み込むとエラーをはく。
In [18]: csv_data = ''' ...: 1,2,3,4,5 ...: 1,2,3, ...: 1,2,3,4,5,6 ...: ''' In [19]: df = pd.read_csv(io.StringIO(csv_data)) --------------------------------------------------------------------------- CParserError Traceback (most recent call last) <ipython-input-19-4220f4db8df7> in <module>() ----> 1 df = pd.read_csv(io.StringIO(csv_data)) C:\Anaconda3\Lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 527 skip_blank_lines=skip_blank_lines) 528 --> 529 return _read(filepath_or_buffer, kwds) 530 531 parser_f.__name__ = name C:\Anaconda3\Lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 303 return parser 304 --> 305 return parser.read() 306 307 _parser_defaults = { C:\Anaconda3\Lib\site-packages\pandas\io\parsers.py in read(self, nrows) 761 raise ValueError('skip_footer not supported for iteration') 762 --> 763 ret = self._engine.read(nrows) 764 765 if self.options.get('as_recarray'): C:\Anaconda3\Lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1211 def read(self, nrows=None): 1212 try: -> 1213 data = self._reader.read(nrows) 1214 except StopIteration: 1215 if self._first_chunk: pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7988)() pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8244)() pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8970)() pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)() pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:22649)() CParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 6
なんじゃ、こりゃ、、と思ってたら、
世の中には、いつもどおり、いい感じに解決策を準備してる人がいる。
助かるね。
対策。
In [20]: import csv In [21]: rows = csv.reader(io.StringIO(csv_data)) In [22]: rows Out[22]: <_csv.reader at 0xb898db0> In [23]: data = [] In [24]: for row in rows: ...: data.append(row) ...: In [25]: data Out[25]: [[], ['1', '2', '3', '4', '5'], ['1', '2', '3', ''], ['1', '2', '3', '4', '5', '6']] In [26]: df = pd.DataFrame(data) In [27]: df Out[27]: 0 1 2 3 4 5 0 None None None None None None 1 1 2 3 4 5 None 2 1 2 3 None None 3 1 2 3 4 5 6
あるいは、カラム名をnames
で与えればいいらしい。
In [28]: csv_data = ''' ...: 1,2,3,4,5 ...: 1,2,3, ...: 1,2,3,4,5,6 ...: ''' In [29]: df = pd.read_csv(io.StringIO(csv_data), names=[1,2,3,4,5,6]) In [30]: df Out[30]: 1 2 3 4 5 6 0 1 2 3 4.0 5.0 NaN 1 1 2 3 NaN NaN NaN 2 1 2 3 4.0 5.0 6.0 In [31]: df = pd.read_csv(io.StringIO(csv_data), names=[1,2,3]) In [32]: df Out[32]: 1 2 3 1 2 3 4.0 5.0 2 3 NaN NaN 2 3 4.0 5.0
なるほどね。
参考。