# v0.20.1 (May 5, 2017)

This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

  • New .agg() API for Series/DataFrame similar to the groupby-rolling-resample API’s, see here
  • Integration with the feather-format, including a new top-level pd.read_feather() and DataFrame.to_feather() method, see here (opens new window).
  • The .ix indexer has been deprecated, see here
  • Panel has been deprecated, see here
  • Addition of an IntervalIndex and Interval scalar type, see here
  • Improved user API when grouping by index levels in .groupby(), see here
  • Improved support for UInt64 dtypes, see here
  • A new orient for JSON serialization, orient='table', that uses the Table Schema spec and that gives the possibility for a more interactive repr in the Jupyter Notebook, see here
  • Experimental support for exporting styled DataFrames (DataFrame.style) to Excel, see here
  • Window binary corr/cov operations now return a MultiIndexed DataFrame rather than a Panel, as Panel is now deprecated, see here
  • Support for S3 handling now uses s3fs, see here
  • Google BigQuery support now uses the pandas-gbq library, see here

Warning

Pandas has changed the internal structure and layout of the code base. This can affect imports that are not from the top-level pandas.* namespace, please see the changes here.

Check the API Changes and deprecations before updating.

Note

This is a combined release for 0.20.0 and and 0.20.1. Version 0.20.1 contains one additional change for backwards-compatibility with downstream projects using pandas’ utils routines. (GH16250 (opens new window))

What’s new in v0.20.0

# New features

# agg API for DataFrame/Series

Series & DataFrame have been enhanced to support the aggregation API. This is a familiar API from groupby, window operations, and resampling. This allows aggregation operations in a concise way by using agg() (opens new window) and transform() (opens new window). The full documentation is here (opens new window) (GH1623 (opens new window)).

Here is a sample

In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
   ...:                   index=pd.date_range('1/1/2000', periods=10))
   ...: 

In [2]: df.iloc[3:7] = np.nan

In [3]: df
Out[3]: 
                   A         B         C
2000-01-01  0.469112 -0.282863 -1.509059
2000-01-02 -1.135632  1.212112 -0.173215
2000-01-03  0.119209 -1.044236 -0.861849
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.113648 -1.478427  0.524988
2000-01-09  0.404705  0.577046 -1.715002
2000-01-10 -1.039268 -0.370647 -1.157892

[10 rows x 3 columns]

One can operate using string function names, callables, lists, or dictionaries of these.

Using a single function is equivalent to .apply.

In [4]: df.agg('sum')
Out[4]: 
A   -1.068226
B   -1.387015
C   -4.892029
Length: 3, dtype: float64

Multiple aggregations with a list of functions.

In [5]: df.agg(['sum', 'min'])
Out[5]: 
            A         B         C
sum -1.068226 -1.387015 -4.892029
min -1.135632 -1.478427 -1.715002

[2 rows x 3 columns]

Using a dict provides the ability to apply specific aggregations per column. You will get a matrix-like output of all of the aggregators. The output has one column per unique function. Those functions applied to a particular column will be NaN:

In [6]: df.agg({'A': ['sum', 'min'], 'B': ['min', 'max']})
Out[6]: 
            A         B
max       NaN  1.212112
min -1.135632 -1.478427
sum -1.068226       NaN

[3 rows x 2 columns]

The API also supports a .transform() function for broadcasting results.

In [7]: df.transform(['abs', lambda x: x - x.min()])
Out[7]: 
                   A                   B                   C          
                 abs  <lambda>       abs  <lambda>       abs  <lambda>
2000-01-01  0.469112  1.604745  0.282863  1.195563  1.509059  0.205944
2000-01-02  1.135632  0.000000  1.212112  2.690539  0.173215  1.541787
2000-01-03  0.119209  1.254841  1.044236  0.434191  0.861849  0.853153
2000-01-04       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-08  0.113648  1.249281  1.478427  0.000000  0.524988  2.239990
2000-01-09  0.404705  1.540338  0.577046  2.055473  1.715002  0.000000
2000-01-10  1.039268  0.096364  0.370647  1.107780  1.157892  0.557110

[10 rows x 6 columns]

When presented with mixed dtypes that cannot be aggregated, .agg() will only take the valid aggregations. This is similar to how groupby .agg() works. (GH15015 (opens new window))

In [8]: df = pd.DataFrame({'A': [1, 2, 3],
   ...:                    'B': [1., 2., 3.],
   ...:                    'C': ['foo', 'bar', 'baz'],
   ...:                    'D': pd.date_range('20130101', periods=3)})
   ...: 

In [9]: df.dtypes
Out[9]: 
A             int64
B           float64
C            object
D    datetime64[ns]
Length: 4, dtype: object
In [10]: df.agg(['min', 'sum'])
Out[10]: 
     A    B          C          D
min  1  1.0        bar 2013-01-01
sum  6  6.0  foobarbaz        NaT

[2 rows x 4 columns]

# dtype keyword for data IO

The 'python' engine for read_csv() (opens new window), as well as the read_fwf() (opens new window) function for parsing fixed-width text files and read_excel() (opens new window) for parsing Excel files, now accept the dtype keyword argument for specifying the types of specific columns (GH14295 (opens new window)). See the io docs (opens new window) for more information.

In [11]: data = "a  b\n1  2\n3  4"

In [12]: pd.read_fwf(StringIO(data)).dtypes
Out[12]: 
a    int64
b    int64
Length: 2, dtype: object

In [13]: pd.read_fwf(StringIO(data), dtype={'a': 'float64', 'b': 'object'}).dtypes
Out[13]: 
a    float64
b     object
Length: 2, dtype: object

# .to_datetime() has gained an origin parameter

to_datetime() (opens new window) has gained a new parameter, origin, to define a reference date from where to compute the resulting timestamps when parsing numerical values with a specific unit specified. (GH11276 (opens new window), GH11745 (opens new window))

For example, with 1960-01-01 as the starting date:

In [14]: pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01'))
Out[14]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)

The default is set at origin='unix', which defaults to 1970-01-01 00:00:00, which is commonly called ‘unix epoch’ or POSIX time. This was the previous default, so this is a backward compatible change.

In [15]: pd.to_datetime([1, 2, 3], unit='D')
Out[15]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)

# Groupby enhancements

Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names. Previously, only column names could be referenced. This allows to easily group by a column and index level at the same time. (GH5677 (opens new window))

In [16]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ....:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ....: 

In [17]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

In [18]: df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
   ....:                    'B': np.arange(8)},
   ....:                   index=index)
   ....: 

In [19]: df
Out[19]: 
              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

[8 rows x 2 columns]

In [20]: df.groupby(['second', 'A']).sum()
Out[20]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

[6 rows x 1 columns]

# Better support for compressed URLs in read_csv

The compression code was refactored (GH12688 (opens new window)). As a result, reading dataframes from URLs in read_csv() (opens new window) or read_table() (opens new window) now supports additional compression methods: xz, bz2, and zip (GH14570 (opens new window)). Previously, only gzip compression was supported. By default, compression of URLs and paths are now inferred using their file extensions. Additionally, support for bz2 compression in the python 2 C-engine improved (GH14874 (opens new window)).

In [21]: url = ('https://github.com/{repo}/raw/{branch}/{path}'
   ....:        .format(repo='pandas-dev/pandas',
   ....:                branch='master',
   ....:                path='pandas/tests/io/parser/data/salaries.csv.bz2'))
   ....: 

# default, infer compression
In [22]: df = pd.read_csv(url, sep='\t', compression='infer')

# explicitly specify compression
In [23]: df = pd.read_csv(url, sep='\t', compression='bz2')

In [24]: df.head(2)
Out[24]: 
       S  X  E  M
0  13876  1  1  1
1  11608  1  3  0

[2 rows x 4 columns]

# Pickle file I/O now supports compression

read_pickle() (opens new window), DataFrame.to_pickle() (opens new window) and Series.to_pickle() (opens new window) can now read from and write to compressed pickle files. Compression methods can be an explicit parameter or be inferred from the file extension. See the docs here. (opens new window)

In [25]: df = pd.DataFrame({'A': np.random.randn(1000),
   ....:                    'B': 'foo',
   ....:                    'C': pd.date_range('20130101', periods=1000, freq='s')})
   ....:

Using an explicit compression type

In [26]: df.to_pickle("data.pkl.compress", compression="gzip")

In [27]: rt = pd.read_pickle("data.pkl.compress", compression="gzip")

In [28]: rt.head()
Out[28]: 
          A    B                   C
0 -1.344312  foo 2013-01-01 00:00:00
1  0.844885  foo 2013-01-01 00:00:01
2  1.075770  foo 2013-01-01 00:00:02
3 -0.109050  foo 2013-01-01 00:00:03
4  1.643563  foo 2013-01-01 00:00:04

[5 rows x 3 columns]

The default is to infer the compression type from the extension (compression='infer'):

In [29]: df.to_pickle("data.pkl.gz")

In [30]: rt = pd.read_pickle("data.pkl.gz")

In [31]: rt.head()
Out[31]: 
          A    B                   C
0 -1.344312  foo 2013-01-01 00:00:00
1  0.844885  foo 2013-01-01 00:00:01
2  1.075770  foo 2013-01-01 00:00:02
3 -0.109050  foo 2013-01-01 00:00:03
4  1.643563  foo 2013-01-01 00:00:04

[5 rows x 3 columns]

In [32]: df["A"].to_pickle("s1.pkl.bz2")

In [33]: rt = pd.read_pickle("s1.pkl.bz2")

In [34]: rt.head()
Out[34]: 
0   -1.344312
1    0.844885
2    1.075770
3   -0.109050
4    1.643563
Name: A, Length: 5, dtype: float64

# UInt64 support improved

Pandas has significantly improved support for operations involving unsigned, or purely non-negative, integers. Previously, handling these integers would result in improper rounding or data-type casting, leading to incorrect results. Notably, a new numerical index, UInt64Index, has been created (GH14937 (opens new window))

In [35]: idx = pd.UInt64Index([1, 2, 3])

In [36]: df = pd.DataFrame({'A': ['a', 'b', 'c']}, index=idx)

In [37]: df.index
Out[37]: UInt64Index([1, 2, 3], dtype='uint64')

# GroupBy on categoricals

In previous versions, .groupby(..., sort=False) would fail with a ValueError when grouping on a categorical series with some categories not appearing in the data. (GH13179 (opens new window))

In [38]: chromosomes = np.r_[np.arange(1, 23).astype(str), ['X', 'Y']]

In [39]: df = pd.DataFrame({
   ....:     'A': np.random.randint(100),
   ....:     'B': np.random.randint(100),
   ....:     'C': np.random.randint(100),
   ....:     'chromosomes': pd.Categorical(np.random.choice(chromosomes, 100),
   ....:                                   categories=chromosomes,
   ....:                                   ordered=True)})
   ....: 

In [40]: df
Out[40]: 
     A   B   C chromosomes
0   87  22  81           4
1   87  22  81          13
2   87  22  81          22
3   87  22  81           2
4   87  22  81           6
..  ..  ..  ..         ...
95  87  22  81           8
96  87  22  81          11
97  87  22  81           X
98  87  22  81           1
99  87  22  81          19

[100 rows x 4 columns]

Previous behavior:

In [3]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
---------------------------------------------------------------------------
ValueError: items in new_categories are not the same as in old categories

New behavior:

In [41]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
Out[41]: 
               A    B    C
chromosomes               
2            348   88  324
3            348   88  324
4            348   88  324
5            261   66  243
6            174   44  162
...          ...  ...  ...
22           348   88  324
X            348   88  324
Y            435  110  405
1              0    0    0
21             0    0    0

[24 rows x 3 columns]

# Table schema output

The new orient 'table' for DataFrame.to_json() (opens new window) will generate a Table Schema (opens new window) compatible string representation of the data.

In [42]: df = pd.DataFrame(
   ....:     {'A': [1, 2, 3],
   ....:      'B': ['a', 'b', 'c'],
   ....:      'C': pd.date_range('2016-01-01', freq='d', periods=3)},
   ....:     index=pd.Index(range(3), name='idx'))
   ....: 

In [43]: df
Out[43]: 
     A  B          C
idx                 
0    1  a 2016-01-01
1    2  b 2016-01-02
2    3  c 2016-01-03

[3 rows x 3 columns]

In [44]: df.to_json(orient='table')
Out[44]: '{"schema": {"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"0.20.0"}, "data": [{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'

See IO: Table Schema for more information (opens new window).

Additionally, the repr for DataFrame and Series can now publish this JSON Table schema representation of the Series or DataFrame if you are using IPython (or another frontend like nteract (opens new window) using the Jupyter messaging protocol). This gives frontends like the Jupyter notebook and nteract (opens new window) more flexibility in how they display pandas objects, since they have more information about the data. You must enable this by setting the display.html.table_schema option to True.

# SciPy sparse matrix from/to SparseDataFrame

Pandas now supports creating sparse dataframes directly from scipy.sparse.spmatrix instances. See the documentation (opens new window) for more information. (GH4343 (opens new window))

All sparse formats are supported, but matrices that are not in COOrdinate (opens new window) format will be converted, copying data as needed.

In [45]: from scipy.sparse import csr_matrix

In [46]: arr = np.random.random(size=(1000, 5))

In [47]: arr[arr < .9] = 0

In [48]: sp_arr = csr_matrix(arr)

In [49]: sp_arr
Out[49]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
	with 501 stored elements in Compressed Sparse Row format>

In [50]: sdf = pd.SparseDataFrame(sp_arr)

In [51]: sdf
Out[51]: 
      0   1         2   3         4
0   NaN NaN  0.977426 NaN       NaN
1   NaN NaN       NaN NaN  0.969340
2   NaN NaN       NaN NaN       NaN
3   NaN NaN       NaN NaN       NaN
4   NaN NaN       NaN NaN       NaN
..   ..  ..       ...  ..       ...
995 NaN NaN       NaN NaN  0.917524
996 NaN NaN       NaN NaN       NaN
997 NaN NaN       NaN NaN  0.968178
998 NaN NaN       NaN NaN  0.901563
999 NaN NaN       NaN NaN       NaN

[1000 rows x 5 columns]

To convert a SparseDataFrame back to sparse SciPy matrix in COO format, you can use:

In [52]: sdf.to_coo()
Out[52]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
	with 501 stored elements in COOrdinate format>

# Excel output for styled DataFrames

Experimental support has been added to export DataFrame.style formats to Excel using the openpyxl engine. (GH15530 (opens new window))

For example, after running the following, styled.xlsx renders as below:

In [53]: np.random.seed(24)

In [54]: df = pd.DataFrame({'A': np.linspace(1, 10, 10)})

In [55]: df = pd.concat([df, pd.DataFrame(np.random.RandomState(24).randn(10, 4),
   ....:                                  columns=list('BCDE'))],
   ....:                axis=1)
   ....: 

In [56]: df.iloc[0, 2] = np.nan

In [57]: df
Out[57]: 
      A         B         C         D         E
0   1.0  1.329212       NaN -0.316280 -0.990810
1   2.0 -1.070816 -1.438713  0.564417  0.295722
2   3.0 -1.626404  0.219565  0.678805  1.889273
3   4.0  0.961538  0.104011 -0.481165  0.850229
4   5.0  1.453425  1.057737  0.165562  0.515018
5   6.0 -1.336936  0.562861  1.392855 -0.063328
6   7.0  0.121668  1.207603 -0.002040  1.627796
7   8.0  0.354493  1.037528 -0.385684  0.519818
8   9.0  1.686583 -1.325963  1.428984 -2.089354
9  10.0 -0.129820  0.631523 -0.586538  0.290720

[10 rows x 5 columns]

In [58]: styled = (df.style
   ....:           .applymap(lambda val: 'color: %s' % 'red' if val < 0 else 'black')
   ....:           .highlight_max())
   ....: 

In [59]: styled.to_excel('styled.xlsx', engine='openpyxl')

style-excel1

See the Style documentation (opens new window) for more detail.

# IntervalIndex

pandas has gained an IntervalIndex with its own dtype, interval as well as the Interval scalar type. These allow first-class support for interval notation, specifically as a return type for the categories in cut() (opens new window) and qcut() (opens new window). The IntervalIndex allows some unique indexing, see the docs (opens new window). (GH7640 (opens new window), GH8625 (opens new window))

Warning

These indexing behaviors of the IntervalIndex are provisional and may change in a future version of pandas. Feedback on usage is welcome.

Previous behavior:

The returned categories were strings, representing Intervals

In [1]: c = pd.cut(range(4), bins=2)

In [2]: c
Out[2]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3], (1.5, 3]]
Categories (2, object): [(-0.003, 1.5] < (1.5, 3]]

In [3]: c.categories
Out[3]: Index(['(-0.003, 1.5]', '(1.5, 3]'], dtype='object')

New behavior:

In [60]: c = pd.cut(range(4), bins=2)

In [61]: c
Out[61]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

In [62]: c.categories
Out[62]: 
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
              closed='right',
              dtype='interval[float64]')

Furthermore, this allows one to bin other data with these same bins, with NaN representing a missing value similar to other dtypes.

In [63]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[63]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

An IntervalIndex can also be used in Series and DataFrame as the index.

In [64]: df = pd.DataFrame({'A': range(4),
   ....:                    'B': pd.cut([0, 3, 1, 1], bins=c.categories)
   ....:                    }).set_index('B')
   ....: 

In [65]: df
Out[65]: 
               A
B               
(-0.003, 1.5]  0
(1.5, 3.0]     1
(-0.003, 1.5]  2
(-0.003, 1.5]  3

[4 rows x 1 columns]

Selecting via a specific interval:

In [66]: df.loc[pd.Interval(1.5, 3.0)]
Out[66]: 
A    1
Name: (1.5, 3.0], Length: 1, dtype: int64

Selecting via a scalar value that is contained in the intervals.

In [67]: df.loc[0]
Out[67]: 
               A
B               
(-0.003, 1.5]  0
(-0.003, 1.5]  2
(-0.003, 1.5]  3

[3 rows x 1 columns]

# Other enhancements

# Backwards incompatible API changes

# Possible incompatibility for HDF5 formats created with pandas < 0.13.0

pd.TimeSeries was deprecated officially in 0.17.0, though has already been an alias since 0.13.0. It has been dropped in favor of pd.Series. (GH15098 (opens new window)).

This may cause HDF5 files that were created in prior versions to become unreadable if pd.TimeSeries was used. This is most likely to be for pandas < 0.13.0. If you find yourself in this situation. You can use a recent prior version of pandas to read in your HDF5 files, then write them out again after applying the procedure below.

In [2]: s = pd.TimeSeries([1, 2, 3], index=pd.date_range('20130101', periods=3))

In [3]: s
Out[3]:
2013-01-01    1
2013-01-02    2
2013-01-03    3
Freq: D, dtype: int64

In [4]: type(s)
Out[4]: pandas.core.series.TimeSeries

In [5]: s = pd.Series(s)

In [6]: s
Out[6]:
2013-01-01    1
2013-01-02    2
2013-01-03    3
Freq: D, dtype: int64

In [7]: type(s)
Out[7]: pandas.core.series.Series

# Map on Index types now return other Index types

map on an Index now returns an Index, not a numpy array (GH12766 (opens new window))

In [68]: idx = pd.Index([1, 2])

In [69]: idx
Out[69]: Int64Index([1, 2], dtype='int64')

In [70]: mi = pd.MultiIndex.from_tuples([(1, 2), (2, 4)])

In [71]: mi
Out[71]: 
MultiIndex([(1, 2),
            (2, 4)],
           )

Previous behavior:

In [5]: idx.map(lambda x: x * 2)
Out[5]: array([2, 4])

In [6]: idx.map(lambda x: (x, x * 2))
Out[6]: array([(1, 2), (2, 4)], dtype=object)

In [7]: mi.map(lambda x: x)
Out[7]: array([(1, 2), (2, 4)], dtype=object)

In [8]: mi.map(lambda x: x[0])
Out[8]: array([1, 2])

New behavior:

In [72]: idx.map(lambda x: x * 2)
Out[72]: Int64Index([2, 4], dtype='int64')

In [73]: idx.map(lambda x: (x, x * 2))
Out[73]: 
MultiIndex([(1, 2),
            (2, 4)],
           )

In [74]: mi.map(lambda x: x)
Out[74]: 
MultiIndex([(1, 2),
            (2, 4)],
           )

In [75]: mi.map(lambda x: x[0])
Out[75]: Int64Index([1, 2], dtype='int64')

map on a Series with datetime64 values may return int64 dtypes rather than int32

In [76]: s = pd.Series(pd.date_range('2011-01-02T00:00', '2011-01-02T02:00', freq='H')
   ....:               .tz_localize('Asia/Tokyo'))
   ....: 

In [77]: s
Out[77]: 
0   2011-01-02 00:00:00+09:00
1   2011-01-02 01:00:00+09:00
2   2011-01-02 02:00:00+09:00
Length: 3, dtype: datetime64[ns, Asia/Tokyo]

Previous behavior:

In [9]: s.map(lambda x: x.hour)
Out[9]:
0    0
1    1
2    2
dtype: int32

New behavior:

In [78]: s.map(lambda x: x.hour)
Out[78]: 
0    0
1    1
2    2
Length: 3, dtype: int64

# Accessing datetime fields of Index now return Index

The datetime-related attributes (see here (opens new window) for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously returned numpy arrays. They will now return a new Index object, except in the case of a boolean field, where the result will still be a boolean ndarray. (GH15022 (opens new window))

Previous behaviour:

In [1]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [2]: idx.hour
Out[2]: array([ 0, 10, 20,  6, 16], dtype=int32)

New behavior:

In [79]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [80]: idx.hour
Out[80]: Int64Index([0, 10, 20, 6, 16], dtype='int64')

This has the advantage that specific Index methods are still available on the result. On the other hand, this might have backward incompatibilities: e.g. compared to numpy arrays, Index objects are not mutable. To get the original ndarray, you can always convert explicitly using np.asarray(idx.hour).

# pd.unique will now be consistent with extension types

In prior versions, using Series.unique() (opens new window) and pandas.unique() (opens new window) on Categorical and tz-aware data-types would yield different return types. These are now made consistent. (GH15903 (opens new window))

  • Datetime tz-aware

Previous behaviour:

# Series
In [5]: pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:            pd.Timestamp('20160101', tz='US/Eastern')]).unique()
Out[5]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object)

In [6]: pd.unique(pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:                      pd.Timestamp('20160101', tz='US/Eastern')]))
Out[6]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

# Index
In [7]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')]).unique()
Out[7]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [8]: pd.unique([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:            pd.Timestamp('20160101', tz='US/Eastern')])
Out[8]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

New behavior:

# Series, returns an array of Timestamp tz-aware
In [81]: pd.Series([pd.Timestamp(r'20160101', tz=r'US/Eastern'),
   ....:            pd.Timestamp(r'20160101', tz=r'US/Eastern')]).unique()
   ....: 
Out[81]: 
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]

In [82]: pd.unique(pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
   ....:           pd.Timestamp('20160101', tz='US/Eastern')]))
   ....: 
Out[82]: 
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]

# Index, returns a DatetimeIndex
In [83]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ....:           pd.Timestamp('20160101', tz='US/Eastern')]).unique()
   ....: 
Out[83]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [84]: pd.unique(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ....:                     pd.Timestamp('20160101', tz='US/Eastern')]))
   ....: 
Out[84]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
  • Categoricals

Previous behaviour:

In [1]: pd.Series(list('baabc'), dtype='category').unique()
Out[1]:
[b, a, c]
Categories (3, object): [b, a, c]

In [2]: pd.unique(pd.Series(list('baabc'), dtype='category'))
Out[2]: array(['b', 'a', 'c'], dtype=object)

New behavior:

# returns a Categorical
In [85]: pd.Series(list('baabc'), dtype='category').unique()
Out[85]: 
[b, a, c]
Categories (3, object): [b, a, c]

In [86]: pd.unique(pd.Series(list('baabc'), dtype='category'))
Out[86]: 
[b, a, c]
Categories (3, object): [b, a, c]

# S3 file handling

pandas now uses s3fs (opens new window) for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. (GH11915 (opens new window)).

# Partial string indexing changes

DatetimeIndex Partial String Indexing (opens new window) now works as an exact match, provided that string resolution coincides with index resolution, including a case when both are seconds (GH14826 (opens new window)). See Slice vs. Exact Match (opens new window) for details.

In [87]: df = pd.DataFrame({'a': [1, 2, 3]}, pd.DatetimeIndex(['2011-12-31 23:59:59',
   ....:                                                       '2012-01-01 00:00:00',
   ....:                                                       '2012-01-01 00:00:01']))
   ....:

Previous behavior:

In [4]: df['2011-12-31 23:59:59']
Out[4]:
                       a
2011-12-31 23:59:59  1

In [5]: df['a']['2011-12-31 23:59:59']
Out[5]:
2011-12-31 23:59:59    1
Name: a, dtype: int64

New behavior:

In [4]: df['2011-12-31 23:59:59']
KeyError: '2011-12-31 23:59:59'

In [5]: df['a']['2011-12-31 23:59:59']
Out[5]: 1

# Concat of different float dtypes will not automatically upcast

Previously, concat of multiple objects with different float dtypes would automatically upcast results to a dtype of float64. Now the smallest acceptable dtype will be used (GH13247 (opens new window))

In [88]: df1 = pd.DataFrame(np.array([1.0], dtype=np.float32, ndmin=2))

In [89]: df1.dtypes
Out[89]: 
0    float32
Length: 1, dtype: object

In [90]: df2 = pd.DataFrame(np.array([np.nan], dtype=np.float32, ndmin=2))

In [91]: df2.dtypes
Out[91]: 
0    float32
Length: 1, dtype: object

Previous behavior:

In [7]: pd.concat([df1, df2]).dtypes
Out[7]:
0    float64
dtype: object

New behavior:

In [92]: pd.concat([df1, df2]).dtypes
Out[92]: 
0    float32
Length: 1, dtype: object

# Pandas Google BigQuery support has moved

pandas has split off Google BigQuery support into a separate package pandas-gbq. You can conda install pandas-gbq -c conda-forge or pip install pandas-gbq to get it. The functionality of read_gbq() (opens new window) and DataFrame.to_gbq() (opens new window) remain the same with the currently released version of pandas-gbq=0.1.4. Documentation is now hosted here (opens new window) (GH15347 (opens new window))

# Memory usage for Index is more accurate

In previous versions, showing .memory_usage() on a pandas structure that has an index, would only include actual index values and not include structures that facilitated fast indexing. This will generally be different for Index and MultiIndex and less-so for other index types. (GH15237 (opens new window))

Previous behavior:

In [8]: index = pd.Index(['foo', 'bar', 'baz'])

In [9]: index.memory_usage(deep=True)
Out[9]: 180

In [10]: index.get_loc('foo')
Out[10]: 0

In [11]: index.memory_usage(deep=True)
Out[11]: 180

New behavior:

In [8]: index = pd.Index(['foo', 'bar', 'baz'])

In [9]: index.memory_usage(deep=True)
Out[9]: 180

In [10]: index.get_loc('foo')
Out[10]: 0

In [11]: index.memory_usage(deep=True)
Out[11]: 260

# DataFrame.sort_index changes

In certain cases, calling .sort_index() on a MultiIndexed DataFrame would return the same DataFrame without seeming to sort. This would happen with a lexsorted, but non-monotonic levels. (GH15622 (opens new window), GH15687 (opens new window), GH14015 (opens new window), GH13431 (opens new window), GH15797 (opens new window))

This is unchanged from prior versions, but shown for illustration purposes:

In [93]: df = pd.DataFrame(np.arange(6), columns=['value'],
   ....:                   index=pd.MultiIndex.from_product([list('BA'), range(3)]))
   ....: 

In [94]: df
Out[94]: 
     value
B 0      0
  1      1
  2      2
A 0      3
  1      4
  2      5

[6 rows x 1 columns]
In [95]: df.index.is_lexsorted()
Out[95]: False

In [96]: df.index.is_monotonic
Out[96]: False

Sorting works as expected

In [97]: df.sort_index()
Out[97]: 
     value
A 0      3
  1      4
  2      5
B 0      0
  1      1
  2      2

[6 rows x 1 columns]
In [98]: df.sort_index().index.is_lexsorted()
Out[98]: True

In [99]: df.sort_index().index.is_monotonic
Out[99]: True

However, this example, which has a non-monotonic 2nd level, doesn’t behave as desired.

In [100]: df = pd.DataFrame({'value': [1, 2, 3, 4]},
   .....:                   index=pd.MultiIndex([['a', 'b'], ['bb', 'aa']],
   .....:                                       [[0, 0, 1, 1], [0, 1, 0, 1]]))
   .....: 

In [101]: df
Out[101]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

[4 rows x 1 columns]

Previous behavior:

In [11]: df.sort_index()
Out[11]:
      value
a bb      1
  aa      2
b bb      3
  aa      4

In [14]: df.sort_index().index.is_lexsorted()
Out[14]: True

In [15]: df.sort_index().index.is_monotonic
Out[15]: False

New behavior:

In [102]: df.sort_index()
Out[102]: 
      value
a aa      2
  bb      1
b aa      4
  bb      3

[4 rows x 1 columns]

In [103]: df.sort_index().index.is_lexsorted()
Out[103]: True

In [104]: df.sort_index().index.is_monotonic
Out[104]: True

# Groupby describe formatting

The output formatting of groupby.describe() now labels the describe() metrics in the columns instead of the index. This format is consistent with groupby.agg() when applying multiple functions at once. (GH4792 (opens new window))

Previous behavior:

In [1]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})

In [2]: df.groupby('A').describe()
Out[2]:
                B
A
1 count  2.000000
  mean   1.500000
  std    0.707107
  min    1.000000
  25%    1.250000
  50%    1.500000
  75%    1.750000
  max    2.000000
2 count  2.000000
  mean   3.500000
  std    0.707107
  min    3.000000
  25%    3.250000
  50%    3.500000
  75%    3.750000
  max    4.000000

In [3]: df.groupby('A').agg([np.mean, np.std, np.min, np.max])
Out[3]:
     B
  mean       std amin amax
A
1  1.5  0.707107    1    2
2  3.5  0.707107    3    4

New behavior:

In [105]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})

In [106]: df.groupby('A').describe()
Out[106]: 
      B                                          
  count mean       std  min   25%  50%   75%  max
A                                                
1   2.0  1.5  0.707107  1.0  1.25  1.5  1.75  2.0
2   2.0  3.5  0.707107  3.0  3.25  3.5  3.75  4.0

[2 rows x 8 columns]

In [107]: df.groupby('A').agg([np.mean, np.std, np.min, np.max])
Out[107]: 
     B                    
  mean       std amin amax
A                         
1  1.5  0.707107    1    2
2  3.5  0.707107    3    4

[2 rows x 4 columns]

# Window binary corr/cov operations return a MultiIndex DataFrame

A binary window operation, like .corr() or .cov(), when operating on a .rolling(..), .expanding(..), or .ewm(..) object, will now return a 2-level MultiIndexed DataFrame rather than a Panel, as Panel is now deprecated, see here. These are equivalent in function, but a MultiIndexed DataFrame enjoys more support in pandas. See the section on Windowed Binary Operations (opens new window) for more information. (GH15677 (opens new window))

In [108]: np.random.seed(1234)

In [109]: df = pd.DataFrame(np.random.rand(100, 2),
   .....:                   columns=pd.Index(['A', 'B'], name='bar'),
   .....:                   index=pd.date_range('20160101',
   .....:                                       periods=100, freq='D', name='foo'))
   .....: 

In [110]: df.tail()
Out[110]: 
bar                A         B
foo                           
2016-04-05  0.640880  0.126205
2016-04-06  0.171465  0.737086
2016-04-07  0.127029  0.369650
2016-04-08  0.604334  0.103104
2016-04-09  0.802374  0.945553

[5 rows x 2 columns]

Previous behavior:

In [2]: df.rolling(12).corr()
Out[2]:
<class 'pandas.core.panel.Panel'>
Dimensions: 100 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2016-01-01 00:00:00 to 2016-04-09 00:00:00
Major_axis axis: A to B
Minor_axis axis: A to B

New behavior:

In [111]: res = df.rolling(12).corr()

In [112]: res.tail()
Out[112]: 
bar                    A         B
foo        bar                    
2016-04-07 B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[5 rows x 2 columns]

Retrieving a correlation matrix for a cross-section

In [113]: df.rolling(12).corr().loc['2016-04-07']
Out[113]: 
bar                   A        B
foo        bar                  
2016-04-07 A    1.00000 -0.13209
           B   -0.13209  1.00000

[2 rows x 2 columns]

# HDFStore where string comparison

In previous versions most types could be compared to string column in a HDFStore usually resulting in an invalid comparison, returning an empty result frame. These comparisons will now raise a TypeError (GH15492 (opens new window))

In [114]: df = pd.DataFrame({'unparsed_date': ['2014-01-01', '2014-01-01']})

In [115]: df.to_hdf('store.h5', 'key', format='table', data_columns=True)

In [116]: df.dtypes
Out[116]: 
unparsed_date    object
Length: 1, dtype: object

Previous behavior:

In [4]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts')
File "<string>", line 1
  (unparsed_date > 1970-01-01 00:00:01.388552400)
                        ^
SyntaxError: invalid token

New behavior:

In [18]: ts = pd.Timestamp('2014-01-01')

In [19]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts')
TypeError: Cannot compare 2014-01-01 00:00:00 of
type <class 'pandas.tslib.Timestamp'> to string column

# Index.intersection and inner join now preserve the order of the left Index

Index.intersection() (opens new window) now preserves the order of the calling Index (left) instead of the other Index (right) (GH15582 (opens new window)). This affects inner joins, DataFrame.join() (opens new window) and merge() (opens new window), and the .align method.

  • Index.intersection
In [117]: left = pd.Index([2, 1, 0])

In [118]: left
Out[118]: Int64Index([2, 1, 0], dtype='int64')

In [119]: right = pd.Index([1, 2, 3])

In [120]: right
Out[120]: Int64Index([1, 2, 3], dtype='int64')

Previous behavior:

In [4]: left.intersection(right)
Out[4]: Int64Index([1, 2], dtype='int64')

New behavior:

In [121]: left.intersection(right)
Out[121]: Int64Index([2, 1], dtype='int64')
  • DataFrame.join and pd.merge
In [122]: left = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])

In [123]: left
Out[123]: 
    a
2  20
1  10
0   0

[3 rows x 1 columns]

In [124]: right = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])

In [125]: right
Out[125]: 
     b
1  100
2  200
3  300

[3 rows x 1 columns]

Previous behavior:

In [4]: left.join(right, how='inner')
Out[4]:
   a    b
1  10  100
2  20  200

New behavior:

In [126]: left.join(right, how='inner')
Out[126]: 
    a    b
2  20  200
1  10  100

[2 rows x 2 columns]

# Pivot table always returns a DataFrame

The documentation for pivot_table() (opens new window) states that a DataFrame is always returned. Here a bug is fixed that allowed this to return a Series under certain circumstance. (GH4386 (opens new window))

In [127]: df = pd.DataFrame({'col1': [3, 4, 5],
   .....:                    'col2': ['C', 'D', 'E'],
   .....:                    'col3': [1, 3, 9]})
   .....: 

In [128]: df
Out[128]: 
   col1 col2  col3
0     3    C     1
1     4    D     3
2     5    E     9

[3 rows x 3 columns]

Previous behavior:

In [2]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum)
Out[2]:
col3  col2
1     C       3
3     D       4
9     E       5
Name: col1, dtype: int64

New behavior:

In [129]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum)
Out[129]: 
           col1
col3 col2      
1    C        3
3    D        4
9    E        5

[3 rows x 1 columns]

# Other API changes

  • numexpr version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not fulfilled (GH15213 (opens new window)).
  • CParserError has been renamed to ParserError in pd.read_csv() and will be removed in the future (GH12665 (opens new window))
  • SparseArray.cumsum() and SparseSeries.cumsum() will now always return SparseArray and SparseSeries respectively (GH12855 (opens new window))
  • DataFrame.applymap() with an empty DataFrame will return a copy of the empty DataFrame instead of a Series (GH8222 (opens new window))
  • Series.map() now respects default values of dictionary subclasses with a __missing__ method, such as collections.Counter (GH15999 (opens new window))
  • .loc has compat with .ix for accepting iterators, and NamedTuples (GH15120 (opens new window))
  • interpolate() and fillna() will raise a ValueError if the limit keyword argument is not greater than 0. (GH9217 (opens new window))
  • pd.read_csv() will now issue a ParserWarning whenever there are conflicting values provided by the dialect parameter and the user (GH14898 (opens new window))
  • pd.read_csv() will now raise a ValueError for the C engine if the quote character is larger than than one byte (GH11592 (opens new window))
  • inplace arguments now require a boolean value, else a ValueError is thrown (GH14189 (opens new window))
  • pandas.api.types.is_datetime64_ns_dtype will now report True on a tz-aware dtype, similar to pandas.api.types.is_datetime64_any_dtype
  • DataFrame.asof() will return a null filled Series instead the scalar NaN if a match is not found (GH15118 (opens new window))
  • Specific support for copy.copy() and copy.deepcopy() functions on NDFrame objects (GH15444 (opens new window))
  • Series.sort_values() accepts a one element list of bool for consistency with the behavior of DataFrame.sort_values() (GH15604 (opens new window))
  • .merge() and .join() on category dtype columns will now preserve the category dtype when possible (GH10409 (opens new window))
  • SparseDataFrame.default_fill_value will be 0, previously was nan in the return from pd.get_dummies(..., sparse=True) (GH15594 (opens new window))
  • The default behaviour of Series.str.match has changed from extracting groups to matching the pattern. The extracting behaviour was deprecated since pandas version 0.13.0 and can be done with the Series.str.extract method (GH5224 (opens new window)). As a consequence, the as_indexer keyword is ignored (no longer needed to specify the new behaviour) and is deprecated.
  • NaT will now correctly report False for datetimelike boolean operations such as is_month_start (GH15781 (opens new window))
  • NaT will now correctly return np.nan for Timedelta and Period accessors such as days and quarter (GH15782 (opens new window))
  • NaT will now returns NaT for tz_localize and tz_convert methods (GH15830 (opens new window))
  • DataFrame and Panel constructors with invalid input will now raise ValueError rather than PandasError, if called with scalar inputs and not axes (GH15541 (opens new window))
  • DataFrame and Panel constructors with invalid input will now raise ValueError rather than pandas.core.common.PandasError, if called with scalar inputs and not axes; The exception PandasError is removed as well. (GH15541 (opens new window))
  • The exception pandas.core.common.AmbiguousIndexError is removed as it is not referenced (GH15541 (opens new window))

# Reorganization of the library: privacy changes

# Modules privacy has changed

Some formerly public python/c/c++/cython extension modules have been moved and/or renamed. These are all removed from the public API. Furthermore, the pandas.core, pandas.compat, and pandas.util top-level modules are now considered to be PRIVATE. If indicated, a deprecation warning will be issued if you reference theses modules. (GH12588 (opens new window))

Previous Location New Location Deprecated
pandas.lib pandas._libs.lib X
pandas.tslib pandas._libs.tslib X
pandas.computation pandas.core.computation X
pandas.msgpack pandas.io.msgpack
pandas.index pandas._libs.index
pandas.algos pandas._libs.algos
pandas.hashtable pandas._libs.hashtable
pandas.indexes pandas.core.indexes
pandas.json pandas._libs.json / pandas.io.json X
pandas.parser pandas._libs.parsers X
pandas.formats pandas.io.formats
pandas.sparse pandas.core.sparse
pandas.tools pandas.core.reshape X
pandas.types pandas.core.dtypes X
pandas.io.sas.saslib pandas.io.sas._sas
pandas._join pandas._libs.join
pandas._hash pandas._libs.hashing
pandas._period pandas._libs.period
pandas._sparse pandas._libs.sparse
pandas._testing pandas._libs.testing
pandas._window pandas._libs.window

Some new subpackages are created with public functionality that is not directly exposed in the top-level namespace: pandas.errors, pandas.plotting and pandas.testing (more details below). Together with pandas.api.types and certain functions in the pandas.io and pandas.tseries submodules, these are now the public subpackages.

Further changes:

# pandas.errors

We are adding a standard public module for all pandas exceptions & warnings pandas.errors. (GH14800 (opens new window)). Previously these exceptions & warnings could be imported from pandas.core.common or pandas.io.common. These exceptions and warnings will be removed from the *.common locations in a future release. (GH15541 (opens new window))

The following are now part of this API:

['DtypeWarning',
 'EmptyDataError',
 'OutOfBoundsDatetime',
 'ParserError',
 'ParserWarning',
 'PerformanceWarning',
 'UnsortedIndexError',
 'UnsupportedFunctionCall']

# pandas.testing

We are adding a standard module that exposes the public testing functions in pandas.testing (GH9895 (opens new window)). Those functions can be used when writing tests for functionality using pandas objects.

The following testing functions are now part of this API:

# pandas.plotting

A new public pandas.plotting module has been added that holds plotting functionality that was previously in either pandas.tools.plotting or in the top-level namespace. See the deprecations sections for more details.

# Other Development Changes

# Deprecations

# Deprecate .ix

The .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers. .ix offers a lot of magic on the inference of what the user wants to do. To wit, .ix can decide to index positionally OR via labels, depending on the data type of the index. This has caused quite a bit of user confusion over the years. The full indexing documentation is here (opens new window). (GH14218 (opens new window))

The recommended methods of indexing are:

  • .loc if you want to label index
  • .iloc if you want to positionally index.

Using .ix will now show a DeprecationWarning with a link to some examples of how to convert code here (opens new window).

In [130]: df = pd.DataFrame({'A': [1, 2, 3],
   .....:                    'B': [4, 5, 6]},
   .....:                   index=list('abc'))
   .....: 

In [131]: df
Out[131]: 
   A  B
a  1  4
b  2  5
c  3  6

[3 rows x 2 columns]

Previous behavior, where you wish to get the 0th and the 2nd elements from the index in the ‘A’ column.

In [3]: df.ix[[0, 2], 'A']
Out[3]:
a    1
c    3
Name: A, dtype: int64

Using .loc. Here we will select the appropriate indexes from the index, then use label indexing.

In [132]: df.loc[df.index[[0, 2]], 'A']
Out[132]: 
a    1
c    3
Name: A, Length: 2, dtype: int64

Using .iloc. Here we will get the location of the ‘A’ column, then use positional indexing to select things.

In [133]: df.iloc[[0, 2], df.columns.get_loc('A')]
Out[133]: 
a    1
c    3
Name: A, Length: 2, dtype: int64

# Deprecate Panel

Panel is deprecated and will be removed in a future version. The recommended way to represent 3-D data are with a MultiIndex on a DataFrame via the to_frame() or with the xarray package (opens new window). Pandas provides a to_xarray() method to automate this conversion (GH13563 (opens new window)).

In [133]: import pandas.util.testing as tm

In [134]: p = tm.makePanel()

In [135]: p
Out[135]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

Convert to a MultiIndex DataFrame

In [136]: p.to_frame()
Out[136]:
                     ItemA     ItemB     ItemC
major      minor
2000-01-03 A      0.628776 -1.409432  0.209395
           B      0.988138 -1.347533 -0.896581
           C     -0.938153  1.272395 -0.161137
           D     -0.223019 -0.591863 -1.051539
2000-01-04 A      0.186494  1.422986 -0.592886
           B     -0.072608  0.363565  1.104352
           C     -1.239072 -1.449567  0.889157
           D      2.123692 -0.414505 -0.319561
2000-01-05 A      0.952478 -2.147855 -1.473116
           B     -0.550603 -0.014752 -0.431550
           C      0.139683 -1.195524  0.288377
           D      0.122273 -1.425795 -0.619993

[12 rows x 3 columns]

Convert to an xarray DataArray

In [137]: p.to_xarray()
Out[137]:
<xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
array([[[ 0.628776,  0.988138, -0.938153, -0.223019],
        [ 0.186494, -0.072608, -1.239072,  2.123692],
        [ 0.952478, -0.550603,  0.139683,  0.122273]],

       [[-1.409432, -1.347533,  1.272395, -0.591863],
        [ 1.422986,  0.363565, -1.449567, -0.414505],
        [-2.147855, -0.014752, -1.195524, -1.425795]],

       [[ 0.209395, -0.896581, -0.161137, -1.051539],
        [-0.592886,  1.104352,  0.889157, -0.319561],
        [-1.473116, -0.43155 ,  0.288377, -0.619993]]])
Coordinates:
  * items       (items) object 'ItemA' 'ItemB' 'ItemC'
  * major_axis  (major_axis) datetime64[ns] 2000-01-03 2000-01-04 2000-01-05
  * minor_axis  (minor_axis) object 'A' 'B' 'C' 'D'

# Deprecate groupby.agg() with a dictionary when renaming

The .groupby(..).agg(..), .rolling(..).agg(..), and .resample(..).agg(..) syntax can accept a variable of inputs, including scalars, list, and a dict of column names to scalars or lists. This provides a useful syntax for constructing multiple (potentially different) aggregations.

However, .agg(..) can also accept a dict that allows ‘renaming’ of the result columns. This is a complicated and confusing syntax, as well as not consistent between Series and DataFrame. We are deprecating this ‘renaming’ functionality.

  • We are deprecating passing a dict to a grouped/rolled/resampled Series. This allowed one to rename the resulting aggregation, but this had a completely different meaning than passing a dictionary to a grouped DataFrame, which accepts column-to-aggregations.
  • We are deprecating passing a dict-of-dicts to a grouped/rolled/resampled DataFrame in a similar manner.

This is an illustrative example:

In [134]: df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
   .....:                    'B': range(5),
   .....:                    'C': range(5)})
   .....: 

In [135]: df
Out[135]: 
   A  B  C
0  1  0  0
1  1  1  1
2  1  2  2
3  2  3  3
4  2  4  4

[5 rows x 3 columns]

Here is a typical useful syntax for computing different aggregations for different columns. This is a natural, and useful syntax. We aggregate from the dict-to-list by taking the specified columns and applying the list of functions. This returns a MultiIndex for the columns (this is not deprecated).

In [136]: df.groupby('A').agg({'B': 'sum', 'C': 'min'})
Out[136]: 
   B  C
A      
1  3  0
2  7  3

[2 rows x 2 columns]

Here’s an example of the first deprecation, passing a dict to a grouped Series. This is a combination aggregation & renaming:

In [6]: df.groupby('A').B.agg({'foo': 'count'})
FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version

Out[6]:
   foo
A
1    3
2    2

You can accomplish the same operation, more idiomatically by:

In [137]: df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})
Out[137]: 
   foo
A     
1    3
2    2

[2 rows x 1 columns]

Here’s an example of the second deprecation, passing a dict-of-dict to a grouped DataFrame:

In [23]: (df.groupby('A')
    ...:    .agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
    ...:  )
FutureWarning: using a dict with renaming is deprecated and
will be removed in a future version

Out[23]:
     B   C
   foo bar
A
1   3   0
2   7   3

You can accomplish nearly the same by:

In [138]: (df.groupby('A')
   .....:    .agg({'B': 'sum', 'C': 'min'})
   .....:    .rename(columns={'B': 'foo', 'C': 'bar'})
   .....:  )
   .....: 
Out[138]: 
   foo  bar
A          
1    3    0
2    7    3

[2 rows x 2 columns]

# Deprecate .plotting

The pandas.tools.plotting module has been deprecated, in favor of the top level pandas.plotting module. All the public plotting functions are now available from pandas.plotting (GH12548 (opens new window)).

Furthermore, the top-level pandas.scatter_matrix and pandas.plot_params are deprecated. Users can import these from pandas.plotting as well.

Previous script:

pd.tools.plotting.scatter_matrix(df)
pd.scatter_matrix(df)

Should be changed to:

pd.plotting.scatter_matrix(df)

# Other deprecations

  • SparseArray.to_dense() has deprecated the fill parameter, as that parameter was not being respected (GH14647 (opens new window))
  • SparseSeries.to_dense() has deprecated the sparse_only parameter (GH14647 (opens new window))
  • Series.repeat() has deprecated the reps parameter in favor of repeats (GH12662 (opens new window))
  • The Series constructor and .astype method have deprecated accepting timestamp dtypes without a frequency (e.g. np.datetime64) for the dtype parameter (GH15524 (opens new window))
  • Index.repeat() and MultiIndex.repeat() have deprecated the n parameter in favor of repeats (GH12662 (opens new window))
  • Categorical.searchsorted() and Series.searchsorted() have deprecated the v parameter in favor of value (GH12662 (opens new window))
  • TimedeltaIndex.searchsorted(), DatetimeIndex.searchsorted(), and PeriodIndex.searchsorted() have deprecated the key parameter in favor of value (GH12662 (opens new window))
  • DataFrame.astype() has deprecated the raise_on_error parameter in favor of errors (GH14878 (opens new window))
  • Series.sortlevel and DataFrame.sortlevel have been deprecated in favor of Series.sort_index and DataFrame.sort_index (GH15099 (opens new window))
  • importing concat from pandas.tools.merge has been deprecated in favor of imports from the pandas namespace. This should only affect explicit imports (GH15358 (opens new window))
  • Series/DataFrame/Panel.consolidate() been deprecated as a public method. (GH15483 (opens new window))
  • The as_indexer keyword of Series.str.match() has been deprecated (ignored keyword) (GH15257 (opens new window)).
  • The following top-level pandas functions have been deprecated and will be removed in a future version (GH13790 (opens new window), GH15940 (opens new window)) pd.pnow(), replaced by Period.now() pd.Term, is removed, as it is not applicable to user code. Instead use in-line string expressions in the where clause when searching in HDFStore pd.Expr, is removed, as it is not applicable to user code. pd.match(), is removed. pd.groupby(), replaced by using the .groupby() method directly on a Series/DataFrame pd.get_store(), replaced by a direct call to pd.HDFStore(...)
  • pd.pnow(), replaced by Period.now()
  • pd.Term, is removed, as it is not applicable to user code. Instead use in-line string expressions in the where clause when searching in HDFStore
  • pd.Expr, is removed, as it is not applicable to user code.
  • pd.match(), is removed.
  • pd.groupby(), replaced by using the .groupby() method directly on a Series/DataFrame
  • pd.get_store(), replaced by a direct call to pd.HDFStore(...)
  • is_any_int_dtype, is_floating_dtype, and is_sequence are deprecated from pandas.api.types (GH16042 (opens new window))

# Removal of prior version deprecations/changes

# Performance improvements

# Bug fixes

# Conversion

# Indexing

# I/O

# Plotting

# Groupby/resample/rolling

# Sparse

# Reshaping

# Numeric

# Other

# Contributors

A total of 204 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.

  • Adam J. Stewart +
  • Adrian +
  • Ajay Saxena
  • Akash Tandon +
  • Albert Villanova del Moral +
  • Aleksey Bilogur +
  • Alexis Mignon +
  • Amol Kahat +
  • Andreas Winkler +
  • Andrew Kittredge +
  • Anthonios Partheniou
  • Arco Bast +
  • Ashish Singal +
  • Baurzhan Muftakhidinov +
  • Ben Kandel
  • Ben Thayer +
  • Ben Welsh +
  • Bill Chambers +
  • Brandon M. Burroughs
  • Brian +
  • Brian McFee +
  • Carlos Souza +
  • Chris
  • Chris Ham
  • Chris Warth
  • Christoph Gohlke
  • Christoph Paulik +
  • Christopher C. Aycock
  • Clemens Brunner +
  • D.S. McNeil +
  • DaanVanHauwermeiren +
  • Daniel Himmelstein
  • Dave Willmer
  • David Cook +
  • David Gwynne +
  • David Hoffman +
  • David Krych
  • Diego Fernandez +
  • Dimitris Spathis +
  • Dmitry L +
  • Dody Suria Wijaya +
  • Dominik Stanczak +
  • Dr-Irv
  • Dr. Irv +
  • Elliott Sales de Andrade +
  • Ennemoser Christoph +
  • Francesc Alted +
  • Fumito Hamamura +
  • Giacomo Ferroni
  • Graham R. Jeffries +
  • Greg Williams +
  • Guilherme Beltramini +
  • Guilherme Samora +
  • Hao Wu +
  • Harshit Patni +
  • Ilya V. Schurov +
  • Iván Vallés Pérez
  • Jackie Leng +
  • Jaehoon Hwang +
  • James Draper +
  • James Goppert +
  • James McBride +
  • James Santucci +
  • Jan Schulz
  • Jeff Carey
  • Jeff Reback
  • JennaVergeynst +
  • Jim +
  • Jim Crist
  • Joe Jevnik
  • Joel Nothman +
  • John +
  • John Tucker +
  • John W. O’Brien
  • John Zwinck
  • Jon M. Mease
  • Jon Mease
  • Jonathan Whitmore +
  • Jonathan de Bruin +
  • Joost Kranendonk +
  • Joris Van den Bossche
  • Joshua Bradt +
  • Julian Santander
  • Julien Marrec +
  • Jun Kim +
  • Justin Solinsky +
  • Kacawi +
  • Kamal Kamalaldin +
  • Kerby Shedden
  • Kernc
  • Keshav Ramaswamy
  • Kevin Sheppard
  • Kyle Kelley
  • Larry Ren
  • Leon Yin +
  • Line Pedersen +
  • Lorenzo Cestaro +
  • Luca Scarabello
  • Lukasz +
  • Mahmoud Lababidi
  • Mark Mandel +
  • Matt Roeschke
  • Matthew Brett
  • Matthew Roeschke +
  • Matti Picus
  • Maximilian Roos
  • Michael Charlton +
  • Michael Felt
  • Michael Lamparski +
  • Michiel Stock +
  • Mikolaj Chwalisz +
  • Min RK
  • Miroslav Šedivý +
  • Mykola Golubyev
  • Nate Yoder
  • Nathalie Rud +
  • Nicholas Ver Halen
  • Nick Chmura +
  • Nolan Nichols +
  • Pankaj Pandey +
  • Pawel Kordek
  • Pete Huang +
  • Peter +
  • Peter Csizsek +
  • Petio Petrov +
  • Phil Ruffwind +
  • Pietro Battiston
  • Piotr Chromiec
  • Prasanjit Prakash +
  • Rob Forgione +
  • Robert Bradshaw
  • Robin +
  • Rodolfo Fernandez
  • Roger Thomas
  • Rouz Azari +
  • Sahil Dua
  • Sam Foo +
  • Sami Salonen +
  • Sarah Bird +
  • Sarma Tangirala +
  • Scott Sanderson
  • Sebastian Bank
  • Sebastian Gsänger +
  • Shawn Heide
  • Shyam Saladi +
  • Sinhrks
  • Stephen Rauch +
  • Sébastien de Menten +
  • Tara Adiseshan
  • Thiago Serafim
  • Thoralf Gutierrez +
  • Thrasibule +
  • Tobias Gustafsson +
  • Tom Augspurger
  • Tong SHEN +
  • Tong Shen +
  • TrigonaMinima +
  • Uwe +
  • Wes Turner
  • Wiktor Tomczak +
  • WillAyd
  • Yaroslav Halchenko
  • Yimeng Zhang +
  • abaldenko +
  • adrian-stepien +
  • alexandercbooth +
  • atbd +
  • bastewart +
  • bmagnusson +
  • carlosdanielcsantos +
  • chaimdemulder +
  • chris-b1
  • dickreuter +
  • discort +
  • dr-leo +
  • dubourg
  • dwkenefick +
  • funnycrab +
  • gfyoung
  • goldenbull +
  • hesham.shabana@hotmail.com
  • jojomdt +
  • linebp +
  • manu +
  • manuels +
  • mattip +
  • maxalbert +
  • mcocdawc +
  • nuffe +
  • paul-mannino
  • pbreach +
  • sakkemo +
  • scls19fr
  • sinhrks
  • stijnvanhoey +
  • the-nose-knows +
  • themrmax +
  • tomrod +
  • tzinckgraf
  • wandersoncferreira
  • watercrossing +
  • wcwagner
  • xgdgsc +
  • yui-knk