Travis Oliphant

Conda Packaging for the PyData Ecosystem

Developers at Continuum Analytics have been using Python actively for a long time. We have all seen and made poor software engineering decisions in the Python ecosystem because of the lack of a solid, cross-platform, binary packaging mechanism – particularly for Python extensions in the NumPy stack. We have seen most of these problems in enterprise usages of Python as many people have struggled with managing internal Python distributions – some of which basically means the developer population is frozen on a particular snap-shot of development history. It therefore becomes expensive and difficult to move to the next version. This presents a specific challenge for the Python community now that Python 3 is available and where most Python development is taking place. The majority of installations continue to use Python 2, and due to packaging challenges, it is more difficult than it needs to be to migrate. We posit that the problem of binary packaging is fundamentally an OS-level problem, and it should not be solved with a Python-centric worldview. Until you can install Python 2 and Python 3 and PyPy and R and Stackless and PyParallel within the same system and easily switch environments between all of them, a Python-only system will not fundamentally solve the problems we have addressed with conda.

We built conda because we needed to make it easy and simple to re-create environments for full-stack scientific computing and data analysis with Python. This system needed to be easy for users, developers, and IT administrators. Users of this tool would include occasional developers that need to reproduce the work of a colleague or create an environment quickly that matches one they’ve seen someone else use. Developers would be people who are extending their own system, building packages, or contributing their own unique Python libraries to the growing Python ecosystem. IT administrators would be people who are managing systems for other users and/or developers. One of Python’s great strengths is its ability to have both users and developers cooperate in the same ecosystem with some people gliding back and forth between both roles. This strength is especially relevant and important to the use of Python for Science and Data Analysis where almost all developers of the NumPy stack are or begin as users of the NumPy stack. It is the the fundamental reason that Python is used in the same sentence as R and also in the same sentence as Java when discussing large-scale systems.

Conda Works Today

While conda is continually improving, we emphasize that it works and works well today and solves the problems we faced. It is currently being used by hundreds of thousands of people and is getting better every week as more and more people contribute their recipes and packages to the ecosystem. Conda supports potentially any platform someone has enough resources to build binary packages for. Continuum has made freely available a wide variety of attribution-requested binary files that use conda for installation. Many of these binaries are bundled into a single-click installed distribution of Python we call Anaconda. However, anyone can build conda binaries of whatever stack they want, build collections of these packages, and make them available as public or private repositories at binstar.org.

A reasonable question to ask is why not just extend pip/virtualenv using the new .whl format. The answer is addressed in some detail by my recent blog post. But the summary is that fundamentally pip and virtualenv address a different problem that has some overlap with the problem conda addresses but not a complete enough overlap in the roadmap and goals for us to use that approach in a cost-effective way. Another question to ask is: “Why write NumPy instead of just extending the array object in Python?” Those two objects also have overlapping use-cases, but instead Numeric, Numarray, and finally NumPy were all written as separately installed objects. Conda is separately installed, but otherwise freely available.

The other packaging tools do not fully consider the problems that we have solved nor offer a mechanism to solve them today. Eventually, the .whl format and additional meta-data will likely grow enough information and features that we could technically use it with conda, and we look forward to conda2whl and whl2conda utilties emerging as that transpires. Even then, however, our use-cases for conda go beyond Python packages. For example, we have experimental support for R and R-packages with conda, and we package python itself as a conda package. It is also quite easy today to use pip within a conda environment, or conversely use conda install in any Python environment (by first pip installing conda and running conda init).

Conda and Anaconda are Open and Free

Anyone can make their own conda binaries, create conda repositories (both locally and served via http), and put them anywhere they want using the tools we have created. You can also extend these tools as desired because conda is open-source. Conda solves a real problem for people today on Mac, Windows, and Linux. It is an open-source, BSD-licensed solution and used by our free, attribution-requested Anaconda binaries. Binaries can be hosted anywhere by anyone, but we also provide a simple and free binstar.org service (currently in beta until it gets more testing – request an invite to get immediate access). This service allows anyone to host as many binary packages as they like and to point to those packages using any mechanism they like. Anaconda could be considered a “reference” distribution and a proof of concept, but we have made tools available for anyone to re-create this distribution.

We encourage developers of open source Python extensions to make conda recipes and publish them with their source-code. Recipes are simple to make and we have made a wide-variety of recipes available here. In addition, you can also make conda packages using the conda package command which allows you to package-up any “untracked” files. In this way you can install using whatever method you like and still get a binary package that you can redistribute or place on binstar.org. There are a few caveats to this that you will need to take into account such as script-rewriting and using RPATH for shared-libaries on Mac and Linux. More detail will be coming to conda.pydata.org.

Specific Differences with .whl and virtualenv

It is useful to list out specifically a few of the main differences between conda packages and the .whl format as well as the differences between conda OS-level environments and virtualenv environments. Here is a list of high-level differences for the curious.

Differences between conda and .whl:

  • Conda has meta-data support on the server-side that works today.
  • Conda is not python-specific in its binary layout
  • Conda has a build command that is not dependent on a setup.py file. (We do not believe you should have to write a setup.py file or execute any Python code to package a C/C++ or Fortran library.)
  • There are far more conda packages available today than .whl packages

Differences between conda environments and virtualenv environments:

  • Conda environments are OS-level and not Python-specific which means that conda behaves like a full-copy of the entire installation instead of a just a copy of python + site-packages.
  • You can create environments of any binary dependency tree (different versions of Python, R, Julia, etc.).
  • Conda environments use hard-linking and soft-linking for maximal efficiency
  • All conda packages are deployed to an environment so that environments are first-class ideas in conda ecosystem, even root is an environment.

There are good things happening in the larger Python packaging space. Some of it is helpful to users of the NumPy stack or can be eventually. Especially for the purposes of the NumPy stack, conda is useful now. We invite people to try conda and see if it can solve their problems today. It should be remembered that all Python packaging tools started as projects, like conda, written by someone to solve a problem they faced. With conda, we have shared a solution to Python packaging that provides general-purpose cross-platform package management to everyone, not just Python users. We have given that solution to everyone with a very generous licensing scheme (BSD) and freely available binary packages so that all can enjoy its benefits.

Transitioning between Python 2 and Python 3

There have been several blog posts and numerous forum dialogues recently by prominent Python developers and others about the Python 3 story and the challenge people face migrating between Python 2 and Python 3. Armin Ronacher, Alex Gaynor, and Nick Coghlan have all weighed in on the discussion which can be considered to follow from a presentation made by Brett Cannon earlier in the year.

We believe that conda can make this transition between Python 2 and Python 3 easy for individuals and companies so that it can occur on your timeline as you see fit. It really emphasizes the benefit that choosing to make conda a python-agnostic tool allows. Conda itself works with both Python 2 and Python 3. As a result, it is as easy to create a Python 3 environment with a Python 2 “system” as it is to create a Python 2 environment on a Python 3 “system” (the “system” is defined as what version of Python conda itself is using). We have published a previous blogpost that showed how to do this. I won’t repeat that explanation here, but it is remarkably easy to do.

Whether you are part of a company sitting on top of a lot of Python 2 legacy code but want to migrate to Python 3 for new code, or you primarily use Python 3 with the occassional need for access to Python 2 code and libraries, conda makes it easy to deploy and manage all of your code on a single system using the same tool. We will be coming out with additional commercial tools very soon that make this process of deployment and management of Python installations across your organization even easier in the coming weeks. Stay tuned or email us at info@continuum.io if you want the details. Conda itself will remain free and open source. Anaconda will become even more generously licensed for non-commercial entities. We will still continue to ask commercial entities to simply acknowledge that they use Anaconda if they redistribute binaries obtained via Anaconda in their commercial products.

Try Anaconda and/or conda Today

Using Python for Scientific Computing and Data Analysis should be a joyful experience. Installing Python and the NumPy Stack should be as enjoyable as coding with Python can be. I have found that to be the case with conda. Everytime I hear of someone struggling with a scipy install or a scikit-learn install, I want to reach out and help them understand how the work we have done with conda really can make things easier for them. We are also working to expand and improve the documentation for conda which now lives at its own website: conda.pydata.org. I invite you to see if conda can help you solve your packaging and deployment issues. Download the full Anaconda distribution or the Miniconda package and request an invite for a binstar account. See if the good things people are saying about conda and Anaconda apply to your situation as well.

Tags: conda Packaging Python PyData NumPy SciPy
comments powered by Disqus