Subscribe

Programming

I was among 900 attendees at the recent PyData Seattle 2015 conference, an event focused on the use of Python in data management, analysis and machine learning. Nearly all of the tutorials & talks I attended last weekend were very interesting and informative, and several were positively inspiring. I haven't been as excited to experiment with new tools since I discovered IPython Notebooks at Strata 2014.

I often find it helpful to organize my notes after attending a conference or workshop, and am sharing these notes publicly in case they are helpful to others. The following is a chronological listing of some of the highlights of my experience at the conference. My notes from some sessions are rather sparse, so it is a less comprehensive compilation than I would have liked to assemble. I'll also include some links to some sessions I did not attend at the end.

This was my first time at a PyData conference, and I spoke with several others who were attending their first PyData. Apparently, this was the largest turnout for a PyData conference yet. I gave a 2-hour tutorial on Python for Data Science, designed as a rapid on-ramp primer for programmers new to Python or Data Science. Responses to a post-tutorial survey confirm my impression that I was unrealistically optimistic about being able to fit so much material into a 2-hour time slot, but I hope the tutorial still helped participants get more out of other talks & tutorials throughout the rest of the conference, many of which presumed at least an intermediate level of experience with Python and/or Data Science. As is often the case, I missed the session prior to the one in which I was speaking - the opening keynote - as I scrambled with last-minute preparations (ably aided by friends, former colleagues & assistant tutors Alex Thomas and Bryan Tinsley).

Running and re-running data science experiments in which many steps are repeated, some of which are varied (e.g., with different parameter settings), and several take a long time are all part of a typical data science workflow. Every company in which I've worked as a data scientist has rolled their own workflow pipeline framework to support this process, and each homegrown solution has offered some benefits while suffering from some shortcomings. Jonathan Dinu demonstrated Luigi, an open source library initially created by Spotify for managing batch pipelines that might encompass a large number of local and/or distributed computing cluster processing steps. Luigi offers a framework in which each stage of the pipeline has input, processing and output specifications; the stages can be linked together in a dependency graph which can be used to visualize progress. He illustrated how Luigi could be used for a sample machine learning pipeline (Data Engineering 101), in which a corpus of text documents is converted to TF-IDF vectors, and then models are trained and evaluated with different hyperparameters, and then deployed.

Joseph Sirosh sprinkled several pithy quotes throughout his presentation, starting off with a prediction that while software is eating the world, the cloud is eating software (& data). He also introduced what may have been the most viral meme at the conference - the connected cow - as a way of emphasizing that every company is a data company ... even a dairy farm. In an illustration of where AI [artificial intelligence] meets AI [artificial insemination], he described a project in which data from pedometers worn by cows boosted estrus detection accuracy from 55% to 95%, which in turn led to more successful artificial insemination and increased pregnancy rates from 40% to 67%. Turning his attention from cattle to medicine, he observed that every hospital is a data company, and suggested that Florence Nightingale's statistical evaluation of the relationship between sanitation and infection made her the world's first woman data scientist. Sirosh noted that while data scientists often complain that data wrangling is the most time-consuming and challenging part of the data modeling process, that is because deploying and monitoring models in production environments - which he argued is even more time-consuming and challenging - is typically handed off to other groups. And, as might be expected, he demonstrated how some of these challenging problems can be addressed by Azure ML, Microsoft's cloud-based predictive analytics system.

IPython Jupyter Notebooks are one of my primary data science tools. The ability to interleave code, data, text and a variety of [other] media make the notebooks a great way to both conduct and describe experiments. Jonathan described the upcoming Big Split(tm), in which IPython will be separated from Notebooks, to better emphasize the increasingly language-agnostic capabilities of the notebooks, which [will soon] have support for 48 language kernels, including Julia, R, Haskell, Ruby, Spark and C++. Version 4.0 will offer capabilities to

ease the transition from large notebook to small notebooks

import notebooks as packages

test notebooks

verify that a notebook is reproducible

As a former educator, a new capability I find particularly exciting is nbgrader, which uses the JupyterHub collaborative platform, providing support for releasing, fetching, submitting and collecting assignments. Among the personally most interesting tidbits I learned during this session was that IPython started out as Fernando Perez' "procrastination project" while he was in PhD thesis avoidance mode in 2001 ... an outstanding illustration of the benefits of structured procrastination.

One of the most challenging aspects of attending a talk by Paco Nathan is figuring out how to bide my time between listening, watching, searching for or typing in referenced links ... and taking notes. He is a master speaker, with compelling visual augmentations and links to all kinds of interesting related material. Unfortunately, while my browser fills up with tabs during his talks, my notes typically end up rather sparse. In this talk, Paco talked about the ways that O'Reilly Media is embracing Jupyter Notebooks as a primary interface for authors using their multi-channel publishing platform. An impressive collection of these notebooks can be viewed on the O'Reilly Learning page. Paco observed that the human learning curve is often the most challenging aspect to leading data science teams, as data, problems and techniques change over time. The evolution of user expertise, e.g., among connoisseurs of beer, is another interesting area involving human learning curves that was referenced during this session.

Fraud detection presents some special challenges in evaluating the performance of machine learning models. If a model is trained on past transactions that are labeled based on whether or not they turned out to be fraudulent, once the model is deployed, the new transactions classified by the model as fraud are blocked. Thus, the transactions that are allowed to go through after the model is deployed may be substantially different - especially with respect to the proportion of fraudulent transactions - than those that were allowed before the model was deployed. This makes evaluation of the model performance difficult, since the training data may be very different from the data used to evaluate the model. It also complicates the training of new models, since the new training data (post model deployment) will be biased. Michael Manapat presented some techniques to address these challenges, involving allowing a small proportion of potentially fraudulent transactions through and using a propensity function to control the "exploration/exploitation tradeoff".

Once again, I had a hard time keeping up with the multi-sensory inputs during a talk by Paco Nathan. Although I can't find his slides from PyData, I was able to find a closely related slide deck (embedded below). The gist of the talk is that many real-world problems can often be represented as graphs, and that there are a number of tools - including Spark and GraphLab - that can be utilized for efficient processing of large graphs. One example of a problem amenable to graph processing is the analysis of a social network, e.g., contributors to open source forums, which reminded me of some earlier work by Weiser, et al (2007), on Visualizing the signatures of social roles in online discussion groups. The session included a number of interesting code examples, some of which I expect are located in Paco's spark-exercises GitHub repository. Other interesting references included TextBlob, a Python library for text processing, and TextRank, a graph-based ranking model for text processing, a paper by Mihalcea & Tarau from EMNLP 2004.

Pandas - the Python open source data analysis library - may take 10 minutes to learn, but I have found that it takes a long time to master. Jeff Tratner a key contributor to Pandas - an open source community he described as "really open to small contributors" - shared a number of insights into how Pandas works, how it addresses some of the problems that make python slow, and how the use of certain features can lead to improved performance. For example, specifying the data type of columns in a CSV file via the dtype parameter in read_csv can help pandas save space and time while loading the data from the file. Also, the Dataframe.append operation is very expensive, and should be avoided wherever possible (e.g., by using merge, join or concat). One of my favorite lines: "The key to doing many small operations in Python: don’t do them in Python"

There were a few sessions about which I read or heard great things, but which I did not attend. I'll include information I could find about them, in the chronological order in which they were listed in the schedule, to wrap things up.

I thought it would be fun to experiment with all 3: implementing the recursive C function used by Gustavo Duarte in Python, doing so inside PythonTutor (which can generate an iframe) and then embedding the PythonTutor iframe inside an IPython Notebook, which I would then embed in this blog post.

Unfortunately, I haven't achieved the trifecta: I couldn't figure out how to embed the PythonTutor iframe inside an IPython Notebook, so I will embed both of them separately here in this blog post. Across the collection, 3 flavors of visualizing recursion are shown:

simple print statement output tracing the call stack (ASCII version)

a static call stack image created by Gustavo

a dynamic call stack created automatically by PythonTutor

I'll start out with embedding Motivating and Visualizing Recursion in Python, an IPython Notebook I created to interweave text, images and code in summarizing Gustavo Duarte's compelling critique and counter-proposal for how best to teach and visualize recursion, and reimplementing his examples in Python.

I really like the way that PythonTutor enables stepping through and visualizing the call stack (and other elements of the computation). It may not be visible in the iframe above (you have to scroll to the right to see it), so I'll include a snapshot of it below.

If anyone knows how to embed a PythonTutor iframe within an IPython Notebook, please let me know, as I'd still welcome the opportunity to achieve a trifecta ... and I suspect that combining these two tools would represent even more enhanced educational opportunities for Pythonistas.

"this one got me the most excited about getting home (or back to work) to practice what I learned"

Well, I got back to work, and learned how to create an IPython Notebook. Specifically, I created one to provide a rapid "on-ramp" for computer programmers who are already familiar with basic concepts and constructs in other programming language to learn enough about Python to effectively use the Atigeo xPatterns analytics framework (or other data science tools). The Notebook also includes some basic data science concepts, utilizing material I'd earlier excerpted in a blog post in which I waxed poetic about the book Data Science for Business, by Foster Provost and Tom Fawcett, and other resources I have found useful in articulating the fundamentals of data science.

The rapid on-ramp approach was motivated, in part, by my experience with the Natural Language Toolkit (NLTK) book, which provides a rapid on-ramp for learning Python in conjunction with the open-source NLTK library to develop programs using natural language processing techniques (many of which involve machine learning). I find that IPython Notebooks are such a natural and effective way of integrating instructional information and "active" exercises that I wish I'd discovered it back when I was teaching courses using Python at the University of Washington (e.g., what came to be known as the socialbots course). I feel like a TiVO fanatic now, wanting to encourage anyone and everyone sharing any knowledge about Python to use IPython Notebooks as a vehicle for doing so.

Not only was this my first IPython Notebook, but I'm somewhat embarrassed to admit that the Python for Data Science repository represents my first contribution to GitHub. When I was teaching at UW, I regularly encouraged students to contribute to open source projects. Now I'm finally walking the talk ... better late than never, I suppose.

I'll include the contents of the repo's README.md file below. Any questions, comments or other feedback is most welcome.

This short primer on Python is designed to provide a rapid "on-ramp" for computer programmers who are already familiar with basic concepts and constructs in other programming languages to learn enough about Python to effectively use open-source and proprietary Python-based machine learning and data science tools.

The primer is spread across a collection of IPython Notebooks, and the easiest way to use the primer is to install IPython Notebook on your computer. You can also install Python, and manually copy and paste the pieces of sample code into the Python interpreter, as the primer only makes use of the Python standard libraries.

There are three versions of the primer. Two versions contain the entire primer in a single notebook:

SimpleDecisionTree.py: a Python class to encapsulate a simplified version of a popular machine learning model

There are also 2 data files, based on the mushroom dataset in the UCI Machine Learning Repository, used for coding examples, exploratory data analysis and building and evaluating decision trees in Python:

agaricus-lepiota.data: a machine-readable list of examples or instances of mushrooms, represented by a comma-separated list of attribute values

I've been using the Delicious social bookmarking web service for many years as a way to archive links to interesting web pages and associate tags to personally categorize - and later search for - their content [my tags can be found under the username gump.tion, a riff on the original Delicious URL, del.icio.us]. In December 2010, a widely circulated rumor reported that Yahoo was planning to shutdown Delicious, and a number of my friends abandoned the service for other services. I was in the midst of yet another career change, rejoining academia after a 21-year hiatus, with little time for browsing, much less bookmarking, so I did not make any changes at the time.

It turns out that rather than being shutdown, Delicious was was sold in April 2011, and various changes have since been made to the service and its user interface. The Delicious UI initially interpreted spaces in the TAGS field as tag separators, e.g., typing in the string "education mooc disruption" (as shown in the screenshot below) would be interpreted as tagging a page with the 3 tags "education", "mooc" and "disruption"; if you wanted to have a single tag with those 3 terms, you had to use remove or replace the spaces, e.g., "educationmoocdisruption" or "education_mooc_disruption". Someime in October 2011, the specifications changed, and commas rather than spaces were used to separate tags, allowing spaces to be used in the tags themselves, e.g., "education mooc disruption" was interpreted as a single tag (equivalent to "educationmoocdisruption"). Unfortunately, I did not see an announcement or notice this change for quite some time, and so I had hundreds of web pages archived on Delicious with tags I did not intend.

This problem surfaced recently when I was sharing my bookmarks on MOOCs (massive open online courses) with a group of students working on a project investigating MOOCs in an small closed offline course, Computing Technology and Public Policy. There were several pages I remembered bookmarking that did not appear in pages associated with my MOOC tag. Searching through my archive for the titles of some of those pages, I discovered several pages tagged with terms including spaces. I started manually renaming tags, replacing the multi-term tags with the multiple tags I'd intended to associate with the pages. After a dozen or so manual replacements, I scanned my tag set and saw many, many more, and so decided to try a different approach.

The Delicious API provides a programmatic way to access or change tags associated with an authenticated user's account. Ever since my first socialbots experiment, my programming language of first resort in accessing any web service API is Python, and as I expected, there is a Python package for accessing the Delicious API, aptly named pydelicious. Using pydelicious, I discovered that my Delicious account had over 200 tags with unintended spaces in them. I'm sharing the process I used to convert these tags in case it is of interest / use to others in a similar predicament. [Note: my MacBook Pro, running Mac OS X 10.8.3, comes prebundled with Python 2.7.2; instructions for installing and using Python can be found at python.org.]

Replacing all the tags containing unintended spaces with comma-delimited equivalents (e.g., replacing "education mooc disruption" with "education", "mooc", "disruption") was relatively straightforward, using the following sequence:

Install pydeliciousType easy_install pydelicious on the command line (on Mac OS X, this is can be done in a Terminal window; on Windows, this can be done in a Command Prompt window)

Having replaced all the tags with unintended spaces, I've reduced my tag set from 881+ to 680. I now see that I have a number if misspelled tags (e.g., commumity), and a number of singleton tags that are semantically similar to other tags I've used more regularly (e.g., comics (2) and humor (27)) - an inconsistency that similarly affects the category tags on this blog - but I'll leave further fixes for another time in which I want to engage in structured procrastination.

The cover of Gayle Laakmann McDowell's book, Cracking the Coding Interview, and links to her Career Cup web site and Technology Woman blog are included in the slides I use on the first day of every senior (400-level) computer science course I have taught over the last two years. These are some of the most valuable resources I have found for preparing for interviews for software engineering - as well as technical program manager, product manager or project manager - positions. I recently discovered she has another book, The Google Resume, that offers guidance on how to prepare for a career in the technology industry, so I've added that reference to my standard introductory slides.

While my Computing and Software Systems faculty colleagues and I strive to prepare students with the knowledge and skills they will need to succeed in their careers, the technical interview process can prove to be an extremely daunting barrier to entry. The resources Gayle has made available - based on her extensive interviewing experience while a software engineer at Google, Microsoft and Apple - can help students (and others) break through those barriers. The updated edition of her earlier book focuses on how to prepare for interviews for technical positions, and her latest book complements this by offering guidance - to students and others who are looking to change jobs or fields - on how to prepare for careers in the computer technology world.

I have been looking for an opportunity to invite Gayle to the University of Washington Bothell to present her insights and experiences directly to our computer science students since I started teaching there last fall, and was delighted when she was able to visit us last week. Given the standing room only crowd, I was happy to see that others appreciated the opportunity to benefit from some of her wisdom. I will include fragments of this wisdom in my notes below, but for the full story, I recommend perusing her slides (embedded below) or watching a video of a similar talk she gave in May (also embedded further below), and for anyone serious about preparing for tech interviews and careers, I recommend reading her books.

Gayle emphasized the importance of crafting a crisp resume. Hiring managers typically spend no more than 15-30 seconds per resume to make a snap judgment about the qualifications of a candidate. A junior-level software engineer should be able to fit everything on one page, use verbs emphasizing accomplishments (vs. activities or responsibilities), and quantify accomplishments wherever possible. Here are links to some of the relevant resources available at her different web sites:

One important element of Gayle's advice [on Slide 13] that aligns with my past experience - and ongoing bias - in hiring researchers, designers, software engineers and other computing professionals is the importance of working on special projects (or, as Gayle puts it, "Build something!"). While graduates of computer science programs are in high demand, I have always looked for people who have done something noteworthy and relevant, above and beyond the traditional curriculum, and it appears that this is a common theme in filtering prospective candidates in many technology companies. This is consistent with advice given in another invited talk at UWB last year by Jake Homan on the benefits of contributing to open source projects, and is one of the motivations behind the UWB CSS curriculum requiring a capstone project for all our computer science and software engineering majors.

Gayle spoke of "the CLRS book" during her talk at UWB and her earlier talk at TheEasy, a reference to the classic textbook, Introduction to Algorithms, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. She said that entry-level software engineer applicants typically won't need to know data structures and algorithms at the depth or breadth presented in that book, and she offers a cheat sheet / overview of the basics on Slides 23-40, and an elaboration in Chapters 8 & 9 of her CtCI book. However, for those who are interested in delving more deeply into the topic, an online course based on the textbook is now part of the MIT Open CourseWare project, and includes video & audio lectures, selected lecture notes, assignments, exams and solutions.

One potential pitfall to candidates who prepare thoroughly for technical interviews is they may get an interview question that they have already seen (and perhaps studied). She recommended that candidates admit to having seen a question before, equating not doing so with cheating on an exam, and to avoid simply reciting solutions from memory, both because simple slip-ups are both common and easy to catch.

Gayle stressed that was there is no correlation between how well a candidate thinks he or she did in an interview and how well their interviewers thought they did. In addition to natural biases, the candidate evaluation process is always relative: candidates' responses to questions are assessed in the context of the responses of other candidates for the same position. So even if a candidate thinks he or she did well on a question, it may not be as well as other candidates, and even if a candidate thinks he or she totally blew a question, it may not have been blown as badly as other candidates blew the question.

Another important factor to bear in mind is that most of the big technology companies tend to be very conservative in making offers; they generally would prefer to err on the side of false negatives than false positives. When they have a candidate who seems pretty good, but they don't feel entirely confident about the candidate's strength, they have so many [other] strong candidates, they would rather reject someone who may have turned out great than risk hiring someone who does not turn out well. Of course, different companies have different evaluation and ranking schemes, and many of these details can be found in her CtCI book.

Gayle visits the Seattle area on a semi-regular basis, so I'm hoping I will be able to entice her to return each fall to give a live presentation to our students. However, for the benefit of those who are not able to see her present live, here is a video of her Cracking the Coding Interview presentation at this year's Canadian University Software Engineering Conference (CUSEC 2012) [which was also the site of another great presentation I blogged about a few months ago, Bret Victor's Inventing on Principle].

Finally, I want to round things out on a lighter note, with a related video that I also include in my standard introductory slides, Vj Vijai's Hacking the Technical Interview talk at Ignite Seattle in 2008:

I recently encountered some challenges using Java 6Swing and AWT components to develop a GUI with a top-level BorderLayout, which did not prevent the shrinking of components in its CENTER area in response to an expanding JList in its WEST area. I provided more context about this problem (and its solution) in my last post, in which I described How to Prevent a JList in a JScrollPane on a JPanel from Resizing. As I mentioned in that post, one way I attempted to resolve the problem was to substitute a different LayoutManager, first GridBagLayout and then BoxLayout. Neither of those substitutions resolved the problem, but in exploring this trajectory, I learned how to approximate the 5 areas of BorderLayout - NORTH, WEST, CENTER, EAST and SOUTH - using GridBagLayout and BoxLayout, and wanted to share simplified versions of these approximations here, in case they may be of use to others.

BorderLayout

My simplified base example, SimpleBorderLayoutDemo, creates a 320x160 pixel GUI with a JButton in each of the 5 areas:

The "north" and "south" buttons are stretched to fill the entire width of the window. This is the case regardless of whether one extends JButton and overrides the getPreferredSize() or getMaximumSize() methods, or uses setPreferredSize() or setMaximumSize(). More specifically, BorderLayout ignores preferred and maximum widths for components in the NORTH and SOUTH areas. Note also that the "west" and "east" buttons are stretched vertically to fill the space between the NORTH and SOUTH areas. This is because BorderLayout ignores the preferred and maximum heights for components in the WEST and EAST areas. Just for completeness, it ignores all preferred and maximum dimensions (width and height) for components added to the CENTER pane.

To get buttons that have the preferred widths and heights, each JButton can be added to a separate JPanel, which has a default FlowLayout (a LayoutManager which respects the preferred heights and widths of its components), and then each of those JPanels can be added to the main contentPane. Using separate JPanels for each JButton results in the following GUI (SimpleBorderLayoutDemo2):

GridBagLayout

To approximate this 5-area layout using GridBagLayout, we would need to use 3 rows and 3 columns, in which the north and south components each consumed 3 columns. We can use GridBagConstraints fields gridx and gridy to specify rows and columns, and use the gridwidth field to enable the north and south components to occupy all 3 columns of the top and bottom rows.

GridBagLayout components tend to gravitate toward the middle; in order to spread them out. we can insert Insets specifying the number of pixels to use for padding above, to the left of, below and the to right of each component. If we create a new Insets object initialized with values of 10 for each its edges (top, left, bottom and right), and assign it to the GridBagConstraints insets field for the GridBagLayout, we get the following layout:

BoxLayout

BoxLayout can also be used to approximate the BorderLayout. BoxLayout allows components to be laid out horizontally (along the X_AXIS) or vertically (along the Y_AXIS). In horizontal arrangements, BoxLayout attempts to respect the preferred widths of compoenents; in vertical arrangements, BoxLayout attempts to respect the preferred heights of components.

We could create a top-level BoxLayout with vertical arrangement, in which case (assuming the default top-to-bottom orientation), we would

add the north button

create and add another JPanel having a BoxLayout with vertical orientation to which we would add west, center and east buttons

add the south button

Alternately, we could create a top-level BoxLayout with horizontal arrangement, in which case (assuming the default left-to-right orientation), we would

add the west button

create and add another JPanel having a BoxLayout with vertical orientation to which we would add north, center and south buttons

add the east button

In the first BoxLayout configuration, everything is shifted toward the top, and the north and south buttons appear off-center; in the second BoxLayout configuration, everything seems shifted to the left. These misalignments are due to the default alignments in a BoxLayout. There is a whole section of the Java tutorial on How to Use BoxLayout devoted to Fixing Alignment Problems, so I won't go into the full details here.

The same tutorial also describes Using Invisible Components as Filler, which includes a few different internal padding options. A flexible "glue" filler can be used to add as much filler as necessary to evenly space out the components, in the horizontal or vertical dimensions. So, in addition to alignment adjustments, we'll also want to createHorizontalGlue() between the north and center buttons, and between the center and south buttons, in the first BoxLayout, and then createVerticalGlue() between the west button and the center panel (containing north, center and south buttons), and between the center panel and the east button. In the second BoxLayout, we'd do the opposite, using createVerticalGlue() between the west and center buttons, and between the center and east buttons, and then using createHorizontalGlue() between the north button and center panel, and between the center panel and the south button. Whichever combination we use, we end up with the following layout:

Note that each of the final layouts above has some subtle differences; in BorderLayout, the buttons in the WEST, CENTER and EAST are all aligned at the top rather the center of their respective areas; in GridBagLayout, the spaces between the buttons are all very uniform; in BoxLayout, the buttons are pushed out toward the edges. All of these can be modified with further adjustments, but my main goal here was simply to demonstrate how the basic NORTH, WEST, CENTER, EAST and SOUTH areas of a BorderLayout can be approximated using other LayoutManagers.

Potential Problems with Growing and Shrinking Components

Just to round things out, in view of the previous post on preventing the resizing of components, I'll include variations on each of the layouts above where the label of the "west" button is changed to "westwestwest" (3x), "westwestwestwestwest" (5x) and "westwestwestwestwestwestwest" (7x):

BorderLayout (version 1)

In the first case, the EAST area is shrunk to the preferred size of its button; in the second case, the CENTER area is shrunk below the preferred size of its button; in the third case, the CENTER area is entirely consumed by the WEST area.

BorderLayout (version 2)

The same pattern of shrinking and eventual elimination of the CENTER area is repeated.

GridBagLayout

In the GridBagLayout, the expanding "west*" button simply pushes the components to its right further to the right, but their sizes are not changed.

BoxLayout (version 1)

In the first version of BoxLayout, where the center row is a separate JPanel with a BoxLayout.X_AXIS arrangement, we see a pattern similar to that of GridBagLayout, where the expanding "west*" button keeps pushing the other buttons further to the right, without changing their sizes.

BoxLayout (version 2)

A similar pattern can be seen in version 2 of BoxLayout, where the center column is a separate JPanel with a BoxLayout.Y_AXIS arrangement, except that in this case the entire center column is pushed further to the right.

The solution to the problem of expanding components resulting in the shrinking or elimination of other components, as I mentioned in the previous post, is to use setPreferredSize() and/or setMaximumSize() - depending on the LayoutManager - for any components that might expand dynamically during execution and encroach upon other components ... such as a JList that accepts new String elements that may be longer than any previous element.

I've been working on a graphical user interface to enable a user to view, modify and add relevance judgments for a set of results returned by a search engine in response to a set of topics. The GUI was developed as part of some work I've been doing on the Text REtrieval Conference (TREC) 2012 Medical Records Track. I hope to write more about the GUI and the work on TREC in the future. For now, I want to share a solution I developed to a problem with the Java 6Swing and AWT components I was struggling with over the past few days.

The problem was a JList contained within a JScrollPane contained within a JPanel that was resizing when new String elements added to the JList were longer than its current width. I initially implemented the GUI, which extends JFrame, using a BorderLayout:

The offending JList was in the West (leftmost) area, and when it grew (after adding a string longer than its initial width), it shrank the JTextArea component I have in the Center area. I wanted to prevent any resizing from happening.

I read that the Center area of a BorderLayout cannot be protected from resizing - it fills whatever space is available after all the other components have been rendered - so I experimented with different LayoutManagers (GridBagLayout and BoxLayout) ... which will be the subject of a future blog post [update: see How to approximate BorderLayout with GridBagLayout or BoxLayout]. I couldn't manage to get GridBagLayout to work and play nicely with my components; BoxLayout - using a BoxLayout.X_AXIS JPanel for the West, Center and East area components, which was contained inside a BoxLayout.Y_AXIS JPanel (which was wedged between BoxLayout.Y_AXIS JPanels for North and South areas) - reduced the shrinking, but did not eliminate it.

I considered extending JPanel and overriding the getPreferredSize() or getMaximumSize() methods, but decided against this as the log files revealed that some resizing occurs after all the components in the GUI are initially populated. I wanted to allow the component sizes to settle into an initial populated state, and then prevent subsequent resizing.

After trying several other possibilities, I finally wrote a method to update the width associated with the PreferrredSize and MaximumSize, which I call after populating the components. I'll share it below, in case it is of use to others, or in case others have better solutions.

Among the shortcomings of this solution is that any JPanel restricted by this method will not automatically resize if/when the containing window (JFrame) is resized. My expectation is that my GUI will fill the screen - however large that screen is - and that users will generally not want to make it any smaller than full screen. If my assumption proves unwarranted, I suppose I could reintroduce some flexibility via the componentResized() method of a ComponentAdapter, but I'm going to wait to see if any of the [currently] small group of users asks for this capability.

Update: I discovered another possible approach, which I have notnow tried [see update 2 below], described in the Advanced JList Programming article at the Sun Developer Network (SDN). In the section entitled "JList Performance: Fixed Size Cells, Fast Renderers", Hans Muller suggests using the setFixedCellWidth() or setPrototypeCellValue() - which is passed a prototype (e.g., String) value that sets the cell width and cell height to the width and height of the prototype value - method to restrict the width of cells in a JList. He also notes "be sure to set the prototypeCellValue property after setting the cell renderer". The shortcoming I envision with this approach is that I would have to call setFixedCellWidth() any time that the window is resized ... though, as noted before, this is a scenario I don't currently handle in the GUI anyway.

Update 2: When I loaded the GUI with a JList containing cells wider than the desired width, the original problem recurred, so my fix was not effective. I have since used setFixedCellWidth() for the two JLists in the West and East areas, and all is well in my [GUI] world again.

I recently graded the first Python programming assignments in the course I'm teaching on Social and Computational Intelligence in the Computing and Software Systems program at University of Washington Bothell. Most of the students are learning Python as a second (or third) language, approaching it from the perspective of C++ and Java programming, the languages we use in nearly all our other courses. Both of those languages require the definition of a main() function in any source file that is intended to be run as an executable program, and so many of the submissions include the definition of a main() function in their Python scripts.

In reviewing some recurring issues from the first programming assignment during class, I highlighted this practice, and suggested it was unPythonistic (a la Code like a Pythonista). I recommended that the students discontinue this practice in future programming assignments, as unlike in C++ and Java, a function named main has no special meaning to the the Python interpreter. Several students asked why they should refrain from this practice - i.e., what harm is there in defining a main() function? - and one sent me several examples of web sites with Python code including main() functions as evidence of its widespread use.

In my experience, the greatest benefit to teaching is learning, and the students in my classes regularly offer me opportunities to move out of my comfort zone and into my growth zone (and occasionally into my panic zone). I didn't have a good answer for why def main() in Python was a bad practice during that teachable moment ... but after lingering in the growth zone for a while, I think I do now.

The potential problem with this practice is that any function defined at the top level of a Python module becomes part of the namespace for that module, and if the function is imported from that module into the current namespace, it will replace any function previously associated with the function name. This may lead to unanticipated consequences if it is combined with a practice of using wildcards when importing, e.g., from module import * (though it should be noted that wildcard imports are also considered harmful by Pythonistas).

I wrote a couple of simple Python modules - main1.py and main2.py - to illustrate the problem:

The main functions are identical except one has a string 'main1.py' whereas the other has a string 'main2.py'. If either of these modules are executed from the command line, they execute their own main() functions, printing out the hard-coded strings and the values of __name__ and sys.argv[0] (the latter of which will only have a value when the module is executed from the command line).

Now, this may all be much ado about little, especially given the aforementioned caveat about the potential harm of using wildcards in import statements. I suppose if one were to execute the more Pythonistic selective imports, i.e., from main1 import main and from main2 import main, at least the prospect of the main() function being overridden might be more apparent. But people learning a programming language for the first time - er, or teaching a programming language for the second time - often use shortcuts (such as wildcard imports), and so I offer all this as a plausible rationale for refraining from def main() in Python.

As part of my practice of leaning into discomfort and staying in the growth zone, I welcome any relevant insights or experiences that others may want to share.