Tag Archives: database

This is a guest post from Michael Li of The Data Incubator. The The Data Incubator runs a free eight week data science fellowship to help transition their Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career.

If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset. Why? Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes). Also, Excel has low size limits, making it suitable for “small data”, not “big data.”

In this blog entry we’ll talk about SQL. This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel. SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure. Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary). Critically, each column must be of the same “type”. Here is a sample of the table Employees:

EmployeeId

Name

StartYear

Salary

DepartmentId

1

Bob

2001

10.5

10

2

Sally

2004

20

10

3

Alice

2005

25

20

4

Fred

2004

12.5

20

SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN. We’ll go through these each individually.

SELECT

SELECT is the foundational keyword in SQL. SELECT can also filter on columns. For example

SELECT Name, StartYear FROM Employees

returns

Name

StartYear

Bob

2001

Sally

2004

Alice

2005

Fred

2004

WHERE

The WHERE clause filters the rows. For example

SELECT * FROM Employees WHERE StartYear=2004

returns

EmployeeId

Name

StartYear

Salary

DepartmentId

2

Sally

2004

20

10

4

Fred

2004

12.5

20

GROUP BY

Next, the GROUP BY clause allows for combining rows using different functions like COUNT (count) and AVG (average). For example,

SELECT StartYear, COUNT(*) as Num, AVG(Salary) as AvgSalary
FROM EMPLOYEES
GROUP BY StartYear

returns

StartYear

Num

AvgSalary

2001

1

10.5

2004

2

16.25

2005

1

25

JOIN

Finally, the JOIN clause allows us to join in other tables. For example, assume we have a table called Departments:

DepartmentId

DepartmentName

10

Sales

20

Engineering

We could use JOIN to combine the Employees and Departments tables based ON the DepartmentId fields:

We’ve ignored a lot of details about joins: e.g. there are actually (at least) 4 types of joins, but hopefully this gives you a good picture.

Conclusion and Further Reading

With these basic commands, you can get a lot of basic data processing done. Don’t forget, that you can nest queries and create really complicated joins. It’s a lot more powerful than Excel, and gives you much better control of your data. Of course, there’s a lot more to SQL than what we’ve mentioned and this is only intended to wet your appetite and give you a taste of what you’re missing.

And when you’re ready to step it up from “medium data” to “big data”, you should apply for a fellowship at The Data Incubator where we work with current-generation data-processing technologies like MapReduce and Spark!

Dear Future Developers

Please store as much data as possible. Do not worry about the cost of the extra storage disks. The value in the data will far outweigh the cost of the hardware. Here are some examples of data that could be stored but is typically not.

Start storing data about the order in which pages on your site get visited. Where do visitors most often land, and where do they go from there? Is there a path that leads to visitors becoming customers? Is there a path that leads to visitors leaving? Both would be good to know. Given enough of this data, it would be possible to predict what pages eventually lead to the most customers.

Start storing log information to a database. Some places do this, but far too many do not. As developers, this should get a higher priority. It is never fun to go debug a problem only to find the log file has been overwritten. Setting up a database for this would definitely save on debug time. Plus, the log data could possibly be helpful for determining trends or parts of the system that frequently have issues. It is important to remember that not all bugs produce errors, thus it is important to store all the log data.

Start storing data about the errors that occur and what(screen/page) caused the error. This information is typically stored in log file somewhere. It is too frequently lost after a couple days. It would be much better to store this information in a database for archival purposes. This is closely related to the previous paragraph.

Start storing information about which fields on a form get updated. Then you can notice if users are constantly returning to the same form to update a different field. Maybe the user was unaware that both fields can be updated simultaneously. Rearranging the fields might create a better user experience, and it will decrease the amount of updates hitting the database.

Start storing data about which buttons and links users click. This is not just the pages visited but the actual user actions. A good web analytics program can cover some of this, but why not store all of it yourself. Then you can do with it as you please. It would be great to know for your site what buttons users click the most? Is it the color, location, neither, or both that determine a popular button? What buttons and links never get clicked? How frequently does the same user click each button? If a user continues to come back and click the same button, it may indicate a navigation issue. There are some nice usability enhancements that can be made with this data.

Start storing data that you cannot immediately see as useful. The bigdata movement is continually showing the advantage of having more data. You never know when or for what the data will be useful.

Many of the current NoSQL choices would be good candidates to store the above data. This data will obviously grow very quickly, and speedy inserts are a must. Therefore, a database like MongoDB, Cassandra or Redis might be a good choice.

What other data do you think could be collected? I am sure there are lots of other possibilities. Also, I am going to take myself up on this challenge. I would like to store more information about the software I build.