My Profile Photo

Wojtek Kulikowski


I am a software engineer writing on ai, startups and product development. I need to yet include words python and cloud for SEO purposes


We won an ML hackathon and did it so wrong

TL;DR. We participated in a hackhaton and won. In this post we describe our (rather basic) architecture and mistakes that we did along the way. REST, django, React and Apache are being discussed. Main theme - sticking to tutorials vs doing actual development

Narcisstic photo sorry

Please note that this post is directed to people starting out with full stack development and probably will bring no value to any experienced programmer. But you can enjoy our fuck ups anyway :wink:

Like most college students at some point, me and my friends recently participated in a hackathon. Winning wasn’t on our mind from the beggining; we rather wanted just to build a fully functional app over a weekend that we would be at least somewhat proud of. I described the hackathon event and our idea in detail in the previous post. This is the second part in which I fully focus on technologies that we used. Honestly, we had wrong approaches to almost every step of our development so follow along for a funny ride. Snippets of shame included.

What have we built

The final process of our development was a simple app consisting of two views (unfortunately both mainly in Polish):

starting one:

Home page

and the final one:

Result

The first form lets the user enter the data about their business. Then, the app should show you the perfect place for it. Simple.

Backend

We decided to go with Django for our backend development. Why not use Flask, much more convinient and suited for such lightweight apps? The answer is simple - at the time I didn’t know well any and wanted to learn Django. :man_shrugging: Also, we needed a Python environment for an easy intergration with our future machine learning model so other frameworks felt out of the way.

Quick lesson on RESTful APIs

… which I barely heard of at the start and are appearently really important during software developement process

REST API stands for REpresentational State Transfer Application Programming Interface. Sounds scary for newbies like me, but is much simpler in reality. In our case, it just means:

You can access the same URL in several ways. Most popular are GET and POST.

For detailed explanation please head over here. Example that I like the most is that you could use GET on twitter.com to get a list of freshest tweets, but also use POST with your credentials and tweet’s content as arguments to post it online.

How did we use it

The gist of our simple app was, surpisingly, quite simple. After user enters our app’s adress they would send a GET request. Then the backend engine would send the needed resources to the frontend and the app would respond with a nice front page view. Then, after user fill the form and hits the big red button, they would send us a POST request with chosen options as arguments. Backend computes the answer, serves it to the frontend and we are done with the whole functionality. Sounds simple.

Yeah, so let’s see the implemenetation.

If all you have is a hammer, everything looks like a nail

A good REST programmer should know that all the API communication is done by JSON arrays and there is no magic involved. I didn’t.

However, I’ve heard about django-rest-framework and thought that it would be useful, so the Friday night has been spent on tutorials and all the stuff that was sooo unnecessary.

We ended up with models.py, serializers.py crawling in the project and views.py looking like this:

from eventify.models import OperationType, Profile, YearlyMaxRevenue, BusinessType, Area
from eventify.serializers import OperationTypeSerializer, ProfileSerializer, YearlyMaxRevenueSerializer, BusinessTypeSerializer

def businesstypes_list(request):

    if request.method == 'GET':
        operation_types = OperationType.objects.all()
        operation_types_serializer = OperationTypeSerializer(operation_types, many=True)
        profiles = Profile.objects.all()
        profiles_serializer = ProfileSerializer(profiles, many=True)
        revenues = YearlyMaxRevenue.objects.all()
        revenues_serializer = YearlyMaxRevenueSerializer(revenues, many=True)

        return JsonResponse({'operation_types': operation_types_serializer.data,
                             'profiles': profiles_serializer.data,
                             'yearly_max_revenues': revenues_serializer.data}, safe=False)
    
    elif request.method == 'POST':
        ... # This wasn't as bloated

instead of this:

def businesstypes_list(request):

    if request.method == 'GET':
        operation_types = [] #Lists filled with the entry data
        profiles = []
        revenues = []

        response_dict = {
            'operation_types': operation_types,
            'profiles': profiles,
            'revenues': revenies,
        }

        return JsonResponse(response_dict, safe=False)

    elif request.method == 'POST':
        ... # This wasn't as bloated

First version is probably a better scaling one and overall a good solution for a large system, but consumed a good couple of hours that we could use much more efficiently. It also increased our BugO notation polynomialy. We made 6 migrations of the database and the migrations directory ended up looking like this:

migrations
    0001_initial.py
    0002_auto_20190223_1321.py
    0003_auto_20190223_1421.py
    0004_area.py
    0005_fake.py
    0006_fake2.py
    __init__.py
    __pycache__

THIS IS TERRIBLE, especially knowing that we didn’t need models anyway. We will probably go with an improved version of this approach, but for a 24 hour hackathon you don’t need the full package. Just write your tiny data to a Python list and then put it in JSON using JsonResponse. Done.

Frontend (guest chapter written by Dominik)

For our frontend we used mainly TypeScript and React. We started by running eject on a generic codebase generated by Create React App . At the beginning I had some abitious plans to use Redux and its patterns, so we ended up having folder structure ready for using Redux store. It wasn’t long to realize that such approach would take too much time :grin:

Fortunately, our solution would consist only of two major views, so there was no problem with state or props transfer. The biggest issue for me were styles. I am a backend developer on a daily basis and fronted development always seemed complicated to me. However I had to learn it in my current job, so fortunately I knew what to do during the hackathon :relieved:

Most of our styles were created completely by heart and required lots of trial and error attempts. It wasn’t really a proper design art, but I am happy with what we achieved during the hackahton. The site has responsive design, what turned out to be crutial during the live demo. Most of people in the audience would access the page with their phones while we developed it on computers and could easily forget about the mobile experience.

The last big challenge for us was Warsaw’s map generation. The first idea was to go with Google Maps, but their API has shockingly low use limits. We choosed the free plan of Mapbox with allows for up to 40000 API cals per day and allows for searching a place by its name instead of raw coordinates. Having the maps set up I could access what our ml model working on backend would have output.

Machine learning

This part is the most shameful - after all that was our selling point during the Hackathon, but in the end it was the part that would work the worst.

Idea & data engineering

We were given an access to 58M unique transaction records. Such huge amount of data is a bless and a curse in one. We decided that the least amount of features we needed was a triplet - business category in which the payment occured, district of Warsaw in which the payment took place and finally, an amount of money that changed owners. As easy and simple does it sound, obtaining the data required a three table join statement. Which for us at the time would take infinty of computation time. We couldn’t figure out what is wrong until one of hackathon experts told us that the SQL database that we were given access to has been half-baked and require us to put our own indices on columns of interest.

What is an index on an SQL table?

By MySQL Documentation:

Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data. This is much faster than reading every row sequentially.

Indexes are usually implemented by B-trees which allow for logarithmic operations on indexed columns instead of linear ones. Such complexity reduction would improve time of our queries from infinity to merely couple of seconds.

When we found out about the possibility of adding our indices we didn’t have much time left. Michał, a guy responsible for data engineering in our team managed to dump 2000 records from the database to a csv file that later the model would be trained on. It doesn’t take a genious to suspect that it wouldn’t be a good model :sweat_smile:

Model developement

So, having 2000 records meant two things:

  1. our model will be rather bad anyway
  2. computation will be rather light and I can compute it on my laptop :smile:

Knowing 1. and 2. and that all we need for a demo we settled on the simple svm model. In practice our all mighty machine learning algorithm looked like that:

import pandas as pd
from sklearn import svm

df = pd.read_csv("data_dump.csv")

... # Simple data engineering

regr = svm.SVR()
regr.fit(X_train, y_train)

Our classifier would work like this: given a category (and possibly some additional features) predict the value of an average transacion in all Warsaw districts. Then return the district with the highest score to the user.

We felt kinda bad, because with less than 2 hours left our model would be massively skewed to always show the same district of Warsaw. Almost everybody noticed that and it was certainly a bummer in terms of user experience. However this is something easy to fix in the future and we are curious if we manage to obtain interesting results using some more advanced algorithms.

Deployment

The single most important thing that I wanted to achieve during the hacathon was an app that would be accessible for anybody in a public URL.

Easier said than done.

Putting things into perspective. For your app to have a public IP, you need to put it on a server and do some magi..caugh software to take care of everything for you. Nowadays, there are two popular tools for that: Apache and Nginx.

Server

For the heart of our app we needed the cheapest and smallest option possible. There are lots of server providers offering different plans, but not every provider has a 30 minute long video recorded by Sentdex deploying his app on their machine. $20 promo code also came handy.

Needless to say, we went with https://linode.com. This part was easy.

Apache setup

In his video, Sentdex uses Apache for the deployment as so did we. The process should be simple but we managed to make it hard.

Watching a tutorial and adjusting stuff is ok-ish in a verbose environment, where you can quickly adjust your configuration to the error messages popping out on the terminal. Apache provides some form of logging, but it is not obvious where to find them. For a long time we were blindly looking for mistakes in configration only to discover that the logs are stored in /var/log/apache/ and can be easily viewed with tail -f listening on the side.

WSGI and how Apache knows about Django

This is still a mistery to me and a topic to research in a future post.

React on the server

We ended up messing with our ports quite hard and React listening on port 3000 and Django on port

  1. It should be reversed, but in the end it was just an minor inconvinience in the app usage. React was working in a debug mode all the time, bacause I had no time to ask my frontend friend how to turn it off. It was crazy and I will try to cover all of this in the future.

Summary

I would never expect that writing this post would take over 2000 words in length and a week in my writing time. Afterall, all programmers are bad at planning :wink:

I hope that after reading the post you know more about a simple software development process. There are many mistakes to make and repeat continously but the Internet gives us the power of sharing them and learn together.

Frankly speaking, at first I wanted to cover more technical aspects and explain them with my words. However, I decided to put them in seperate posts in the future, as I ran out of space (and time) in this one. Hope you got to learn something anyway!

Till the next one,

Wojtek