The Israeli Ministry of Defense’s R&D Directorate (MAFAT) launched a fascinating data-science challenge on April 13th, 2022: MAFT Challenge – WiFi Sensing – Non Invasive Human Presence Detection. The participant's goal was to identify whether a particular room is empty or occupied and to infer how many people were present in the room by analyzing the RSSI measurements of a Wireless network.
We interviewed Harel Rom, the first-place prize-winner in track 1 and the second-place prize-winner in track 2.
The challenge was to detect and count human occupancy within rooms using WiFi RSSI data. In the first track of the competition the goal was to detect whether a particular room is empty or occupied, and in the second track the goal was to count how many people are present in the room. The data consists of a WiFi Received Signal Strength Indicator (RSSI), recorded from a router. The RSSI data was collected from WiFi channels established between the router and various devices: laptops, cell phones, tablets, and Raspberry Pis.
For further information please see the competition CodaLab page.
Hi Harel, congratulations! Please Tell us about yourself.
My background lies in computer science and data science. In the past, I was involved in data communication research and did a lot of research and analysis. Also, I have a B.Sc. in computer science and currently, I’m working on my M.Sc in neuroscience. My thesis focuses on the identification of Parkinson's disease from face videos.
Recently, I started a new job and as part of my new job training program, I was asked to join a specific data science competition. My manager wasn't at work that day and I didn't find the specific competition he talked about, but I came across the MAFAT Wifi Challenge. This challenge interested me because I felt it might be similar to some of the issues I'm dealing with in my thesis. Although I didn't have any previous experience with ML competitions, I saw I got nice scores on the public test set so I wanted to keep trying and get better.
How did you approach the challenge? What was your general strategy?
My general strategy was to use knowledge and principles from my thesis. I’m familiar with the Tsfresh python package, which is very powerful for feature engineering of time serieses, and I assumed it might be helpful in this challenge.
How did you handle the features in the model? Did you perform any feature engineering?
I knew that for each sample there is information about the signals in each of the two router antennas, left and right, and I wanted my model to use those two different values as one. It was pretty intuitive to me that the difference between the values might be meaningful, and therefore I created features based mainly on those differences. After I tried some features I found out that the absolute difference between the antenna values worked the best. I used all the features from Tsfresh and used the features with the highest correlation with the target variables and those that contribute the most to the models. With those features, I created some more features with different parameters.
Did you split the given training data into training and validation sets? How?
I divided the data into training and validation sets, based on the rooms the segments were taken from. By doing so, I realized that room 6 gave very different results than the other rooms but I didn’t know why. I checked it and tried to submit different models that didn’t train on this room data and found that consistently room 6 harms the model's results. I tried to analyze if there was something different in this room but I didn’t find anything, so I decided to leave room 6 out of my models and give up on that data.
The training data wasn’t balanced. How did you handle it?
I tried different balancing methods. Eventually, what worked best was in-room weight and multiplying it with out-room weight. It helped keep the samples balanced between the rooms. Also, because I realized room 1 improved the model results, I increased the weight for this room, and removed room 6 from the same logic.
What augmentations have you performed over the data?
In order to make the most out of the data, I performed different augmentations and created more data for the training. After a cross-validation process, I defined a 20-second sliding window, and I swapped between the right and left antennas for data augmentation.
What was your final model architecture?
For my main architecture, I used XGBOOST, which is a gradient boosting library that created a boosted ensemble of decision trees. The model included 100 trees, and each tree was trained over the full training set. After the cross-validation process, I defined the maximum tree depth as 2, minimal leaf weight as 36, and alpha as 138. In order to avoid overfitting, I performed a few steps: augmentations, adding new features, adding weights, and calculating the absolute difference between the router antennas.
You won first place at the occupancy track and second place at the people counting track. How did you approach the people counting track?
Unfortunately, I didn’t have a lot of time dedicated to solving track 2’s problem. I tried different approaches, and eventually used my track 1 solution as a baseline model and set thresholds to determine the room occupants number: 0, 1 or 2, and gave up the attempts to predict 3. I felt that the track 2 problem is unsolvable and I felt that the model I submitted reflects that.
Thank you for this interview! And congratulations for winning this challenge! Do you have any interesting insights you want to share?
Overall it was a very rewarding challenge and allowed me to deal with new, interesting data. The challenge gave me a glimpse of the WiFi and signals world, and the challenges in this area. On a personal level, participating and winning in this challenge was a wonderful opportunity for me to demonstrate my machine learning and data science capabilities.