Street Image Recognition
[ Last Updated: 2025-4-21 22:34 ]
Speaking of CNNs, one cannot avoid the topic of Street View Recognition.
This is perhaps where my persona projects have the most direct connection to neural networks. Although I haven't yet found the time to practice a small street view recognition project, I am forcing myself to sit down and finish this article first.
1. Basic Methods of Image Decomposition
The image classification problems we discussed previously regarding CNNs only involve very superficial aspects—answering the question: "What object does this image contain?" For example, a dog, a house, or a person. This can actually be accomplished by simple neural networks or even logistic regression (the classic handwritten digit problem can be regressed without using CNNs).
However, the information carried in images is vast. This requires deeper learning to decompose and understand the latent information within images, or in other words, "letting the machine read the picture." Consequently, we have two paths to further extract information from images: Object Localization and Semantic Segmentation.
-
Object Localization answers the question: 👉 "Where is the object in the image?" It not only identifies what it is but also frames its position. For instance, below we have circled the dog, the bicycle, and the car.

-
Semantic Segmentation goes a step further by answering: 👉 "What is this pixel?" It assigns every pixel of an image into different categories to understand what each part of the image represents. As shown in the image below, the picture is clearly divided into different categories: traffic signs, pedestrians, ground, roadway, cars, sky, trees, and buildings.

2. Implementation Methods for Object Localization
2.1. Dividing the Image into Grids
To more quickly enumerate every possible "localization box," we generally do not consider using pixels as the basic unit. Instead, the image is divided into larger grid blocks: the image below shows it divided into a 19×19 grid.

2.2 Defining Bounding Boxes
Next, for each grid cell, we use pre-defined anchor boxes of various shapes (imagine long, square, flat, etc.) to check (five are chosen in the image above). For each type of anchor box, the Convolutional Neural Network estimates the probability of whether the content contains the object to be recognized, which class the object belongs to, the coordinates of the box's centre point, and the height and width of the box (here, it represents the degree of adjustment required relative to the original box proportions).

Bounding boxes of different sizes

2.3 Redefining the Output Layer
For the output layer, each bounding box needs to record the probability for each specific class.


2.4 What if Duplicate Recognitions?
However, there is still an important problem to solve: if a car is recorded multiple times in different bounding boxes, how do we remove the duplicates?
This requires the use of an important function: IoU (Intersection over Union) to detect the degree of overlap between different boxes.

Here, we use non-max suppression to compress the layers. Unlike simple maximum value extraction used previously, this involves two main steps:
-
First, we need to remove boxes with low probabilities, whether because no object was detected or because the model isn't certain about the category. This usually requires setting a threshold beforehand, such as filtering out anything below 60%.
-
Then, for the remaining boxes, we first select the box with the highest probability of containing an object. After determining its most likely category, we calculate its degree of overlap with all other boxes detected as the same category. Those with high overlap rates (also requiring a threshold) are deleted.
-
Repeat the above step until there are no more boxes left to delete, resulting in the final predicted output.
Example:

Attached: The architecture when YOLO was first published:

As seen from this architecture, all the issues regarding box overlap and class probabilities mentioned above concern the final output layer. The preceding architecture and the CNN used for classification are not significantly different from before.
3. Implementation of Semantic Segmentation
Based on my superficial understanding, I view the process of semantic segmentation as a process of "blurring first and then sharpening."
Taking the classic U-Net architecture shown below as an example, the first half "descends," moving from individual pixels to abstract "semantics"—the level of object category information. Afterward, it moves from the bottom back up, reasoning from semantics back to the category each pixel should belong to.

Encoder - Convolutional Layers
The sinking part on the left half of the image above is the Encoder. It typically consists of multiple Convolutional layers and Pooling layers; we can treat it as a standard CNN. Similar to other recognition and classification tasks, it gradually extracts high-level semantic features or global features from the image through a series of convolutional layers, without being distracted by local details. (We can understand this as: if we only needed the image category, we would already have the necessary information by this point.)
But if we want to be precise down to every pixel, we must complete the following second half:
Decoder - Transpose Convolution
In this second half, the Decoder gradually restores the low-resolution feature map obtained by the encoder back to the original image dimensions through a series of Transposed Convolutions (deconvolution or transposed convolution), achieving a sense of "sharpening" to obtain a precise segmentation map.
-
Transposed Convolutional Layers: The role of these layers is to gradually increase the spatial resolution of the feature map through upsampling operations, while restoring the dimensions of the feature map to the size of the original input image.
This process also employs different filters for scanning and can be understood as the inverse process of convolution—except it is a process of expanding a small image into a large one. The two images below show the difference with and without stride; blue represents the input layer, and blue-green represents the transposed output result. This allows the high-dimensional feature map to recover to the same resolution as the original image.


- Skip Connections (or "crop and concatenate" in the image): During each step of the upsampling process, the decoder utilizes the feature maps from the corresponding layers of the encoder (via skip connections) for concatenation. This way, it uses not only the compressed high-level semantic information but also utilizes high-resolution local details to expand the image. This ensures the decoder retains details (like edges) when generating segmentation results, rather than relying solely on high-level semantic information.
Honks
plan to write a post on practical implementation later. Writing theory like this is still a bit abstract; only hands-on operation can truly deepen understanding. However, there is too much to learn lately, and the practical project schedule is quite congested... I haven't gotten around to it yet.
I feel that after concluding the CNN summary, I want to write some lighter posts. These two chapters were a bit painful to write; TBH my personal interest in all these neural network things isn't actually that great, even though it is super powerful... It feels like the entire system is built on an extremely radical and "brute-force" stacking of computing power and massive dataset, plus days and nights fine-tuning of hyperparameters. (Maybe I am wrong.;)
Also due to the fact that I've been writing a lot of interesting things in C# recently (which I should find a chance to record as well), my learning of neural networks has been quite slow. However, I recently want to find some time to research Self-Organizing Maps and their comparison/combination with K-means (digging another hole). It seems there are many things on my to-do list.
— Untiled Penguin 2025/04/14 22:35