Luận án Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông

Download
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
118 trang Hà Tiên 27/02/2024 1820
Download
Bạn đang xem 10 trang mẫu của tài liệu "Luận án Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.
Tóm tắt nội dung tài liệu: Luận án Giải pháp học thích ứng trên nền tảng mạng học sâu ứng dụng nhận dạng đối tượng tham gia giao thông

table frames) ( 
With each image, the CNN will extract the rich features, which are pedestrian 
postures, roadways, roadsides and positions of pedestrians on road, figure 2.5. The 
rich features extracted will be used for training SVM classifier model. 
a) Input image. 
b) Rich features simulation. 
Figure 2.4 Input images and simulate rich features of image 
 32 
. 
 In CNN model, many feature layers can be extracted such as convolution 
layer or full connected layer but the more advantageous layer is layer 19 (fc7 – 4096 
fully connected layer) – the one right before the classification layer. 
Literally, in cases of object recognition such as animals, things and vehicles, 
the rate of recognizing object is higher (90% to 100%). In case of predicting the 
action of pedestrians, the features of input images focus on not only a specific 
object but also others such as vehicles, buildings, trees, and things around roadsides 
as shown in Figure 2.5. 
Figure 2.5 Influence of other objects on the road on pedestrian movement prediction 
In this regard, in term of accuracy, ACF algorithm is used to detect 
pedestrians before extracting ROI, classifying and predicting the action of 
pedestrians. 
2.2.1.2 Pedestrian action prediction 
Pedestrian detection by ACF. ACF classification model, specified as 'inria-
100x41' or 'caltech-50x21', is person detection. The 'inria-100x41' model was 
trained using the INRIA Person data set. The 'caltech-50x21' model was trained 
using the Caltech Pedestrian dataset. The ‘inria-100x41’ model (default) is 
proposed in ACF. In ACF algorithm, detection scores value - confidence value - 
return an M-by-1 vector of classification scores in the range of [0..1]. When a 
pedestrian is detected, a bounding box will appear. The scores on top of bounding 
box are confidence value (by percentage). Larger score values indicate higher 
accuracy in the detection. In some complex images, the ACF algorithm sometimes 
 33 
recognizes errors. During the real-time experimental process, the score value 0.25 is 
proposed to avoid error-recognizing cases. For example, if the score value is 0.1, the 
result will not be accurate in some cases (Figure 2.7 (a)) and if the score value is 
0.25, the result will be of higher accuracy (Figure 2.7 (b)). 
Figure 2.6 Example input image for recognition 
a) b) 
Figure 2.7 Pedestrian detection with scores = 0.1 (a) and scores = 0.25 (b) 
In particular, when the AV moves on the roads, there are some cases in 
which so many pedestrians appear in one frame of the video. Therefore, to ensure 
the accuracy, it is considered that a frame be extracted into many separate frames to 
be easily recognized in each case. The region extracted is called ROI (Figure 2.8). 
Also, in real-time, the image received from AV is in big size and contains a lot of 
irrelevant data. Hence, extracting ROI of image at a certain scale, which removes 
irrelevant objects around, is necessary for each pedestrian detected. Extracting ROI 
of image helps the CNN model extract the exact features and reduce the error rate in 
the process of action recognition and classification of the SVM. The size of ROI is 
 34 
proposed as follows: 
Supposing that H and W are height and width of the rectangle covering 
pedestrian object; x and y are the coordinates of the top left of rectangle and Width 
and Height are the size of input image, the values x1, y1, W1, H1 describe the size 
of ROI which are defined as follow: 
( )
( )
1 1.5
1 – 
1 3
1 2
 1 1 
 1 1 
x x H
y y W
W W H
H H W
if W Width thenW Width
if H Height then H Height
= − 
=
= + 
= + 
 =
 =
(1) 
In special cases, when x1, y1, W1, H1 are smaller than the edge value of the 
frame or bigger than the size of the input image, the values equal the edge values of 
the image. 
( )
( )
( )
( )
 1 0 1 0
 1 0 1 0
 1 1 1 – 1
 1 1 1 – 1
if x then x
if y then y
if x W Width then x Width W
if y H Height then y Height H
 =
 =
+ =
+ =
(2) 
On the other hand, when ROI is out of image input size, the offset value of 
ROI on the opposite side is proposed in Figure 2.8. 
Figure 2.8 ROI extraction from pedestrian image 
. 
 35 
 Pedestrian movement prediction: After ROI is extracted into a single image, 
the features are extracted (by CNN model) to be classified (by SVM classifier 
model). The outputs are labeled according to values of prediction of pedestrian case 
(i.e., Pedestrian_crossing, Pedestrian_waiting, Pedestrian_walking). 
(i) Pedestrian_crossing: When a pedestrian is crossing or walking in the road of 
other vehicles. 
(ii) Pedestrian_waiting: When a pedestrian is standing on the roadside and 
waiting to cross. 
(iii) Pedestrian_walking: When a pedestrian is walking on the edges of the road. 
(1) (2) (3) 
Figure 2.9 The order of classifications of pedestrians when there are many 
pedestrians on the road in an input image 
2.2.2 Solution to vehicle recognition 
2.2.2.1 Sequential Deep Learning architecture 
Usually, available pre-trained network models can be used to re-train the 
vehicle recognition models. However, in our approach, reusing the trained model is 
inappropriate, as the size of the old models differs from the actual images obtained 
simultaneously. Besides, the training parameters do not support accuracy 
improvement. Some proposed models, such as AlexNet [53], GoogleNet [28],... are 
only effective for general recognition problems, not for this specific recognition 
problem. There are many different approaches to building a CNN model in vehicle 
recognition. In this study, we constructed a 24-layer CNN architecture, shown in 
Table 2.1, consisting of the input layer, convolution layer, rectified linear unit layer 
(ReLU), cross-normalization, max-pooling, and fully connected layer. The network 
model transforms the input image into a serial hierarchical descriptor. The neural 
aggregate input is the intensity values of the image applied to the CNN model. Input 
Pedestrian_crossing Pedestrian_waiting Pedestrian_walking 
 36 
sample includes 128×128×3 images. In this model, filters at the first layer concern 
to three-color channels, namely R-G-B. Filters operate independently and jointly 
among hidden layers, involving three channels of the input image. The final layer 
handling the feature vector will be extracted into the classification layer. A 
convolutional layer implements a combination of mapped input images with a filter 
size nx× ny. 
Table 2.1 CNN architecture with 22 hidden layers, 1 input layer, and the final 
classification layer 
TT Layer type Parameter 
 1 Image Input image size 128x128x3 
 2 Convolution 64 7x7x3 convolutions with stride [1 1] 
 3 ReLU ReLU 
 4 Normalization Cross channel normalization 
 5 Max Pooling 3x3 max pooling with stride [1 1] 
 6 Convolution 64 7x7x64 convolutions with stride [1 1] 
 7 ReLU ReLU 
 8 Max Pooling 2x2 max pooling with stride [1 1] 
 9 Convolution 64 7x7x64 convolutions with stride [1 1] 
10 ReLU ReLU 
11 Normalization Cross channel normalization 
12 Max Pooling 2x2 max pooling with stride [1 1] 
13 Convolution 64 7x7x64 convolutions with stride [1 1] 
14 ReLU ReLU 
15 Max Pooling 2x2 max pooling with stride [1 1] 
16 Convolution 64 7x7x64 convolutions with stride [1 1] 
17 ReLU ReLU 
18 Normalization Cross channel normalization 
19 Max Pooling 2x2 max pooling with stride [1 1] 
20 Fully Connected 1024 fully connected layer 
21 ReLU ReLU 
22 Fully Connected 4 fully connected layer 
23 Softmax softmax 
24 Classification Output crossentropyex with 4 other classes 
2.2.2.2 Data augmentation 
The training data set classified during the collection is shown in Figure 2.10. 
 37 
In order to improve the accuracy of vehicle recognition, we propose to augment 
data about 10 times. Images are rotated [-50, 50], flipped or added noise, yet no 
changes will be made to the image quality during training. The training data set 
after augmentation is shown in Table 2.5. 
2.3. Experimental evaluation 
 2.3.1 Pedestrian detection 
 2.3.1.1 Extracting features and training classifier model 
The experiment is carried out with about 3,000 images being extracted by 
CNN model. There features are used for training of SVM classifier model. Table 
2.2 shows the image and label datasets of extracted and trained features. 
Table 2.2 Image and label datasets of extracted and trained features 
Class Number Label 
Pedestrian crossing 1,000 Pedestrian_crossing 
Pedestrian waiting 1,000 Pedestrian_waiting 
Pedestrian walking 1,000 Pedestrian_walking 
90% of images from each set is used for the training data and the rest 10% is 
used for the data validation. 
2.3.1.2 Pedestrian detection and action prediction 
With the input images (i.e., Figure 2.6), after using pedestrian detection ACF 
algorithm, the output is executed as in Figure 2.11. In case of the input images with 
many pedestrians in a frame, we extract ROI into a single image for action 
prediction by SVM classifier as shown in Figure 2.11. Each image in Figure 2.11 
will be extracted features; finally, the system will rely on the SVM classification 
model to conduct action prediction of pedestrian and issue appropriate alerts for AV 
accordingly in Figure 2.9. 
 38 
Figure 2.10 Pedestrians detected and ROI extracted 
 The maximum results of rate-recognition after training and comparing with 
dataset in Table 2.2 are as follow: 
Table 2.3 Maximum confusion matrix for pedestrian action prediction 
 Pedestrian 
crossing 
Pedestrian 
waiting 
Pedestrian 
walking 
Pedestrian 
crossing 
0.9796 0.0204 0 
Pedestrian 
waiting 
0.0612 0.9286 0.0102 
Pedestrian 
walking 
0.0102 0.0408 0.9490 
The result of experiment in real-time video on the road gives minimum 
accuracy rate of 82%, maximum of 97% and the speed for processing reaching 0.6 
second per pedestrian detected. They are promising results for potential self-driving. 
2.3.2 Vehicle recognition 
2.3.2.1 Experimental data 
We have conducted experiments on a real database of vehicles including 
motors, cars, coaches, trucks taken from actual traffic situations. Camera systems 
typically receive signals in front of or behind the vehicles in traffic. This dataset is 
 39 
collected from different practical contexts on different traffic routes. The training 
dataset is divided into 4 different vehicle classes, including motors, cars, coaches, 
trucks simulated in Figure 2.10, with 8,558 vehicle images. The dataset was actually 
collected in Nha Trang city, Khanh Hoa province, Vietnam. Dataset is partitioned 
into 60% for training and the remaining 40% for evaluation as shown in Table 2.4. 
(a) Motor 
(b) Car 
(c) Coach 
(d) Truck 
Figure 2.11 Some examples of vehicle categories 
Table 2.4 Training data 
Categories 
Number of samples 
Sample size 
Overall Train Evaluation 
Motor 2673 1604 1069 128x128 
Car 2808 1685 1123 128x128 
Coach 1640 984 656 128x128 
Truck 1437 862 575 128x128 
Table 2.5 Training data after augmentation and balance data 
Categories Number of samples 
Motor 16040 
Car 16850 
Coach 17712 
Truck 17240 
2.3.2.2 Training CNN 
Result obtained after CNN model training is shown as follows: 
 40 
(i) Filter parameters: The first convolution layer uses 64 filters, whose filter's 
weight is shown in Figure 2.12: 
Figure 2.12 The weight values of the filter of the first convolution layer. This layer 
consists of 64 filters size 7x7, each of which is connected to three RGB image input 
channels 
(ii) Convolution result: The sample images fed into the network through a 
convolution filter and the obtained data show components distinct from the original 
RGB image with various feature result, creating a variety of vehicle features. The 
output value of the convolution set contains a negative value, which should be 
normalized by linear adjustment. The output of some layers is shown below, with 
the input pattern of the motor sample. 
(a) The output of 64 convolutions at the first convolution layer 
 41 
(b) The linear correction value after the first convolution layer 
(c) The output of 64 samples at the second Convolution layer 
Figure 2.13 Some results of linear convolution and linear correction for the input 
images being motors 
2.3.2.3 Categorical vehicle recognition 
Based on the experiment, three different methods have been evaluated on the 
same set of sample data as shown in Table 2.4. Methods include: (i) Traditional 
methods of HOG and SVM; (ii) CNN network; (iii) CNN network in combination 
with data augmentation. 
 42 
The accuracy of the HOG and SVM method on the sample data set was 
89.31%. Details of the sample size for each type and recognition result are shown in 
Table 2.6. 
Table 2.6 Confusion matrix of vehicle recognition using HOG and SVM 
 Motor Car Coach Truck 
 1069 1123 656 575 
 #Num Per(%) #Num Per(%) #Num Per(%) #Num Per(%) 
Motor 1029 97.26 16 1.53 15 1.87 9 1.75 
Car 25 2.36 989 94.37 77 9.59 32 6.23 
Coach 1 0.09 23 2.19 599 74.60 33 6.42 
Truck 3 0.28 20 1.91 112 13.95 440 85.60 
The evaluated accuracy of the CNN method based on original data was achieved 
90.10% on average, as shown in Table 2.7. 
Table 2.7 Confusion matrix of vehicle recognition using CNN 
 Motor Car Coach Truck 
 1069 1123 656 575 
 #Num Per(%) #Num Per(%) #Num Per(%) #Num Per(%) 
Motor 1026 95.98 38 3.38 1 0.15 5 0.87 
Car 32 2.99 953 84.86 17 2.59 24 4.17 
Coach 6 0.56 104 9.26 617 94.05 58 10.09 
Truck 5 0.47 28 2.49 21 3.20 488 84.87 
The evaluated accuracy of the CNN method based on data augmentation was 
achieved 95.59% on average, as shown in Table 2.8. 
Table 2.8 Confusion matrix of vehicle recognition using CNN and data 
augmentation 
 Motor Car Coach Truck 
 1069 1123 656 575 
 #Num Per(%) #Num Per(%) #Num Per(%) #Num Per(%) 
Motor 1060 99.16 11 0.98 0 0 1 0.17 
Car 5 0.47 1057 94.12 8 1.22 13 2.26 
Coach 0 0 41 3.65 645 98.32 51 8.87 
Truck 4 0.37 14 1.25 3 0.46 510 88.70 
 43 
In this study, we also evaluated the proposed CNN model to another 
traditional approach based on HOG feature descriptor and SVM classifier. Results 
of the comparison are shown in Figure 2.14. 
Figure 2.14 Comparison of HOG+SVM, CNN model and CNN with augmenting 
data 
 2.4 Conclusion 
 Artificial intelligence with the development of machine learning, especially 
recent Deep Learning network, has brought great improvements in computer 
systems. Study content in Chapter 2 demonstrates the ability to recognize objects of 
CNN models and intelligence of CNN models in specific cases. Although the study 
was conducted in a small recognition area, the content clearly demonstrates basic 
techniques of Deep Learning in recognizing objects and the potential of application. 
However, a limit of artificial intelligence is the lack of self-study, self-update and 
self-thinking capabilities. Artificial intelligence would become perfect if learning 
and data training do not need the interference of humans. Therefore, Chapter 3 aims 
to build an Adaptive Learning to help autonomous systems in self-study, self-update 
and self-thinking to narrow the gap between artificial and human intelligence. In the 
 44 
Chapter 2, the author mentions the two research works which are papers PP 1.1, PP 
1.2, PP 1.3. 
 45 
CHAPTER 3: DEVELOPMENT OF ADAPTIVE LEARNING 
TECHNIQUE IN OBJECT RECOGNITION 
In this Chapter, basing on the research results stated in Chapter 2, the 
Adaptive Learning solution of self-driving vehicle system data is continuously 
proposed. The proposed model is capable of self-learning and self-intelligence 
without any human intervention 
 3.1 Adaptive learning problem in object recognition 
 Nowadays, object recognition techniques have achieved high accuracy due to 
the advent of advanced technologies, such as the deep convolutional neural 
network. With the growing support of computer hardware, the CNN models have 
increasingly complex structure, more layers, and a large amount of training data. 
These systems are capable of identifying most object classes with high accuracy. 
However, the models just well recognize objects in the case they are a high 
similarity to the trained data. Meanwhile, the change of status or appearance of 
objects existing in practice is considerable variety and the image obtaining process 
of devices is affected by environmental conditions, such as brightness, rain, fog, 
vibration by movement, etc. Thus, the training dataset, large as it is, cannot cover 
almost all status of objects in practice. Additionally, training on too large data sets 
leads to impossible task due to limited computer resources and consuming time. To 
deal with these problems, proposed an approach solution, which is adaptive for 
automatically upgrading the recognition model with expected to reach higher 
accuracy. 
 3.2 Suggested solutions 
 3.2.1 Overview of solutions 
 In this chapter, a solution will be suggested based on Adaptive Learning by 
CNN models. In this suggested method, the recognition model will automatically 
update by directly collecting data in the normal operation of an ADAS, training, 
comparing the accuracy and updating the model. The updating mission will focus 
on datasets that are different from those in previous training. The solution aims to 
 46 
update the old model so that it would be more adaptive and accurate. In the 
Adaptive Learning method, recognition systems can learn and add information by 
themselves without the help of experts in data labeling. Especially, thank to the 
increasingly developed online storage technology, development of infrastructure 
and data transmission solutions on new platforms available (5G, Cloud data, etc.), 
the problems of the proposed model are expected to be handled by storage and 
updating of online data. Suggested solutions include five main stages: 
(1) Object detection with low reliability 
(2) Object tracking in n images in following processes to identify if they are 
objects of interest. 
(3) In case recognized objects with high reliability: label Positive for 
datasets recognized with low reliability in previous processes. In case recognized 
objects are not of interest, label Negative for all objected tracked in previous 
images. 
(4) Establishing a training dataset based on the collective combination of 
training dataset and new dataset. 
(5) Retraining and re-updating model if the new version has higher accuracy 
than the old one. 
 Trials were conducted to compare suggested model PDNet with modern 
models such as AlexNet and Vgg. Results showed that the suggested model have 
higher accuracy than a model that is self-taught over time. Further, the suggested 
Adaptive Learning model can be applied with conventional recognition models such 
as AlexNet and Vgg to improve their accuracy. 
 3.2.2. Analysis 
 3.2.2.1 Concept Definitions of System Components 
Before going into detail the block functions of the system, some concepts are 
classified and defined as follows: 
(1) Adaptive learning The self-learning, self-adaptability of a Deep Learning 
model. The adaptive process supports to automatically improve the ability to 
recognize objects of the system without the need of manually data complementation 
and expert support. 
 47 
(2) Interest objects (IO) The object of interest to detect and recognize; for 
example, traffic signs, vehicles, etc. 
(3) Confidence scores A measure of reliability when an object is detected as 
IO. The confidence score of object O is denoted as Conf(O). ConfidenceH is a 
highly confident threshold. 
(4) Confident tracking The process of object tracking when an object is 
detected as an IO. 
(5) Lost object (LO) Objects initially recognized as low confi
File đính kèm:
giai_phap_hoc_thich_ung_tren_nen_tang_mang_hoc_sau_ung_dung.pdf
5. Thong tin_Luan An_tieng Viet.pdf
4. Thong tin_Luan An_tieng Anh.pdf
3. Tom Tat Luan An_tieng Viet.pdf
2. Tom Tat Luan An_tieng Anh.pdf