Image inpainting represents a sophisticated methodology within the domain of computer vision, whose core objective is to programmatically restore occluded regions or eliminate undesired elements from digital imagery. This process endeavors to reconstruct visual continuity such that the resulting image exhibits both perceptual naturalness and structural completeness. Image inpainting has gradually become a hot field in computer vision. It is used in film processing, watermark removal, photo processing, and other fields. Traditional image inpainting methods use adjacent pixels of the missing area for filling, which not only incur high computational costs but also suffer from ghost artifacts and blur. With the emergence of large-scale datasets, deep learning-based image inpainting methods have been successively proposed, significantly improving restoration quality. However, the current state-of-the-art methodologies continue to demonstrate suboptimal performance when confronted with images featuring extensive occluded domains. Additionally, technological advancements in related image fields bring new opportunities and challenges to image inpainting. This paper discusses three aspects: (1) a review of relevant datasets for image inpainting, (2) a detailed description and summary of state-of-the-art methods, and (3) an introduction of evaluation metrics with performance comparisons of representative approaches. Finally, we address existing challenges and future opportunities in this field.
This study investigates the optimization of broadband communication channel capacity through an integrative information-theoretic framework. Leveraging Shannon’s theory, it examines fundamental constraints such as bandwidth limitations, channel noise, modulation techniques, error correction mechanisms, and adaptive systems. A comprehensive literature review of 118 articles identified 18 critical enablers, which were evaluated by domain experts. The Fuzzy DEMATEL method was employed to prioritize enablers based on interdependencies and influence. Results indicate that Security Considerations, Channel Access Protocols, and Propagation Characteristics exert the most significant impact on capacity optimization. The findings offer a structured decision-making model for stakeholders, enabling efficient allocation of technological, infrastructural, and human resources. By bridging theoretical principles with practical implementation, this research provides actionable insights for academic researchers and industry practitioners in designing robust, high-capacity broadband systems. The integrative modeling approach advances the application of information theory in modern communication networks, supporting informed technology adoption and system integration.
It’s highly crucial to divide up medical photos correctly in order to make diagnoses and plan treatments. Convolutional Neural Networks (CNNs) are very good at picking up local information, but they have problems with long-range dependencies. On the other side, Vision Transformers (ViTs) are good at modeling global context, but they need a lot of computer power and labeled data. To get surrounding these difficulties, we establish PSwinUNet, a hybrid CNN-Transformer system based on a partially supervised learning the structure. Adding a SwinTransformer block to a U-shaped structure makes PSwinUNet better at learning internationally semantics and up-sampling. It also uses a polarized self-attention mechanism in skip connections to keep spatial information from getting lost when the image is downsampled. PSwinUNet does a better job than the best gets closer that are currently accessible when tested on the BUSI, DRIVE, and CVC-ClinicDB datasets. For instance, it earned Dice Similarity Coefficient (DSC) scores of 0.781, 0.896, and 0.960 based on the BUSI data set with 1/8, 1/2, and entire labeled information, respectively. These scores are substantially better than those of the old UNet and UNet++ models.
This research investigates the development of a custom hybrid operating system (OS) for a Mars rover experimental prototype using the Raspberry Pi platform. Focusing on operating system optimization, the work enhances computational efficiency, real-time responsiveness, and AI integration. Key innovations include overclocking (boosting CPU performance by 28%), custom threading (reducing task scheduling latency by 22%), and networking improvements for stable remote operation. Codec refinements and framework adaptations improved real-time video analysis throughput by 30%. Integration of a Power-over-Ethernet (PoE) HAT enhanced thermal regulation and stabilized system runtime. Experimental results show the customized OS effectively supports intensive tasks such as image processing, sensor data acquisition, and edge AI workloads. The findings demonstrate a scalable, modular OS framework for real-time vision systems and intelligent robotics in resource-constrained environments.
To address the challenges of low reconstruction accuracy and insufficient model generalization in image super-resolution (ISR) under complex degradation scenarios, this paper proposes an improved method that integrates generative adversarial networks (GAN) and vision transformers (ViT). First, in the generator module of Real-ESRGAN, some residual-in-residual dense blocks (RRDB) are replaced with ViT modules, leveraging the self-attention mechanism to enhance global feature modeling. This enables the model to better capture global information while preserving local details in complex scenes. Experimental results demonstrate that the improved model achieves PSNR gains of 0.59dB/0.45dB and SSIM improvements of 0.018/0.056 in ×2/×4 upscaling tasks on the Urban100 dataset, while also exhibiting excellent performance on benchmark datasets such as Set14. This method significantly enhances image reconstruction quality under complex degradation conditions, providing an effective technical solution for practical applications such as security surveillance, remote sensing monitoring, and target reconnaissance.
As the technologies of virtual reality and augmented reality rapidly advance, the demand for high-quality 3D models has been growing exponentially. However, the Multi-View Stereo Network (MVSNet) for 3D reconstruction has faced issues with the inaccurate extraction of global image information and depth cues. In response to these challenges, this paper presents enhancements to MVSNet. First, the self-attention mechanism is introduced to enhance MVSNet's ability to capture global information in images. Second, a residual structure is added to mitigate the accuracy loss caused by the downsampling and upsampling of feature maps during the regularization process of cost volume, thus ensuring the integrity of information and transmission efficiency. Experimental results indicate that, in comparison with the original MVSNet, the SelfRes-MVSNet reduces the error rate by 1.3% in terms of overall accuracy and completeness, thereby improving the reconstruction effect from 2D images to 3D models.
Tool wear detection in mechanical machining is a critical link for ensuring product quality and improving production efficiency. However, this field faces challenges such as scarce annotated data and interference from complex working conditions, making it difficult to deploy advanced detection models. To address the fundamental mismatch between model capacity and data availability, this paper proposes a novel data-efficient hybrid detection architecture named MD-YOLOV12. This architecture ingeniously integrates the rich general visual representations learned by the self-supervised vision transformer model DINOv3 with the YOLOv12 object detection framework. Specifically, we perform feature enhancement at two key locations: input preprocessing and the middle layer of the backbone network, thereby enhancing the model's perception and recognition capability for tool wear features without relying on massive annotated data. To validate the method's effectiveness, we constructed a specialized tool wear detection dataset containing 8083 high-resolution images, meticulously annotated into three categories: "No Wear," "Moderate Wear," and "Severe Wear." Extensive experimental results demonstrate that the proposed MD-YOLOV12 method surpasses existing state-of-the-art techniques in the tool wear detection task, providing a viable technical pathway for data-efficient industrial vision applications.
Small object detection remains a formidable challenge in computer vision, primarily because conventional models like SSD suffer from two critical limitations: weak semantic information in shallow feature maps and a mismatch between the receptive field and the actual size of small targets. To address these deficiencies, this paper introduces Lite-RFB SSD, an innovative architecture that strategically integrates a lightweight Receptive Field Block (RFB) module into the SSD framework. This module is meticulously reconstructed using depthwise separable convolutions and channel pruning techniques, resulting in a remarkable 62% reduction in parameters. By embedding this optimized module into the shallow conv4_3 layer, the model preserves high-resolution features crucial for small object detection while significantly enhancing computational efficiency. Experimental validation on the PASCAL VOC dataset demonstrates that Lite-RFB SSD achieves an average precision for small objects (APs) of 22.9%, a substantial 4.2% improvement over the original SSD. Furthermore, it operates at an impressive 28 FPS on edge devices, establishing a superior balance between accuracy and efficiency that outperforms competing methods such as standard RFB and MobileNet-SSD.
To address the challenge of accurate gaze estimation in unconstrained environments susceptible to various interfering factors, this paper proposes AG-HybridNet, an end-to-end gaze estimation model integrating a dual-branch architecture combining CNN and Transformer components. The model employs Swin Transformer as the backbone for global feature extraction while incorporating an enhanced CNN branch dedicated to local feature capture. We introduce the TDConv-Block, which replaces standard convolution with partial convolution integrated with reparameterization technique, significantly reducing computational load and memory access while forming a T-shaped receptive field focused on central facial regions. Additionally, we design Efficient Additive Attention (ED-Attention) that effectively resolves the computational bottleneck in long-sequence processing for Transformers by reconstructing the computational workflow. Comprehensive experiments on MPIIFaceGaze and Gaze360 datasets validate the model's effectiveness. Experimental results demonstrate that AG-HybridNet achieves mean angular errors of 3.72° and 10.82° on MPIIFaceGaze and Gaze360 datasets respectively. Comparative studies with other mainstream 3D gaze estimation methods confirm that our network model can accurately estimate 3D gaze directions while reducing computational complexity.
This paper presents an improved version of the DeepLabV3+ network to address issues such as large parameter count, difficulties in mobile deployment, limited receptive field, and insufficient utilization of low-level semantic information in existing deep learning semantic segmentation networks. The main enhancement approach is as follows: we utilize the lightweight MobileNetV2 as the backbone feature extraction network, while an improved multi-scale atrous convolution module (AS-ASPP) and convolutional block attention mechanism (CBAM) are introduced. Tests conducted on the PASCAL VOC 2012 dataset demonstrate that the optimized model retains merely around one-tenth the parameters of the original network, while attaining superior segmentation precision and computational effectiveness. Specifically, it reaches a mIoU of 73.21% and a Precision of 80.56%, with the training time reduced by approximately 50% and the inference speed significantly improved.