Contributed 32 merged upstream pull requests to OpenCV's CUDA stack — 29 in opencv and 3 in opencv_contrib
- Designed and implemented GpuMatND — an N-dimensional GPU matrix type for CUDA. (#19259, interoperability test code in contrib #2805)
- Eliminated multi-stream race conditions across the CUDA module — data races and hangs in FAST, Canny, StereoBM, and TVL1 optical flow. (#10906, #11483, #13850, #17556)
- Achieved bit-exact GPU/CPU parity for histogram equalization. (#18136)
- Improved CUDA toolkit compatibility across multiple releases (CUDA 9–10.1), keeping the build green. (#14000, #13958, #13658, #13596)
- Added in-place NPP paths for
cuda::flipandGpuMat::convertTo, avoiding extra GPU buffers. (#17863, ported to contrib #2612; #17982)
- [#19259] Minimal implementation of GpuMatND (test code in contrib #2805)
- [#17982] cuda::GpuMat::convertTo — fix for in-place arguments
- [#10751] cuda_stream: do not allocate GPU memory by default
- [#17556] cuda optflow TVL1: run safely in async mode
- [#13850] cuda::StereoBM — fix hanging and racing issue
- [#13810] cudalegacy: use safe block scan function
- [#11572] NPP: NppStreamHandler fix
- [#11483] cuda_canny: multi stream safety
- [#11110] test_buffer_pool: synchronize after async copy
- [#10906] cuda_fast: multi stream safety
- [#18136] bit-exact cuda::equalizeHist
- [#17863] cuda::flip — use in-place npp function for inplace arguments (ported to contrib #2612)
- [#13764] Add CV_16UC1 support for cuda::CLAHE
- [#13625] Fix Farneback Optical Flow algorithm
- [#11526] cuda_meanStdDev: bug fix
- [#10987] SSE2: use
_mm_cvtpd_epi32when converting from CV_64F to CV_32S - [#10861] Fix for CUDA_Arithm/Dft.Algorithm/0 test
- [#10640] cv::cuda::cvtColor bug fix
- [#14000] CUDA 10.1 build issue fix on master branch
- [#13960] Windows build issue fix
- [#13958] CUDA 10.1 build issue fix
- [#13658]
__shfl_up_syncwith mask for CUDA >= 9 - [#13596] Remove build warning message with CUDA 10.0
- [#22041] Remove const from functions returning by value (contrib counterpart #3266)
- [#14041] Extract Ptr-related code from lut.cu to new lut.cpp
- [#13903] cudev — rework some code
- [#11155] Update GpuMat, GpuMat::download, GpuMat::upload documentation
- [#13364] Fix error in LineIterator example code in doc
- [#10803] Update BufferReader documentation with example code
Contributed via review, design discussion, and code to PRs authored by others — most of which merged:
- [#16666] [WIP] Add GpuMatND with arbitrary dimension support — earlier community design attempt; later delivered as my merged #19259
- [#19534] cudafilters: remove dangerous race condition — merged; CUDA filter thread-safety, my core area
- [#13695] Fix cuda::filter corrupted output across threads/streams — merged; multi-stream correctness in cudafilters
- [#11064] cudaarithm: make the asynchronous call to NPP safe — merged; aligns with my own NPP / stream-safety work (#11572, #11483, #10906)
Full list of PRs I was involved in but did not author (for reference) — opencv, opencv_contrib
[#19534] cudafilters: remove dangerous race condition (merged)
[#19286] add cuda::Stream constructor with cuda stream flags (merged)
[#17671] CUDA: fix native detection on Jetson (merged)
[#17581] CUDA: fix build error on Jetson TX1 and TX2 (merged)
[#17432] CUDA: choose supported CC automatically (merged)
[#16666] [WIP] Add GpuMatND with arbitrary dimension support (closed)
[#13695] Fix cuda::filter corrupted output across threads/streams (merged)
[#12722] cudafilters: fix test failure of Median_Accuracy (merged)
[#12585] cuda: move CUDA modules to opencv_contrib (merged)
[#11951] cmake: allow to use external FindCUDA from modern CMake (merged)
[#11064] cudaarithm: make the asynchronous call to NPP safe (merged)
Diagnosed community bug reports and shipped the merged fix:
- #18035 (non-deterministic CUDA equalizeHist) → fixed in #18136
- #17840 (in-place GpuMat flip artifacts) → fixed in #17863
- #13092 (GpuMat::convertTo in-place) → fixed in #17982
- #16013 / #18155 (TVL1 optical flow unsafe in async/multithreaded use) → fixed in #17556
Reported and fixed myself:
Design proposals / RFCs I opened:
Full list of issues I reported or was involved in (for reference) — opencv, opencv_contrib
[#24115] RFC cuda::Stream — documentation issue and usage inconsistency
[#18347] cudaarithm: inplace version of NPP flip fails with odd number ROI
[#18155] cuda_OpticalFlowDual_TVL1 is not thread-safe in python
[#18051] CUDA GoodFeaturesToTrackDetector is not ThreadSafe ?
[#18035] CUDA equalizeHist does not produce identical result
[#17840] In-place flip of GpuMat produces image artifacs
[#16433] GpuMat as input/output to cv::dnn::Net
[#16013] Corrupted optical flow using cuda::DenseOpticalFlow asynchronously in multithreaded environment
[#13092] cv::cuda::GpuMat.convertTo() seems not to support in-place, while cv::Mat does
[#2724] (contrib) Error building with BUILD_CUDA_STUB on machine without CUDA
[#2361] (contrib) Bug in cv::cuda::warpPerspective
[#14052] an illegal memory access was encountered in function 'download'
[#14017] Opencv 4.0.1 with Cuda
[#13996] opencv-4.0.1, CUDA10.1, failed to build cudaimageproc
[#13984] Problem compiling clahe.cu — identifier "PtrStepus" is undefined
[#13952] OpenCV 4.0.1 + Cuda 10.1, failed to build?
[#13897] Failed to build OpenCV 4.0.1 with CUDA 10 10.0
[#13883] Template Matching is not threadsafe
[#13761] cudalegacy NCVHaarObjectDetection hangs with RTX 2080 Ti
[#13491] Error when building with CUDA. VS 2017, Win10.
[#1958] (contrib) Feature request: Cuda CLAHE for 16 bit images
[#13477] cuda::createTemplateMatching not work with CUDA10.0
[#13014] cuda blockScanInclusive hangs with RTX 2080
[#12895] cudaoptflow: test failure of FarnebackOpticalFlow
[#12721] cudafilters: Median_Accuracy fails with CUDA 9.0 and after
[#12320] cv::cuda::integral hangs on Titan V
[#11622] CUDA Median filter tests fail with CUDA 9.1 but pass with CUDA 8.0
[#11606] Suggestion for the CUDA stream module
[#11511] unneeded cudaStreamSynchronize(stream_)
[#11298] bug in MemoryReturn in cuda module
[#11063] cudaarithm: async call to NPP fails
[#8938] Can –default-stream per-thread be used with opencv ?
[#8725] Calling cv::cuda::Stream::Null() results in a stray cudaMalloc() call
[#6742] cv::cuda::Filter thread safety