Sonic Vision — Provisional Patent Specification | Back to overview Filing guide

Provisional Application for Patent

Title of Invention:

ADAPTIVE AMBIENT AUDIO SYSTEM USING COMPUTER VISION FOR REAL-TIME AUDIENCE DETECTION, BEHAVIORAL FEEDBACK ANALYSIS, AND GESTURE-BASED CONTROL

Cross-Reference to Related Applications

[0001] None.

Field of the Invention

[0002] The present invention relates generally to ambient audio systems for shared physical spaces, and more particularly to systems and methods that employ computer vision, machine learning, edge computing, and reinforcement learning to automatically select, play, and adapt music or sonic soundscapes based on real-time detection and analysis of persons present in a monitored space, their activities, their demographic characteristics, their behavioral responses to currently playing audio, and their gestural commands.

Background of the Invention

[0003] Ambient audio — background music, soundscapes, and environmental audio — plays a significant role in shaping the atmosphere of commercial, hospitality, and public spaces. Restaurants, retail stores, hotel lobbies, fitness centers, co-working spaces, and outdoor venues all rely on background audio to influence mood, encourage desired behaviors (such as lingering or purchasing), and create distinctive brand experiences.

[0004] Current approaches to ambient audio selection suffer from several fundamental limitations. Static playlists, the most common method, involve a pre-curated set of tracks played in sequence without regard for who is actually present or what they are doing. Manual disc jockey (DJ) control requires a dedicated human operator and is cost-prohibitive for most venues during regular operating hours. Time-based scheduling allows venue operators to assign different playlists to different time slots (e.g., "morning jazz," "evening pop"), but this approach cannot adapt to variations in actual crowd composition or energy level within those time windows.

[0005] Subscription-based ambient music services (e.g., Muzak, Mood Media, Rockbot) offer curated playlists organized by genre or mood, but the selection remains fundamentally static once chosen by the venue operator. These services do not observe or respond to the people actually present in the space.

[0006] Prior art in emotion-based music selection exists but is limited. U.S. Pat. No. 9,489,934 describes a system that captures a single user's face, detects their emotional state, and selects music to guide that emotion toward a target state. However, this system is designed for single-user, personal-device use cases and does not address crowd-level analysis, activity detection, behavioral feedback loops, or shared physical spaces. U.S. Pat. No. 9,570,091 describes music selection based on speech emotion recognition, but relies solely on audio input rather than visual observation. U.S. Pat. No. 10,846,517 describes content modification based on emotion detection, but addresses generic content rather than music specifically and does not incorporate spatial or environmental awareness.

[0007] None of the existing approaches implement a closed-loop behavioral feedback system in which the audio selection system observes how persons in the space react to the music that the system itself selected, and then uses those reactions as reinforcement learning signals to continuously improve future selections. This closed-loop capability represents the central innovation of the present invention.

Summary of the Invention

[0008] The present invention provides an integrated system for adaptive ambient audio selection in shared physical spaces. The system comprises six principal modules: (1) an image capture device with a computer vision module that detects persons, estimates demographics, classifies activities, recognizes gestures, and detects behavioral responses to currently playing audio; (2) an audio intelligence engine that selects and manages audio playback using weighted inputs from the vision module and reinforcement learning from behavioral feedback; (3) an audio output system with zone-aware speakers; (4) a QR-based companion interface that enables location-gated web-based feedback and control without requiring a native application install; (5) an AI voice onboarding module that synthesizes spoken announcements explaining available interaction methods, with adaptive frequency based on crowd turnover; and (6) a flexible processing architecture supporting edge, cloud, and hybrid inference modes.

[0009] A key innovation of the present invention is the behavioral feedback loop. Unlike prior art systems that perform one-shot emotion detection to select initial content, the present system continuously observes how detected persons react to the audio that the system itself chose to play. Positive behavioral signals (such as dancing, rhythmic movement, head-nodding, remaining in the monitored area, smiling, and clapping) are treated as positive reinforcement learning rewards, encouraging similar future selections. Negative behavioral signals (such as covering ears, leaving the monitored area, grimacing, and gesturing disapproval) are treated as negative rewards, causing the system to adjust its selection strategy. This creates a continuously self-improving system that adapts to the preferences and responses of the actual people present.

Brief Description of the Drawings

[0010] FIG. 1 is a block diagram illustrating the overall system architecture of the adaptive ambient audio system, showing the six principal modules and their interconnections.

[0011] FIG. 2 is a flowchart illustrating the behavioral feedback loop process, from video capture through person detection, behavioral classification, and audio selection adjustment.

[0012] FIG. 3 is a reference diagram illustrating the gesture vocabulary recognized by the computer vision module, showing each gesture, its detection method, and its corresponding system action.

[0013] FIG. 4 is a sequence diagram illustrating the QR-based companion interface interaction flow, from QR code scanning through location verification to feedback submission.

[0014] FIG. 5 is a topology diagram illustrating the three processing modes (edge, cloud, and hybrid), showing data flow, latency characteristics, and privacy implications of each mode.

[0015] FIG. 6 is a sequence diagram illustrating the AI voice onboarding process, from new arrival detection through announcement generation and adaptive frequency control.

Detailed Description of Preferred Embodiments

1. System Overview

[0016] Referring now to FIG. 1, the adaptive ambient audio system comprises an image capture device 100, a computer vision module 200, an audio selection engine 300, an audio output device 400, a QR-based companion interface 500, an AI voice onboarding module 600, and a flexible processing architecture 700. These modules are interconnected by a behavioral feedback loop 800 that continuously monitors audience reaction to the system's own audio selections.

[0017] The image capture device 100 comprises one or more cameras with wide-angle lenses positioned to observe a monitored physical space. In preferred embodiments, the camera includes a charge-coupled device (CCD) or complementary metal-oxide-semiconductor (CMOS) image sensor capable of capturing video at a minimum of 15 frames per second at a resolution of at least 720p. The device may optionally include an infrared sensor for low-light operation and a depth sensor (such as a time-of-flight camera or structured light projector) for improved person detection and gesture recognition. The device may further include a microphone array for ambient noise level detection.

[0018] The computer vision module 200 receives video frames from the image capture device 100 and performs a plurality of analysis functions organized as sub-modules: person detection and tracking 210, demographic estimation 220, activity classification 230, behavioral response detection 240, gesture recognition 250, and crowd turnover detection 260.

FIG. 1

2. Computer Vision Module

[0019] The person detection and tracking sub-module 210 employs a real-time object detection model (such as YOLO, SSD, or an equivalent architecture) to identify and localize individual persons within each video frame. A multi-object tracker (such as DeepSORT or ByteTrack) maintains persistent identity for each detected person across frames, enabling the system to track how long each person remains in the space and to correlate behavioral responses with specific individuals over time.

[0020] The demographic estimation sub-module 220 performs approximate analysis of detected persons to estimate age range (e.g., child, young adult, middle-aged, senior) and group composition (e.g., number of people, whether they appear to be in groups or alone). This sub-module explicitly does not perform facial recognition or store biometric data; it operates on body proportions, posture, and group spatial relationships rather than facial features, thereby preserving privacy while still enabling demographically-informed audio selection.

[0021] The activity classification sub-module 230 employs a pose estimation model (such as MediaPipe Pose, OpenPose, or an equivalent) to classify the activities being performed by detected persons. Recognized activity categories include but are not limited to: seated dining, standing conversation, walking/browsing, dancing, exercising, working at a desk, and waiting. The classified activity informs the audio selection engine about the appropriate energy level, tempo, and genre for the current audience.

[0022] The behavioral response detection sub-module 240 is the core component of the feedback loop 800. This sub-module analyzes the body language and movements of detected persons specifically in the context of audio that is currently playing. Referring to FIG. 2, the sub-module classifies observed behaviors into positive feedback signals and negative feedback signals. Positive signals include: rhythmic movement or dancing in synchrony with the audio, head-nodding in time with the beat, remaining in the monitored area (as opposed to departing), visible smiling or laughing, and clapping. Negative signals include: covering ears with hands, departing the monitored area shortly after a track change, grimacing or displaying facial expressions of displeasure, and making gestural signals of disapproval.

FIG. 2

[0023] The gesture recognition sub-module 250 detects and interprets intentional hand gestures made by persons in the monitored space. Referring to FIG. 3, the recognized gesture vocabulary includes: an open-palm horizontal swipe 251 indicating "skip track," a thumbs-up gesture 252 indicating positive feedback ("like"), a thumbs-down gesture 253 indicating negative feedback ("dislike"), an upward palm raise 254 indicating "volume up," and a downward palm push 255 indicating "volume down." To prevent false positives from casual hand movements, gestures must be sustained for a minimum of 0.5 seconds to be registered. When multiple persons gesture simultaneously, a majority voting module 256 determines the aggregate intent.

FIG. 3

[0024] The crowd turnover detection sub-module 260 monitors the rate at which new persons enter the monitored space and existing persons depart. This metric is used by the AI voice onboarding module 600 to determine when to play instructional announcements, and by the audio selection engine 300 to determine when the audience composition has changed sufficiently to warrant a selection reassessment.

3. Audio Intelligence Engine

[0025] The audio selection engine 300 comprises a music and soundscape library 310, a selection algorithm 320, a reinforcement learning feedback processor 330, and a crossfade and mixing module 340.

[0026] The music and soundscape library 310 stores audio content tagged with metadata including: energy level (1-10 scale), genre, sub-genre, mood (e.g., relaxed, energetic, melancholic, uplifting), tempo (BPM), instrumentation characteristics, and demographic affinity scores derived from historical preference data.

[0027] The selection algorithm 320 receives a weighted feature vector from the computer vision module 200 comprising: person count, demographic distribution, primary activity classification, current energy level estimate, and time-of-day context. The algorithm computes a similarity score between the current audience feature vector and the metadata of each track in the library, selecting tracks that maximize expected audience satisfaction based on the current audience profile.

[0028] The reinforcement learning feedback processor 330 implements a reward-based learning loop. The processor treats positive behavioral signals from the behavioral response detection sub-module 240 as positive rewards and negative behavioral signals as negative rewards. Over time, the RL processor learns which audio attributes (genre, tempo, energy level) produce positive responses from audiences with specific demographic and activity profiles. The system maintains per-location preference profiles that accumulate learning across multiple sessions, enabling venue-specific optimization.

[0029] The crossfade and mixing module 340 manages smooth transitions between audio tracks, implementing beat-matched crossfades when possible and gradual volume transitions to prevent jarring track changes that could disrupt the ambient atmosphere.

4. Audio Output System

[0030] The audio output device 400 comprises one or more speakers positioned to serve the monitored physical space. In multi-zone deployments, separate camera-speaker pairs may operate independently, allowing different areas of a single venue (e.g., bar area versus dining area) to have different audio selections tailored to the specific audience and activity in each zone. The system adjusts playback volume based on detected ambient noise levels and crowd density, increasing volume in noisy environments and reducing it in quieter settings.

5. QR-Based Companion Interface

[0031] Referring to FIG. 4, the QR-based companion interface 500 provides a web-based control and feedback mechanism that requires no native application installation on the user's mobile device. The interface comprises a QR code display component 510, a location gate 520, and a web user interface 530.

FIG. 4

[0032] The QR code display component 510 generates and displays a QR code in the monitored physical space — for example, on a wall placard, table tent, or digital display. The QR code encodes a URL pointing to the web user interface 530.

[0033] When a user scans the QR code with their mobile device camera, the web user interface 530 loads in the device's browser. Before granting access to controls, the location gate 520 requests the device's geolocation via the browser Geolocation API and compares the device's coordinates to the known coordinates of the monitored physical space. Access is granted only if the device's location falls within a configurable radius (default: 100 meters) of the venue. This location-gating mechanism prevents remote users from interfering with audio selections — for example, it prevents someone outside the venue from maliciously skipping tracks or adjusting volume.

[0034] Once access is granted, the web user interface 530 displays the currently playing track information (title, artist, album art), a "like" button providing positive feedback, a "skip/dislike" button providing negative feedback, a volume slider, a song or genre request input, and the upcoming queue. Feedback submitted through the companion interface is routed to the audio selection engine 300 where it is treated as explicit feedback alongside the implicit behavioral feedback from the vision module. Users may optionally subscribe to push notifications to provide periodic feedback even when the web interface is not actively open.

6. AI Voice Onboarding Module

[0035] Referring to FIG. 6, the AI voice onboarding module 600 comprises a text-to-speech synthesis engine 610, a set of announcement templates 620, and a frequency controller 630.

FIG. 6

[0036] When the crowd turnover detection sub-module 260 detects a significant influx of new persons into the monitored space, it signals the frequency controller 630. The frequency controller checks whether sufficient time has elapsed since the last announcement (the "cooldown period"). If the cooldown has expired, the frequency controller selects an appropriate announcement template from the template set 620 and passes it to the TTS synthesis engine 610.

[0037] The announcement templates 620 are parameterized by venue type. A casual venue (bar, outdoor festival) might generate the announcement: "Hey — wave your hand to skip a track, give a thumbs up if you're digging this, or scan the QR code on the wall to control the music from your phone." A professional venue (hotel lobby, conference center) might generate: "Welcome. You can interact with our audio system using hand gestures or by scanning the QR code displayed nearby." The TTS synthesis engine 610 generates natural-sounding speech audio from the selected template, and the audio is played through the speaker 400 between tracks or during natural pauses.

[0038] The frequency controller 630 adapts its cooldown period based on crowd turnover rate. In high-turnover environments (such as a busy bar), the cooldown may be as short as 5 minutes to ensure new arrivals are informed. In low-turnover environments (such as a hotel lobby), the cooldown may extend to 30 minutes or longer. The system may also employ person re-identification (based on clothing and body characteristics, without biometric storage) to avoid replaying announcements for persons who have already heard them.

7. Flexible Processing Architecture

[0039] Referring to FIG. 5, the flexible processing architecture 700 enables the computer vision module 200 to operate in one of three modes: edge processing mode 710, cloud processing mode 720, or hybrid processing mode 730.

FIG. 5

[0040] In edge processing mode 710, all AI inference executes directly on a system-on-chip (SoC) integrated with or attached to the image capture device 100. Suitable edge computing platforms include NVIDIA Jetson series, Google Coral, Intel Neural Compute Stick, or equivalent. In this mode, no video data leaves the device, providing maximum privacy protection. Lightweight models (such as MobileNet, EfficientNet-Lite, or MoveNet) are employed to achieve inference latency below 50 milliseconds per frame. Edge mode is preferred for venues with strict privacy requirements or limited network connectivity.

[0041] In cloud processing mode 720, the image capture device 100 transmits video frames (or extracted features) to a remote server over a network connection. The remote server executes the full computer vision pipeline using larger, more powerful models (such as Vision Transformer, YOLO-X, or equivalent) that may exceed the computational capacity of edge hardware. Cloud mode enables centralized fleet management, where a single server manages vision processing for multiple venues and distributes model updates simultaneously. Inference latency in cloud mode is typically 100-500 milliseconds depending on network conditions.

[0042] In hybrid processing mode 730, the system partitions the vision pipeline between edge and cloud. Time-critical tasks — specifically gesture recognition 250 and person detection 210 — execute on the edge device for sub-50ms latency, ensuring gestures are recognized immediately. Less time-sensitive tasks — specifically demographic estimation 220 and model updates — are offloaded to the cloud server. A dynamic mode selector monitors available network bandwidth, current latency measurements, and the venue's configured privacy policy to automatically determine the optimal partition between edge and cloud processing.

8. Behavioral Feedback Loop

[0043] The behavioral feedback loop 800 is the central innovation of the present invention and distinguishes it from all known prior art. Referring again to FIG. 2, the loop operates as follows: (a) the image capture device 100 captures continuous video of the monitored space; (b) the computer vision module 200 detects persons, classifies their activities, and observes their behavioral responses; (c) the audio selection engine 300 selects audio content based on the current audience profile; (d) the audio output device 400 plays the selected audio; (e) the computer vision module 200 then observes how the detected persons respond to the audio that the system just selected; (f) the behavioral response detection sub-module 240 classifies these responses as positive or negative; and (g) the RL feedback processor 330 uses these classified responses as reward signals to adjust the selection algorithm's parameters for future selections. This loop repeats continuously, creating a system that improves with every iteration.

[0044] Unlike prior art systems such as U.S. Pat. No. 9,489,934, which perform a single emotion detection and selection step, the present invention's feedback loop is continuous and self-referential — the system observes reactions to its own outputs. This closed-loop architecture enables the system to discover and exploit non-obvious preferences that would not be predicted by demographic or activity classification alone.

9. Use Cases

[0045] The system is applicable to a variety of venue types, including but not limited to: restaurants (adapting from calm background music during dining to higher-energy selections as the atmosphere shifts toward socializing), retail stores (matching audio to the browsing pace and demographic of current shoppers), fitness centers (synchronizing tempo to workout intensity detected through activity classification), hotel lobbies (maintaining a sophisticated ambient atmosphere calibrated to guest demographics), co-working spaces (selecting focus-friendly audio during work hours and social audio during breaks), and outdoor venues such as festivals, parks, and courtyards.

Abstract

[0046] An adaptive ambient audio system for shared physical spaces that employs computer vision to detect persons, estimate demographics, classify activities, recognize hand gestures, and detect behavioral responses to currently playing audio. An audio selection engine uses these vision inputs, combined with a reinforcement learning feedback loop, to automatically select and adapt music or sonic soundscapes in real time. The system continuously observes audience reactions (such as dancing, head-nodding, or covering ears) to audio that the system itself selected, treating these observations as reward signals to improve future selections. A QR-code-based web interface enables location-gated user feedback and control without requiring a native application. An AI voice onboarding module synthesizes spoken announcements explaining interaction methods, with adaptive frequency based on crowd turnover. A flexible processing architecture supports edge, cloud, and hybrid inference modes, enabling privacy-preserving on-device processing or more powerful cloud-based analysis as deployment requirements dictate.