Support

If you have a problem or need to report a bug please email : support@dsprobotics.com

There are 3 sections to this support area:

DOWNLOADS: access to product manuals, support files and drivers

HELP & INFORMATION: tutorials and example files for learning or finding pre-made modules for your projects

USER FORUMS: meet with other users and exchange ideas, you can also get help and assistance here

NEW REGISTRATIONS - please contact us if you wish to register on the forum

what to do with a 32-tap FIR filter

DSP related issues, mathematics, processing and techniques

what to do with a 32-tap FIR filter

Postby steph_tsf » Mon Mar 09, 2020 3:59 am

I went inspecting the ConvoRev7fixed.frm that got posted by Martin Vicanek on Wed Jun 05, 2019 11:00 pm.
http://www.dsprobotics.com/support/viewtopic.php?f=4&t=3879&start=50

Martin Vicanek aim was to allow Flowstone users to experiment with FFT-then-iFFT transversal filters, regarded as optimal substitutes for long FIR filters. The FFT-then-iFFT trick is required for simulating the Taj-Mahal listening environment, whose -20 dB Impulse Response is approx 5 seconds long. The FIR filter that's required would be 220,500 taps long (at Fs = 44.1 kHz). This requires computing 10 Giga 32-bit floating point multiply-accumulates per second. Times two in case of stereo. Using the FFT-then-iFFT trick, the computational load is several orders of magnitude lighter.

The FFT-then-iFFT efficiency gain comes from the fact that in the Taj-Mahal application, the FIR filter weights are defined once for all. They never change.

A FIR filter is indeed a powerful construction, capable of learning, whose learning capability is destroyed in case its weights are not allowed to vary with time. The answer to the "what to do with a 32-tap FIR filter" question is : "build a intelliger". Let us now adopt such name, for designating the Widrow-Hoff LMS Machinery.

A intelliger that's monitoring the input and the output of some given "plant", is continuously delivering as outputs:

- the putative "plant" IR (Impulse Response)
- the putative "plant" C (Causality)
- and as free bonus, the putative "plant" output signal in case causality would be 100% instead of the actual causality

In case the "plant" is adding little noise, the C (causality) will be a strong 99%.
In case the "plant" is adding a lot of noise, the C will be a weak 80%.

What I am describing here, is a kindergarten SISO (Single Input, Single Output) intelliger.

Once you are dealing with a real world "plant", you need to distinguish between what's relevant (strong causality), and irrelevant (weak causality). You thus need to hook a MISO (Many Inputs, Single Output) intelliger.

One is expecting from a MISO intelliger that's hooked on a "many-inputs plant", to continuously deliver as outputs:

- the putative "input_1 plant" IR
- the putative "input_2 plant" IR
- the putative "input_n plant" IR
- the putative "input_1 plant" C
- the putative "input_2 plant" C
- the putative "input_n plant" C
- and as free bonus, the putative "plant" output signal in function of artificially imposed causalities (C factors)

Still today, the tremendous FIR filters capabilities remain ignored because the intelliger is not yet mainstream. Instead, a FIR filter is mostly considered as featuring non time-varying coefficients. Consequently, 99% of the FIR filter capabilities remain unexplored and unexploited.

One may think that all static FIR filters should be computed using the FFT-then-iFFT trick. Unfortunately, the FFT-then-iFFT emulation is introducing latency. The whole iFFT-then-FFT calculation must be completed, before the first audio sample can go out. This means a 100 ms latency when the FIR filter that's emulated is 4,410 taps long.

Such is the latency you will incur, when carrying out a high precision frequency response equalization along with a high precision phase response equalization, when relying on the FFT-then-iFFT trick.

For mitigating such latency issue, the ConvoRev7fixed.frm is embedding a Direct Convolution (DC) routine, purely operating in time-domain, that's automatically selected when dealing with FIR filters that are max 32-tap long. This is also concerning the 32 first taps of the 220,500-tap FIR filter that's emulated.

Code: Select all
streamin in;   // audio stream in (L*R*)
streamin IRL;   // left impulse response
streamin IRR;   // right impulse response
streamin ix;   // index (0...31)
streamout out;   // convoluted audio out (L_IRL,L_IRR,R_IRL,R_IRR)

float x[32];   int index=-16;
float h[32];   int indexh=-16;
float F0=0;

stage2;
// update IR

movaps xmm2,IRL;
movaps xmm3,IRR;
shufps xmm2,xmm3,0;      // 0044
shufps xmm2,xmm2,136;   // 0202
cvtps2dq xmm0,ix;
movd eax,xmm0;
shl eax,4;
movaps h[eax],xmm2;   // (u,v,u,v)

// convolve
mov eax,index[0]; add eax,16; and eax,496; mov index[0],eax;
movaps xmm0,in; shufps xmm0,xmm0,160; // 0022
movaps x[eax],xmm0;
add eax,-16; and eax,496; movaps xmm1,x[eax];
add eax,-16; and eax,496; movaps xmm2,x[eax];
add eax,-16; and eax,496; movaps xmm3,x[eax];
push eax;

mov eax,indexh[0];
add eax,16; movaps xmm4,h[eax];
add eax,16; movaps xmm5,h[eax];
add eax,16; movaps xmm6,h[eax];
add eax,16; movaps xmm7,h[eax];
mov indexh[0],eax;

mulps xmm0,xmm4; mulps xmm1,xmm5;
mulps xmm2,xmm6; mulps xmm3,xmm7;
addps xmm0,xmm1; addps xmm2,xmm3; addps xmm0,xmm2;
movaps out,xmm0;
pop eax;

loop:
add eax,-16; and eax,496; movaps xmm0,x[eax];
add eax,-16; and eax,496; movaps xmm1,x[eax];
add eax,-16; and eax,496; movaps xmm2,x[eax];
add eax,-16; and eax,496; movaps xmm3,x[eax];
push eax;

mov eax,indexh[0];
add eax,16; movaps xmm4,h[eax]; mulps xmm0,xmm4;
add eax,16; movaps xmm5,h[eax]; mulps xmm1,xmm5; addps xmm0,xmm1;
add eax,16; movaps xmm6,h[eax]; mulps xmm2,xmm6; addps xmm0,xmm2;
add eax,16; movaps xmm7,h[eax]; mulps xmm3,xmm7; addps xmm0,xmm3;
mov indexh[0],eax;

addps xmm0,out; movaps out,xmm0;

cmp eax,448; // 512-64
pop eax; jl loop;
mov eax,-16; mov indexh[0],eax;

Such trick-on-the-trick allows the emulated 220,500-tap FIR filter to deliver a meaningful audio output, in less than a millisecond. Indeed, a 32 audio samples delay is a 0.8 millisecond delay at Fs = 44.1 KHz. When I say "a meaningful" audio output, I mean some audio output that's neither silence, neither garbage, kind of approximation of the filtering that's supposed to build-up in parallel.

For avoiding a perceived discontinuity after 5 seconds, which is the time the 220,500-tap FIR filter emulation is taking for delivering its first "completely convolved" audio sample, the 220,500-tap FIR filter emulation gets partitioned into several 64-tap FFT-then-iFFTs segments, several 64-tap FFT-then-iFFTs segments, several 4k-tap 64-tap FFT-then-iFFTs segments, and several 16k-tap FFT-then-IFFTs segments. Thus, during the 5 first seconds, the user benefits from a progressively maturing convolution, made of many discrete incremental convolutions that are progressively and seamlessly adding up. Call it a masterpiece. Thanks 1,000,000 times Martin Vicanek.

Anyway,

I wanted to implement a 31-tap FIR filter, as 2-way crossover operating at 3,400 Hz. I've designed the following .fsm, exploiting Martin Vicanek 32-tap Direct Convolution (DC) routine:


32-tap FIR filter.fsm
(184.74 KiB) Downloaded 1074 times

32-tap FIR filter (650 pix).jpg
32-tap FIR filter (650 pix).jpg (74.55 KiB) Viewed 13431 times

It is incorporating comparisons with other crossovers.

- I have compared with the subtractive delay-compensated crossover. For for such purpose, in order to provide a quasi 4th-order highpass slope, I've selected a double 2nd-order Butterworth lowpass filter (similar to the lowpass branch of a 4th-order Linkwitz-Riley) whose delay gets sightly overcompensated on purpose. John Kreskovsky presented such trick in his "A New High Slope, Linear Phase Crossover Using the Subtractive Delayed Approach" technical note dating back from December 2002, that he revised in April 2005. Unfortunately, all the graphs John Kreskovsky published, were formatted for keeping us unaware of the non-monotonic behavior below -35 dB. Moreover, unfortunately, John Kreskovsky did not warned about the delay sensitivity & granularity issue. Such effect shows when crossing above 2,000 Hz when Fs = 44.1 kHz. When crossing at 3,400 Hz, you shall hesitate between a delay of 5 samples, 6 samples, or 7 samples. We are talking about integer numbers. You need to obey the discrete steps. There is no question of "fine tuning" the delay, thus. Kind of fatal deception would you say. No possibility of fine tuning the delay would you say. Wrong! You still can do a fine tuning, by choosing a 6 samples delay, and fine tuning the crossover frequency around 3,400 Hz using the 32-bit floating point precision of the x86 SSE. Or using the 24-bit fixed point precision of the DSP56K. It is that simple. Kind of Columbus egg. Now comes the truth. Continue reading because there is a miracle occurring. When you base on a on a lowpass filter that's a double 2nd-order Butterworth (instead of the 3rd-order Butterworth lowpass Lipshitz-Vanderkooy are suggesting in their 1981 publication at the AES), and when you slightly oversize the compensating delay (or tweak the crossover frequency for a given delay), the lowpass and highpass slopes are both becoming quasi 4th-order slopes in the transition band till -30 dB. This is a miracle because in 1981, in their 51 pages communication to the AES, Lipshitz-Vanderkooy are taking 20 pages full of math, for telling that the quasi 3rd-order highpass slope they are attaining with their subtractive delay-compensated scheme, is the max highpass selectivity one can expect, whatever the order of the lowpass filter. After having seen such miracle, you may think that the show is ending here. Not at all. There is a second miracle. Indeed, when you base on such double 2nd-order Butterworth lowpass filter, the relative phase mismatch between the lowpass output and the higpass output, in the transition band, is approx. 30 degree or so. Thus, such subtractive delay-compensated crossover is now quasi-synchronous. Consequently, such crossover is indeed a) quasi 4th-order, b) symmetrical, c) quasi-synchronous, and d) transient perfect. In other words, such crossover is very close to perfection, so close to perfection that there is no need for a 32-tap FIR filter acting as crossover, when crossing above 2,000 Hz. Such subtractive delay-compensated crossover is requiring little computation power as there are only two IIR Biquads to compute per lowpass / highpass pair, plus one delay, plus one subtraction. John Kreskovsky was right. The slightly over-compensated delay subtracting scheme that's applied on a double 2nd-order Butterworth lowpass filter is probably the best scheme one can imagine, and it is requiring no more computational power than a plain normal 4th-order Linkwitz-Riley. Unfortunately, below -35 dB, the attenuation curves appear to be non-monotonic. Which means that in case one is not publishing the global phases curves, and not publishing the transient responses, and only publishing the attenuation curves on a 100 dB scale, the Linkwitz-Riley crossover appears as perfect, and the slightly over-compensated delay subtracting crossover that's applied on a double 2nd-order Butterworth lowpass filter, appears as zero-value crap. Now is time to find a better name for such crossover I am advocating for, like John Kreskowsky did long before me. It should be named SOCDS-DB2. Slightly Over Compensated Delay Substracting - Double Butterwoth 2nd-order. All digital crossovers should be SOCDS-DB2. The attached .fsm shows how to implement a 3-way SOCDS-DB2 crossover crossing at 170 Hz and 3,400 Hz. Its propagation time is 2.8 ms. In case of crossing at 85 Hz instead of 170 Hz for targeting a subwoofer + 13 cm woofer + tweeter combination, the propagation time is approx 5.0 ms. Nothing to worry, thus. Later on I will show how the natural unfiltered 2nd-order lowpass slope of a two-chamber 4th-order vented subwoofer can get 2nd-order lowpass filtered, for emulating the 4th-order lowpass transfer function the SOCDS-DB2 environment is requiring.

What I am telling, is for a 3,400 Hz crossover.

Imagine now, dealing with a 340 Hz crossover.
The SOCDS-DB2 crossover will perform just as usual, requiring little computing power.
What about a 32-tap FIR filter crossover, crossing ar 340 Hz?
No way! A 320-tap FIR filter is now required. Quite expensive!
Consequently, the SOCDS-DB2 remains the only practical, quasi-perfect & affordable solution for building a near-perfect 3-way crossover.

- I have compared the SOCDS-DB2 solution against the Linkwitz-Riley solution. The frequency domain responses are essentially the same. The time-domain responses are fundamentally different, in favor of the SOCDS-DB2 solution, quasi transient-perfect.

- I have not compared with the 4th-order synchronous Duelund 3-way crossover, and I have not compared with the 4th-order synchronous René Christensen crossover, because they are both low Q, less selective 4th-orders than the Linkwitz-Riley, and still, transient-distorting instead of transient-perfect.

- I have not compared with the Universal State Variable 2nd-order 3-way crossover, because it is far from synchronous, and because the low Q highpass and lowpass slopes don't provide enough selectivity.

So, finally, what to do with a 32-tap FIR filter?

Use #1. Speaking of crossovers, two 32-tap FIR filters are required in case one is willing to individually idealize the speakers in the 3,400 Hz crossover transition band.
Take as example a 3-way loudspeaker. The 3,400 Hz SOCDS-DB2 crossover is incapable of individually idealizing the two speaker drivers, prior they get filtered.
A 32-tap FIR filter can do this, acting as helping hand.
There will be a 32-tap FIR filter, idealizing the the medium speaker or midwoofer speaker in the transition band.
There will be a 32-tap FIR filter, idealizing the tweeter in the transition band.
The additional computational load remains moderate.
The additional latency is approx 0.8 microsecond.
Such is the first, easy use of a 32-tap filter.
It should be mainstream. Unfortunately, it is not yet mainstream.

Use #2. One can try getting rid of the 3,400 Hz SOCDS-DB2 crossover.
One 32-tap FIR filter may suffice for idealizing the medium speaker, and for lowpass filtering it.
One 32-tap FIR filter may suffice for idealizing the tweeter, and for highpass filtering it.

Problem is, how to compute the required FIR weights? This applies to Use#1 and Use#2.

One is obliged to measure the gain and the phase of the unfiltered medium speaker and the unfiltered tweeter. A dual-channel analyzer and a microphone can do this, sending a white noise or pink noise, progressively acquiring and averaging 100 or 1,000 acoustic pressure frames measuring each 32 samples long for determining the gain and phase of the raw medium speaker, and raw tweeter.

Use#1 is the most sophisticated, best performing one.
One is comparing the measured gain and phase, with a target gain and phase.
One must remain reasonable, what's concerning the target gain and phase for a given speaker driver.
One must respect and keep the low end behavior, as is. Try a Q=0.5 2nd-order highpass at some F1.
One must respect and keep the high end behavior, as is. Try a Q=0.5 2nd-order lowpass at some F2.
Simply ask the FIR filter, to apply a gain and phase correction whose aim is to transform the measured gain and phase, into the F1_modelled * F2_modelled transfer function, along with the important requirement consisting in targeting a flat frequency response along with a linear phase around the Ft frequency (transition band frequency), where the speaker is due to fifty-fifty seamlessly merge with some other speaker.
This is there, around Ft, that you should allocate most of the power, the FIR filter is delivering.
Therefore, below F1 and above F2, you need to relax the target gain and phase requirements.
Below F1 and above F2, you need to apply a "don't care" gain and phase correction strategy, for the FIR filter power not getting wasted in trying to correct the gain and phase below F1, and for the FIR filter power not getting wasted in trying to correct the gain and the phase above F2.
You'll be surprised, realizing how effective a 32-tap FIR filter can be, when intelligently teaming up with a 3,400 Hz SOCDS-DB2 crossover. The "don't care" gain and phase correction strategy below F1 and above F2 features the beneficial functional advantage, of never applying a strong, abrupt, spectacular correction. One is never trying to build "another speaker driver", into some existing speaker driver.

Use#2 is the simplest one.
For the midbass or medium driver, it suffices to define as target gain and phase curve, the lowpass filtered transfer function. For the tweeter, it suffices to define as target gain and phase curve, the highpass filtered transfer function.
One can already try this using a 32-tap FIR filter;
The whole fitting process can happen continuously in realtime, and interactively.
In less than one minute, you will watch the 32-tap FIR filter converging, then stabilizing.

Unfortunately, people will turn away from this because a) it is easy as 1-2-3 (no complicated math required), b) it is inexpensive, c) the aim doesn't consist in doing a room equalization and d) the sound the 32-tap or 128-tap FIR filter is delivering, will be function of the room where you did the measurement, will be function of the mike you relied on (dynamic or condenser, omni-directional or uni-directional), and will be function of the geometry (mike distance and mike angle).

Thus, each time you will be powering your audio system for listening to some audio material, you will constantly ask yourself:

Was I right in relying on such mike, should I have done the measurement in a perfectly silent and damped room, and should I have done the measurement from a 200 cm distance instead of a 100 cm distance, and at a 30 degree angle off-axis, instead of on-axis?

People will realize a bit too late, that there are always invisible guests acting as men-in-the-middle. In audio, this is starting with the perchman (sound capture), and this is ending with you (sound delivery). You need to trust in yourself, the same way you use to trust in the perchman.

Now we can formulate an answer to the inevitable following question : "what to do with a 32-tap intelliger?"

One should exploit the intelliger mimetic tropism.
A 32-tap FIR filter may suffice for idealizing the medium speaker, and for lowpass filtering it.
In digital domain, we build a SOCDS-DB2 crossover acting as "plant', crossing at 3,400 Hz, consisting on two IIR Biquads, a delay, and a subtractor.
The "plant" is receiving a white noise signal, at its input.
Consequently, the lowpass output delivers the filtered signal. This is the "wanted" signal.
In case we ask a 32-tap intelliger to monitor the input and the lowpass output of such "plant", the 32 FIR weights of such intelliger will converge in less than 100 millisecond to some steady state.
The mimetic tropism of the intelliger, forces the intelliger to emulate the "plant".
The 32 FIR weight are thus describing the impulse response of the "plant".
Let us assume that since long, we know the impulse response of the bare, unfiltered the medium speaker.
We can thus compute the impulse response of the 32-tap FIR filter that's required for rendering the filtered medium speaker impulse response, equal to the "plant", in other words equal to the SOCDS-DB2 crossover lowpass output.

One can exploit the intelliger mimetic tropism, using a more elaborated way, more ambitious way.
A 32-tap FIR filter may suffice for idealizing the medium speaker, and for lowpass filtering it.
In digital domain, we build a SOCDS-DB2 crossover acting as "plant', crossing at 3,400 Hz, consisting on two IIR Biquads, a delay, and a subtractor.
The "plant" is receiving a white noise signal, at its input.
Consequently, the lowpass output delivers the filtered signal. This is the "wanted" signal.
The intelliger is also receiving a white noise signal at its input, but this time, we are placing a DAC, a power amplifier, the medium speaker, the mike and a ADC, just after the intelliger.
Our ambition is to determine if the intelliger mimetic tropism, is strong and robust enough, for compensating the medium speaker transfer function that's following it. Will the intelliger still converge? Will the 32 FIR weights of the intelliger, duly describe the impulse response of the 32-tap FIR filter that's required for rendering the filtered medium speaker impulse response, equal to the "plant" impulse response?
I guess yes, because such arrangement resembles the "adaptive equalizer" arrangement that's widely used in landline communications systems. Unfortunately, this is about compensating the transfer function (attenuation, phase) of electric cables instead of speakers.
Anyway, we need to be careful with the audio delay that's are introduced by the DAC and ADC that are surrounding the medium speaker. The easiest way to annihilate the perverse effects of such audio delay, is to apply the exact same delay, by construction, to the "plant". Thus, during the setup, we need to connect a apparently useless DAC + wire + ADC chain on the SOCDS-DB2 crossover lowpass output.
We shall add a 2 or 3 samples delay cell after the SOCDS-DB2 crossover lowpass output, for mirroring the sound time-of-fly caused by the speaker-to-mike distance. When the intelliger has duly converged, it suffices to read its 32 FIR weights, and transfer them into a static (dumb) 32-tap FIR filter.
One need to repeat the process for generating the static (dumb) highpass FIR filter. This time the "plant" needs to be the SOCDS-DB2 crossover highpass output.

In case the more elaborated way exhibits convergence issues, one need to dig into the "Filtered-X LMS Algorithm" arrangement, and determine if this can help in our case : http://www.ijsps.com/uploadfile/2015/0915/20150915102843852.pdf.

If I correctly understand the "Filtered-X LMS Algorithm", one need to manage two 32-long circular audio input buffers. The first one is the main one, conveying the last 32 samples seen at the "plant" input, and serving as adaptive FIR filter, just as usual. The second one is the auxiliary one, conveying a allpass filtered version of the last 32 samples seen at the "plant" input. The allpass filter must feature the same phase as the speaker phase. I won't discuss here, the way such allpass filter can get implemented using a 1st-order IIR allpass filter (speaker highpass phase cloning), followed by a short FIR filter (speaker lowpass phase cloning). The aim of the "Filtered-X LMS Algorithm", is to restore the synchronicity of the etiaw (Error Times Input Add to the Weight) learning process. Clearly, the audio signal that's originating from the branch where the FIR filter is, gets impacted by the speaker phase. The etiaw learning process will remain immune from this, in case the one is providing to the etiaw learning process, an "input" signal that's not anymore the "plant input" signal, but a allpass filtered "plant input" signal, equally impacted by the speaker phase. The additional cost of a 32-tap Filtered-X LMS Algorithm only consists in maintaining a allpass filter, and allocating a 32-long circular audio buffer.

There are videos of Bernard Widrow on Youtube :
Part 1 https://www.youtube.com/watch?v=hc2Zj55j1zU
Part 2 https://www.youtube.com/watch?v=skfNlwEbqck
He is still with us. He co-discovered a simple training - learning algorithm long time ago, with Marcial (Ted) Hoff, his first doctoral student, who became Intel employee #12 in 1968. They had no name for such algorithm. They were doubting it would work because of some "crazy" statistical simplification they made at a certain point inside their math derivation. They were surprised, it actually worked. A name had to be found. A student (whose name got lost) proposed "Least Mean Square" as name. The name was okay because it remained vague about the many applications that were supposed to follow : categorizing, discriminating, cloning, filtering, rejecting, anticipating, etc. I remain under the impression that Bernard Widrow may not refute "intelliger" as friendly name.

I found excellent company over here : http://www-isl.stanford.edu/~widrow/publications.html
Please read the two Technical Reports (first one is dated 1960, second one is dated 1966).

I am now describing the etiaw (Error Times Input Add to the Weight) learning process, that's transforming a dumb 32-tap FFT filter, into a 32-tap intelliger. For keeping everything simple, I am describing the Widrow-Hoff LMS implementation.

initialize the plant_in, plant_out, FIR_out stream audio pointers
initialize the 32 weight coefficients, along with the weights pointer
do each time there is a new audio sample arriving,
- reset the global accumulator that's building-up the FIR_out value
- compute the error [error <- plant_out - FIR_out] (one should apply a scale factor)
- do 32 times in a row for i = 0 to 31
-- globally and progressively elaborate the FIR_ out value [FIR_out <- FIR_out + (plant_in (i) * weight(i))]
-- locally update the current weight [weight(i) <- weight(i) + (error * plant_in(i))]
- next do
- transfer the globally and progressively accumulated FIR_out value into the FIR_out audio stream
next do

One must provide a read/write access to the 32 weights.

I am now describing the etfiaw (Error Times Filtered Input Add to the Weight) learning process, that's transforming a dumb 32-tap FIR filter, into a 32-tap intelliger, following the Filtered-X LMS implementation.

initialize the plant_in, filtered_x, plant_out, FIR_out stream audio pointers
initialize the 32 weight coefficients, along with the weights pointer
do each time there is a new audio sample arriving,
- reset the global accumulator that's building-up the FIR_out value
- compute the filtered_x audio sample, from the plant_in audio sample
- compute the error [error <- plant_out - FIR_out] (one should apply a scale factor)
- do 32 times in a row for i = 0 to 31
-- globally and progressively elaborate the FIR_ out value [FIR_out <- FIR_out + (plant_in (i) * weight(i))]
-- locally update the current weight [weight(i) <- weight(i) + (error * filtered_x(i))]
- next do
- transfer the globally and progressively accumulated FIR_out value into the FIR_out audio stream
next do

Unfortunately, the x86 instruction set and architecture appears to have been created, completely outside of the whys and hows of digital signal processing. This is quite surprising as Marcial (Ted) Hoff joined Intel as employee #12 as soon as 1968. Around 1968, Robert Noyce, one of the two founders of Intel, visited Busicom, a Japanese company specializing in tabletop calculators. He was in search of customers for the new in-house developed DRAM chips. The DRAM chip was part of the Intel MCS-4 chipset (CPU, ROM, DRAM, serial-to-parallel shift register). The MCS-4 chipset allowed Intel (or the tabletop calculator brand), to easily add new features to a calculator.

During the summer 1981, I spent my pocket money in purchasing a few Intel databooks.
The Intel MCS-86 (x86) 8086 CPU chip (16-bit registers) looked miserable against the quite intimidating Motorola 68000 (said to be a DEC PDP-11 on a chip, 32-bit registers, orthogonal instruction set, many addressing modes, etc).
Consequently, the yet announced (for year 1982) Intel 80186 and 80286 looked ambiguous, strange.
The MCS-51, successor of the MCS-48, looked mature, ideal for embedded applications, featuring a 8 x 8 bits = 16 bits multiplier, until you realize that the whooping 12 MHz clock only executes one instruction every microsecond.
The Intel iAPX-432 looked over-engineered, imposing delirious complications and bottlenecks nobody would tolerate. It looked like the Intel iAPX-432, only could work with some unspecified operating system and unspecified programming language or programming style. I was 20, had no experience, but felt that Intel ambition was to trap, lock and isolate customers. The Intel iAPX-432 was evil or delirious or both.

My lack of experience (I was only 20 years old), prevented me to perceive the validity of the Intel 80286 philosophy.

Today, I know more. The Intel 80286 was a "high integration" chip. It only required an external UVPROM and some RAM or DRAM for working. This way you build a qualitative 16-bit computer embedding a 16-bit hardware multiply assist and divide assist. This is way better than a MCS-51. In case you connect a daughter-board hosting a raster-scan CRT controller and a keyboard scanner, your 16-bit embedded computer becomes your development system. You may activate the protected mode. This was already a fantastic innovation. Soon or late, comes the question of determining how better such system could perform, after replacing its Intel 80286 chip (internal 16-bit registers), by a Motorola 68000 chip (internal 32-bit registers), and adapting the software to the Motorola 68000 chip. Comes the answer : only, marginally better. Why? Here is the answer. For single-user single-task or few simultaneous tasks, there is no need to rely on a CPU that's a "DEC PDP-11 on a chip", like the Motorola 68000 is. There is no need for a complicated multi-user multi-task secure preempting operating system, automatically transforming blocking code into non-blocking code. A single-user cooperative multi-task operating system that's consisting in less than 100 lines of code may suffice. Yes indeed, this is working, provided the person that's writing the application program, always replaces a "wait ... something" source code by a "wait ... something" macro-instruction or routine that's manipulating the stack pointer. Naturally, because of distraction or inattention, the person will depart from such rule. Contrary to popular belief, there exist many embedded applications written this (apparently stupid) way. One can exploit the 50 Hz or 60 Hz mains frequency, for constantly applying reset pulses. Yes indeed, the inexpensive 16-bit processing speed (much better than a Zilog Z-80 CPU) attained by the Intel 80286 CPU allowed to produce unbeatable results in the real unpredictable world. It suffices to organize for getting out as quick as possible from a worst case situation, possibly a blocking code situation. This is more effective than over-engineering the hardware and the software. It suffices to regard a apparently "constantly monitoring" system, as a damn fast sampling system that's getting a new life 50 or 60 times per second, that's committing in completing its job within a certain time, say 200 milliseconds, that's also committing to cold-restart itself in case the 200 millisecond max processing time could not be honored.

1968, Intel wants to sell DRAM chips
1969, Intel wants to sell DRAM chips designed for their customizable tabletop calculator 4-bit CPU chip
1971, Intel is selling DRAM chips along with their customizable tabletop calculator 4-bit CPU chip (4004)
1972, Intel is selling DRAM chips along with their customizable tabletop calculator 8-bit CPU chip (8008)
1973, Intel recognizes that end-user programmable tabletop calculators are in strong demand
1974, Intel recognizes that raster-scan CRT displays allow end-users to comfortably edit their programs
1975, Intel wants to sell DRAM chips designed for raster-scan CRT display based user-programmable computers
1976, Intel is selling DRAM chips along with the MCS-85 chips family (ROM, 8085 CPU, CRT controller, etc)
1976, Shurgart Associates is launching the ​5 1⁄4-inch FDD (Floppy Disk Drive)
1977, Intel 8085 sales are down because the Zilog Z-80 (designed by ex-Intel people) is better and cheaper
1977, TandyRadioShack is launching the TRS-80 Desktop Computer basing on the Z-80 CPU (max 64 kbyte DRAM)
1978, Sharp is launching the MZ-80K Desktop Computer basing on the Z-80 CPU (max 64 kbyte DRAM)
1978, Intel is launching the 16-bit bus 16-bit 8086 40-pin CPU (max 16 pages of 64 kbyte DRAM)
1979, Intel is launching the 08-bit bus 16-bit 8088 40-pin CPU (max 16 pages of 64 kbyte DRAM)
1979, Motorola is launching the 16-bit bus 32-bit 68000 64-pin CPU able to address 16 Mbyte of DRAM
1980, Seagate (a company related to Shurgart) is launching the 5 Mbyte ST-506 HDD (Hard Disk Drive)
1981, IBM is launching the PC5150 basing on the Intel 8088 CPU able to address 16 pages of 64 kbyte DRAM
1982, Intel is launching the 80186 CPU (high integration 8086 with more on-board peripheral devices)
1982, Intel is launching the 80286 CPU (80186 + protected mode)
1984, IBM is launching the PC5170/AT basing on the Intel 80286 CPU able to address 16 pages of 64 kbyte DRAM
1985, Intel is launching the 80386 CPU (32-bit, three-stage instruction pipeline, memory management unit)
1989, Intel is launching the 80486 CPU (L1 cache, floating-point unit, five-stage tightly-coupled pipeline)
1993, Intel is launching the 80586 P5 CPU, Pentium (much faster floating-point unit)
1995, Intel is launching the 80586 P5 CPU, Pentium Pro ( - bugged floating-point unit - )
1997, Intel is launching the 80586 P5 CPU, Pentium II - MMX
1999, Intel is launching the 80686 P6 CPU, Pentium III - MMX, SSE
2000, Intel is launching the 80686 P6 CPU, Pentium IV ( - deceiving - )
2003, Intel is launching the 80686 P6 CPU, Pentium M - MMX, SSE, SSE2
2006, Intel is launching the Core CPU,
2020, Intel stopped with DRAM at a certain moment, and keeps very busy with CPU chips

Our Personal Computers should feature three clocks:
- user clock (keystroke, mouse move, mouse click) - highly variable & unpredictable clock
- video clock (say 60 Hz)
- audio clock (say 44.1 kHz)
The audio should be sample-by-sample processed. Kind of localized 44,100 Hz interrupt, not disturbing the general-purpose CPU. Unfortunately, nowadays Windows-x86 computers are not organized this way. They group 128 or 256 audio samples in a buffer, and wait for the buffer to get full before actually batch-processing it. That's bad. Despite the 1999 Pentium III SSE (Streaming SIMD Extension), all Personal Computers still process audio the same way, using buffers for batch-processing 128 or 256 consecutive audio samples that are congesting the buses and the CPU.

1983, the audio CD gets launched.

1989, Intel is launching the 80486 CPU chip (L1 cache, floating-point unit, five-stage tightly-coupled pipeline).
You would expect Intel, putting some audio CD related hardware in such chip. Not at all. The built-in math co-processor (Floating Point Unit) renders the Excel calculations (spreadsheet calculations), quasi instantaneous. This is very welcome indeed. But wait a minute. This has nothing to do with audio. In 1989, Intel was still considering a Personal Computer, as a colorful glorified programmable tabletop calculator. The fact that digital was everywhere, including in audio, remained ignored by Intel.

1996, Windows 95 OSR2 (with the FAT32 file system) allows the PC to store and process the equivalent of several audio CDs on the hard disk.

1997, Intel is launching the 80586 P5 CPU chips, Pentium II - MMX
This is 15 years after the audio CD launch.
A PC becomes capable of quickly transforming a audio CD into MP3 audio files taking 10 times less hard disk space, and playing them at will using Winamp, through Winamp playlists.

1997, the DVD gets launched.

1999, Intel is launching the 80686 P6 CPU chips, Pentium III - MMX, SSE
A Windows 98 PC becomes capable of quickly transforming a DVD movie into AVI or MP4 video files, taking 5 times less hard disk space, and playing them at will using the Windows 98 Media Player. Or VLC Media Player, released in January 2001.

2006, Intel is launching the Core CPU chips - embedding SLMS (Streamed LMS)
The new fir (Finite Impulse Response) instruction is computing a 8-channel ASIO FIR filter tap every CPU cycles. The execution of the main loop of a 8-channel 128-tap FIR filter only takis 132 CPU cycles.
The new fir_etiaw (Error Times Input Add to the Weight) instruction is computing a 8-channel ASIO adaptive FIR filter tap every 2 CPU cycles, designed for the LMS learning capability. The execution of the main loop of a 8-channel ASIO adaptive 128-tap FIR filter, only takis 264 CPU cycles.
The new fir_etfiaw (Error Times Filtered Input Add to the Weight) instruction is computing a 8-channel ASIO adaptive FIR filter tap every 2 CPU cycles, designed for the Filtered-X LMS learning capability. This is of course not including the auXiliary filter output computation.
Microsoft and Steinberg providid the requirid software drivers for inserting FIR filters used as equalizers everywhere at will in the new Windows sound system that's basing on ASIO.
Microsoft and Intel providid the requirid software drivers for implementing fast neural networks.


Such demand emanated from everywhere. The general public wanted the PC to become a audio-video jukebox. The audio and video professionals wanted the PC to become digital audio and digital video editing stations. A smooth continuous evolution had taken place on a 30 years span.

Kind of success story would you say. Maybe. Or not. Because the Intel "Core" CPUs are not yet featuring SLMS (Streamed LMS). What I've just described is distopia. In reality, the novelty in the Intel "Core" CPUs is that they can get juxtaposed on a same chip for implementing a multi-Core CPU.

In 2006, something bad happened. The Windows PC Audio System has not approved the Steinberg ASIO VST standard. Or possibly, Steinberg and/or Microsoft wanted the PC Audio System to remain different from the Steinberg ASIO VST standard.
In 1996, just after the Windows 95 OSR2 release, a company named Steinberg proposed ASIO and VST, as "Audio Streaming Input Output" standard and as "Virtual Studio Technology" standard, allowing a x86 PC to acquire many audio channels from soundcards, process and mix many audio channels in realtime, and deliver many audio channels to soundcards. Microsoft did not commit into this, probably because ASIO and VST was only dealing with audio, not with video. The fact that Steve Ballmer, Microsoft CEO, never committed into this, appears as technologically detrimental. From a marketing point of view, Steve Ballmer didn't want Microsoft and Windows, to become perceived by their traditional "office" users, as endorsing the idea of introducing music, movies and videogames into their "office" computers. Steve Ballmer wanted to ensure a long term "office" stability for Microsoft and Windows.
Anyway, ten years after 1996, despite the good streaming capabilities of the "Core" CPU launched by Intel in 2006, Microsoft is still not interesting in joining on the Steinberg ASIO VST bandwagon.

Since 2001, Microsoft organized for supplying :
- Office computers (Windows 98 or ME Professional)
- Home computers (Windows 98 or ME for Home)
- Personal Digital Assistants (Windows CE 3.0) (resistive touchscreen + stylus)
- Gaming (xBox) (733 MHz Intel Pentium III, 233 MHz nVidia NV2A graphics, no keyboard)

In 2006, Steve Ballmer and Bill gates wanted Windows 98 and Windows ME to remain the friendly forge, the friendly crucible that's enabling people to carry out software innovation, the exact same way they are carrying out their daily office duties. This is why Steve Ballmer and Bill gates wanted a VisualBasic and a keyboard in every Windows 98 and Windows ME computer. And by extension, Steven Ballmer also wanted a hardware keyboard in every "business" smartphone. Here, you can identify where Steven Ballmer got wrong. His "by extension" thinking was invalid. Nevertheless, he persisted. Soon after 2007, Microsoft lost the PDA/Smartphone market. Today in 2020, I regret that Steven Ballmer did not denounce loud and clear, back in 2007, what was going to happen once software editors like Apple or Google commit to get paid, not through software licences, but through the exploitation of user preferences, web browsing habits, social networking, etc.

In 2006, Steve Ballmer could not imagine the lethal blast the Apple iPhone (2007) and Google Android smartphones (2007) would represent for all Microsoft CE and Windows Mobile PDAs and Smartphones. The Microsoft CE and Windows Mobile PDAs and Smartphones instantly appeared as obsolete, because:
- the best ones were basing on a 640 x 480 trans-reflective resistive touchscreen that was requiring a stylus,
- there was no Windows Mobile application store,
- one could not select the language (no possibility to switch from English to French, etc.)

In 2006, the only competing Phone OS was Symbian. Here is the way Symbian got conceived, and died.
1980, the British company Psion (Potter Scientific Instruments) got founded by David Potter.
1984, Psion is launching the Psion Organiser.
1987, Psion is releasing EPOC, a preemptive multitasking operating system written in C for the Intel 8086 CPU.
1997, Psion is releasing the 32-bit version of EPOC for the Psion Organiser series 5.
1998, Symbian Ltd. gets formed as a partnership between Nokia, Ericsson, Motorola and Psion
1998, EPOC becomes Symbian OS.
2000, Ericsson is launching the R380 smartphone, running EPOC version 5 rebadged Symbian OS 5.1
2000, Nokia is launching the 9210 Communicator, running Symbian OS 6.0 able to run third-party apps
2003, Symbian OS 7.0
2005, Symbian OS 8.1
2005, Symbian OS 9.1
2007, Symbian OS 9.3
2007, Symbian OS 9.5 supporting real-time multimedia
2008, Nokia acquires Symbian Ltd.
Contrary to popular belief, Symbian OS was not incorporating a GUI (Graphical User Interface)
A third-party GUI was required, running on top of Symbian OS.
The most popular GUI for Symbian OS was "S60", written by Nokia, requiring a license.
There were other GUI written by other brands. This caused incompatibility among brands.
Such incompatibility prevented third-party apps to be intensively developed.
Consequently, there was no application store.

After such long digression about year 2006, let's discuss streaming audio in general terms.

1989, Steinberg is launching Cubase, originally only a MIDI sequencer, running on the Atari ST computer.
1993, Cubase got a boost thanks to the Atari Falcon 030, embedding a Motorola DSP56001 digital signal processor chip, allowing many audio effects with 8-track audio recording and playback.
The following years, Cubase got ported for running on Apple computers.
1998, Cubase VST 3.5.5 for Windows 95 starts the modern Windows Cubase era. Such release that's optimally exploiting the Microsoft DirectX architecture and DirectX plugins, is basing on VST, and is capable of dealing with 32 tracks and 128 equalizers, all in realtime.
2011, Cubase 6.0 got launched for running on 64-bit Windows 7. The Mac versions (Apple) are of course still there.

There exists a kind of Cubase for iOS (Apple iPads).
A temptation is to rely on Cubase for materializing a kind of music instrument one can play live.

Such market is covered since 2001 by a Windows/Mac application whose name is Ableton Live, not originating from Steinberg. Ableton Live got developed in Berlin by Bernt Roggendorf and Gerhard Behles. Wikipedia says "Ableton Live is a digital audio workstation for macOS and Windows. In contrast to many other software sequencers, Ableton Live is designed to be an instrument for live performances as well as a tool for composing, recording, arranging, mixing, and mastering, as shown by Ableton's companion hardware product, Ableton Push. It is also used by DJs, as it offers a suite of controls for beatmatching, crossfading, and other different effects used by turntablists, and was one of the first music applications to automatically beatmatch songs."

There can't be audio stuttering or hiccups. There can't be any de-synchronization from tracks to tracks and from audio to video. The operating system, graphical user interface and hardware get obliged to teamwork in a tight manner.

Let us consider a fixed weights streamed 32-tap FIR filter. The algorithm is very straightforward, even in streaming.

initialize the FIR_in, FIR_out stream audio pointers
initialize the 32 weight coefficients, along with the weights pointer
do each time there is a new audio sample arriving,
- reset the accumulator that's building-up the FIR_out value
- do 32 times in a row for i = 0 to 31
-- progressively elaborate the FIR_ out value [accumulator <- accumulator + (FIR_in (i) * weight(i))]
- next do
- transfer the accumulated value into the FIR_out audio stream
next do

You may try Googling "FIR Filter Algorithm Implementation Using Intel SSE (Fritz Gerneth, 2010)".
The SSE acronym stands for "Streaming SIMD Extensions".
The SIMD acronym stands for "Single Instruction Multiple Data".
Try reading such document. It is emanating from Intel. Remember it is dating from 2010.
There is question of implementing a off-line 63-tap FIR filter, processing 640 audio samples in mono (not stereo). The audio samples are fixed point 16-bit, instead of fixed point 24-bit or floating point 32-bit.

Still in 2020, we are forced to code a fixed weights streamed 32-tap FIR filter, the following way in the x86 SSE environment and Windows 7/8/10 environment :

Code: Select all
streamin in;   // audio stream in (L*R*)
streamin IRL;   // left impulse response
streamin IRR;   // right impulse response
streamin ix;   // index (0...31)
streamout out;   // convoluted audio out (LL,LR,RL,RR)

float x[32];   int index=-16;
float h[32];   int indexh=-16;
float F0=0;

stage2;
// update IR

movaps xmm2,IRL;
movaps xmm3,IRR;
shufps xmm2,xmm3,0;      // 0044
shufps xmm2,xmm2,136;   // 0202
cvtps2dq xmm0,ix;
movd eax,xmm0;
shl eax,4;
movaps h[eax],xmm2;   // (u,v,u,v)

// convolve
mov eax,index[0]; add eax,16; and eax,496; mov index[0],eax;
movaps xmm0,in; shufps xmm0,xmm0,160; // 0022
movaps x[eax],xmm0;
add eax,-16; and eax,496; movaps xmm1,x[eax];
add eax,-16; and eax,496; movaps xmm2,x[eax];
add eax,-16; and eax,496; movaps xmm3,x[eax];
push eax;

mov eax,indexh[0];
add eax,16; movaps xmm4,h[eax];
add eax,16; movaps xmm5,h[eax];
add eax,16; movaps xmm6,h[eax];
add eax,16; movaps xmm7,h[eax];
mov indexh[0],eax;

mulps xmm0,xmm4; mulps xmm1,xmm5;
mulps xmm2,xmm6; mulps xmm3,xmm7;
addps xmm0,xmm1; addps xmm2,xmm3; addps xmm0,xmm2;
movaps out,xmm0;
pop eax;

loop:
add eax,-16; and eax,496; movaps xmm0,x[eax];
add eax,-16; and eax,496; movaps xmm1,x[eax];
add eax,-16; and eax,496; movaps xmm2,x[eax];
add eax,-16; and eax,496; movaps xmm3,x[eax];
push eax;

mov eax,indexh[0];
add eax,16; movaps xmm4,h[eax]; mulps xmm0,xmm4;
add eax,16; movaps xmm5,h[eax]; mulps xmm1,xmm5; addps xmm0,xmm1;
add eax,16; movaps xmm6,h[eax]; mulps xmm2,xmm6; addps xmm0,xmm2;
add eax,16; movaps xmm7,h[eax]; mulps xmm3,xmm7; addps xmm0,xmm3;
mov indexh[0],eax;

addps xmm0,out; movaps out,xmm0;

cmp eax,448; // 512-64
pop eax; jl loop;
mov eax,-16; mov indexh[0],eax;

How come? Can anyone estimate the overhead generated by such unjustified complexity? Clearly, there is a gross mismatch between supply and demand. We must identify an already existing, more suitable environment, that's also executing all the glorified tabletop calculations (non streaming .exe and .dll files) to which we are accustomed. In case such an environment does not yet exist, it must be created from scratch. Such creation cost will remain an order of magnitude lower than a “intelligate" scandal which could wipe out Intel and the x86 silicon industry. This is of importance. Modern jobs, done in the office or at home, require streaming audio and video, along with speech enhancement and echo cancellation. Same for road transport. And this will include the driver once autonomous driving is reigning. A foreseeable extension, is that on top of audio and video, it will be required to continuously stream "objects folders", requiring fast heavy processing. Think about objects or people, constantly entering and leaving the visual field of autonomous vehicles. The whole content needs to get profiled, classified, and identified, within milliseconds. Think about digitized objects or people, entering a synthetic movie scene, or virtual reality environment. This will be mainstream. In case we let the situation "as is", the overhead that's generated by the unjustified complexity, is going to silently kill many advances, many cost savings.

Perhaps, the biggest prejudices are coming from silently killing many advances. One can observe this in flesh, here. I mean, on Synthmaker forum, then Flowstone forum.

First illustration. On Flowstone, nobody seems aware of the existence of the Adaline sisters. I mean Knobby Adaline, and Memistor Adaline. Knobby changed the course of humanity. Bernard Widrow and Martial (Ted) Hoff built Knobby using parts purchased in Zack's electronics shop in downtown Palo Alto on High Street. Anybody pretending to be a "Artificial Intelligence" specialist, that never took a Sunday for re-creating Knobby, is a patented clown. With Flowstone, such re-creation takes less than one hour. Knobby's sister, Memistor thus, came later. She can't get re-created. She remains unique. Memistor is embedding current-controlled variable resistors, actually replacing knobs. A positive current entering the control port, is plating copper atoms on a carbon substrate acting as variable resistor, decreasing its value. A negative current entering the control port, is stripping copper atoms from the carbon substrate acting as variable resistor, increasing its value. The above described resistor materialize thus a permanent, analog memory, necessitating no standby power. In suffices to enter a positive or negative current inside the control port, for decreasing or increasing the resistor value. Memistor allows the learning to happen, by depressing a button labelled "Please learn this", the time required for Memistor telling "OK, I've got that" in return. Full automatic learning is thus at reach.

Second illustration. On Flowstone, FIR filters remain 90% ignored, and yet, there are no Windrow-Hoff LMS machines (aka intelligers). Consequently, Flowstone gets maintained into stone age. This has tremendous consequences, speaking of human intellectual development. Instead of becoming able to explore and exploit FIR filters and intelligers at will, young people get rapidly confined in adding bells and whistles to non-learning devices that are existing since the MiniMoog synthesizer launched in 1970. This is quite bizarre as back in July 2013, as newbee on Flowstone, I managed to write a 19-tap FIR filter, 39-tap FIR filter and 99-tap FIR filter, all working fine, using the Flowstone "DSP Code Component" (Flowstone blue frame).

The 19-tap FIR "DSP Code Component" (Flowstone blue frame) is :
Code: Select all
streamin input;
streamout output;

// FIR coeff
streamin c00;
streamin c01;
streamin c02;
streamin c03;
streamin c04;
streamin c05;
streamin c06;
streamin c07;
streamin c08;
streamin c09;
streamin c10;
streamin c11;
streamin c12;
streamin c13;
streamin c14;
streamin c15;
streamin c16;
streamin c17;
streamin c18;

// MEM
float fir;
float epsilon = 0.00000000001;
float m00;
float m01;
float m02;
float m03;
float m04;
float m05;
float m06;
float m07;
float m08;
float m09;
float m10;
float m11;
float m12;
float m13;
float m14;
float m15;
float m16;
float m17;
float m18;

input = input + epsilon;
m01 = input;
fir = (c00*m00)+(c01*m01)+(c02*m02)+(c03*m03)+(c04*m04)+(c05*m05)+(c06*m06)+(c07*m07)+(c08*m08)+(c09*m09);
fir = fir+(c10*m10)+(c11*m11)+(c12*m12)+(c13*m13)+(c14*m14)+(c15*m15)+(c16*m16)+(c17*m17)+(c18*m18);
output = fir;

// update MEM
input = input - epsilon;
m18 = m17;
m17 = m16;
m16 = m15;
m15 = m14;
m14 = m13;
m13 = m12;
m12 = m11;
m11 = m10;
m10 = m09;
m09 = m08;
m08 = m07;
m07 = m06;
m06 = m05;
m05 = m04;
m04 = m03;
m03 = m02;
m02 = m01;
m01 = m00;


The assembly code that gets generated is:
Code: Select all
streamin input;streamout output;streamin c00;streamin c01;streamin c02;streamin c03;streamin c04;streamin c05;streamin c06;streamin c07;streamin c08;streamin c09;streamin c10;streamin c11;streamin c12;streamin c13;streamin c14;streamin c15;streamin c16;streamin c17;streamin c18;float fir=0;
float epsilon=1e-011;
float m00=0;
float m01=0;
float m02=0;
float m03=0;
float m04=0;
float m05=0;
float m06=0;
float m07=0;
float m08=0;
float m09=0;
float m10=0;
float m11=0;
float m12=0;
float m13=0;
float m14=0;
float m15=0;
float m16=0;
float m17=0;
float m18=0;
movaps xmm0,input;
addps xmm0,epsilon;
//Assignment> sLeft=xmm0
movaps input,xmm0;
//Assignment> sLeft=input
movaps xmm0,input;
movaps m01,xmm0;
movaps xmm0,c00;
mulps xmm0,m00;
movaps xmm1,c01;
mulps xmm1,m01;
addps xmm0,xmm1;
movaps xmm1,c02;
mulps xmm1,m02;
addps xmm0,xmm1;
movaps xmm1,c03;
mulps xmm1,m03;
addps xmm0,xmm1;
movaps xmm1,c04;
mulps xmm1,m04;
addps xmm0,xmm1;
movaps xmm1,c05;
mulps xmm1,m05;
addps xmm0,xmm1;
movaps xmm1,c06;
mulps xmm1,m06;
addps xmm0,xmm1;
movaps xmm1,c07;
mulps xmm1,m07;
addps xmm0,xmm1;
movaps xmm1,c08;
mulps xmm1,m08;
addps xmm0,xmm1;
movaps xmm1,c09;
mulps xmm1,m09;
addps xmm0,xmm1;
//Assignment> sLeft=xmm0
movaps fir,xmm0;
movaps xmm0,c10;
mulps xmm0,m10;
movaps xmm1,fir;
addps xmm1,xmm0;
movaps xmm2,c11;
mulps xmm2,m11;
addps xmm1,xmm2;
movaps xmm2,c12;
mulps xmm2,m12;
addps xmm1,xmm2;
movaps xmm2,c13;
mulps xmm2,m13;
addps xmm1,xmm2;
movaps xmm2,c14;
mulps xmm2,m14;
addps xmm1,xmm2;
movaps xmm2,c15;
mulps xmm2,m15;
addps xmm1,xmm2;
movaps xmm2,c16;
mulps xmm2,m16;
addps xmm1,xmm2;
movaps xmm2,c17;
mulps xmm2,m17;
addps xmm1,xmm2;
movaps xmm2,c18;
mulps xmm2,m18;
addps xmm1,xmm2;
//Assignment> sLeft=xmm1
movaps fir,xmm1;
//Assignment> sLeft=fir
movaps xmm0,fir;
movaps output,xmm0;
movaps xmm0,input;
subps xmm0,epsilon;
//Assignment> sLeft=xmm0
movaps input,xmm0;
//Assignment> sLeft=m17
movaps xmm0,m17;
movaps m18,xmm0;
//Assignment> sLeft=m16
movaps xmm0,m16;
movaps m17,xmm0;
//Assignment> sLeft=m15
movaps xmm0,m15;
movaps m16,xmm0;
//Assignment> sLeft=m14
movaps xmm0,m14;
movaps m15,xmm0;
//Assignment> sLeft=m13
movaps xmm0,m13;
movaps m14,xmm0;
//Assignment> sLeft=m12
movaps xmm0,m12;
movaps m13,xmm0;
//Assignment> sLeft=m11
movaps xmm0,m11;
movaps m12,xmm0;
//Assignment> sLeft=m10
movaps xmm0,m10;
movaps m11,xmm0;
//Assignment> sLeft=m09
movaps xmm0,m09;
movaps m10,xmm0;
//Assignment> sLeft=m08
movaps xmm0,m08;
movaps m09,xmm0;
//Assignment> sLeft=m07
movaps xmm0,m07;
movaps m08,xmm0;
//Assignment> sLeft=m06
movaps xmm0,m06;
movaps m07,xmm0;
//Assignment> sLeft=m05
movaps xmm0,m05;
movaps m06,xmm0;
//Assignment> sLeft=m04
movaps xmm0,m04;
movaps m05,xmm0;
//Assignment> sLeft=m03
movaps xmm0,m03;
movaps m04,xmm0;
//Assignment> sLeft=m02
movaps xmm0,m02;
movaps m03,xmm0;
//Assignment> sLeft=m01
movaps xmm0,m01;
movaps m02,xmm0;
//Assignment> sLeft=m00
movaps xmm0,m00;
movaps m01,xmm0;

Third illustration. The first digital filter primitive on Synthmaker, was not a IIR Biquad Direct Form I or II filter, along with some IIR Biquad controller in charge of doing a Bilinear Transform. Instead, it was a digital replica of the Moog lowpass filter (4th-order lowpass, adjustable cutoff frequency, adjustable resonance). And because of violating the SSA (Small Angle Approximation), the replica was not accurate in the high frequencies.

Fourth illustration. In 1975, Ramesh C. Agarwal and C. Sidney Burrus showed how to circumvent digital arithmetic saturation, when dealing with digital filters exhibiting a very low cutoff frequency. The solution consisted in reproducing in digital, the layout of the Universal State Variable filter, made of two integrators (approximated in digital by two accumulating delay cells with gain) maintained in control by two feedback loops. In 1980, Hal Chamberlin detailed a working example in his seminal, famous book entitled "Musical Applications of Microprocessors". Such advance remained ignored by most Synthmaker and Flowstone users, because they were more interested in polishing their own creations, than coding a Agarwal-Burrus filter that's operating in streaming.

As you can see above, generating "DSP Code" on Flowstone is easy as a-b-c.
Mrs. Knobby Adaline only requires 16 inputs and 16 knobs.
Comes the idea of re-producing Mrs. Adaline in "DSP Code".
Apparently, nobody tried this yet.
Same for Mrs. Memistor Adaline.

Comes the question of taking advantage of the x86 SSE instruction set.
On Flowstone, one must code "by hand" using the x86 SSE assembly (green frame).

Outside of Flowstone, most C compilers do not generate x86 SSE code because in case you don't specify anything about data alignment (very crucial with x86 SSE), the x86 SSE code will be less efficient than a normal x86 code. By "normal" x86 code, I mean the assembly code that's getting generated by Flowstone "DSP Code Component" (blue frame).
Outside of Flowstone, most C compiler act fuzzy when dealing with the out-of-order execution issues. The generated SSE code (if any) may not be reliable.

One must represent the endless questions about reliability and performance, that are occupying the minds of perfectly valuable people, engulfed into the x86 lineage.

An army of people are searching during hours, during days, during weeks, one or another trick, able to circumvent this or another x86 or Windows quirk. They all do this in good faith, because after identifying a challenge, they want to win it. All the time they allocate to this, is transforming them into specialized ants, using some specialized jargon that's preventing them to realize the relative common-world insignificance of their commitment.
At the moment, nobody is offering them a kind of progression, consisting in letting them articulate, how better x86 ASIO systems could be designed.
Since long, I am asking myself, why nobody is offering such opportunity, despite the return being possibly huge.
It is only recently, that I know the reason. They've got engulfed so long into the x86 lineage. They won't articulate anything, that's not about the x86. They feel that in case they dare articulating something that's not strictly x86, they'll be disapproved, considered as traitors, excluded by the x86 community.
Thus, the only way to get valuable people, working again on applied computer science, learning systems, etc, is to explain to them that several computer architectures will be banned, regarded as so inadequate, that allocating resources (time, money, intelligence, people careers) gets identified as a pure waste. Several computer architectures will thus get banned, just like several petrol or diesel cars will get banned.
Oh, anyway, computers that are executing more than one million bit moves before they can actually operate will also get banned, for obvious national security reasons and for obvious private life preservation reasons. This should give you an idea of the importance of the global cleansing that must occur in the computer industry. The sooner, the better, isn't?

Have a nice day
Last edited by steph_tsf on Tue Mar 17, 2020 10:42 am, edited 41 times in total.
steph_tsf
 
Posts: 249
Joined: Sun Aug 15, 2010 10:26 pm

Re: what to do with a 32-tap FIR filter

Postby Duckett » Tue Mar 10, 2020 5:53 pm

Even if your posts tend to be rather long reads, I do appreciate the detailed, in-depth discussion of DSP subjects.

As far as the "attachment limit" situation, even Trog has to wait for Maik to come 'round the FS forum and reset everybody; therefore my selfish advice (because I really want to look at your example) would be to use Dropbox or similar, until Maik's able to drop by and reset the forum attachment limits :)
We have to train ourselves so that we can improvise on anything... a bird, a sock, a fuming beaker! This, too, can be music. Anything can be music. -Biff Debris
User avatar
Duckett
 
Posts: 132
Joined: Mon Dec 14, 2015 12:39 am

Re: what to do with a 32-tap FIR filter

Postby steph_tsf » Mon Mar 16, 2020 11:52 am

Duckett wrote:I really want to look at your example ... use Dropbox or similar, until Maik's able to drop by and reset the forum attachment limits.
Regarding Dropbox or similar, I agree this is a possible solution in the short term, but I am reluctant about it at the moment. Kind of Flowstone bypass, detrimental for Flowstone. I prefer granting the opportunity to Flowstone, to remain in total control for solving the "board attachment quota reached" issue at the source, once for all.

Today, Mon Mar 16, 2020 12:52 pm, the "board attachment quota reached" issue appears to be solved. The .fsm example is now part of the initial post dated Mon Mar 09, 2020 4:59 am.

I am now attaching the iDFT XO Visual Basic application I wrote in August 2012 for calculating FIR filter coefficients (weights) in the context of 2-way phase-linear synchronous complementary (ideal) crossovers.

iDFT XO (exe).zip
(214.63 KiB) Downloaded 1045 times

iDFT XO (650 pix).jpg
iDFT XO (650 pix).jpg (90.64 KiB) Viewed 13431 times

One can rely on Flowstone Ruby for generating a .fsm that's doing the exact same.

I am attaching iDFT XO souce code:
iDFT XO (source).zip
(5.66 KiB) Downloaded 1043 times

Use Notepad ++ for navigating into iDFT XO source code.
Notepad ++ can be downloaded here:
https://notepad-plus-plus.org/downloads/

Have a nice day
steph_tsf
 
Posts: 249
Joined: Sun Aug 15, 2010 10:26 pm

Re: what to do with a 32-tap FIR filter

Postby wlangfor@uoguelph.ca » Sat Mar 21, 2020 3:52 pm

You would be able to skip a few steps with double precision. In fact maybe three, roughly.
My youtube channel: DSPplug
My Websites: www.dspplug.com KVRaudio flowstone products
User avatar
wlangfor@uoguelph.ca
 
Posts: 912
Joined: Tue Apr 03, 2018 5:50 pm
Location: North Bay, Ontario, Canada


Return to DSP

Who is online

Users browsing this forum: No registered users and 22 guests