📢 Gate Square #MBG Posting Challenge# is Live— Post for MBG Rewards!
Want a share of 1,000 MBG? Get involved now—show your insights and real participation to become an MBG promoter!
💰 20 top posts will each win 50 MBG!
How to Participate:
1️⃣ Research the MBG project
Share your in-depth views on MBG’s fundamentals, community governance, development goals, and tokenomics, etc.
2️⃣ Join and share your real experience
Take part in MBG activities (CandyDrop, Launchpool, or spot trading), and post your screenshots, earnings, or step-by-step tutorials. Content can include profits, beginner-friendl
Dismantling Uniswap V3 data cleaning process
Written by: Zelos
Introduction
In the last issue, we calculated the net worth and rate of return of users on uniswap from the perspective of user addresses. This time, our goal remains the same. But the cash held by these addresses must be counted. Get a total net worth and rate of return.
There are two pools for this statistical object, including
The final data obtained is hourly level data. Note: **The data in each row represents the value at the last moment of the hour. **
Overall process
1. Get Uniswap data
In order to provide data sources for demeter, we developed the demeter-fetch tool. This tool can obtain Uniswap pool logs from different channels and parse them into different formats. Supported data sources are:
Output formats include:
This time we mainly obtain tick data, which is used to count position information, including fund amount/income per minute/life cycle/holders, etc.
These data are obtained through the event log of the pool. Such as mint, burn, collect. swap. However, the log of the pool does not contain the token id. This makes it impossible for us to locate which position the operation of the pool is for.
In fact, the rights and interests of uniswap LP are managed through nft, and the manager of these nft tokens is the proxy contract. The token id only exists in the event log of the proxy. Therefore, if you want to obtain complete LP position information, you must obtain the proxy The event log is then combined with the event log of the pool.
trade this [3] For example, we need to pay attention to the two logs with log index 227 and 229. They are mint thrown by the pool contract and IncreaseLiquidity thrown by the proxy contract. The amount (that is, liquidity) between them, amount0 and amount1 are the same. . This can be used as the basis for correlation. By correlating these two logs, we can get the tick range, liquidity, token id, and the amount corresponding to the two tokens of this LP behavior.
For burn and collect, you can also use this method to find the corresponding position id for the pool event. But there is a trouble here. Sometimes the amounts of the two events are not the same, and there will be a slight deviation. For example, this transaction
There will be a small difference between his amount0 and amount1. Although this situation is rare, it is also very common. So when we match burn and collect, we leave some room for error in the values.
The next question to be dealt with is who initiated this transaction. For liquidation, we will use the receipt in the collect event as the holder of the position. For mint, we can only get the sender( See picture with mint event).
If the user is operating the pool contract, the sender will be the LP provider, but if it is an ordinary user operating the contract through a proxy, the sender will be the address of the proxy. This is because the funds are indeed transferred from the proxy to the pool. But the good news is The proxy will generate nft token. And this nft token will definitely be transferred to the LP provider. Therefore, by detecting the transfer of the proxy contract (that is, the nft token contract), you can find the LP provider corresponding to this mint.
In addition, if nft is transferred, the holder of the position will change. We have made statistics on this, and this situation is rare. To simplify, we did not consider the transfer of nft after mint.
2. Get the cash held by the address
The goal of this stage is to obtain the number of tokens held by an address at each moment during the statistical period. To achieve this goal, two aspects of data need to be obtained,
By adding and subtracting the balance using the transfer records, the balance at each moment can be inferred.
The balance at the starting time can be queried through the rpc interface. When using achievement node, you can set the height in the query parameters to get the balance at any time. The balance of native token and erc20 can be obtained in this way. .
Obtaining erc20 transfer records is relatively easy and can be obtained through any channel (big query, rpc, chifra).
The transfer records of eth need to be obtained through transactions and traces. The transactions are okay, but the query and processing of traces are very computationally intensive. Fortunately, chifra provides the function of exporting the eth balance. A record can be output when the balance changes, although It can only record quantity changes, but cannot record transfer objects, but it can still meet the requirements. This is the lowest-cost method that meets the requirements.
3. Obtain price
Uniswap is an exchange. If a token exchange occurs, a swap event will be generated. We can get the token price from the sqrtPriceX96 field. Get the total liquidity at that time from the liquidity field.
Since our pool has a stable currency, it is very easy to obtain the price of u. But this price is not absolutely accurate. First of all, it is affected by the frequency of transactions. If there is no swap transaction, the price will lag. In addition, When the stablecoin is unanchored, there will also be a gap between this price and the price for u. But under normal circumstances, this price is accurate enough, and there is no problem for market research.
Finally, the token price is resampled to obtain a price list per minute.
In addition, since the liquidity field of the event also contains the total liquidity of the current pool, we also add the total liquidity in. Finally, a table is formed as follows:
4. Handling fee statistics
Handling fees are the main source of income for position. Every time a user performs a swap operation on the pool, the corresponding position can receive a certain handling fee (that is, lower and upper include the position of the current tick). The amount of income is proportional to the liquidity. Ratio, pool fee rate, and tick range are related.
In order to count the user's fee income, we can record the amount of swap that occurred in the pool on which tick every minute. Then calculate the fee income on this tick in the current minute:
Finally, it becomes a table like this
This statistical method does not take into account the situation when the current tick liquidity is exhausted during swap. However, since our statistical goal is LP, we use tick range for statistics. This error can be alleviated to a certain extent.
5. Get position list
To obtain a list of positions, you must first specify the identifier of the position.
Through this identifier, we can integrate all operations of LP to form a list describing the entire life cycle of position. For example
However, it should be noted that the object of this statistics is during 2023, not from the creation of the pool. Inevitably, for some positions, we cannot obtain their operations before January 1, 2023. This requires We speculate how much liquidity this position has at the beginning of statistics. We adopt an economical way to speculate:
This method can avoid downloading data before 2023, thereby saving costs. However, it will face the problem of sunken liquidity, that is: if the LP does not perform any operations during this year, the LP cannot be found. But this problem is not serious. Since the statistical period is one year, we assume that users will generally adjust LP during this period. Because in the span of one year, the price of eth will change greatly, and users have many reasons Adjust their LP. If the price exceeds the tick range, invest funds in other DEFI, etc. Therefore, as an active user, you will definitely adjust your LP according to the price. For those who deposit funds in the pool and never adjust, We consider this user to be inactive and not included in the statistics.
Another more troublesome situation is that the position mint some liquidity before 2023, and then performed some mint/burn operations during the cycle. By the end of the statistics, all the liquidity was not burned. Therefore, we only A part of the liquidity can be counted. In this case, the sunk liquidity will have an impact on the position's fee estimate, causing abnormal returns. The specific reasons will be discussed later.
In the final statistics, polygon has a total of 73,278 positions, while ethereum has 21,210 positions. There are no more than 10 abnormal returns for each chain, proving that this assumption is credible.
6. Get the corresponding relationship between address and position
Since the ultimate goal of our statistics is the income of the address, we also need to obtain the corresponding relationship between the address and position. Through this association, we can get the user's specific investment behavior.
In step 1, we did some work to find the associated users of the fund operation (mint/collect). Therefore, as long as we find the sender of mint and the receiver of collect, we can find the corresponding relationship between position and address.
7. Calculate the net value and rate of return of the position
In this step, we need to calculate the net value of each position, and then calculate the rate of return based on the net value
Net Worth
The net value of Position consists of two parts. One is the liquidity of LP, which is equivalent to the principal of market making. After the user invests funds in Position, the amount of liquidity will not change, but the net value will fluctuate as the price changes. The other part Handling fee income, this part is independent of liquidity, is stored separately in two fields, fee0 and fee1. The net value of handling fees increases with time.
Therefore, at any minute, liquidity is combined with the price of this minute to get the net value of the principal part. The calculation of the handling fee requires the use of the handling fee table calculated in the fourth step.
First, divide the liquidity of this position by the total liquidity of the current pool as the sharing ratio. Then add the handling fees of all ticks included in the tick range of this position to get the handling fee income for this minute.
Expressed as:
Finally, add the handling fees of fee0 and fee1 to get the net handling fee. Then add it to the net value of liquidity to get the total net worth.
When calculating net worth, we divide the life cycle of the position based on mint/burn/collect transactions.
Finally, summarize the net value list obtained from each collect. Then perform resample and other statistics to obtain the final result.
In addition, in order to improve accuracy, we have made two optimizations.
First, for the hour when a transaction occurred (mint/burn/collect), we performed minute-level statistics, and for the hour when no transaction occurred, we performed hour-level statistics. Finally, the results were resampled into hour-level statistics.
Secondly, in the collect event, we can get the sum of liquidity + handling fee. Therefore, we can compare the actual collect value with our theoretical calculation value to get the difference between the theoretical handling fee and the actual handling fee (actually this difference The value also includes the difference of lp principal, but the error of the difference of lp principal is very small and can basically be considered as 0). We will compensate the handling fee difference to each row. In order to improve the accuracy of handling fee estimation (also These are the fee_modify0 and fee_modify1 fields in the table above).
Notice:
Since each row is the data at the last moment of this hour, for the position that has been completely closed, the net value will be 0. In this case, the net value at the time of the position close will be lost. In order to retain this net value, a line is created at the end of the file The data at the time of 2038-1-1 00:00:00 stores the net value and other data at the time of position close. To prepare for the statistical needs of other projects.
rate of return
Normally, the rate of return is calculated by dividing the starting equity by the ending equity. But this does not apply here. The reasons are as follows:
For question 1, we can divide the net value per minute to get the rate of return per minute, and then multiply the rate of return per minute to get the total rate of return.
But this algorithm has a serious problem. If there is a data calculation error in the rate of return per minute, it will lead to a large deviation in the total rate of return. In this way, the statistical process becomes a tightrope walk, and no mistakes can be made. But good On the one hand, this leaves no room for statistical errors.
Regarding question 2, if there is a transfer of funds in and out this minute, directly dividing by the rate of return will still result in a very outrageous rate of return. Therefore, it is necessary to refine the rate of return algorithm per minute.
The first attempt we made was to break down the changes in net worth in detail, and then eliminate the changes in funds. We split the changes in net worth into several parts. 1 is the change in principal brought about by the price. 2 is The accumulated handling fees for this minute. 3 is the inflow and outflow of funds. Obviously, 3 must be excluded from the statistics. For this, we have formulated the following calculation method:
Based on the above assumptions, the rate of return per minute is to divide the liquidity/price/handling fee at the end by the liquidity/price/handling fee at the end, expressed as follows, where f refers to converting liquidity to Net worth algorithm.
This method looks very good. It perfectly avoids changes in liquidity and reflects the impact of prices and fees on net worth. This is exactly what we expect. However, in practice, it will occur in some rows Big yield. After investigation we found that the problem occurs when withdrawing liquidity. Recall our rules: the time represented by each row is the end of the minute/hour. This provides uniformity for the statistics of the data scale, but it should be noted that the meaning of each column is different:
So for that hour of burn liquidity
This reduces the above formula to:
This situation will not only occur at the end of the position life cycle, but also when part of the liquidity is burned, it will also cause a change in the ratio of the increase in handling fees to the LP's net worth.
For the sake of simplicity, when the net value of LP changes, we set the rate of return to 1. This will bring errors to the calculation results of the rate of return. But for a normal continuous investment position, the hour of transaction generation is relative to The entire life cycle is still very small. Therefore, the impact is not significant.
8. Calculate the total LP income of the address
With the rate of return of each position, plus the corresponding relationship between position and address, we can get the rate of return of the user address in each position.
The algorithm here is relatively simple. The positions of this address in different periods are connected in series. There is no investment period in the middle, the net value is set to 0, and the rate of return is set to 1 (because the net value before and after is 0, there is no change, so the rate of return is 1.)
If there are multiple positions in the same period, add the net values in the overlapping parts to get the total net value. When combining the returns, we will weight the merger according to the net value of each position.
9. Combined total cash and LP returns
Finally, as long as the cash held by the user address and the LP investment are combined, the final result can be obtained.
The merging of net worth is simpler than the previous step (merging positions). As long as you find the time range on the LP net worth, then look up the cash held in the corresponding time range, and then find out the price of eth, you can get the total net worth. .
For the rate of return, we also use the algorithm of finding the rate of return per minute and then multiplying it. At the beginning, we used the error rate of return algorithm mentioned in Section 7. This requires that the fixed part of this minute (including cash) The amount of cash, liquidity in LP) and the variable part (price changes, fee accumulation, fund transfers in and out) are separated. Compared with the statistics of position, its complexity is much higher, because for the inflow and outflow of uniswap funds , just pay attention to the mint and collect events. The traceability of cash is very troublesome. We have to distinguish whether the funds are transferred to LP or transferred to the outside. If it is transferred to LP, the principal part can remain unchanged. If it is transferred to the outside, , to correct the amount of principal. This requires tracking the transfer destination address of erc20 and eth. This work is very troublesome. First of all, during mint/collect, the transfer address may be a pool or a proxy. What is more complicated is the eth one. For transfers, since eth is a native token, some transfer records can only be found through trace records. However, the amount of trace data is too large and exceeds our processing capabilities.
The last straw that broke the camel's back was when we discovered that the net value of each line is the instantaneous value of this hour, and the handling fee is the cumulative value of this hour, which cannot be directly added in a physical sense. This problem was indeed discovered very late.
Therefore, we gave up this algorithm. Instead, we used the net value of the next minute, divided by the net value of the previous minute. This method is much simpler. But there is also a problem with this method. That is when funds are transferred in and out. , the rate of return will still be unreasonable. Through the above discussion, we know that it is very difficult to separate the flow of funds. Therefore, here we sacrifice some accuracy and set the rate of return when there is a fund transfer to 1.
The remaining question is, how to identify the inflow and outflow of funds in the current hour? The algorithm I thought of at the beginning was very simple. Using the token balance of the previous hour and the current price, we can calculate the net value of this hour if we hold these tokens. What will it be? Then just subtract the estimated value from the actual value. When the difference is not equal to the actual value, it means there is a transfer of funds in and out. The formula is expressed as:
However, this algorithm ignores the complexity of uniswap LP. In LP, the number of tokens will change as the price changes, and the net value will also change accordingly. And this method does not take into account changes in handling fees. Ultimately, the estimated value will be different from the actual value. The value has an error of about 0.1%.
In order to improve the accuracy, the structure of the funds is refined, the value change of LP is calculated separately, and the handling fee is also taken into account.
In this way, the error of the estimated value can be controlled within 0.001%.
In addition, we limit the decimal of the data to avoid division by too small numbers (usually below 10^-10). These small numbers are errors accumulated from various calculations and resamples. If the direct correlation is not processed, Dividing will cause the error to be amplified, seriously distorting the rate of return.
Other questions
native token
In this statistics, the usdc-eth pool on ethereum has been added, where eth is a native token and requires some special processing.
Eth cannot be used in defi and must be converted to weth. Therefore, this pool is actually a usdc-weth pool. For users who directly operate the pool, just transfer weth in and out of this pool. This is the same as an ordinary pool. it's the same.
For users who add LP through proxy, they need to bring eth into the value of the transaction and transfer it to the proxy contract. The contract then converts these eth into weth and then puts it into the pool. When collecting, usdc can be used directly Transfer it to the user, but eth cannot be transferred directly to the user. It needs to be transferred from the pool to the proxy first, then converted into eth by the proxy contract, and finally sent to the user through internal transfer. For an example, see this transaction [4] .
Therefore, the only difference between the usdc-eth pool and the ordinary pool is the transfer in and out of funds. This only affects the matching position and address. In order to solve this problem, we pulled all the nft transfer data from the creation of the pool, and then Find the holder of the corresponding position through the token id.
missing position
In the statistics, some positions did not enter the final list. These positions have certain special features.
A large part of them are MEV transactions. MEV are pure arbitrage transactions and are not normal investors, so they are not included in our statistical scope. In addition, it is difficult to count them in actual statistics, which requires the use of trace level. Data. Here we use a simple strategy to filter mev transactions, that is, the time from start to end is less than one minute. In fact, since the highest accuracy of our data is 1 minute. If a position exists If the time is less than one minute, it cannot be counted.
Another possibility is that there is no collect transaction in this position. As can be seen from step 7, our calculation of income is triggered by collect. Without the collect operation, the previous net value and rate of return will not be calculated. Under normal circumstances Under this situation, users will choose to harvest the income or principal of LP in a timely manner. However, some special users are not excluded, that is, they must store their assets in fee0 and fee1 of uniswap pool. For such users, we also consider them to be special users and are not included in the statistical scope. .